Site issues are a part of life for most web application shops. Database errors, buggy code, vendor failures, growing pains, etc. rear their heads and keep engineers up at night.
At Box, we're no exception, and over the years we've done our fair share of triaging and solving site issues. This is a story about the evolution of site outages at Box, a grassroots campaign to scorch our tech debt, mold our postmortems, and ultimately reclaim operational confidence.
In The Beginning
After years in business, our tech debt had added up, and we were paying the price. We knew it was bad, but we weren't measuring the right things and consequently didn't have a clear picture of the damage.
We realized that the logical first step was to keep track of site outages, so we created a new JIRA project and workflow. Looking back, this seems obvious, but, at the time, fighting production fires was just considered part of daily operations—daily operations with no paper trail.
Discovering the Problem
The new JIRA project showed us what we suspected (we had lots of site issues), but it also showed us something we hadn’t expected. At this point, while we were doing a great job tracking each site issue, we were mediocre at doing anything to prevent reoccurrences. We were often waiting weeks or longer to conduct postmortems, and the group most hurt by an issue wasn't necessarily the group best equipped to fix the root cause.
Everyone's problem was no one's problem.
Developers were suffering from their own application code tech debt. Operations was suffering from its own infrastructure tech debt. We needed a cross-functional group of individuals to own the problem. That's when a small group of dev and ops engineers decided to do it. *We called ourselves the coroners.*
We started with collecting more data on our site issues. For each site issue tracked, we collected:
- Site down time.
- Site degraded time.
- Remediation time spent by ALL involved parties.
- An underlying reason for the issue.
After letting this bake for a few months, we reviewed the data. It revealed that in our worst month we lost TWO full weeks of salaried time to remediating issues. Our biggest source of postmortems was tech debt in our web application code base, and we averaged five action items per *conducted* postmortem.
Site issues weren't going away, but we were doing a little better at tracking statistics and conducting postmortems.
All Action Items Are Not Created Equal
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like "make system X better" (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.
Out of this need, PIE ("Probability of recurrence * Impact of recurrence * Ease of addressing") was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:
- A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
- A straightforward way to prioritize our action items.
What's better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
The Streamlined Process
With PIE, our postmortem discussions became targeted, much shorter, and more constructive. We came up with a streamlined vision for postmortems and decided to present it to our executive team. The plan was as follows:
- Commit to conducting postmortems within 48 hours.
- Postmortem meetings should be attended by a small, targeted group, which must include the incident owner's directing manager and one coroner.
- The postmortem analysis should:Each action item must have an assigned individual owner (i.e. no generic ownership).
- Be mostly completed prior to the meeting.
- Identify the root cause (and the probability and effect of recurrence).
- Propose at least one action item for fixing the root cause (and estimate effort).
- Propose at least one action item for escape prevention (and estimate effort).
- Propose at least one action item for detection (and estimate effort).
- Each action item must be addressable in the *short* term (i.e. NO half-year projects).
Action items must be within the owner's sphere of control.
It's important to point out that, as an organization, we still encouraged the systemic fixing of tech debt by way of long-term projects, just not in the context of addressing a postmortem action item for high-probability or high-impact risks.
We re-branded our group of coroners as the "Medical Examiners" and started looking for new recruits to help promote the cause in and out of postmortems.
At this point, we were very excited about our new process, but realized that we could only consider it successful if it was adopted: this is where our executive sponsorship came into play. We got the support we needed to encourage the entire org to focus on our postmortem debt. We conducted postmortem meetings on a massive number of outstanding issues, ultimately resulting in the identification of action items. And it wasn't just postmortems; we were also able to focus on addressing our technical debt.
Emerging from this period of focus, we could already see improvement throughout our application stack, infrastructure, and engineering organization. Our engineers had fully committed to the new process and not simply because it *should* make their lives better, but because it was *simple* and it *worked*.
Which Way Now?
And that brings us to the present: site issues continue, but their severity has lessened; site service availability has improved; MTTR has dropped by an order of magnitude; and MTBF has improved but could be even better.
Our path is clearer, but not complete. We will continue to conduct postmortems, we will continue to learn from *every* mistake, and we will continue to adapt our process.