The Three Letters that Saved My Tech Debt

An Inconvenient Debt

Talk to any engineer, and they're probably more familiar than they'd like with technical debt. It's a reality in each of our systems about which most of us are aware, but don't like facing. Good practices for addressing technical debt are not discussed often enough. Read on to learn some of Box's story. This article is a continuation of a lightning talk I recently did at Velocity Santa Clara (links here for the video and the slides).

Defining Tech Debt

Technical debt can be thought of as the result of any conscious or unconscious design choice invalidated by changes in scale or usage patterns over time. In simpler terms, it's a price paid in software, hardware, network, etc. architecture due to a decision made in the past. In my opinion, tech debt represents solutions which are no longer optimal.

A rapidly growing engineering organization is constantly faced with hard trade-offs between quality and delivery. It's an endless struggle to satisfy requirements with (arguably) necessary shortcuts. The problem is that as shortcuts continue to be taken (whether recognized as shortcuts or not), the system eventually becomes brittle and unsustainable.

The effects of tech debt range from annoyance to catastrophe, like potholes on a dirt road to fissures on a racetrack.

Admitting It Exists

By the time tech debt has worked its way into a system, everyone expects somebody else to deal with it. It's often so poorly understood that fixing it seems overwhelming and unjustifiable. Finally, even if someone decides to do something about their tech debt, the recognition for fixing it is rarely acclaimed.

Nonetheless, the hard reality is that you can't ignore tech debt forever. Eventually, it starts to stink. It slows down & frustrates the engineering team. It causes unexpected behavior in your product and can effect customers.

Tech Debt at Box

Box easily meets the definition of a rapidly growing organization and denying that we have faced problems because of technical debt would be like pretending it's easy to find a cheap apartment in San Francisco. As our tech debt began to evolve from a metaphorical pothole to fissure, we recognized that we had to do something about it.

Our first step was simply to categorize all our known the tech debt from immediate to long term risk. We did this by looking at our site incident postmortem analyses. Site incidents are like tech debt collectors and our postmortems were coffers of information. We used the data from our postmortems as a learning opportunity to systematically analyze the categories of tech debt associated with our biggest areas of risk.

This was great and we wanted to get to work right away, but we had a new problem: deciding what to do first. Since everything was a P0, priority was effectively useless. We needed a way to expose the subtleties between the tech debt tickets that was both quantitative and objective.

Discovering PIE

We started by realizing two things about each tech debt ticket. First, there was a distinct likelihood of how soon it would be before it occurred again. Second, there was a distinct impact associated with that occurrence. We set up a scale to measure of each these, from 1 to 5. Multiplying these numbers together could give us a tech debt score (0 to 25). The higher the score, the more important it was to fix. We needed a fun way to to refer to the rating, so we renamed "likelihood" to "probability" and, together with "impact," that gave us the acronym "PI."

PI gave us a way to think more strategically about tech debt priority, but there were still many competing projects. The next thing we realized is that the most important thing was reducing risk as soon as possible. This wasn't necessarily removing the tech debt completely, but doing the quickest thing to address the highest risk part of it. Doing this would give us more breathing room to work on the next high risk problem or to more thoroughly refactor out the debt.

We started measuring this "quickness" factor on a similar 0-5 scale and combined it with PI for a three part tech debt score. We renamed it to "Ease," which gave us the acronym "PIE."

Putting It Into Practice

In order to get to a sustainable place, we needed to drive down our tech debt to a reasonable place. We approached this by setting a PI (no E) threshold and getting commitment from the leadership to fixing everything above the threshold within one month, starting with the easier (high E) ones first.

We knew that without visibility and accountability, we would never succeed. We set up code to pull data from JIRA daily and politely nudge tech debt owners via emails that CC'ed executive leadership. No one was pointing fingers, but no one was hiding the facts either.

How Did That Work Out?

The entire engineering rallied around the cause and by the end of the first month, we had closed all of the tickets above the threshold. We started seeing improvement immediately:

  • Teams previously most affected by tech debt were getting more engineering work done
  • There were fewer surprises during development and deployment
  • We had come together as an organization (from product to dev to ops) with clear priorities and goals
  • We felt good about ourselves
  • We had a framework for staying on top of our tech debt

Ultimately, we could operate at a higher level of confidence both in development and in knowing that we could tackle the tech debt challenges of a rapidly growing, high traffic system.

Parting Thoughts

We've come a long way, but there's still work to do. Technical debt is a living thing and the risk of a shortcut can change between today, tomorrow, and next month. PIE gives us a reliable system to keep us aware of our known challenges and for raising visibility on which of them are the most relevant to address first.

I personally learned that cultural change in a large organization is possible. People want to do the right thing, but it should be rewarding for them to do it and embarrassing to NOT do it. Accountability to fix what's broken can't be expected if risk can't be measured and there isn't a clear message about priority.

Share your thoughts with me on twitter at @bvanevery.