Continuous Deployment in Desktop Software

In an effort to maintain the agility of our startup days and deliver the best software possible, Box has been moving towards continuous deployment. We’ve had hugely successful results in our web application and, about 2 years ago, we decided we’d find out what continuous deployment looks like in client software. We had started a complete redesign and rewrite of our desktop application—Box Sync—and we wondered, “Could we apply the same lessons we learned on the web to a domain where we don’t control the machines and we can’t roll back the code if something goes wrong?”

Acceptance

The automated acceptance

The automated acceptance 'big board' for Box Sync. Builds that do not pass all the tests are not deployed, and no further commits are accepted until the problem is fixed. (Red means something bad happened.)

The first step towards continuous deployment is continuous delivery—code is immediately deployable after commit, even if it is not actually deployed, and the first step towards continuous delivery is setting up automated acceptance tests.

At a high level, the best practices applied here are the same for the desktop and the web:

  1. Every time a developer pushes a new commit, the application is built in its entirety (“continuous integration”) and the full suite of tests is run
  2. If a test fails, the build cannot be deployed and further commits are rejected until the test failure is fixed (“stop the line”)

Box Sync is particularly well suited for automated acceptance testing. It is light on UI and its basic job—ensuring two sets of files in two different places match—is very easy for a computer to verify. We run three types of automated tests:

  1. We have full code coverage via unit tests
  2. The main syncing algorithm is covered by integration-style tests which simulate the network and file system called B to Y (the local file system is A, and the network is Z)
  3. We run full-scale integration tests that launch the built version of Sync (the full .app or .exe, depending on platform), play with files on the local hard drive or on Box, and verify the right things end up in the right place at the end. We call this “chimp.”

We currently support a number of platforms and versions of operating systems, so all of our tests are run on each supported platform and OS. The whole process runs over 13,000 distinct tests and takes roughly 20 minutes.

There’s one other issue that our testing framework deals with for us: we are entirely dependent on the Box web API for functionality, and as we are making changes to Sync, other engineers are making changes to the servers that power Sync. As part of their deployment process, we run a full suite of integration tests (“chimp-staging”) to ensure compatibility between the two.

Deployment

The version of Sync running on client computers.  We release to a group of beta testers roughly once a day, and do full releases once or twice a week.

The version of Sync running on client computers. We release to a group of beta testers roughly once a day, and do full releases once or twice a week.

Deploying client software is completely unlike deploying a web app, so our first goal was to make the process as consistent as possible, while respecting the different domain requirements and maintaining high user experience standards. A deployment follows these steps:

  1. Once an hour, the client makes an API call, passing the current client version. The API responds with a new version, or otherwise tells the client to try again later. This call is authenticated, so users or enterprises can block updates if they choose.
  2. If an update is found, the client downloads the new version, performs security and versioning checks, launches the new version, and quits.
  3. The new version installs itself in the appropriate location, updates any links or other leftover pieces, and continues running as if nothing happened.

When a new build is ready for release, the API response is updated, and clients update over the next hour. There are a few caveats and requirements to updates that I’d like to note, briefly.

Firstly, for users, there is really only one thing that matters about deployment: it must be completely silent.

Secondly, there are two problems we run into with desktop software that you don’t get when you own the hardware: backwards compatibility and the absence of a big red cancel button. It’s impossible to keep everyone updated at all times, if only because users turn their computers off at different times (and more than we want them to). So you have to keep in mind that, every time you deploy, you will be updating some clients that are months old. You need to be incredibly disciplined about backwards compatibility on your updates. In addition to that, if we push a bad release, there’s no fixing it. Any user who receives it will be stuck with that build until they manually update (which in reality is probably never).

Because the stakes are so high, before releasing a build to all users, we upgrade Box employees to the new build, let it run for a day, and then double-check that it can update to the next day’s build.

Unfortunately, we haven’t yet gotten to true continuous deployment. There are three very solvable things holding us back:

  1. Shipping a complete copy of the application multiple times a day would saturate both our bandwidth and our customers’. We can solve this with differential updates.
  2. One of the features of Box Sync is that it shows icons overlaid on the user’s files in Finder or Explorer indicating whether the file is synced or not. This is very hard for us to test in an automated way. Additionally, because of the way it interacts with the OS, it’s rather fragile. Before each release, we manually check that icons are still working on all platforms.
  3. We’re not yet at a point where we can automate the reading of our remote monitoring graphs after we release code. This means that it is still fairly manual and time-consuming to ensure that when a release has gone out, it’s not causing problems.

The good news is that we’re moving in the right direction. We’re now able to sustain a release to Box employees once a day, and a full release to all users twice a week.

Monitoring

Remote monitoring of application crashes. This is updated hourly and linked to our JIRA instance, so we know what we

Remote monitoring of application crashes. This is updated hourly and linked to our JIRA instance, so we know what we've fixed and what we need to work on.

Effective monitoring is critical to the continuous deployment process in general, and it plays a central role to how we deploy Box Sync. Our monitoring focuses on three areas: deployment, general health, and development.

During deployment, we want to know if the new build is causing problems, which will allow us to stop rolling out the build to any more new users. We remotely monitor any exceptions, errors, or warnings the clients encounter. We also monitor good things, including uploads, downloads, and oauth2 session renewals, so we can easily tell if the update makes the client quietly die or stop syncing without sending an error. We continue running these dashboards even when we’re not in the middle of a deployment in order to monitor the general health of the application. We also use our remote monitoring to guide what we build and how we prioritize fixing bugs.

The biggest challenge in effectively monitoring the clients is good aggregation. In particular, if the site goes down and triggers errors that trigger more errors when sending to Box, we can end up with a client DDoS attack on the Box servers. We clearly needed strong limits on how much data at what frequency the clients could send. To enforce this, we built a client-side aggregation system that attaches hashes to events and only writes new events to the DB if a row with that event’s particular hash doesn’t already exist (and bumps up the count field if the row does exist). A leaky bucket on the client and rate-limiting on the server ensures the rate of requests is sufficiently low.

Aggregating data is not only critical for technical reasons, but also makes data analysis considerably easier.

Beyond the Desktop

One of the things we learned while building Box Sync is that even if we cannot reach true continuous deployment for technical reasons, having it as a goal makes a strong, positive impact on our culture and development practices. One of the next big experiments that we could undertake is to understand what continuous deployment would mean on native mobile applications. There are many very real challenges in the mobile application space and even reaching continuous delivery with weekly or biweekly deployments would likely be a huge win.