Load Testing in Production

As a member of the performance team, my responsibilities include load and scalability testing to ensure that Box applications scale and perform reasonably as we gain new customers.

Most of the performance testing I have done in the past has been in lab environments, where the typical flow included setting up a performance lab where the target application would be load tested using emulated users, an automated framework and synthetic data that could be refreshed at will. In general one wouldn't care much if the application crashed because of the testing, in fact, a performance engineer is secretly delighted when he/she can successfully bring down the target application!

Load testing on production is a completely different ball game. The testing has to be designed, executed and very closely monitored to ensure that real users on the site are not affected. Simultaneously, the application needs to be stressed sufficiently to help identify bottlenecks.

How is this accomplished? Read on, to find out.

Why Load Testing in Production

Since load testing in production is inherently risky, why do it in the first place? The primary reason is the difficulty of simulating the production environment in the lab. Although it is possible to deploy the application easily enough in the lab, reproducing the data set is another matter.

  • If the data is of any significant size, the costs to duplicate the data infrastructure can quickly become prohibitive.
  • Rules and regulations on user data privacy require that the data be scrubbed.
  • Trying to maintain this data in sync with production can become very difficult and lead to unstable performance numbers. After all, the whole purpose of the lab in the first place is to give us the ability to do reliable performance testing.

For these reasons, usually the lab data set is synthetically generated or is a subset of live data at a point in time that has been scrubbed.

This then implies that one can never truly catch scalability problems that result from the data layer. The only way to truly measure scalability is by load testing in production.

Load Testing Framework

We have adopted, Faban, an open source performance workload creation and execution framework, as our load testing framework. Faban allows us a high level of customization of the workload. Its intuitive web interface and the ability to collect system and application performance metrics on various machines in the infrastructure are attractive features as well.

We have developed a wide variety of workloads for Faban and use it both in the performance lab and for production testing.


Whereas most content sites are focused on downloads, for Box the primary workload is uploads. As users sign on to Box, their first action is to upload their files. We want to ensure that this operation is as smooth as possible.

Box also frequently does marketing campaigns that sign up thousands of users in a single day - this can cause huge spikes in traffic (e.g. when users sign up and upload their entire photo library).

So our first workload was uploads - to ensure that our site can handle the traffic.


We didn't start load testing on live on day 1. Instead, we took it gradually, first measuring performance on servers that were taken out of rotation (but that still talked to the same back end infrastructure). We then gradually moved to larger and larger configurations. The process we followed is described below:


  • Develop the workload to measure and test the target functionality (in our case, file uploads). Faban comes with some sample workloads in the samples/benchmarks directory.
  •  Create necessary test accounts and other required data sources.
  • Test the workload using a small load to ensure it does what is expected.

Single Node Test

Isolate one set of servers that form a complete path to the database from the production infrastructure. In our case, we only needed one web and one upload server. Run the load against this server until it is maxed out. Make sure to monitor the database and any other shared infrastructure. For most deployments that use hundreds of servers, this load should be insignificant, since it is only from one server.

Multi Node Test

Isolate a set of servers (we used 10) from the web server tier, stick a separate load balancer in front of it and drive load against this set. We now were able to drive 10 times the load as the previous test. Monitor all possible components closely.  We monitored the databases, caching tier, web server tier, etc. to identify bottlenecks. When you are able to scale the load to a level where the CPU on the server set is exhausted, move to the next step.

Data Center Level Test

At this point, we decided to run the test against an entire datacenter. We chose a time when traffic was typically minimal, then diverted all traffic away from this data center. This allowed us to continue to drive the load against the front-end servers in isolation without affecting live traffic. However, the load did have an impact on all shared back end infrastructure. Once (but only once) we brought down the database. Since we were well prepared, it didn't take long to terminate the load and bring the site back up. This level of testing is the only way to catch difficult issues that may occur only at very high loads.

Full Site Test

The final step is to run against the entire production infrastructure. After successfully driving load against one data center and with all known issues fixed, we took the plunge to run against the entire site. This actually went very well - much better than previous steps and we were able to push our target load while ensuring that all services ran smoothly. The reason for this success was primarily because of the step-by-step approach that allowed us to catch issues early and fix them before applying further load.

We continue to routinely load test production now to ensure that our site sees and handles a load much before we really anticipate that load level from real users.