A few years ago, we began splitting up the monolithic PHP application that powers Box into microservices. We knew we'd ultimately need dozens (even hundreds) of microservices to be successful, but there was a serious problem: our model for provisioning new services was slightly... antiquated. And by antiquated I mean that people in the 1800's probably had better ways of building and deploying microservices than what we had at Box.
In the beginning
If you wanted to deploy a new production service, you first had to ask the operations team for dedicated hardware. That's right - because we started Box both before AWS and before virtualization was internally practical, much of our technology stack was still fundamentally based on bare metal servers dedicated to specific services. It could take weeks (or even months) to get your hardware ordered, racked, and online. Then you had to write Puppet profiles to customize your specific servers. The Puppet repo was of course highly secured and locked down to prevent any mishaps, which was great for security and stability but not so great for developer velocity.
After a few weeks of customizing your Puppet configs, getting some servers for load balancers, customizing those, deploying your own Nagios checks to ensure your servers and services were up, and probably finishing pretty much every article on Wikipedia, you were finally ready to deploy your code. You also needed to do that for the development, staging, and production environments. The work was such a burden that some teams would skip deploying to staging altogether (or skip "inconsequential" things like load balancers or service authentication), creating significant inconsistencies between environments.
And all this work was for just launching one new service - now imagine that effort multiplied by dozens or hundreds and you know that we have a very serious problem. Something had to be done, so we began investigating how to make service deployment and management a much simpler activity.
Picking a solution
We knew we needed some kind of internal PaaS - a platform that would make it quick and easy for developers to take their service and deploy it to development, staging and production environments. The first decision we had to make was whether to build a platform based on virtual machines or containers. We knew that the rapid development and deployment characteristics of containers were a perfect fit for our goals, and we wanted to skate to where the puck was going, so we chose to build our platform around containerization technology.
Docker was the obvious choice for the image and container format, but the technology to use for managing the containers across many servers was much less obvious, especially when we were making this decision in late 2014. At this point, Kubernetes had only just launched, Amazon ECS didn't exist, and Mesos didn't natively support Docker containers.
After a thorough evaluation of every orchestration technology we could find, the only two that seemed to have the sophistication (and openness) that we needed were Kubernetes and Mesos. The reasons we chose Kubernetes over Mesos could fill its own blog post, so we'll just say that Kubernetes' worldview of what an orchestration system should look like and the capabilities it provides much more closely matched our own architectural vision. To be honest, the API that Kubernetes provides is the one we've always wanted. Add that to the fact that we were incredibly impressed with the Kubernetes team - after all, who better to build an orchestration platform than the engineers who helped build and maintain the largest containerization platform in the world?
The new workflow
After about 18 months of work, we've built and deployed a platform that massively streamlines what an engineer has to do to get a service in production. The new workflow goes something like this:
An engineer writes a Dockerfile to package up their service into a Docker image. Once their image has been built by Jenkins and published to our internal registry, they write and test the Kubernetes objects that run their service, set up service discovery, generate and load secrets, provide run-time configuration, and more.
Note: This is one key difference between Kubernetes and other orchestration solutions - while most solutions would require you to go to many different systems to manage these pieces (or to write your own glue to tie those systems together), Kubernetes believes that your infrastructure should fundamentally be describable through a set of Kubernetes objects that are submitted and stored on the master. This is not to be confused with Kubernetes itself being monolithic - the actual implementation of each of the above pieces is implemented through individual microservices inside of Kubernetes. It's only the data model that is unified.
Once the engineer has written their config (in the Jsonnet templating language for easier refactoring), they add the configs to the central git repository. We then have an "applier" that's responsible for continually reconciling the state of the git repository with the state of the various Kubernetes masters we have in each of our datacenters using "kubectl apply." (We helped contribute much of the code powering "kubectl apply" and we'll be open sourcing the applier (built by our intern!) shortly.)
At that point Kubernetes takes over and creates Docker containers on the various servers, automatically configures our haproxy load-balancers using service-loadbalancer, provides secrets and configuration to the instances, and so on. Deploying to different environments or clusters is as simple as adding an if statement to generate a few more files in the central git repository.
The above workflow has resulted in a reduction from six months to launch a service to an average of less than a week. Some of our microservices have gone from local development to production in multiple datacenters in less than an hour.
We're currently running Kubernetes clusters in each of our production datacenters, and several critical-path services have migrated to the platform, with many more to come. We're especially excited for the new PetSet support in Kubernetes 1.3 so we can ultimately run services like Hadoop on Kubernetes as well.
There is no doubt that this project would have been impossible without Kubernetes. The Kubernetes community has been an absolutely phenomenal model for true open source collaboration. We've been continually impressed with how transparent and open the entire design and development process has been, and it's been a huge vote of confidence for our continuing investment in the project. Kubernetes is a production-ready platform that we look forward to building on for years to come.