Node.js High Availability at Box

One of Node.js's most interesting characteristics is that it's single-threaded. That means running a single Node.js instance on a typical server would result in massive under-utilization of today's multicore CPUs. As compared to other application servers, Node.js isn't too bright on its own with regards to this excess capacity. That means you're left with three choices for deploying Node.js:
  1. A single Node.js instance per multicore CPU server (under utilizing the CPUs)
  2. A single Node.js instance on a VM with a single core CPU (eliminating the excess CPU cores)
  3. Multiple Node.js instances per multicore CPU server
In the early days of Node.js at Box, we happily experimented with the first option (one instance per multicore CPU server) as it mimicked the setup we had using Apache. As such, we could easily swap in Node.js for Apache and use the same overall server setup. After a while, we wanted to explore the third option of running multiple Node.js instances per multicore CPU server. There are a lot of different ways that companies achieve this and so we spent time investigating and talking to folks at Yahoo and PayPal who were already deep into this process. Of course, we also scoured the Internet looking for information on how other companies solved this problem.

The Investigation

We started by looking at the cluster module that comes bundled with Node.js. There are a lot of examples of how to use cluster out in the wild and it seemed like people were generally happy using it. Our performance testing yielded uneven results when using more than four Node.js instances - some instances were being maxed out while others sat idle. Additionally, I had personal distaste for the manual nature of cluster, which requires you to code up exactly how you want everything to work. I really preferred having a more fully-baked solution. So we moved on to investigate some more "productionized" versions of cluster such as strong-cluster-control. StrongLoop has a fair number of modules related to the multicore problem and together they form a nice ecosystem. We strongly considered using strong-cluster-control; my only personal concern was needing to bundle the multicore logic with the application itself. Perhaps we could get around that through some more clever packaging, but it was an issue to resolve. The tipping point in the investigation came with a visit from Dav Glass (Yahoo) who shared his experience running Node.js on the heavily-trafficked Yahoo network. He warned us that he had seen Node.js-based cluster management fail because even though Node.js is relatively stable, it still has a tendency to crash at a higher rate than other application servers. If we used cluster or a cluster-like solution, we'd basically be left with a sticky situation of a master Node.js instance monitoring slave Node.js instances...but what if the master Node.js instance crashed? All the slave Node.js instances would become unavailable (since the master does routing as well as healthchecks). That would necessarily mean we'd need a process watcher for the master Node.js instance. So that would mean a process watcher to watch the master Node.js instance, which in turn would watch the slave Node.js instances. That seemed like a strange separation of responsibilities amongst the various processes on the server. Dav's recommendation was to use a non-Node.js based solution to monitor the Node.js instances on the server. He pointed us to Yahoo's Monitr, which spawns a thread that monitors a Node.js process, and we looked deeply at the entire Yahoo high availability toolset to better understand their approach.

The Result

After a lot of investigation and hard work by the Box Engineering team, we ended up with the solution that is running in production right now. The basic setup is as follows: For process monitoring, we use Supervisor. This was a tool our ops team had been using and it fit the bill perfectly. Its only job is to watch processes and monitor their lifecycles. We setup a Supervisor configuration file specifying how many Node.js instances we want to run and Supervisor ensures that those instances are started and the target number is maintained. Additionally, we use the httpok plugin to perform HTTP-based healthchecks of each Node.js instance. That means if a Node.js process becomes unresponsive, Supervisor will terminate it and start a new one in its place; if a Node.js process crashes, Supervisor also notices and starts a new process to replace it. Nginx acts as a reverse proxy for a single physical server (or virtual machine). It handles serving and caching of static assets as well as SSL termination. Traffic comes in from our external load balancer directly to Nginx. Dynamic content requests are forwarded from Nginx to HAProxy. HAProxy's primary responsibility is as the internal load balancer across the running Node.js instances. We considered skipping HAProxy and just using Nginx's load balancing, but decided we wanted more than Nginx's naive round robin approach. We configure HAProxy with a healthchecked round robin as an extra measure of protection against unresponsive Node.js instances. HAProxy only ever receives traffic from Nginx and only ever sends traffic to the Node.js instances. The Node.js instances, then, are free to do their job without worry about process management. Our source code repository has no notion of there being any more than one instance of the Node.js application running at a time.


We've already rolled out this setup at Box, and so far, it's given us exactly what we hoped: stability and reliability while maximizing each server's CPU utilization. The investigation phase of this project was extremely fruitful and we're glad we took the time to fully vet the various approaches before choosing one. This final solution nicely separated the responsibilities of each part of the system, allowing us to reuse familiar tools (Nginx, HAProxy, and Supervisor) to create a new server configuration for Node.js applications. We're still investigating the optimal number of Node.js processes to run per server and experimenting with using domain sockets inside of the server instead of TCP ports. We'll continue to tune and modify this solution as we observe more about its performance characteristics in production. Interested in working on problems like this? We're looking for front-end engineers with Node.js experience.