Fast test feedback is a critical component of building a successful deployment pipeline. Simultaneously, having a full suite of automated tests run against every change is another critical component. The challenge engineering teams face is scaling their automated test suites without losing that critical fast feedback.
This post will discuss how we adopted ClusterRunner within Box Technical Operations to get near instant feedback on a nontrivially sized suite of Puppet unit tests.
ClusterRunner was developed internally at Box to speed up test feedback for our PHP web application test suite. Back in 2014, we open sourced the code and we’ve seen good adoption. Late last year, as we began writing unit tests in earnest for our Puppet codebase, we realized that we were going to need to speed things up. Unfortunately, due to regulatory restrictions, its not possible to run the Puppet unit tests using the existing ClusterRunner system in the development environment. The only option was to install a new cluster in production. We learned a lot during the experience and hope this post can capture & share some of those learnings.
The goal of our build pipeline is to identify breaking commits as soon as possible and then block any new commits from being merged (i.e. Stop the Line).
At a high level, our happy path Git workflow resembles the following:
- A developer clones the Puppet repository from Git master server
- The developer makes local changes and gets code review
- The developer merges their changes upstream to the Git master server
In this post, we are interested in Step 3. Specifically in how we can achieve our goal of fast feedback if a bad commit gets merged and preventing new commits after that. To achieve this, we implemented the following workflow:
- On every git push, a pre-receive Git Hook runs that determines the state of the build.
- If the build is green, the change is allowed to proceed.
- If the build is red, the change is blocked, UNLESS the change’s commit message indicates that it is a fix for the build.
- If the change is allowed to proceed, a post-receive Git Hook runs that executes the entire suite of unit tests and sets the state of the build (green if ALL tests passed, and red if ANY test failed).
The diagram below describes this process visually.
Running Puppet specs with ClusterRunner
Initially, when the Puppet unit tests were run serially they took 8 minutes to complete. Delegating parallelization and scheduling to ClusterRunner brought the time to run the tests down to a mere 20 seconds! ClusterRunner determines the most efficient way to run a set of tests and is bottlenecked by the slowest test in the suite. We hope to drive this number down even further in the future by ratcheting down slow tests.
Anatomy of a Puppet Unit Test
ClusterRunner operates on a concept of atomic tests. An atomic test is the smallest unit into which an individual test suite can be broken. The tree structure for a typical Puppet module can be seen below:
The Puppet unit test files are located in the spec directory with the pattern "*_spec.rb". So, the natural choice for the atomic unit are these individual spec files.
As Puppet uses rake spec to execute unit tests, it was discovered that substantial parallelization can be achieved without requiring any custom code. This is due to the fact that rake spec is a combination of three subcommands:
- rake spec_prep creates fixtures needed by all the tests.
- rake spec_clean deletes those fixtures.
- rake spec_standalone runs the actual test.
- rake spec does all three.
Using the information above, the content for the clusterrunner.yaml file can be determined:
To understand what all the variables mean, please see the ClusterRunner job configuration docs.
Tying it all Together
Jenkins is used to hold the state of the build and ClusterRunner is the tool used to execute the unit tests. In addition, there is another open source tool used in the workflow to execute the Git Hooks and to interact with Jenkins. Bart has a robust Git Hooks framework that makes writing Git Hooks a painless task. It also provides a simple client to interact with Jenkins.
The time to run the test suite can be kept the same (as long a slower test than the current slowest test is not added) by using a combination of more ClusterRunner slaves and/or by running more tests in parallel on a specific slave (configurable via the max_executors setting).
Nadeem Ahmad is a software engineer on the Ops Platform team at Box, for more engineering insights follow along on his blog.