Metrics data is a key pillar of Observability at Box. Engineers rely very heavily on service metrics to measure performance and debug issues. The Observability Team at Box provides instrumentation libraries for services to produce metrics data which underneath utilizes Codahale Metrics.
Histograms are a very effective metric tool for measuring distribution of values and are widely used at Box for measuring latencies, speeds etc. Histogram and Timer metrics that need to produce statistical quantiles, e.g 95-percentile, need to collect data points over a period of time and create snapshots of these quantiles when being reported. These data points are collected in Reservoirs.
The metrics library at Box has been using ExponentiallyDecayingReservoir as the default reservoir. This Reservoir uses Cormode et al's forward-decaying priority sampling method to produce a statistically representative sampling reservoir, exponentially biased towards newer entries. The sampling aims to bias the reservoir to the past 5 minutes of measurements. Although the expected margin of error and the assumption that the data is normally distributed can cause significant inaccuracies in the percentiles.
Our metrics library reports Histograms by calling the getSnapshot() method at a fixed interval of 60s and sending the percentile values to our time series database. We use Wavefront at the backend that provides efficient visualizations on the time series data. It allows for great slicing and aggregation of the metric data for up to a minute granularity. In the scheme of reporting data every minute the Exponentially Decaying Reservoir is not optimal because it reflects the state from roughly 5 minutes in the past. With the visualizations allowing for trends over any desired duration, at the end of every minute we would like to report only information pertaining to that minute.
Justin Mason's Weblog has discussed this problem in detail and presented a few viable solutions. However, none of those completely fits into our architecture. One of the particularly interesting solution uses the new SlidingTimeWindowArrayReservoir, introduced in version 3.2.3 of Codahale metrics. If we used a time interval that was the same as our reporting interval of 60s, this reservoir could fix the slow decay issue of the Exponentially Decaying Reservoir. The reason we didn't completely like this alternative is because it cuts through abstractions established in our metrics library for metrics creation versus metrics reporting. The fact that we report metrics data every 60s is purely a property of our reporters and specifying this time interval when creating the Histograms is not ideal.
Another scenario where the problem manifests itself is in metrics that stop receiving updates. In absence of any new data points the old data never decays and the Histograms continue reporting the last value of each metric. If the value was beyond a threshold, the corresponding alert would never resolve causing several disruptions. We see this frequently with services using two instances or sides to service requests. When the service switches sides the metric for the older side never vanishes as seen in the following chart reporting the P99 latency when the service switches sides from blue to orange.
To address the issue for our use case, we designed our own Reservoir called ResetOnSnapshotReservoir. This reservoir creates a new UniformReservoir every time the reporter reads a snapshot to report metrics to our time series database. For our use case, the reservoir accumulates data for 60s and when the reporter calls getSnapshot(), the existing reservoir is discarded to create a new one for the next 60s time window. We now use ResetOnSnapshotReservoir as the default reservoirs for all Histograms and Timers.
We are open sourcing this reservoir here.