At Box, our cloud service evolved from a just handful of application servers and databases into a high scale, high performance collaboration platform. Like most large-scale web companies, we depend on a distributed caching tier for frequently accessed data.
Box uses memcached (a distributed, in-memory key-value store) to serve billions of requests for frequently used data objects every day. However, we occasionally see issues where a single data object to be requested at an unusually high rate, sometimes overwhelming the bandwidth capacity for the server storing the data. This tends to be caused by an overzealous background task, a change in application code, or atypical customer activity.
In the graph below, you can see bandwidth to one of several memcached servers has suddenly increased (brown line). The cause is a hot key: a single data object which is being requested at very high rates or has suddenly become very large. A hot key can impair a memcached server's ability to maintain high performance and serve all requests.
During events like this, it can be difficult to determine the key which is responsible for the bandwidth spike. Unlike databases, many caches are designed to serve requests at such high rate that request logs are not feasible. A different approach is needed to identify hot keys.
Today, Box is releasing memsniff — a robust, open-source traffic analyzer for memcached.
Memsniff inspects network packets on a memcached server and provides realtime statistics about individual keys: their size, request rate, and bandwidth used. This helps identify hot keys without impacting the memcached service.
Memsniff in action
Inspired by mctop from Etsy and memkeys from Tumblr, memsniff was built to be robust, efficient, and scalable. Memsniff can handle nearly all network packets (over 99.99%) under heavy traffic loads. Using golang's simple multithreading primitives, memsniff is able to keep up with production traffic without inordinate CPU or memory usage (see Performance).
$ go get github.com/box/memsniff
$ go build github.com/box/memsniff
In typical Go fashion, you will find a single statically-linked memsniff binary in your working directory, ready to be transferred to your memcached servers, or packaged in your distribution's preferred format.
Memsniff requires superuser privileges to capture network traffic on most operating systems. The -i option is required and indicates the interface from which to capture. Example usage:
$ sudo memsniff -i eth0
Memsniff also has the ability to read from packet dumps captured with tcpdump:
$ sudo memsniff -r eth0.pcap
For further command line options, see the memsniff github page.
How memsniff works
- Raw packets are captured on the main thread from libpcap using GoPacket.
- Batches of raw packets are sent to the decode pool, where workers parse the memcached protocol looking for responses to GET requests. The key and size of the value returned are extracted into a response summary.
- Batches of response summaries are sent to the analysis pool where the stream is hash partitioned by cache key and sent to workers. Each worker maintains a hotlist of the busiest keys in its hash partition.
- In response to periodic requests from the UI, the analysis pool merges reports from all its workers into a single sorted hotlist, which is displayed to the user.
On a server running an Intel Xeon E5-2470 handling ~350,000 memcached requests per second, we observe:
4-5 cores utilized (~20 threads, ~20% CPU utilization each)
- 100% of packets handled
- 99.99% displayed indicates the small constant number of packets dropped at startup.
- During a hot key event where the NIC is effectively saturated, 99.9% of packets are handled.
- ~40 MB heap utilized
- ~100 MB RSS (tunable via GOGC)
- Typical GC pause: 0.6 ms
- Maximum GC pause: 2.0 ms
Vision / Roadmap
We anticipate memsniff evolving in the some of the following ways.
- TCP stream reassembly: get-miss tracking, binary protocol support, redis support
- Triggers (e.g. fire alerts when hot keys emerge)
- Automatic logging to disk when specified conditions are met (e.g. aggregate or single key traffic exceeds a threshold)
- Capability to restrict data collection to keys that match a filter
- Capture traces of individual request/response cycles
- Break out traffic by client IP
- Support non-default memcached server ports
- Support alternate sorting methods
- Support listening to traffic from multiple server ports simultaneously
- Support additional operations beyond GET
- View filtering
- Create a stable report format for output to disk
- Supply build support for common package formats (.deb, .rpm, …)
Want to contribute to memsniff? Visit our github for steps on how to develop memsniff.
Do you enjoy working on hard problems at scale? Join us at box.com/careers.