Container Networking: The Mystery of the Missing RSTs

As we posted in July, we recently began migrating our bare-metal infrastructure to Docker containers running under Kubernetes. Some pieces like filesystem namespaces and cgroups have worked quite well without incident, but other pieces like networking have resulted in subtle differences that show how containerization is not yet mainstream and can result in unexpected behavior.

Our metrics pipeline at Box includes an in-house Scala service, Castle, which consumes metrics from Kafka and sends them via a TCP socket connection to Wavefront Proxy, which forwards them to our time-series database service Wavefront (similar to Datadog, InfluxDB, etc.). WF Proxy is a simple Java service provided to us by Wavefront.

As part of our effort to migrate services to Kubernetes, WF Proxy was one of the first we chose. The resulting architecture looks something like this:

Blue boxes denote bare-metal machines. Yellow boxes are containers / pods within Kubernetes

Blue boxes denote bare-metal machines. Yellow boxes are containers / pods within Kubernetes

A couple nuances regarding Castle’s TCP connection handling:

  1. Castle holds multiple TCP connections to WF Proxy, one per Kafka topic that the Castle instance is responsible for, each running on its own thread.
  2. Castle establishes a TCP connection and holds it open for as long as it can, reconnecting only if an attempt to write to the connection raises an exception.

Under Kubernetes, a rolling deployment of WF Proxy involves shutting down the existing Docker containers and replacing them with containers running the new version one at a time. As this is happening, the containers register and deregister with the Octoproxy load balancer dynamically. (Octoproxy is our internal fork of service-loadbalancer.) Our expectation was that as the old containers shut down, attempts by Castle to write to TCP connections to these containers would fail, triggering Castle to re-establish the connection with the new containers.

But what we noticed was that only one or two of the TCP connections held by Castle would raise an exception and get closed. The remaining connections in Castle would continue accepting data for as much as 10-15 minutes before finally raising exceptions. For this duration, these connections were essentially a black-hole as the container they connected to no longer existed.

Understanding TCP
Packet captures on the machine running Octoproxy showed FIN packets being sent to all TCP connections when WF Proxy was shut down, but RST packets being sent only for a few of the connections held by Castle - these were the connections that were terminating properly. The remaining connections would not receive RST packets and as a result, Castle would continue writing to them without an exception being raised.

Understanding what was happening here requires us to get our hands dirty with the nitty-gritties of TCP, specifically how a TCP connection is terminated.

A TCP connection is established as two one-way byte streams, for communication in either direction.

A TCP connection is established as two one-way byte streams, for communication in either direction.

Castle writes data to WF Proxy over Stream 1, and if it were to read from WF Proxy, it would be sent data over Stream 2. When WF Proxy shuts down, the TCP specification requires a FIN packet to be sent over Stream 2, indicating that no more data will be sent by WF Proxy. This was being done correctly as reflected in our packet captures, as well as the fact that all connections on the Castle machine were transitioning to state CLOSE_WAIT indicating a FIN packet had been received.

On Castle Machine

On Castle Machine

In our particular scenario, Castle never reads data from WF Proxy as the communication is exclusively one-way. This means that while the FIN packet was received by the machine running Castle, it is not received by the Castle application, and the shutdown of WF Proxy is unknown to Castle. This puts the connection in a half-open state, which is defined in the TCP specifications (RFC 793) [1]:

“An established connection is said to be "half-open" if one of the TCPs has closed or aborted the connection at its end without the knowledge of the other [...]”

RFC 793 recognizes this as a situation that should be unusual, but is not an abnormal state to be in. In a case where client A only writes to client B, this situation is the inevitable result of a shut down by client B. RFC 793 proceeds to detail the recovery process from this state [1]:

“If at site A the connection no longer exists, then an attempt by the user at site B to send any data on it will result in the site B TCP receiving a reset control message. Such a message indicates to the site B TCP that something is wrong, and it is expected to abort the connection.”

The reset control message, indicated by an RST packet, is exactly what Castle relies on to know that WF Proxy has gone away. The reset control message manifests as a Broken pipe error in Java triggering a re-connection. However, as evidenced by our packet captures and the absence of exceptions in Castle, these RSTs were not coming through reliably.

Through further experimentation, and based on a hunch we had, we enabled host networking on the WF Proxy container. This means that Docker will not create a virtual network interface for the container and will instead reuse the host's networking interface. The advantage is that the networking stack is less complex and more efficient, but the disadvantage is that each container must share ports with other containers, which goes against Kubernetes' networking philosophy of network isolation between containers.

As soon as we enabled host networking, the issue went away - RSTs would now come through reliably for all connections and Castle would re-establish all of them. So the issue lies somewhere within the Docker networking stack, and it’s time for us to roll up our sleeves and dive into that.

Networking in Docker
Docker’s default networking mode operates by creating a virtual network interface, typically called docker0. Docker then creates a pair of peer interfaces for each container - one in the container’s network namespace, called eth0, and the other in the host machine’s network namespace called something random like vethAQI2QT. docker0 serves as a virtual ethernet bridge, forwarding packets from the host machine’s ethernet interface to the container’s virtual interface, vethAQI2QT which is “peered” with eth0 within the container.

Interfaces in blue are actual interfaces. Yellow ones are virtual interfaces

Interfaces in blue are actual interfaces. Yellow ones are virtual interfaces

So what’s going on?
With a sufficient understanding of TCP and Docker networking, we were finally able to put the pieces of the jigsaw together and see what was going on. When the WF Proxy containers were shut down, docker would clean up the associated virtual network interface, vethAQI2QT. In a traditional environment where network interfaces are static while services are ephemeral, the TCP stack on the host machine would return RST packets when it receives an incoming TCP message at a port that no service is listening on. However, what’s the right thing to do when the interface itself is ephemeral?

We ran a packet capture on the docker0 interface to see what was happening during and after the shutdown of the container. The capture shows that after the shutdown handshake (lines 10-11) another message is received with no response being sent out. This triggers several attempts at the sender’s side at retransmitting the packet.

Packet capture on the docker0 interface

Packet capture on the docker0 interface

Packets that are received by docker0 cannot be routed further since the virtual interface it bridges to no longer exists. In this situation docker0 cannot, with certainty, say that the receiving peer is no longer listening. It is possible that the virtual interface disappearance was caused by a temporary blip in the network and the interface will return, and so it does not respond with an RST. The sender, after not receiving an ACK for a certain duration, attempts a retransmission. This continues until the retransmission limit is hit at which point the sender considers the receiving host to be “down” and tears down the connection, unblocking Castle.

The Connection Release Process
The root cause of the issue here is that WF Proxy and Castle do not negotiate a complete shutdown handshake before the docker container and the virtual interface are torn down.

WF Proxy initiates what is known as an “orderly connection release” by sending a FIN packet at the time of closing the socket. A FIN indicates that WF Proxy will no longer send any data to Castle, but it does not indicate that WF Proxy is no longer reading data. As described by Java’s document on Connection release [3]:

The problem is a slight mismatch between the semantics of Socket.close() and the TCP FIN message. Sending a TCP FIN message means "I am finished sending", whereas Socket.close() means "I am finished sending and receiving."

As per the TCP spec, neither Castle nor the TCP stack on Castle’s host should interpret this FIN to mean that WF Proxy is no longer reading data as well. In a regular environment with static network interfaces, the subsequent attempt to write data by Castle will cause an RST to be sent by the TCP stack indicating that Castle should close the connection.

So how do we fix it?
After a week of hair-pulling investigation, learning about TCP, Docker networking, Java sockets and what-not, we were finally able to understand the root cause and reliably reproducible it. But that begs the next question: how do we fix it? After all, that’s what sent us down this rabbit hole in the first place.

For the short term, we fixed the issue by requesting Wavefront to allow us to configure WF Proxy to do “abortive connection releases” [3]. An abortive connection release by WF Proxy will send an RST packet to Castle instead of a FIN, indicating that WF Proxy will no longer read or write data on the connection. On receipt of the RST, Castle will immediately close the connection and re-establish a new one. This is done by using the method `Socket.setSoLinger()`.

In the long run, we want to address this issue in a way that abstracts the underlying cause of disappearing interfaces away from applications. We’re currently investigating adding an IP Tables rule that can be configured to return RSTs for requests to containers that have been torn down and no longer exist.

Wrapping it up
Because of how Kubernetes and Docker implement networking, what was once a rare occurrence (the complete and sudden disappearance of a network interface) is now commonplace. As a result, we need to be sure that our software can handle this edge case gracefully.

Of course, using a higher order protocol like HTTP would also be a great way to handle situations like this, but that’s not always feasible. In our case, due to the high volume of data sent, we did not want to take on the additional overhead of HTTP.

[1] RFC 793: Transmission Control Protocol Specification
[2] Docker Networking
[3] Orderly Versus Abortive Connection Release in Java

Note: In [3], the document erroneously refers to the method as Socket.setLinger() when the method is actually Socket.setSoLinger().