Investigating TCP Self-Throttling Triggered Overload

April 25, 2022ByKuba Kiciński

This blog post explains how Meta’s Kernel team debugged and fixed a high request completion time issue caused by CPU overload.

Our storage stack recently hit a problem where network processing would use 100% of a CPU and yet not be able to send out enough traffic to meet the application’s needs. We, the Kernel team at Meta, traced this problem to a particular Network Interface Controller (NIC) design choice and excessive latencies causing TCP to spread out transmissions into small frames instead of using NIC segmentation offload to its full potential. As a result, we decided to constrain the TCP pacing rate by increasing the net.ipv4.tcp_min_tso_segs value using sysctl. Long term, starting with Linux 5.18, TCP will automatically allow full TCP segmentation offload (TSO) bursts for in-datacenter connections.

The Event

Our storage team notified us about request completion times skyrocketing. It mostly affected machines with a particular NIC. Affected hosts are configured to use 8 NIC queue sets. Therefore, 8 CPUs can perform the packet processing. Looking at affected machines, we noticed that a portion of the CPUs handling network processing was exclusively running kernel’s software interrupts (softirqs) and stuck at 100%. While the other cores dealt with networking just fine, using only 20-40% of CPU for softirqs.

This overload state is sticky and can last for hours, unlike normal network processing spikes, which usually resolve as soon as traffic goes down. Looking at atop, we observed that the overload starts when traffic spikes (which seems to happen once an hour for the app), at which point all CPUs process networking at 100%. The problem is that some remain at 100% for a long time (usually one, although we saw multiple CPUs stuck as well).

We also observed packets waiting to be sent on the affected cores (backlogged in the Linux Qdisc), leading to drops at the Qdisc level.

The problem would generally get fixed if we touched any of the NIC settings. NIC setting changes, however, also change the Rx hashing key as a side effect in our case, which changes TCP flow to CPU mapping. So we were not sure if it was a reshuffling of flows or a NIC reset that fixed the situation.

NIC Completion Queues

Since the problem is a CPU overload, our first instinct was to increase the number of NIC queues and spread the load across more CPUs. This proved to not be sufficient—a few days after increase we found machines with more rings that exhibited the same problem. We turned our attention to the fact that machines most affected by the problem had packets stuck in the egress Qdisc, even though the transmit (Tx) rate was nowhere near the line rate.

To explain this, we need to look at the design of the queues used for sending packets between the host and the NIC. Each network processing core uses a set of queues. There would be only two queues in a set in very old NICs—one for receive (Rx) and one for Tx. Modern NICs create separate queues for sending notifications back to the host (notifications meaning “packet was sent” and “packet was received”). This allows the host to learn about the events by reading from its local memory, or hopefully even the L3 cache, rather than reading NIC registers which would be very slow.

NICs may either have a separate queue for Rx and Tx events (image on the left) or use just a single notification queue (image on the right):

Tx completions can get stuck behind Rx traffic in the configuration with a single completion queue, which takes more time to process. Under normal scenarios, this doesn’t matter much because the CPU can drain the completion queue and process all events. However, when the CPU is 100% busy, it’s not keeping up with the completions and can process only a portion of them. As a result, even though all the Tx packets have already been sent, the CPU will not know about it until it gets to the corresponding notification, and notifications are spread out through the completion queue, interleaved with Rx packets.

In other words, a single completion queue imposes a form of “fairness” between Rx and Tx. This is a problem for this particular storage workload, which wants to send much more traffic than it receives. It’s a generally accepted best practice in NIC/network stack design to prioritize Tx.

Configuring separate completion queues for this particular NIC design comes at the cost of halving the number of total available queues.

Problem Generalized

To easily reproduce the problem, we created an experiment—a dozen machines with the number of queues constrained to just 4. The storage team had also invested in creating a new metric tracking the highest softirq load across cores. This allowed us to easily catch machines with single cores stuck at 100% busy even if the application metric did not go dangerously high. Unfortunately, we proved that even with well configured NIC, the TCP misbehavior could still be reproduced.

Looking at the traffic stats of the overloaded queue (blue line) vs. other queues, we observed:

The overloaded core sends much smaller TSO super-frames. It sends a little more traffic in terms of bytes (~33%) than other queues, but it does so using 7 times the number of packets. Since network processing cost is primarily the function of number of frames, not bytes, we now know why the core is at 100%—it generates smaller packets.

Standing Queue

To understand why the core produces tiny TSO frames rather than sending full 64kB chunks of data down to the NIC, we need to understand a bit about TCP in the data center.

Traditional TCP aims at achieving a steady state called “ACK clocking:”

Let’s assume the application has a lot of data to send and network latency is relatively high. It’s the job of the congestion control algorithm to figure out “how many packets to put in the network” (i.e., the congestion window or SND_CWND in TCP parlance), and then send a new packet every time an ACK for a previous packet comes back. ACKs are the “go” signal for the next send, hence “ACK clocking.”

Unfortunately, in reality, the ACK signal is not very accurate. The ACKs often get “compressed” to consume less uplink bandwidth in the Internet wilderness. In the data center, the workloads are bursty (RPC vs continuous stream) and the network’s latencies are far lower than the software processing noise. Modern TCP tries to estimate the correct send rate and send out the traffic at that rate without relying on ACK timing. Maintaining a good, well-paced send rate was a cornerstone of the Bottleneck Bandwidth (BBR) algorithm, but today it’s used for all Linux TCP traffic to a larger or smaller extent.

We are back to the storage workload’s problem. Here is the histogram of pacing rates TCP has computed for flows terminating on good cores vs. the bad core (in bytes per second):

@[good cores]:
[2M, 4M)        1954 |@@@@                                            	|
[4M, 8M)        4188 |@@@@@@@@                                        	|
[8M, 16M)       4887 |@@@@@@@@@@                                      	|
[16M, 32M)      6778 |@@@@@@@@@@@@@@                                  	|
[32M, 64M)      8128 |@@@@@@@@@@@@@@@@                                	|
[64M, 128M)    12713 |@@@@@@@@@@@@@@@@@@@@@@@@@@                      	|
[128M, 256M)   25025 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256M, 512M)    9010 |@@@@@@@@@@@@@@@@@@                              	|
[512M, 1G)      5649 |@@@@@@@@@@@                                     	|
[1G, 2G)        5418 |@@@@@@@@@@@                                     	|
[2G, 4G)       11199 |@@@@@@@@@@@@@@@@@@@@@@@                         	|

         
@[bad cores]:
[16K, 32K)       5671 |@@@                                             	|
[32K, 64K)       5273 |@@@                                             	|
[64K, 128K)    	17296 |@@@@@@@@@@@                                     	|
[128K, 256K)   	54935 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            	|
[256K, 512K)   	60336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         	|
[512K, 1M)     	79204 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1M, 2M)       	33225 |@@@@@@@@@@@@@@@@@@@@@                           	|
[2M, 4M)       	13774 |@@@@@@@@@                                       	|
[4M, 8M)         8349 |@@@@@                                           	|
[8M, 16M)        4222 |@@                                              	|

The pacing rate for flows on the bad core are abysmal.

The Linux TCP stack tries to send at least one packet per millisecond, which means that to send a TCP Segmentation Offload (TSO) frame with more than 2 MSS worth of data, we need a pacing rate of at least 3MBps. The pacing rate is computed as Congestion Window (CWND) over Round-Trip Time (RTT), let’s look at RTT (in usec):

           
@good
[128, 256)       2272 |@@                                              	|
[256, 512)       195  |                                                	|
[512, 1K)      	17187 |@@@@@@@@@@@@@@@@@                               	|
[1K, 2K)       	50020 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[2K, 4K)       	19803 |@@@@@@@@@@@@@@@@@@@@                            	|
[4K, 8K)       	12951 |@@@@@@@@@@@@@                                   	|
[8K, 16K)        7616 |@@@@@@@                                         	|
[16K, 32K)       5089 |@@@@@                                           	|
[32K, 64K)       2765 |@@                                              	|
[64K, 128K)      1173 |@                                               	|
[128K, 256K)       96 |                                                	|

@bad
[512K, 1M)     	82845 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1M, 2M)       	51511 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                	|

The RTT confirms our hunch that CPU overload leads to excessive queuing. The extent of the queuing is mind-blowing. 1M usecs is 1 second of queuing delay in a data center network. However, upon reflection, when we consider the relatively low rate of packets sent and received, it seems entirely plausible.

We see 2k-4k TSO frames stuck in the Qdisc. With a sending rate of only tens of thousands of frames per second (the rate is inversely proportional to the number of queues), packets can spend 100ms+ just waiting to be sent. The situation is more dire on the receiver path. Our NICs are configured with a relatively high length of the Rx and Tx queues to absorb traffic bursts or temporary CPU overloads—there are a few thousand entries on the Rx rings. Again, hundreds of milliseconds worth of packets at the relatively low packet rates at which these servers operate.

This is a textbook example of the bufferbloat problem that has been discussed at length in the last decade. The queue should absorb bursts, but not stay filled.

You Say Bufferbloat, I Say CoDel

This section describes a complete tangent, so if you’re getting tired, skip to the next section. The solution to bufferbloat is well known—smaller queues or, better still, actively managing the length of the Queue. CoDel is the poster child of queue management that prevents bufferbloat.

NICs don’t provide us with much of an active queue management (AQM), not even simple Random Early Detection (RED). Even the industry’s newest NICs were aware of implementing AQM at the wrong point:

We can, however, execute the policy after we pick the packet from the queue. In a sense, that may be closer to CoDel’s intent. We can drop/mark the packets quite efficiently, thanks to XDP, Linux’s BPF hook allowing us to filter raw buffers coming from the NIC before any processing was invested into them.

To implement CoDel, we need the sojourn time of the packets. Using timing information to control packet flow and prevent network congestion is a very hot topic that is worked on by multiple teams at Meta. Here we want to know how long the packet stayed in the host queue. NICs have their own clock that can tell us when a packet arrived from the wire with nanosecond precision (as vendors promise). You may have heard of these clocks in the context of PTP, precise time, another important project at Meta. We have to compare the NIC arrival time to the current time when the CPU processes the packet to get the sojourn time.

Two challenges that we need to solve are:

XDP doesn’t have access to the packet timestamp (yet).
Time is relative; the NIC clock is not the same clock as the host clock.

We can solve these problems by attaching a few extra kprobes:

kprobe (2) runs on every received packet on its way to the stack, where all timestamps are easily accessible to BPF. It calculates the delay as:
```
  delay = bpf_ktime_get_ns() - convert(skb->hwtstamp);
```
It saves the packet delay to the per-queue BPF map which is then used by the XDP logic.
XDP AQM logic makes its decisions based on the timing of the previous packet. If the previous packet had been in the queue for longer than desired threshold (10ms), we can mark or drop it. In case of drops, the packet will not reach kprobe (2), where its delay would be measured, which means we may keep acting on stale information. Luckily the CUBIC algorithm, traffic we drop rather than mark with ECN, is usually interleaved with the Data Center TCP (DCTCP), so it doesn’t matter. And, like any good AQM, we deploy an averaging algorithm to act on the recent history of the delay rather than just the delay of the most recent frame.
kprobe (1) runs every 64 packets.It’s hooked into the New API (NAPI) scheduling algorithm in Linux, where Linux decides which NIC queue to service next. The exact attachment point is not relevant, it’s just supposed to run periodically. The job of this BPF program is to estimate the parameters for the NIC -> host time conversion.

The above logic works pretty nicely, but it doesn’t solve the problem. The following table displays the output of machines when not overloaded—the control app prints stats every second. There are only two queues for a test. The first two lines of the output are per queue stats, then a “total” that shows packets per second on the entire system, then a report from the next cycle, etc.

          
XDP-cpumap	CPU   pps   	drop expt lo  NAPI   busy   q(ms)  ACK  CE      nECT
XDP-RX    	5 	241,611   0	0  10  4,289  3,055  0.06  0	0	0   
XDP-RX    	11	215,923   0	0  23  7,166  2,347  0.07  0	0	0   
          	total 457,535    0   
XDP-RX    	5 	228,791   0	0  36  6,177  2,582  0.06  0	0	0   
XDP-RX    	11	245,263   0	0  5   5,932  2,923  0.08  0	0	0   
          	total 474,055   0   
XDP-RX    	5 	273,172   0	0  4   2,653  3,779   0.30 75  776   0   
XDP-RX    	11	255,099   0	0  9   3,767  3,277   0.42 197 514   0   
          	total 528,272   0   
XDP-RX    	5 	259,929   0	0 13  4,340  3,293   1.53 0	0	0   
XDP-RX    	11	212,942   0	0 13  6,876  2,491   0.07 0	0	0   
          	total 472,872   0   
XDP-cpumap	CPU   pps   	drop expt lo  NAPI   busy   q(ms) ACK  CE      nECT
XDP-RX    	5 	208,368   0	0 15  5,824  2,373   0.08 0	0	0   
XDP-RX    	11	213,279   0	0 13  7,622  2,138   0.06 0	0	0   
          	total 421,648   0   
XDP-RX    	5 	195,566   0	0  4  6,518  2,059   0.18 0	0	0   
XDP-RX    	11	182,328   0	0 20  9,716  1,643   0.05 0	0	0   
          	total 377,894   0   
XDP-RX    	5 	194,545   0	0 13  3,417  2,423   0.07 0	0	0   
XDP-RX    	11	195,263   0	0 21  7,508  1,958   0.09 0	0	0   
          	total 389,809   0   
XDP-RX    	5 	215,466   0	0 6   5,484  2,435   0.09 0	0	0   
XDP-RX    	11	219,894   0	0 14  7,608  2,404   0.05 0	0	0

In the above table, NAPI is the number of NAPI loops that managed to drain the completion queue, and busy is the number of NAPI loops that read 64 packets and returned to the NAPI scheduler even though the completion queue was not fully drained. We can see that with just 2 queues enabled on the system, our wimpy storage CPU had many completions to process, and busy gets reported quite a lot. The softirq CPU load for the networking cores was at 70% here.

We can see CoDel in action in the above table in the bold section—there is a spike of traffic (528k packets received). Somewhere during that period, latency rose above our target of 10ms and packets got marked (CE is marked, drop would be dropped but there was no CUBIC traffic during the bad period).

After waiting a bit, the machine with just 2 queues gets overloaded, and we have one CPU (CPU 11 in this case) spinning at 100%:

                  
XDP-cpumap	CPU   pps    drop expt lo  NAPI   busy   q(ms)   ACK     CE  nECT
XDP-RX    	5 	35,900	 0   0	40  13,676 117   0.07    0	0      0   
XDP-RX    	11	63,773	71   0	 0   0     996	 96.01 55081 7261   1430
          	total 99,673	71  
XDP-RX    	5 	31,213	 0   0	54  12,567 124    0.02  0	 0     0   
XDP-RX    	11	54,624	65   0	 0   0     853	  77.22 47038 6327  1259
          	total 85,837	65  
XDP-RX    	5 	28,804	 0   0	36  12,896 110    0.02  0	0      0   
XDP-RX    	11	56,384	61   0	 0   0     881	  82.95 48663 6477  1244
          	total 85,189	61  
XDP-RX    	5 	21,727	 0   0	42  13,005   0    0.04  0	0      0   
XDP-RX    	11	51,827	61   0	 0   0     809	  72.17 43349 5767  1185
          	total 73,554	61

Because the CPU is spinning, the NAPI column (completion queue fully drained) is 0, queue is “standing,” and all NAPI loops end as busy. XDP CoDel calculated the queue delay to oscillate around 80ms. Keep in mind that the max delay of a standing queue is inversely proportional to the load, so with 16 queues, by rough math, this number could be 8 times larger.

XDP CoDel implementation prevents queue build and does help quite a bit in removing Rx queue overflow drops:

Why is the Queue Standing

The XDP experiment was a partial success, it did not fix the overload, but it revealed why the queue is stable—most of the packets coming to the machine are ACKs. Recall the ACK clocking section—ACKs that we receive will cause a new transmission to be scheduled and subsequently, a new ACK to take a spot in the Rx queue.

In other words, the reason that the queue is ”standing” and doesn’t get drained is most likely because the workload was Tx heavy. Unfortunately, we can’t do much to fix the standing queue full of ACKs using AQM like CoDel because congestion control algorithms do not directly control the ACKs.

The Congestion Window

Back from the queue tangents and to the important question—why are we sending tiny packets instead of full TSO frames? Full TSO frames are cheaper to process at the sender and generate fewer ACKs.

My naive understanding of TCP congestion control suggests that the increase in queueing delay should be compensated by an increase in the congestion window. It happens but not to an extent large enough to compensate the latency histogram of SND_CWND:

  
@[CUBIC, good cores]:
[4, 8)                	  218 |@@                                              	|
[8, 16)           4136 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@[CUBIC, bad core]:
[4, 8)                    444 |@@@@@@                                          	|
[8, 16)           3452 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)          1995 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  	|
[32, 64)          3417 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128)         2603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         	|
[128, 256)        2229 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               	|
[256, 512)        1454 |@@@@@@@@@@@@@@@@@@@@@                           	|

@[DCTCP, good cores]:
[4, 8)          2868 |@@@@                                            	|
[8, 16)        	35148 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)       	16200 |@@@@@@@@@@@@@@@@@@@@@@@                         	|
[32, 64)       	12664 |@@@@@@@@@@@@@@@@@@                              	|
[64, 128)       8250 |@@@@@@@@@@@@                                    	|
[128, 256)      4295 |@@@@@@                                          	|
[256, 512)      2509 |@@@                                             	|
[512, 1K)       1001 |@                                               	|
[1K, 2K)        638 |                                                	|

@[DCTCP, bad core]:
[4, 8)          4455 |@@@@@                                           	|
[8, 16)        	43640 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)       	28767 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              	|
[32, 64)       	35249 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      	|
[64, 128)      	12776 |@@@@@@@@@@@@@@@                                 	|
[128, 256)      8932 |@@@@@@@@@@                                      	|
[256, 512)      2648 |@@@                                             	|
[512, 1K)       1634 |@                                               	|

The windows do grow but are nowhere near large enough for the pacing rate (i.e., SND_CWND divided by RTT) to reach efficient TSO use. They rarely reach 100 packets, which covers 25ms of latency if we want to send TSO frames of just 6kB. This is likely because the congestion control algorithm is conservative, trying to avoid causing bufferbloat in the network. What might also play a role is that these storage flows are relatively short-lived—they seem to terminate after transferring only 10MB of data.

The Fix

As mentioned, raising the congestion window too aggressively in response to latency can itself induce bufferbloat. Tuning the congestion control algorithms could be an option, but it requires careful research and testing.

We need to counteract Linux’s attempts to smooth out traffic too much, and for that, we have the tcp_min_tso_segs variable that we can control using sysctl.

tcp_min_tso_segs - INTEGER

Minimal number of segments per TSO frame. Since linux-3.12, TCP does an automatic sizing of TSO frames, depending on flow rate, instead of filling 64Kbytes packets. For specific usages, it's possible to force TCP to build big TSO frames. Note that TCP stack might split too big TSO packets if available window is too small. Default: 2

Note that the tcp_min_tso_segs parameter does not override SND_CWND. If the sender is CWND-constrained, it will still generate small frames. Luckily, in our tests on overloaded storage boxes there’s usually enough headroom to generate a larger frame.

The TCP stack has other parameters which make the pacing rate more aggressive. However, increasing the floor of the TSO packet size seems like a good fit for our problem. In the experiment with NICs configured to use only 4 queues, setting tcp_min_tso_segs to 8 appeared to be sufficient.

After we reported the problem upstream, the Linux TCP maintainer shared his suggestion / patch. Instead of increasing the min_tso globally, the patch forces min_tso to be high if the estimated min_rtt is below 500us, effectively attempting 64kB sends for all in-datacenter traffic, CWND allowing. This fix will be part of the Linux 5.18 release.

Appendix: bpftrace trickery

Like in most of our investigations, we depended heavily on bpftrace to gather data. Not all of them were saved (or worth saving), but here is an example to give the reader an idea.

To collect the histogram of CWND for currently sent TSO frames, normal queues vs the overloaded queue (queue 3 in the example) we used:

tracepoint:net:net_dev_start_xmit { 
if (args->gso_type != 16) {return;} /* skip non-TSO */

$skb = (struct sk_buff *)args->skbaddr; 
$tp = (struct tcp_sock *)$skb->sk; 
$icsk = (struct inet_connection_sock *)$tp; 

@[$icsk->icsk_ca_ops, 
  $skb->sk->__sk_common.skc_rx_queue_mapping == 3] =
    hist($tp->snd_cwnd); 
}

Following is a little more complex example which gets parameters of kprobe from registers and counts the congestion window on input, including ECN marking of the packet. Using kfunc instead of kprobe does not require jumping through the register casting hoops, but it would occasionally take very long to attach due to the CPU livelock in this investigation. In this case, using kprobes was quicker:

kprobe:tcp_filter { 
    $skb = (struct sk_buff *)reg("si");
    $sk = (struct sock *)reg("di"); 
    $tp = (struct tcp_sock *)$sk; 

    $ip = (struct ipv6hdr *)($skb->head + $skb->network_header); 
    $th = (struct tcphdr *)($skb->head + $skb->transport_header);

    $ecn = ($ip->flow_lbl[0] & 0x30) >> 4; 
    @[cpu, $ecn] = hist($tp->snd_cwnd); 
}

To learn more about Meta Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter, Facebook and LinkedIn.

Tags:

Open Source 2022