When "idle" isn't idle: How a Linux Kernel Optimization became a QUIC Bug

Congestion Control Algorithms (CCAs) are the invisible conductors of the internet, governing how data flows between servers and clients. CUBIC, the default congestion controller in the Linux kernel, is designed to maximize bandwidth utilization while avoiding network collapse. However, a subtle bug in Cloudflare's open-source QUIC implementation, quiche, caused connections to enter a "death spiral" where the congestion window (cwnd) would permanently pin at its minimum value, failing to recover even after network loss stopped.

This is a cautionary tale of how a kernel-level optimization, when ported to user-space QUIC, created a state machine trap that only surfaced under specific, high-loss conditions.

The Symptom: A 61% Failure Rate

Most congestion control tests focus on steady-state throughput or growth phases. However, the true test of a CCA is its ability to recover from a congestion collapse. Cloudflare identified this issue through an integration test pipeline that simulated a brutal environment: a 10 MB file download with 30% random packet loss during the first two seconds of the connection.

Under these conditions, CUBIC should have throttled back during the loss phase and then ramped up once the loss stopped. Instead, approximately 60% of the tests failed to complete the download within a generous 10-second timeout.

The Anomaly: The RTT-Synced Oscillation

Using qlog visualizations, engineers observed a bizarre pattern. After the loss phase ended at T=2s, the cwnd remained locked at the minimum floor of 2700 bytes (roughly two full-size packets). Even more strangely, the congestion state oscillated between "recovery" and "congestion avoidance" every ~14ms—a duration that almost exactly matched the connection's Round Trip Time (RTT) of 10ms plus some overhead.

This oscillation happened 999 times over 6.7 seconds. Because the cwnd was so small, every incoming ACK from the client would drain the bytes_in_flight to zero. This triggered a specific piece of logic in the CUBIC implementation that misinterpreted this transient state as the connection being "idle."

The Root Cause: The "Idle" Miscalculation

To understand the bug, we have to look at the origin of the code. In 2017, the Linux kernel introduced an optimization to handle "app-limited" or idle periods. If a connection stops sending data for a while, the CUBIC growth curve (which is based on the time elapsed since the last loss event, or the "epoch") can become skewed. If the epoch isn't adjusted, the algorithm might try to inflate the cwnd to an unreasonable value the moment it resumes.

The kernel fix was to shift the epoch forward by the duration of the idle period, effectively sliding the growth curve in time.

When this was ported to quiche, the implementation checked for idleness in on_packet_sent():

if bytes_in_flight == 0 {
    let delta = now - self.last_sent_time;
    self.congestion_recovery_start_time += delta;
}

The Death Spiral

This logic created a trap when cwnd was at its minimum. The sequence worked as follows:

Drain: The sender sends two packets. After one RTT, both are ACKed, and bytes_in_flight hits zero.
False Idle: The next time a packet is sent, the code sees bytes_in_flight == 0 and assumes the connection was idle.
Inflated Delta: It calculates the idle duration as now - last_sent_time. Because last_sent_time was the start of the previous RTT cycle, the "idle" duration is measured as one full RTT (~14ms), even though the actual gap between the last ACK and the next send was nearly zero.
Future Boundary: This inflated delta pushes the congestion_recovery_start_time into the future.
Stagnation: Because the algorithm believes it is still in a recovery period (since the current time is before the recovery boundary), it skips cwnd growth.

This loop repeats indefinitely, or until scheduler jitter allows the boundary to finally slip behind the current time.

The Fix: Precise Idle Measurement

The solution required changing how the idle duration is measured. Instead of measuring from the last packet sent, the algorithm now measures from the most recent activity—either the last ACK received or the last packet sent, whichever is later.

let idle_start = cmp::max(cubic.last_ack_time, cubic.last_sent_time);
if let Some(idle_start) = idle_start {
    if idle_start < now {
        let delta = now - idle_start;
        r.congestion_recovery_start_time = Some(recovery_start_time + delta);
    }
}

By using last_ack_time, the delta no longer includes the RTT. The recovery boundary stops chasing the send time, allowing the cwnd to grow along the expected CUBIC curve.

Engineering Takeaways

This incident highlights several critical lessons in systems engineering:

The Danger of Porting without Context: As one community observer noted, this was essentially a case of copying kernel code without fully accounting for the nuances of the environment (user-space vs. kernel-space) or following up on subsequent kernel bug fixes.
The Importance of Edge-Case Testing: This bug was invisible in standard throughput dashboards. It only surfaced because the team deliberately drove the system into a "congestion collapse" regime—a state the CCA is specifically designed to handle but rarely exercises in production.
The Complexity of "Simple" States: Defining "idle" seems straightforward, but in high-performance networking, the difference between a transient zero-byte flight and true application idleness can be the difference between a healthy connection and a death spiral.

When "idle" isn't idle: How a Linux Kernel Optimization became a QUIC Bug

When "idle" isn't idle: How a Linux Kernel Optimization became a QUIC Bug

The Symptom: A 61% Failure Rate

The Anomaly: The RTT-Synced Oscillation

The Root Cause: The "Idle" Miscalculation

The Death Spiral

The Fix: Precise Idle Measurement

Engineering Takeaways

References

HN Stories