← Back to Blogs
HN Story

Pushing the Limits: How Netflix Serves Video Traffic at 800Gb/s

May 22, 2026

Pushing the Limits: How Netflix Serves Video Traffic at 800Gb/s

Serving high-definition video to millions of concurrent users requires more than just fast disks and a big pipe; it requires a meticulous elimination of every single CPU cycle and memory copy that doesn't contribute to moving bits from storage to the network. In a detailed technical presentation from the NAB Show, Netflix engineers revealed the architectural evolution required to push a single server's throughput toward 800Gb/s.

At this scale, the primary enemy is not raw compute power, but memory bandwidth and the overhead of the operating system kernel. To achieve these speeds, Netflix focused on a specific workload: serving pre-encoded static media files using a stack based on FreeBSD-current and NGINX.

The Evolution of the Data Path

To understand how Netflix reached 800Gb/s, it is helpful to look at the timeline of optimizations they implemented to reduce the "tax" on the CPU and memory bus.

1. Asynchronous Sendfile (2014)

Traditionally, serving a file involves copying data from the disk to the kernel, then to userspace, and finally back to the kernel to be sent over the network. Netflix utilized sendfile(2), which allows the kernel to transmit data directly from a file descriptor to a TCP socket, bypassing userspace entirely.

However, standard sendfile can block the NGINX worker if the disk read is slow. Netflix implemented asynchronous sendfile, turning the operation into a "fire and forget" mechanism. When a disk read completes, an interrupt handler informs the TCP stack that the data is ready, preventing worker threads from stalling and increasing throughput from 23Gb/s to 36Gb/s on older hardware.

2. Kernel TLS (kTLS) (2016)

Encryption is mandatory for modern traffic, but TLS typically breaks the sendfile pipeline. In a standard TLS setup, data must be copied from the kernel to userspace to be encrypted by the CPU before being sent back to the kernel.

Netflix solved this by moving the TLS symmetric encryption into the kernel. While the initial handshake remains in userspace, the bulk encryption is handled as part of the sendfile pipeline. This restored the zero-copy data flow and eliminated the massive memory bandwidth spike caused by repeated copying between kernel and userspace.

3. Mastering NUMA (2019)

As network speeds climbed, Netflix encountered the limitations of Non-Uniform Memory Architecture (NUMA). In multi-socket systems, memory and I/O devices are "closer" to specific CPU cores. If a CPU on Node 0 needs to encrypt data stored in memory attached to Node 1 and send it via a NIC attached to Node 0, the data must cross the NUMA fabric (the interconnect between sockets) multiple times.

In a worst-case scenario, a single packet could cross the NUMA bus four times:

  1. Disk to memory
  2. Memory to CPU (for encryption)
  3. CPU back to memory (encrypted)
  4. Memory to NIC

This congestion leads to CPU stalls. Netflix implemented Disk Centric Siloing, ensuring that the process of reading, encrypting, and transmitting happens on the NUMA node where the data resides. By moving to "Strict Disk Centric Siloing," they eliminated bulk data crossings of the NUMA bus entirely, significantly reducing fabric saturation.

The Final Leap: Inline Hardware kTLS (2022)

Even with NUMA optimizations, software-based kTLS consumes nearly half of the available CPU cycles. To break the 400Gb/s barrier, Netflix collaborated with NVIDIA (Mellanox) to implement Inline Hardware kTLS using the ConnectX-6 Dx NIC.

With hardware offload, the kernel passes the encryption keys directly to the NIC. The data flows from the disk to memory and then straight to the NIC in plaintext; the NIC encrypts the data on-the-fly as it hits the wire.

The impact is twofold:

  • CPU Relief: The host CPU is no longer involved in the bulk encryption process.
  • Memory Bandwidth Reduction: Memory bandwidth requirements are cut in half (from ~400GB/sec to ~200GB/sec for an 800Gb/s stream) because the CPU no longer needs to read and write the data for encryption.

Experimental Results at 800Gb/s

Using a Dell R7525 server with dual AMD EPYC 7713 CPUs and four Mellanox ConnectX-6 Dx NICs, Netflix tested the limits of this architecture. Their journey to 720Gb/s involved several iterations:

  • Initial Result (420Gb/s): Limited by AMD's dynamic link width management (DLWM) on the xGMI links between sockets.
  • Forced Link Width (500Gb/s): After forcing xGMI to x16 and 18GT/s, they hit a plateau due to uneven I/O distribution across NVMe quadrants.
  • Disk Centric Siloing (670Gb/s): Improved xGMI hashing by changing how DMA was handled, though it introduced some pressure on the page daemon.
  • Strict Disk Centric Siloing (720Gb/s): By ensuring the egress NIC was local to the NUMA node with the disk, they reached 720Gb/s. At this point, the bottleneck shifted from the CPU to NIC output drops caused by content popularity (some NICs were pushed to 94Gb/s while others sat at 84Gb/s).

Technical Synthesis and Community Perspective

This architecture highlights a fundamental shift in high-performance networking: the move toward "offloading everything." By pushing the data plane into the kernel and then into the hardware, Netflix has minimized the CPU's role to that of a coordinator rather than a data processor.

From a community perspective, some observers have questioned the necessity of TLS for static, already-encrypted DRM content. However, the industry trend toward "TLS everywhere" suggests that the security and privacy benefits—and the ability to use standardized hardware offloads—outweigh the overhead. Additionally, the use of FreeBSD in this stack underscores the continued relevance of the BSD kernel for specialized, high-throughput networking tasks where fine-grained control over the network stack is paramount.

References

HN Stories