← Back to Blogs
HN Story

From u32 to Root: Analyzing the io_uring ZCRX Freelist Vulnerability

May 10, 2026

From u32 to Root: Analyzing the io_uring ZCRX Freelist Vulnerability

The Linux kernel's io_uring subsystem has long been a focal point for security researchers due to its immense complexity and powerful capabilities. A recent discovery in the Zero-Copy Receive (ZCRX) subsystem highlights a classic yet devastating memory safety error: a missing bounds check on a stack-based freelist. This vulnerability allows an attacker with specific capabilities to transform a simple 4-byte out-of-bounds (OOB) write into full root privileges.

This post breaks down the technical mechanics of the ZCRX vulnerability, the heap grooming required to weaponize it, and the eventual path to local privilege escalation (LPE).

The Vulnerability: Missing Bounds Checks in ZCRX

Introduced in Linux 6.15, ZCRX allows userspace to receive network packets directly into registered memory regions, bypassing the overhead of kernel-to-user copying. To manage these memory slots, the kernel utilizes a net_iov structure and a corresponding freelist:

  • freelist[]: A stack of available slot indices, allocated via kcalloc(num_niovs, sizeof(u32)).
  • free_count: An integer tracking the current depth of the stack.

The vulnerability exists in the io_zcrx_return_niov_freelist function. When a network I/O vector (niov) is returned to the pool, the kernel pushes the index onto the freelist and increments free_count without verifying if free_count has already reached num_niovs.

static void io_zcrx_return_niov_freelist(struct net_iov *niov)
{
    struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);

    spin_lock_bh(&area->freelist_lock);
    area->freelist[area->free_count++] = net_iov_idx(niov);
    spin_unlock_bh(&area->freelist_lock);
}

When free_count equals num_niovs, the write occurs at freelist[num_niovs], which is exactly one 4-byte slot past the end of the allocated array. This results in a 4-byte OOB write into the adjacent slab memory.

Triggering the OOB Write

The exploit relies on a race condition between two kernel teardown paths that both return niovs to the same freelist:

  1. Path A (Normal Completion): The network stack releases a packet, triggering io_pp_zc_release_netmem, which pushes the niov back to the freelist.
  2. Path B (Page Pool Teardown): When a NIC is brought down, io_pp_zc_destroy iterates through all niovs. If a niov still has a reference count, it is forced back into the freelist.

Because the ptr_ring drain (Path A) and the scrub loop (Path B) are not atomic, a window exists where a niov can be counted twice. If the freelist is nearly full, this double-count pushes free_count beyond the array boundary, triggering the OOB write.

To trigger this from userspace, an attacker requires CAP_NET_ADMIN to bring the NIC down via SIOCSIFFLAGS. The process involves registering a ZCRX interface queue (IFQ), flooding it with UDP packets to allocate niovs, and then bringing the interface down while some packets are still in-flight.

From a Small Integer to Root: The Exploit Chain

Writing a small integer (the niov index) might seem insignificant, but the attacker can control the value and the location of the write by manipulating the num_niovs parameter during registration. This determines the size of the kcalloc allocation and, consequently, which slab cache (e.g., kmalloc-128) is used.

1. Heap Grooming with msg_msg

The target for the OOB write is the struct msg_msg object. By spraying msg_msg objects via msgsnd(), the attacker can ensure a msg_msg is placed immediately after the freelist in the kmalloc-128 slab.

The OOB write hits the first 4 bytes of the adjacent msg_msg, which corresponds to the lower 32 bits of the m_list.next pointer. On x86-64, this corrupts the pointer while preserving the high 32 bits, keeping the pointer within the kernel's physmap range.

2. Breaking KASLR

With a corrupted m_list.next pointer, the attacker can use msgrcv() with the MSG_COPY flag to perform an over-read of the heap. By scanning the returned data for known kernel text pointers, the attacker can calculate the kernel base address and bypass KASLR. Alternatively, if /proc/kallsyms or dmesg are accessible, the address of modprobe_path can be obtained directly.

3. Overwriting modprobe_path

modprobe_path is a global kernel variable that points to the binary executed when the kernel needs to load a module. By using the previously leaked KASLR address and CAP_SYS_ADMIN (often granted alongside CAP_NET_ADMIN in certain container configs), the attacker can overwrite modprobe_path via /proc/sys/kernel/modprobe to point to a malicious script.

Finally, triggering an unknown socket address family (e.g., socket(AF_CAN, ...)) forces the kernel to execute the malicious script as root.

Mitigation and Community Perspective

The vulnerability was addressed in commit 770594e, which introduces a critical bounds check:

if (WARN_ON_ONCE(area->free_count >= area->nia.num_niovs))
    return;

Critical Analysis

While the technical chain is sophisticated, the community has noted that the prerequisite privileges (CAP_NET_ADMIN and CAP_SYS_ADMIN) significantly limit the exploit's impact. As one commenter noted:

"If you can write modprobe_path, is it really news that you can find a way to execute code?"

However, others argue that io_uring continues to be a "security nightmare" due to its massive attack surface and the frequency of privilege escalation bugs. The consensus suggests that for high-security environments, disabling io_uring entirely via sysctl -w kernel.io_uring_disabled=2 may be the safest course of action.

Summary of Requirements

Requirement Detail
Kernel Version 6.15 – 6.19 (without commit 770594e)
Configuration CONFIG_IO_URING_ZCRX=y
Hardware ZCRX-capable NIC (e.g., Mellanox ConnectX-6+, Intel E800)
Privileges CAP_NET_ADMIN (and CAP_SYS_ADMIN for the modprobe_path write)

References

HN Stories