Copy Fail: Breaking Container Isolation via Page Cache Poisoning
The security of modern containerized environments relies on the assumption that namespaces and cgroups provide a robust boundary between workloads. However, a recently disclosed vulnerability known as "Copy Fail" shatters this assumption by targeting a fundamental component of the Linux kernel: the page cache. Unlike traditional kernel exploits that rely on fragile race conditions or Use-After-Free (UAF) bugs to achieve code execution, Copy Fail provides a deterministic primitive for rewriting cached file contents, enabling attackers to move laterally between pods or escape to the host entirely.
The Mechanics of Copy Fail
At its core, Copy Fail is a local-privilege escalation vulnerability that exploits a memory corruption flaw in the kernel code handling IPSec ESP Extended Sequence Numbers (authencesn). This functionality is exposed to unprivileged users via AF_ALG sockets, the userland interface for the Linux kernel cryptography subsystem.
By confusing the kernel into treating a mutable reference to the page cache as disposable scratch memory, an attacker can use splice(2) to perform a controlled 4-byte write into the page cache backing any readable file. This allows the attacker to modify the cached version of a file without ever changing the bytes stored on the physical disk. Because the write bypasses standard accounting and the overlayfs "copy-up" mechanism, the modification happens directly to the shared lower-layer files.
Why Containers Are Vulnerable
Container isolation is implemented via mount, network, PID, user, and IPC namespaces. Crucially, none of these namespaces create a per-container page cache. The kernel's page cache is shared across the entire system.
In a Kubernetes environment, container images are composed of read-only lower layers. To save space, container runtimes (like containerd or CRI-O) deduplicate these layers by content hash. If two different pods on the same node share a base image (e.g., debian:bookworm-slim or python:3.12-slim), they share the same underlying host inode and address_space.
When Copy Fail mutates a folio in the page cache, every file descriptor across all containers that points to that same address_space will see the modified bytes. This creates two primary attack vectors:
Scenario 1: Cross-Container Poisoning
In this scenario, an attacker with code execution in one pod (or simply the right to create pods) can target a widely shared base layer.
- Target Selection: The attacker identifies a common file, such as a Python module in
site-packagesor a shared library likeglibcwithin a base layer. - The Write: Using Copy Fail, the attacker chains 4-byte writes to patch the target file in the page cache.
- The Trigger: When a second, unrelated pod on the same node imports that module or executes that library, it loads the poisoned bytes from the cache and executes the attacker's code.
This allows an attacker to compromise a hardened backend pod simply because it shares a base image with a compromised or attacker-controlled pod on the same node. Furthermore, if the attacker has pods/create permissions, they can intentionally schedule a pod on a victim's node and pull the same base image to trigger the poisoning.
Scenario 2: Container Escape to Host Root
Copy Fail can also be used to achieve a full escape from an unprivileged container to the host. This path mirrors the "Dirty Pipe" escape pattern:
- Force
runcExecution: The attacker overwrites/bin/shinside the container with a shebang pointing to/proc/self/exe. When an administrator runskubectl exec,runcis invoked and becomes pinned in the container's PID namespace. - Locate and Poison: The attacker identifies the
runcprocess and opens its/proc/<pid>/exesymlink. Sinceruncis bind-mounted from the host, this provides a path to the host'sruncbinary in the page cache. - Execution: The attacker uses Copy Fail to overwrite the
runcELF header with a malicious payload. The next timeruncis executed on the host (via a probe, a pod start, or anotherexec), the malicious code runs as root on the host.
Detection and Mitigation
One of the most dangerous aspects of Copy Fail is its invisibility to traditional security tooling. Because the on-disk bytes remain unchanged, image registry scanners (Trivy, Clair), agent-less disk scanners, and file-integrity monitors (AIDE, Tripwire) will report that the system is clean.
Effective Defenses
| Defense Mechanism | Effectiveness | Note |
|---|---|---|
| Kernel Patching | High | The only permanent fix is updating the host kernel to the patched version. |
| Seccomp Profiles | High | Blocking socket(AF_ALG, ...) removes the exploit primitive. |
| gVisor / Kata Containers | High | These provide a separate kernel or microVM, eliminating the shared page cache. |
| Managed MicroVMs (Fargate) | High | Per-pod kernels prevent cross-pod poisoning. |
| Runtime EDR | Partial | Can detect post-exploitation behavior or in-memory page mismatches. |
Conclusion
Copy Fail demonstrates that the shared nature of the Linux page cache is a significant architectural blind spot in container security. While namespaces provide a logical separation of resources, the underlying memory management of the kernel remains a global resource. For organizations requiring hard multi-tenancy, this vulnerability reinforces the argument that containers should not be used as a primary security boundary and that VM-based isolation (or sandboxed runtimes like gVisor) is essential for protecting high-risk workloads.