Copy Fail: Breaking Container Isolation via Page Cache Poisoning

The security of modern containerized environments relies on the assumption that namespaces and cgroups provide a robust boundary between workloads. However, a recently disclosed vulnerability known as "Copy Fail" shatters this assumption by targeting a fundamental component of the Linux kernel: the page cache. Unlike traditional kernel exploits that rely on fragile race conditions or Use-After-Free (UAF) bugs to achieve code execution, Copy Fail provides a deterministic primitive for rewriting cached file contents, enabling attackers to move laterally between pods or escape to the host entirely.

The Mechanics of Copy Fail

At its core, Copy Fail is a local-privilege escalation vulnerability that exploits a memory corruption flaw in the kernel code handling IPSec ESP Extended Sequence Numbers (authencesn). This functionality is exposed to unprivileged users via AF_ALG sockets, the userland interface for the Linux kernel cryptography subsystem.

By confusing the kernel into treating a mutable reference to the page cache as disposable scratch memory, an attacker can use splice(2) to perform a controlled 4-byte write into the page cache backing any readable file. This allows the attacker to modify the cached version of a file without ever changing the bytes stored on the physical disk. Because the write bypasses standard accounting and the overlayfs "copy-up" mechanism, the modification happens directly to the shared lower-layer files.

Why Containers Are Vulnerable

Container isolation is implemented via mount, network, PID, user, and IPC namespaces. Crucially, none of these namespaces create a per-container page cache. The kernel's page cache is shared across the entire system.

In a Kubernetes environment, container images are composed of read-only lower layers. To save space, container runtimes (like containerd or CRI-O) deduplicate these layers by content hash. If two different pods on the same node share a base image (e.g., debian:bookworm-slim or python:3.12-slim), they share the same underlying host inode and address_space.

When Copy Fail mutates a folio in the page cache, every file descriptor across all containers that points to that same address_space will see the modified bytes. This creates two primary attack vectors:

Scenario 1: Cross-Container Poisoning

In this scenario, an attacker with code execution in one pod (or simply the right to create pods) can target a widely shared base layer.

Target Selection: The attacker identifies a common file, such as a Python module in site-packages or a shared library like glibc within a base layer.
The Write: Using Copy Fail, the attacker chains 4-byte writes to patch the target file in the page cache.
The Trigger: When a second, unrelated pod on the same node imports that module or executes that library, it loads the poisoned bytes from the cache and executes the attacker's code.

This allows an attacker to compromise a hardened backend pod simply because it shares a base image with a compromised or attacker-controlled pod on the same node. Furthermore, if the attacker has pods/create permissions, they can intentionally schedule a pod on a victim's node and pull the same base image to trigger the poisoning.

Scenario 2: Container Escape to Host Root

Copy Fail can also be used to achieve a full escape from an unprivileged container to the host. This path mirrors the "Dirty Pipe" escape pattern:

Force runc Execution: The attacker overwrites /bin/sh inside the container with a shebang pointing to /proc/self/exe. When an administrator runs kubectl exec, runc is invoked and becomes pinned in the container's PID namespace.
Locate and Poison: The attacker identifies the runc process and opens its /proc/<pid>/exe symlink. Since runc is bind-mounted from the host, this provides a path to the host's runc binary in the page cache.
Execution: The attacker uses Copy Fail to overwrite the runc ELF header with a malicious payload. The next time runc is executed on the host (via a probe, a pod start, or another exec), the malicious code runs as root on the host.

Detection and Mitigation

One of the most dangerous aspects of Copy Fail is its invisibility to traditional security tooling. Because the on-disk bytes remain unchanged, image registry scanners (Trivy, Clair), agent-less disk scanners, and file-integrity monitors (AIDE, Tripwire) will report that the system is clean.

Effective Defenses

Defense Mechanism	Effectiveness	Note
Kernel Patching	High	The only permanent fix is updating the host kernel to the patched version.
Seccomp Profiles	High	Blocking `socket(AF_ALG, ...)` removes the exploit primitive.
gVisor / Kata Containers	High	These provide a separate kernel or microVM, eliminating the shared page cache.
Managed MicroVMs (Fargate)	High	Per-pod kernels prevent cross-pod poisoning.
Runtime EDR	Partial	Can detect post-exploitation behavior or in-memory page mismatches.

Conclusion

Copy Fail demonstrates that the shared nature of the Linux page cache is a significant architectural blind spot in container security. While namespaces provide a logical separation of resources, the underlying memory management of the kernel remains a global resource. For organizations requiring hard multi-tenancy, this vulnerability reinforces the argument that containers should not be used as a primary security boundary and that VM-based isolation (or sandboxed runtimes like gVisor) is essential for protecting high-risk workloads.

Copy Fail: Breaking Container Isolation via Page Cache Poisoning

Copy Fail: Breaking Container Isolation via Page Cache Poisoning

The Mechanics of Copy Fail

Why Containers Are Vulnerable

Scenario 1: Cross-Container Poisoning

Scenario 2: Container Escape to Host Root

Detection and Mitigation

Effective Defenses

Conclusion

References

HN Stories