Cutting Inference Cold Starts by 40x: The Engineering Behind Truly Serverless GPUs
In the current era of AI, the demand for inference is characterized by extreme variability. Unlike training workloads, which are predictable and steady, inference is driven by external user behavior, leading to spiky demand patterns. For engineers, this creates a tension between Quality of Service (QoS) and cost: over-provisioning GPUs to handle peaks leads to dismal "GPU Allocation Utilization," while under-provisioning leads to latency spikes and 503 errors.
To achieve "truly serverless" GPUs, the time it takes to scale from a request to a running replica must be reduced from minutes to seconds. Modal has implemented a stack of four key optimizations—Cloud Buffers, a custom lazy filesystem (FUSE), CPU Checkpoint/Restore, and CUDA Checkpoint/Restore—to reduce cold starts by up to 40x (from ~2,000 seconds to ~50 seconds).
1. Removing Instance Allocation from the Hot Path
The first bottleneck in scaling is the time required to spin up a new virtual machine and perform health checks, which can take several minutes. Modal removes this from the critical path by maintaining a cloud buffer of idle, healthy GPUs shared across many applications.
By scheduling new replicas onto these pre-warmed units and asynchronously replenishing the buffer, the system eliminates the initial allocation latency. This is managed as a linear programming problem using Google's GLOP solver, which balances cost, requested capacity, and observed cloud provider supply.
Crucially, this approach includes aggressive GPU health checking. Because GPUs fail more frequently than standard server hardware, Modal implements a two-tier health check: a short active check on boot and more intensive diagnostics (like dcgmi diag) on a weekly cadence.
2. Lazy Container Loading via ImageFS
Standard container startups are bottlenecked by the need to pull and unpack gigabytes of root filesystem data. Modal solves this by disaggregating the container launcher from the image delivery using a custom filesystem called ImageFS built with libfuse.
Lazy Loading and Content-Addressing
Instead of loading the entire image, ImageFS only loads the metadata (an index) during startup, which takes less than 100ms. The actual file contents are loaded lazily as the application requests them. Since most containers never access a large portion of their filesystem (e.g., locale or timezone data), much of the image is never actually transferred.
To optimize the data that is accessed, Modal uses a tiered, content-addressed cache:
- Page Cache: Microsecond latency for the most frequent hits.
- Local SSD: High-throughput storage for commonly used content.
- Regional CDN/Blob Storage: Infinite capacity for the long tail of data.
By using content-addressing rather than path-based or layer-based caching, Modal ensures that shared bytes across different images are stored only once, regardless of which layer they reside in.
3. Fast-Forwarding Host Startup with CPU Snapshots
Even after a container starts, the application must initialize. In Python-heavy AI stacks, a simple import torch can trigger thousands of syscalls and several seconds of overhead.
Modal utilizes Checkpoint/Restore (C/R) to bypass this. By using gVisor's runsc runtime, Modal treats the container as a state machine. They create a memory snapshot of a process—including its heap, thread state, and file descriptor table—and save it to disk.
When a new replica is needed, the system restores the process directly from this snapshot into memory. This "fast-forwards" the application to a ready state, providing roughly a 10x reduction in host-side startup time. However, this requires snapshots to be compatible with the underlying CPU instruction sets (e.g., avoiding instructions not supported by specific AWS instance types).
4. Eliminating Device Initialization with CUDA Checkpoints
The final and often most significant bottleneck is GPU-side initialization. This involves two main tasks:
- Weight Loading: Moving billions of parameters from storage to GPU VRAM.
- Inference Engine Setup: Compute-heavy tasks like capturing CUDA graphs or running the Torch compiler.
While weight loading is primarily a throughput bottleneck (limited by network/disk speed), engine setup is a compute bottleneck. Modal leverages recent Nvidia driver capabilities to checkpoint device memory into host memory.
By combining host-side and device-side snapshots, Modal can restore the entire CUDA context. For LLM servers like vLLM or SGLang, this reduces boot latency significantly. For example, in tests with a 1 GiB model, vLLM boot latency dropped from a mean of ~95 seconds to ~13 seconds when snapshots were enabled.
Summary of Latency Reductions
| Optimization | Target Component | Latency Impact |
|---|---|---|
| Cloud Buffers | Machine Management | Minutes $\rightarrow$ Seconds |
| ImageFS (FUSE) | Local SSD / Network | Minutes $\rightarrow$ Seconds |
| CPU Snapshots | CPU / RAM | Tens of Seconds $\rightarrow$ Seconds |
| CUDA Snapshots | GPU / VRAM | Minutes $\rightarrow$ Tens of Seconds |
Real-World Application: Reducto
These optimizations enable workloads with high peak-to-average ratios to scale efficiently. Reducto, a document processing platform, uses vision-language models to process massive enterprise datasets. Their workloads require scaling to thousands of GPUs for short bursts to meet tight deadlines. By utilizing GPU memory snapshotting, Reducto reduced their cold starts from ~70 seconds to ~12 seconds, allowing them to operate a "kilo-GPU" workload in a truly serverless fashion without maintaining costly idle capacity.
Technical Counterpoints and Considerations
While the Modal approach is highly effective, the community has noted several technical trade-offs:
- FUSE Overhead: The use of
libfuseintroduces additional context switches between user and kernel space. While this is negligible for throughput-heavy AI workloads, it can be a bottleneck for latency-sensitive file operations. - Snapshot Fragility: Memory snapshots are highly sensitive to the host environment. A snapshot created on a machine with specific CPU instructions cannot be restored on a machine lacking them, necessitating multiple snapshots for heterogeneous clusters.
- Multi-GPU Complexity: Snapshotting multi-GPU programs is challenging because communication libraries like
ncclare not designed for pauses and can deadlock during the restore process.