The Evolution of LLM Efficiency: KV Sharing, mHC, and Compressed Attention
The landscape of Large Language Model (LLM) architectures is shifting. While the fundamental decoder-only transformer remains the status quo, the focus has moved from simply scaling parameters to optimizing for long-context efficiency. As reasoning models and agentic workflows require models to maintain more tokens in memory for longer periods, the KV-cache size, memory traffic, and attention costs have become the primary bottlenecks.
Recent releases from Google, DeepSeek, and other open-weight contributors reveal a trend: the introduction of intricate "architecture tricks" designed to reduce the computational and memory footprint of long-context inference without sacrificing representational capacity.
Gemma 4: KV Sharing and Per-Layer Embeddings
Google's Gemma 4 suite introduces two significant efficiency-oriented design choices in its smaller variants (E2B and E4B).
Cross-Layer KV Sharing
To combat the memory demands of the KV cache, Gemma 4 employs a shared KV cache scheme. While Grouped Query Attention (GQA) already shares KV heads across multiple query heads within a single layer, Gemma 4 takes this further by sharing KV projections across different layers.
In this setup, later layers reuse the key-value states from the most recent earlier non-shared layer of the same attention type. For example, in the Gemma 4 E2B model, only the first 15 of 35 layers compute their own KV projections; the remaining 20 reuse them. This effectively halves the KV cache size, saving approximately 2.7 GB of memory for 128K contexts in the E2B model and 6 GB in the E4B model.
Per-Layer Embeddings (PLE)
While KV sharing reduces memory, PLE focuses on parameter efficiency. The goal is to allow small models to utilize more token-specific information without scaling the entire transformer stack.
Instead of giving each block a full copy of the token embedding layer, PLE provides each transformer block with a small, layer-specific token vector. This vector is gated by the hidden state and added as an extra residual update after the feed-forward branch. This allows the model to maintain a smaller "effective" parameter count for the expensive transformer blocks while storing additional capacity in cheaper, lookup-style embedding tables.
Laguna XS.2: Layer-wise Attention Budgeting
Poolside's Laguna XS.2 introduces the concept of "Layer-wise attention budgeting," which challenges the assumption that every transformer layer requires the same attention capacity.
Laguna XS.2 varies the attention cost by layer using a mix of 30 sliding-window attention layers (local context) and 10 global attention layers (full context). The innovation lies in the use of per-layer query-head counts. Specifically, the model assigns more query heads to the cheaper sliding-window layers and fewer query heads to the expensive global layers, while keeping the KV heads fixed. This ensures that attention capacity is spent where it is most computationally efficient.
ZAYA1-8B: Compressed Convolutional Attention (CCA)
ZAYA1-8B, developed by Zyphra, introduces Compressed Convolutional Attention (CCA), a mechanism that operates directly in a compressed latent space.
Unlike Multi-head Latent Attention (MLA), which uses latent representations primarily to reduce the KV cache before projecting them back for computation, CCA performs the attention operation itself within the compressed space. This reduces not only the KV cache size but also the FLOPs required during prefill and training.
To mitigate the loss of expressiveness caused by compression, CCA employs convolutional mixing on the compressed Query (Q) and Key (K) representations. These convolutions provide the compressed vectors with local context before attention scores are computed, which the developers claim allows CCA to outperform MLA under comparable compression settings.
DeepSeek V4: mHC and Sequence Compression
DeepSeek V4 represents a massive leap in architectural complexity, focusing on both the residual pathway and the attention mechanism.
Manifold-Constrained Hyper-Connections (mHC)
DeepSeek V4 modernizes the residual connection by replacing the single residual stream with several parallel residual streams (hyper-connections). To prevent the signal from amplifying or shrinking unpredictably across deep layers, DeepSeek introduces "manifold constraints."
The residual mapping is projected onto the manifold of doubly stochastic matrices (where entries are non-negative and rows/columns sum to 1). This ensures a stable redistribution of information across the parallel streams, making the residual pathway more expressive without significantly increasing the FLOPs of the Attention or MoE layers.
CSA and HCA: Sequence-Length Compression
While MLA compresses the representation of each token, DeepSeek V4's Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) compress the sequence length itself.
- CSA (Compressed Sparse Attention): Uses a mild compression rate and a sparse selector to identify the most relevant compressed history blocks.
- HCA (Heavily Compressed Attention): Employs aggressive compression (e.g., compressing 128 tokens into one entry) and performs dense attention over those entries.
By interleaving CSA and HCA layers and maintaining a local sliding-window branch for recent tokens, DeepSeek V4-Pro achieves a massive reduction in overhead. At a 1M-token context, it uses only 10% of the KV cache size and 27% of the inference FLOPs compared to DeepSeek V3.2.
Summary of Architectural Trends
The evolution from GPT-2 to DeepSeek V4 shows a clear trajectory: the transformer block is no longer a static entity but a modular system of specialized optimizations. The current trend is to increase complexity within the block to decrease costs at runtime.
| Model | Primary Efficiency Innovation | Target Metric |
|---|---|---|
| Gemma 4 | Cross-layer KV sharing & PLE | KV Cache Memory / Parameter Efficiency |
| Laguna XS.2 | Per-layer query-head budgeting | Attention FLOPs |
| ZAYA1-8B | Compressed Convolutional Attention | KV Cache & Attention FLOPs |
| DeepSeek V4 | mHC & CSA/HCA | Residual Expressiveness / Long-Context Memory |
As these models move toward agentic workflows and massive context windows, the ability to surgically reduce memory and compute overhead while maintaining modeling quality will be the defining characteristic of next-generation LLM architectures.