Orthrus-Qwen3: Accelerating LLM Inference with Diffusion-Based Speculative Decoding
The challenge of autoregressive (AR) Transformers—the architecture of most modern Large Language Models (LLMs)—is their sequential nature. Generating a token one by one is a computationally expensive process that often becomes the bottleneck for real-world applications. To solve this, researchers have explored speculative decoding, where a smaller, faster 'drafter' model predicts multiple tokens ahead, and the larger 'target' model verifies them in parallel.
Orthrus-Qwen3 introduces a novel approach to this problem by integrating a trainable diffusion attention module directly into each layer of a frozen Qwen3 backbone. Unlike traditional speculative decoding, Orthrus does not require an external drafter model or a separate KV cache, resulting in significant speedups without sacrificing accuracy.
How Orthrus Works: The Diffusion-Attention Mechanism
At its core, Orthrus modifies the standard AR Transformer by injecting a trainable diffusion attention module into each layer. The key innovation is that the base model (the frozen AR Transformer) and the diffusion head are designed to share a single KV cache.
The process works in two stages:
- The Diffusion Pass: The diffusion head projects $K=32$ tokens in parallel. This is a single-step denoising process that predicts multiple potential next tokens.
- The Verification Pass: The AR head then verifies these tokens in a second pass. It accepts the longest matching prefix that aligns with the base model's original output distribution.
Because the base model's weights are frozen, the output distribution is provably identical to the original Qwen3 model. This ensures that the user receives the same quality of responses, but at a significantly higher velocity.
Performance Benchmarks and Advantages
According to the authors, Orthrus-Qwen3 delivers substantial improvements in throughput and efficiency compared to both traditional diffusion LMs and existing speculative decoding methods.
Throughput and Speed
- Tokens Per Forward (TPF): Orthrus achieves up to 7.8x TPF, with approximately 6x wall-clock speedup on the MATH-500 benchmark.
Comparison with Diffusion LMs
Traditional diffusion LMs (such as Dream, Fast-dLLM-v2, and Mercury) often modify the base weights of the model to enable parallel generation. This frequently leads to a loss in accuracy. For instance, Fast-dLLM-v2 saw an 11-point drop on MATH-500. In contrast, Orthrus freezes the backbone, ensuring that accuracy matches Qwen3-8B exactly.
Comparison with Speculative Decoding
Compared to methods like EAGLE-3 and DFlash, Orthrus offers several architectural advantages:
- No External Drafter: There is no need to initialize or synchronize a separate model, which eliminates the Time to First Token (TTFT) penalty.
- Memory Efficiency: The KV overhead is minimal, remaining at $O(1)$ (approximately 4.5 MiB flat).
- Higher Acceptance Rates: On the MATH-500 benchmark, Orthrus achieved an acceptance length of 11.7, compared to 7.9 for DFlash and 3.5 for EAGLE-3.
Training and Implementation Details
The implementation of Orthrus is remarkably efficient in terms of training requirements. Only 16% of the parameters are trained, and the model was trained on fewer than 1B tokens over 24 hours using 8x H200 GPUs.
Researchers found that KL distillation performed better than Cross-Entropy (CE) for improving the acceptance rate, and a single-step denoising process (6.35 TPF) outperformed multi-step denoising (3.53 TPF).
Limitations and Considerations
While the Orthrus-Qwen3 results are impressive, the model is currently bounded by the frozen base model. This means it inherits all the biases, hallucinations, and knowledge gaps of the original Qwen3. Furthermore, the current evaluation is limited to Qwen3 and greedy/rejection sampling methods.
As the community discusses the potential for this work, interest has grown regarding its application to other models, such as DeepSeek-V3 or quantized GGUF versions for local LLM execution. If successfully ported, this could significantly lower the latency and congestion for high-scale AI providers and local enthusiasts alike.