Optimizing LLM Efficiency: Reducing Input Tokens by 70% with Adola
As Large Language Models (LLMs) are integrated into production environments, the cost and latency associated with massive context windows are becoming critical bottlenecks. When developers implement Retrieval-Augmented Generation (RAG) or complex agentic workflows, they often encounter "noisy" context—redundant information, irrelevant snippets, or bloated tool transcripts—that increases token consumption without adding value to the final answer.
Adola introduces Rose 1, a semantic prompt compression engine designed to trim this noise before the model call. By focusing on "keeping what matters," Rose 1 aims to reduce input tokens by up to 70% while ensuring that the core reasoning and factual accuracy of the model remain intact.
How Semantic Compression Works
Unlike simple truncation, which might cut off vital information at the end of a prompt, Adola uses semantic compression. This process identifies and removes redundant or irrelevant data while preserving the essential spans of text required to answer a specific query.
For example, in a technical support scenario, a prompt might contain duplicate notes from a support thread or unrelated ticket history. Rose 1 filters these out, retaining only the critical schema, policy exceptions, account tiers, and citation trails necessary for the model to provide a safe and accurate response.
Performance and Benchmarks
One of the primary challenges with prompt compression is the risk of "turning the model blind"—removing too much information and causing a drop in accuracy. Adola has released production benchmarks for Rose 1 across six major evaluation sets, demonstrating high stability even with a 70% compression ratio (keeping only 30% of the original prompt).
Benchmark Results
| Evaluation Set | Focus Area | Accuracy Impact |
|---|---|---|
| AIME | Competition Math | 0% decrease |
| GPQA Diamond | Expert Science QA | 0% decrease |
| GDPval-AA | Professional Tasks | 0% decrease |
| CommonsenseQA | Commonsense Reasoning | 0% decrease |
| GSM8K | Grade-school Math | 0% decrease |
| ARC-Challenge | Grade-school Science | 2% decrease |
These results suggest that for the vast majority of hard reasoning tasks, the semantic noise can be stripped away without impacting the model's ability to reach the correct conclusion.
Production Use Cases
Adola is positioned as a "pre-model API" or a prompt gateway, meaning it can be inserted into existing workflows without requiring a change in the model provider. Key application areas include:
1. Agent Traces
In multi-step agentic workflows, tool transcripts can become incredibly long. Adola can trim these transcripts before the next planning step, reducing the cost of subsequent turns in the conversation.
2. RAG Retrieval
Retrieval systems often over-retrieve chunks to ensure the answer is present. Rose 1 shrinks these over-retrieved chunks while keeping the answer-bearing spans, optimizing the context window.
3. Support Copilots
For customer support bots, compressing ticket history, policy documents, and account context allows the model to handle more history and documentation without hitting token limits or increasing latency.
Implementation
The tool is designed for developer ease, offering SDKs for Python, JavaScript, TypeScript, Go, and Rust. A typical implementation involves sending the raw context and the user query to the Adola API, specifying a target compression ratio.
from adola import Adola
client = Adola(api_key="rose_...")
result = client.compress(
input=open("retrieved_context.txt").read(),
query="Which incident caused latency?",
compression={"target_ratio": 0.3},
include_spans=False,
)
compressed = result["output"]
Community Perspectives
While the initial reception highlights the potential for cost savings, some users have raised questions regarding the flexibility of the compression strategy. One user noted:
"Can I choose a strategy for token reduction, based on what I'm optimizing for? I might be ok with a quality drop for a great cost savings, for example."
This suggests a demand for more granular control over the trade-off between accuracy and aggressive cost reduction, a feature that would allow developers to tune the compression level based on the specific criticality of the task.