Optimizing LLM Efficiency: Reducing Input Tokens by 70% with Adola

As Large Language Models (LLMs) are integrated into production environments, the cost and latency associated with massive context windows are becoming critical bottlenecks. When developers implement Retrieval-Augmented Generation (RAG) or complex agentic workflows, they often encounter "noisy" context—redundant information, irrelevant snippets, or bloated tool transcripts—that increases token consumption without adding value to the final answer.

Adola introduces Rose 1, a semantic prompt compression engine designed to trim this noise before the model call. By focusing on "keeping what matters," Rose 1 aims to reduce input tokens by up to 70% while ensuring that the core reasoning and factual accuracy of the model remain intact.

How Semantic Compression Works

Unlike simple truncation, which might cut off vital information at the end of a prompt, Adola uses semantic compression. This process identifies and removes redundant or irrelevant data while preserving the essential spans of text required to answer a specific query.

For example, in a technical support scenario, a prompt might contain duplicate notes from a support thread or unrelated ticket history. Rose 1 filters these out, retaining only the critical schema, policy exceptions, account tiers, and citation trails necessary for the model to provide a safe and accurate response.

Performance and Benchmarks

One of the primary challenges with prompt compression is the risk of "turning the model blind"—removing too much information and causing a drop in accuracy. Adola has released production benchmarks for Rose 1 across six major evaluation sets, demonstrating high stability even with a 70% compression ratio (keeping only 30% of the original prompt).

Benchmark Results

Evaluation Set	Focus Area	Accuracy Impact
AIME	Competition Math	0% decrease
GPQA Diamond	Expert Science QA	0% decrease
GDPval-AA	Professional Tasks	0% decrease
CommonsenseQA	Commonsense Reasoning	0% decrease
GSM8K	Grade-school Math	0% decrease
ARC-Challenge	Grade-school Science	2% decrease

These results suggest that for the vast majority of hard reasoning tasks, the semantic noise can be stripped away without impacting the model's ability to reach the correct conclusion.

Production Use Cases

Adola is positioned as a "pre-model API" or a prompt gateway, meaning it can be inserted into existing workflows without requiring a change in the model provider. Key application areas include:

1. Agent Traces

In multi-step agentic workflows, tool transcripts can become incredibly long. Adola can trim these transcripts before the next planning step, reducing the cost of subsequent turns in the conversation.

2. RAG Retrieval

Retrieval systems often over-retrieve chunks to ensure the answer is present. Rose 1 shrinks these over-retrieved chunks while keeping the answer-bearing spans, optimizing the context window.

3. Support Copilots

For customer support bots, compressing ticket history, policy documents, and account context allows the model to handle more history and documentation without hitting token limits or increasing latency.

Implementation

The tool is designed for developer ease, offering SDKs for Python, JavaScript, TypeScript, Go, and Rust. A typical implementation involves sending the raw context and the user query to the Adola API, specifying a target compression ratio.

from adola import Adola

client = Adola(api_key="rose_...")
result = client.compress(
    input=open("retrieved_context.txt").read(),
    query="Which incident caused latency?",
    compression={"target_ratio": 0.3},
    include_spans=False,
)

compressed = result["output"]

Community Perspectives

While the initial reception highlights the potential for cost savings, some users have raised questions regarding the flexibility of the compression strategy. One user noted:

"Can I choose a strategy for token reduction, based on what I'm optimizing for? I might be ok with a quality drop for a great cost savings, for example."

This suggests a demand for more granular control over the trade-off between accuracy and aggressive cost reduction, a feature that would allow developers to tune the compression level based on the specific criticality of the task.

Optimizing LLM Efficiency: Reducing Input Tokens by 70% with Adola

Optimizing LLM Efficiency: Reducing Input Tokens by 70% with Adola

How Semantic Compression Works

Performance and Benchmarks

Benchmark Results

Production Use Cases

1. Agent Traces

2. RAG Retrieval

3. Support Copilots

Implementation

Community Perspectives

References

HN Stories