Building an Agent that Tunes Its Own Cache

The challenge of optimizing Large Language Model (LLM) applications is often a game of trial and error. Developers typically set Time-to-Live (TTL) values and caching strategies based on intuition, then monitor logs to see if they work. However, the gap between manual configuration and real-world usage patterns is often wide, creating inefficiencies in both cost and latency.

In a recent project, the creator of BetterDB developed a an agentic caching system for a RAG (Retrieval-Augmented Generation) application built over Valkey/Redis/Dragonfly documentation. The goal was to actually "dogfood" their caching libraries by allowing the agent to monitor its own performance and suggest configuration changes in real-time.

A Multi-Tier Caching Architecture

To maximize efficiency, the system employs a two-tier caching strategy designed to handle different types of user interactions:

1. The Exact-Match Tool Cache

This tier sits between the SDK and the tools. Every call is normalized and checked for an exact match. This is ideal for predefined questions, repeated copy-pasted queries, or users checking the same technical detail multiple times. If a hit occurs, the system returns the result immediately, bypassing the LLM entirely.

2. The Semantic Cache

Because humans rarely phrase questions identically, the system utilizes a semantic cache. This tier embeds the prompt and performs a K-Nearest Neighbors (KNN) search via valkey-search. If the cosine distance between the new prompt and a cached prompt is sufficiently close, the system streams the cached response.

When a cache miss occurs in both tiers, the system records the prompt embedding, the model used, and the input/output tokens from the OpenAI usage report. This allows the system to track the exact dollar amount avoided through subsequent hits.

Closing the Loop: Self-Tuning and Monitoring

The true innovation in this project is the move from static configuration to an agentic loop. The system stores metadata in the Valkey/Redis instance, which is then analyzed by a monitoring process.

This monitoring loop operates as follows:

Analysis: The monitoring tool reads the cache metadata and analyzes usage patterns.
Suggestion: The system suggests improvements (such as TTL changes) via an MCP (Model Context Protocol) server.
Execution: In this demo environment, the agent is permitted to approve and apply its own suggestions. Because the libraries read configuration directly from the Valkey instance, changes take effect immediately without requiring a server restart.

Lessons Learned: Config vs. Code

During testing, the developer observed a fascinating trend. Over three runs, the number of tool calls dropped from 15 to 13, and finally to 8. While the agent suggested several TTL changes, the developer noted a critical limitation: TTL is often the wrong point of control.

For example, a user might ask "How fast is XADD?" and "XADD performance." These are semantically identical but string-different. A TTL change cannot fix the fact that these two queries miss the exact-match cache. The only real fix is a architectural change—moving those specific tools from the exact-match tier to the semantic cache checks.

This realization highlights a key insight for LLM developers: not all optimizations can be solved by configuration. Some inefficiencies are inherent to the routing logic and require code changes rather than parameter tuning.

Future Directions

To further refine the system, the developer is looking into making the routing logic itself configurable. This would allow the agent to move tools between the exact-match and semantic tiers without needing a redeploy, enabling faster iteration loops and verification of optimization hypotheses.

Building an Agent that Tunes Its Own Cache

Building an Agent that Tunes Its Own Cache

A Multi-Tier Caching Architecture

1. The Exact-Match Tool Cache

2. The Semantic Cache

Closing the Loop: Self-Tuning and Monitoring

Lessons Learned: Config vs. Code

Future Directions

References

HN Stories