Beyond the Basics: Making LLM Token Streams Resumable and Multi-Device
For many developers, the transition from synchronous chatbots to asynchronous AI agents—tools that run in the background while a user works—seems like a simple transport change. The common advice is to use Server-Sent Events (SSE) with the Last-Event-ID header to create a durable stream. On paper, it sounds easy. In practice, implementing this at scale introduces significant architectural friction.
When moving beyond a simple demo to a production-grade agentic experience, three core requirements emerge: resumable streams, reliable cancellations, and multi-device synchronization. While SSE can technically handle these, the implementation details reveal a gap between "doable" and "easy."
The Overhead of Token Streaming
Before diving into the architectural challenges, it is important to understand the nature of LLM responses. Whether you are using the Vercel AI SDK, OpenAI, or Anthropic, the data returned is not just a stream of text. Each "token" is wrapped in significant metadata.
For instance, a single event from the Anthropic API might contain 125 characters of JSON metadata just to deliver 5 characters of actual text delta. This metadata overhead becomes a critical bottleneck the moment you decide to make your streams durable by storing them in a database.
The Challenge of Resumable Streams
To make a stream resumable, the client must be able to reconnect after a drop and request the missing tokens. The SSE specification provides the Last-Event-ID header for this purpose. If every event has a unique ID (e.g., response_id:token_index), the client can tell the server exactly where it left off.
However, in a modern, horizontally scalable architecture with stateless server replicas, this introduces a massive write amplification problem. Because any replica might handle the reconnection request, every single token must be written to a shared database or cache in real-time.
This creates a paradoxical situation: you are performing a database write for every few characters of text, often for requests that will never actually drop. Once the LLM finishes the response, these individual token records become useless and must be cleaned up in favor of the final "full response" record. The result is a high-cost infrastructure burden for a feature that only benefits a small percentage of users.
Handling Cancellations in a Durable World
In a basic SSE setup, a dropped connection often signals the server to cancel the LLM request. But once you implement resumable streams, a dropped connection no longer means the user wants to stop; it might just mean they entered a tunnel or refreshed their browser.
Cancellations now require a dedicated out-of-band mechanism. You must implement a separate endpoint (e.g., POST /cancel/{response_id}) that writes a "cancel marker" into the shared database. The server replica handling the LLM inference must then constantly poll this shared store between generating tokens to check if it should abort the upstream call. This adds latency and complexity to the inference loop.
The Multi-Device Synchronization Gap
Supporting multiple devices introduces two distinct problems:
- State Recovery: Since tokens are already being stored in the database for resumability, a second device can fetch the history and pick up the current stream.
- Real-time Notification: How does Device B know that Device A has sent a prompt and a response is currently streaming?
Without a persistent bidirectional connection, Device B must rely on polling. As noted in the source material, polling is a lose-lose trade-off: poll infrequently and you suffer high latency; poll frequently and you hammer your servers with unnecessary traffic.
Alternative Architectures: Pub/Sub
Given these frictions, some argue that HTTP is fundamentally the wrong transport for async agentic applications. A pub/sub (Publisher/Subscriber) pattern offers a more elegant solution by decoupling the connection lifetime from the agent lifecycle.
In a pub/sub model:
- Persistence: The channel exists independently of the client. Tokens are published to the channel and remain available for the client to "rewind" and collect upon reconnection.
- Multi-device: Multiple clients can subscribe to the same channel and receive the same stream in real-time without polling.
- Efficiency: The transport layer can handle the compaction of token deltas into full responses, reducing the number of messages the client needs to process when catching up.
- Routing: Cancellations and interrupts can be published as messages to the channel, which the server process consumes immediately, eliminating the need to poll a database for cancel markers.
Community Perspectives
While the challenges of SSE are significant, some developers suggest that the "difficulty" is a matter of implementation. For example, one contributor suggested that a Redis Streams-backed SSE implementation could solve many of the state and durability challenges without abandoning the SSE protocol entirely. Others pointed toward emerging tools like Durable Streams or specialized APIs designed to handle token events and socket closures automatically.
Ultimately, the choice depends on the scale of the application. For simple bots, SSE is sufficient. But for complex, multi-device agents that require high reliability and low latency, moving the state management from the database to the transport layer is a powerful architectural shift.