Beyond the Prompt: The Rise of Native Interaction Models
For years, the primary interface for interacting with Large Language Models (LLMs) has been the "turn-based" exchange: a user provides a prompt, the model processes it, and the model generates a response. While effective for static tasks, this pattern creates a significant collaboration bottleneck. In real-world work, humans don't collaborate via discrete packets of information; we use copresence, contemporality, and simultaneity—interjecting, listening, and reacting in real time.
Thinking Machines has introduced a research preview of Interaction Models, a new approach that treats interactivity not as an external "harness" or scaffolding, but as a native capability of the model itself. By moving interaction into the model's core, the goal is to allow AI to collaborate with humans as naturally as people collaborate with each other.
The Collaboration Bottleneck
Most current frontier models are optimized for autonomy—the ability to complete long tasks without human intervention. However, as noted in recent model cards, when these models are used in "hands-on-keyboard" synchronous patterns, they often feel too slow or unresponsive. This is because the interface has no room for the human to remain in the loop effectively.
Today's models experience reality in a single thread. They wait for a user to finish typing or speaking before they begin perceiving, and their own perception freezes while they are generating a response. This narrow channel limits the amount of practical knowledge and intuitive judgment (what James C. Scott calls Métis) that can be transmitted between the human and the AI.
A Native Approach to Interactivity
Rather than bolting on Voice Activity Detection (VAD) or other external components to emulate interruptions, Thinking Machines trains its interaction model from scratch to handle these dynamics natively. This shift enables several qualitatively new capabilities:
- Seamless Dialog Management: The model implicitly tracks whether a speaker is thinking, yielding, or self-correcting without a separate management component.
- Simultaneous Speech: The ability for the user and model to speak concurrently, essential for tasks like live translation.
- Verbal and Visual Interjections: The model can jump in based on context or visual cues, rather than waiting for a silence trigger.
- Time-Awareness: A direct sense of elapsed time, allowing the model to track durations or initiate speech at specific intervals.
- Concurrent Tool Use: The model can perform searches or generate UI while simultaneously listening and speaking to the user.
Technical Architecture: Micro-Turns and Dual-Model Design
To achieve this, Thinking Machines employs a time-aligned micro-turn design. Instead of alternating long turns, the model processes continuous input and output streams split into 200ms "micro-turns."\n
The Interaction vs. Background Split
The system is architected around two distinct but coordinating models:
- The Interaction Model: A real-time, multimodal model that maintains a constant two-way exchange with the user. It handles the immediate "presence" of the conversation.
- The Background Model: An asynchronous model that handles sustained reasoning, complex tool use, and longer-horizon work.
When a task requires deeper reasoning, the interaction model delegates the work to the background model. The results stream back and are interleaved into the conversation naturally, avoiding the abrupt context switches common in current AI agents.
Engineering Optimizations
Real-time performance at this scale (a 276B parameter MoE with 12B active) requires significant infrastructure optimization:
- Encoder-free Early Fusion: The model avoids large standalone encoders, instead using lightweight embedding layers for audio (dMel) and image patches (hMLP), co-trained with the transformer.
- Streaming Sessions: To avoid the overhead of frequent small prefills, the team implemented streaming sessions where chunks are appended to a persistent sequence in GPU memory, a feature they have upstreamed to SGLang.
- Batch-Invariant Kernels: To ensure training stability and debugging, they utilize custom communication kernels (NVLS) to achieve bitwise alignment between different parallelism strategies.
Benchmarking Interactivity
Traditional benchmarks fail to capture the nuance of real-time collaboration. Thinking Machines introduced new internal benchmarks to measure proactive capabilities:
- TimeSpeak: Testing the ability to initiate speech at user-specified times (e.g., "remind me to breathe every 4 seconds").
- CueSpeak: Testing the ability to respond to specific verbal cues in real time, even while the user is still speaking.
- Visual Proactivity: Using benchmarks like RepCount-A and ProactiveVideoQA to test if the model can react to visual changes (e.g., counting pushups) without an audio prompt.
According to their data, TML-Interaction-Small outperforms existing frontier models in interaction quality (FD-bench) while remaining competitive in general intelligence benchmarks.
Critical Perspectives and Future Directions
While the technical achievement is significant, the community has raised important questions regarding the utility and UX of such systems. Some observers note that while the demos are impressive, the actual use cases—such as "auto-slouch-detectors"—can feel dystopian or contrived.
"The UX is also an issue - the model interrupting the user... is jarring and makes one lose their flow. A human, when participating in this (rare) 'invited interruption,' has the ability to speak 'under' the main speaker and I feel it's generally timed with a lot of nuance."
There is also a debate regarding the flexibility of moving the interaction harness into the model. Some argue that an external harness allows for faster iteration and easier UI customization, whereas a native model approach might be less flexible for specific, evolving user needs.
Despite these critiques, the move toward native interaction represents a fundamental shift in how AI is integrated into human workflows—moving from a tool we prompt to a collaborator that shares our presence.