Needle: Distilling Tool Calling into a 26M Parameter Model

The prevailing trend in Large Language Models (LLMs) has been a race toward greater scale. However, for specific agentic tasks—particularly tool calling—massive models are often overkill. Tool calling is fundamentally a task of retrieval and assembly: matching a user query to a tool name, extracting the necessary arguments, and emitting a structured JSON response. It is not a complex reasoning task.

Recognizing this, the team at Cactus has released Needle, an experimental 26M parameter model designed specifically for single-shot function calling. Built for consumer devices like phones, watches, and glasses, Needle demonstrates that highly specialized, tiny models can outperform much larger counterparts in narrow, structured tasks.

The Architecture: "Attention Is Actually All You Need"

The most striking aspect of Needle's design is the complete removal of Feed-Forward Network (FFN) layers. In a standard Transformer, FFNs are typically where the model stores "factual knowledge." The Cactus team observed that when a model is provided with external structured knowledge (such as a list of available tools in the prompt), the FFN parameters become redundant.

Key Technical Specifications:

Parameter Count: 26 Million.
Architecture: Simple Attention Networks (Attention and gating only; no MLPs).
Performance: 6,000 tok/s prefill and 1,200 tok/s decode on consumer hardware.
Training: Pretrained on 200B tokens (27 hours on 16 TPU v6e) and post-trained on 2B tokens of synthesized function-calling data (45 minutes).
Data Source: Training data was synthesized via Gemini across 15 tool categories, including navigation, smart home, and messaging.

By stripping away the FFNs, Needle operates as a lean retrieval engine. This finding suggests a broader principle: for any task where the model has access to external structured knowledge (such as RAG or tool use), the model does not need to memorize facts in its weights if those facts are provided in the input.

Benchmarks and Performance

Despite its size, Needle outperforms several larger models in single-shot function calling, including:

FunctionGemma-270M
Qwen-0.6B
Granite-350M
LFM2.5-350M

It is important to note that while Needle excels at the specific task of tool calling, it lacks the conversational capacity and general reasoning abilities of these larger models. It is designed to be a specialized component of a larger system, not a general-purpose chatbot.

Real-World Applications and Use Cases

The community has identified several high-impact applications for a model of this size:

1. On-Device Voice Assistants

Because it can run locally on low-power hardware, Needle is an ideal candidate for a "Siri-like" core. It can take transcribed text and a list of available tools to execute commands like "set a timer" or "check the weather" without sending data to the cloud.

2. Smart Home Integration

Users have noted its potential for platforms like Home Assistant, where a tiny model can act as a low-latency bridge between natural language commands ("Computer! Lights!") and device toggles.

3. Natural Language CLIs

There is potential to integrate Needle into command-line tools, allowing users to specify arguments in natural language which the model then parses into valid flags and commands.

4. Agentic Orchestration

Some developers suggest using Needle as a "first pass" in a complex agent system. Needle can handle the initial tool selection and argument extraction, handing the resulting data off to a larger, more capable model for final synthesis or reasoning.

Community Critique and Considerations

While the release has been met with excitement, several technical and legal points were raised during the discussion:

The "Distillation" Controversy: Some users pointed out that distilling Gemini (using a larger model to generate training data for a smaller one) may violate Google's Terms of Service, which prohibit using their services to develop competing models.

Handling Ambiguity: Questions remain regarding the model's ability to handle complex, multi-step chains or ambiguous requests. One user noted that for a query like "in 1 hour set a timer for 1 hour," the model struggled to determine if it should create a delayed timer or a two-hour timer.

State Management: Since Needle is optimized for single-shot calls, its ability to track state across a multi-turn conversation is a primary area for future investigation. As one commenter noted:

"I'm curious whether this generalizes to multi-turn tool calling where the model needs to track state across several calls, or if it breaks down there."

Conclusion

Needle represents a shift toward "right-sizing" AI. By proving that tool calling is a retrieval task rather than a reasoning task, Cactus has opened the door for highly efficient, private, and local agentic experiences on the smallest of devices. The project is MIT licensed, and the weights are available on Hugging Face for those looking to fine-tune the model for their own specific toolsets.

Needle: Distilling Tool Calling into a 26M Parameter Model

Needle: Distilling Tool Calling into a 26M Parameter Model

The Architecture: "Attention Is Actually All You Need"

Key Technical Specifications:

Benchmarks and Performance

Real-World Applications and Use Cases

1. On-Device Voice Assistants

2. Smart Home Integration

3. Natural Language CLIs

4. Agentic Orchestration

Community Critique and Considerations

Conclusion

References

HN Stories