← Back to Blogs
HN Story

Qwen3.7-Max: Pushing the Boundaries of Long-Horizon Autonomous Agents

May 22, 2026

Qwen3.7-Max: Pushing the Boundaries of Long-Horizon Autonomous Agents

The transition from LLMs as chatbots to LLMs as autonomous agents requires a fundamental shift in capability. While traditional models excel at short-turn interactions, true agency requires the ability to plan, execute, and self-correct over hundreds or thousands of steps without losing coherence.

With the introduction of Qwen3.7-Max, Alibaba is positioning its latest proprietary model as a foundation specifically for this "agent era." Designed to be a versatile backbone for coding agents, office automation, and complex engineering tasks, Qwen3.7-Max emphasizes sustained autonomous execution and the ability to generalize across different agent scaffolds.

A Foundation for the Agent Era

Qwen3.7-Max is engineered to operate as a reliable core for various agentic frameworks. Unlike models optimized for a single specific tool-use pattern, Qwen3.7-Max is designed for cross-scaffold generalization, meaning it performs consistently whether it is deployed via Claude Code, OpenClaw, Qwen Code, or custom internal frameworks.

Performance Benchmarks

The model demonstrates significant gains across several key agentic and reasoning domains:

  • Coding Agency: It shows strong performance on SWE-Pro (60.6) and Terminal Bench 2.0-Terminus (69.7), where it outperforms competitors like DS-V4-Pro Max. It remains competitive with Opus-4.6 Max on SWE-Verified (80.4).
  • General-Purpose Agency: The model excels in MCP (Model Context Protocol) integrations, scoring 60.8 on MCP-Mark and 76.4 on MCP-Atlas. It also shows high proficiency in office automation via SpreadSheetBench-v1 (87).
  • Hard Reasoning: Qwen3.7-Max achieves leading results on the most challenging reasoning benchmarks, including GPQA Diamond (92.4) and HMMT 2026 Feb (97.1).

Sustained Long-Horizon Execution

One of the most impressive claims regarding Qwen3.7-Max is its ability to maintain coherent reasoning over extremely long horizons. This is best illustrated by a real-world optimization task involving a production-grade attention operator in SGLang.

The 35-Hour Kernel Optimization Case Study

Tasked with optimizing an "Extend Attention" kernel on a hardware platform it had never seen during training (T-Head ZW-M890 PPUs), Qwen3.7-Max operated autonomously for approximately 35 hours. Over the course of 1,158 tool calls and 432 kernel evaluations, the model iteratively improved the kernel through several structural transitions:

  1. Parallelism Redesign: Implementing Split-KV partitioning to maximize SM occupancy.
  2. Overhead Reduction: Replacing synchronous cudaMalloc/cudaFree with pre-allocated tensors.
  3. Adaptive Tuning: Developing workload-size-dependent heuristics for splitting.
  4. Architectural Overhaul: Redesigning the kernel to process four query tokens simultaneously, sharing K/V loads to amortize memory costs.

The result was a 10.0x geometric mean speedup over the Triton reference implementation, demonstrating that the model can use runtime feedback to solve problems for which it has no prior memorized knowledge.

Scaling Agentic Capabilities

Alibaba attributes these gains to an "environment scaling" approach. Much like how LLMs generalize from diverse text during pretraining, Qwen3.7-Max was trained in a vast array of diverse agentic environments.

To prevent the model from simply learning "shortcuts" for specific benchmarks, the team used a decoupled infrastructure where Tasks, Harnesses, and Verifiers are orthogonal. By pairing the same task with different harnesses and verifiers, the model is forced to learn general problem-solving strategies rather than harness-specific patterns.

Enterprise and Physical World Applications

Beyond coding, Qwen3.7-Max is being applied to high-complexity professional workflows:

  • Startup Management: In the YC-Bench simulation (a year-long startup lifecycle), Qwen3.7-Max generated 2.08M USD in revenue—nearly double that of Qwen3.6-Plus—by autonomously identifying malicious clients and recovering from mid-term crises.
  • Office Automation: The model can autonomously reformat complex documents (e.g., university theses) by reading specification files and executing a series of office-cli tool calls.
  • Robotics: Through the Qwen-RobotClaw and Qwen-RobotNav frameworks, the model can operate a robot dog, handling physical planning and decision-making in real-time.

Community Reception and Critical Perspectives

While the technical benchmarks are impressive, the Hacker News community raised several critical points regarding the model's deployment and transparency:

  • Geopolitical and Ethical Concerns: Several users expressed reluctance to use models from Chinese labs due to concerns over telemetry, data privacy, and government censorship, specifically regarding sensitive political topics.
  • Benchmark Transparency: Some users noted that the benchmarks often compare the new model against older versions of competitors (e.g., Opus-4.6), questioning if the comparisons are fully current.
  • Accessibility: Users reported challenges in obtaining API keys and navigating the Alibaba Cloud Model Studio interface, with some noting high CPU usage on the landing page.

Despite these concerns, developers using Qwen's previous versions have praised the "sweet spot" of being cheap, fast, and highly capable, with some using Qwen3.6 as a free alternative to Claude Code for smaller tasks.

Integration and Availability

Qwen3.7-Max is designed for easy integration into existing developer workflows. It supports the preserve_thinking feature, which allows the model to retain reasoning chains from previous turns—a critical requirement for complex agentic tasks. It is compatible with the OpenAI specification and the Anthropic API protocol, allowing it to be used as a drop-in replacement in tools like Claude Code and OpenClaw.

References

HN Stories