Can LLMs Model Real-World Systems in TLA+? The SysMoBench Findings
The promise of agentic model checking—where an AI can autonomously analyze a codebase and generate a formal specification—is a significant frontier in systems engineering. However, a critical question remains: can Large Language Models (LLMs) actually model the implementation of a real-world system, or are they simply reciting the textbook definitions of the protocols those systems use?
Recent research from the Specula team introduces SysMoBench, an automated benchmark designed to differentiate between "textbook modeling" and faithful system representation using TLA+ (Temporal Logic of Actions). The results highlight a systemic gap in how LLMs handle the transition from abstract protocol knowledge to concrete implementation details.
The "Textbook Modeling" Trap
When asked to write a TLA+ specification for a system like Etcd's Raft implementation, frontier LLMs often produce a polished, syntactically correct model that passes the TLC model checker. However, upon closer inspection, these models frequently mirror the original Raft paper's appendix rather than the specific architectural choices made in the Etcd codebase.
This phenomenon is termed "textbook modeling." Because LLMs are trained on nearly every TLA+ example available online, they are highly proficient at recalling known protocols. But when tasked with abstracting logic from a complex, real-world implementation—where atomic actions may be decomposed differently or state is managed via specific data structures—the models often fail.
How SysMoBench Evaluates Formal Models
SysMoBench utilizes eleven systems spanning concurrent synchronization and distributed protocols. To move beyond simple syntax checks, it employs a four-phase evaluation pipeline:
- Syntax Phase: Checks if the specification compiles.
- Runtime Phase: Verifies if the TLC model checker can execute the spec without error.
- Conformance Phase: Uses trace validation to compare execution traces from the actual code against the model.
- Invariant Phase: Checks if the spec satisfies key safety and liveness properties.
Transition Validation
The most revealing part of this process is Transition Validation. By cutting real-world execution traces into "transition windows" (pre-state, action, post-state), the benchmark can test whether a spec's action can actually move a system from one real state to another. This allows for action-granular diagnostics, pinpointing exactly where a model diverges from the code.
Common Failure Modes
Across leading models like Claude, GPT, and Gemini, two recurring patterns of failure emerge during the conformance phase:
1. Over-Approximation (Entering Impossible States)
LLMs often use common formalization templates that don't match the system's actual data structures. For example, in a ZooKeeper Fast Leader Election (FLE) spec, an LLM might model a vote collection as a set union (accumulating all votes), whereas the actual ZooKeeper code uses a map that overwrites old votes from the same peer. This leads the model to enter states that the real system could never reach.
2. Under-Approximation (Missing Reachable States)
LLMs frequently merge multiple implementation steps into a single atomic guard. In the ZooKeeper example, a server might first update its logical clock and then process a message. An LLM might fuse these into one step, effectively erasing the intermediate states that the real system enters in every election round.
The Performance Gap
The data shows a stark divergence between syntax and conformance. While most frontier LLMs score near 100% in the syntax phase, their performance plummets in the conformance and invariant phases, often averaging around 46% and 41% respectively.
On simple systems (e.g., Asterinas Mutex), LLMs perform well across all phases. However, on complex distributed systems like RedisRaft or CURP, even the most powerful models struggle, with some scoring as low as 25% overall. This suggests that while LLMs can "write TLA+", they cannot yet reliably "model a specific system."
Community Perspectives and Counterpoints
The discussion around these findings raises several critical points regarding the future of formal methods:
- Verification vs. Modeling: Some argue that the gap between model and implementation is an inherent flaw of TLA+. Approaches like Verus are cited as alternatives because they couple implementation and verification, ensuring the model cannot diverge from the code.
- The Role of Human Intent: A recurring concern is whether automating the design process removes human intent. As one commenter noted, if an LLM generates both the design and the program, the "proof" may lack meaningful assurance because a human didn't specify the desired properties.
- Liveness and State Explosion: Experienced practitioners note that LLMs struggle significantly with liveness properties and often produce models that trigger state space explosion, requiring human guidance to optimize the state space.
Looking Forward
While raw LLMs struggle, the Specula team is developing specialized agents that can autonomously read repositories and drive the specification workflow. Their specialized agent, Specula, has already shown the ability to achieve full conformance and invariant scores on current SysMoBench tasks, suggesting that the path forward lies in agentic workflows rather than single-prompt generation.
As formal methods move toward an AI-augmented future, the challenge remains in expanding trace coverage and improving state abstraction to ensure that the "digital twin" created by the AI is a faithful representation of the system it intends to verify.