Testing Distributed Systems with AI Agents: A Claim-Driven Approach
Testing distributed systems is notoriously difficult. The standard approach—writing a handful of integration tests and hoping for the best—rarely catches the bugs that actually cause production outages: partial network partitions, non-deterministic concurrency, and complex crash-recovery scenarios.
To address this, a new framework has emerged that leverages AI coding agents (such as Claude Code, Cursor, or Gemini) to implement a rigorous, opinionated workflow for distributed systems testing. By providing agents with specific "skills" in the form of Markdown files, developers can automate the design and execution of tests that are grounded in formal claims rather than simple setup scripts.
Moving from Test-Driven to Claim-Driven Testing
The core philosophy of this approach is a shift from test-driven development to claim-driven testing. In traditional testing, a test is often named after its setup (e.g., test_network_partition_node_1). Over time, these tests are often weakened or "watered down" as the system evolves.
In a claim-driven model, every scenario is designed to falsify a specific product claim. For example, instead of a generic partition test, a scenario might be named linearizable_append_under_partition, specifically targeting the claim: "Every acknowledged append is durable and linearizable."
As noted by community members, this framing makes tests harder to weaken because the goal is explicitly tied to a business or technical invariant. This is particularly critical for stateful systems where invariants like idempotent posting and "no lost acknowledgments" are the primary measures of correctness.
The AI Agent Workflow: Design and Execution
The framework splits the testing process into two distinct AI agent skills: Designing and Executing.
1. Designing Distributed System Tests
The design skill transforms the codebase and product documentation into a structured Markdown test plan. This process involves:
- Extracting Claims: Identifying what the product promises to the user.
- Hypothesis Generation: Creating failure-mode hypotheses tied to those claims.
- Coverage Matrix: Mapping claims against hypotheses to ensure no critical failure mode is ignored.
- Technique Selection: Choosing the right testing methodology from a curated catalog (e.g., Jepsen-style linearizability checking, deterministic simulation, or formal methods like TLA+).
- Adequacy Argument: Providing a formal argument for why the chosen scenarios are sufficient to ship the product, along with a list of residual uncertainties.
2. Executing Distributed System Tests
The execution skill takes the design plan and turns it into reality. Rather than just running a script and checking for a "green" build, it follows a strict discipline:
- Toolbox Discovery: The agent first discovers existing runbooks and fault-injection scaffolding to avoid reinventing the wheel.
- Nemesis Landing Evidence: To prevent "silent passes," the agent must prove the fault actually occurred. If a network partition was supposed to happen, the agent looks for evidence (e.g.,
iptablesdrop counters) to prove the "nemesis" actually landed. - Model + History + Checker: For consistency-critical tests, the agent binds an abstract model (like a log or queue) to an operation-history schema and a named checker (e.g., Porcupine for linearizability).
- 9-State Verdicts: Instead of a binary Pass/Fail, the agent assigns a verdict from a 9-state taxonomy. This distinguishes between a genuine SUT (System Under Test) failure and a failure in the test harness or environment.
The Technique Catalog
One of the most valuable aspects of this framework is the integration of a technique catalog distilled from industry literature and seminal papers (including work from OSDI, SOSP, and NSDI). The AI agent is guided to reach for specific tools based on the symptoms it is trying to detect:
| Technique | Best For... |
|---|---|
| Jepsen/Elle | Linearizability and serializability under faults |
| Deterministic Simulation | Reproducible bugs in async-heavy code |
| Chaos/Fault Injection | Real-cluster partial or asymmetric faults |
| Fuzzing | Input or concurrency bugs under sanitizers |
| Formal Methods (TLA+) | Protocol correctness at the design stage |
| Crash-Recovery | Durability, replay, and idempotency |
Implementation and Practicality
The system is implemented as a set of SKILL.md files that can be installed into any agent that can read Markdown and execute shell commands. This allows the AI to act as a technical lead who designs the plan and a QA engineer who executes it, producing a final findings report that a human reviewer can use to make a ship/no-ship decision without needing to re-run the tests.
However, the use of "pure markdown skills" does come with risks. Some practitioners have noted that frontier models can occasionally hallucinate the completion of steps—claiming a file was created when it wasn't. This highlights the importance of the "execution evidence" and