Resurf: A Realistic and Reproducible Testing Framework for AI Browser Agents
Testing AI browser agents is notoriously difficult. When developers rely on live websites, they face a constant battle with flakiness, rate-limiting, and the expense of bypassing Captchas. Conversely, static-HTML benchmarks often fail to provide the necessary state and dynamic behavior required to verify if an agent can actually handle a real-world interaction.
Resurf introduces a systematic approach to this problem by providing a realistic, stateful, and instrumented framework specifically designed for browser agent evaluation. By moving away from live sites and toward synthetic environments, it allows developers to build more reliable and deterministic testing pipelines.
The Challenges of Current Testing Paradigms
To understand why a framework like Resurf is necessary, one must look at the current state of browser agent testing. Most developers currently oscillate between two extremes:
Live Website Testing: While realistic, this is inherently unstable. Websites change their DOM structure, implement aggressive rate-limiting, and employ anti-bot measures. This makes it impossible to achieve a deterministic test suite where a failure is caused by the agent's logic rather than the external environment.
Static Benchmarks: These are often used for academic purposes, but they lack the statefulness of a modern web application. An agent cannot truly be tested on its ability to navigate a complex multi-step process—such as a checkout flow or a database update—if the environment is just a collection of static pages.
Core Features of Resurf
Resurf provides a bridge between these gap by offering a synthetic environment that mimics real-world complexity without the instability of the live web. Its core capabilities include:
Deterministic and Reproducible Environments
Because Resurf uses synthetic websites, the environment is fully controlled. This ensures that tests are reproducible. If an agent fails a specific step in a workflow, developers can replay the exact same state to debug the same failure mode without worrying about the live site changing in the live-time.
Failure-Mode Injection
One of the most powerful aspects of the framework is the ability to inject failures. To build a resilient agent, you must test how it handles errors. Resurf allows developers to simulate:
Network Latency: Testing if the agent times out or retries correctly.
Payment Errors: Simulating failed transactions to see if the agent can recover or report the error accurately.
5xx Server Errors: Testing the agent's ability to handle unexpected server-side crashes.
Auditable Success Evaluation
Many current AI agent evaluations rely on an "LLM judge"—another LLM that looks at the same result and guesses if the agent succeeded. Resurf moves away from this by providing auditable success evaluations based on actual database state. Instead of asking an LLM if the agent "seems