Skip to content
ArchalArchal
Sign inGet started →

Test your agents & software
against

Start a clone →
Backed byY Combinator
“Good evaluations help teams ship AI agents more confidently. Without them, it’s easy to get stuck in reactive loops — catching issues only in production, where fixing one failure creates others.”
Evaluation guidance for AI agents

01. The loop

Move the eval
loop earlier

Four steps from markdown to CI.

01

Write the scenario

A scenario is a markdown file that captures the starting state of the clone, the task you want the agent to handle, and what counts as success. They live in your repo and get reviewed in PRs like any other code.
02

Run against a stateful clone

We provision a sandboxed copy of GitHub, Slack, Stripe, or whatever service your production software or agent talks to, usually in under a minute. The clone uses the real API surface, with the same endpoints, error semantics, and rate limits. State persists for the run and resets cleanly between scenarios, so each one starts from a known baseline.
03

Capture the full trace

Every tool call, API request, response body, and state change is captured during the run. You can replay the trace, click through it like a debugger, and diff it against earlier runs to see exactly what changed when behavior started failing.
04

Fail loudly in CI

Wire scenarios into your existing pipeline so they run on every push. When behavior regresses, the build breaks before the change ships. You find out in minutes, instead of from a customer ticket two weeks later.

02. Surfaces

Surfaces

Stateful, typed,
reference-checked daily

Stateful clones of real services via MCP tools and REST routes. Same objects, familiar errors, covered edge cases.

03. Pricing

Pay for minutes used

A session-minute is one clone running for one minute inside a workspace. Evals are the scored checks you run against those sessions, and each workspace has its own shared pool.

Free

$0
  • 500 session-minutes / workspace
  • 100 evals / workspace
  • 1 workspace (2 users max)
  • 3 concurrent sessions / workspace
  • All clones included

Pro / Teams

$99/mo per seat
  • 5,000 session-minutes / seat / month
  • 500 evals / seat / month
  • All clones included
  • 10 concurrent sessions / workspace
  • 1 workspace, up to 5 seats with pooled usage
Seats
1
Evals / month500

500 included · $0.20 / eval monthly add-on

Session-min / month5,000

5,000 included · $0.05 / min monthly add-on

Total / month$99

Enterprise

Contact us
  • Unlimited session-minutes
  • Unlimited seats per workspace
  • Unlimited workspaces
  • 50 concurrent sessions / workspace
  • SAML SSO
  • SCIM provisioning
  • SOC 2 (in progress)
  • Custom clones & eval support
  • Dedicated onboarding & support

04. Get started

One minute
to your first eval

or read the quickstart