The run loop
When you executearchal run scenario.md, five things happen:
1. Parse the scenario
The runner reads your markdown file and extracts these sections:- Setup - natural language describing the initial twin state
- Prompt (optional) - the task given to the agent. If omitted, the setup is used as the task.
- Expected behavior - what the agent should do (used only for evaluation, never shown to the agent)
- Success criteria - evaluable statements tagged as
[D](deterministic) or[P](probabilistic) - Config - which twins to use, timeout, number of runs
2. Provision cloud twins
Archal requests a hosted session for the required twins. Each twin is pre-loaded with state generated from the scenario’s setup section. The hosted twins expose MCP/API endpoints that are reachable by your configured execution engine.3. Run the engine
Archal executes your agent against the hosted twins. The mode is inferred from which flags you provide:- API mode (
--engine-endpoint): sends the scenario task to a remote/v1/responsesendpoint (e.g. an OpenClaw gateway or your own agent API). The engine receives the task plus twin endpoint URLs. - Harness mode (
--harness-dir): spawns a local agent command from a directory, optionally configured with anarchal-harness.jsonmanifest
4. Evaluate
After each run, the evaluator checks every success criterion:- Deterministic criteria
[D]- checked against the twin’s final state. Numeric comparisons, existence checks, count assertions. Free and instant. - Probabilistic criteria
[P]- assessed by an LLM (Claude) using the trace, final state, and expected behavior description
5. Score
The scenario runs N times (default 1, increase with-n). Each run gets a per-criterion pass/fail. The satisfaction score is the percentage of criterion-run pairs that passed across all runs.