Skip to main content

Complete example

Here’s a finished scenario you can use as a starting point:
# Close Stale Issues

## Setup

A GitHub repository called "acme/webapp" with 20 open issues. 8 of the issues
have not been updated in over 90 days. 4 of those stale issues have the label
"keep-open".

## Prompt

Find all issues with no activity in the last 90 days and close them with a
comment explaining why. Do not close issues labelled "keep-open".

## Expected Behavior

The agent should identify stale issues, exclude any with the "keep-open" label,
and close the remaining 4 with a polite comment explaining the closure reason.

## Success Criteria

- [D] Exactly 4 issues are closed
- [D] All closed issues have a new comment
- [D] Issues with "keep-open" remain open
- [P] Each closing comment is polite and explains the reason for closure

## Config

twins: github
timeout: 90
runs: 5
tags: workflow

Structure

A scenario is a markdown file with these sections:
SectionRequiredShown to agent
# TitleYesYes
## SetupYesYes (as context)
## PromptNoYes (as the task instruction)
## Expected BehaviorYesNo — evaluator only
## Success CriteriaYesNo
## ConfigNoNo

Setup

Describe the starting state of the digital twins in plain English. Archal interprets this to configure the twin’s seed state.
## Setup

A GitHub repository called "acme/webapp" with 20 open issues. 8 of the issues
have not been updated in over 90 days. 4 of those stale issues have the label
"keep-open".
Be specific about quantities, names, labels, and relationships. The more precise your setup, the more reliable the evaluation.

Expected behavior

Describe what the agent should do. This section is the “holdout set” - it’s used only for evaluation and is never shown to the agent.
## Expected Behavior

The agent should identify stale issues, exclude any with the "keep-open" label,
and close them with a polite comment explaining the closure reason.

Prompt

The ## Prompt section is required. It gives the agent its explicit task instruction. The scenario title is metadata for humans, not agent task text. The agent receives only the Prompt. Setup is used to seed the world and for evaluation context, but it is not included in the model-visible task.
## Prompt

Find all stale issues (no activity in 90+ days) and close them with a comment
explaining why. Skip any issue with the "keep-open" label.

Success criteria

Each criterion is a list item prefixed with [D] or [P]:
  • [D] Deterministic: Checked against twin state. Numeric comparisons, existence checks, counts. Free and instant.
  • [P] Probabilistic: Assessed by an LLM. Fuzzy judgments like tone, helpfulness, correctness. Requires an API key.

Writing [D] criteria

Deterministic criteria need to be assertable from the twin’s final state. Use concrete, countable language:
PatternExample
Exactly N ...Exactly 4 issues are closed
At least N ...At least 1 comment was posted
At most N / Fewer than NFewer than 30 tool calls were made
N things are/were ...3 PRs were merged
... is created/closed/merged/deletedThe issue is closed
... existsA label named "stale" exists
Zero/None ... remainZero issues remain in the Triage state
If you omit the [D] tag, Archal infers the type from the language above. Anything that doesn’t match a concrete count or state check defaults to [P]. You can also force a tag on any criterion:
- [D] The PR was merged          ← explicit, evaluator checks state
- [P] The PR description is clear ← explicit, LLM judges quality
- The repo has exactly 2 labels   ← inferred as [D] from "exactly"
- The agent was helpful           ← inferred as [P], too vague to check

Writing [P] criteria

Use [P] for anything that needs judgment rather than a state lookup — tone, reasoning quality, whether the agent stayed on task, whether an explanation makes sense. Write [P] criteria as full sentences that an evaluator could answer yes/no to given the trace and final state:
- [P] Each closing comment explains the reason for closure
- [P] The agent did not take any destructive actions
- [P] The PR description accurately summarizes the changes
Avoid vague [P] criteria like “the agent did a good job.” Give the evaluator something specific to check.
## Success Criteria

- [D] Exactly 4 issues are closed
- [D] All closed issues have a new comment
- [D] Issues with "keep-open" remain open
- [P] Each closing comment is polite and explains the reason for closure
Write criteria that are evaluable. “Agent should be efficient” is too vague. “Agent completes the task in fewer than 50 tool calls” is evaluable.

Negative assertions

Use negative criteria to check the agent didn’t do something harmful:
- [D] No issues with the "keep-open" label were closed
- [D] No messages were sent to channels other than #engineering
- [P] The agent did not fabricate information not present in the issue

How evaluation works

After each run:
  1. The evaluator collects the twin’s final state and the tool call trace
  2. [D] criteria are checked against the state programmatically
  3. [P] criteria are sent to an LLM with the trace, state, and expected behavior as context
  4. Each criterion gets a pass/fail result
  5. The run score is the fraction of criteria that passed
After all runs, the satisfaction score is aggregated across runs.

Config

The config section specifies runtime settings:
KeyDescriptionDefault
twinsComma-separated list of twins to startoptional — inferred from content if omitted
timeoutSeconds before a run is killed120
runsNumber of times to execute the scenario1
seedOverride the twin seed (e.g. enterprise-repo)(auto-selected)
difficultyScenario difficulty: easy, medium, or hard(none)
tagsComma-separated labels for filtering(none)
evaluator-modelOverride the LLM used for [P] criterion evaluation. Also accepted as evaluator.(account default)
## Config

twins: github, slack
timeout: 90
runs: 3

Multi-service scenarios

Scenarios can use multiple twins. Specify them as a comma-separated list:
## Setup

A GitHub repository "acme/api" with an open issue #42 titled "Fix auth bug".
A Slack workspace with a #engineering channel.

## Config

twins: github, slack
The agent will have MCP access to both twins simultaneously.

Tips

  • Scaffold a new scenario with archal scenario create my-scenario.md — it generates the section structure for you.
  • Test your scenario with archal scenario validate my-scenario.md before running it. Use archal scenario lint for deeper checks.
  • Keep scenarios self-contained. No references to other scenarios or shared state.
  • Be precise in Setup. “20 open issues” is better than “many issues.”
  • Prefer [D] criteria when possible. They’re free, instant, and deterministic.
  • Use [P] criteria for things that genuinely need judgment: tone, helpfulness, correctness.