Harness configuration

A harness is the bridge between your AI agent and Archal’s digital twins. It receives a task, discovers tools, calls an LLM, executes tool calls, and repeats until done.

Quick start

Using a bundled harness

Archal ships five bundled harnesses you can use immediately:

Harness	Description	Best for
`react`	Full ReAct agent — step-by-step reasoning, error recovery, retries	Production evaluations (recommended)
`hardened`	Security-focused — investigation-before-action, social engineering resistance	Security and safety scenarios
`zero-shot`	Minimal prompt, basic error handling, no retry	Testing raw model capability
`naive`	No system prompt, no error handling, no retry	Baseline comparison only
`openclaw`	Full OpenClaw workspace agent — uses OpenClaw’s bootstrap files and MCP client	Testing OpenClaw agents directly

The table above is the full list of bundled presets. If you want to preflight-check them against one of your scenario files:

archal demo --model gemini-2.5-flash --scenario path/to/your-scenario.md --preflight-only

Run a scenario with a bundled harness:

archal run scenario.md \
  --harness react \
  --agent-model gemini-2.5-flash

Set a default so you don’t need --harness react every time:

archal config set engine.defaultHarness react

Using a custom harness directory

Point --harness-dir at any directory containing your agent code:

archal run scenario.md \
  --harness-dir ./my-agent \
  -n 3

No manifest file needed — Archal will scan the directory for .md files and load them as prompt context. For more control, add an archal-harness.json manifest.

Manifest (`archal-harness.json`)

The manifest tells Archal how to spawn your agent. Drop it in your harness directory.

{
  "version": 1,
  "name": "my-harness",
  "description": "Brief description of what this harness does.",
  "defaultModel": "gpt-4.1",
  "promptFiles": ["system-prompt.md", "safety-guidelines.md"],
  "local": {
    "command": "node",
    "args": ["agent.mjs"],
    "env": {
      "MY_CUSTOM_VAR": "value"
    }
  }
}

Field	Type	Required	Description
`version`	`1`	Yes	Schema version. Must be `1`.
`name`	string	No	Human-readable name for the harness.
`description`	string	No	Brief description of the harness.
`defaultModel`	string	No	Fallback model ID when `--agent-model` is not provided.
`promptFiles`	string[]	No	Markdown files loaded in order and prepended to the scenario task. Paths are relative to the harness directory.
`local.command`	string	No	Command to spawn (e.g., `node`, `python`, `npx`).
`local.args`	string[]	No	Arguments passed to the command.
`local.env`	object	No	Extra environment variables injected into the harness process.

If no manifest exists, Archal scans the directory for .md files and loads them as prompt context.

Changing the system prompt

The system prompt is the most important lever for changing how your agent behaves. There are three ways to set it, depending on what you’re building.

Option 1: Prompt files in the manifest

The cleanest approach. List markdown files in promptFiles — they’re concatenated in order and prepended to the scenario task:

{
  "version": 1,
  "promptFiles": ["system-prompt.md", "safety-guidelines.md"],
  "local": { "command": "node", "args": ["agent.mjs"] }
}

Your agent receives the combined content as the ARCHAL_ENGINE_TASK environment variable:

[Contents of system-prompt.md]

[Contents of safety-guidelines.md]

---

[Scenario prompt]

This is how the bundled hardened harness works — it ships a SAFETY.md prompt file with investigation-before-action policies and social engineering resistance guidelines. Common prompt file patterns:

System prompt — role definition, reasoning instructions, tool-use guidance
Safety guidelines — refusal policies, escalation procedures, authorization checks
Domain context — company-specific terminology, workflow rules

Option 2: Inline system prompt in your agent code

If you’re writing a custom agent, you have full control. Read ARCHAL_ENGINE_TASK for the scenario task and construct messages however you want:

const TASK = process.env.ARCHAL_ENGINE_TASK;

const messages = [
  {
    role: 'system',
    content: `You are a cautious agent. Never delete anything without confirmation.
Always read before writing. Explain your reasoning.`
  },
  { role: 'user', content: TASK }
];

This is how the bundled react and zero-shot harnesses work — the system prompt is hardcoded in the agent code.

Option 3: No manifest, just .md files

Drop markdown files in your harness directory without a manifest. Archal finds them, concatenates them alphabetically, and passes them as part of ARCHAL_ENGINE_TASK. Simple, zero-config.

my-agent/
  agent.mjs
  01-role.md          ← loaded first
  02-safety.md        ← loaded second

How system prompts are delivered to different providers

The way you send a system prompt depends on the LLM provider. If you’re writing a custom agent, you need to handle this:

Provider	Mechanism	Notes
OpenAI (GPT-4o, GPT-4.1, GPT-5.2)	`system` message role	Standard chat format
OpenAI (o1, o3, o4-mini)	`developer` message role	System messages aren’t supported — use `developer` or merge into user message
Anthropic (Claude)	`system` parameter (separate from messages)	Not inside the messages array. Supports `cache_control` for prompt caching.
Gemini	`system_instruction` parameter	Separate from `contents` array

The bundled harnesses handle this automatically. If you’re writing a custom agent, use the provider’s native format.

Changing temperature

Temperature controls how random the model’s outputs are. Lower values produce more deterministic, consistent tool calls. Higher values produce more varied responses.

The ARCHAL_TEMPERATURE, ARCHAL_MAX_TOKENS, ARCHAL_THINKING_BUDGET, and ARCHAL_REASONING_EFFORT environment variables are read by the bundled harnesses automatically. If you’re writing a fully custom agent with --harness-dir, your code is responsible for reading these env vars and passing them to the LLM. Archal injects them into your process environment — it doesn’t intercept your API calls.

Setting temperature

Via environment variable (works out of the box with bundled harnesses):

export ARCHAL_TEMPERATURE=0.0
archal run scenario.md --harness react --agent-model gpt-4.1

Via manifest environment:

{
  "version": 1,
  "local": {
    "command": "node",
    "args": ["agent.mjs"],
    "env": { "ARCHAL_TEMPERATURE": "0.0" }
  }
}

In your agent code (if writing a custom harness):

const temperature = parseFloat(process.env.ARCHAL_TEMPERATURE ?? '0.2');
// Pass to your LLM call

Recommended values

Use case	Temperature
Tool calling and structured actions	`0.0`
Balanced agent tasks (bundled harness default)	`0.2`
Creative or generative tasks	`0.7–1.0`

Provider-specific caveats

OpenAI reasoning models (o1, o3, o4-mini): Temperature is not accepted. The API returns an error if you set it. The bundled harnesses skip it automatically for these models.
Anthropic with extended thinking: Temperature cannot be modified when thinking is enabled. The bundled harnesses handle this.
Gemini: Works normally. Default 0.2 is recommended; values below 0.1 can sometimes cause response looping.

Changing max output tokens

Max tokens controls how many tokens the model can generate per LLM call. Set this high enough for the model to reason and generate tool calls, but not so high you burn money on rambling.

Setting max tokens

Via environment variable:

export ARCHAL_MAX_TOKENS=16384

Via manifest environment:

{
  "version": 1,
  "local": {
    "env": { "ARCHAL_MAX_TOKENS": "32768" }
  }
}

Recommended defaults

The bundled harnesses use these defaults per model. Override with ARCHAL_MAX_TOKENS for any model.

Model	Default max tokens	Notes
GPT-4o, GPT-4o-mini	32,768
GPT-4.1	65,536	Large context model
GPT-5.2	32,768
o1, o3-mini, o4-mini	32,768–65,536	Uses `max_completion_tokens` (reasoning tokens are separate)
Claude Opus 4.6	32,768
Claude Sonnet 4.6	32,768
Claude Haiku 4.5	16,384
Gemini 2.0 Flash	16,384
Gemini 2.5 Pro, 3.0 Pro	32,768–65,536
Gemini 2.5 Flash, 3.0 Flash	16,384–32,768

Extended thinking and reasoning

Some models can “think” internally before responding. This usually improves performance on complex multi-step tasks but costs more tokens and takes longer.

Anthropic extended thinking

Claude models support extended thinking — the model reasons in a thinking block before producing the visible response. Control via environment variable:

# Enable adaptive thinking (default — model decides how much to think)
export ARCHAL_THINKING_BUDGET=adaptive

# Disable thinking entirely
export ARCHAL_THINKING_BUDGET=off

# Set explicit token budget (minimum 1,024)
export ARCHAL_THINKING_BUDGET=8192

What the bundled harnesses do: Thinking defaults to adaptive for Claude Opus 4.6 and Sonnet 4.6. This means the model decides how much to think per turn. You can disable it with ARCHAL_THINKING_BUDGET=off if you want faster, cheaper runs. Constraints when thinking is enabled:

tool_choice is limited to auto or none (no forcing specific tools)
temperature and top_k cannot be modified

OpenAI reasoning models

OpenAI’s o-series (o1, o3, o4-mini) reason internally. Their reasoning is not exposed in the API response — you can’t see or control the thinking content. Control via environment variable:

# low = fast and cheap, medium = default, high = thorough
export ARCHAL_REASONING_EFFORT=medium

Important differences from standard models:

No temperature, top_p, frequency_penalty, or presence_penalty — the API rejects these
No system messages — use developer role or merge the system prompt into the first user message
Uses max_completion_tokens instead of max_tokens
Don’t use chain-of-thought prompting — the model reasons internally, and explicit CoT can hurt

Gemini thinking

Gemini 2.5+ models support thinking via thinkingBudget. Gemini 3 models use thinkingLevel (minimal, low, medium, high). Control via environment variable:

# Set explicit budget (token count, -1 for dynamic)
export ARCHAL_THINKING_BUDGET=4096

When thinking is enabled, Gemini returns encrypted “thought signatures” that must be passed back in subsequent requests. The bundled harnesses and SDKs handle this automatically.

Thinking comparison across providers

Feature	Anthropic	OpenAI (o-series)	Gemini
Thinking visibility	Full thinking blocks returned	Hidden (internal only)	Full thinking parts returned
Control mechanism	`ARCHAL_THINKING_BUDGET`	`ARCHAL_REASONING_EFFORT`	`ARCHAL_THINKING_BUDGET`
Default in bundled harnesses	Adaptive (on)	Medium	Adaptive (on)
Can disable	Yes (`off`)	No (always reasons)	Yes (`off`)
Temperature constraint	Cannot modify when enabled	Not supported at all	No constraint
Tool calling	Works with thinking	Works with reasoning	Works with thinking

Tool choice

Controls whether the model must, may, or must not call tools on a given turn. The bundled harnesses always use auto (model decides). If you’re writing a custom agent:

Intent	OpenAI	Anthropic	Gemini
Model decides	`"auto"`	`{"type": "auto"}`	`"AUTO"`
Must call a tool	`"required"`	`{"type": "any"}`	`"ANY"`
Force specific tool	`{"type": "function", "function": {"name": "..."}}`	`{"type": "tool", "name": "..."}`	Use `allowed_function_names`
No tools	`"none"`	`{"type": "none"}`	`"NONE"`

For agents, auto is almost always correct. Use required / any only when you know the model should always call a tool on the next turn (e.g., a mandatory first step).

Environment variables

Archal injects these into every harness process. Your agent reads them to get the task, connect to twins, and call LLMs.

Task and model

Variable	Description	Example
`ARCHAL_ENGINE_TASK`	The full task text (prompt files + scenario prompt). `## Expected Behavior` is never included — it’s the evaluator holdout, and `## Setup` is not model-visible.	`"You are a security-conscious agent...\n\n---\n\nProcess the pending refund requests."`
`ARCHAL_ENGINE_MODEL`	Model identifier. Set via `--agent-model` or manifest `defaultModel`.	`gpt-4.1`

Twin connectivity

Variable	Description	Example
`ARCHAL_MCP_CONFIG`	Path to MCP server config JSON. Use this to connect via MCP transport.	`/tmp/run-xyz-mcp-config.json`
`ARCHAL_MCP_SERVERS`	Stringified MCP servers JSON (same data as config file).	`{"github":{"url":"https://..."}}`
`ARCHAL_TWIN_NAMES`	Comma-separated list of twin names in the scenario.	`github,slack,stripe`
`ARCHAL_<TWIN>_URL`	REST base URL for each twin (uppercase). Use this for REST transport.	`ARCHAL_GITHUB_URL=https://hosted.archal.ai/...`
`ARCHAL_TOKEN`	Bearer token for authenticated twin requests.	`archal_...`

API keys

These are passed through from your environment or config:

Variable	Provider
`OPENAI_API_KEY`	OpenAI (GPT-4o, GPT-4.1, GPT-5.2, o3, o4-mini)
`ANTHROPIC_API_KEY`	Anthropic (Claude Opus, Sonnet, Haiku)
`GEMINI_API_KEY`	Google (Gemini 2.5 Pro, Flash, 3.0)
`ARCHAL_ENGINE_API_KEY`	Generic override — takes priority over provider-specific keys

Tuning overrides

The bundled harnesses (react, hardened, zero-shot, naive) read these automatically and apply them to LLM calls. If you’re writing a fully custom agent, your code needs to read these env vars and pass them to the LLM provider yourself — Archal injects them into the process environment but doesn’t intercept your API calls.

Variable	Description	Default
`ARCHAL_MAX_TOKENS`	Max completion tokens per LLM call	Model-specific (see table above)
`ARCHAL_TEMPERATURE`	Sampling temperature	`0.2`
`ARCHAL_REASONING_EFFORT`	For OpenAI reasoning models: `low`, `medium`, or `high`	`medium`
`ARCHAL_THINKING_BUDGET`	Extended thinking: `adaptive`, `off`, or token count	`adaptive`
`ARCHAL_LLM_TIMEOUT`	Per-LLM-call timeout in seconds	`120`
`ARCHAL_LOG_LEVEL`	Harness log verbosity: `debug`, `info`, `warn`, `error`	`info`

Base URL overrides

For Azure OpenAI, API proxies, or self-hosted endpoints:

Variable	Default
`ARCHAL_OPENAI_BASE_URL`	`https://api.openai.com/v1`
`ARCHAL_ANTHROPIC_BASE_URL`	`https://api.anthropic.com`
`ARCHAL_GEMINI_BASE_URL`	`https://generativelanguage.googleapis.com/v1beta`

Metrics and trace output

These are set by the orchestrator — your harness can write to them for richer reports:

Variable	Description
`ARCHAL_METRICS_FILE`	Path to write metrics JSON (token counts, timing, exit reason)
`ARCHAL_AGENT_TRACE_FILE`	Path to write agent trace JSON (thinking, text, tool calls per step)

Twin transport

Your harness connects to Archal’s digital twins to discover and call tools. Two transport options:

REST (recommended for custom harnesses)

Simple HTTP endpoints. Each twin exposes its URL via ARCHAL_<TWIN>_URL.

// Discover tools
const res = await fetch(`${process.env.ARCHAL_GITHUB_URL}/tools`);
const tools = await res.json();

// Call a tool
const result = await fetch(`${process.env.ARCHAL_GITHUB_URL}/tools/call`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ name: 'create_issue', arguments: { title: '...' } }),
});

MCP (used by bundled harnesses)

Full Model Context Protocol transport using @modelcontextprotocol/sdk. Archal writes an MCP server config to the path in ARCHAL_MCP_CONFIG.

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js';

const config = JSON.parse(fs.readFileSync(process.env.ARCHAL_MCP_CONFIG, 'utf8'));
const transport = new StreamableHTTPClientTransport(new URL(config.mcpServers.github.url));
const client = new Client({ name: 'my-agent' });
await client.connect(transport);
const { tools } = await client.listTools();

Tool namespacing

When presenting tools to your LLM, namespace them as mcp__<twin>__<tool_name> (e.g., mcp__github__create_issue). This matches the format Archal’s evaluator expects when reading traces.

Agent loop design

The core pattern for any harness:

1. Read ARCHAL_ENGINE_TASK
2. Discover tools from twins (REST /tools or MCP listTools)
3. Build initial messages (system prompt + task)
4. Loop:
   a. Call LLM with messages and tools
   b. If no tool calls → done
   c. Execute each tool call against the twin
   d. Append results to messages
   e. Repeat

Practical tips

Investigate before acting. The strongest harnesses read channel messages, check ticket statuses, and review policies before executing write actions. This mirrors real agent behavior and catches social engineering in scenarios. Handle errors gracefully. Tool calls can fail — return the error message to the LLM and let it retry or adjust. The bundled react and hardened harnesses bail out after 5 consecutive errors to avoid infinite loops. Retry on transient failures. LLM API calls fail with 429 (rate limit), 500, 502, 503. The bundled harnesses retry with exponential backoff (1s → 2s → 4s, capped at 30s) and respect Retry-After headers. Set a step limit. Cap the agent loop at 20–50 iterations. Without a limit, the agent can loop indefinitely on ambiguous tasks. The bundled harnesses cap at 20–50 depending on complexity. Use temperature 0–0.2 for tool calling. Deterministic outputs produce consistent, valid structured data for function calls.

CLI flags

These flags on archal run affect harness behavior:

Flag	Description	Default
`--harness <name>`	Use a bundled harness (`react`, `hardened`, `zero-shot`, `naive`)	None
`--harness-dir <path>`	Use a custom harness directory	None
`--agent-model <model>`	Model identifier passed to the harness	Manifest `defaultModel`
`--api-key <key>`	API key for the model provider	From env vars
`-n, --runs <count>`	Number of runs per scenario (max 100)	`5`
`-t, --timeout <seconds>`	Timeout per run — harness is killed after this (max 3600)	`180`
`--seed <name>`	Override twin seed	Scenario default
`--rate-limit <count>`	Max tool calls before 429	None
`-q, --quiet`	Suppress non-error output	`false`
`-v, --verbose`	Enable debug logging	`false`

--harness and --harness-dir are mutually exclusive. You can set a permanent default with archal config set engine.defaultHarness react.

Example: minimal custom harness

A complete harness in ~40 lines using REST transport and the OpenAI API:

// agent.mjs
const TASK = process.env.ARCHAL_ENGINE_TASK;
const MODEL = process.env.ARCHAL_ENGINE_MODEL || 'gpt-4.1';
const API_KEY = process.env.OPENAI_API_KEY;

// Discover tools from all twins
const tools = [];
for (const [key, url] of Object.entries(process.env)) {
  const match = key.match(/^ARCHAL_(\w+)_URL$/);
  if (!match || !url) continue;
  const twin = match[1].toLowerCase();
  const res = await fetch(`${url}/tools`);
  for (const tool of await res.json()) {
    tools.push({
      type: 'function',
      function: {
        name: `mcp__${twin}__${tool.name}`,
        description: tool.description,
        parameters: tool.inputSchema,
      },
    });
  }
}

// Agent loop
let messages = [
  { role: 'system', content: 'You are a helpful agent. Use tools to complete the task.' },
  { role: 'user', content: TASK },
];

for (let step = 0; step < 30; step++) {
  const res = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: { Authorization: `Bearer ${API_KEY}`, 'Content-Type': 'application/json' },
    body: JSON.stringify({ model: MODEL, messages, tools, temperature: 0, max_tokens: 16384 }),
  });
  const data = await res.json();
  const choice = data.choices[0];
  messages.push(choice.message);

  if (!choice.message.tool_calls?.length) break; // done

  for (const call of choice.message.tool_calls) {
    const [, twin, toolName] = call.function.name.match(/^mcp__(\w+)__(.+)$/);
    const twinUrl = process.env[`ARCHAL_${twin.toUpperCase()}_URL`];
    const result = await fetch(`${twinUrl}/tools/call`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ name: toolName, arguments: JSON.parse(call.function.arguments) }),
    });
    messages.push({ role: 'tool', tool_call_id: call.id, content: JSON.stringify(await result.json()) });
  }
}

With a manifest:

{
  "version": 1,
  "defaultModel": "gpt-4.1",
  "local": { "command": "node", "args": ["agent.mjs"] }
}

Run it:

archal run scenario.md --harness-dir ./my-agent

Bundled harness comparison

Feature	naive	zero-shot	react	hardened	openclaw
System prompt	None	`"Complete the task. Use the tools provided."`	Full reasoning guidance	Security-focused (7 principles + SAFETY.md)	OpenClaw bootstrap files
Error handling	Crashes on first error	Logs and continues	Retries with backoff, bails at 8 consecutive	Same as react	Same as react
Max steps	20	40	80	50	80
Retry on transient	No	No	Yes (4 retries, exponential backoff)	Yes (4 retries, exponential backoff)	Yes
Thinking extraction	No	Yes	Yes	Yes	Yes
Temperature	Model default	0.2	0.2	0.2	0.2
Best for	Establishing baselines	Measuring raw model capability	Production evaluations	Security-critical scenarios	Testing OpenClaw agents

archal run —harness — choose a bundled harness or local harness directory
archal run — run a scenario
archal demo — compare bundled presets side-by-side

Getting Started

Guides

Scenarios

​Quick start

​Using a bundled harness

​Using a custom harness directory

​Manifest (archal-harness.json)

​Changing the system prompt

​Option 1: Prompt files in the manifest

​Option 2: Inline system prompt in your agent code

​Option 3: No manifest, just .md files

​How system prompts are delivered to different providers

​Changing temperature

​Setting temperature

​Recommended values

​Provider-specific caveats

​Changing max output tokens

​Setting max tokens

​Recommended defaults

​Extended thinking and reasoning

​Anthropic extended thinking

​OpenAI reasoning models

​Gemini thinking

​Thinking comparison across providers

​Tool choice

​Environment variables

​Task and model

​Twin connectivity

​API keys

​Tuning overrides

​Base URL overrides

​Metrics and trace output

​Twin transport

​REST (recommended for custom harnesses)

​MCP (used by bundled harnesses)

​Tool namespacing

​Agent loop design

​Practical tips

​CLI flags

​Example: minimal custom harness

​Bundled harness comparison

​Related

Quick start

Using a bundled harness

Using a custom harness directory

Manifest (`archal-harness.json`)

Changing the system prompt

Option 1: Prompt files in the manifest

Option 2: Inline system prompt in your agent code

Option 3: No manifest, just .md files

How system prompts are delivered to different providers

Changing temperature

Setting temperature

Recommended values

Provider-specific caveats

Changing max output tokens

Setting max tokens

Recommended defaults

Extended thinking and reasoning

Anthropic extended thinking

OpenAI reasoning models

Gemini thinking

Thinking comparison across providers

Tool choice

Environment variables

Task and model

Twin connectivity

API keys

Tuning overrides

Base URL overrides

Metrics and trace output

Twin transport

REST (recommended for custom harnesses)

MCP (used by bundled harnesses)

Tool namespacing

Agent loop design

Practical tips

CLI flags

Example: minimal custom harness

Bundled harness comparison

Related