aec-benchaec-bench

Agent Harnesses

Agent harnesses define how a model attempts a task, from a single direct call through tool loops, recursive REPL work, and structured Lambda-RLM report generation.

An agent harness is the strategy used to drive a model during a task. It answers practical questions: does the model get one call or many, can it use tools, does it keep persistent state, and is the output assembled by a structured workflow?

AECBench currently ships four built-in agent harnesses: direct, tool_loop, rlm, and lambda-rlm. They share an internal Adapter interface, so the benchmark runner can schedule, execute, score, and compare them through the same trial pipeline. That does not mean every harness is equally suitable for every task. A task with mandatory source lookup or shell execution will usually need a tool-capable harness.

The protocol

Internally, each agent harness implements the Adapter protocol:

class Adapter(Protocol):
    def execute(self, request: AdapterRequest) -> AdapterResult: ...
    def adapter_name(self) -> str: ...
    def resolved_model(self) -> str: ...

AdapterRequest carries the instruction, optional system prompt, task-declared tools, harness-specific configuration, and output target:

class AdapterRequest:
    instruction: str
    system_prompt: str | None
    tools: list[ToolSpec]
    configuration: dict[str, Any]
    output_path: str              # defaults to /workspace/output.jsonl
    output_format: str            # "jsonl", "markdown", "json"

AdapterResult captures the normalised outcome. This trimmed view shows the fields most users inspect first:

class AdapterResult:
    adapter_name: str
    resolved_model: str
    configuration_record: dict[str, Any]
    agent_output: AgentOutput
    transcript: list[TranscriptEntry]
    failure_kind: AdapterFailureKind | None
    raw_output_text: str | None
    provider_error: str | None
    usage_input_tokens: int | None
    usage_output_tokens: int | None
    # + cache and advisor usage fields

Choosing a harness

HarnessTurnsToolsBest for
Direct1noneClassification, extraction, simple generation
Tool LoopNWhitelistShell execution, search, general tool use
RLMRecursivePython REPLIterative analysis, long-context work
Lambda-RLMStructured phasesTemplate pipelineReports, scopes, and known document workflows

Direct

Direct makes one model call and returns one response. It is the right baseline for tasks where the instruction contains enough context for the model to answer without external files or computation.

result = direct_adapter.execute(
    AdapterRequest(
        instruction="Classify this beam as simply-supported or cantilever...",
        configuration={"max_tokens": 8192},
    )
)

Direct ignores task-declared tools. If the task requires shell execution, source lookup, or iterative correction, use a tool-capable harness instead.

Tool Loop

Tool Loop runs a bounded multi-turn interaction. The model asks for a tool call, the harness executes it, the result goes back to the model, and the loop continues until the model writes the required output or reaches max_turns (default 8).

harness = "tool_loop"
model = "claude-sonnet-4"

[configuration]
max_turns = 12

Tool Loop enforces the task's tool whitelist. A request for a tool that was not declared fails the run with undeclared_tool_request (AdapterFailureKind.UNDECLARED_TOOL_REQUEST in Python). This makes tool access part of the benchmark contract instead of something the model can invent mid-run.

The optional advisor tool is configured separately. When enabled, it is an explicit harness capability, not a task-declared tool.

A typical turn sequence for the voltage-drop task:

user:      "Calculate voltage drop using bash and the supplied cable data."
assistant: [tool_call: bash, command: "python voltage_calc.py"]
tool:      [stdout: "V_drop = 5.2V (compliant)"]
assistant: [writes /workspace/output.jsonl]

RLM (Recursive Language Model)

RLM is based on Recursive Language Models (Zhang, Kraska, Khattab, 2025). In AECBench, it gives the model a sandboxed Python REPL, persistent scratchpad, helper functions, and optional compaction for long runs.

# rlm.toml
[guardrails]
token_budget = 100_000
max_iterations = 20
max_subcall_depth = 3

[execution]
context_limit = 1_000_000
compaction_threshold_pct = 0.85

Key features:

  • Persistent notes: NOTE("key", value) and RECALL("key") store working data outside the conversation history.
  • REPL helpers: HELP, SHOW_VARS, grep, parallel, fill_parallel, FINAL, and FINAL_VAR support file inspection, calculation, report filling, and finalisation.
  • Compaction: when context approaches compaction_threshold_pct, older turns can be summarised while preserving variables, scratchpad entries, and template progress.
  • Guardrails: token_budget, max_iterations, max_subcall_depth, and optional budget caps keep long runs bounded.

Use RLM when the model needs to inspect documents in chunks, run calculations, preserve intermediate state, or recover from partial progress without starting over.

Lambda-RLM

Lambda-RLM is a structured report workflow rather than a free-form tool loop. It is designed for tasks where the output shape is known before the run starts: scopes, compliance reports, design notes, fee proposals, and other templated technical documents.

The high-level phases are:

  1. Plan - build an extraction schedule from the report template
  2. Extract - pull required facts from source documents
  3. Review (optional) - check extraction against contract requirements
  4. Generate - produce prose per section
  5. Output - assemble the final document
# lambda-rlm.toml
[template]
tier = "dependency_tree"
definition = "report_template.toml"

[planner]
context_window_chars = 200_000
max_branching_factor = 4

[review]
enabled = true
max_retries_per_source = 1
max_supplements_per_section = 1

[guardrails]
token_budget = 500_000

[execution]
max_parallel_workers = 4

Current Lambda-RLM runs can use:

  • Report templates with guided sections, composed sections, fields, fragments, and required-output metadata
  • Planning passes that seed a compose scratchpad before section generation
  • Block-task routing so different section types can use different handlers
  • Structure enforcement that validates generated blocks against required fields and retries targeted gaps
  • Grounding reports that check whether generated content is supported by declared source material
  • Best-of-K synthesis where multiple candidate blocks can be generated and synthesised into a final section

Lambda-RLM trades free-form exploration for repeatable extraction and composition. It is a poor fit for open-ended tasks where the agent must decide the workflow from scratch.

Tools

Task-declared tools use the ToolSpec contract:

class ToolSpec(StrictModel):
    name: NonEmptyStr
    source: str             # relative path in the task dir
    description: NonEmptyStr
    returns_image: bool = False

Tool Loop resolves each declared name to a ToolExecutor in its registry:

class ToolExecutor(Protocol):
    def execute(self, tool_name: str, arguments: dict[str, Any]) -> ToolExecutionResult: ...

The built-in bash executor runs shell commands in the staged workspace and returns stdout, stderr, and exit code. Custom tools register by name against the same protocol.

RLM and Lambda-RLM expose different capabilities. RLM works through the Python REPL and injected helpers. Lambda-RLM works through its report template, source mapping, sandbox, review, synthesis, and grounding configuration.

Failure modes

When something goes wrong, failure_kind tells the runner which structural category failed. Python enum names are uppercase; serialized values are lowercase:

KindMeaning
provider_errorLLM API returned an error, such as a rate limit or 5xx
turn_limit_reachedTool Loop exhausted max_turns without output
timeoutHarness exceeded the wall-clock budget
undeclared_tool_requestAgent tried to use a tool not in the task whitelist
tool_execution_failedA tool call raised an error
missing_outputHarness finished but no output file was written

The evaluation pipeline treats any failure_kind as a structural failure. The reward is 0.0, regardless of how close the transcript looked.

Transcripts

Runs can emit a structured JSONL trajectory, usually at /workspace/trajectory.jsonl inside the staged workspace. Entries are validated as TrajectoryEntry records:

{"version": 1, "format": "aec-bench-trajectory"}
{"role": "system", "step": 0, "content": "..."}
{"role": "user", "step": 1, "content": "Calculate voltage drop..."}
{"role": "assistant", "step": 2, "content": "I'll run the calculation..."}
{"role": "tool_call", "step": 2, "tool_name": "bash", "command": "python voltage_calc.py"}
{"role": "tool_result", "step": 2, "tool_name": "bash", "stdout": "V_drop = 5.2V", "exit_code": 0}

Because trajectory entries are flushed during execution, a crashed run can still leave a readable partial trace. Trajectories feed trace inspection and behavioural classification (see Evaluation).

On this page