Agent Harnesses
Agent harnesses define how a model attempts a task, from a single direct call through tool loops, recursive REPL work, and structured Lambda-RLM report generation.
An agent harness is the strategy used to drive a model during a task. It answers practical questions: does the model get one call or many, can it use tools, does it keep persistent state, and is the output assembled by a structured workflow?
AECBench currently ships four built-in agent harnesses: direct, tool_loop, rlm, and lambda-rlm. They share an internal Adapter interface, so the benchmark runner can schedule, execute, score, and compare them through the same trial pipeline. That does not mean every harness is equally suitable for every task. A task with mandatory source lookup or shell execution will usually need a tool-capable harness.
The protocol
Internally, each agent harness implements the Adapter protocol:
class Adapter(Protocol):
def execute(self, request: AdapterRequest) -> AdapterResult: ...
def adapter_name(self) -> str: ...
def resolved_model(self) -> str: ...AdapterRequest carries the instruction, optional system prompt, task-declared tools, harness-specific configuration, and output target:
class AdapterRequest:
instruction: str
system_prompt: str | None
tools: list[ToolSpec]
configuration: dict[str, Any]
output_path: str # defaults to /workspace/output.jsonl
output_format: str # "jsonl", "markdown", "json"AdapterResult captures the normalised outcome. This trimmed view shows the fields most users inspect first:
class AdapterResult:
adapter_name: str
resolved_model: str
configuration_record: dict[str, Any]
agent_output: AgentOutput
transcript: list[TranscriptEntry]
failure_kind: AdapterFailureKind | None
raw_output_text: str | None
provider_error: str | None
usage_input_tokens: int | None
usage_output_tokens: int | None
# + cache and advisor usage fieldsChoosing a harness
| Harness | Turns | Tools | Best for |
|---|---|---|---|
| Direct | 1 | none | Classification, extraction, simple generation |
| Tool Loop | N | Whitelist | Shell execution, search, general tool use |
| RLM | Recursive | Python REPL | Iterative analysis, long-context work |
| Lambda-RLM | Structured phases | Template pipeline | Reports, scopes, and known document workflows |
Direct
Direct makes one model call and returns one response. It is the right baseline for tasks where the instruction contains enough context for the model to answer without external files or computation.
result = direct_adapter.execute(
AdapterRequest(
instruction="Classify this beam as simply-supported or cantilever...",
configuration={"max_tokens": 8192},
)
)Direct ignores task-declared tools. If the task requires shell execution, source lookup, or iterative correction, use a tool-capable harness instead.
Tool Loop
Tool Loop runs a bounded multi-turn interaction. The model asks for a tool call, the harness executes it, the result goes back to the model, and the loop continues until the model writes the required output or reaches max_turns (default 8).
harness = "tool_loop"
model = "claude-sonnet-4"
[configuration]
max_turns = 12Tool Loop enforces the task's tool whitelist. A request for a tool that was not declared fails the run with undeclared_tool_request (AdapterFailureKind.UNDECLARED_TOOL_REQUEST in Python). This makes tool access part of the benchmark contract instead of something the model can invent mid-run.
The optional advisor tool is configured separately. When enabled, it is an explicit harness capability, not a task-declared tool.
A typical turn sequence for the voltage-drop task:
user: "Calculate voltage drop using bash and the supplied cable data."
assistant: [tool_call: bash, command: "python voltage_calc.py"]
tool: [stdout: "V_drop = 5.2V (compliant)"]
assistant: [writes /workspace/output.jsonl]RLM (Recursive Language Model)
RLM is based on Recursive Language Models (Zhang, Kraska, Khattab, 2025). In AECBench, it gives the model a sandboxed Python REPL, persistent scratchpad, helper functions, and optional compaction for long runs.
# rlm.toml
[guardrails]
token_budget = 100_000
max_iterations = 20
max_subcall_depth = 3
[execution]
context_limit = 1_000_000
compaction_threshold_pct = 0.85Key features:
- Persistent notes:
NOTE("key", value)andRECALL("key")store working data outside the conversation history. - REPL helpers:
HELP,SHOW_VARS,grep,parallel,fill_parallel,FINAL, andFINAL_VARsupport file inspection, calculation, report filling, and finalisation. - Compaction: when context approaches
compaction_threshold_pct, older turns can be summarised while preserving variables, scratchpad entries, and template progress. - Guardrails:
token_budget,max_iterations,max_subcall_depth, and optional budget caps keep long runs bounded.
Use RLM when the model needs to inspect documents in chunks, run calculations, preserve intermediate state, or recover from partial progress without starting over.
Lambda-RLM
Lambda-RLM is a structured report workflow rather than a free-form tool loop. It is designed for tasks where the output shape is known before the run starts: scopes, compliance reports, design notes, fee proposals, and other templated technical documents.
The high-level phases are:
- Plan - build an extraction schedule from the report template
- Extract - pull required facts from source documents
- Review (optional) - check extraction against contract requirements
- Generate - produce prose per section
- Output - assemble the final document
# lambda-rlm.toml
[template]
tier = "dependency_tree"
definition = "report_template.toml"
[planner]
context_window_chars = 200_000
max_branching_factor = 4
[review]
enabled = true
max_retries_per_source = 1
max_supplements_per_section = 1
[guardrails]
token_budget = 500_000
[execution]
max_parallel_workers = 4Current Lambda-RLM runs can use:
- Report templates with guided sections, composed sections, fields, fragments, and required-output metadata
- Planning passes that seed a compose scratchpad before section generation
- Block-task routing so different section types can use different handlers
- Structure enforcement that validates generated blocks against required fields and retries targeted gaps
- Grounding reports that check whether generated content is supported by declared source material
- Best-of-K synthesis where multiple candidate blocks can be generated and synthesised into a final section
Lambda-RLM trades free-form exploration for repeatable extraction and composition. It is a poor fit for open-ended tasks where the agent must decide the workflow from scratch.
Tools
Task-declared tools use the ToolSpec contract:
class ToolSpec(StrictModel):
name: NonEmptyStr
source: str # relative path in the task dir
description: NonEmptyStr
returns_image: bool = FalseTool Loop resolves each declared name to a ToolExecutor in its registry:
class ToolExecutor(Protocol):
def execute(self, tool_name: str, arguments: dict[str, Any]) -> ToolExecutionResult: ...The built-in bash executor runs shell commands in the staged workspace and returns stdout, stderr, and exit code. Custom tools register by name against the same protocol.
RLM and Lambda-RLM expose different capabilities. RLM works through the Python REPL and injected helpers. Lambda-RLM works through its report template, source mapping, sandbox, review, synthesis, and grounding configuration.
Failure modes
When something goes wrong, failure_kind tells the runner which structural category failed. Python enum names are uppercase; serialized values are lowercase:
| Kind | Meaning |
|---|---|
provider_error | LLM API returned an error, such as a rate limit or 5xx |
turn_limit_reached | Tool Loop exhausted max_turns without output |
timeout | Harness exceeded the wall-clock budget |
undeclared_tool_request | Agent tried to use a tool not in the task whitelist |
tool_execution_failed | A tool call raised an error |
missing_output | Harness finished but no output file was written |
The evaluation pipeline treats any failure_kind as a structural failure. The reward is 0.0, regardless of how close the transcript looked.
Transcripts
Runs can emit a structured JSONL trajectory, usually at /workspace/trajectory.jsonl inside the staged workspace. Entries are validated as TrajectoryEntry records:
{"version": 1, "format": "aec-bench-trajectory"}
{"role": "system", "step": 0, "content": "..."}
{"role": "user", "step": 1, "content": "Calculate voltage drop..."}
{"role": "assistant", "step": 2, "content": "I'll run the calculation..."}
{"role": "tool_call", "step": 2, "tool_name": "bash", "command": "python voltage_calc.py"}
{"role": "tool_result", "step": 2, "tool_name": "bash", "stdout": "V_drop = 5.2V", "exit_code": 0}Because trajectory entries are flushed during execution, a crashed run can still leave a readable partial trace. Trajectories feed trace inspection and behavioural classification (see Evaluation).
Templates
A template defines a parameterised engineering problem family that generates concrete task instances through parameter sampling and Jinja2 rendering.
Configuration
Agent configuration covers provider credentials, experiment agent entries, harness parameters, and custom internal adapter registration.