Architecture

aec-bench is organised as a dependency-layered pipeline of seven domains, from contracts and tasks through to evaluation and reporting.

End-to-End Flow

A benchmark run goes through six stages:

Define Task

Resolve Instance

Stage Environment

Execute Agent

Score Output

Aggregate & Report

Define Task

Resolve Instance

Stage Environment

Execute Agent

Score Output

Aggregate & Report

Stage	What happens	Key module
Define	A `task.toml` describes the problem, `instruction.md` provides the prompt, and a verifier script defines how to score	`contracts/task_definition`
Resolve	Paths are resolved, tools discovered, and the task is packaged into a `ResolvedTaskInstance`	`tasks/instance`
Stage	A workspace or sandbox is prepared, with declared files, fixtures, and tools staged for the selected backend	`harness/local_environment`
Execute	An agent harness runs the task — calling tools, writing output, recording a trajectory	`adapters/*`
Score	The task's verifier, usually `tests/verify.py`, runs against the agent's output, producing a reward (0.0–1.0) and per-field details	`evaluation/pipeline`
Aggregate	Trial records are appended to the ledger, scores aggregated, and reports generated	`ledger`, `communication`

Core Domains

The library is organised into seven domains with strict dependency boundaries — each layer only depends on layers above it.

Contracts

Tasks

Adapters

Generation

Harness

Evaluation

Communication

Feedback

Contracts

Tasks

Adapters

Generation

Harness

Evaluation

Communication

Feedback

Contracts

The foundation layer. Pydantic models that define the shape of data at every boundary:

TaskDefinition — what a task is (prompt, environment, verifier, difficulty)
TrialRecord — immutable record of one agent run with full provenance
EvaluationResult — reward score, per-field breakdown, confidence, behavioural classification
ExperimentManifest — which tasks x which agents x how many repetitions

All other domains depend on contracts. Contracts depend on nothing.

Agent harnesses

Provider-neutral agent execution strategies. Four built-in harnesses share the same internal execution protocol:

Harness	How it works	Best for
Direct	Single LLM call, no tools	Classification, extraction
Tool Loop	Multi-turn with tool calls (max N turns)	General tasks with tool use
RLM	Sandboxed REPL with structured sub-calls	Complex calculations, iterative work
Lambda-RLM	Pre-planned steps with bounded execution	Structured reports, known workflows

All agent harnesses produce the same output envelope: the agent's answer, a full execution trajectory, and token usage.

See Agent Harnesses for details on each strategy.

Parameterised templates that produce reproducible task instances. A template defines parameters (soil type, load magnitude, cable length), archetypes (sandy soil, clay), and difficulty presets. The generator samples parameters and renders concrete tasks.

Template (Jinja2 + ParamSpec)
    → sample parameters
    → render instruction
    → generate verifier
    → scaffold task directory

See Templates for the template engine reference.

Execution harness

Orchestrates trial execution. Handles container lifecycle, tool staging, agent-harness dispatch, and trajectory recording. Runs trials on pluggable compute backends:

Local — in-process execution on your machine
Modal — serverless cloud execution
Docker, e2b, Daytona — alternative sandbox targets

For production deployment across a shared fleet, aec-bench dispatches jobs through Harbor — a separate orchestration service that schedules trials across backends. Harbor is not an agent harness; it decides where trial jobs run. See Deployment.

Evaluation

Multi-stage scoring pipeline:

Verifier — task-specific script that checks agent output against expected results
Reward — normalised score (0.0–1.0) with per-field breakdown
Confidence — statistical metadata (Cohen's kappa, Wilson CI)
Behavioural classification — categorise agent turns using the bond taxonomy: execution, verification, deliberation, and exploration

Communication

Reporting and visualisation: HTML reports, leaderboard tables, trace viewers, and structured exports (JSON, JSONL).

Feedback

Structured expert review, calibration, adjudication, and benchmark improvement. Feedback captures human judgments as provenanced data, feeds task and verifier fixes back into the benchmark, and keeps holdout-sensitive observations out of public or training-visible surfaces.

Design Principles

aec-bench follows a strict priority order when design decisions conflict:

Priority	Principle	What it means
1	Validity	Results reflect real agent capability, not benchmark artefacts
2	Reproducibility	Any result can be reconstructed from recorded inputs
3	Coverage	Benchmark spans meaningful AEC domain breadth and depth
4	Cost	Experiments are affordable enough to run frequently
5	Throughput	New tasks, agents, and models can be onboarded quickly

Key invariants

Trial records are immutable — once written to the ledger, a trial record is never modified
Tasks are self-contained — everything needed to run and score a task lives in its directory
Agent harnesses are provider-neutral — the same task can be run with any LLM provider
Verifier scoring is deterministic — the same output and verifier revision produce the same mechanical score

On this page