Architecture
aec-bench is organised as a dependency-layered pipeline of seven domains, from contracts and tasks through to evaluation and reporting.
End-to-End Flow
A benchmark run goes through six stages:
| Stage | What happens | Key module |
|---|---|---|
| Define | A task.toml describes the problem, instruction.md provides the prompt, and a verifier script defines how to score | contracts/task_definition |
| Resolve | Paths are resolved, tools discovered, and the task is packaged into a ResolvedTaskInstance | tasks/instance |
| Stage | A workspace or sandbox is prepared, with declared files, fixtures, and tools staged for the selected backend | harness/local_environment |
| Execute | An agent harness runs the task — calling tools, writing output, recording a trajectory | adapters/* |
| Score | The task's verifier, usually tests/verify.py, runs against the agent's output, producing a reward (0.0–1.0) and per-field details | evaluation/pipeline |
| Aggregate | Trial records are appended to the ledger, scores aggregated, and reports generated | ledger, communication |
Core Domains
The library is organised into seven domains with strict dependency boundaries — each layer only depends on layers above it.
Contracts
The foundation layer. Pydantic models that define the shape of data at every boundary:
TaskDefinition— what a task is (prompt, environment, verifier, difficulty)TrialRecord— immutable record of one agent run with full provenanceEvaluationResult— reward score, per-field breakdown, confidence, behavioural classificationExperimentManifest— which tasks x which agents x how many repetitions
All other domains depend on contracts. Contracts depend on nothing.
Tasks
Loading, discovery, and lifecycle management of benchmark tasks. Tasks live on the filesystem as directories containing task.toml, instruction.md, a Docker environment, and a verifier script.
tasks/
└── electrical/
└── cable-voltage-drop/
├── task.toml # metadata, timeout, difficulty
├── instruction.md # the prompt sent to the agent
├── environment/
│ ├── Dockerfile
│ └── fixtures/ # input data staged into the container
└── tests/
└── verify.py # scoring scriptSee Tasks for the full reference.
Agent harnesses
Provider-neutral agent execution strategies. Four built-in harnesses share the same internal execution protocol:
| Harness | How it works | Best for |
|---|---|---|
| Direct | Single LLM call, no tools | Classification, extraction |
| Tool Loop | Multi-turn with tool calls (max N turns) | General tasks with tool use |
| RLM | Sandboxed REPL with structured sub-calls | Complex calculations, iterative work |
| Lambda-RLM | Pre-planned steps with bounded execution | Structured reports, known workflows |
All agent harnesses produce the same output envelope: the agent's answer, a full execution trajectory, and token usage.
See Agent Harnesses for details on each strategy.
Generation
Parameterised templates that produce reproducible task instances. A template defines parameters (soil type, load magnitude, cable length), archetypes (sandy soil, clay), and difficulty presets. The generator samples parameters and renders concrete tasks.
Template (Jinja2 + ParamSpec)
→ sample parameters
→ render instruction
→ generate verifier
→ scaffold task directorySee Templates for the template engine reference.
Execution harness
Orchestrates trial execution. Handles container lifecycle, tool staging, agent-harness dispatch, and trajectory recording. Runs trials on pluggable compute backends:
- Local — in-process execution on your machine
- Modal — serverless cloud execution
- Docker, e2b, Daytona — alternative sandbox targets
For production deployment across a shared fleet, aec-bench dispatches jobs through Harbor — a separate orchestration service that schedules trials across backends. Harbor is not an agent harness; it decides where trial jobs run. See Deployment.
Evaluation
Multi-stage scoring pipeline:
- Verifier — task-specific script that checks agent output against expected results
- Reward — normalised score (0.0–1.0) with per-field breakdown
- Confidence — statistical metadata (Cohen's kappa, Wilson CI)
- Behavioural classification — categorise agent turns using the bond taxonomy: execution, verification, deliberation, and exploration
Communication
Reporting and visualisation: HTML reports, leaderboard tables, trace viewers, and structured exports (JSON, JSONL).
Feedback
Structured expert review, calibration, adjudication, and benchmark improvement. Feedback captures human judgments as provenanced data, feeds task and verifier fixes back into the benchmark, and keeps holdout-sensitive observations out of public or training-visible surfaces.
Design Principles
aec-bench follows a strict priority order when design decisions conflict:
| Priority | Principle | What it means |
|---|---|---|
| 1 | Validity | Results reflect real agent capability, not benchmark artefacts |
| 2 | Reproducibility | Any result can be reconstructed from recorded inputs |
| 3 | Coverage | Benchmark spans meaningful AEC domain breadth and depth |
| 4 | Cost | Experiments are affordable enough to run frequently |
| 5 | Throughput | New tasks, agents, and models can be onboarded quickly |
Key invariants
- Trial records are immutable — once written to the ledger, a trial record is never modified
- Tasks are self-contained — everything needed to run and score a task lives in its directory
- Agent harnesses are provider-neutral — the same task can be run with any LLM provider
- Verifier scoring is deterministic — the same output and verifier revision produce the same mechanical score