aec-benchaec-bench

Architecture

aec-bench is organised as a dependency-layered pipeline of seven domains, from contracts and tasks through to evaluation and reporting.

End-to-End Flow

A benchmark run goes through six stages:

StageWhat happensKey module
DefineA task.toml describes the problem, instruction.md provides the prompt, and a verifier script defines how to scorecontracts/task_definition
ResolvePaths are resolved, tools discovered, and the task is packaged into a ResolvedTaskInstancetasks/instance
StageA workspace or sandbox is prepared, with declared files, fixtures, and tools staged for the selected backendharness/local_environment
ExecuteAn agent harness runs the task — calling tools, writing output, recording a trajectoryadapters/*
ScoreThe task's verifier, usually tests/verify.py, runs against the agent's output, producing a reward (0.0–1.0) and per-field detailsevaluation/pipeline
AggregateTrial records are appended to the ledger, scores aggregated, and reports generatedledger, communication

Core Domains

The library is organised into seven domains with strict dependency boundaries — each layer only depends on layers above it.

Contracts

The foundation layer. Pydantic models that define the shape of data at every boundary:

  • TaskDefinition — what a task is (prompt, environment, verifier, difficulty)
  • TrialRecord — immutable record of one agent run with full provenance
  • EvaluationResult — reward score, per-field breakdown, confidence, behavioural classification
  • ExperimentManifest — which tasks x which agents x how many repetitions

All other domains depend on contracts. Contracts depend on nothing.

Tasks

Loading, discovery, and lifecycle management of benchmark tasks. Tasks live on the filesystem as directories containing task.toml, instruction.md, a Docker environment, and a verifier script.

tasks/
└── electrical/
    └── cable-voltage-drop/
        ├── task.toml           # metadata, timeout, difficulty
        ├── instruction.md      # the prompt sent to the agent
        ├── environment/
        │   ├── Dockerfile
        │   └── fixtures/       # input data staged into the container
        └── tests/
            └── verify.py       # scoring script

See Tasks for the full reference.

Agent harnesses

Provider-neutral agent execution strategies. Four built-in harnesses share the same internal execution protocol:

HarnessHow it worksBest for
DirectSingle LLM call, no toolsClassification, extraction
Tool LoopMulti-turn with tool calls (max N turns)General tasks with tool use
RLMSandboxed REPL with structured sub-callsComplex calculations, iterative work
Lambda-RLMPre-planned steps with bounded executionStructured reports, known workflows

All agent harnesses produce the same output envelope: the agent's answer, a full execution trajectory, and token usage.

See Agent Harnesses for details on each strategy.

Generation

Parameterised templates that produce reproducible task instances. A template defines parameters (soil type, load magnitude, cable length), archetypes (sandy soil, clay), and difficulty presets. The generator samples parameters and renders concrete tasks.

Template (Jinja2 + ParamSpec)
    → sample parameters
    → render instruction
    → generate verifier
    → scaffold task directory

See Templates for the template engine reference.

Execution harness

Orchestrates trial execution. Handles container lifecycle, tool staging, agent-harness dispatch, and trajectory recording. Runs trials on pluggable compute backends:

  • Local — in-process execution on your machine
  • Modal — serverless cloud execution
  • Docker, e2b, Daytona — alternative sandbox targets

For production deployment across a shared fleet, aec-bench dispatches jobs through Harbor — a separate orchestration service that schedules trials across backends. Harbor is not an agent harness; it decides where trial jobs run. See Deployment.

Evaluation

Multi-stage scoring pipeline:

  1. Verifier — task-specific script that checks agent output against expected results
  2. Reward — normalised score (0.0–1.0) with per-field breakdown
  3. Confidence — statistical metadata (Cohen's kappa, Wilson CI)
  4. Behavioural classification — categorise agent turns using the bond taxonomy: execution, verification, deliberation, and exploration

Communication

Reporting and visualisation: HTML reports, leaderboard tables, trace viewers, and structured exports (JSON, JSONL).

Feedback

Structured expert review, calibration, adjudication, and benchmark improvement. Feedback captures human judgments as provenanced data, feeds task and verifier fixes back into the benchmark, and keeps holdout-sensitive observations out of public or training-visible surfaces.

Design Principles

aec-bench follows a strict priority order when design decisions conflict:

PriorityPrincipleWhat it means
1ValidityResults reflect real agent capability, not benchmark artefacts
2ReproducibilityAny result can be reconstructed from recorded inputs
3CoverageBenchmark spans meaningful AEC domain breadth and depth
4CostExperiments are affordable enough to run frequently
5ThroughputNew tasks, agents, and models can be onboarded quickly

Key invariants

  • Trial records are immutable — once written to the ledger, a trial record is never modified
  • Tasks are self-contained — everything needed to run and score a task lives in its directory
  • Agent harnesses are provider-neutral — the same task can be run with any LLM provider
  • Verifier scoring is deterministic — the same output and verifier revision produce the same mechanical score

On this page