aec-benchaec-bench

Contracts

Contracts are the Pydantic models that define the shape of data crossing every boundary in aec-bench.

The central run path connects TaskDefinition, AgentOutput, EvaluationResult, and TrialRecord:

TaskDefinition — what to solve

A TaskDefinition is the complete description of a benchmark problem. It carries the prompt, the environment the agent runs in, and the script that scores the output.

Here's a real example — a voltage drop calculation task:

tasks/electrical/voltage-drop/task.toml
version = "1.0"

[metadata]
difficulty = "easy"
category = "reasoning"
tags = ["electrical", "buildings-electrical", "deterministic", "AS-NZS-3008"]

[agent]
timeout_sec = 600.0

[verifier]
timeout_sec = 120.0

[environment]
extensions = []
build_timeout_sec = 600.0
cpus = 1
memory_mb = 2048
storage_mb = 5120
allow_internet = true

The instruction.md alongside it contains the actual prompt:

tasks/electrical/voltage-drop/instruction.md
You are a senior electrical engineer specializing in building services.

## Problem

Calculate the voltage drop for a three-phase cable circuit using the
impedance method, and determine whether it complies with the maximum
allowable voltage drop limit.

## Given

| Parameter | Value | Unit |
|-----------|-------|------|
| Load current | 45 | A |
| Cable length (one way) | 80 | m |
| Cable resistance (R) | 0.524 | ohm/km |
| ...

Every task has three key pieces:

  • Instruction — the self-contained prompt sent to the agent
  • Environment — a declared runtime environment with tools, fixtures, and resource limits
  • Verifier — a Python script that computes ground truth and scores the agent's output

Lifecycle and visibility

Tasks have two independent axes that control when and where they appear:

publicholdout
activeStandard benchmark taskRuns but hidden from reports (anti-contamination)
deprecatedStill runnable, shown as supersededRarely used

proposed tasks aren't runnable yet. retired tasks are archived permanently.

AgentOutput — what came back

When an agent harness finishes executing, it returns an AgentOutput — a minimal envelope that says what the agent produced and where to find it:

class AgentOutput(StrictModel):
    status: AgentOutputStatus   # completed | partial | failed | empty
    output_path: NonEmptyStr    # where the result file lives
    output_format: NonEmptyStr  # "jsonl", "markdown", "json"
    error_message: str | None   # what went wrong (if anything)

The harness reports status and location; parsing the output file is the evaluation pipeline's job.

EvaluationResult — how it scored

The evaluation pipeline reads the agent's output file, runs the task's verifier script, and produces an EvaluationResult:

class EvaluationResult(StrictModel):
    reward: float                # 0.0 to 1.0
    validity: ValidityCheck      # did the output parse correctly?
    breakdown: dict | None       # per-field scores
    error_taxonomy: list | None  # classified errors
    confidence: ConfidenceMetadata | None
    annotations: list | None     # human review verdicts

The reward is a normalised score from 0.0 to 1.0. The supporting fields add diagnostic detail:

Validity checks whether the output was structurally sound before scoring:

class ValidityCheck(StrictModel):
    output_parseable: bool    # could we read the file at all?
    schema_valid: bool        # did it match the expected format?
    verifier_completed: bool  # did the scoring script finish?
    errors: list[str]         # what went wrong

Breakdown gives per-field scores. For the voltage drop task, this looks like:

{
  "voltage_drop_v": 0.95,
  "voltage_drop_pct": 1.0,
  "compliance": 1.0
}

One strict rule: if the output isn't parseable, the reward is always 0.0. No partial credit for unreadable output.

TrialRecord — the permanent record

A TrialRecord captures everything about one agent run — enough provenance to reproduce it from scratch.

class TrialRecord(StrictModel):
    trial_id: NonEmptyStr
    experiment_id: NonEmptyStr
    dataset_id: str | None
    timestamp: datetime

    # What was tested
    task: TaskReference              # task_id + content hash
    agent: AgentReference            # harness, model, configuration

    # How it ran
    environment: EnvironmentSnapshot # runtime image, compute backend
    inputs: InputRecord              # instruction, system prompt, files

    # What happened
    outputs: OutputRecord            # agent output + trajectory paths
    evaluation: EvaluationResult     # the score

    # Metadata
    timing: TimingRecord             # wall-clock times per phase
    cost: CostRecord | None          # tokens + estimated USD
    adaptation: AdaptationProvenance | None
    completeness: Completeness       # complete | partial

Trial records are immutable and append-only. Once written to the ledger, they are never modified. This is the core reproducibility guarantee.

A complete trial must include full provenance — harness revision, tool versions, and input file hashes. Partial records are allowed during execution but cannot be promoted to the official ledger.

Trajectory entries

A TrajectoryEntry records one event within an agent run. These entries are written to trajectory.jsonl so evaluators and trace viewers can inspect how a result was produced, not just the final answer.

class TrajectoryEntry(StrictModel):
    step: int
    role: str                 # assistant | tool_call | tool_result | system | user
    content: str | None
    tool_name: str | None
    command: str | None
    arguments: dict | None
    stdout: str | None
    stderr: str | None
    exit_code: int | None
    duration_ms: int | None
    media: list[str] | None
    timestamp: str | None

Cost tracking

Every trial optionally records token usage and cost:

class CostRecord(StrictModel):
    tokens_in: int | None
    tokens_out: int | None
    cache_read_tokens: int | None
    cache_write_tokens: int | None
    estimated_cost_usd: float | None

This feeds into the leaderboard's cost-efficiency analysis — how much does it cost a model to achieve a given score?

ExperimentManifest — what to run

An experiment defines which tasks to run, with which agent harnesses, how many times:

experiment.yaml
experiment_id: claude-vs-gpt-electrical
name: Claude vs GPT on electrical tasks
repetitions: 3

tasks:
  dataset: electrical-v1@1.0.0
  difficulties:
    - easy
    - medium

agents:
  - name: claude-sonnet
    harness: tool_loop
    model: $ANTHROPIC_MODEL
  - name: gpt-4.1
    harness: tool_loop
    model: gpt-4.1

compute:
  backend: modal

The TaskSelector supports filtering by dataset, domains, difficulties, and glob patterns. The model field supports $ENV_VAR references so model names can stay deployment-specific.

Design rules

A few principles that apply across all contracts:

  • No extra fieldsStrictModel rejects any field not in the schema. This catches typos and version mismatches early.
  • Non-empty strings — the NonEmptyStr type prevents accidental blank values. A task with an empty instruction fails validation at load time, not at runtime.
  • Relative paths — all file paths in contracts are relative to the task directory. No absolute paths leak into the data model.
  • Validators over conventions — invariants like "unparseable output = reward 0.0" are enforced by Pydantic model validators, not by documentation.

On this page