aec-benchaec-bench

Tasks

A task is a self-contained engineering problem — prompt, environment, and verifier — that an agent attempts and the system scores.

aec-bench uses the word "task" at three levels:

LayerWhat it meansTypical file
Source task specA proposed benchmark idea captured from the catalogue or source materialsource_task.json
TemplateA parameterised task family that can produce many concrete scenariostemplate metadata and renderers
Runnable task instanceA concrete problem with a prompt, runtime, and verifiertask.toml + instruction.md

Public benchmark reports should count a specific dataset or release, not the whole repository tree. The repository contains catalogue seeds, templates, generated instances, and development task families; those are useful assets, but they are not one interchangeable benchmark count.

Anatomy of a task

A runnable task instance is a directory with required files and optional review aids:

tasks/electrical/voltage-drop/
├── task.toml          # metadata and resource limits
├── instruction.md     # the prompt sent to the agent
├── environment/
│   └── Dockerfile     # runtime image for the task
├── solution/
│   └── solve.py       # reference solution (not sent to agent)
└── tests/
    ├── test.sh        # verifier entry point
    ├── verify.py      # scoring logic
    └── fixtures/      # golden pass/fail examples for verifier testing
        ├── golden_pass.md
        └── golden_fail.md

The loader requires task.toml, instruction.md, an environment Dockerfile, and a verifier entry point. solution/ and golden fixtures are strongly recommended for review, but they are not part of the agent prompt.

The prompt

instruction.md is the prompt sent to the agent. It must stand on its own and explicitly name any staged files, outputs, or constraints the agent is expected to use.

instruction.md (voltage drop task)
You are a senior electrical engineer specializing in building services.

## Problem
Calculate the voltage drop for a three-phase cable circuit using the
impedance method, and determine whether it complies with the maximum
allowable voltage drop limit.

## Given
| Parameter | Value | Unit |
|-----------|-------|------|
| Load current | 45 | A |
| Cable length (one way) | 80 | m |
| Cable resistance (R) | 0.524 | ohm/km |
| Cable reactance (X) | 0.08 | ohm/km |
| Power factor (cos phi) | 0.85 | - |
| System voltage (line-to-line) | 400 | V |
| Maximum allowable voltage drop | 5 | % |

## Required
Calculate the following:
- Voltage drop (V)
- Voltage drop as a percentage of system voltage (%)
- Compliance with the 5% limit (1 if compliant, 0 if not)

The configuration

task.toml describes metadata, timeouts, and runtime limits — not the problem itself. The loader combines this file with instruction.md and produces a TaskDefinition contract.

task.toml
version = "1.0"

[metadata]
difficulty = "easy"
category = "reasoning"
tags = ["electrical", "buildings-electrical", "deterministic", "AS-NZS-3008"]

[agent]
timeout_sec = 600.0

[verifier]
timeout_sec = 120.0

[environment]
extensions = []
build_timeout_sec = 600.0
cpus = 1
memory_mb = 2048
storage_mb = 5120
allow_internet = true

The environment

Each runnable task provides a Dockerfile that defines its runtime image. This is usually minimal — just the tools the task requires:

environment/Dockerfile
FROM --platform=linux/amd64 ubuntu:24.04

RUN apt-get update && apt-get install -y \
    python3 \
    bc \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

Some tasks stage input files such as spreadsheets, drawings, code extracts, or reference data into the runtime via fixture directories. The agent reads them as normal files on disk.

The verifier

The verifier computes ground truth independently, reads the agent's output, and writes a reward (0.0–1.0) plus per-field details. For Harbor-backed runs, tests/test.sh is the entry point; it usually calls tests/verify.py.

Here's the core scoring logic from the voltage drop verifier:

tests/verify.py
def compute_ground_truth() -> dict[str, float]:
    """Compute expected answers from three-phase voltage drop formula."""
    current = 45.0
    L = 80.0       # cable length (m)
    R = 0.524      # resistance (ohm/km)
    X = 0.08       # reactance (ohm/km)
    pf = 0.85      # power factor
    V = 400.0      # system voltage (V)

    sin_phi = math.sin(math.acos(pf))
    Vd = math.sqrt(3) * current * L * (R * pf + X * sin_phi) / 1000
    Vd_pct = Vd / V * 100
    compliance = 1.0 if Vd_pct <= 5.0 else 0.0

    return {
        "voltage_drop_v": Vd,
        "voltage_drop_pct": Vd_pct,
        "compliance": compliance,
    }

The verifier writes two files under /logs/verifier/:

  • reward.json — the headline score: {"reward": 0.67}
  • details.json — per-field breakdown: {"voltage_drop_v": 1.0, "voltage_drop_pct": 1.0, "compliance": 0.0}

Each field is scored independently with configurable tolerances. Numerical fields typically allow 3% relative tolerance; boolean/categorical fields require an exact match.

Golden fixtures

Most verifiers ship with test fixtures — a golden_pass.md that should score 1.0 and a golden_fail.md that should score less. These are used to test the verifier itself, not the agent.

fixtures/golden_pass.md (excerpt)
## Step 2: Voltage Drop
Vd = sqrt(3) x 45 x 80 x (0.524 x 0.85 + 0.08 x 0.5268) / 1000
Vd = 3.0400 V

## Step 3: Percentage
Vd% = 3.0400 / 400 x 100 = 0.7600%

Task lifecycle

Tasks move through four stages: proposedactivedeprecatedretired.

  • Proposed — draft task or source spec under review
  • Active — production benchmark task eligible for datasets and reports
  • Deprecated — superseded by a better version but still runnable for historical comparison
  • Retired — permanently archived and not runnable

Separately, tasks can be public (appears in reports and leaderboard) or holdout (runs but results are hidden). Holdout tasks are used for anti-contamination testing — if a model scores suspiciously well on holdout tasks it hasn't seen, that's a red flag.

Task categories

Categories describe the type of work a task demands. They are free-form strings, not a closed enum. Use the category that best describes the primary demand, and keep more specific descriptors in tags.

CategoryWhat the agent does
reasoningApply formulas and first principles to produce a numerical or boolean answer
report-generationProduce a structured engineering or proposal deliverable from source material
hydraulic-calculationsRun civil hydraulic calculations against given inputs
load-analysisResolve loads, assumptions, and checks for a structural scenario
short-circuitCalculate fault levels or related electrical protection values

Difficulty levels

Each task is rated easy, medium, or hard. Ratings should be calibrated by domain experts against what a professional engineer at each level would be expected to handle.

DifficultyCharacteristicsCalibrated against
EasySingle-step, all parameters given, clear formulaJunior engineer, no references needed
MediumMulti-step, some judgment, may need standard lookupsIntermediate engineer
HardComplex analysis, multiple variables, judgment criticalSenior engineer

Source specs, seeds, and generated instances

There are three common ways work enters the benchmark:

Source task specs capture proposed benchmark ideas before they are runnable. They usually live as source_task.json files and preserve the original discipline, category, standards, inputs, outputs, and source file.

Seed runnable instances are hand-authored tasks with a prompt, runtime, and custom verifier. These are useful when the task needs expert judgement, a bespoke source document, or a narrow real-world scenario.

Generated instances come from parameterised templates. A template defines the problem structure with variable parameters, and the generator samples concrete values to create reproducible task instances. Generated task.toml files include generation metadata such as template name, seed, archetype, and difficulty. See Templates for how this works.

On this page