Tasks
A task is a self-contained engineering problem — prompt, environment, and verifier — that an agent attempts and the system scores.
aec-bench uses the word "task" at three levels:
| Layer | What it means | Typical file |
|---|---|---|
| Source task spec | A proposed benchmark idea captured from the catalogue or source material | source_task.json |
| Template | A parameterised task family that can produce many concrete scenarios | template metadata and renderers |
| Runnable task instance | A concrete problem with a prompt, runtime, and verifier | task.toml + instruction.md |
Public benchmark reports should count a specific dataset or release, not the whole repository tree. The repository contains catalogue seeds, templates, generated instances, and development task families; those are useful assets, but they are not one interchangeable benchmark count.
Anatomy of a task
A runnable task instance is a directory with required files and optional review aids:
tasks/electrical/voltage-drop/
├── task.toml # metadata and resource limits
├── instruction.md # the prompt sent to the agent
├── environment/
│ └── Dockerfile # runtime image for the task
├── solution/
│ └── solve.py # reference solution (not sent to agent)
└── tests/
├── test.sh # verifier entry point
├── verify.py # scoring logic
└── fixtures/ # golden pass/fail examples for verifier testing
├── golden_pass.md
└── golden_fail.mdThe loader requires task.toml, instruction.md, an environment Dockerfile, and a verifier entry point. solution/ and golden fixtures are strongly recommended for review, but they are not part of the agent prompt.
The prompt
instruction.md is the prompt sent to the agent. It must stand on its own and explicitly name any staged files, outputs, or constraints the agent is expected to use.
You are a senior electrical engineer specializing in building services.
## Problem
Calculate the voltage drop for a three-phase cable circuit using the
impedance method, and determine whether it complies with the maximum
allowable voltage drop limit.
## Given
| Parameter | Value | Unit |
|-----------|-------|------|
| Load current | 45 | A |
| Cable length (one way) | 80 | m |
| Cable resistance (R) | 0.524 | ohm/km |
| Cable reactance (X) | 0.08 | ohm/km |
| Power factor (cos phi) | 0.85 | - |
| System voltage (line-to-line) | 400 | V |
| Maximum allowable voltage drop | 5 | % |
## Required
Calculate the following:
- Voltage drop (V)
- Voltage drop as a percentage of system voltage (%)
- Compliance with the 5% limit (1 if compliant, 0 if not)The configuration
task.toml describes metadata, timeouts, and runtime limits — not the problem itself. The loader combines this file with instruction.md and produces a TaskDefinition contract.
version = "1.0"
[metadata]
difficulty = "easy"
category = "reasoning"
tags = ["electrical", "buildings-electrical", "deterministic", "AS-NZS-3008"]
[agent]
timeout_sec = 600.0
[verifier]
timeout_sec = 120.0
[environment]
extensions = []
build_timeout_sec = 600.0
cpus = 1
memory_mb = 2048
storage_mb = 5120
allow_internet = trueThe environment
Each runnable task provides a Dockerfile that defines its runtime image. This is usually minimal — just the tools the task requires:
FROM --platform=linux/amd64 ubuntu:24.04
RUN apt-get update && apt-get install -y \
python3 \
bc \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /workspaceSome tasks stage input files such as spreadsheets, drawings, code extracts, or reference data into the runtime via fixture directories. The agent reads them as normal files on disk.
The verifier
The verifier computes ground truth independently, reads the agent's output, and writes a reward (0.0–1.0) plus per-field details. For Harbor-backed runs, tests/test.sh is the entry point; it usually calls tests/verify.py.
Here's the core scoring logic from the voltage drop verifier:
def compute_ground_truth() -> dict[str, float]:
"""Compute expected answers from three-phase voltage drop formula."""
current = 45.0
L = 80.0 # cable length (m)
R = 0.524 # resistance (ohm/km)
X = 0.08 # reactance (ohm/km)
pf = 0.85 # power factor
V = 400.0 # system voltage (V)
sin_phi = math.sin(math.acos(pf))
Vd = math.sqrt(3) * current * L * (R * pf + X * sin_phi) / 1000
Vd_pct = Vd / V * 100
compliance = 1.0 if Vd_pct <= 5.0 else 0.0
return {
"voltage_drop_v": Vd,
"voltage_drop_pct": Vd_pct,
"compliance": compliance,
}The verifier writes two files under /logs/verifier/:
reward.json— the headline score:{"reward": 0.67}details.json— per-field breakdown:{"voltage_drop_v": 1.0, "voltage_drop_pct": 1.0, "compliance": 0.0}
Each field is scored independently with configurable tolerances. Numerical fields typically allow 3% relative tolerance; boolean/categorical fields require an exact match.
Golden fixtures
Most verifiers ship with test fixtures — a golden_pass.md that should score 1.0 and a golden_fail.md that should score less. These are used to test the verifier itself, not the agent.
## Step 2: Voltage Drop
Vd = sqrt(3) x 45 x 80 x (0.524 x 0.85 + 0.08 x 0.5268) / 1000
Vd = 3.0400 V
## Step 3: Percentage
Vd% = 3.0400 / 400 x 100 = 0.7600%Task lifecycle
Tasks move through four stages: proposed → active → deprecated → retired.
- Proposed — draft task or source spec under review
- Active — production benchmark task eligible for datasets and reports
- Deprecated — superseded by a better version but still runnable for historical comparison
- Retired — permanently archived and not runnable
Separately, tasks can be public (appears in reports and leaderboard) or holdout (runs but results are hidden). Holdout tasks are used for anti-contamination testing — if a model scores suspiciously well on holdout tasks it hasn't seen, that's a red flag.
Task categories
Categories describe the type of work a task demands. They are free-form strings, not a closed enum. Use the category that best describes the primary demand, and keep more specific descriptors in tags.
| Category | What the agent does |
|---|---|
reasoning | Apply formulas and first principles to produce a numerical or boolean answer |
report-generation | Produce a structured engineering or proposal deliverable from source material |
hydraulic-calculations | Run civil hydraulic calculations against given inputs |
load-analysis | Resolve loads, assumptions, and checks for a structural scenario |
short-circuit | Calculate fault levels or related electrical protection values |
Difficulty levels
Each task is rated easy, medium, or hard. Ratings should be calibrated by domain experts against what a professional engineer at each level would be expected to handle.
| Difficulty | Characteristics | Calibrated against |
|---|---|---|
| Easy | Single-step, all parameters given, clear formula | Junior engineer, no references needed |
| Medium | Multi-step, some judgment, may need standard lookups | Intermediate engineer |
| Hard | Complex analysis, multiple variables, judgment critical | Senior engineer |
Source specs, seeds, and generated instances
There are three common ways work enters the benchmark:
Source task specs capture proposed benchmark ideas before they are runnable. They usually live as source_task.json files and preserve the original discipline, category, standards, inputs, outputs, and source file.
Seed runnable instances are hand-authored tasks with a prompt, runtime, and custom verifier. These are useful when the task needs expert judgement, a bespoke source document, or a narrow real-world scenario.
Generated instances come from parameterised templates. A template defines the problem structure with variable parameters, and the generator samples concrete values to create reproducible task instances. Generated task.toml files include generation metadata such as template name, seed, archetype, and difficulty. See Templates for how this works.