Tasks

A task is a self-contained engineering problem — prompt, environment, and verifier — that an agent attempts and the system scores.

aec-bench uses the word "task" at three levels:

Layer	What it means	Typical file
Source task spec	A proposed benchmark idea captured from the catalogue or source material	`source_task.json`
Template	A parameterised task family that can produce many concrete scenarios	template metadata and renderers
Runnable task instance	A concrete problem with a prompt, runtime, and verifier	`task.toml` + `instruction.md`

Public benchmark reports should count a specific dataset or release, not the whole repository tree. The repository contains catalogue seeds, templates, generated instances, and development task families; those are useful assets, but they are not one interchangeable benchmark count.

Anatomy of a task

A runnable task instance is a directory with required files and optional review aids:

tasks/electrical/voltage-drop/
├── task.toml          # metadata and resource limits
├── instruction.md     # the prompt sent to the agent
├── environment/
│   └── Dockerfile     # runtime image for the task
├── solution/
│   └── solve.py       # reference solution (not sent to agent)
└── tests/
    ├── test.sh        # verifier entry point
    ├── verify.py      # scoring logic
    └── fixtures/      # golden pass/fail examples for verifier testing
        ├── golden_pass.md
        └── golden_fail.md

The loader requires task.toml, instruction.md, an environment Dockerfile, and a verifier entry point. solution/ and golden fixtures are strongly recommended for review, but they are not part of the agent prompt.

The prompt

instruction.md is the prompt sent to the agent. It must stand on its own and explicitly name any staged files, outputs, or constraints the agent is expected to use.

instruction.md (voltage drop task)

You are a senior electrical engineer specializing in building services.

## Problem
Calculate the voltage drop for a three-phase cable circuit using the
impedance method, and determine whether it complies with the maximum
allowable voltage drop limit.

## Given
| Parameter | Value | Unit |
|-----------|-------|------|
| Load current | 45 | A |
| Cable length (one way) | 80 | m |
| Cable resistance (R) | 0.524 | ohm/km |
| Cable reactance (X) | 0.08 | ohm/km |
| Power factor (cos phi) | 0.85 | - |
| System voltage (line-to-line) | 400 | V |
| Maximum allowable voltage drop | 5 | % |

## Required
Calculate the following:
- Voltage drop (V)
- Voltage drop as a percentage of system voltage (%)
- Compliance with the 5% limit (1 if compliant, 0 if not)

The configuration

task.toml describes metadata, timeouts, and runtime limits — not the problem itself. The loader combines this file with instruction.md and produces a TaskDefinition contract.

task.toml

version = "1.0"

[metadata]
difficulty = "easy"
category = "reasoning"
tags = ["electrical", "buildings-electrical", "deterministic", "AS-NZS-3008"]

[agent]
timeout_sec = 600.0

[verifier]
timeout_sec = 120.0

[environment]
extensions = []
build_timeout_sec = 600.0
cpus = 1
memory_mb = 2048
storage_mb = 5120
allow_internet = true

The environment

Each runnable task provides a Dockerfile that defines its runtime image. This is usually minimal — just the tools the task requires:

environment/Dockerfile

FROM --platform=linux/amd64 ubuntu:24.04

RUN apt-get update && apt-get install -y \
    python3 \
    bc \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

Some tasks stage input files such as spreadsheets, drawings, code extracts, or reference data into the runtime via fixture directories. The agent reads them as normal files on disk.

The verifier

The verifier computes ground truth independently, reads the agent's output, and writes a reward (0.0–1.0) plus per-field details. For Harbor-backed runs, tests/test.sh is the entry point; it usually calls tests/verify.py.

Here's the core scoring logic from the voltage drop verifier:

tests/verify.py

def compute_ground_truth() -> dict[str, float]:
    """Compute expected answers from three-phase voltage drop formula."""
    current = 45.0
    L = 80.0       # cable length (m)
    R = 0.524      # resistance (ohm/km)
    X = 0.08       # reactance (ohm/km)
    pf = 0.85      # power factor
    V = 400.0      # system voltage (V)

    sin_phi = math.sin(math.acos(pf))
    Vd = math.sqrt(3) * current * L * (R * pf + X * sin_phi) / 1000
    Vd_pct = Vd / V * 100
    compliance = 1.0 if Vd_pct <= 5.0 else 0.0

    return {
        "voltage_drop_v": Vd,
        "voltage_drop_pct": Vd_pct,
        "compliance": compliance,
    }

The verifier writes two files under /logs/verifier/:

reward.json — the headline score: {"reward": 0.67}
details.json — per-field breakdown: {"voltage_drop_v": 1.0, "voltage_drop_pct": 1.0, "compliance": 0.0}

Each field is scored independently with configurable tolerances. Numerical fields typically allow 3% relative tolerance; boolean/categorical fields require an exact match.

Golden fixtures

Most verifiers ship with test fixtures — a golden_pass.md that should score 1.0 and a golden_fail.md that should score less. These are used to test the verifier itself, not the agent.

fixtures/golden_pass.md (excerpt)

## Step 2: Voltage Drop
Vd = sqrt(3) x 45 x 80 x (0.524 x 0.85 + 0.08 x 0.5268) / 1000
Vd = 3.0400 V

## Step 3: Percentage
Vd% = 3.0400 / 400 x 100 = 0.7600%

Task lifecycle

Tasks move through four stages: proposed → active → deprecated → retired.

Proposed — draft task or source spec under review
Active — production benchmark task eligible for datasets and reports
Deprecated — superseded by a better version but still runnable for historical comparison
Retired — permanently archived and not runnable

Separately, tasks can be public (appears in reports and leaderboard) or holdout (runs but results are hidden). Holdout tasks are used for anti-contamination testing — if a model scores suspiciously well on holdout tasks it hasn't seen, that's a red flag.

Task categories

Categories describe the type of work a task demands. They are free-form strings, not a closed enum. Use the category that best describes the primary demand, and keep more specific descriptors in tags.

Category	What the agent does
`reasoning`	Apply formulas and first principles to produce a numerical or boolean answer
`report-generation`	Produce a structured engineering or proposal deliverable from source material
`hydraulic-calculations`	Run civil hydraulic calculations against given inputs
`load-analysis`	Resolve loads, assumptions, and checks for a structural scenario
`short-circuit`	Calculate fault levels or related electrical protection values

Difficulty levels

Each task is rated easy, medium, or hard. Ratings should be calibrated by domain experts against what a professional engineer at each level would be expected to handle.

Difficulty	Characteristics	Calibrated against
Easy	Single-step, all parameters given, clear formula	Junior engineer, no references needed
Medium	Multi-step, some judgment, may need standard lookups	Intermediate engineer
Hard	Complex analysis, multiple variables, judgment critical	Senior engineer

Source specs, seeds, and generated instances

There are three common ways work enters the benchmark:

Source task specs capture proposed benchmark ideas before they are runnable. They usually live as source_task.json files and preserve the original discipline, category, standards, inputs, outputs, and source file.

Seed runnable instances are hand-authored tasks with a prompt, runtime, and custom verifier. These are useful when the task needs expert judgement, a bespoke source document, or a narrow real-world scenario.

Generated instances come from parameterised templates. A template defines the problem structure with variable parameters, and the generator samples concrete values to create reproducible task instances. Generated task.toml files include generation metadata such as template name, seed, archetype, and difficulty. See Templates for how this works.

On this page