aec-benchaec-bench

Datasets

A dataset is a versioned, content-hashed snapshot of a task collection that anchors comparable leaderboard results.

Datasets are the formal "this is the benchmark slice" artefact. They sit between task generation and experiment execution, making runs comparable even as templates and tasks continue to evolve.

Anatomy

A dataset lives under a versioned path:

artefacts/datasets/
└── electrical-v1/
    └── 1.0.0/
        └── manifest.json

The manifest pins every included task:

{
  "name": "electrical-v1",
  "version": "1.0.0",
  "content_hash": "sha256:9b2f...",
  "description": {
    "summary": "Electrical engineering benchmark",
    "purpose": "Baseline evaluation for cable and power-system tasks",
    "domains": ["electrical"],
    "difficulty_distribution": { "easy": 3, "medium": 3 },
    "standards": ["AS/NZS 3008"],
    "task_count": 6
  },
  "tasks": [
    {
      "task_id": "electrical/cable-sizing/voltage-drop/example",
      "task_path": "tasks/electrical/cable-sizing/voltage-drop/example",
      "content_hash": "sha256:a4b1...",
      "domain": "electrical",
      "difficulty": "easy",
      "tags": ["deterministic"]
    }
  ]
}

Content hashing

Hashes apply at two levels:

  1. Task hash — instruction, task config, environment, verifier, and task-local files.
  2. Dataset hash — ordered combination of every task hash and manifest metadata that affects benchmark meaning.

If a verifier changes, the task hash changes. If any task hash changes, the dataset hash changes. This is what prevents a leaderboard from mixing scores from different task definitions.

Creating a dataset

Freeze existing tasks by domain, difficulty, or pattern:

uv run aec-bench dataset create \
  --name electrical-v1 \
  --version 1.0.0 \
  --domain electrical \
  --difficulty easy \
  --difficulty medium \
  --summary "Electrical deterministic baseline" \
  --purpose "Comparable cable and power-system evaluation"

Freeze the exact output of a generated suite:

uv run aec-bench generate suite --config suite.toml --output tasks/generated

uv run aec-bench dataset create \
  --name suite-v1 \
  --version 1.0.0 \
  --from-suite-output tasks/generated/dataset.json

Use --overwrite only when replacing a local mistake before sharing results. Published dataset versions should be immutable.

Running from a dataset

Generate an experiment manifest from a dataset reference:

uv run aec-bench dataset config electrical-v1@1.0.0 \
  --model gpt-4.1-mini \
  --harness tool_loop \
  --backend modal \
  --output experiment.yaml

uv run aec-bench run --config experiment.yaml

Manual experiment manifests can also reference a dataset:

tasks:
  dataset: electrical-v1@1.0.0
  difficulties: ["medium", "hard"]

Additional filters are intersections on top of the dataset, not replacements.

Inspecting and moving datasets

uv run aec-bench dataset list
uv run aec-bench dataset info electrical-v1@1.0.0
uv run aec-bench dataset validate electrical-v1@1.0.0
uv run aec-bench dataset results electrical-v1@1.0.0
uv run aec-bench dataset export electrical-v1@1.0.0 --output electrical-v1.tar.gz
uv run aec-bench dataset import electrical-v1.tar.gz

dataset validate exits non-zero if files are missing or drifted from the recorded hashes.

Versioning policy

Use semver-like discipline:

ChangeBump
Add tasks without changing existing task contentminor
Change task instructions, verifiers, fixtures, or environmentmajor
Edit non-scoring metadata onlypatch

The practical rule is simple: if a previous reward is no longer comparable, create a new major dataset version.

On this page