Datasets
A dataset is a versioned, content-hashed snapshot of a task collection that anchors comparable leaderboard results.
Datasets are the formal "this is the benchmark slice" artefact. They sit between task generation and experiment execution, making runs comparable even as templates and tasks continue to evolve.
Anatomy
A dataset lives under a versioned path:
artefacts/datasets/
└── electrical-v1/
└── 1.0.0/
└── manifest.jsonThe manifest pins every included task:
{
"name": "electrical-v1",
"version": "1.0.0",
"content_hash": "sha256:9b2f...",
"description": {
"summary": "Electrical engineering benchmark",
"purpose": "Baseline evaluation for cable and power-system tasks",
"domains": ["electrical"],
"difficulty_distribution": { "easy": 3, "medium": 3 },
"standards": ["AS/NZS 3008"],
"task_count": 6
},
"tasks": [
{
"task_id": "electrical/cable-sizing/voltage-drop/example",
"task_path": "tasks/electrical/cable-sizing/voltage-drop/example",
"content_hash": "sha256:a4b1...",
"domain": "electrical",
"difficulty": "easy",
"tags": ["deterministic"]
}
]
}Content hashing
Hashes apply at two levels:
- Task hash — instruction, task config, environment, verifier, and task-local files.
- Dataset hash — ordered combination of every task hash and manifest metadata that affects benchmark meaning.
If a verifier changes, the task hash changes. If any task hash changes, the dataset hash changes. This is what prevents a leaderboard from mixing scores from different task definitions.
Creating a dataset
Freeze existing tasks by domain, difficulty, or pattern:
uv run aec-bench dataset create \
--name electrical-v1 \
--version 1.0.0 \
--domain electrical \
--difficulty easy \
--difficulty medium \
--summary "Electrical deterministic baseline" \
--purpose "Comparable cable and power-system evaluation"Freeze the exact output of a generated suite:
uv run aec-bench generate suite --config suite.toml --output tasks/generated
uv run aec-bench dataset create \
--name suite-v1 \
--version 1.0.0 \
--from-suite-output tasks/generated/dataset.jsonUse --overwrite only when replacing a local mistake before sharing results. Published dataset versions should be immutable.
Running from a dataset
Generate an experiment manifest from a dataset reference:
uv run aec-bench dataset config electrical-v1@1.0.0 \
--model gpt-4.1-mini \
--harness tool_loop \
--backend modal \
--output experiment.yaml
uv run aec-bench run --config experiment.yamlManual experiment manifests can also reference a dataset:
tasks:
dataset: electrical-v1@1.0.0
difficulties: ["medium", "hard"]Additional filters are intersections on top of the dataset, not replacements.
Inspecting and moving datasets
uv run aec-bench dataset list
uv run aec-bench dataset info electrical-v1@1.0.0
uv run aec-bench dataset validate electrical-v1@1.0.0
uv run aec-bench dataset results electrical-v1@1.0.0
uv run aec-bench dataset export electrical-v1@1.0.0 --output electrical-v1.tar.gz
uv run aec-bench dataset import electrical-v1.tar.gzdataset validate exits non-zero if files are missing or drifted from the recorded hashes.
Versioning policy
Use semver-like discipline:
| Change | Bump |
|---|---|
| Add tasks without changing existing task content | minor |
| Change task instructions, verifiers, fixtures, or environment | major |
| Edit non-scoring metadata only | patch |
The practical rule is simple: if a previous reward is no longer comparable, create a new major dataset version.