Quickstart
The quickstart is a short path from a fresh checkout to generating a task, running a local trial, and evaluating the imported ledger.
This walkthrough uses the source checkout because it is the current public setup path while the package release is being prepared.
Prerequisites
- Python 3.13 or later
- uv for dependency management
- An LLM provider key for the model you run, unless you only generate and validate tasks
Install
git clone https://github.com/TheodoreGalanos/aec-bench.git
cd aec-bench
uv sync --dev
uv run aec-bench --helpIn a source checkout, run commands through uv run so the local package code and
dependencies stay aligned.
Generate
List the built-in templates and generate one deterministic task instance:
uv run aec-bench generate list-templates --discipline electrical
uv run aec-bench generate task voltage-drop \
--instances 1 \
--difficulty easy \
--seed 42 \
--output tasks/generatedGenerated instances are normal task directories: instruction, task config, verifier files, and provenance linking the instance back to its template and seed.
Validate
Before treating a task as benchmark material, validate its structure and verifier:
uv run aec-bench task validate \
tasks/generated/electrical/cable-sizing/voltage-drop/sydney-suburban-residential-lighting-00For template work, validate the source template rather than only the generated instance:
uv run aec-bench generate validate-template \
src/aec_bench/templates/builtin/electrical/voltage_dropRun
Run one task locally with a real model. The local runner copies the task into a temporary workspace, executes the selected harness, verifies the result, and imports the trial unless you opt out.
export ANTHROPIC_API_KEY="..."
uv run aec-bench run-local \
tasks/generated/electrical/cable-sizing/voltage-drop/sydney-suburban-residential-lighting-00 \
--model claude-sonnet-4-20250514 \
--harness directUse --harness rlm for a recursive-language-model harness, or --keep-workspace when debugging the temporary workspace after a run.
Evaluate
The local runner imports the trial under the local experiment id unless you opt out:
uv run aec-bench evaluate --experiment local
uv run aec-bench evaluate --experiment local --report report.htmlFor a repeatable benchmark slice, freeze tasks into a dataset, generate an experiment config, then run that config:
uv run aec-bench dataset create \
--name electrical-v1 \
--version 1.0.0 \
--domain electrical
uv run aec-bench dataset config electrical-v1@1.0.0 \
--model claude-sonnet-4-20250514 \
--output experiment.yaml
uv run aec-bench run --config experiment.yaml