Quickstart

The quickstart is a short path from a fresh checkout to generating a task, running a local trial, and evaluating the imported ledger.

This walkthrough uses the source checkout because it is the current public setup path while the package release is being prepared.

Prerequisites

Python 3.13 or later
uv for dependency management
An LLM provider key for the model you run, unless you only generate and validate tasks

Install

git clone https://github.com/TheodoreGalanos/aec-bench.git
cd aec-bench
uv sync --dev
uv run aec-bench --help

In a source checkout, run commands through uv run so the local package code and dependencies stay aligned.

Generate

List the built-in templates and generate one deterministic task instance:

uv run aec-bench generate list-templates --discipline electrical

uv run aec-bench generate task voltage-drop \
  --instances 1 \
  --difficulty easy \
  --seed 42 \
  --output tasks/generated

Generated instances are normal task directories: instruction, task config, verifier files, and provenance linking the instance back to its template and seed.

Validate

Before treating a task as benchmark material, validate its structure and verifier:

uv run aec-bench task validate \
  tasks/generated/electrical/cable-sizing/voltage-drop/sydney-suburban-residential-lighting-00

For template work, validate the source template rather than only the generated instance:

uv run aec-bench generate validate-template \
  src/aec_bench/templates/builtin/electrical/voltage_drop

Run

Run one task locally with a real model. The local runner copies the task into a temporary workspace, executes the selected harness, verifies the result, and imports the trial unless you opt out.

export ANTHROPIC_API_KEY="..."

uv run aec-bench run-local \
  tasks/generated/electrical/cable-sizing/voltage-drop/sydney-suburban-residential-lighting-00 \
  --model claude-sonnet-4-20250514 \
  --harness direct

Use --harness rlm for a recursive-language-model harness, or --keep-workspace when debugging the temporary workspace after a run.

Evaluate

The local runner imports the trial under the local experiment id unless you opt out:

uv run aec-bench evaluate --experiment local
uv run aec-bench evaluate --experiment local --report report.html

For a repeatable benchmark slice, freeze tasks into a dataset, generate an experiment config, then run that config:

uv run aec-bench dataset create \
  --name electrical-v1 \
  --version 1.0.0 \
  --domain electrical

uv run aec-bench dataset config electrical-v1@1.0.0 \
  --model claude-sonnet-4-20250514 \
  --output experiment.yaml

uv run aec-bench run --config experiment.yaml

Prerequisites

Install

Generate

Validate

Run

Evaluate

On this page