aec-benchaec-bench

Quickstart

The quickstart is a short path from a fresh checkout to generating a task, running a local trial, and evaluating the imported ledger.

This walkthrough uses the source checkout because it is the current public setup path while the package release is being prepared.

Prerequisites

  • Python 3.13 or later
  • uv for dependency management
  • An LLM provider key for the model you run, unless you only generate and validate tasks

Install

git clone https://github.com/TheodoreGalanos/aec-bench.git
cd aec-bench
uv sync --dev
uv run aec-bench --help

In a source checkout, run commands through uv run so the local package code and dependencies stay aligned.

Generate

List the built-in templates and generate one deterministic task instance:

uv run aec-bench generate list-templates --discipline electrical

uv run aec-bench generate task voltage-drop \
  --instances 1 \
  --difficulty easy \
  --seed 42 \
  --output tasks/generated

Generated instances are normal task directories: instruction, task config, verifier files, and provenance linking the instance back to its template and seed.

Validate

Before treating a task as benchmark material, validate its structure and verifier:

uv run aec-bench task validate \
  tasks/generated/electrical/cable-sizing/voltage-drop/sydney-suburban-residential-lighting-00

For template work, validate the source template rather than only the generated instance:

uv run aec-bench generate validate-template \
  src/aec_bench/templates/builtin/electrical/voltage_drop

Run

Run one task locally with a real model. The local runner copies the task into a temporary workspace, executes the selected harness, verifies the result, and imports the trial unless you opt out.

export ANTHROPIC_API_KEY="..."

uv run aec-bench run-local \
  tasks/generated/electrical/cable-sizing/voltage-drop/sydney-suburban-residential-lighting-00 \
  --model claude-sonnet-4-20250514 \
  --harness direct

Use --harness rlm for a recursive-language-model harness, or --keep-workspace when debugging the temporary workspace after a run.

Evaluate

The local runner imports the trial under the local experiment id unless you opt out:

uv run aec-bench evaluate --experiment local
uv run aec-bench evaluate --experiment local --report report.html

For a repeatable benchmark slice, freeze tasks into a dataset, generate an experiment config, then run that config:

uv run aec-bench dataset create \
  --name electrical-v1 \
  --version 1.0.0 \
  --domain electrical

uv run aec-bench dataset config electrical-v1@1.0.0 \
  --model claude-sonnet-4-20250514 \
  --output experiment.yaml

uv run aec-bench run --config experiment.yaml

On this page