aec-benchaec-bench

CLI

The aec-bench CLI is the command-line interface for generating tasks, running experiments, exporting environments, and inspecting results.

The command surface is grouped by lifecycle: setup, generation, execution, analysis, review, evolution, datasets, and integrations.

uv run aec-bench --version
uv run aec-bench --help

Global flags --json and --text force machine-readable or human-readable output where commands support both.

Setup

init

Scaffold a project directory:

uv run aec-bench init my-bench

config

Manage user defaults:

uv run aec-bench config view
uv run aec-bench config set tasks-root tasks
uv run aec-bench config set ledger-root artefacts/ledger
uv run aec-bench config reset

Generation

generate list-templates

List built-in and user-supplied templates:

uv run aec-bench generate list-templates
uv run aec-bench generate list-templates --discipline mechanical

generate task

Generate concrete task instances from a built-in or local template:

uv run aec-bench generate task voltage-drop \
  --instances 5 \
  --difficulty easy,medium \
  --seed 42 \
  --output tasks/generated

Useful options:

FlagPurpose
--template PATHGenerate from a local template directory instead of a built-in name
--instances NNumber of instances to create
--difficulty easy,mediumComma-separated difficulty filter
--seed NReproducible sampling seed
--tool-mode MODEOverride the template's tool mode
--dry-runShow the plan without writing task instances

generate validate-template

Validate a template directory:

uv run aec-bench generate validate-template \
  src/aec_bench/templates/builtin/electrical/voltage_drop

generate suite

Generate a suite from suite.toml:

uv run aec-bench generate suite --config suite.toml --dry-run
uv run aec-bench generate suite --config suite.toml --validate-only

generate dataset is retained as a deprecated alias; prefer generate suite for generation and dataset create for freezing.

Tasks and Datasets

task validate

Check a task directory:

uv run aec-bench task validate tasks/electrical/voltage-drop

dataset

Freeze, inspect, export, import, and evaluate task snapshots:

uv run aec-bench dataset create \
  --name electrical-v1 \
  --version 1.0.0 \
  --domain electrical

uv run aec-bench dataset config electrical-v1@1.0.0 \
  --model gpt-4.1-mini \
  --output experiment.yaml

uv run aec-bench dataset list
uv run aec-bench dataset info electrical-v1@1.0.0
uv run aec-bench dataset validate electrical-v1@1.0.0
uv run aec-bench dataset results electrical-v1@1.0.0

Running Experiments

run

Run an experiment manifest:

uv run aec-bench run --config experiment.yaml

Common overrides include task roots, model, harness, backend, repetitions, and --dry-run. The CLI still accepts --adapter as an alias for --harness.

run-local

Run one task without Docker or Harbor:

uv run aec-bench run-local tasks/electrical/voltage-drop \
  --model gpt-4.1-mini \
  --harness direct

Useful options:

FlagPurpose
--output DIRSave local run output in a specific directory
--keep-workspacePreserve the temporary workspace for debugging
--legacy-scriptUse the older standalone script runner instead of the library adapter
--no-importDo not import the local result into the ledger
--no-normaliseSkip canonical-reference normalisation before verification
--constitutional-model MODELOverride the constitutional inference model for RLM tasks that declare one

import, import-local, and import-prime-eval

Import completed results into the ledger:

uv run aec-bench import jobs/exp-001
uv run aec-bench import-local tasks/electrical/voltage-drop/_local_runs/run-001
uv run aec-bench import-prime-eval <prime-eval-id> --experiment prime-eval-001

import-prime-eval reads Prime hosted eval samples and writes normal aec-bench ledger artefacts, so hosted rollouts can be inspected with the same trace and behavioural report commands as local runs.

Analysis and Review

evaluate

Summarise an experiment's ledger records:

uv run aec-bench evaluate --experiment exp-001
uv run aec-bench evaluate --experiment exp-001 --report report.html
uv run aec-bench evaluate --experiment exp-001 --model gpt-4.1-mini

report

Generate focused outputs from ledger data:

uv run aec-bench report summary --experiment-id exp-001
uv run aec-bench report leaderboard
uv run aec-bench report traces --experiment-id exp-001
uv run aec-bench report behavioral --experiment-id exp-001 --classifier claude-sonnet-4-20250514

ledger

Query and export trial records:

uv run aec-bench ledger list
uv run aec-bench ledger export --output trials.jsonl

remediate

Run verifier-driven remediation on a completed run:

uv run aec-bench remediate --help

Use remediation when the verifier can point to a concrete gap and the original task artefacts are still available.

Interactive Surfaces

uv run aec-bench tui
uv run aec-bench web

The TUI is for local browsing, triage, comparison, and review. The web UI exposes the same ledger and review concepts through the FastAPI/Svelte surface in the library repo. Use uv sync --extra webui --dev in the source checkout before launching the browser UI.

Discovery

Search templates and seed tasks:

uv run aec-bench search "voltage drop"
uv run aec-bench search "cable" --discipline electrical

Library Catalogue

Export the public task and template catalogue consumed by the website:

uv run aec-bench library export --pretty
uv run aec-bench library export --stdout --pretty

See Library Catalogue for schema and sync details.

Evolution

Single-workspace evolution:

uv run aec-bench evolve init workspaces/my-ws --name voltage-drop-evo --harness rlm
uv run aec-bench evolve run --config evolution.yaml
uv run aec-bench evolve history workspaces/my-ws
uv run aec-bench evolve rollback workspaces/my-ws evo-20260404-1220-2

Multi-agent QD swarm evolution:

uv run aec-bench swarm run swarm.yaml
uv run aec-bench swarm status <run-id>
uv run aec-bench swarm resume <run-id>
uv run aec-bench swarm history
uv run aec-bench swarm stop <run-id>

See Evolution and Swarm.

Prime Lab

Export AEC-Bench tasks as Prime/verifiers environments and run local or hosted evals:

uv run aec-bench prime doctor
uv run aec-bench prime adapters
uv run aec-bench prime export --name aec-smoke --task electrical/voltage-drop
uv run aec-bench prime push --name aec-smoke --task electrical/voltage-drop
uv run aec-bench prime eval --remote-env owner/aec-smoke --hosted --model "Qwen/Qwen3.5-4B"
uv run aec-bench prime eval --remote-env owner/aec-smoke --hosted --model "Qwen/Qwen3.5-4B" --adapter-from-run <training-run-id>
uv run aec-bench import-prime-eval <prime-eval-id> --experiment prime-smoke-import
uv run aec-bench prime train-config --environment owner/aec-smoke -o train.toml
uv run aec-bench prime train-config \
  --environment owner/aec-suite \
  -o filtered.toml \
  --difficulty-ratio easy=0.45 \
  --difficulty-ratio medium=0.40 \
  --difficulty-ratio hard=0.15 \
  --online-difficulty-filtering
uv run aec-bench prime train train.toml
uv run aec-bench prime smoke --name aec-smoke --task electrical/voltage-drop

See Prime Lab for export lanes, deployed adapter discovery, adapter eval, hosted training, online difficulty filtering, and hosted rollout import.

Exit Codes

The CLI uses conventional exit codes:

CodeMeaning
0Success
1User error, missing file, validation failure, or runtime failure depending on command
2+Typer/Click argument parsing and command invocation failures

When scripting, prefer --json where supported and branch on the returned status, data, and errors fields rather than only the numeric process code.

On this page