CLI
The aec-bench CLI is the command-line interface for generating tasks, running experiments, exporting environments, and inspecting results.
The command surface is grouped by lifecycle: setup, generation, execution, analysis, review, evolution, datasets, and integrations.
uv run aec-bench --version
uv run aec-bench --helpGlobal flags --json and --text force machine-readable or human-readable output where commands support both.
Setup
init
Scaffold a project directory:
uv run aec-bench init my-benchconfig
Manage user defaults:
uv run aec-bench config view
uv run aec-bench config set tasks-root tasks
uv run aec-bench config set ledger-root artefacts/ledger
uv run aec-bench config resetGeneration
generate list-templates
List built-in and user-supplied templates:
uv run aec-bench generate list-templates
uv run aec-bench generate list-templates --discipline mechanicalgenerate task
Generate concrete task instances from a built-in or local template:
uv run aec-bench generate task voltage-drop \
--instances 5 \
--difficulty easy,medium \
--seed 42 \
--output tasks/generatedUseful options:
| Flag | Purpose |
|---|---|
--template PATH | Generate from a local template directory instead of a built-in name |
--instances N | Number of instances to create |
--difficulty easy,medium | Comma-separated difficulty filter |
--seed N | Reproducible sampling seed |
--tool-mode MODE | Override the template's tool mode |
--dry-run | Show the plan without writing task instances |
generate validate-template
Validate a template directory:
uv run aec-bench generate validate-template \
src/aec_bench/templates/builtin/electrical/voltage_dropgenerate suite
Generate a suite from suite.toml:
uv run aec-bench generate suite --config suite.toml --dry-run
uv run aec-bench generate suite --config suite.toml --validate-onlygenerate dataset is retained as a deprecated alias; prefer generate suite for generation and dataset create for freezing.
Tasks and Datasets
task validate
Check a task directory:
uv run aec-bench task validate tasks/electrical/voltage-dropdataset
Freeze, inspect, export, import, and evaluate task snapshots:
uv run aec-bench dataset create \
--name electrical-v1 \
--version 1.0.0 \
--domain electrical
uv run aec-bench dataset config electrical-v1@1.0.0 \
--model gpt-4.1-mini \
--output experiment.yaml
uv run aec-bench dataset list
uv run aec-bench dataset info electrical-v1@1.0.0
uv run aec-bench dataset validate electrical-v1@1.0.0
uv run aec-bench dataset results electrical-v1@1.0.0Running Experiments
run
Run an experiment manifest:
uv run aec-bench run --config experiment.yamlCommon overrides include task roots, model, harness, backend, repetitions, and --dry-run. The CLI still accepts --adapter as an alias for --harness.
run-local
Run one task without Docker or Harbor:
uv run aec-bench run-local tasks/electrical/voltage-drop \
--model gpt-4.1-mini \
--harness directUseful options:
| Flag | Purpose |
|---|---|
--output DIR | Save local run output in a specific directory |
--keep-workspace | Preserve the temporary workspace for debugging |
--legacy-script | Use the older standalone script runner instead of the library adapter |
--no-import | Do not import the local result into the ledger |
--no-normalise | Skip canonical-reference normalisation before verification |
--constitutional-model MODEL | Override the constitutional inference model for RLM tasks that declare one |
import, import-local, and import-prime-eval
Import completed results into the ledger:
uv run aec-bench import jobs/exp-001
uv run aec-bench import-local tasks/electrical/voltage-drop/_local_runs/run-001
uv run aec-bench import-prime-eval <prime-eval-id> --experiment prime-eval-001import-prime-eval reads Prime hosted eval samples and writes normal aec-bench ledger artefacts, so hosted rollouts can be inspected with the same trace and behavioural report commands as local runs.
Analysis and Review
evaluate
Summarise an experiment's ledger records:
uv run aec-bench evaluate --experiment exp-001
uv run aec-bench evaluate --experiment exp-001 --report report.html
uv run aec-bench evaluate --experiment exp-001 --model gpt-4.1-minireport
Generate focused outputs from ledger data:
uv run aec-bench report summary --experiment-id exp-001
uv run aec-bench report leaderboard
uv run aec-bench report traces --experiment-id exp-001
uv run aec-bench report behavioral --experiment-id exp-001 --classifier claude-sonnet-4-20250514ledger
Query and export trial records:
uv run aec-bench ledger list
uv run aec-bench ledger export --output trials.jsonlremediate
Run verifier-driven remediation on a completed run:
uv run aec-bench remediate --helpUse remediation when the verifier can point to a concrete gap and the original task artefacts are still available.
Interactive Surfaces
uv run aec-bench tui
uv run aec-bench webThe TUI is for local browsing, triage, comparison, and review. The web UI exposes the same ledger and review concepts through the FastAPI/Svelte surface in the library repo.
Use uv sync --extra webui --dev in the source checkout before launching the browser UI.
Discovery
Search templates and seed tasks:
uv run aec-bench search "voltage drop"
uv run aec-bench search "cable" --discipline electricalLibrary Catalogue
Export the public task and template catalogue consumed by the website:
uv run aec-bench library export --pretty
uv run aec-bench library export --stdout --prettySee Library Catalogue for schema and sync details.
Evolution
Single-workspace evolution:
uv run aec-bench evolve init workspaces/my-ws --name voltage-drop-evo --harness rlm
uv run aec-bench evolve run --config evolution.yaml
uv run aec-bench evolve history workspaces/my-ws
uv run aec-bench evolve rollback workspaces/my-ws evo-20260404-1220-2Multi-agent QD swarm evolution:
uv run aec-bench swarm run swarm.yaml
uv run aec-bench swarm status <run-id>
uv run aec-bench swarm resume <run-id>
uv run aec-bench swarm history
uv run aec-bench swarm stop <run-id>Prime Lab
Export AEC-Bench tasks as Prime/verifiers environments and run local or hosted evals:
uv run aec-bench prime doctor
uv run aec-bench prime adapters
uv run aec-bench prime export --name aec-smoke --task electrical/voltage-drop
uv run aec-bench prime push --name aec-smoke --task electrical/voltage-drop
uv run aec-bench prime eval --remote-env owner/aec-smoke --hosted --model "Qwen/Qwen3.5-4B"
uv run aec-bench prime eval --remote-env owner/aec-smoke --hosted --model "Qwen/Qwen3.5-4B" --adapter-from-run <training-run-id>
uv run aec-bench import-prime-eval <prime-eval-id> --experiment prime-smoke-import
uv run aec-bench prime train-config --environment owner/aec-smoke -o train.toml
uv run aec-bench prime train-config \
--environment owner/aec-suite \
-o filtered.toml \
--difficulty-ratio easy=0.45 \
--difficulty-ratio medium=0.40 \
--difficulty-ratio hard=0.15 \
--online-difficulty-filtering
uv run aec-bench prime train train.toml
uv run aec-bench prime smoke --name aec-smoke --task electrical/voltage-dropSee Prime Lab for export lanes, deployed adapter discovery, adapter eval, hosted training, online difficulty filtering, and hosted rollout import.
Exit Codes
The CLI uses conventional exit codes:
| Code | Meaning |
|---|---|
0 | Success |
1 | User error, missing file, validation failure, or runtime failure depending on command |
2+ | Typer/Click argument parsing and command invocation failures |
When scripting, prefer --json where supported and branch on the returned status, data, and errors fields rather than only the numeric process code.