Config
aec-bench's configuration is split across several TOML and YAML files, each living near the concept it configures.
This page is the structural reference. Narrative docs for each concept live under Core, Agents, and Advanced.
aec-bench.toml — project root
Defines paths and defaults for a project. Placed at the repo root.
[project]
name = "my-bench"
[paths]
tasks = "tasks"
templates = "templates"
seeds = "seeds"
ledger = "artefacts/ledger"
feedback = "artefacts/feedback"
datasets = "artefacts/datasets"
jobs = "jobs"
[compute]
backend = "modal"All paths are relative to the project root. Omitted keys fall back to the defaults above.
The aec-bench config CLI manages the per-user JSON defaults for paths. Project TOML can also set compute.backend; that value is read by the project loader, not by aec-bench config set.
task.toml — per task
Every task directory has one. Full schema:
version = "1.0"
[metadata]
difficulty = "easy" # easy | medium | hard
visibility = "public" # public | holdout
category = "reasoning"
tags = ["electrical", "AS-NZS-3008"]
[agent]
timeout_sec = 600.0
[verifier]
timeout_sec = 300.0
[environment]
extensions = []
build_timeout_sec = 600.0
cpus = 2
memory_mb = 4096
storage_mb = 10240
allow_internet = trueSee Tasks for what lives alongside task.toml in a task directory.
experiment.yaml — per run
A single experiment: which tasks, which agents, how to run them.
experiment_id: exp-20260412-001
name: "Claude vs GPT electrical"
description: "Optional narrative description"
tasks:
dataset: electrical-v1@1.0.0
include_patterns: ["electrical/*"]
exclude_patterns: []
domains: ["electrical"]
difficulties: ["easy", "medium"]
agents:
- name: claude-tool-loop
harness: tool_loop
model: claude-sonnet-4-20250514
parameters:
max_turns: 12
system_prompt_file: prompts/electrical.md # optional
- name: gpt4-direct
harness: direct
model: gpt-4.1
parameters:
max_tokens: 8192
compute:
backend: modal
resource_limits:
timeout_override: 1200
memory_mb: 8192
repetitions: 3
disable_verification: falseThe model field supports $ENV_VAR expansion. The manifest parser accepts harness as the public synonym for the internal adapter field; generated YAML may still contain adapter. The tasks selector filters by dataset, patterns, domain, difficulty, and lifecycle as set intersections.
Adapter configs
rlm.toml
Adapter-specific config for RLM runs:
[template]
tier = "flat" # flat | dependency_tree
definition = "report_template.toml" # optional reference
[inputs.source_doc]
type = "string"
source = "fixtures/spec.md"
pre_parse = false
description = "Reference specification"
[hints]
phases = ["understand", "calculate", "verify"]
prohibited = ["guess", "skip_verification"]
[subcalls.extract]
enabled = true
custom_impl = null
description = "Pull values from source doc"
[guardrails]
token_budget = 500_000
max_iterations = 100
max_subcall_depth = 1
budget_warning_pct = 80.0
max_subcalls = 0 # 0 = unlimited
max_budget_usd = 0.0
billable_input_budget = 0
[execution]
scaffolding = true
compaction_threshold_pct = 0.85
hard_ceiling_pct = 0.95
compaction_model = null # null = use agent's model
subcall_model = null
context_limit = 1_000_000
max_parallel_workers = 4
[advisor]
model = "claude-haiku-4-5-20251001"
max_uses = 5
max_response_tokens = 500
context_window = 10
enabled = truelambda-rlm.toml
[template]
tier = "dependency_tree"
definition = null
[planner]
context_window_chars = 100_000
accuracy_target = 0.80
leaf_accuracy = 0.95
compose_accuracy = 0.90
max_branching_factor = 20
[review]
enabled = true
max_retries_per_source = 1
max_supplements_per_section = 1
[execution]
max_parallel_workers = 4
[guardrails]
token_budget = 500_000
[advisor]
model = "claude-haiku-4-5-20251001"
enabled = trueevolution.yaml
Drives the evolution loop (see Evolution):
workspace_path: workspaces/voltage-drop-evo
models:
classifier: env:AWS_HAIKU_MODEL_ID
evolver: env:AWS_SONNET_MODEL_ID
solver:
name: solver-v1
harness: rlm
model: env:AWS_SONNET_MODEL_ID
client:
kind: bedrock
settings: {}
generate:
template: voltage-drop
count: 10
seed: 999
difficulties: ["easy", "medium"]
tasks:
domains: [electrical]
include_patterns: ["electrical/*"]
exclude_patterns: []
backend: local
batch_size: 5
max_cycles: 10
improvement_threshold: 0.01
stagnation_window: 5
timeout: 1800
harness_config: experiment.yaml # optionalsuite.toml — generated-suite config
Declares how aec-bench generate suite should select templates, allocate instances, and write generated task output:
name = "my-suite"
seed = 20260524
[coverage]
difficulties = { easy = 0.3333333333, medium = 0.3333333333, hard = 0.3333333334 }
min_tasks_per_discipline = 3
[templates]
include = ["electrical/*", "civil/*"]
user_dirs = []
[visibility]
mix = { all_given = 0.78, partial = 0.22 }
[tool_mode]
mix = { with_tool = 1.0 }
[instances]
per_task = 3
total_max = 200
[output]
dir = "tasks/generated/my-suite"The templates.include entries match each template's logical discipline/name path, such as electrical/voltage-drop. generate suite --dry-run returns the planned counts by discipline, difficulty, visibility, and tool mode before writing task instances.
Dataset manifest
Dataset manifests are generated by aec-bench dataset create, not hand-written. The structure is still worth knowing for tooling:
{
"name": "electrical-v1",
"version": "1.0.0",
"content_hash": "sha256:...",
"created_at": "2026-04-12T00:00:00Z",
"description": {
"summary": "...",
"purpose": "...",
"standards": ["AS/NZS 3008"],
"domains": ["electrical"],
"difficulty_distribution": { "easy": 3, "medium": 3 },
"task_count": 6
},
"tasks": [
{
"task_id": "electrical.voltage-drop.basic",
"task_path": "tasks/electrical/voltage-drop/basic",
"content_hash": "sha256:...",
"domain": "electrical",
"difficulty": "easy",
"tags": ["deterministic", "AS-NZS-3008"]
}
],
"source": {
"method": "manual",
"suite_config": {},
"seed": null
}
}See Datasets for the content hashing and versioning policy.
swarm.yaml
Swarm config extends the evolution workspace model with parallel agents, shared budget, and quality-diversity archive settings:
task:
workspace: ./workspaces/my-swarm
task_path: tasks/electrical/voltage-drop
agents:
count: 4
default_model: au.anthropic.claude-sonnet-4-6
budget:
max_cost_usd: 20.0
eval_budget_usd: 5.0
wind_down_threshold: 0.8
final_threshold: 0.95
evaluation:
timeout: 300
backend: local
evolution:
batch_size: 1
improvement_threshold: 0.01
heartbeat:
pivot_after: 5See Swarm for the runtime model and event log.
Prime environment args
Prime export commands take selection flags directly. Hosted eval and training commands can forward environment arguments into load_environment(...):
uv run aec-bench prime eval \
--remote-env owner/aec-suite \
--hosted \
--model "Qwen/Qwen3.5-4B" \
--split eval \
--difficulty medium \
--harness stateful \
--env-num-examples 10 \
--seed 20260509 \
--env-arg max_turns=20Values passed through --env-arg KEY=VALUE are parsed as JSON where possible, so numbers and booleans keep their types.
Generated Prime environments apply task selection in this order: difficulty filter, deterministic split selection, optional seeded shuffle, then num_examples. Use split=all when you intentionally want the full exported slice.
configs/endpoints.toml
In Prime Lab workspaces, endpoint aliases live in configs/endpoints.toml and are referenced by Prime commands and generated experiment configs. Keep provider credentials in environment variables or the provider's own auth store, not in this file.