aec-benchaec-bench

Config

aec-bench's configuration is split across several TOML and YAML files, each living near the concept it configures.

This page is the structural reference. Narrative docs for each concept live under Core, Agents, and Advanced.

aec-bench.toml — project root

Defines paths and defaults for a project. Placed at the repo root.

[project]
name = "my-bench"

[paths]
tasks      = "tasks"
templates  = "templates"
seeds      = "seeds"
ledger     = "artefacts/ledger"
feedback   = "artefacts/feedback"
datasets   = "artefacts/datasets"
jobs       = "jobs"

[compute]
backend = "modal"

All paths are relative to the project root. Omitted keys fall back to the defaults above.

The aec-bench config CLI manages the per-user JSON defaults for paths. Project TOML can also set compute.backend; that value is read by the project loader, not by aec-bench config set.

task.toml — per task

Every task directory has one. Full schema:

version = "1.0"

[metadata]
difficulty = "easy"              # easy | medium | hard
visibility = "public"            # public | holdout
category   = "reasoning"
tags       = ["electrical", "AS-NZS-3008"]

[agent]
timeout_sec = 600.0

[verifier]
timeout_sec = 300.0

[environment]
extensions         = []
build_timeout_sec  = 600.0
cpus               = 2
memory_mb          = 4096
storage_mb         = 10240
allow_internet     = true

See Tasks for what lives alongside task.toml in a task directory.

experiment.yaml — per run

A single experiment: which tasks, which agents, how to run them.

experiment_id: exp-20260412-001
name: "Claude vs GPT electrical"
description: "Optional narrative description"

tasks:
  dataset: electrical-v1@1.0.0
  include_patterns: ["electrical/*"]
  exclude_patterns: []
  domains:      ["electrical"]
  difficulties: ["easy", "medium"]

agents:
  - name: claude-tool-loop
    harness: tool_loop
    model: claude-sonnet-4-20250514
    parameters:
      max_turns: 12
    system_prompt_file: prompts/electrical.md   # optional

  - name: gpt4-direct
    harness: direct
    model: gpt-4.1
    parameters:
      max_tokens: 8192

compute:
  backend: modal
  resource_limits:
    timeout_override: 1200
    memory_mb: 8192

repetitions: 3
disable_verification: false

The model field supports $ENV_VAR expansion. The manifest parser accepts harness as the public synonym for the internal adapter field; generated YAML may still contain adapter. The tasks selector filters by dataset, patterns, domain, difficulty, and lifecycle as set intersections.

Adapter configs

rlm.toml

Adapter-specific config for RLM runs:

[template]
tier       = "flat"              # flat | dependency_tree
definition = "report_template.toml"   # optional reference

[inputs.source_doc]
type        = "string"
source      = "fixtures/spec.md"
pre_parse   = false
description = "Reference specification"

[hints]
phases     = ["understand", "calculate", "verify"]
prohibited = ["guess", "skip_verification"]

[subcalls.extract]
enabled     = true
custom_impl = null
description = "Pull values from source doc"

[guardrails]
token_budget         = 500_000
max_iterations       = 100
max_subcall_depth    = 1
budget_warning_pct   = 80.0
max_subcalls         = 0          # 0 = unlimited
max_budget_usd       = 0.0
billable_input_budget = 0

[execution]
scaffolding              = true
compaction_threshold_pct = 0.85
hard_ceiling_pct         = 0.95
compaction_model         = null   # null = use agent's model
subcall_model            = null
context_limit            = 1_000_000
max_parallel_workers     = 4

[advisor]
model               = "claude-haiku-4-5-20251001"
max_uses            = 5
max_response_tokens = 500
context_window      = 10
enabled             = true

lambda-rlm.toml

[template]
tier       = "dependency_tree"
definition = null

[planner]
context_window_chars  = 100_000
accuracy_target       = 0.80
leaf_accuracy         = 0.95
compose_accuracy      = 0.90
max_branching_factor  = 20

[review]
enabled                     = true
max_retries_per_source      = 1
max_supplements_per_section = 1

[execution]
max_parallel_workers = 4

[guardrails]
token_budget = 500_000

[advisor]
model   = "claude-haiku-4-5-20251001"
enabled = true

evolution.yaml

Drives the evolution loop (see Evolution):

workspace_path: workspaces/voltage-drop-evo

models:
  classifier: env:AWS_HAIKU_MODEL_ID
  evolver:    env:AWS_SONNET_MODEL_ID

solver:
  name: solver-v1
  harness: rlm
  model: env:AWS_SONNET_MODEL_ID
  client:
    kind: bedrock
    settings: {}

generate:
  template: voltage-drop
  count: 10
  seed: 999
  difficulties: ["easy", "medium"]

tasks:
  domains: [electrical]
  include_patterns: ["electrical/*"]
  exclude_patterns: []

backend: local
batch_size: 5
max_cycles: 10
improvement_threshold: 0.01
stagnation_window: 5
timeout: 1800
harness_config: experiment.yaml      # optional

suite.toml — generated-suite config

Declares how aec-bench generate suite should select templates, allocate instances, and write generated task output:

name = "my-suite"
seed = 20260524

[coverage]
difficulties = { easy = 0.3333333333, medium = 0.3333333333, hard = 0.3333333334 }
min_tasks_per_discipline = 3

[templates]
include = ["electrical/*", "civil/*"]
user_dirs = []

[visibility]
mix = { all_given = 0.78, partial = 0.22 }

[tool_mode]
mix = { with_tool = 1.0 }

[instances]
per_task = 3
total_max = 200

[output]
dir = "tasks/generated/my-suite"

The templates.include entries match each template's logical discipline/name path, such as electrical/voltage-drop. generate suite --dry-run returns the planned counts by discipline, difficulty, visibility, and tool mode before writing task instances.

Dataset manifest

Dataset manifests are generated by aec-bench dataset create, not hand-written. The structure is still worth knowing for tooling:

{
  "name": "electrical-v1",
  "version": "1.0.0",
  "content_hash": "sha256:...",
  "created_at": "2026-04-12T00:00:00Z",
  "description": {
    "summary": "...",
    "purpose": "...",
    "standards": ["AS/NZS 3008"],
    "domains": ["electrical"],
    "difficulty_distribution": { "easy": 3, "medium": 3 },
    "task_count": 6
  },
  "tasks": [
    {
      "task_id": "electrical.voltage-drop.basic",
      "task_path": "tasks/electrical/voltage-drop/basic",
      "content_hash": "sha256:...",
      "domain": "electrical",
      "difficulty": "easy",
      "tags": ["deterministic", "AS-NZS-3008"]
    }
  ],
  "source": {
    "method": "manual",
    "suite_config": {},
    "seed": null
  }
}

See Datasets for the content hashing and versioning policy.

swarm.yaml

Swarm config extends the evolution workspace model with parallel agents, shared budget, and quality-diversity archive settings:

task:
  workspace: ./workspaces/my-swarm
  task_path: tasks/electrical/voltage-drop

agents:
  count: 4
  default_model: au.anthropic.claude-sonnet-4-6

budget:
  max_cost_usd: 20.0
  eval_budget_usd: 5.0
  wind_down_threshold: 0.8
  final_threshold: 0.95

evaluation:
  timeout: 300
  backend: local

evolution:
  batch_size: 1
  improvement_threshold: 0.01

heartbeat:
  pivot_after: 5

See Swarm for the runtime model and event log.

Prime environment args

Prime export commands take selection flags directly. Hosted eval and training commands can forward environment arguments into load_environment(...):

uv run aec-bench prime eval \
  --remote-env owner/aec-suite \
  --hosted \
  --model "Qwen/Qwen3.5-4B" \
  --split eval \
  --difficulty medium \
  --harness stateful \
  --env-num-examples 10 \
  --seed 20260509 \
  --env-arg max_turns=20

Values passed through --env-arg KEY=VALUE are parsed as JSON where possible, so numbers and booleans keep their types.

Generated Prime environments apply task selection in this order: difficulty filter, deterministic split selection, optional seeded shuffle, then num_examples. Use split=all when you intentionally want the full exported slice.

configs/endpoints.toml

In Prime Lab workspaces, endpoint aliases live in configs/endpoints.toml and are referenced by Prime commands and generated experiment configs. Keep provider credentials in environment variables or the provider's own auth store, not in this file.

On this page