aec-benchaec-bench

Templates

A template defines a parameterised engineering problem family that generates concrete task instances through parameter sampling and Jinja2 rendering.

How templates work

A template is a reusable engineering problem family. It samples realistic parameters, renders an instruction, computes ground truth, and scaffolds complete task instances that can be validated, frozen into datasets, and run through the same benchmark pipeline as hand-authored tasks.

The current library catalogue contains 184 built templates across five disciplines, plus proposed seed tasks that have not yet been converted into deterministic templates.

DisciplineBuilt templatesProposed seedsTypical coverage
Civil5730Hydrology, hydraulics, transport geometry, coastal, drainage, wind and load derivations
Electrical5292Cable sizing, PV, grounding, arc flash, busbar, thermal rating, short-circuit
Ground103Bearing capacity, settlement, CPT/SPT interpretation, slope and retaining-wall checks
Mechanical5092HVAC, pumps, fire services, process calculations, acoustics, vibration, wastewater
Structural1567Marine, concrete, structural fire, load combinations, movement and connection checks

The catalogue used by the site is generated by aec-bench library export; see Library Catalogue.

Template anatomy

Each built-in template is a directory under src/aec_bench/templates/builtin/<discipline>/<template>/:

src/aec_bench/templates/builtin/electrical/voltage_drop/
├── params.toml       # metadata, parameters, archetypes, difficulty presets
├── instruction.md    # Jinja2 template for the problem statement
├── engine.py         # pure ground-truth computation
└── __init__.py

The three required files are the contract:

FileRole
params.tomlDeclares metadata, inputs, sampling ranges, archetypes, outputs, tolerances, and difficulty presets
instruction.mdRenders the task prompt from sampled parameters and visibility rules
engine.pyComputes expected outputs from sampled parameters for verifier and fixture generation

params.toml

params.toml is the public contract for a template:

params.toml
[meta]
name = "voltage-drop"
description = "Cable voltage drop calculation per AS/NZS 3008.1.1"
discipline = "electrical"
category = "cable-sizing"
standards = ["AS/NZS 3008.1.1"]
tool_mode = "with-tool"

[params.cable_size_mm2]
type = "enum"
unit = "mm²"
description = "Cable conductor cross-sectional area"
values = ["1.5", "2.5", "4", "6", "10", "16", "25", "35", "50", "70", "95", "120", "150", "185", "240"]

[params.length_m]
type = "float"
unit = "m"
description = "Cable route length (one way)"
min = 1
max = 500

[params.load_current_a]
type = "float"
unit = "A"
description = "Design load current"
min = 0.5
max = 500

[params.power_factor]
type = "float"
description = "Load power factor"
min = 0.5
max = 1.0
default = 0.8

[params.conductor_material]
type = "enum"
description = "Conductor material"
values = ["copper", "aluminium"]
derivable_from = "archetype"

[params.circuit_type]
type = "enum"
description = "Circuit type"
values = ["single_phase", "three_phase"]

Supported parameter types include float, int, and enum. Templates can also use archetype-derived values so generated cases remain realistic rather than random-but-implausible.

Archetypes

Archetypes bundle values that should move together:

params.toml
[archetypes.sydney_suburban_lighting]
description = "Suburban lighting circuit with moderate route length"
site_contexts = ["sydney-suburban", "melbourne-suburban"]
length_m = { min = 5, max = 30 }
load_current_a = { min = 1, max = 10 }

This matters in AEC tasks because input independence often creates nonsense. A geotechnical soil, hydraulic duty point, cable route, or structural load case usually has correlated values.

Difficulty presets

Difficulty controls which archetypes can be sampled and how much information is visible:

params.toml
[difficulty.easy]
description = "All calculation inputs are visible"
visibility = "all_given"
archetypes = ["residential_lighting", "residential_power"]

[difficulty.hard]
description = "Some inputs must be inferred from the scenario"
visibility = "partial"
archetypes = ["commercial_submain", "industrial_feeder"]
hidden_params = ["conductor_material"]
replacement_text = "Use the stated project context to select a suitable conductor material."

The built-in convention is:

DifficultyExpected shape
easyDirect calculation with all or nearly all values visible
mediumMore steps, more distractors, or a modest inference
hardWider context, hidden values, richer unit handling, or more opportunities for wrong assumptions

instruction.md

Instructions are Jinja2 templates:

instruction.md
## Given

| Parameter | Value | Unit |
|-----------|-------|------|
| Cable size | {{ cable_size_mm2 }} | mm² |
| Cable route length | {{ length_m }} | m |
| Design load current | {{ load_current_a }} | A |
| Power factor | {{ power_factor }} | - |
{% if conductor_material is defined %}
| Conductor material | {{ conductor_material }} | - |
{% endif %}
| Circuit type | {{ circuit_type }} | - |

## Required

Calculate the voltage drop percentage and state whether it is within the allowable limit.

Difficulty visibility decides which variables are rendered into the prompt. The hidden values still exist for the engine and verifier.

engine.py

The engine is intentionally small and deterministic:

engine.py
def compute(
    cable_size_mm2: str,
    length_m: float,
    load_current_a: float,
    power_factor: float,
    conductor_material: str = "copper",
    circuit_type: str = "single_phase",
) -> dict[str, float]:
    vc_mv_per_a_m = ...
    voltage_drop_v = ...
    voltage_drop_pct = ...
    return {
        "vc_mv_per_a_m": vc_mv_per_a_m,
        "voltage_drop_v": voltage_drop_v,
        "voltage_drop_percent": voltage_drop_pct,
        "compliant": 1.0 if voltage_drop_pct <= 5.0 else 0.0,
    }

Good engines are pure functions over sampled parameters. They should not call model APIs, inspect the generated prompt, depend on local machine state, or rely on unstated intent or prose-only judgement.

Generating instances

Generate concrete tasks from a built-in template:

uv run aec-bench generate task voltage-drop \
  --instances 5 \
  --difficulty easy,medium \
  --seed 42 \
  --output tasks/generated

List and filter the catalogue:

uv run aec-bench generate list-templates
uv run aec-bench generate list-templates --discipline structural

Validate a custom template before using it:

uv run aec-bench generate validate-template ./my-template

Generate a configured suite:

uv run aec-bench generate suite --config suite.toml --dry-run
uv run aec-bench generate suite --config suite.toml

Each generated instance records its template name, sampled values, difficulty, and seed so the task can be reproduced.

Built-in template scope

The built-in templates are strongest when the engineering contract is explicit:

  • Deterministic calculations with numeric or categorical inputs
  • Stable formulae, embedded lookup tables, or clearly bounded reductions of a design method
  • Outputs with concrete tolerances
  • Verifiers that can score mechanically without relying on unstated intent or prose-only judgement

Tasks are deferred rather than templated when they depend on open-ended document review, hidden standards tables, iterative solvers without a reduced contract, or broad design judgement that has not been made explicit.

Writing your own

Start with the smallest useful deterministic contract:

  1. Define the expected inputs and outputs in params.toml.
  2. Add realistic archetypes so sampled values make sense together.
  3. Render a clear instruction.md with difficulty-aware visibility.
  4. Implement compute() in engine.py.
  5. Run uv run aec-bench generate validate-template ./my-template.
  6. Generate easy, medium, and hard instances and validate the generated task directories.

Use templates for reproducible benchmark families. Use hand-authored tasks for bespoke workflows that are not yet reducible to a stable generation contract.

On this page