Introduction
aec-bench is an open-source platform for benchmarking AI agents on Architecture, Engineering and Construction tasks.
Why aec-bench?
AEC work mixes deterministic calculation, document interpretation, tool use, and engineering judgement. General model benchmarks do not tell you whether an agent can follow an engineering brief, use the right formula, inspect source material, run a verifier, and leave an auditable trace.
aec-bench provides:
- A public catalogue of built templates and proposed seed tasks across civil, electrical, ground, mechanical, and structural engineering
- Agent harnesses for direct answers, tool loops, RLM, and Lambda-RLM workflows
- Automated scoring through task-local verifiers and structured reward contracts
- Versioned datasets so comparable runs are anchored to immutable task snapshots
- Prime Lab export for local eval, hosted eval, adapter eval, and hosted training
- Evolution and swarm workflows for improving agent workspaces against real benchmark failures
Disciplines
| Discipline | Example coverage |
|---|---|
| Civil | Hydrology, hydraulics, drainage, roads, coastal, wind and load derivations |
| Electrical | Cable sizing, PV, grounding, arc flash, thermal rating, short-circuit |
| Ground | Bearing capacity, settlement, CPT/SPT interpretation, slope and retaining-wall checks |
| Mechanical | HVAC, pumps, fire services, process calculations, acoustics, wastewater |
| Structural | Marine, concrete, structural fire, load combinations, movement and connection checks |
Mental model
The core loop is:
- Define or generate tasks.
- Freeze a dataset when you need comparability.
- Run an agent harness against the selected tasks.
- Verify outputs and write trial records.
- Evaluate, report, and inspect traces.
Prime export, evolution, and swarm runs all build on the same task, trial, verifier, and trace records.
Next Steps
- Quickstart — Generate and run a first task
- Templates — Understand the built-in template catalogue
- Agent Harnesses — Choose an execution strategy
- Prime Lab — Export tasks for hosted eval and training