aec-benchaec-bench

Introduction

aec-bench is an open-source platform for benchmarking AI agents on Architecture, Engineering and Construction tasks.

Why aec-bench?

AEC work mixes deterministic calculation, document interpretation, tool use, and engineering judgement. General model benchmarks do not tell you whether an agent can follow an engineering brief, use the right formula, inspect source material, run a verifier, and leave an auditable trace.

aec-bench provides:

  • A public catalogue of built templates and proposed seed tasks across civil, electrical, ground, mechanical, and structural engineering
  • Agent harnesses for direct answers, tool loops, RLM, and Lambda-RLM workflows
  • Automated scoring through task-local verifiers and structured reward contracts
  • Versioned datasets so comparable runs are anchored to immutable task snapshots
  • Prime Lab export for local eval, hosted eval, adapter eval, and hosted training
  • Evolution and swarm workflows for improving agent workspaces against real benchmark failures

Disciplines

DisciplineExample coverage
CivilHydrology, hydraulics, drainage, roads, coastal, wind and load derivations
ElectricalCable sizing, PV, grounding, arc flash, thermal rating, short-circuit
GroundBearing capacity, settlement, CPT/SPT interpretation, slope and retaining-wall checks
MechanicalHVAC, pumps, fire services, process calculations, acoustics, wastewater
StructuralMarine, concrete, structural fire, load combinations, movement and connection checks

Mental model

The core loop is:

  1. Define or generate tasks.
  2. Freeze a dataset when you need comparability.
  3. Run an agent harness against the selected tasks.
  4. Verify outputs and write trial records.
  5. Evaluate, report, and inspect traces.

Prime export, evolution, and swarm runs all build on the same task, trial, verifier, and trace records.

Next Steps

On this page