LIVEdataset releasetasks 552models 18last submission · built

SECTION 01 / HERO

How capable is AI at real engineering?

aec-bench measures AI performance across 500+ tasks in architecture, engineering and construction — cable sizing, seismic design, hydraulic modelling, HVAC, geotech. Real problems, real standards, automated scoring.

SECTION 02 / CURRENT STANDINGS

Current standings

dataset release · 552 tasks · 5 disciplines

~/aec-bench / leaderboard.tsv18 rows · release eval
aec-bench ~ $ bench leaderboard --top 4 --by reward › release ok
#01
Grok 4.3
other · tool_loop
0.89
100%
#02
Grok 4.20 Reasoning
other · tool_loop
0.87
100%
#03
Kimi K2.6
other · tool_loop
0.86
96%
#04
GPT-5.2
openai · tool_loop
0.83
100%
C civilE electricalG groundM mechanicalS structural

SECTION 03 / REWARD × LATENCY

Reward × Latency

release results pair task performance with runtime and completion coverage

── TOP_4 ──full table ↗
  • #01Grok 4.30.89
  • #02Grok 4.20 Reasoning0.87
  • #03Kimi K2.60.86
  • #04GPT-5.20.83
── REWARD × LATENCY ──explore ↗
Grok 4.3 — reward 0.89, median 21.9s, coverage 100%Grok 4.20 Reasoning — reward 0.87, median 15.2s, coverage 100%Kimi K2.6 — reward 0.86, median 47.0s, coverage 96%GPT-5.2 — reward 0.83, median 14.2s, coverage 100%median latency →↑ reward

SECTION 05 / HOW IT WORKS

Define → run → score

six-stage pipeline · same flow every run

01
Define task
Template + params
02
Resolve instance
Jinja render
03
Stage env
Sandbox + tools
04
Execute agent
Harness drives the model
05
Score output
Automated verifier
06
Aggregate
Ledger + report
aec-bench ~ $ uv run aec-bench run-local \
  tasks/generated/electrical/cable-sizing/voltage-drop/sydney-suburban-residential-lighting-00 \
  --model claude-sonnet-4-20250514 --harness direct
› staging temporary workspace … ok
› executing harness direct
› verifier complete · reward 0.83 · imported as experiment local
aec-bench ~ $ uv run aec-bench evaluate --experiment local --report report.html
done. report written to report.html

SECTION 06 / RUN IT YOURSELF

Benchmark your model against real engineering.

Open-source. Reproducible. Runs locally or against any provider.

git clone https://github.com/TheodoreGalanos/aec-bench.git
source checkout·github.com/TheodoreGalanos/aec-bench·2.4k ★