aec-benchaec-bench

Prime Lab

Prime Lab integration exports aec-bench tasks as verifiers environments for local eval, hosted eval, adapter eval, hosted training, and rollout import.

Prime Lab support is optional. It lets aec-bench package selected tasks as Prime/verifiers environments while preserving the aec-bench task verifier as the reward authority.

Export lanes

aec-bench classifies tasks into conservative Prime lanes:

AEC-Bench task shapePrime environment shapeStatus
Deterministic formula and calculation tasksSingleTurnEnv style package with deterministic rubricSupported
Workspace tasks that need files or commandsStatefulToolEnv style package with workspace toolsSupported
RLM and lambda-RLM tasksStateful workspace export carrying policy guardrailsSupported for eval; deeper policy export is future work
Multi-environment training suitesDifficulty-filtered environment views with Prime buffer ratiosSupported through prime train-config

The key invariant is that the original task verifier still decides reward. Prime changes the evaluation substrate, not the task's definition of correctness.

Readiness check

Install the optional integration and check local readiness:

uv sync --extra prime --dev
uv run aec-bench prime doctor
uv run aec-bench prime doctor --check-inference

doctor checks the optional dependency and Prime CLI availability. --check-inference also checks model-list connectivity.

Export

Export one task or a dataset slice as a generated environment package:

uv run aec-bench prime export \
  --name aec-voltage-drop \
  --task electrical/voltage-drop
uv run aec-bench prime export \
  --name aec-electrical-v1 \
  --dataset electrical-v1@1.0.0 \
  --harness-mode auto

Generated packages default to prime-rl/environments/. They are local build artefacts and should be regenerated from source tasks rather than edited by hand.

Generated environments accept split, difficulty, num_examples, seed, and harness through load_environment(...). For exported environments, difficulty filters first, split then chooses a deterministic training or eval slice, and seed only shuffles within that slice. That keeps train and eval from overlapping when a dataset is large enough to split. Very small slices stay intact so smoke tests do not export an empty environment.

Smoke test

Use smoke before pushing or training:

uv run aec-bench prime smoke \
  --name aec-voltage-drop \
  --task electrical/voltage-drop

Add --model to run a one-example prime eval run; omit it to keep the smoke local and package-load focused.

Push

Push a generated environment to Prime Hub:

uv run aec-bench prime push \
  --name aec-electrical-v1 \
  --dataset electrical-v1@1.0.0 \
  --visibility PRIVATE \
  --owner your-owner

Private visibility is the safe default while validating reward behaviour and task selection.

Hosted eval

Run a hosted eval against an existing Hub environment:

uv run aec-bench prime eval \
  --remote-env your-owner/aec-electrical-v1 \
  --hosted \
  --model "Qwen/Qwen3.5-4B" \
  --split eval \
  --difficulty medium \
  --harness stateful \
  --env-num-examples 10 \
  --seed 20260509 \
  --num-examples 5 \
  --rollouts-per-example 3 \
  --max-tokens 4096 \
  --eval-name aec-electrical-base-medium

--split, --difficulty, --harness, --env-num-examples, and --seed are forwarded to the generated environment's load_environment(...) function through Prime env args.

Use repeated --difficulty values for mixed slices:

uv run aec-bench prime eval \
  --remote-env your-owner/aec-electrical-v1 \
  --hosted \
  --model "Qwen/Qwen3.5-4B" \
  --difficulty easy \
  --difficulty medium

Use --env-arg KEY=VALUE for additional load_environment(...) arguments that are not first-class CLI flags yet.

Adapter eval

Prime evaluates base models and hosted-training adapters through the same environment boundary. The adapter used for inference must be a deployed Prime inference adapter, not a raw training checkpoint id.

List available adapter deployments:

uv run aec-bench prime adapters
uv run aec-bench --json prime adapters

If you already know the deployed adapter id, pass the base model as --model and the deployed adapter id as --adapter-id:

uv run aec-bench prime eval \
  --remote-env your-owner/aec-electrical-v1 \
  --hosted \
  --model "Qwen/Qwen3.5-4B" \
  --adapter-id uv124zgh7ttg3in94f7jzmv2 \
  --split eval \
  --difficulty medium \
  --harness stateful \
  --env-num-examples 10 \
  --seed 20260509 \
  --num-examples 5 \
  --rollouts-per-example 3 \
  --max-tokens 4096 \
  --eval-name aec-electrical-adapter-medium

aec-bench composes the Prime model string as <base-model>:<adapter-id>. To avoid mixing checkpoint ids and deployment ids, resolve the adapter directly from a hosted training run:

uv run aec-bench prime eval \
  --remote-env your-owner/aec-electrical-v1 \
  --hosted \
  --model "Qwen/Qwen3.5-4B" \
  --adapter-from-run <training-run-id> \
  --adapter-step latest \
  --split eval \
  --difficulty medium \
  --harness stateful \
  --env-num-examples 10 \
  --seed 20260509 \
  --num-examples 5 \
  --rollouts-per-example 3 \
  --max-tokens 4096 \
  --eval-name aec-electrical-adapter-medium

For base-versus-adapter comparisons, keep the environment slug, split, difficulty, harness, seed, number of examples, rollout count, and token budget fixed. Change only the model adapter.

Hosted training

Write a conservative single-slice training config:

uv run aec-bench prime train-config \
  --environment your-owner/aec-electrical-v1 \
  --output train.toml \
  --model "Qwen/Qwen3.5-0.8B" \
  --split train \
  --difficulty easy \
  --harness stateful \
  --num-examples 50 \
  --max-steps 20

For broad training suites, generate a multi-environment config with one difficulty-filtered view per ratio. This keeps the environment package fixed while Prime samples from easy, medium, and hard slices according to [buffer].env_ratios:

uv run aec-bench prime train-config \
  --environment your-owner/aec-release-train \
  --output configs/rl/aec-filtered-ratios.toml \
  --model "Qwen/Qwen3.5-9B" \
  --split all \
  --harness stateful \
  --difficulty-ratio easy=0.45 \
  --difficulty-ratio medium=0.40 \
  --difficulty-ratio hard=0.15 \
  --max-steps 50 \
  --batch-size 64 \
  --rollouts-per-example 8 \
  --max-tokens 4096 \
  --online-difficulty-filtering \
  --easy-threshold 0.8 \
  --hard-threshold 0.2 \
  --easy-fraction 0.25 \
  --hard-fraction 0.25

Each --difficulty-ratio DIFFICULTY=RATIO flag emits a separate [[env]] block with that difficulty passed to load_environment(...). The ratio flags are mutually exclusive with repeated --difficulty, which still means one environment view that allows multiple difficulties. --online-difficulty-filtering adds Prime's adaptive buffer thresholds; retaining some easy and hard examples helps noisy AEC reward surfaces avoid collapsing into only already-solved examples or only zero-signal failures.

Launch training through Prime:

uv run aec-bench prime train train.toml --yes

Before scaling a training run, run a one-example hosted eval and inspect reward, stop conditions, and tool-call metrics. Small evals are smoke tests, not benchmark claims.

After training finishes, use aec-bench prime adapters to find the deployed inference adapter before running adapter evals. Training checkpoint ids and deployed inference adapter ids are different Prime records.

Import hosted rollouts

Prime hosted evals keep sample-level rollout payloads: prompt, completion messages, tool calls, reward, timing, token usage, stop condition, and environment metrics. Import those samples into the normal aec-bench ledger when you want the same trace and behavioural tooling used for local or Harbor-backed runs:

uv run aec-bench import-prime-eval <prime-eval-id> \
  --experiment prime-eval-medium-stateful

The importer writes one TrialRecord per Prime sample plus local artefacts under the ledger: conversation.jsonl, the raw prime_sample.json, and output.md when the rollout called submit_answer with submitted content. It does not replace Prime's hosted result; it materialises the result locally so the rest of aec-bench can inspect it.

After import, use the existing report commands:

uv run aec-bench report traces \
  --experiment-id prime-eval-medium-stateful

uv run aec-bench report behavioral \
  --experiment-id prime-eval-medium-stateful \
  --classifier claude-sonnet-4-20250514 \
  --output prime-eval-behaviour.json

Use the same import-and-report path for base and adapter evals. For comparison reports, keep the experiment ids distinct and compare aggregate reward, stop conditions, tool-call counts, submit_answer calls, and behavioural classifications side by side.

What to inspect

For stateful AEC-Bench exports, inspect more than mean reward:

  • Whether the rollout calls submit_answer
  • Whether file and command tools are actually used when expected
  • Runtime errors versus verifier disagreement
  • Per-example reward rows, so one task does not hide the aggregate
  • Token budget and max-turn behaviour for RLM-style tasks

The Prime path should remain explicit: select the task slice, export it, smoke it, run a small hosted eval, import the hosted rollouts, inspect behaviour, then scale.

On this page