Prime Lab
Prime Lab integration exports aec-bench tasks as verifiers environments for local eval, hosted eval, adapter eval, hosted training, and rollout import.
Prime Lab support is optional. It lets aec-bench package selected tasks as Prime/verifiers environments while preserving the aec-bench task verifier as the reward authority.
Export lanes
aec-bench classifies tasks into conservative Prime lanes:
| AEC-Bench task shape | Prime environment shape | Status |
|---|---|---|
| Deterministic formula and calculation tasks | SingleTurnEnv style package with deterministic rubric | Supported |
| Workspace tasks that need files or commands | StatefulToolEnv style package with workspace tools | Supported |
| RLM and lambda-RLM tasks | Stateful workspace export carrying policy guardrails | Supported for eval; deeper policy export is future work |
| Multi-environment training suites | Difficulty-filtered environment views with Prime buffer ratios | Supported through prime train-config |
The key invariant is that the original task verifier still decides reward. Prime changes the evaluation substrate, not the task's definition of correctness.
Readiness check
Install the optional integration and check local readiness:
uv sync --extra prime --dev
uv run aec-bench prime doctor
uv run aec-bench prime doctor --check-inferencedoctor checks the optional dependency and Prime CLI availability. --check-inference also checks model-list connectivity.
Export
Export one task or a dataset slice as a generated environment package:
uv run aec-bench prime export \
--name aec-voltage-drop \
--task electrical/voltage-dropuv run aec-bench prime export \
--name aec-electrical-v1 \
--dataset electrical-v1@1.0.0 \
--harness-mode autoGenerated packages default to prime-rl/environments/. They are local build artefacts and should be regenerated from source tasks rather than edited by hand.
Generated environments accept split, difficulty, num_examples, seed, and harness through load_environment(...). For exported environments, difficulty filters first, split then chooses a deterministic training or eval slice, and seed only shuffles within that slice. That keeps train and eval from overlapping when a dataset is large enough to split. Very small slices stay intact so smoke tests do not export an empty environment.
Smoke test
Use smoke before pushing or training:
uv run aec-bench prime smoke \
--name aec-voltage-drop \
--task electrical/voltage-dropAdd --model to run a one-example prime eval run; omit it to keep the smoke local and package-load focused.
Push
Push a generated environment to Prime Hub:
uv run aec-bench prime push \
--name aec-electrical-v1 \
--dataset electrical-v1@1.0.0 \
--visibility PRIVATE \
--owner your-ownerPrivate visibility is the safe default while validating reward behaviour and task selection.
Hosted eval
Run a hosted eval against an existing Hub environment:
uv run aec-bench prime eval \
--remote-env your-owner/aec-electrical-v1 \
--hosted \
--model "Qwen/Qwen3.5-4B" \
--split eval \
--difficulty medium \
--harness stateful \
--env-num-examples 10 \
--seed 20260509 \
--num-examples 5 \
--rollouts-per-example 3 \
--max-tokens 4096 \
--eval-name aec-electrical-base-medium--split, --difficulty, --harness, --env-num-examples, and --seed are forwarded to the generated environment's load_environment(...) function through Prime env args.
Use repeated --difficulty values for mixed slices:
uv run aec-bench prime eval \
--remote-env your-owner/aec-electrical-v1 \
--hosted \
--model "Qwen/Qwen3.5-4B" \
--difficulty easy \
--difficulty mediumUse --env-arg KEY=VALUE for additional load_environment(...) arguments that are not first-class CLI flags yet.
Adapter eval
Prime evaluates base models and hosted-training adapters through the same environment boundary. The adapter used for inference must be a deployed Prime inference adapter, not a raw training checkpoint id.
List available adapter deployments:
uv run aec-bench prime adapters
uv run aec-bench --json prime adaptersIf you already know the deployed adapter id, pass the base model as --model and the deployed adapter id as --adapter-id:
uv run aec-bench prime eval \
--remote-env your-owner/aec-electrical-v1 \
--hosted \
--model "Qwen/Qwen3.5-4B" \
--adapter-id uv124zgh7ttg3in94f7jzmv2 \
--split eval \
--difficulty medium \
--harness stateful \
--env-num-examples 10 \
--seed 20260509 \
--num-examples 5 \
--rollouts-per-example 3 \
--max-tokens 4096 \
--eval-name aec-electrical-adapter-mediumaec-bench composes the Prime model string as <base-model>:<adapter-id>. To avoid mixing checkpoint ids and deployment ids, resolve the adapter directly from a hosted training run:
uv run aec-bench prime eval \
--remote-env your-owner/aec-electrical-v1 \
--hosted \
--model "Qwen/Qwen3.5-4B" \
--adapter-from-run <training-run-id> \
--adapter-step latest \
--split eval \
--difficulty medium \
--harness stateful \
--env-num-examples 10 \
--seed 20260509 \
--num-examples 5 \
--rollouts-per-example 3 \
--max-tokens 4096 \
--eval-name aec-electrical-adapter-mediumFor base-versus-adapter comparisons, keep the environment slug, split, difficulty, harness, seed, number of examples, rollout count, and token budget fixed. Change only the model adapter.
Hosted training
Write a conservative single-slice training config:
uv run aec-bench prime train-config \
--environment your-owner/aec-electrical-v1 \
--output train.toml \
--model "Qwen/Qwen3.5-0.8B" \
--split train \
--difficulty easy \
--harness stateful \
--num-examples 50 \
--max-steps 20For broad training suites, generate a multi-environment config with one
difficulty-filtered view per ratio. This keeps the environment package fixed
while Prime samples from easy, medium, and hard slices according to
[buffer].env_ratios:
uv run aec-bench prime train-config \
--environment your-owner/aec-release-train \
--output configs/rl/aec-filtered-ratios.toml \
--model "Qwen/Qwen3.5-9B" \
--split all \
--harness stateful \
--difficulty-ratio easy=0.45 \
--difficulty-ratio medium=0.40 \
--difficulty-ratio hard=0.15 \
--max-steps 50 \
--batch-size 64 \
--rollouts-per-example 8 \
--max-tokens 4096 \
--online-difficulty-filtering \
--easy-threshold 0.8 \
--hard-threshold 0.2 \
--easy-fraction 0.25 \
--hard-fraction 0.25Each --difficulty-ratio DIFFICULTY=RATIO flag emits a separate [[env]]
block with that difficulty passed to load_environment(...). The ratio flags
are mutually exclusive with repeated --difficulty, which still means one
environment view that allows multiple difficulties. --online-difficulty-filtering
adds Prime's adaptive buffer thresholds; retaining some easy and hard examples
helps noisy AEC reward surfaces avoid collapsing into only already-solved
examples or only zero-signal failures.
Launch training through Prime:
uv run aec-bench prime train train.toml --yesBefore scaling a training run, run a one-example hosted eval and inspect reward, stop conditions, and tool-call metrics. Small evals are smoke tests, not benchmark claims.
After training finishes, use aec-bench prime adapters to find the deployed inference adapter before running adapter evals. Training checkpoint ids and deployed inference adapter ids are different Prime records.
Import hosted rollouts
Prime hosted evals keep sample-level rollout payloads: prompt, completion messages, tool calls, reward, timing, token usage, stop condition, and environment metrics. Import those samples into the normal aec-bench ledger when you want the same trace and behavioural tooling used for local or Harbor-backed runs:
uv run aec-bench import-prime-eval <prime-eval-id> \
--experiment prime-eval-medium-statefulThe importer writes one TrialRecord per Prime sample plus local artefacts under the ledger: conversation.jsonl, the raw prime_sample.json, and output.md when the rollout called submit_answer with submitted content. It does not replace Prime's hosted result; it materialises the result locally so the rest of aec-bench can inspect it.
After import, use the existing report commands:
uv run aec-bench report traces \
--experiment-id prime-eval-medium-stateful
uv run aec-bench report behavioral \
--experiment-id prime-eval-medium-stateful \
--classifier claude-sonnet-4-20250514 \
--output prime-eval-behaviour.jsonUse the same import-and-report path for base and adapter evals. For comparison reports, keep the experiment ids distinct and compare aggregate reward, stop conditions, tool-call counts, submit_answer calls, and behavioural classifications side by side.
What to inspect
For stateful AEC-Bench exports, inspect more than mean reward:
- Whether the rollout calls
submit_answer - Whether file and command tools are actually used when expected
- Runtime errors versus verifier disagreement
- Per-example reward rows, so one task does not hide the aggregate
- Token budget and max-turn behaviour for RLM-style tasks
The Prime path should remain explicit: select the task slice, export it, smoke it, run a small hosted eval, import the hosted rollouts, inspect behaviour, then scale.