Traces
A trace is the per-turn trajectory plus the TrialRecord that preserves task, agent, output, evaluation, timing, and cost evidence.
The trajectory
During execution, an agent harness can append to /workspace/trajectory.jsonl. The file starts with a format header, then records structured entries for messages, tool calls, tool results, and run metadata:
{"version": 1, "format": "aec-bench-trajectory"}
{"step": 0, "role": "system", "content": "You are a senior electrical engineer..."}
{"step": 0, "role": "user", "content": "Calculate voltage drop..."}
{"step": 1, "role": "assistant", "content": "I'll compute R and X contributions..."}
{"step": 1, "role": "tool_call", "tool_name": "python", "command": "python calc.py"}
{"step": 1, "role": "tool_result", "tool_name": "python", "stdout": "V_drop = 5.2V", "exit_code": 0, "duration_ms": 142}
{"step": 2, "role": "assistant", "content": "Writing result to output.jsonl..."}Each entry carries step and role. Text entries usually carry content; tool entries can carry tool_name, command, arguments, stdout, stderr, exit_code, duration_ms, media, and optional metadata. The schema is deliberately small so traces stay easy to parse.
The trial record
Once evaluation finishes, the runner assembles a TrialRecord, the self-contained record of one agent run:
class TrialRecord(StrictModel):
trial_id: NonEmptyStr
experiment_id: NonEmptyStr
dataset_id: str | None = None
timestamp: datetime
task: TaskReference # task_id + content hash
agent: AgentReference # harness, model, configuration
environment: EnvironmentSnapshot # Docker image, compute backend
inputs: InputRecord # instruction, system prompt, files
outputs: OutputRecord # agent output, raw output, conversation, trajectory
evaluation: EvaluationResult
timing: TimingRecord # wall-clock per phase
cost: CostRecord | None # tokens, advisor calls, estimated USD
adaptation: AdaptationProvenance | None
completeness: Completeness # complete | partialEverything needed to reproduce the run is pinned by hash:
- Task reference:
task_idplustask_revision - Agent reference: harness name, resolved model, optional harness revision, and effective configuration
- Input record: instruction, optional system prompt, and input file references
- Environment snapshot: runtime image, compute backend, and tool versions when available
If any of these change later, the new run should produce a different record rather than mutating the old one.
Completeness
Trial records have two states:
| State | Meaning |
|---|---|
partial | Written during execution. Some provenance fields may be missing. |
complete | Sealed after evaluation with required provenance. Eligible for the ledger. |
Complete records must include agent.adapter_revision, environment.tool_versions, and inputs.input_files. Partial records are useful for debugging, but they should not back a leaderboard number.
Cost tracking
Each trial optionally records token usage and an estimated USD cost:
class CostRecord(StrictModel):
tokens_in: int | None
tokens_out: int | None
cache_read_tokens: int | None
cache_write_tokens: int | None
estimated_cost_usd: float | None
advisor_calls: int | None
advisor_input_tokens: int | None
advisor_output_tokens: int | NoneThese fields support cost-efficiency views and make advisor-heavy runs visible during review.
The ledger
Completed trial records are append-only. Once written to the ledger, a trial record should not be modified. New information attaches through new records or review artifacts that reference the original trial_id.
Imported Prime rollouts
Hosted Prime eval samples can be materialised into the same ledger shape:
uv run aec-bench import-prime-eval <prime-eval-id> \
--experiment prime-eval-medium-statefulThe import writes one partial TrialRecord per hosted sample. Each record points to a local conversation.jsonl built from the Prime prompt and completion messages, preserves the raw prime_sample.json, and records Prime metrics such as reward, timing, token usage, stop condition, tool-call counts, and submit_answer calls in the normal output/evaluation fields.
Imported Prime records are diagnostic artefacts. They are useful for reviewing rollouts, comparing base and adapter behaviour, and running trace reports; leaderboard-grade claims still need the usual benchmark controls around task slice, model, repetitions, verifier provenance, and review policy.
Inspecting a trace
For quick diagnostics, the evaluation layer can load the trajectory first and fall back to conversation.jsonl when no trajectory is present. Trace summaries expose turn counts, tool errors, first error seen, and the behavioural classification described in Classification. Full trajectories are the evidence surface for step-by-step inspection.