aec-benchaec-bench

Traces

A trace is the per-turn trajectory plus the TrialRecord that preserves task, agent, output, evaluation, timing, and cost evidence.

The trajectory

During execution, an agent harness can append to /workspace/trajectory.jsonl. The file starts with a format header, then records structured entries for messages, tool calls, tool results, and run metadata:

{"version": 1, "format": "aec-bench-trajectory"}
{"step": 0, "role": "system", "content": "You are a senior electrical engineer..."}
{"step": 0, "role": "user", "content": "Calculate voltage drop..."}
{"step": 1, "role": "assistant", "content": "I'll compute R and X contributions..."}
{"step": 1, "role": "tool_call", "tool_name": "python", "command": "python calc.py"}
{"step": 1, "role": "tool_result", "tool_name": "python", "stdout": "V_drop = 5.2V", "exit_code": 0, "duration_ms": 142}
{"step": 2, "role": "assistant", "content": "Writing result to output.jsonl..."}

Each entry carries step and role. Text entries usually carry content; tool entries can carry tool_name, command, arguments, stdout, stderr, exit_code, duration_ms, media, and optional metadata. The schema is deliberately small so traces stay easy to parse.

The trial record

Once evaluation finishes, the runner assembles a TrialRecord, the self-contained record of one agent run:

class TrialRecord(StrictModel):
    trial_id: NonEmptyStr
    experiment_id: NonEmptyStr
    dataset_id: str | None = None
    timestamp: datetime

    task: TaskReference              # task_id + content hash
    agent: AgentReference            # harness, model, configuration

    environment: EnvironmentSnapshot # Docker image, compute backend
    inputs: InputRecord              # instruction, system prompt, files
    outputs: OutputRecord            # agent output, raw output, conversation, trajectory
    evaluation: EvaluationResult

    timing: TimingRecord             # wall-clock per phase
    cost: CostRecord | None          # tokens, advisor calls, estimated USD
    adaptation: AdaptationProvenance | None
    completeness: Completeness       # complete | partial

Everything needed to reproduce the run is pinned by hash:

  • Task reference: task_id plus task_revision
  • Agent reference: harness name, resolved model, optional harness revision, and effective configuration
  • Input record: instruction, optional system prompt, and input file references
  • Environment snapshot: runtime image, compute backend, and tool versions when available

If any of these change later, the new run should produce a different record rather than mutating the old one.

Completeness

Trial records have two states:

StateMeaning
partialWritten during execution. Some provenance fields may be missing.
completeSealed after evaluation with required provenance. Eligible for the ledger.

Complete records must include agent.adapter_revision, environment.tool_versions, and inputs.input_files. Partial records are useful for debugging, but they should not back a leaderboard number.

Cost tracking

Each trial optionally records token usage and an estimated USD cost:

class CostRecord(StrictModel):
    tokens_in: int | None
    tokens_out: int | None
    cache_read_tokens: int | None
    cache_write_tokens: int | None
    estimated_cost_usd: float | None
    advisor_calls: int | None
    advisor_input_tokens: int | None
    advisor_output_tokens: int | None

These fields support cost-efficiency views and make advisor-heavy runs visible during review.

The ledger

Completed trial records are append-only. Once written to the ledger, a trial record should not be modified. New information attaches through new records or review artifacts that reference the original trial_id.

Imported Prime rollouts

Hosted Prime eval samples can be materialised into the same ledger shape:

uv run aec-bench import-prime-eval <prime-eval-id> \
  --experiment prime-eval-medium-stateful

The import writes one partial TrialRecord per hosted sample. Each record points to a local conversation.jsonl built from the Prime prompt and completion messages, preserves the raw prime_sample.json, and records Prime metrics such as reward, timing, token usage, stop condition, tool-call counts, and submit_answer calls in the normal output/evaluation fields.

Imported Prime records are diagnostic artefacts. They are useful for reviewing rollouts, comparing base and adapter behaviour, and running trace reports; leaderboard-grade claims still need the usual benchmark controls around task slice, model, repetitions, verifier provenance, and review policy.

Inspecting a trace

For quick diagnostics, the evaluation layer can load the trajectory first and fall back to conversation.jsonl when no trajectory is present. Trace summaries expose turn counts, tool errors, first error seen, and the behavioural classification described in Classification. Full trajectories are the evidence surface for step-by-step inspection.

On this page