aec-benchaec-bench

Scoring

Scoring turns an agent's output into a validated reward by reading verifier artifacts into an EvaluationResult.

When a trial finishes, the execution backend runs the task's verifier against the agent output file. The verifier writes JSON artifacts, the runner parses them into an EvaluationResult, and that result is attached to the trial record.

The verifier contract

Each task exposes a VerifierSpec:

class VerifierSpec(StrictModel):
    script: str
    expected_output_path: str
    reward_path: str
    details_path: str | None = None

For current task directories, the loader resolves tests/test.sh as the verifier entry point. Harbor expects that shell script inside the container. The verifier reads the agent output from expected_output_path, usually a /workspace/... path inferred from the task instruction, and writes:

/logs/verifier/reward.json     # mandatory: {"reward": 0.0-1.0}
/logs/verifier/details.json    # optional: per-field scores and evidence

A minimal reward.json is just the headline number:

{ "reward": 0.93 }

The optional details.json carries the per-dimension breakdown:

{
  "voltage_drop_v": { "score": 0.95, "max_score": 1.0, "evidence": "within 2% of reference" },
  "voltage_drop_pct": { "score": 1.0, "max_score": 1.0, "evidence": "exact match" },
  "compliance": { "score": 1.0, "max_score": 1.0, "evidence": "correctly flagged compliant" }
}

Reward rollup

reward.json is the authoritative trial reward. Artifact ingestion does not recompute the reward from details.json; it reads the reward, validates it, and stores details.json as EvaluationResult.breakdown.

For tasks that need rubric-style scoring, verifier code or judge pipelines can call the rubric scorer before writing artifacts. The scorer normalises each dimension to score / max_score, clamps it to [0, 1], and combines dimensions with a rollup strategy:

StrategyBehaviourUsed when
weighted_mean (default)sum(normalised * weight) / sum(weight)Balanced judgement across fields
minWorst dimension winsA single wrong field should fail the task

Rubric rewards round to 4 decimals and must land in [0.0, 1.0]. If a verifier writes reward.json directly without a rubric, that value is used as-is after validation.

Validity gates

Before scoring, the pipeline checks whether the output is even fit for scoring:

class ValidityCheck(StrictModel):
    output_parseable: bool       # did the file parse as its declared format?
    schema_valid: bool           # did it match the expected shape?
    verifier_completed: bool     # did reward.json appear?
    errors: list[str] = Field(default_factory=list)

If output_parseable=False, a non-zero reward is invalid. The EvaluationResult cross-field validator rejects that combination; missing reward artifacts are ingested as 0.0 with verifier_completed=False.

The other two flags are diagnostic. They do not force the reward by themselves, but they surface in reports and help distinguish "the agent was wrong" from "the agent did not produce anything usable".

The full EvaluationResult

Putting it together:

class EvaluationResult(StrictModel):
    reward: float                          # 0.0 to 1.0
    validity: ValidityCheck
    breakdown: dict[str, Any] | None       # per-field scores + evidence
    error_taxonomy: list[ErrorTag] | None  # classified failures
    confidence: ConfidenceMetadata | None  # statistical metadata
    annotations: list[Annotation] | None   # human review verdicts

Error taxonomy

Verifiers (or post-hoc analysis) can tag failures for grouping in reports:

class ErrorTag(StrictModel):
    category: str              # "tool failure", "parsing error", "unit mismatch"
    description: str | None = None
    source: ErrorSource        # mechanical | human | judge

Tasks define their own categories; there is no fixed taxonomy. source tracks whether the classification came from the runner, a reviewer, or an automated judge.

Annotations

annotations is for human review after the automated run:

class Annotation(StrictModel):
    reviewer_id: str
    reviewer_discipline: str | None = None
    timestamp: datetime
    judgment: Judgment             # pass | fail | defer
    categories: list[str] = Field(default_factory=list)
    notes: str | None

These do not overwrite the automated reward. They sit alongside it, so disagreement between a reviewer and a verifier stays visible on the record.

Determinism

Mechanical verifier scoring should be deterministic: the same agent output should produce the same reward. When human review or judge-based classification is used, record the extra uncertainty in confidence, annotations, and error_taxonomy rather than hiding it inside an unexplained reward. See Classification.

On this page