Scoring
Scoring turns an agent's output into a validated reward by reading verifier artifacts into an EvaluationResult.
When a trial finishes, the execution backend runs the task's verifier against the agent output file. The verifier writes JSON artifacts, the runner parses them into an EvaluationResult, and that result is attached to the trial record.
The verifier contract
Each task exposes a VerifierSpec:
class VerifierSpec(StrictModel):
script: str
expected_output_path: str
reward_path: str
details_path: str | None = NoneFor current task directories, the loader resolves tests/test.sh as the verifier entry point. Harbor expects that shell script inside the container. The verifier reads the agent output from expected_output_path, usually a /workspace/... path inferred from the task instruction, and writes:
/logs/verifier/reward.json # mandatory: {"reward": 0.0-1.0}
/logs/verifier/details.json # optional: per-field scores and evidenceA minimal reward.json is just the headline number:
{ "reward": 0.93 }The optional details.json carries the per-dimension breakdown:
{
"voltage_drop_v": { "score": 0.95, "max_score": 1.0, "evidence": "within 2% of reference" },
"voltage_drop_pct": { "score": 1.0, "max_score": 1.0, "evidence": "exact match" },
"compliance": { "score": 1.0, "max_score": 1.0, "evidence": "correctly flagged compliant" }
}Reward rollup
reward.json is the authoritative trial reward. Artifact ingestion does not recompute the reward from details.json; it reads the reward, validates it, and stores details.json as EvaluationResult.breakdown.
For tasks that need rubric-style scoring, verifier code or judge pipelines can call the rubric scorer before writing artifacts. The scorer normalises each dimension to score / max_score, clamps it to [0, 1], and combines dimensions with a rollup strategy:
| Strategy | Behaviour | Used when |
|---|---|---|
weighted_mean (default) | sum(normalised * weight) / sum(weight) | Balanced judgement across fields |
min | Worst dimension wins | A single wrong field should fail the task |
Rubric rewards round to 4 decimals and must land in [0.0, 1.0]. If a verifier writes reward.json directly without a rubric, that value is used as-is after validation.
Validity gates
Before scoring, the pipeline checks whether the output is even fit for scoring:
class ValidityCheck(StrictModel):
output_parseable: bool # did the file parse as its declared format?
schema_valid: bool # did it match the expected shape?
verifier_completed: bool # did reward.json appear?
errors: list[str] = Field(default_factory=list)If output_parseable=False, a non-zero reward is invalid. The EvaluationResult cross-field validator rejects that combination; missing reward artifacts are ingested as 0.0 with verifier_completed=False.
The other two flags are diagnostic. They do not force the reward by themselves, but they surface in reports and help distinguish "the agent was wrong" from "the agent did not produce anything usable".
The full EvaluationResult
Putting it together:
class EvaluationResult(StrictModel):
reward: float # 0.0 to 1.0
validity: ValidityCheck
breakdown: dict[str, Any] | None # per-field scores + evidence
error_taxonomy: list[ErrorTag] | None # classified failures
confidence: ConfidenceMetadata | None # statistical metadata
annotations: list[Annotation] | None # human review verdictsError taxonomy
Verifiers (or post-hoc analysis) can tag failures for grouping in reports:
class ErrorTag(StrictModel):
category: str # "tool failure", "parsing error", "unit mismatch"
description: str | None = None
source: ErrorSource # mechanical | human | judgeTasks define their own categories; there is no fixed taxonomy. source tracks whether the classification came from the runner, a reviewer, or an automated judge.
Annotations
annotations is for human review after the automated run:
class Annotation(StrictModel):
reviewer_id: str
reviewer_discipline: str | None = None
timestamp: datetime
judgment: Judgment # pass | fail | defer
categories: list[str] = Field(default_factory=list)
notes: str | NoneThese do not overwrite the automated reward. They sit alongside it, so disagreement between a reviewer and a verifier stays visible on the record.
Determinism
Mechanical verifier scoring should be deterministic: the same agent output should produce the same reward. When human review or judge-based classification is used, record the extra uncertainty in confidence, annotations, and error_taxonomy rather than hiding it inside an unexplained reward. See Classification.