aec-benchaec-bench

Classification

Classification tags assistant turns with behavioural bond types and compares trace structure against high-reward references.

The bond taxonomy

Each assistant turn in a trajectory can be tagged with one of four bond types. The chemistry language is only an analogy; the stored values are lowercase strings:

BondChemical analogueBehavioural shapeTypical signals
executionCovalentNormal operationtool calls, code execution, structured output
verificationHydrogenSelf-reflectionresult comparison, error checking, backward references
deliberationMetallicDeep reasoningcausal chains, committed plans, step-by-step logic
explorationVan der WaalsHypothesisinghedging, alternatives, branching, open questions

High-reward traces are used to build reference transition patterns and, where useful, an ideal bond sequence. The common shape is deliberation -> execution -> verification, but the reference can be learned from the selected traces rather than hard-coded.

Turn classification

The LLMTurnClassifier reads assistant turns in batches and emits a TurnClassification for each one:

@dataclass(frozen=True)
class TurnClassification:
    turn_index: int
    bond_type: BondType           # execution | verification | deliberation | exploration
    confidence: float             # 0.0–1.0
    rationale: str = ""           # why this bond was assigned

The loader prefers structured trajectory.jsonl and falls back to conversation.jsonl when needed. Each classification carries its own confidence so low-certainty tags can be inspected before they influence aggregates.

Structural scoring

Individual turn tags are only part of the picture. The sequence of bonds is what characterises a run. Two metrics compare a trajectory against a reference built from high-reward trials:

  • Transition matrix similarity: cosine similarity between the observed transition matrix and the reference matrix. Captures which bonds tend to follow which.
  • Sequence edit distance: how many insertions, deletions, and substitutions separate the observed bond sequence from the reference sequence.

A run can have plausible bond counts but the wrong order, such as verifying before executing. The structural score catches that case better than a tag histogram.

Confidence metadata

Before a classification is trusted for aggregation, its statistical footing gets recorded:

class ConfidenceMetadata(StrictModel):
    annotator_count: int | None = None
    inter_rater_agreement: float | None = None
    confidence_interval: tuple[float, float] | None = None
    confidence_method: str | None = None

Two standard statistical tools are useful when classifications are being calibrated against human labels:

  • Cohen's kappa measures inter-rater agreement above chance. Values above 0.8 indicate strong agreement; below 0.6 means the classification should not be trusted.
  • Wilson confidence interval gives a binomial interval for a success rate. It behaves better than the normal approximation on small samples.

Judgment readiness

Automated behavioural judgment should only affect aggregate analysis when the calibration evidence is good enough:

GateThresholdWhy
Annotator count>= 2Single-rater classifications cannot produce agreement stats
Inter-rater agreement>= 0.8Low agreement means raters or the classifier are inconsistent
CI width< 0.2Wide intervals mean the sample isn't decisive
Calibration sample size>= 12Fewer than a dozen trials cannot resist one odd run

If any gate fails, keep the classification as analysis evidence but do not roll it into aggregate claims.

On this page