Classification
Classification tags assistant turns with behavioural bond types and compares trace structure against high-reward references.
The bond taxonomy
Each assistant turn in a trajectory can be tagged with one of four bond types. The chemistry language is only an analogy; the stored values are lowercase strings:
| Bond | Chemical analogue | Behavioural shape | Typical signals |
|---|---|---|---|
execution | Covalent | Normal operation | tool calls, code execution, structured output |
verification | Hydrogen | Self-reflection | result comparison, error checking, backward references |
deliberation | Metallic | Deep reasoning | causal chains, committed plans, step-by-step logic |
exploration | Van der Waals | Hypothesising | hedging, alternatives, branching, open questions |
High-reward traces are used to build reference transition patterns and, where useful, an ideal bond sequence. The common shape is deliberation -> execution -> verification, but the reference can be learned from the selected traces rather than hard-coded.
Turn classification
The LLMTurnClassifier reads assistant turns in batches and emits a TurnClassification for each one:
@dataclass(frozen=True)
class TurnClassification:
turn_index: int
bond_type: BondType # execution | verification | deliberation | exploration
confidence: float # 0.0–1.0
rationale: str = "" # why this bond was assignedThe loader prefers structured trajectory.jsonl and falls back to conversation.jsonl when needed. Each classification carries its own confidence so low-certainty tags can be inspected before they influence aggregates.
Structural scoring
Individual turn tags are only part of the picture. The sequence of bonds is what characterises a run. Two metrics compare a trajectory against a reference built from high-reward trials:
- Transition matrix similarity: cosine similarity between the observed transition matrix and the reference matrix. Captures which bonds tend to follow which.
- Sequence edit distance: how many insertions, deletions, and substitutions separate the observed bond sequence from the reference sequence.
A run can have plausible bond counts but the wrong order, such as verifying before executing. The structural score catches that case better than a tag histogram.
Confidence metadata
Before a classification is trusted for aggregation, its statistical footing gets recorded:
class ConfidenceMetadata(StrictModel):
annotator_count: int | None = None
inter_rater_agreement: float | None = None
confidence_interval: tuple[float, float] | None = None
confidence_method: str | None = NoneTwo standard statistical tools are useful when classifications are being calibrated against human labels:
- Cohen's kappa measures inter-rater agreement above chance. Values above 0.8 indicate strong agreement; below 0.6 means the classification should not be trusted.
- Wilson confidence interval gives a binomial interval for a success rate. It behaves better than the normal approximation on small samples.
Judgment readiness
Automated behavioural judgment should only affect aggregate analysis when the calibration evidence is good enough:
| Gate | Threshold | Why |
|---|---|---|
| Annotator count | >= 2 | Single-rater classifications cannot produce agreement stats |
| Inter-rater agreement | >= 0.8 | Low agreement means raters or the classifier are inconsistent |
| CI width | < 0.2 | Wide intervals mean the sample isn't decisive |
| Calibration sample size | >= 12 | Fewer than a dozen trials cannot resist one odd run |
If any gate fails, keep the classification as analysis evidence but do not roll it into aggregate claims.