Classification

Classification tags assistant turns with behavioural bond types and compares trace structure against high-reward references.

The bond taxonomy

Each assistant turn in a trajectory can be tagged with one of four bond types. The chemistry language is only an analogy; the stored values are lowercase strings:

Bond	Chemical analogue	Behavioural shape	Typical signals
`execution`	Covalent	Normal operation	tool calls, code execution, structured output
`verification`	Hydrogen	Self-reflection	result comparison, error checking, backward references
`deliberation`	Metallic	Deep reasoning	causal chains, committed plans, step-by-step logic
`exploration`	Van der Waals	Hypothesising	hedging, alternatives, branching, open questions

High-reward traces are used to build reference transition patterns and, where useful, an ideal bond sequence. The common shape is deliberation -> execution -> verification, but the reference can be learned from the selected traces rather than hard-coded.

Turn classification

The LLMTurnClassifier reads assistant turns in batches and emits a TurnClassification for each one:

@dataclass(frozen=True)
class TurnClassification:
    turn_index: int
    bond_type: BondType           # execution | verification | deliberation | exploration
    confidence: float             # 0.0–1.0
    rationale: str = ""           # why this bond was assigned

The loader prefers structured trajectory.jsonl and falls back to conversation.jsonl when needed. Each classification carries its own confidence so low-certainty tags can be inspected before they influence aggregates.

Structural scoring

Individual turn tags are only part of the picture. The sequence of bonds is what characterises a run. Two metrics compare a trajectory against a reference built from high-reward trials:

Transition matrix similarity: cosine similarity between the observed transition matrix and the reference matrix. Captures which bonds tend to follow which.
Sequence edit distance: how many insertions, deletions, and substitutions separate the observed bond sequence from the reference sequence.

A run can have plausible bond counts but the wrong order, such as verifying before executing. The structural score catches that case better than a tag histogram.

Confidence metadata

Before a classification is trusted for aggregation, its statistical footing gets recorded:

class ConfidenceMetadata(StrictModel):
    annotator_count: int | None = None
    inter_rater_agreement: float | None = None
    confidence_interval: tuple[float, float] | None = None
    confidence_method: str | None = None

Two standard statistical tools are useful when classifications are being calibrated against human labels:

Cohen's kappa measures inter-rater agreement above chance. Values above 0.8 indicate strong agreement; below 0.6 means the classification should not be trusted.
Wilson confidence interval gives a binomial interval for a success rate. It behaves better than the normal approximation on small samples.

Judgment readiness

Automated behavioural judgment should only affect aggregate analysis when the calibration evidence is good enough:

Gate	Threshold	Why
Annotator count	`>= 2`	Single-rater classifications cannot produce agreement stats
Inter-rater agreement	`>= 0.8`	Low agreement means raters or the classifier are inconsistent
CI width	< 0.2	Wide intervals mean the sample isn't decisive
Calibration sample size	`>= 12`	Fewer than a dozen trials cannot resist one odd run

If any gate fails, keep the classification as analysis evidence but do not roll it into aggregate claims.

The bond taxonomy

Turn classification

Structural scoring

Confidence metadata

Judgment readiness

On this page