aec-benchaec-bench

Evolution

Evolution is a closed-loop system that mutates an agent workspace, mainly its system prompt and skill library, based on trial failures.

Evolution improves the workspace an agent reads. The base model does not change; prompts and skills change under version control, then each candidate workspace is evaluated against tasks.

Use single-workspace evolution when you want a controlled hill-climb. Use Swarm when you want multiple agents exploring a quality-diversity frontier.

The cycle

Each evolution cycle runs six phases:

PhaseWhat happens
ClassifyRead recent trials and tag turns, failures, and behavioural patterns
AnalyseIdentify repeated failure modes and the prompt or skill surface implicated
EvolvePropose structured prompt or skill mutations
ApplyWrite the candidate workspace version
GateRun the candidate on a batch of tasks and compare with the incumbent
VersionPromote the candidate when it clears the threshold; otherwise retain the previous best

The loop stops when max_cycles is reached or the configured stagnation window produces no improvement.

Setup

Create a workspace:

uv run aec-bench evolve init workspaces/voltage-drop-evo \
  --name voltage-drop-evo \
  --harness rlm

The workspace contains a manifest, system prompt, and skill directory. Treat it like source code: review diffs, keep changes scoped, and avoid contaminating benchmark task definitions with answer leakage.

Configuration

Evolution is driven by a YAML file:

workspace_path: workspaces/voltage-drop-evo

models:
  classifier: env:AWS_HAIKU_MODEL_ID
  evolver: env:AWS_SONNET_MODEL_ID

solver:
  name: solver-v1
  harness: rlm
  model: env:AWS_SONNET_MODEL_ID

tasks:
  domains: [electrical]
  include_patterns: ["electrical/*"]
  difficulties: ["easy", "medium"]

backend: local
batch_size: 5
max_cycles: 10
improvement_threshold: 0.01
stagnation_window: 5
timeout: 1800

Run it with:

uv run aec-bench evolve run --config evolution.yaml
uv run aec-bench evolve history workspaces/voltage-drop-evo

Rollback is non-destructive: it restores an older tag as a new workspace version rather than deleting history.

uv run aec-bench evolve rollback workspaces/voltage-drop-evo evo-20260404-1220-2

Mutations

The evolver emits structured workspace edits, not free-form rewriting:

@dataclass(frozen=True)
class MutationAction:
    action_type: str                 # write_skill | modify_skill | delete_skill | modify_prompt
    skill_name: str | None = None
    skill_description: str | None = None
    skill_discipline: str | None = None
    skill_body: str | None = None
    prompt_content: str | None = None

Each mutation is validated before application. Applied versions can be evaluated, compared, archived, or rolled back.

Selection strategy

The default strategy is hill-climbing: mutate from the best-scoring workspace seen so far. That is deliberately conservative and easy to interpret. More exploratory strategies belong in the swarm path, where diversity is a first-class objective.

Backend independence

Evolution delegates solving through the same harness boundary as normal experiments. A candidate workspace can be evaluated locally during development and on a larger backend when the loop needs scale.

The invariant is that the task verifier remains the reward authority. Evolution may change how the agent reasons, but it must not change what counts as correct for the task.

When to use it

Evolution is useful when:

  • Failures repeat across tasks in a narrow domain.
  • The agent workspace has editable prompts and skills.
  • You have enough tasks and budget for a signal beyond one lucky run.
  • You can inspect promoted changes before trusting the next generation.

For one-off benchmark runs, use a fixed harness. For broad exploration where multiple strategies should survive, use Swarm.

On this page