Evolution
Evolution is a closed-loop system that mutates an agent workspace, mainly its system prompt and skill library, based on trial failures.
Evolution improves the workspace an agent reads. The base model does not change; prompts and skills change under version control, then each candidate workspace is evaluated against tasks.
Use single-workspace evolution when you want a controlled hill-climb. Use Swarm when you want multiple agents exploring a quality-diversity frontier.
The cycle
Each evolution cycle runs six phases:
| Phase | What happens |
|---|---|
| Classify | Read recent trials and tag turns, failures, and behavioural patterns |
| Analyse | Identify repeated failure modes and the prompt or skill surface implicated |
| Evolve | Propose structured prompt or skill mutations |
| Apply | Write the candidate workspace version |
| Gate | Run the candidate on a batch of tasks and compare with the incumbent |
| Version | Promote the candidate when it clears the threshold; otherwise retain the previous best |
The loop stops when max_cycles is reached or the configured stagnation window produces no improvement.
Setup
Create a workspace:
uv run aec-bench evolve init workspaces/voltage-drop-evo \
--name voltage-drop-evo \
--harness rlmThe workspace contains a manifest, system prompt, and skill directory. Treat it like source code: review diffs, keep changes scoped, and avoid contaminating benchmark task definitions with answer leakage.
Configuration
Evolution is driven by a YAML file:
workspace_path: workspaces/voltage-drop-evo
models:
classifier: env:AWS_HAIKU_MODEL_ID
evolver: env:AWS_SONNET_MODEL_ID
solver:
name: solver-v1
harness: rlm
model: env:AWS_SONNET_MODEL_ID
tasks:
domains: [electrical]
include_patterns: ["electrical/*"]
difficulties: ["easy", "medium"]
backend: local
batch_size: 5
max_cycles: 10
improvement_threshold: 0.01
stagnation_window: 5
timeout: 1800Run it with:
uv run aec-bench evolve run --config evolution.yaml
uv run aec-bench evolve history workspaces/voltage-drop-evoRollback is non-destructive: it restores an older tag as a new workspace version rather than deleting history.
uv run aec-bench evolve rollback workspaces/voltage-drop-evo evo-20260404-1220-2Mutations
The evolver emits structured workspace edits, not free-form rewriting:
@dataclass(frozen=True)
class MutationAction:
action_type: str # write_skill | modify_skill | delete_skill | modify_prompt
skill_name: str | None = None
skill_description: str | None = None
skill_discipline: str | None = None
skill_body: str | None = None
prompt_content: str | None = NoneEach mutation is validated before application. Applied versions can be evaluated, compared, archived, or rolled back.
Selection strategy
The default strategy is hill-climbing: mutate from the best-scoring workspace seen so far. That is deliberately conservative and easy to interpret. More exploratory strategies belong in the swarm path, where diversity is a first-class objective.
Backend independence
Evolution delegates solving through the same harness boundary as normal experiments. A candidate workspace can be evaluated locally during development and on a larger backend when the loop needs scale.
The invariant is that the task verifier remains the reward authority. Evolution may change how the agent reasons, but it must not change what counts as correct for the task.
When to use it
Evolution is useful when:
- Failures repeat across tasks in a narrow domain.
- The agent workspace has editable prompts and skills.
- You have enough tasks and budget for a signal beyond one lucky run.
- You can inspect promoted changes before trusting the next generation.
For one-off benchmark runs, use a fixed harness. For broad exploration where multiple strategies should survive, use Swarm.