Evolution

Evolution is a closed-loop system that mutates an agent workspace, mainly its system prompt and skill library, based on trial failures.

Evolution improves the workspace an agent reads. The base model does not change; prompts and skills change under version control, then each candidate workspace is evaluated against tasks.

Use single-workspace evolution when you want a controlled hill-climb. Use Swarm when you want multiple agents exploring a quality-diversity frontier.

The cycle

Each evolution cycle runs six phases:

Phase	What happens
Classify	Read recent trials and tag turns, failures, and behavioural patterns
Analyse	Identify repeated failure modes and the prompt or skill surface implicated
Evolve	Propose structured prompt or skill mutations
Apply	Write the candidate workspace version
Gate	Run the candidate on a batch of tasks and compare with the incumbent
Version	Promote the candidate when it clears the threshold; otherwise retain the previous best

The loop stops when max_cycles is reached or the configured stagnation window produces no improvement.

Setup

Create a workspace:

uv run aec-bench evolve init workspaces/voltage-drop-evo \
  --name voltage-drop-evo \
  --harness rlm

The workspace contains a manifest, system prompt, and skill directory. Treat it like source code: review diffs, keep changes scoped, and avoid contaminating benchmark task definitions with answer leakage.

Configuration

Evolution is driven by a YAML file:

workspace_path: workspaces/voltage-drop-evo

models:
  classifier: env:AWS_HAIKU_MODEL_ID
  evolver: env:AWS_SONNET_MODEL_ID

solver:
  name: solver-v1
  harness: rlm
  model: env:AWS_SONNET_MODEL_ID

tasks:
  domains: [electrical]
  include_patterns: ["electrical/*"]
  difficulties: ["easy", "medium"]

backend: local
batch_size: 5
max_cycles: 10
improvement_threshold: 0.01
stagnation_window: 5
timeout: 1800

Run it with:

uv run aec-bench evolve run --config evolution.yaml
uv run aec-bench evolve history workspaces/voltage-drop-evo

Rollback is non-destructive: it restores an older tag as a new workspace version rather than deleting history.

uv run aec-bench evolve rollback workspaces/voltage-drop-evo evo-20260404-1220-2

Mutations

The evolver emits structured workspace edits, not free-form rewriting:

@dataclass(frozen=True)
class MutationAction:
    action_type: str                 # write_skill | modify_skill | delete_skill | modify_prompt
    skill_name: str | None = None
    skill_description: str | None = None
    skill_discipline: str | None = None
    skill_body: str | None = None
    prompt_content: str | None = None

Each mutation is validated before application. Applied versions can be evaluated, compared, archived, or rolled back.

Failures repeat across tasks in a narrow domain.
The agent workspace has editable prompts and skills.
You have enough tasks and budget for a signal beyond one lucky run.
You can inspect promoted changes before trusting the next generation.

For one-off benchmark runs, use a fixed harness. For broad exploration where multiple strategies should survive, use Swarm.

The cycle

Setup

Configuration

Mutations

Selection strategy

Backend independence

When to use it

On this page