Deployment
Harbor is the orchestration path aec-bench uses to dispatch experiment trials and import completed results into the ledger.
For normal experiment runs, aec-bench builds a Harbor job config from the experiment manifest, invokes the Harbor CLI, detects the produced job directory, and imports completed results into the ledger.
Where it sits
Harbor is the orchestration layer around compute backends, not a harness type and not a replacement for the task verifier:
| Layer | Responsibility | Examples |
|---|---|---|
| Compute backend | Executes a single trial | Modal, Docker, e2b, Daytona |
| Orchestration | Schedules, queues, retries jobs across many trials | Harbor |
| Ingestion | Reads back results and writes trial records to the ledger | aec-bench import |
The job config
aec-bench builds a Harbor job config from your experiment manifest and submits it by invoking the Harbor CLI:
uv run harbor run -c job.yamlThe per-trial config that Harbor receives looks like this:
HarborTrialConfig(
task=HarborTaskConfig(path="tasks/electrical/voltage-drop"),
agent=HarborAgentConfig(
name="solver-v1",
model_name="au.anthropic.claude-sonnet-4-6",
),
environment=HarborEnvironmentConfig(type="docker"),
job_id="exp-20260412-001",
)Harbor resolves paths, stages tasks onto its compute, and runs the entrypoint agent plus verifier. The per-trial contract is unchanged: same task, same harness selection, same verifier artifacts.
Configuration
The experiment manifest selects the compute environment that Harbor should use:
compute:
backend: modal
resource_limits:
n_concurrent_trials: 4Harbor-specific service configuration lives with the Harbor CLI/runtime, not inside the public experiment manifest. Keep service credentials in environment variables or the Harbor auth store rather than committed YAML.
Bringing results back
Harbor writes each completed trial into a job directory with result.json plus agent outputs and trajectory artifacts. aec-bench ingests these with the standard import command:
uv run aec-bench import jobs/exp-20260412-001The importer walks the directory, validates each result.json against the Harbor result contract, reads verifier artifacts, and converts the result into an aec-bench TrialRecord.
From the ledger's perspective, comparable trial records have the same shape even if they arrived through different execution paths.
When to use Harbor
Direct local execution and small backend runs are fine for single-user work. Harbor starts paying off when:
- Multiple teams submit to shared compute: queuing and priority matter
- Runs are long-lived: detach, go home, come back to finished results
- Jobs need retries or preemption handling centrally rather than per-experiment
- Compute sits behind an organisational boundary: VPN, private fleet, tiered access
If none of those apply, staying on a direct backend is simpler and avoids the extra moving part.
Operational notes
- Determinism still applies. Harbor does not change trial semantics. The same task revision and harness configuration should produce the same reward whether run locally or via Harbor.
- Provenance travels with results.
TrialRecordprovenance captures task, agent, environment, input, output, timing, and evaluation evidence. - Import is idempotent. Re-importing a finished job directory should not duplicate records;
trial_idis unique per trial. - Failures surface at import. A failed Harbor trial can be represented as a partial record with diagnostic errors, while complete records remain the only safe basis for leaderboard aggregation.