aec-benchaec-bench

Deployment

Harbor is the orchestration path aec-bench uses to dispatch experiment trials and import completed results into the ledger.

For normal experiment runs, aec-bench builds a Harbor job config from the experiment manifest, invokes the Harbor CLI, detects the produced job directory, and imports completed results into the ledger.

Where it sits

Harbor is the orchestration layer around compute backends, not a harness type and not a replacement for the task verifier:

LayerResponsibilityExamples
Compute backendExecutes a single trialModal, Docker, e2b, Daytona
OrchestrationSchedules, queues, retries jobs across many trialsHarbor
IngestionReads back results and writes trial records to the ledgeraec-bench import

The job config

aec-bench builds a Harbor job config from your experiment manifest and submits it by invoking the Harbor CLI:

uv run harbor run -c job.yaml

The per-trial config that Harbor receives looks like this:

HarborTrialConfig(
    task=HarborTaskConfig(path="tasks/electrical/voltage-drop"),
    agent=HarborAgentConfig(
        name="solver-v1",
        model_name="au.anthropic.claude-sonnet-4-6",
    ),
    environment=HarborEnvironmentConfig(type="docker"),
    job_id="exp-20260412-001",
)

Harbor resolves paths, stages tasks onto its compute, and runs the entrypoint agent plus verifier. The per-trial contract is unchanged: same task, same harness selection, same verifier artifacts.

Configuration

The experiment manifest selects the compute environment that Harbor should use:

compute:
  backend: modal
  resource_limits:
    n_concurrent_trials: 4

Harbor-specific service configuration lives with the Harbor CLI/runtime, not inside the public experiment manifest. Keep service credentials in environment variables or the Harbor auth store rather than committed YAML.

Bringing results back

Harbor writes each completed trial into a job directory with result.json plus agent outputs and trajectory artifacts. aec-bench ingests these with the standard import command:

uv run aec-bench import jobs/exp-20260412-001

The importer walks the directory, validates each result.json against the Harbor result contract, reads verifier artifacts, and converts the result into an aec-bench TrialRecord.

From the ledger's perspective, comparable trial records have the same shape even if they arrived through different execution paths.

When to use Harbor

Direct local execution and small backend runs are fine for single-user work. Harbor starts paying off when:

  • Multiple teams submit to shared compute: queuing and priority matter
  • Runs are long-lived: detach, go home, come back to finished results
  • Jobs need retries or preemption handling centrally rather than per-experiment
  • Compute sits behind an organisational boundary: VPN, private fleet, tiered access

If none of those apply, staying on a direct backend is simpler and avoids the extra moving part.

Operational notes

  • Determinism still applies. Harbor does not change trial semantics. The same task revision and harness configuration should produce the same reward whether run locally or via Harbor.
  • Provenance travels with results. TrialRecord provenance captures task, agent, environment, input, output, timing, and evaluation evidence.
  • Import is idempotent. Re-importing a finished job directory should not duplicate records; trial_id is unique per trial.
  • Failures surface at import. A failed Harbor trial can be represented as a partial record with diagnostic errors, while complete records remain the only safe basis for leaderboard aggregation.

On this page