aec-benchaec-bench

Backends

A compute backend executes a single trial inside a container or sandbox on a target compute platform.

Backends are pluggable via the ComputeBackend protocol. They execute a single prepared trial and return collected artifacts to the harness boundary.

This page covers the compute layer. For the orchestration layer that dispatches jobs to external services, see Deployment.

The backend protocol

Every direct backend implements the same five-method interface:

class ComputeBackend(Protocol):
    def build_environment(self, *, task_dir: Path) -> str: ...
    def launch_trial(self, *, image_ref: str, workspace_dir: Path) -> TrialHandle: ...
    def execute_trial(
        self,
        *,
        handle: TrialHandle,
        request: BackendExecutionRequest,
    ) -> BackendExecutionResult: ...
    def collect_outputs(self, *, handle: TrialHandle) -> CollectedArtifacts: ...
    def teardown(self, *, handle: TrialHandle) -> None: ...

The orchestrator drives the sequence: build the task environment, launch a runtime for one trial, execute the agent and verifier inside, collect artifacts, and tear down. A TrialHandle is an opaque identifier: a Docker container ID, a Modal sandbox reference, or whatever the backend needs.

Available backends

Different backends suit different workflows. aec-bench run --backend currently accepts modal, docker, e2b, and daytona. Local execution is exposed through aec-bench run-local and through evolution/swarm development paths.

BackendWhere trials runSuited for
modalModal serverless containersScaling out without infra
dockerLocal Docker daemonReproducible runs on your machine
e2bE2B sandboxesSandbox-per-trial isolation
daytonaDaytona workspacesDev-environment-backed runs
local runtimeIn-process on the hostHarness iteration, verifier debugging

Maturity varies. modal and local runtime paths are the best exercised. The others are supported at the protocol level but may need environment tuning for your workloads.

Local

In-process execution on your machine. The adapter runs in Python directly; the verifier runs as a subprocess against a local workspace directory. No containerisation.

uv run aec-bench run-local tasks/electrical/voltage-drop \
  --model gpt-4.1-mini \
  --harness direct

Zero infrastructure setup. Good for harness development, debugging verifier scripts, and short experiments where you would rather see stack traces than log artefacts. Throughput is bounded by a single host.

Each trial runs in an ephemeral Modal sandbox built from the task's environment image. Task fixtures stage onto a Modal volume; outputs pull back when the trial completes.

compute:
  backend: modal
  modal:
    environment: aec-bench-prod
    timeout_sec: 600

Modal scales horizontally without infrastructure work. Cost is per-second, so short trials are cheap and long ones show up on the bill.

Docker

A local Docker daemon runs each trial in a container built from the task's Dockerfile. Most of the same guarantees as Modal (reproducible environment, isolation) without leaving your machine.

compute:
  backend: docker

Good middle ground between local runtime (no isolation) and modal (cloud cost plus latency). Throughput is bounded by host resources.

E2B and Daytona

Both provide per-trial sandboxed environments hosted off-machine:

  • E2B: purpose-built sandboxes for agent workloads, quick to spin up per trial
  • Daytona: dev-environment-style workspaces, useful where trials need longer-lived state or heavier tooling

Config follows the same pattern:

compute:
  backend: e2b        # or daytona

Reach for these when Modal does not fit: different regions, different sandbox semantics, or existing infrastructure investments.

Writing a custom backend

Adding a backend means implementing the five-method protocol:

from pathlib import Path
from aec_bench.harness.backend import (
    BackendExecutionRequest,
    BackendExecutionResult,
    CollectedArtifacts,
    TrialHandle,
)

class KubernetesBackend:
    def build_environment(self, *, task_dir: Path) -> str:
        # push image to registry, return ref
        ...

    def launch_trial(self, *, image_ref: str, workspace_dir: Path) -> TrialHandle:
        # create pod, return handle
        ...

    def execute_trial(
        self,
        *,
        handle: TrialHandle,
        request: BackendExecutionRequest,
    ) -> BackendExecutionResult:
        # exec adapter + verifier in pod, return result
        ...

    def collect_outputs(self, *, handle: TrialHandle) -> CollectedArtifacts:
        # pull logs and reward artifacts
        ...

    def teardown(self, *, handle: TrialHandle) -> None:
        # delete pod
        ...

Once registered, the new backend can be wired into the experiment runner. Trial semantics do not change: same harness, same verifier, same trial record at the other end.

On this page