Backends
A compute backend executes a single trial inside a container or sandbox on a target compute platform.
Backends are pluggable via the ComputeBackend protocol. They execute a single prepared trial and return collected artifacts to the harness boundary.
This page covers the compute layer. For the orchestration layer that dispatches jobs to external services, see Deployment.
The backend protocol
Every direct backend implements the same five-method interface:
class ComputeBackend(Protocol):
def build_environment(self, *, task_dir: Path) -> str: ...
def launch_trial(self, *, image_ref: str, workspace_dir: Path) -> TrialHandle: ...
def execute_trial(
self,
*,
handle: TrialHandle,
request: BackendExecutionRequest,
) -> BackendExecutionResult: ...
def collect_outputs(self, *, handle: TrialHandle) -> CollectedArtifacts: ...
def teardown(self, *, handle: TrialHandle) -> None: ...The orchestrator drives the sequence: build the task environment, launch a runtime for one trial, execute the agent and verifier inside, collect artifacts, and tear down. A TrialHandle is an opaque identifier: a Docker container ID, a Modal sandbox reference, or whatever the backend needs.
Available backends
Different backends suit different workflows. aec-bench run --backend currently accepts modal, docker, e2b, and daytona. Local execution is exposed through aec-bench run-local and through evolution/swarm development paths.
| Backend | Where trials run | Suited for |
|---|---|---|
modal | Modal serverless containers | Scaling out without infra |
docker | Local Docker daemon | Reproducible runs on your machine |
e2b | E2B sandboxes | Sandbox-per-trial isolation |
daytona | Daytona workspaces | Dev-environment-backed runs |
| local runtime | In-process on the host | Harness iteration, verifier debugging |
Maturity varies. modal and local runtime paths are the best exercised. The others are supported at the protocol level but may need environment tuning for your workloads.
Local
In-process execution on your machine. The adapter runs in Python directly; the verifier runs as a subprocess against a local workspace directory. No containerisation.
uv run aec-bench run-local tasks/electrical/voltage-drop \
--model gpt-4.1-mini \
--harness directZero infrastructure setup. Good for harness development, debugging verifier scripts, and short experiments where you would rather see stack traces than log artefacts. Throughput is bounded by a single host.
Modal
Each trial runs in an ephemeral Modal sandbox built from the task's environment image. Task fixtures stage onto a Modal volume; outputs pull back when the trial completes.
compute:
backend: modal
modal:
environment: aec-bench-prod
timeout_sec: 600Modal scales horizontally without infrastructure work. Cost is per-second, so short trials are cheap and long ones show up on the bill.
Docker
A local Docker daemon runs each trial in a container built from the task's Dockerfile. Most of the same guarantees as Modal (reproducible environment, isolation) without leaving your machine.
compute:
backend: dockerGood middle ground between local runtime (no isolation) and modal (cloud cost plus latency). Throughput is bounded by host resources.
E2B and Daytona
Both provide per-trial sandboxed environments hosted off-machine:
- E2B: purpose-built sandboxes for agent workloads, quick to spin up per trial
- Daytona: dev-environment-style workspaces, useful where trials need longer-lived state or heavier tooling
Config follows the same pattern:
compute:
backend: e2b # or daytonaReach for these when Modal does not fit: different regions, different sandbox semantics, or existing infrastructure investments.
Writing a custom backend
Adding a backend means implementing the five-method protocol:
from pathlib import Path
from aec_bench.harness.backend import (
BackendExecutionRequest,
BackendExecutionResult,
CollectedArtifacts,
TrialHandle,
)
class KubernetesBackend:
def build_environment(self, *, task_dir: Path) -> str:
# push image to registry, return ref
...
def launch_trial(self, *, image_ref: str, workspace_dir: Path) -> TrialHandle:
# create pod, return handle
...
def execute_trial(
self,
*,
handle: TrialHandle,
request: BackendExecutionRequest,
) -> BackendExecutionResult:
# exec adapter + verifier in pod, return result
...
def collect_outputs(self, *, handle: TrialHandle) -> CollectedArtifacts:
# pull logs and reward artifacts
...
def teardown(self, *, handle: TrialHandle) -> None:
# delete pod
...Once registered, the new backend can be wired into the experiment runner. Trial semantics do not change: same harness, same verifier, same trial record at the other end.