aec-benchaec-bench

Configuration

Agent configuration covers provider credentials, experiment agent entries, harness parameters, and custom internal adapter registration.

Providers

Provider handling depends on the execution path. Agent harnesses backed by PydanticAI infer provider routing from the model string and available credentials. Script-style agents can also build sandbox environment variables from an explicit provider name.

Runtime pathRequired env vars
Anthropic APIANTHROPIC_API_KEY
Azure OpenAI or Azure AI Foundry v1AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT; optional AZURE_OPENAI_API_VERSION
Bedrock through PydanticAIAWS_REGION or AWS_DEFAULT_REGION, plus AWS credentials available to the process
Bedrock script-style providerAWS_BEDROCK_ENDPOINT, AWS_BEARER_TOKEN or AWS_BEARER_TOKEN_BEDROCK, AWS_REGION or AWS_DEFAULT_REGION
OpenAI script-style providerOPENAI_API_KEY
Together AITOGETHER_API_KEY
# .env
ANTHROPIC_API_KEY=sk-ant-...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://example.services.ai.azure.com/openai/v1/
OPENAI_API_KEY=sk-...
TOGETHER_API_KEY=...
AWS_REGION=us-west-2

Keep credentials out of config files. Agent definitions should reference environment variables or rely on the provider SDK's normal credential chain.

For Azure AI Foundry deployments that expose the v1 OpenAI-compatible API, set AZURE_OPENAI_ENDPOINT to the /openai/v1/ endpoint and pass the deployment name as model. For Together AI, prefix the model with together: so routing stays explicit when multiple provider credentials are present:

uv run aec-bench run-local tasks/electrical/voltage-drop \
  --model "together:Qwen/Qwen3.7-Max" \
  --harness direct

Agent definitions

An agent entry in an experiment manifest picks the public harness name, model, and optional parameters:

# experiment.yaml
agents:
  - name: claude-sonnet-tool-loop
    harness: tool_loop
    model: claude-sonnet-4-20250514
    parameters:
      max_turns: 12

  - name: gpt4-direct
    harness: direct
    model: gpt-4.1
    parameters:
      max_tokens: 8192

The manifest parser also accepts the older adapter field, but public docs use harness. Supplying both is an error.

The model field supports $ENV_VAR references so pinned model IDs stay out of config:

agents:
  - name: pinned
    harness: tool_loop
    model: $ANTHROPIC_MODEL     # resolved at run time

Harness parameters

Each harness takes different knobs. In experiment manifests, parameters are passed to the execution layer and become request.configuration for the internal adapter.

Direct: simple generation settings such as output token budget:

max_tokens = 16384

Tool Loop: bounded turn count:

max_turns = 8

RLM: workspace-level rlm.toml, grouped into guardrails and execution:

# rlm.toml
[guardrails]
token_budget = 100_000
max_iterations = 20
max_subcall_depth = 3
max_budget_usd = 5.00

[execution]
scaffolding = true
context_limit = 1_000_000
compaction_threshold_pct = 0.85
max_parallel_workers = 4

Lambda-RLM: workspace-level lambda-rlm.toml, with template, planner, review, guardrails, and execution settings:

# lambda-rlm.toml
[template]
tier = "dependency_tree"
definition = "report_template.toml"

[planner]
context_window_chars = 200_000
max_branching_factor = 4

[review]
enabled = true
max_retries_per_source = 1
max_supplements_per_section = 1

[guardrails]
token_budget = 500_000

[execution]
max_parallel_workers = 4

RLM and Lambda-RLM configuration files live in the staged task workspace. The experiment manifest selects the harness and model; the workspace TOML controls the harness-specific runtime behaviour.

Custom Internal Adapters

Public documentation calls these agent harnesses, but the Python execution protocol is still named Adapter. Any class matching that protocol can be registered:

from aec_bench.adapters.base import AdapterRequest, AdapterResult

class MyAdapter:
    def __init__(self, model_name: str, workspace: str, **kwargs):
        self.model_name = model_name
        self.workspace = workspace

    def execute(self, request: AdapterRequest) -> AdapterResult:
        # your strategy here
        ...

    def adapter_name(self) -> str:
        return "my_adapter"

    def resolved_model(self) -> str:
        return self.model_name

Register against a LocalAdapterRegistry with a builder function:

from aec_bench.adapters.local_registry import LocalAdapterRegistry

registry = LocalAdapterRegistry()
registry.register(
    "my_adapter",
    lambda model_name, workspace, **kwargs: MyAdapter(model_name, workspace, **kwargs),
)

Once registered, the adapter kind is addressable from an experiment config through the public harness field:

agents:
  - name: my-experimental-agent
    harness: my_adapter
    model: claude-sonnet-4

What a good adapter does

  • Respects the whitelist: only call tools present in request.tools.
  • Writes to request.output_path: producing that file is the adapter's job.
  • Records a transcript: capture model turns, tool calls, and tool results.
  • Reports token usage: populate usage_input_tokens and usage_output_tokens when available.
  • Classifies failures: set failure_kind so downstream reports can group errors.

The contract is deliberately thin. Keeping it thin is what lets the same task compare cleanly across many harnesses.

On this page