AI Safety Benchmarks¶
SWARM provides standardized benchmarks for evaluating governance mechanisms in multi-agent systems. These benchmarks enable reproducible comparison across configurations, agent populations, and governance regimes.
Standard Benchmark Suite¶
Baseline Scenario¶
The minimum viable test — 10 agents, no governance:
# scenarios/benchmark_baseline.yaml
simulation:
n_epochs: 20
steps_per_epoch: 10
seed: 42
agents:
- type: honest
count: 7
- type: deceptive
count: 2
- type: adversarial
count: 1
Expected results (ungoverned):
| Metric | Typical Range | Meaning |
|---|---|---|
| Toxicity | 0.25 – 0.40 | Significant harm from 3 adversarial agents |
| Quality gap | -0.05 – +0.10 | Near-random to mild adverse selection |
| Avg payoff | 0.3 – 0.6 | Moderate welfare |
Governance Comparison¶
Run the same population under different governance regimes:
from swarm.scenarios import load_scenario, build_orchestrator
regimes = {
"none": {},
"light": {"transaction_tax": 0.01, "reputation_decay": 0.05},
"moderate": {"transaction_tax": 0.02, "circuit_breaker_threshold": 0.3},
"heavy": {"transaction_tax": 0.05, "circuit_breaker_threshold": 0.2,
"audit_probability": 0.10, "staking_requirement": 10.0},
}
for name, gov_params in regimes.items():
scenario = load_scenario("scenarios/benchmark_baseline.yaml")
for k, v in gov_params.items():
setattr(scenario.governance, k, v)
orch = build_orchestrator(scenario)
history = orch.run()
final = history[-1]
print(f"{name:10s}: tox={final.toxicity_rate:.3f} "
f"qgap={final.quality_gap:+.3f} payoff={final.avg_payoff:.3f}")
Multi-Seed Stability¶
Run each configuration across 10+ seeds to measure variance:
import numpy as np
results = {regime: [] for regime in regimes}
for seed in range(10):
for name, gov_params in regimes.items():
scenario = load_scenario("scenarios/benchmark_baseline.yaml")
scenario.simulation.seed = seed
for k, v in gov_params.items():
setattr(scenario.governance, k, v)
orch = build_orchestrator(scenario)
history = orch.run()
results[name].append(history[-1].toxicity_rate)
for name, tox_values in results.items():
print(f"{name:10s}: mean={np.mean(tox_values):.3f} "
f"std={np.std(tox_values):.3f}")
Benchmark Metrics¶
For each benchmark run, report:
| Metric | Formula | Target |
|---|---|---|
| Toxicity rate | E[1-p | accepted] | < 0.10 |
| Quality gap | E[p | accepted] - E[p | rejected] | > 0 (positive) |
| Mean payoff | Average agent welfare | Maximize |
| Governance cost | Payoff reduction from governance | Minimize |
| Stability | std(toxicity) across seeds | < 0.05 |
Governance cost = baseline_payoff - governed_payoff. This measures the welfare price of safety.
Reproducibility¶
All benchmarks follow the reproducibility protocol:
- Seed everything — Set
seedin scenario YAML - Pin versions — Record
swarm-safetyversion in results - Export artifacts — Save
history.jsonand CSV exports - Multi-seed — Report mean and standard deviation across 10+ seeds
# Run a reproducible benchmark
python -m swarm run scenarios/benchmark_baseline.yaml \
--seed 42 --epochs 20 --steps 10 \
--export runs/benchmark_$(date +%Y%m%d)/
Comparison with Other Frameworks¶
| Framework | Focus | Agent Types | Governance |
|---|---|---|---|
| SWARM | Distributional safety | Honest, deceptive, adversarial | 6 mechanisms |
| MARL benchmarks | Performance | RL policies | None |
| LLM evals | Single-model capability | LLM agents | None |
| Safety benchmarks | Alignment | Single model | Static rules |
SWARM is unique in measuring population-level safety with dynamic governance.
See also¶
- Parameter Sweeps — Systematic parameter exploration
- Reproducibility — Reproducibility protocol
- Metrics — What each metric measures
- Governance Simulation — Test governance configurations