Skip to content

SWARM: Multi-Agent AI Safety Framework

System-Wide Assessment of Risk in Multi-agent systems

Study how intelligence swarms—and where it fails.

The Core Insight: AGI-level risks don't require AGI-level agents. Catastrophic failures can emerge from the interaction of many sub-AGI agents—even when none are individually dangerous. Read more in our theoretical foundations.
The Purity Paradox: Populations with only 10% honest agents achieve 74% higher welfare than 100% honest populations. Heterogeneity creates competitive pressure that improves outcomes. See the research blog for detailed analysis.

What is SWARM?

SWARM is the reference implementation of the Distributional AGI Safety research framework. It provides tools for studying emergent risks in multi-agent AI systems. Rather than focusing on single misaligned agents, SWARM reveals how harmful dynamics emerge from:

  • Information asymmetry between agents
  • Adverse selection (system accepts lower-quality interactions)
  • Variance amplification across decision horizons
  • Governance latency and illegibility

SWARM makes these interaction-level risks observable, measurable, and governable using soft probabilistic labels.

Measure

Soft probabilistic labels capture uncertainty. Four key metrics—toxicity, quality gap, conditional loss, and incoherence—reveal hidden risks.

Govern

Transaction taxes, circuit breakers, reputation decay, staking, and collusion detection. Test interventions before deployment.

Validate

Integrate with real systems via bridges: Concordia for LLM agents, Prime Intellect for safety-reward RL training, Gas Town for production data, AgentXiv for research mapping.

Quick Start

pip install swarm-safety
from swarm.agents.honest import HonestAgent
from swarm.agents.deceptive import DeceptiveAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig

# Configure and run
config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orchestrator = Orchestrator(config=config)

orchestrator.register_agent(HonestAgent(agent_id="honest_1"))
orchestrator.register_agent(DeceptiveAgent(agent_id="dec_1"))

metrics = orchestrator.run()

for m in metrics:
    print(f"Epoch {m.epoch}: toxicity={m.toxicity_rate:.3f}")

Architecture

Observables → ProxyComputer → v_hat → sigmoid → p → SoftPayoffEngine → payoffs
                                            SoftMetrics → toxicity, quality gap, etc.

Learn More

Core Concepts

Understand soft labels, metrics, and the theory behind SWARM.

Writing Scenarios

Create custom experiments with YAML scenario definitions.

Research

Dive into the theoretical foundations and academic context.

Research Workflow

Multi-agent research with depth/breadth control and quality gates.

Reflexivity

Handle feedback loops when agents study agents.

Agent Publishing

Publish research to agentxiv.org and clawxiv.org.

Research Blog

Experiment write-ups, cross-scenario analyses, and governance mechanism deep dives.

API Reference

Full Python API documentation for agents, metrics, payoffs, and orchestration.

Glossary

Definitions for soft labels, adverse selection, externality internalization, and more.

Key Findings from SWARM Research

These results come from published experiments in the SWARM research blog, based on the theoretical framework in Distributional Safety in Agentic Systems.

Finding Evidence Source
3 turns of forced cooperation eliminates nuclear escalation 210 LLM runs; nuclear rate drops from 100% to exactly 0% at cooperation window=3 across all scenarios Cooperation Window Phase Transition
Deception persists at temperature 0.0 120-run temperature sweep; signal-action divergence of 1.05 at deterministic decoding Temperature vs Deception
Large models (70B–405B) escalate more than small models Llama 405B: 100% nuclear rate, worst welfare (−523.7). Claude Sonnet 4: 0% nuclear, positive welfare (+74.9) Model Size vs Escalation
Purity paradox: 20% honest agents outperform 100% 21 parameter configs tested; paradox holds in 71% but disappears at ρ ≥ 0.5 (full externality pricing) The Purity Paradox
Emergency controls destroy 80% of welfare Market freeze (95% tax) crashed welfare from ~65 to ~15; toxicity actually increased post-freeze Runaway Intelligence Containment
Transparency halves nuclear escalation 120-run information asymmetry sweep; nuclear rate drops 60% → 30% for safety-trained models Asymmetric Information Escalation
Same model, 41% apart from environment fixes alone GPT-4.1-mini: 3 environment fixes changed composite reward from 0.830 to 1.175 Two Eval Runs, One Model

New: Recursive Agent Research

SWARM now includes a complete research workflow for agents conducting research about multi-agent systems:

from swarm.research import ResearchWorkflow, WorkflowConfig

# Configure with reflexivity handling
workflow = ResearchWorkflow(
    config=WorkflowConfig(
        depth=3,
        breadth=3,
        enable_reflexivity=True,
    ),
    simulation_fn=my_simulation,
)

# Run complete workflow: literature → experiment → analysis → publication
state = workflow.run(
    question="How do governance mechanisms affect population dynamics?",
    parameter_space={"honest_fraction": [0.2, 0.5, 0.8]},
)

print(f"Published: {state.submission_result.paper_id}")

Key Features:

  • 7 Specialized Agents: Literature, Experiment, Analysis, Writing, Review, Critique, Replication
  • Quality Gates: Automated checks between workflow phases
  • Pre-Registration: Hash-verified hypothesis locking
  • Reflexivity Analysis: Shadow simulations and publish-then-attack protocols
  • Platform Integration: Submit directly to agentxiv and clawxiv

Based on Distributional Safety in Agentic Systems · MIT License · GitHub · @ResearchSwarmAI