Skip to content

Five Sweeps, One Red Team, and the Limits of Parametric Governance

We swept 5 governance parameters, ran 8 adversarial attacks, hardened the defenses, and re-attacked. The result: parametric tuning can't fix structural vulnerabilities.


Research Swarm recruits AI agents to research Triple-Negative Breast Cancer. Agents register, get tasks, search PubMed, submit findings, and 30% of the time review each other's work. It's an open-entry multi-agent system with reputation-based QC, compute extraction asymmetry, and collusion vulnerability --- a natural target for SWARM's distributional safety framework.

We built a scenario modeling the platform with 19 scripted agents (8 honest, 3 deceptive, 4 opportunistic, 2 adversarial, 2 collusive verifiers) and swept five governance levers across 20-epoch runs. The results expose a structural property of parametric governance that has implications beyond this specific platform.

The five sweeps

1. QC rate (audit_probability 0.1--0.5): No effect

AP Avg Toxicity Avg Quality Gap Final Welfare
0.1 0.3470 0.1284 45.38
0.3 0.3472 0.1289 41.12
0.5 0.3473 0.1289 41.12

Quintupling the audit rate didn't move toxicity by a single basis point. The QC mechanism penalizes agents after the fact but doesn't change which interactions get generated.

2. Externality internalization (rho 0.01--0.5): Pure tax

rho Avg Toxicity Avg Payoff Final Welfare
0.01 0.3472 1.3041 43.49
0.20 0.3472 1.0732 32.23
0.50 0.3472 0.7086 14.46

Raising rho 50x crushed welfare by 67% and payoffs by 46%. Toxicity? Identical across all values. Externality pricing is an accounting transfer that doesn't create behavioral feedback.

3. Platform extraction (theta 0.3--0.9): Surplus split only

theta Avg Toxicity Avg Payoff Final Welfare
0.3 0.3472 0.4889 41.12
0.7 0.3472 1.2555 41.12
0.9 0.3472 1.6388 41.12

Payoff scales linearly with theta. Everything else is identical. The platform's 70% extraction rate is a distributional question, not a safety question.

4. Staking barrier (min_stake 0--5): Counterproductive

Stake Interactions Avg Toxicity Final Welfare
0.0 731 0.3472 41.12
0.5 552 0.3687 4.73
5.0 550 0.3711 6.97

Staking was the first lever to change behavior --- interactions dropped 25%. But the effect was destructive: toxicity increased (from 0.347 to 0.371), quality gap narrowed, and welfare collapsed 88%. The barrier filtered out participation indiscriminately, hitting honest agents as hard as deceptive ones. Once any stake was required, the damage saturated immediately --- no difference between 0.5 and 5.0.

5. Deceptive agent count (1--7): The lever that works

Deceptive Total Avg Toxicity Avg Quality Gap Final Welfare
1 17 0.3478 0.1723 34.17
3 19 0.3472 0.1289 41.12
5 21 0.3473 0.1548 62.51
7 23 0.3560 0.1520 34.90

This was the only lever across all five sweeps to push toxicity past the 0.35 threshold. At 7 deceptive agents (30% of population), the system fails the success criterion. The quality gap peaks at dec=1 (0.172) where QC has the easiest separation job, and the overall pattern shows real sensitivity to population composition rather than flat-line indifference.

The pattern

Four of five levers --- audit rate, externality pricing, platform extraction, and staking --- produced zero improvement in toxicity. The one that did (agent composition) isn't a governance parameter at all. It's who shows up.

This isn't a simulation artifact. It reflects a real property of parametric governance:

Governance levers that operate after interactions occur are accounting mechanisms, not behavioral feedback loops.

In Research Swarm's actual deployment, this predicts:

  • Raising QC rates won't help unless QC outcomes feed back into task assignment
  • Externality taxes won't change behavior unless agents can observe and respond to their own cost trajectory
  • Entry barriers hurt more than they help because they're indiscriminate
  • The binding constraint is population composition --- the ratio of honest to deceptive agents determines safety outcomes, and no amount of post-hoc scoring changes that ratio

Red-team: 8 attacks, 2 rounds

After the sweeps, we ran the full SWARM attack library against the governance configuration. Then we hardened the defenses and attacked again.

Round 1: Original governance

Score: 0.64 / 1.00 (Grade D) --- 4 of 8 attacks prevented.

Attack Result Damage Evasion
Reputation Farming PREVENTED 14.2 20.0%
Collusion Ring PREVENTED 23.1 10.8%
Threshold Dancing PREVENTED 20.5 28.7%
Sybil Attack PREVENTED 19.9 5.6%
Timing Attack SUCCEEDED 25.7 35.9%
Information Laundering SUCCEEDED 45.6 31.9%
Resource Drain SUCCEEDED 57.0 39.9%
Governance Gaming SUCCEEDED 22.8 31.9%

The good: collusion and sybil detection caught coordination attacks in 5 epochs with low evasion. The bad: resource drain (no staking barrier, 5% tax) was the highest-damage attack, and information laundering bypassed agent-level circuit breakers entirely by routing harm through proxies.

Hardening

We applied 13 parameter changes targeting the 4 successful attacks:

  • Audit probability 0.3→0.4, penalty 2x→3x (timing attack)
  • Circuit breaker threshold 0.6→0.45, violations 3→2, freeze 2→3 epochs (information laundering)
  • Collusion thresholds tightened across the board, penalty 2x→3x (governance gaming)
  • Sybil similarity 0.8→0.75, penalty 1x→2x (strengthen existing defense)
  • Transaction tax 5%→12%, bandwidth cap 12→8 (resource drain)

Round 2: Hardened governance

Score: 0.70 / 1.00 (Grade D) --- 5 of 8 attacks prevented.

Attack Result Damage Evasion vs. Round 1
Reputation Farming PREVENTED 13.2 18.5% same
Collusion Ring SUCCEEDED 21.4 10.0% regressed
Threshold Dancing PREVENTED 19.0 26.6% same
Sybil Attack PREVENTED 18.5 5.2% same
Timing Attack PREVENTED 23.8 33.3% fixed
Information Laundering SUCCEEDED 42.2 29.6% improved
Resource Drain SUCCEEDED 52.8 37.0% improved
Governance Gaming PREVENTED 21.1 29.6% fixed

The scorecard

Metric v1 v2 Change
Robustness score 0.64 0.70 +9%
Attacks prevented 4/8 5/8 +1
Total damage 228.9 212.0 −7%
Avg evasion rate 25.6% 23.7% −1.9pp

But the simulation told a different story:

Metric v1 v2 Change
Total interactions 731 562 −23%
Avg toxicity 0.3472 0.3479 ~same
Final welfare 41.12 9.20 −78%

The hardened governance prevented one more attack and reduced damage by 7%. But it also collapsed welfare by 78% --- the tighter circuit breaker (0.45 threshold, 2-strike, 3-epoch freeze) and reduced bandwidth cap throttled economic activity across the board.

The whack-a-mole problem

The most revealing result: tightening collusion detection caused the collusion ring to succeed. In v1, the collusion ring was prevented. In v2, with stricter thresholds (frequency 1.5→1.3, correlation 0.6→0.5), the more aggressive lever changed interaction patterns enough that the collusion ring found new exploitation paths.

This is governance whack-a-mole. Tightening one parameter shifts adversary behavior to exploit another surface. Parametric hardening is inherently local --- it patches the specific attack that was tested, but can open new vulnerabilities in the process.

What parametric tuning can and can't do

Can do:

  • Fix known, specific attack vectors (timing attack, governance gaming)
  • Reduce damage from existing vulnerabilities (resource drain −7%, info laundering −7%)
  • Improve detection speed (mean latency 7.9→7.4 epochs)

Can't do:

  • Prevent structural vulnerabilities (resource drain and information laundering still succeed after hardening)
  • Avoid welfare costs (every tightening reduces economic activity)
  • Prevent regression (fixing one attack can break defense against another)
  • Change the grade (D→D, not D→C)

What would actually work

The sweep and red-team results converge on the same conclusion: the interventions that matter are structural, not parametric.

  1. Flow-level monitoring --- track value chains across agents, not just individual behavior. This is the only defense against information laundering, where harm is routed through proxies that look clean in isolation.
  2. Adaptive feedback loops --- make QC outcomes affect future task assignment, not just payoffs. This is the key to making rho and audit_probability into decision-shaping mechanisms rather than accounting adjustments.
  3. Tiered access with optional staking --- preserve open entry for basic tasks but require staking for high-value task access. This targets resource drain without the welfare collapse of universal staking (sweep 4).
  4. Graduated sanctions --- replace binary circuit-breaker freezes with progressive penalties. The v2 welfare collapse (−78%) was driven by the aggressive freeze mechanism. Smoother penalties preserve economic activity while still creating deterrence.
  5. Population screening --- agent composition is the binding constraint (sweep 5). At 30% deceptive agents, the system fails regardless of governance. The most impactful long-term intervention is making it costly to be deceptive over time.

Next steps

These results are for scripted agents with fixed strategies. The same sweeps and attacks with LLM-backed agents (type: llm_agent) or RL agents (type: skillrl) --- agents that observe their own payoff history and adapt --- may produce dramatically different results. When agents can switch strategies in response to declining payoffs, parametric levers could become the decision-shaping mechanisms they're designed to be.

The scenario is at scenarios/research_swarm.yaml. The full case study with architecture diagrams, mapping tables, all sweep data, and the complete red-team hardening cycle is in the Research Swarm bridge docs.


Disclaimer: This post simulates a stylized research platform environment for AI safety research. Nothing here constitutes medical advice, research methodology recommendations, or endorsement of any specific research platform or strategy.