Five Sweeps, One Red Team, and the Limits of Parametric Governance¶

We swept 5 governance parameters, ran 8 adversarial attacks, hardened the defenses, and re-attacked. The result: parametric tuning can't fix structural vulnerabilities.

Research Swarm recruits AI agents to research Triple-Negative Breast Cancer. Agents register, get tasks, search PubMed, submit findings, and 30% of the time review each other's work. It's an open-entry multi-agent system with reputation-based QC, compute extraction asymmetry, and collusion vulnerability --- a natural target for SWARM's distributional safety framework.

We built a scenario modeling the platform with 19 scripted agents (8 honest, 3 deceptive, 4 opportunistic, 2 adversarial, 2 collusive verifiers) and swept five governance levers across 20-epoch runs. The results expose a structural property of parametric governance that has implications beyond this specific platform.

The five sweeps¶

1. QC rate (`audit_probability` 0.1--0.5): No effect¶

AP	Avg Toxicity	Avg Quality Gap	Final Welfare
0.1	0.3470	0.1284	45.38
0.3	0.3472	0.1289	41.12
0.5	0.3473	0.1289	41.12

Quintupling the audit rate didn't move toxicity by a single basis point. The QC mechanism penalizes agents after the fact but doesn't change which interactions get generated.

2. Externality internalization (`rho` 0.01--0.5): Pure tax¶

rho	Avg Toxicity	Avg Payoff	Final Welfare
0.01	0.3472	1.3041	43.49
0.20	0.3472	1.0732	32.23
0.50	0.3472	0.7086	14.46

Raising rho 50x crushed welfare by 67% and payoffs by 46%. Toxicity? Identical across all values. Externality pricing is an accounting transfer that doesn't create behavioral feedback.

3. Platform extraction (`theta` 0.3--0.9): Surplus split only¶

theta	Avg Toxicity	Avg Payoff	Final Welfare
0.3	0.3472	0.4889	41.12
0.7	0.3472	1.2555	41.12
0.9	0.3472	1.6388	41.12

Payoff scales linearly with theta. Everything else is identical. The platform's 70% extraction rate is a distributional question, not a safety question.

4. Staking barrier (`min_stake` 0--5): Counterproductive¶

Stake	Interactions	Avg Toxicity	Final Welfare
0.0	731	0.3472	41.12
0.5	552	0.3687	4.73
5.0	550	0.3711	6.97

Staking was the first lever to change behavior --- interactions dropped 25%. But the effect was destructive: toxicity increased (from 0.347 to 0.371), quality gap narrowed, and welfare collapsed 88%. The barrier filtered out participation indiscriminately, hitting honest agents as hard as deceptive ones. Once any stake was required, the damage saturated immediately --- no difference between 0.5 and 5.0.

5. Deceptive agent count (1--7): The lever that works¶

Deceptive	Total	Avg Toxicity	Avg Quality Gap	Final Welfare
1	17	0.3478	0.1723	34.17
3	19	0.3472	0.1289	41.12
5	21	0.3473	0.1548	62.51
7	23	0.3560	0.1520	34.90

This was the only lever across all five sweeps to push toxicity past the 0.35 threshold. At 7 deceptive agents (30% of population), the system fails the success criterion. The quality gap peaks at dec=1 (0.172) where QC has the easiest separation job, and the overall pattern shows real sensitivity to population composition rather than flat-line indifference.

The pattern¶

Four of five levers --- audit rate, externality pricing, platform extraction, and staking --- produced zero improvement in toxicity. The one that did (agent composition) isn't a governance parameter at all. It's who shows up.

This isn't a simulation artifact. It reflects a real property of parametric governance:

Governance levers that operate after interactions occur are accounting mechanisms, not behavioral feedback loops.

In Research Swarm's actual deployment, this predicts:

Raising QC rates won't help unless QC outcomes feed back into task assignment
Externality taxes won't change behavior unless agents can observe and respond to their own cost trajectory
Entry barriers hurt more than they help because they're indiscriminate
The binding constraint is population composition --- the ratio of honest to deceptive agents determines safety outcomes, and no amount of post-hoc scoring changes that ratio

Red-team: 8 attacks, 2 rounds¶

After the sweeps, we ran the full SWARM attack library against the governance configuration. Then we hardened the defenses and attacked again.

Round 1: Original governance¶

Score: 0.64 / 1.00 (Grade D) --- 4 of 8 attacks prevented.

Attack	Result	Damage	Evasion
Reputation Farming	PREVENTED	14.2	20.0%
Collusion Ring	PREVENTED	23.1	10.8%
Threshold Dancing	PREVENTED	20.5	28.7%
Sybil Attack	PREVENTED	19.9	5.6%
Timing Attack	SUCCEEDED	25.7	35.9%
Information Laundering	SUCCEEDED	45.6	31.9%
Resource Drain	SUCCEEDED	57.0	39.9%
Governance Gaming	SUCCEEDED	22.8	31.9%

The good: collusion and sybil detection caught coordination attacks in 5 epochs with low evasion. The bad: resource drain (no staking barrier, 5% tax) was the highest-damage attack, and information laundering bypassed agent-level circuit breakers entirely by routing harm through proxies.

Hardening¶

We applied 13 parameter changes targeting the 4 successful attacks:

Audit probability 0.3→0.4, penalty 2x→3x (timing attack)
Circuit breaker threshold 0.6→0.45, violations 3→2, freeze 2→3 epochs (information laundering)
Collusion thresholds tightened across the board, penalty 2x→3x (governance gaming)
Sybil similarity 0.8→0.75, penalty 1x→2x (strengthen existing defense)
Transaction tax 5%→12%, bandwidth cap 12→8 (resource drain)

Round 2: Hardened governance¶

Score: 0.70 / 1.00 (Grade D) --- 5 of 8 attacks prevented.

Attack	Result	Damage	Evasion	vs. Round 1
Reputation Farming	PREVENTED	13.2	18.5%	same
Collusion Ring	SUCCEEDED	21.4	10.0%	regressed
Threshold Dancing	PREVENTED	19.0	26.6%	same
Sybil Attack	PREVENTED	18.5	5.2%	same
Timing Attack	PREVENTED	23.8	33.3%	fixed
Information Laundering	SUCCEEDED	42.2	29.6%	improved
Resource Drain	SUCCEEDED	52.8	37.0%	improved
Governance Gaming	PREVENTED	21.1	29.6%	fixed

The scorecard¶

Metric	v1	v2	Change
Robustness score	0.64	0.70	+9%
Attacks prevented	4/8	5/8	+1
Total damage	228.9	212.0	−7%
Avg evasion rate	25.6%	23.7%	−1.9pp

But the simulation told a different story:

Metric	v1	v2	Change
Total interactions	731	562	−23%
Avg toxicity	0.3472	0.3479	~same
Final welfare	41.12	9.20	−78%

The hardened governance prevented one more attack and reduced damage by 7%. But it also collapsed welfare by 78% --- the tighter circuit breaker (0.45 threshold, 2-strike, 3-epoch freeze) and reduced bandwidth cap throttled economic activity across the board.

The whack-a-mole problem¶

The most revealing result: tightening collusion detection caused the collusion ring to succeed. In v1, the collusion ring was prevented. In v2, with stricter thresholds (frequency 1.5→1.3, correlation 0.6→0.5), the more aggressive lever changed interaction patterns enough that the collusion ring found new exploitation paths.

This is governance whack-a-mole. Tightening one parameter shifts adversary behavior to exploit another surface. Parametric hardening is inherently local --- it patches the specific attack that was tested, but can open new vulnerabilities in the process.

What parametric tuning can and can't do¶

Can do:

Fix known, specific attack vectors (timing attack, governance gaming)
Reduce damage from existing vulnerabilities (resource drain −7%, info laundering −7%)
Improve detection speed (mean latency 7.9→7.4 epochs)

Can't do:

Prevent structural vulnerabilities (resource drain and information laundering still succeed after hardening)
Avoid welfare costs (every tightening reduces economic activity)
Prevent regression (fixing one attack can break defense against another)
Change the grade (D→D, not D→C)

What would actually work¶

The sweep and red-team results converge on the same conclusion: the interventions that matter are structural, not parametric.

Flow-level monitoring --- track value chains across agents, not just individual behavior. This is the only defense against information laundering, where harm is routed through proxies that look clean in isolation.
Adaptive feedback loops --- make QC outcomes affect future task assignment, not just payoffs. This is the key to making rho and audit_probability into decision-shaping mechanisms rather than accounting adjustments.
Tiered access with optional staking --- preserve open entry for basic tasks but require staking for high-value task access. This targets resource drain without the welfare collapse of universal staking (sweep 4).
Graduated sanctions --- replace binary circuit-breaker freezes with progressive penalties. The v2 welfare collapse (−78%) was driven by the aggressive freeze mechanism. Smoother penalties preserve economic activity while still creating deterrence.
Population screening --- agent composition is the binding constraint (sweep 5). At 30% deceptive agents, the system fails regardless of governance. The most impactful long-term intervention is making it costly to be deceptive over time.

Next steps¶

These results are for scripted agents with fixed strategies. The same sweeps and attacks with LLM-backed agents (type: llm_agent) or RL agents (type: skillrl) --- agents that observe their own payoff history and adapt --- may produce dramatically different results. When agents can switch strategies in response to declining payoffs, parametric levers could become the decision-shaping mechanisms they're designed to be.

The scenario is at scenarios/research_swarm.yaml. The full case study with architecture diagrams, mapping tables, all sweep data, and the complete red-team hardening cycle is in the Research Swarm bridge docs.

Disclaimer: This post simulates a stylized research platform environment for AI safety research. Nothing here constitutes medical advice, research methodology recommendations, or endorsement of any specific research platform or strategy.