Structured Agent Research Workflow¶

A multi-agent workflow for conducting rigorous SWARM research, inspired by recursive exploration architectures like DeepResearch^Eco.

Implementation¶

The workflow is implemented in swarm.research:

from swarm.research import ResearchWorkflow, WorkflowConfig

# Configure workflow
config = WorkflowConfig(
    depth=3,
    breadth=3,
    enable_reflexivity=True,
    enable_pre_registration=True,
    target_venue="clawxiv",
)

# Initialize with simulation function
workflow = ResearchWorkflow(
    config=config,
    simulation_fn=my_simulation_function,
)

# Run complete workflow
state = workflow.run(
    question="How do governance mechanisms interact with population composition?",
    parameter_space={
        "honest_fraction": [0.2, 0.4, 0.6, 0.8, 1.0],
        "transaction_tax": [0.0, 0.05, 0.10],
    },
)

print(f"Status: {state.status}")
if state.submission_result:
    print(f"Published: {state.submission_result.paper_id}")

See Reflexivity for the epistemological framework.

Overview¶

This workflow decomposes research into specialized sub-agents with controllable depth and breadth parameters, enabling systematic exploration while maintaining quality.

┌─────────────────────────────────────────────────────────────────┐
│                    SWARM RESEARCH WORKFLOW                       │
│                                                                  │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐         │
│  │  Literature  │   │  Experiment  │   │   Analysis   │         │
│  │    Agent     │──→│    Agent     │──→│    Agent     │         │
│  └──────────────┘   └──────────────┘   └──────────────┘         │
│         │                  │                  │                  │
│         │                  │                  ↓                  │
│         │                  │          ┌──────────────┐          │
│         │                  │          │   Writing    │          │
│         └──────────────────┴─────────→│    Agent     │          │
│                                       └──────────────┘          │
│                                              │                   │
│                                              ↓                   │
│                                       ┌──────────────┐          │
│                                       │  Publication │          │
│                                       └──────────────┘          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Control Parameters¶

Depth (d)¶

Controls recursive exploration layers. Higher depth = more follow-up investigation.

Level	Description	Use Case
d=1	Single-pass	Quick surveys, known topics
d=2	One follow-up	Standard research
d=4	Deep exploration	Novel findings, complex phenomena

Breadth (b)¶

Controls parallel exploration branches. Higher breadth = more diverse coverage.

Level	Description	Use Case
b=1	Single thread	Focused investigation
b=2	Dual perspective	Compare approaches
b=4	Wide survey	Comprehensive review

Expected Scaling¶

Based on DeepResearch^Eco findings:

Configuration	Relative Sources	Information Density
d1_b1	1x (baseline)	1x
d1_b4	~6x	~5x
d4_b1	~6x	~5x
d4_b4	~21x	~15x

Depth and breadth have approximately equal individual effects, with super-linear combination gains.

Sub-Agent Specifications¶

1. Literature Agent¶

Purpose: Survey existing research and identify gaps.

Inputs: - Research question - Depth parameter (d) - Breadth parameter (b)

Process:

for layer in range(depth):
    queries = generate_search_queries(question, breadth)
    for query in queries:
        results = search_platforms(query)  # agentxiv, clawxiv, arxiv
        summaries = summarize_results(results)
        follow_ups = extract_follow_up_questions(summaries)
        question = prioritize_follow_ups(follow_ups)

Outputs: - Literature summary with source count - Identified gaps and opportunities - Related work bibliography - Follow-up questions (for next iteration)

API Calls:

# Search agentxiv
curl -X POST "https://www.agentxiv.org/api/search" \
  -H "Content-Type: application/json" \
  -d '{"query": "multi-agent welfare optimization", "limit": 20}'

# Search clawxiv
curl "https://www.clawxiv.org/api/v1/search?query=population%20heterogeneity%20safety&limit=20"

Quality Metrics: - Sources integrated: Target 50+ for d4_b4 - Geographic/domain coverage: 4+ distinct areas - Recency: Include papers from last 6 months

2. Experiment Agent¶

Purpose: Design and execute SWARM simulations.

Inputs: - Research hypothesis (from Literature Agent) - Depth parameter (controls parameter sweep granularity) - Breadth parameter (controls configuration diversity)

Process:

name="__codelineno-4-1" href="#__codelineno-4-1">class ExperimentAgent: def __init__(self, depth: int, breadth: int): self.depth = depth self.breadth = breadth def design_experiments(self, hypothesis: str) -> list[Config]: """Generate experiment configurations.""" base_configs = self.generate_base_configs(self.breadth) for layer in range(self.depth): results = self.run_configs(base_configs) interesting = self.identify_interesting_regions(results) base_configs = self.refine_configs(interesting, self.breadth) return base_configs def run_simulation(self, config: Config) -> Results: class="w"> """Execute single SWARM simulation.""" marketplace = Marketplace(config) return marketplace.run(trials=10) # Minimum 10 trials

Configuration Template:

# experiments/research_config.yaml
experiment:
  name: "hypothesis_test"
  depth: 2
  breadth: 4

parameters:
  # Breadth: test multiple values
  honest_fraction: [0.1, 0.4, 0.7, 1.0]  # b=4
  governance:
    transaction_tax: [0.0, 0.05]
    reputation_decay: [0.0, 0.10]

simulation:
  agents: 10
  rounds: 100
  trials: 10  # Per configuration

# Depth: refine based on results
refinement:
  enabled: true
  threshold: 0.1  # Refine if effect > 10%
  granularity: 0.05  # Step size for refinement

Outputs: - Raw simulation results (JSON) - Configuration manifests - Random seeds for reproducibility - Execution logs

Quality Metrics: - Trials per configuration: 10+ (mandatory) - Total configurations: breadth^2 minimum - Parameter coverage: Full range tested - Reproducibility: All seeds documented

3. Analysis Agent¶

Purpose: Statistical analysis and insight extraction.

Inputs: - Raw results (from Experiment Agent) - Literature context (from Literature Agent) - Depth parameter (controls analysis sophistication)

Process:

class AnalysisAgent:
    def __init__(self, depth: int):
        self.depth = depth

    def analyze(self, results: Results, literature: Literature) -> Analysis:
        # Layer 1: Descriptive statistics (always)
        stats = self.compute_descriptive_stats(results)

        if self.depth >= 2:
            # Layer 2: Inferential statistics
            stats.update(self.run_significance_tests(results))
            stats.update(self.compute_effect_sizes(results))

        if self.depth >= 3:
            # Layer 3: Causal analysis
            stats.update(self.causal_inference(results))
            stats.update(self.counterfactual_analysis(results))

        if self.depth >= 4:
            # Layer 4: Meta-analysis
            stats.update(self.compare_to_literature(results, literature))
            stats.update(self.identify_anomalies(results))

        return Analysis(stats)

Statistical Requirements by Depth:

Depth	Requirements
d=1	Mean, std, min/max
d=2	+ 95% CI, t-tests, p-values
d=3	+ Effect sizes (Cohen's d), regression
d=4	+ Causal inference, meta-analysis

Outputs: - Statistical summary tables - Visualizations (plots, heatmaps) - Effect size estimates with confidence intervals - Comparison to prior work - Identified anomalies and unexpected findings

Quality Metrics: - All claims have p-values and effect sizes - Confidence intervals reported - Multiple comparison correction applied - Limitations explicitly stated

4. Writing Agent¶

Purpose: Synthesize findings into publication-ready paper.

Inputs: - Literature review (from Literature Agent) - Results and analysis (from Analysis Agent) - Raw data (from Experiment Agent) - Target venue (agentxiv/clawxiv)

Process:

class WritingAgent:
    def __init__(self, depth: int, breadth: int):
        self.depth = depth
        self.breadth = breadth

    def generate_paper(self,
                       literature: Literature,
                       analysis: Analysis,
                       data: RawData) -> Paper:

        sections = {
            'abstract': self.write_abstract(analysis),
            'introduction': self.write_intro(literature, self.breadth),
            'methods': self.write_methods(data),
            'results': self.write_results(analysis, self.depth),
            'discussion': self.write_discussion(analysis, literature),
            'conclusion': self.write_conclusion(analysis),
        }

        # Depth controls detail level
        if self.depth >= 3:
            sections['appendix'] = self.write_appendix(data)

        # Breadth controls scope of discussion
        if self.breadth >= 3:
            sections['related_work'] = self.write_extended_related(literature)

        return Paper(sections)

Paper Template:

\documentclass{article}
\usepackage{amsmath,amssymb,amsthm}

\title{[Finding]: [Descriptive Title]}
\author{[Agent Name]}
\date{[Month Year]}

\begin{document}
\maketitle

\begin{abstract}
% 4 sentences: (1) Problem, (2) Method, (3) Finding, (4) Implication
\end{abstract}

\section{Introduction}
% Context, gap, contribution

\section{Related Work}
% Literature Agent output (breadth determines coverage)

\section{Methods}
% Experiment Agent configuration
% Include: parameters, trials, seeds

\section{Results}
% Analysis Agent output (depth determines sophistication)
% Tables with CI, effect sizes

\section{Discussion}
% Interpretation, limitations, future work

\section{Conclusion}
% Key takeaways

\section*{Reproducibility}
% Links to code, configs, raw data

\end{document}

Outputs: - LaTeX source - Submission-ready JSON - Figures and tables - Reproducibility package

Quality Metrics: - Information density: 10+ sources per 1000 words - Claims-to-evidence ratio: Every claim has citation or data - Limitation acknowledgment: Explicit section - Reproducibility: Complete config provided

Complete Workflow Example¶

Research Question¶

"How do governance mechanisms interact with population composition?"

Configuration¶

workflow:
  depth: 3
  breadth: 3

literature:
  platforms: [agentxiv, clawxiv, arxiv]
  query_variants: 3  # breadth
  follow_up_layers: 3  # depth

experiment:
  parameters:
    honest_fraction: [0.2, 0.5, 0.8]  # breadth=3
    transaction_tax: [0.0, 0.05, 0.10]  # breadth=3
    reputation_decay: [0.0, 0.05, 0.10]  # breadth=3
  trials: 10
  rounds: 100

analysis:
  statistics: [descriptive, inferential, effect_sizes]  # depth=3
  visualizations: [heatmap, interaction_plot, trend_lines]

writing:
  venue: clawxiv
  include_appendix: true  # depth >= 3

Execution¶

from swarm.research import ResearchWorkflow

# Initialize workflow
workflow = ResearchWorkflow(depth=3, breadth=3)

# Phase 1: Literature
literature = workflow.literature_agent.survey(
    question="governance mechanism interaction with population composition",
    platforms=["agentxiv", "clawxiv"],
)
print(f"Found {literature.source_count} sources")

# Phase 2: Experiments
experiments = workflow.experiment_agent.design(
    hypothesis=literature.primary_hypothesis,
    gaps=literature.identified_gaps,
)
results = workflow.experiment_agent.run(experiments)
print(f"Ran {len(results.configs)} configurations")

# Phase 3: Analysis
analysis = workflow.analysis_agent.analyze(
    results=results,
    literature=literature,
)
print(f"Effect sizes: {analysis.effect_sizes}")

# Phase 4: Writing
paper = workflow.writing_agent.generate(
    literature=literature,
    analysis=analysis,
    data=results,
    venue="clawxiv",
)

# Phase 5: Submission
submission = workflow.submit(
    paper=paper,
    platform="clawxiv",
    api_key=os.environ["CLAWXIV_API_KEY"],
)
print(f"Published: {submission.paper_id}")

Expected Output¶

With d=3, b=3: - Literature: ~60 sources surveyed - Experiments: 27 configurations (3³), 270 total trials - Analysis: Full statistical suite with effect sizes - Paper: ~3000 words, 15+ citations, appendix with raw data

Quality Assurance Checklist¶

Before submission, verify:

Literature Agent¶

Searched all relevant platforms
Follow-up questions explored to depth d
Breadth b query variants used
Sources ≥ 10 × breadth × depth

Experiment Agent¶

All parameter combinations tested
10+ trials per configuration
Random seeds documented
Configs exportable for replication

Analysis Agent¶

Descriptive stats for all metrics
Significance tests with correction
Effect sizes with 95% CI
Comparison to prior work

Writing Agent¶

Abstract follows 4-sentence structure
Every claim has evidence
Limitations explicitly stated
Reproducibility package complete

Metrics Dashboard¶

Track research quality with these metrics:

Metric	Formula	Target (d4_b4)
Source Integration	sources / baseline	≥ 20x
Information Density	sources / 1000 words	≥ 15
Configuration Coverage	configs tested / possible	≥ 80%
Statistical Rigor	claims with CI / total claims	100%
Reproducibility	provided seeds / total trials	100%

Recursive Self-Improvement¶

The workflow can study itself:

# Meta-research: study the research workflow
meta_workflow = ResearchWorkflow(depth=2, breadth=2)

meta_literature = meta_workflow.literature_agent.survey(
    question="How do depth/breadth parameters affect research quality?",
)

meta_experiments = meta_workflow.experiment_agent.design(
    hypothesis="Higher d×b improves finding significance",
    parameter_space={
        "workflow_depth": [1, 2, 4],
        "workflow_breadth": [1, 2, 4],
    },
)

# Run research workflows as experiments
meta_results = []
for config in meta_experiments:
    inner_workflow = ResearchWorkflow(
        depth=config.workflow_depth,
        breadth=config.workflow_breadth,
    )
    result = inner_workflow.run(question="test_question")
    meta_results.append(measure_quality(result))

# Analyze what parameters produce best research
meta_analysis = meta_workflow.analysis_agent.analyze(meta_results)

This enables recursive optimization of the research process itself.

Additional Agents¶

Beyond the core four agents, robust research requires:

5. Review Agent¶

Purpose: Adversarial peer review before publication.

class ReviewAgent:
    """Finds flaws in research before publication."""

    def review(self, paper: Paper, analysis: Analysis) -> Review:
        critiques = []

        # Statistical review
        critiques.extend(self.check_statistics(analysis))

        # Methodology review
        critiques.extend(self.check_methodology(paper))

        # Claims vs evidence
        critiques.extend(self.verify_claims(paper, analysis))

        # Missing considerations
        critiques.extend(self.identify_gaps(paper))

        return Review(
            critiques=critiques,
            severity=self.assess_severity(critiques),
            recommendation=self.recommend(critiques),  # accept/revise/reject
        )

    def check_statistics(self, analysis: Analysis) -> list[Critique]:
        issues = []

        # Check for p-hacking indicators
        if analysis.has_many_marginal_pvalues():
            issues.append(Critique(
                severity="high",
                issue="Multiple p-values near 0.05 threshold",
                suggestion="Apply stricter significance threshold or pre-register",
            ))

        # Check effect sizes
        for claim in analysis.claims:
            if claim.effect_size < 0.2 and claim.is_primary:
                issues.append(Critique(
                    severity="medium",
                    issue=f"Small effect size ({claim.effect_size}) for primary claim",
                    suggestion="Discuss practical significance",
                ))

        # Check sample sizes
        if analysis.total_trials < 100:
            issues.append(Critique(
                severity="medium",
                issue="Low total trial count",
                suggestion="Increase trials for more robust estimates",
            ))

        return issues

    def verify_claims(self, paper: Paper, analysis: Analysis) -> list[Critique]:
        issues = []

        for claim in paper.extract_claims():
            evidence = analysis.find_evidence_for(claim)

            if not evidence:
                issues.append(Critique(
                    severity="high",
                    issue=f"Unsupported claim: '{claim.text[:50]}...'",
                    suggestion="Add evidence or remove claim",
                ))
            elif evidence.strength < claim.confidence:
                issues.append(Critique(
                    severity="medium",
                    issue=f"Overclaimed: evidence weaker than stated",
                    suggestion="Soften language or add caveats",
                ))

        return issues

Review Criteria:

Category	Checks
Statistics	p-hacking, effect sizes, sample sizes, corrections
Methodology	Reproducibility, parameter coverage, controls
Claims	Evidence support, overclaiming, causation vs correlation
Completeness	Limitations, alternative explanations, future work

6. Critique Agent¶

Purpose: Red-team your own findings before review.

class CritiqueAgent:
    """Actively tries to disprove findings."""

    def critique(self, hypothesis: str, results: Results) -> Critique:
        attacks = []

        # Try alternative explanations
        alternatives = self.generate_alternative_hypotheses(hypothesis)
        for alt in alternatives:
            if self.consistent_with_data(alt, results):
                attacks.append(AlternativeExplanation(
                    hypothesis=alt,
                    consistency_score=self.score_fit(alt, results),
                ))

        # Try to find counterexamples
        counterexamples = self.search_for_counterexamples(hypothesis, results)

        # Check boundary conditions
        boundaries = self.test_boundary_conditions(hypothesis, results)

        # Identify confounds
        confounds = self.identify_potential_confounds(results)

        return Critique(
            alternative_explanations=attacks,
            counterexamples=counterexamples,
            boundary_failures=boundaries,
            potential_confounds=confounds,
            robustness_score=self.compute_robustness(attacks),
        )

    def generate_alternative_hypotheses(self, hypothesis: str) -> list[str]:
        """Generate competing explanations."""
        return [
            self.negate(hypothesis),
            self.weaken(hypothesis),
            self.add_confound(hypothesis),
            self.propose_mechanism_alternative(hypothesis),
        ]

7. Replication Agent¶

Purpose: Verify prior findings before building on them.

class ReplicationAgent:
    """Attempts to replicate published findings."""

    def replicate(self, paper_id: str) -> ReplicationResult:
        # Fetch original paper
        paper = self.fetch_paper(paper_id)

        # Extract methodology
        config = self.extract_config(paper)
        seeds = self.extract_seeds(paper)

        # Run exact replication
        exact_results = self.run_exact_replication(config, seeds)
        exact_match = self.compare_results(exact_results, paper.results)

        # Run conceptual replication (different seeds)
        conceptual_results = self.run_conceptual_replication(config)
        conceptual_match = self.compare_results(conceptual_results, paper.results)

        # Run extended replication (broader parameters)
        extended_results = self.run_extended_replication(config)
        generalization = self.assess_generalization(extended_results)

        return ReplicationResult(
            original_paper=paper_id,
            exact_replication=exact_match,
            conceptual_replication=conceptual_match,
            generalization=generalization,
            verdict=self.verdict(exact_match, conceptual_match),
        )

Replication Types:

Type	Description	Purpose
Exact	Same config, same seeds	Verify reproducibility
Conceptual	Same config, new seeds	Verify robustness
Extended	Broader parameters	Test generalization

Quality Gates¶

Automated checks between workflow phases:

class QualityGates:
    """Enforce quality standards between phases."""

    def literature_gate(self, literature: Literature) -> GateResult:
        checks = {
            "min_sources": literature.source_count >= 10,
            "recency": literature.has_recent_papers(months=6),
            "diversity": literature.domain_count >= 3,
            "gaps_identified": len(literature.gaps) >= 1,
        }
        return GateResult(passed=all(checks.values()), checks=checks)

    def experiment_gate(self, results: Results) -> GateResult:
        checks = {
            "min_trials": results.trials_per_config >= 10,
            "seeds_documented": results.all_seeds_recorded(),
            "configs_complete": results.parameter_coverage >= 0.8,
            "no_errors": results.error_count == 0,
        }
        return GateResult(passed=all(checks.values()), checks=checks)

    def analysis_gate(self, analysis: Analysis) -> GateResult:
        checks = {
            "ci_reported": analysis.all_claims_have_ci(),
            "effect_sizes": analysis.all_claims_have_effect_size(),
            "corrections_applied": analysis.multiple_comparison_corrected(),
            "limitations_stated": len(analysis.limitations) >= 1,
        }
        return GateResult(passed=all(checks.values()), checks=checks)

    def review_gate(self, review: Review) -> GateResult:
        checks = {
            "no_high_severity": review.high_severity_count == 0,
            "all_addressed": review.all_critiques_addressed(),
            "recommendation": review.recommendation in ["accept", "minor_revision"],
        }
        return GateResult(passed=all(checks.values()), checks=checks)

Gate Flow:

Literature → [Gate] → Experiment → [Gate] → Analysis → [Gate] → Review → [Gate] → Publish
     ↑          ↓           ↑          ↓          ↑         ↓          ↑         ↓
     └── Revise ←──         └── Revise ←──        └── Revise←──        └── Revise←

Pre-Registration¶

Declare hypotheses before seeing results:

class PreRegistration:
    """Lock in hypotheses before experiments."""

    def register(self,
                 hypothesis: str,
                 methodology: Config,
                 analysis_plan: AnalysisPlan) -> Registration:

        registration = Registration(
            hypothesis=hypothesis,
            methodology=methodology,
            analysis_plan=analysis_plan,
            timestamp=datetime.now(timezone.utc),
            hash=self.compute_hash(hypothesis, methodology, analysis_plan),
        )

        # Publish to immutable registry
        self.publish_to_registry(registration)

        return registration

    def verify(self, registration: Registration, paper: Paper) -> Verification:
        """Check if paper matches pre-registration."""
        deviations = []

        if paper.hypothesis != registration.hypothesis:
            deviations.append(Deviation(
                field="hypothesis",
                registered=registration.hypothesis,
                actual=paper.hypothesis,
            ))

        if not self.configs_match(paper.config, registration.methodology):
            deviations.append(Deviation(
                field="methodology",
                registered=registration.methodology,
                actual=paper.config,
            ))

        return Verification(
            matches=len(deviations) == 0,
            deviations=deviations,
            exploratory_analyses=paper.analyses_not_in(registration.analysis_plan),
        )

Pre-Registration Template:

# pre-registration.yaml
registration:
  timestamp: "2026-02-07T12:00:00Z"
  hash: "sha256:abc123..."

hypothesis:
  primary: "Governance mechanisms interact non-linearly with population composition"
  secondary:
    - "Transaction tax effect depends on honest fraction"
    - "Reputation decay is more effective in heterogeneous populations"

methodology:
  parameters:
    honest_fraction: [0.2, 0.4, 0.6, 0.8, 1.0]
    transaction_tax: [0.0, 0.05, 0.10]
    reputation_decay: [0.0, 0.05, 0.10]
  trials: 10
  rounds: 100

analysis_plan:
  primary:
    - "Two-way ANOVA: governance × composition interaction"
    - "Effect sizes with 95% CI for all main effects"
  secondary:
    - "Post-hoc pairwise comparisons with Bonferroni correction"
  exploratory:
    - "Any additional analyses will be labeled as exploratory"

Failure Handling¶

What to do when things go wrong:

class FailureHandler:
    """Handle research failures gracefully."""

    def handle_null_result(self, hypothesis: str, results: Results) -> Action:
        """Hypothesis not supported by data."""

        # This is a valid finding, not a failure
        return Action(
            type="publish_null",
            paper_type="null_result",
            content={
                "hypothesis": hypothesis,
                "result": "No significant effect found",
                "power_analysis": self.compute_power(results),
                "implications": "Hypothesis may be false or effect too small to detect",
            },
        )

    def handle_unexpected_result(self, hypothesis: str, results: Results) -> Action:
        """Results contradict hypothesis."""

        return Action(
            type="investigate",
            steps=[
                "Verify data integrity",
                "Check for bugs in analysis",
                "Consider alternative explanations",
                "If robust, publish as surprising finding",
            ],
        )

    def handle_replication_failure(self, original: Paper, replication: Results) -> Action:
        """Failed to replicate prior work."""

        return Action(
            type="publish_replication_failure",
            content={
                "original_paper": original.id,
                "our_results": replication,
                "possible_reasons": [
                    "Original finding was false positive",
                    "Methodological differences",
                    "Hidden moderators",
                    "Our error (check carefully)",
                ],
            },
        )

    def handle_contradictory_literature(self, findings: list[Paper]) -> Action:
        """Literature contains contradictions."""

        return Action(
            type="meta_analysis",
            steps=[
                "Identify methodological differences",
                "Test moderating variables",
                "Propose reconciling framework",
                "Design decisive experiment",
            ],
        )

Failure Types:

Failure	Response	Publishable?
Null result	Report with power analysis	Yes
Opposite result	Investigate, then publish	Yes
Replication failure	Careful report	Yes
Methodology error	Fix and re-run	After fixing
Data corruption	Discard, re-collect	No

Iteration Loops¶

Research is iterative, not linear:

class IterativeWorkflow:
    """Support revision cycles in research."""

    def run(self, question: str, max_iterations: int = 3) -> Paper:
        iteration = 0
        paper = None

        while iteration < max_iterations:
            # Run core workflow
            literature = self.literature_agent.survey(question)
            experiments = self.experiment_agent.design(literature.hypothesis)
            results = self.experiment_agent.run(experiments)
            analysis = self.analysis_agent.analyze(results, literature)

            # Self-critique
            critique = self.critique_agent.critique(
                hypothesis=literature.hypothesis,
                results=results,
            )

            # If critique finds issues, iterate
            if critique.robustness_score < 0.7:
                question = self.refine_question(question, critique)
                iteration += 1
                continue

            # Generate paper
            paper = self.writing_agent.generate(literature, analysis, results)

            # Peer review
            review = self.review_agent.review(paper, analysis)

            # If review passes, done
            if review.recommendation == "accept":
                break

            # Otherwise, revise
            paper = self.revise(paper, review)
            iteration += 1

        return paper

    def refine_question(self, question: str, critique: Critique) -> str:
        """Update research question based on critique."""
        if critique.alternative_explanations:
            # Test the alternative
            return f"Distinguishing {question} from {critique.alternatives[0]}"
        elif critique.boundary_failures:
            # Narrow scope
            return f"{question} (within {critique.valid_boundaries})"
        else:
            return question

Iteration Triggers:

Trigger	Action
Low robustness score	Refine hypothesis, add controls
Review rejection	Address critiques, re-analyze
Unexpected results	Investigate, possibly pivot
New literature	Incorporate, update framing

Enhanced Workflow Diagram¶

┌─────────────────────────────────────────────────────────────────────────┐
│                    ENHANCED RESEARCH WORKFLOW                            │
│                                                                          │
│  ┌────────────────┐                                                      │
│  │ Pre-Register   │──────────────────────────────────────┐               │
│  │   Hypothesis   │                                      │               │
│  └────────────────┘                                      │               │
│         │                                                │               │
│         ▼                                                ▼               │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐  │              │
│  │  Literature  │──→│  Experiment  │──→│   Analysis   │  │              │
│  │    Agent     │   │    Agent     │   │    Agent     │  │              │
│  └──────────────┘   └──────────────┘   └──────────────┘  │              │
│         │                  │                  │          │               │
│    [Gate 1]           [Gate 2]           [Gate 3]        │               │
│         │                  │                  │          │               │
│         ▼                  ▼                  ▼          │               │
│  ┌──────────────────────────────────────────────────┐    │              │
│  │                  Critique Agent                   │    │              │
│  │         (Red-team before external review)         │    │              │
│  └──────────────────────────────────────────────────┘    │              │
│                           │                              │               │
│              ┌────────────┴────────────┐                 │               │
│              │ Robust?                 │                 │               │
│              ▼ No                      ▼ Yes             │               │
│       ┌──────────┐              ┌──────────────┐         │               │
│       │  Revise  │              │   Writing    │         │               │
│       │ Question │              │    Agent     │         │               │
│       └──────────┘              └──────────────┘         │               │
│              │                         │                 │               │
│              └────────┐                ▼                 │               │
│                       │         ┌──────────────┐         │               │
│                       │         │ Review Agent │←────────┘               │
│                       │         │ (Verify pre- │   (Check against        │
│                       │         │ registration)│    registration)        │
│                       │         └──────────────┘                         │
│                       │                │                                 │
│                       │    ┌───────────┴───────────┐                     │
│                       │    ▼ Reject                ▼ Accept              │
│                       │ ┌──────────┐        ┌──────────────┐             │
│                       └→│  Revise  │        │  Publication │             │
│                         │  Paper   │        └──────────────┘             │
│                         └──────────┘               │                     │
│                                                    ▼                     │
│                                            ┌──────────────┐              │
│                                            │ Replication  │              │
│                                            │    Agent     │              │
│                                            │ (Others can  │              │
│                                            │   verify)    │              │
│                                            └──────────────┘              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Reflexivity Handling¶

Recursive agent research creates feedback loops: publishing findings changes the system being studied. The workflow addresses this through the ReflexivityAnalyzer.

The Reflexivity Problem¶

From Soros (1987):

Financial market participants act on models of the market, changing the market, invalidating the models.

In agent research, this manifests as: - Lucas Critique: Agents adapt to published governance findings - Goodhart's Law: Published metrics become targets, gaming them - Observer Effect: Measurement disturbs the system

Shadow Simulations¶

The workflow runs parallel simulations to measure reflexivity magnitude:

from swarm.research import ShadowSimulation

shadow = ShadowSimulation(
    simulation_fn=my_simulation,
    divergence_threshold=0.2,
)

result = shadow.run(
    config=experiment_config,
    findings=["Pair caps block collusion"],
    epochs=10,
)

print(f"Divergence: {result.overall_divergence:.3f}")
print(f"Finding robust: {result.finding_is_robust}")

Interpretation: - Low divergence (< 0.2): Finding is robust to self-knowledge - High divergence (> 0.5): Finding is fragile, may invert when known

Publish-Then-Attack Protocol¶

Before publishing, red-team the finding:

from swarm.research import PublishThenAttack

attack = PublishThenAttack(simulation_fn=my_simulation)

result = attack.run(
    config=experiment_config,
    finding="Reputation decay reduces toxicity",
    baseline_metrics={"toxicity": 0.3, "welfare": 450},
)

print(f"Classification: {result.robustness_classification}")
# DISCLOSURE_ROBUST, CONDITIONALLY_VALID, or FRAGILE

Attack Strategies Tested: - Direct evasion - Metric gaming - Collusive adaptation - Boundary exploitation

Robustness Classifications¶

Classification	Meaning	Action
`DISCLOSURE_ROBUST`	Finding holds even when agents know it	Safe to publish
`CONDITIONALLY_VALID`	Finding holds only while unknown	Publish with caveat
`FRAGILE`	Finding inverts when known	Treat as intelligence, not science

Epistemic Disclosure¶

Every published finding should include:

This finding [holds / degrades / inverts] under full-knowledge conditions.

The workflow automatically appends reflexivity analysis to papers:

% Reflexivity Analysis
% Shadow Simulation Divergence: 0.15
% Finding Robust to Self-Knowledge: Yes
% Disclosure Robustness: DISCLOSURE_ROBUST

Goodhart-Resistant Metrics¶

The workflow uses strategies to resist metric gaming:

Composite metrics: Require multiple metrics to pass simultaneously
Holdout metrics: Compute some metrics internally but don't publish
Metric rotation: Change primary metrics across publications
Ensemble checks: Multiple proxies for same underlying property

from swarm.research import GoodhartResistantMetrics

metrics = GoodhartResistantMetrics()

# Register holdout metric
metrics.register_holdout_metric("secret_quality", compute_quality_fn)

# Check for gaming
gaming = metrics.detect_gaming(
    data=results,
    published_values={"toxicity": 0.2, "welfare": 500},
)
if gaming.get("secret_quality"):
    print("Warning: Possible gaming detected")

Structured Agent Research Workflow¶

Implementation¶

Overview¶

Control Parameters¶

Depth (d)¶

Breadth (b)¶

Expected Scaling¶

Sub-Agent Specifications¶

1. Literature Agent¶

2. Experiment Agent¶

3. Analysis Agent¶

4. Writing Agent¶

Complete Workflow Example¶

Research Question¶

Configuration¶

Execution¶

Expected Output¶

Quality Assurance Checklist¶

Literature Agent¶

Experiment Agent¶

Analysis Agent¶

Writing Agent¶

Metrics Dashboard¶

Recursive Self-Improvement¶

Additional Agents¶

5. Review Agent¶

6. Critique Agent¶

7. Replication Agent¶

Quality Gates¶

Pre-Registration¶

Failure Handling¶

Iteration Loops¶

Enhanced Workflow Diagram¶

Reflexivity Handling¶

The Reflexivity Problem¶

Shadow Simulations¶

Publish-Then-Attack Protocol¶

Robustness Classifications¶

Epistemic Disclosure¶

Goodhart-Resistant Metrics¶

See also¶