Unified Provenance Trace Schema¶

This document describes the unified provenance trace schema and exporters added to SWARM's event logging system.

Overview¶

The provenance trace schema extends the existing Event model with deterministic IDs and parent-child relationships, enabling:

Complete audit trails for tool calls, artifacts, audits, and interventions
Deterministic IDs that remain stable across replays
CSV export for analysis with standard data tools
Parent-child linking to reconstruct causal chains

Key Features¶

1. Provenance Fields on Events¶

All events now support the following optional fields:

Field	Type	Description
`provenance_id`	`str`	Deterministic 12-char hex ID for this event in provenance chain
`parent_event_id`	`str`	Links to parent event for causal chains
`tool_call_id`	`str`	ID for tool call if this is a tool execution
`artifact_id`	`str`	ID for artifact if this event produces/references one
`audit_id`	`str`	ID for audit if this is an audit event
`intervention_id`	`str`	ID for governance intervention

2. Deterministic ID Generation¶

from swarm.models.events import generate_deterministic_id

# Generate a stable ID based on event attributes
prov_id = generate_deterministic_id(
    event_type="tool_call",
    agent_id="agent_1",
    timestamp=timestamp,
    tool_name="query_db",
)

Deterministic IDs are: - Stable: Same inputs always generate the same ID - Compact: 12-character hex strings (first 12 chars of SHA-256) - Collision-resistant: Based on cryptographic hash

3. Provenance-Aware Event Factories¶

Convenience functions for creating events with provenance tracking:

from swarm.models.events import (
    tool_call_executed_event,
    artifact_created_event,
    audit_event,
    intervention_event,
    agent_message_event,
)

# Tool call with automatic ID generation
tool_evt = tool_call_executed_event(
    agent_id="agent_1",
    tool_name="query_database",
    arguments={"query": "SELECT 1"},
    result={"rows": 1},
    success=True,
    parent_event_id=parent_id,  # Optional: link to parent
)

# Artifact creation
artifact_evt = artifact_created_event(
    agent_id="agent_1",
    artifact_id="artifact_123",
    artifact_type="memory",
    artifact_data={"title": "Test", "summary": "..."},
    parent_event_id=tool_evt.provenance_id,
)

# Audit event
audit_evt = audit_event(
    agent_id="auditor",
    audit_type="safety",
    audit_result={"decision": "approve", "risk": 0.1},
    audited_event_id=tool_evt.event_id,
    parent_event_id=tool_evt.provenance_id,
)

# Intervention event
intervention_evt = intervention_event(
    agent_id="governance",
    intervention_type="freeze",
    intervention_action="freeze_agent",
    intervention_reason="suspicious_behavior",
    affected_agents=["agent_2"],
    parent_event_id=audit_evt.provenance_id,
)

4. CSV Export¶

Two export methods are available:

Full CSV Export¶

Exports all events with all fields:

from swarm.logging.event_log import EventLog

log = EventLog("events.jsonl")

# Export with standard columns
log.to_csv("events.csv")

# Include payload as JSON string
log.to_csv("events.csv", include_payload=True)

Columns: event_id, timestamp, event_type, agent_id, provenance_id, parent_event_id, tool_call_id, artifact_id, audit_id, intervention_id, etc.

Provenance-Focused CSV Export¶

Exports only events with provenance tracking, with extracted key fields:

log.to_provenance_csv("provenance.csv")

Columns: provenance_id, event_type, timestamp, agent_id, parent_event_id, tool_call_id, artifact_id, audit_id, intervention_id, tool_name, artifact_type, audit_type, intervention_type, success, payload_summary

This format is optimized for: - Reconstructing provenance chains - Analyzing tool usage patterns - Auditing governance interventions - Tracking artifact lineage

Usage Examples¶

Basic Provenance Chain¶

from swarm.logging.event_log import EventLog
from swarm.models.events import tool_call_executed_event, audit_event

log = EventLog("log.jsonl")

# 1. Tool call
tool_evt = tool_call_executed_event(
    agent_id="agent_1",
    tool_name="risky_op",
    arguments={"danger": "high"},
    result={"status": "executed"},
    success=True,
)
log.append(tool_evt)

# 2. Audit the tool call
audit_evt = audit_event(
    agent_id="auditor",
    audit_type="safety",
    audit_result={"flagged": True, "risk": 0.9},
    audited_event_id=tool_evt.event_id,
    parent_event_id=tool_evt.provenance_id,  # Link to tool call
)
log.append(audit_evt)

# 3. Export to CSV
log.to_provenance_csv("provenance.csv")

Querying Provenance Chains¶

# Read back from JSONL
for event in log.replay():
    if event.provenance_id:
        print(f"Event: {event.event_type}")
        print(f"  Provenance ID: {event.provenance_id}")
        print(f"  Parent: {event.parent_event_id or 'None'}")

        # Check specific IDs
        if event.tool_call_id:
            print(f"  Tool Call: {event.tool_call_id}")
        if event.audit_id:
            print(f"  Audit: {event.audit_id}")

Analyzing CSV Output¶

The CSV exports are designed to work with standard data tools:

import pandas as pd

# Load provenance CSV
df = pd.read_csv("provenance.csv")

# Count tool calls by agent
tool_calls = df[df["tool_call_id"] != ""]
print(tool_calls.groupby("agent_id").size())

# Find audit chains
audits = df[df["audit_id"] != ""]
print(audits[["provenance_id", "parent_event_id", "audit_type"]])

# Trace interventions back to their triggers
interventions = df[df["intervention_id"] != ""]
for _, row in interventions.iterrows():
    parent_id = row["parent_event_id"]
    parent = df[df["provenance_id"] == parent_id]
    if not parent.empty:
        print(f"Intervention {row['provenance_id']} triggered by {parent.iloc[0]['event_type']}")

Backward Compatibility¶

The provenance fields are optional and backward compatible:

Old events without provenance fields deserialize correctly
Existing code continues to work without changes
JSONL logs can mix old and new event formats
CSV export handles missing fields gracefully

Testing¶

Comprehensive tests are in tests/test_provenance_trace.py:

pytest tests/test_provenance_trace.py -v

Test coverage includes: - Deterministic ID generation - Provenance-aware event factories - Event serialization/deserialization - CSV export formats - Provenance chain reconstruction - Backward compatibility

Example Script¶

Run the full example demonstrating all features:

python examples/provenance_tracking.py

This creates a complete provenance chain with tool calls, artifacts, audits, and interventions, then exports to both JSONL and CSV formats.

Design Rationale¶

Why Deterministic IDs?¶

Reproducibility: Same simulation with same seed generates same IDs
Stable references: IDs remain valid across log replays
Efficient: No need for centralized ID allocation
Verifiable: Anyone can recompute IDs from event data

Why Parent-Child Linking?¶

Causal chains: Trace root causes of interventions
Audit trails: Answer "what triggered this?"
Debugging: Understand event sequences
Governance: Justify decisions with evidence chain

Why Two CSV Formats?¶

Full CSV: For comprehensive analysis and data warehousing
Provenance CSV: For focused investigation of specific chains

Different use cases need different views of the same data.

Future Enhancements¶

Potential future additions:

Graph export for visualization (e.g., DOT/Graphviz)
Provenance query DSL for complex filtering
Automatic chain validation (detect cycles, orphans)
Integration with existing metrics exporters
Provenance-aware event replay with filtering

Event Schema — Full Event model
Event Log — JSONL logger
Export Utilities — Simulation result exporters
Simulated API Schema — Alternative provenance approach