Plan: llama.cpp Integration for Local CPU Inference¶

Goal¶

Enable SWARM to use local LLaMA models via llama.cpp for CPU-only inference (e.g. on a MacBook), without requiring any cloud API keys.

Architecture Decision¶

Primary path (Option A): llama-server as an OpenAI-compatible HTTP endpoint.

SWARM already has _call_openai_compatible_async in LLMAgent — llama-server exposes /v1/chat/completions with the same schema. This is a near-zero-friction integration.

Secondary path (Option B): In-process llama-cpp-python bindings. Useful for maximum determinism (pinned threads, KV cache, seeds) and zero network overhead. Implemented as a separate code path behind the same LLMProvider.LLAMA_CPP enum value, selected via config.

┌─────────────────────────────────────────────────────┐
│ SWARM Scenario YAML                                  │
│   provider: llama_cpp                                │
│   model: llama-3.2-3b-instruct                       │
│   base_url: http://localhost:8080/v1  (Option A)     │
│   OR                                                 │
│   model_path: ./models/model.gguf     (Option B)     │
└────────────────────────┬────────────────────────────┘
                         │
          ┌──────────────┴──────────────┐
          │                             │
    Option A (HTTP)            Option B (in-process)
          │                             │
  ┌───────▼────────┐          ┌────────▼─────────┐
  │ _call_openai_  │          │ _call_llama_cpp_ │
  │ compatible_    │          │ direct_async()   │
  │ async()        │          │                  │
  │ (existing)     │          │ llama-cpp-python │
  └───────┬────────┘          └────────┬─────────┘
          │                             │
  ┌───────▼────────┐          ┌────────▼─────────┐
  │ llama-server   │          │ llama_cpp.Llama  │
  │ :8080/v1/...   │          │ (in-process)     │
  └────────────────┘          └──────────────────┘

Implementation Steps¶

Step 1: Add `LLAMA_CPP` provider to `LLMProvider` enum¶

File: swarm/agents/llm_config.py

Add LLAMA_CPP = "llama_cpp" to the LLMProvider enum.
Add new optional fields to LLMConfig:
model_path: Optional[str] = None — path to a local .gguf file (Option B only).
n_ctx: int = 4096 — context window size for in-process mode.
n_threads: Optional[int] = None — CPU thread count (defaults to auto-detect).
llama_seed: int = -1 — sampling seed for determinism (-1 = random).
In __post_init__, set default base_url = "http://localhost:8080/v1" when provider is LLAMA_CPP and no base_url is provided.
Skip API key validation for LLAMA_CPP (like Ollama).
Add LLAMA_CPP cost entries to LLMUsageStats (all 0.0 since local).

Step 2: Wire routing in `LLMAgent._call_llm_async`¶

File: swarm/agents/llm_agent.py

Add LLMProvider.LLAMA_CPP case in the provider dispatch:
If model_path is set → call new _call_llama_cpp_direct_async() (Option B).
Otherwise → call existing _call_openai_compatible_async() (Option A).
Add _call_llama_cpp_direct_async() method:
Lazy-load llama_cpp.Llama model instance (one per agent, cached on self).
Use create_chat_completion() for inference.
Extract text + token counts from response.
Run in thread pool (loop.run_in_executor) since llama-cpp-python is blocking.
In _get_api_key_from_env(), return a dummy key for LLAMA_CPP (llama-server ignores it, but the OpenAI client requires a non-empty string).

Step 3: Add health check utility¶

File: swarm/agents/llm_health.py (new, small)

A tiny helper that SWARM can call before a run to verify the llama-server is reachable:

def check_llama_server(base_url: str = "http://localhost:8080") -> bool:
    """Ping llama-server /health endpoint. Returns True if ready."""

Integrated into the orchestrator startup when provider is LLAMA_CPP + Option A, so users get a clear error instead of a timeout mid-simulation.

Step 4: Add optional dependency¶

File: pyproject.toml

[project.optional-dependencies]
llama_cpp = [
    "llama-cpp-python>=0.3.0",
]

Also add to the llm extras group so [pip install](../getting-started/installation.md) -e ".[llm]" pulls it in. The openai package (already in llm extras) is needed for Option A.

Step 5: Create example scenario YAML¶

File: scenarios/llm_llama_cpp.yaml

Two variant blocks showing both options:

# Option A: llama-server (recommended)
agents:
  - type: llm
    count: 3
    llm:
      provider: llama_cpp
      model: llama-3.2-3b-instruct   # name passed to llama-server
      base_url: http://localhost:8080/v1
      temperature: 0.2
      max_tokens: 512
      seed: 42

# Option B: in-process (uncomment to use)
#  - type: llm
#    count: 3
#    llm:
#      provider: llama_cpp
#      model_path: ./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
#      n_ctx: 4096
#      n_threads: 8
#      temperature: 0.2
#      max_tokens: 512
#      seed: 42

Step 6: Helper script for model download + server launch¶

File: scripts/llama-server-setup.sh

Bash script that:

Checks if llama-server binary exists; if not, prints build instructions (or downloads a release binary).
Downloads a recommended small GGUF model if not present (e.g., Llama-3.2-3B-Instruct-Q4_K_M.gguf from HuggingFace — ~2 GB, runs well on MacBook CPU).
Launches llama-server -m <model> --port 8080 --ctx-size 4096 --threads <auto>.
Waits for /health to return OK.

Documented usage:

# One-time setup
./scripts/llama-server-setup.sh download   # fetch model
./scripts/llama-server-setup.sh start      # start server

# Then run SWARM
python -m swarm run scenarios/llm_llama_cpp.yaml --seed 42

Step 7: Tests¶

File: tests/test_llama_cpp_provider.py

Unit tests (no server required):
LLMConfig with provider=llama_cpp sets correct defaults.
Routing dispatches to _call_openai_compatible_async (Option A) or _call_llama_cpp_direct_async (Option B) based on model_path.
Health check returns False when server is down (mocked).
Integration test (marked @pytest.mark.slow, skipped without server):
Start llama-server in fixture, send one chat completion, verify response structure.

Step 8: Documentation¶

File: docs/[guides](../guides/index.md)/llama-cpp-local-inference.md

Covers: - Prerequisites (llama.cpp build or binary, GGUF model). - Option A walkthrough (server mode). - Option B walkthrough (in-process mode). - Recommended models for CPU (3B Q4_K_M for light, 8B Q4_K_M for quality). - Determinism controls (seed, temperature 0, fixed threads). - Throughput tips (batching, --parallel flag on llama-server for multi-agent). - Observability: prompt audit logging (already built in via prompt_audit_path).

Files Changed (Summary)¶

File	Change
`swarm/agents/llm_config.py`	Add `LLAMA_CPP` enum + config fields
`swarm/agents/llm_agent.py`	Add routing + `_call_llama_cpp_direct_async`
`swarm/agents/llm_health.py`	New — health check utility
`pyproject.toml`	Add `llama_cpp` optional dep
`scenarios/llm_llama_cpp.yaml`	New — example scenario
`scripts/llama-server-setup.sh`	New — model download + server launcher
`tests/test_llama_cpp_provider.py`	New — unit + integration tests
`docs/guides/llama-cpp-local-inference.md`	New — user guide

Design Notes¶

Why not just use the existing Ollama provider? Ollama wraps llama.cpp but adds its own HTTP layer, model management, and overhead. Direct llama-server gives lower latency, better determinism control, and avoids Ollama as a dependency. Users who prefer Ollama can already use provider: ollama.
Why both options? Option A (HTTP) is simpler, supports multi-process, and matches the existing OpenAI-compatible pattern. Option B (in-process) is needed for tight determinism (pinned seeds, threads, KV cache) in reproducible experiments — a core SWARM requirement.
No new heavy dependencies for Option A. The openai Python package (already in [llm] extras) is the only runtime dependency. The user just needs llama-server binary running externally.
Model files are gitignored. The models/ directory (GGUF files) should be added to .gitignore — these are multi-GB binaries.