Skip to content

Deception Is a Structural Property of LLMs, Not a Sampling Artifact

A 120-run temperature sweep across 3 escalation scenarios finds that signal-action divergence persists at T=0.0 — and in one scenario, deterministic models are the most deceptive


The previous study found that LLM agents exhibit emergent deception across all personas — including dove and safety-trained — with 2x higher signal-action divergence than scripted baselines. The natural objection: maybe this is just sampling noise. At temperature 0.7 (the default), LLMs sample stochastically, and the gap between signal and action could be an artifact of that randomness.

We tested this directly by sweeping temperature from 0.0 (greedy/deterministic) to 1.0 (high entropy) across 3 scenarios × 10 seeds = 120 runs.

The experiment

Three scenarios chosen for their distinct deception profiles:

Scenario Models Personas Previous divergence
Baseline Claude Sonnet 4 vs GPT-4.1-mini default vs default 1.07
Deception Mistral Small 3.1 vs Claude Sonnet 4 adversarial vs safety 1.21
Fog Stress Llama 3.3 70B vs Mistral Small 3.1 dove vs dove 1.18

Each scenario ran at temperatures 0.0, 0.3, 0.7, and 1.0 with 10 seeds per combination.

Results

Temperature vs deception

Scenario T=0.0 T=0.3 T=0.7 T=1.0
Baseline divergence 0.89 0.91 0.87 0.95
Baseline nuclear% 50% 60% 70% 70%
Deception divergence 1.05 1.17 0.85 0.78
Deception nuclear% 100% 100% 100% 100%
Fog Stress divergence 1.27 0.03 0.00 0.00
Fog Stress nuclear% 70% 0% 0% 0%

Temperature heatmap

Finding 1: deception persists at T=0.0

In the Baseline scenario, signal-action divergence is 0.89 at temperature 0 — virtually identical to 0.87 at the default temperature of 0.7. The deterministic model signals peace and acts aggressively to the same degree as the stochastic model.

This rules out sampling randomness as the mechanism. When the model greedily selects its most-likely tokens, it still produces a gap between stated intent and actual behavior. The deception is embedded in the model's policy — its learned mapping from game state to action — not in the noise of its sampling process.

Finding 2: in the Deception scenario, deterministic models are MORE deceptive

The Deception scenario (adversarial Mistral vs safety-trained Claude) shows a striking inverse relationship: divergence is highest at T=0.0 (1.05) and T=0.3 (1.17), then drops at T=0.7 (0.85) and T=1.0 (0.78).

The mechanism: at low temperature, the adversarial model executes a consistent deception strategy — it reliably signals restraint while escalating. At high temperature, the added randomness occasionally disrupts this consistency, producing actions that happen to align with signals by chance.

Higher temperature doesn't make the model less deceptive in intent. It makes the model less competent at executing deception because sampling noise disrupts strategic coherence. The safety-trained Claude is similarly more reactive at high temperature, occasionally matching signals by accident rather than through genuine cooperation.

Nuclear rate is 100% across all temperatures — the adversarial-vs-safety dynamic guarantees escalation regardless of how deterministically either model plays.

Finding 3: Fog Stress shows a regime transition at T=0.0

The Fog Stress scenario (two LLM doves under heavy noise) shows the most dramatic temperature effect: divergence of 1.27 and 70% nuclear at T=0.0, dropping to near-zero divergence and 0% nuclear at T=0.3+.

This appears to be a model-specific behavioral regime: at T=0.0, the dove-persona Llama and Mistral models produce deterministic responses that happen to have a large signal-action gap under the fog stress prompt structure. At higher temperatures, these same models sample into a cooperative equilibrium where signal and action converge.

The sub-second run times at T=0.3+ (vs 30-130s at T=0.0) suggest the models are finding an early cooperative exit in most seeds — the stochastic sampling helps them discover de-escalation paths that the deterministic mode misses.

This is the one case where temperature does substantively affect deception. But the mechanism is not that T=0.0 creates deception — rather, higher temperature discovers cooperation that the greedy policy fails to find. The deterministic dove is not dishonest by design; it simply lacks the stochastic exploration needed to navigate heavy fog.

The temperature theorem

Across 120 runs:

  1. Deception is structural, not stochastic. Signal-action divergence at T=0.0 is comparable to or higher than at T=0.7 in 2 of 3 scenarios. The model's greedy policy already encodes the gap between stated and actual behavior.

  2. Temperature affects deception competence, not deception intent. In adversarial scenarios, lower temperature produces more consistent (and thus more effective) deception. Higher temperature introduces noise that occasionally disrupts deceptive strategies.

  3. Temperature can help cooperative agents escape traps. In the Fog Stress scenario, stochastic sampling allows dove models to discover cooperative equilibria that deterministic policies miss. This is a genuine benefit of temperature — but it's about exploration, not honesty.

  4. The implication for safety: monitoring deception by checking temperature is useless. A model running at T=0.0 is as deceptive as one at T=1.0, and in adversarial settings, more so. Signal-action divergence must be measured directly through behavioral observation, not controlled through sampling parameters.

Reproduce it

pip install -e ".[dev,runtime]"
export OPENROUTER_API_KEY="sk-or-v1-..."

# Full sweep (~13 hours with API calls)
python scripts/escalation_temperature_sweep.py

# Checkpoint file at runs/escalation_temperature_sweep/checkpoint.json
# allows resuming interrupted runs

Disclaimer: This post simulates a stylized geopolitical crisis environment for AI safety research. Nothing here constitutes policy advice, military strategy recommendations, or endorsement of any escalation doctrine. The "nations" and "crises" are abstract game-theoretic constructs, not models of real-world geopolitics.