Lessons from ARC-AGI-3 Agent Development¶
Overview¶
ARC-AGI-3 is the first interactive reasoning benchmark — video-game-like environments on a 64x64 pixel grid where agents explore, learn rules, and solve puzzles. We built a Claude Sonnet 4.5-powered agent (ClaudeAgent) that uses vision, hypothesis-driven reasoning, and tool-use to compete. This document captures key lessons from 11 iterations (V1-V11) of agent development.
Key Architectural Decisions¶
Game Type Detection is Critical¶
ARC-AGI-3 environments are not homogeneous. We identified three distinct game types from the available_actions field:
| Game Type | Actions Available | Example | Strategy |
|---|---|---|---|
| Movement | [1,2,3,4] (directional) | ls20 | Interactive puzzle with switches, pattern matching |
| ARC Puzzle | [1,2,3,4,5,6] (move+confirm+click) | ft09 | Classic input/output transformation, click to edit cells |
| Click Only | [6] (click) | vc33 | Pure click-based puzzle solving |
Lesson: A single prompt strategy fails across game types. The agent must detect the game type from available actions and dispatch to game-type-specific system prompts. Our initial maze-focused prompt caused the ARC puzzle agent to repeatedly reset instead of clicking, and the click-only agent to click random coordinates.
The Timer Bar Breaks Naive Change Detection¶
Every game has a progress bar at row 63 that ticks down 2 pixels per action (~32-40 actions per cycle). This means:
frame_hashchanges every action regardless of meaningful grid changesstuck_counterstays at 0 even when the agent is completely stuck- The agent never receives "you're stuck" guidance from the prompt
Fix: content_hash — hash only rows 0-62, ignoring the timer bar. This made stuck detection functional and immediately improved behavior (the agent started resetting and trying new approaches when truly stuck).
MazeNavigator is Wrong for Interactive Puzzles¶
V4-V6 introduced a MazeNavigator (persistent DFS with graph building across timer resets) designed for movement games. It was fast (instant, no API calls) and could efficiently explore maze corridors. However:
- Interactive objects are invisible to DFS. The navigator treats every passable cell identically. Walking onto a rotation switch has the same weight as walking onto empty floor.
- DFS explores exhaustively when it should be purposeful. The ls20 puzzle can be solved in ~13 moves (touch switch once, navigate to target). MazeNavigator used 40+ actions exploring the entire grid without ever understanding the puzzle.
- Early handoff kills reasoning. With
MAZE_MODE_AFTER=2, Claude only got 2 reasoning calls before MazeNavigator took over. Not enough to observe the puzzle mechanics.
Lesson: For interactive puzzles, let the LLM reason about every move. The cost (5-8 seconds per API call, ~7K tokens per turn) is high but necessary. Blind programmatic exploration cannot solve puzzles that require understanding cause-and-effect relationships.
The navigate_to Virtual Tool: LLM-Directed Programmatic Navigation¶
V10 introduced a middle ground between full LLM reasoning per move and blind programmatic exploration: a navigate_to(x, y) virtual tool. Claude specifies a destination, and the agent executes a greedy Manhattan-distance path programmatically (no API calls during transit).
Benefits: - Reduced API calls from 200 (every move) to ~20-30 (only at decision points) - 10-20x faster navigation through open areas - Claude retains strategic control: it decides WHERE to go, the agent handles HOW
Challenges: - Greedy paths can't handle walls. A simple "move toward target" path hits walls in corridor-heavy games. The agent needs wall detection, retry logic, and progress-based abort. - Wall retry oscillation. When blocked going left, the agent goes up/down to get around, then tries left again, creating infinite loops. Fixed with progress-based abort: if Manhattan distance to target doesn't decrease after 12 steps, abort and return to Claude. - Content change detection is noisy. We tried interrupting navigation when the grid changed (to detect puzzle events like switch activation). But the timer bar and moving sprites cause 50+ pixel changes per step, drowning out real events. Abandoned in favor of letting Claude observe at arrival. - Arrival precision. With 5-pixel movement steps, the player can only land on positions modulo 5 from the start. A "within 3 pixels" arrival threshold may not mean the player is ON the target.
Recording Analysis is the Most Important Debugging Tool¶
The single biggest breakthrough came from analyzing JSONL recordings frame-by-frame:
- V9: Discovered the timer bar (rows 62-63) changes every action, breaking stuck detection. Led to
content_hashwhich immediately improved ft09 from 125 to 21 actions for level 1. - V11: Discovered that ls20 has NO switch — the "blue object" was the player sprite. The pattern rotation is autonomous, not player-triggered. Every previous iteration was operating on a fundamentally wrong model of the game mechanics.
Lesson: Never iterate on prompts without understanding the ground truth. 6 versions (V4-V10) were spent optimizing for a game mechanic that didn't exist. One recording analysis session revealed the real mechanics and required a complete prompt rewrite.
Prompt Engineering Insights¶
Confirm Spam is a Real Failure Mode¶
When the agent successfully modifies some cells in the ARC puzzle, it often enters a "confirm loop" — submitting the same incorrect answer 20+ times consecutively. This happens because:
- The frame changes after each confirm (timer ticks), so the agent doesn't detect stuckness
- The prompt doesn't explicitly warn against repeated confirms
- Claude's reasoning gets anchored on "I think my answer is correct" and doesn't re-evaluate
Mitigations: - Hard cap: after 3 consecutive confirms without level advancement, force a reset - Prompt warning: "If confirm doesn't advance to the next level, your answer is WRONG" - Content-hash-based stuck detection that ignores the timer
Objects List > Vision for Coordinates¶
Claude's vision on 512x512 upscaled images is good for understanding layout but imprecise for exact pixel coordinates. The extract_objects() function provides:
- orange (10px): x=[39-43] y=[44-48], center (41,46)
- blue (6px): x=[19-23] y=[30-34], center (21,32)
This structured data is more useful than vision alone for navigation. The movement prompt should direct Claude to "use the objects list coordinates to plan direct paths" rather than trying to visually estimate positions.
System Prompt Size Matters for Cost¶
Each Claude call includes the full system prompt. With prompt caching (cache_control: {"type": "ephemeral"}), repeated calls within 5 minutes reuse cached tokens. But the system prompt still contributes to context window pressure.
Key optimizations: - Keep system prompt under 500 tokens - Use sliding window message history (6 turns = 18 messages max) - Send images selectively (every 3rd turn for movement, every turn for ARC) - MAX_TOKENS=512 for responses (tool calls are compact)
Game-Specific Findings¶
ft09 (ARC Puzzle)¶
- Grid structure: 4 quadrants — top-left (example input), top-right (example output), bottom-left (test input), bottom-right (editable test output)
- Cell size: Each logical cell is a 6x6 pixel block in the 64x64 grid
- Click behavior: Each click cycles a cell to the next color
- Frame layers: Initially 5 layers (blinking cursor animation), collapses to 1 after first action
- Transformation type (level 1): Makes 3x3 grids 4-fold symmetric
- V8 result: 1 level completed (score 2.0) in 125 actions (baseline: 15). First-ever level completion.
- V9 result: 1 level completed (score 11.36) in 21 actions (baseline: 15, ratio: 1.4x). 5.7x score improvement over V8.
- V9 improvement: Content-hash stuck detection + confirm spam cap → level 1 solved in 21 actions (was 125 in V8)
ls20 (Movement/Pattern Puzzle)¶
- Actual mechanics (discovered V11 via recording analysis):
- A reference pattern box (upper area, rows 8-16, cols 32-40) shows the target pattern
- An answer box (bottom-left, rows 52-63, cols 0-11) displays a pattern that rotates automatically
- A moving indicator (pink+maroon sprite) autonomously cycles through green corridors
- Each time the indicator reaches the reference box, the answer box pattern rotates 90 degrees
- Pattern cycle: 4 states (original → H-flip → 180-rot → V-flip)
- The player must navigate to the answer box when its pattern matches the reference
- There is no switch. What we identified as a "blue switch" was actually the player sprite's blue accent pixels
- Grid structure: Green (3) walls form corridors and box borders. Yellow (4) and gray (5) are walkable. The player starts at ~(41,45) in an open area surrounded by corridors.
- Timer: ~40 actions per life, 3 lives total, 7 levels to complete
- V4-V8 failure: MazeNavigator explores blindly, never understands puzzle mechanics
- V9 change: Disabled MazeNavigator, full Claude reasoning per move — still wandered aimlessly
- V10-V10.3: Added
navigate_tovirtual tool for efficient multi-step navigation. Fixed wall detection bug, added progress-based abort for infinite loops. Claude correctly navigated to key locations but didn't complete levels. - V11: Completely rewrote prompt after recording analysis revealed true game mechanics. Previous prompts told Claude to find a non-existent "switch" and navigate to the wrong target box.
vc33 (Click Only)¶
- Grid: Green left half, black right half, with maroon/grey/teal objects
- Timer: Cycles every ~8 frames (fast)
- 7 levels to complete
- Grid never changed across 111 frames in V6 run — suggests clicks weren't landing on interactive targets
- Not yet tested with game-type-aware V8/V9 agent
Cost Analysis¶
| Version | Game | Actions | Input Tokens | Output Tokens | Levels | Cost (est) |
|---|---|---|---|---|---|---|
| V6 | ft09 | 201 | ~50K | ~3K | 0 | ~$0.15 |
| V8 | ft09 | 201 | ~600K | ~10K | 1 | ~$2.00 |
| V8 | ls20 | 201 | ~16K | ~1K | 0 | ~$0.05 |
| V9 | ft09 | 201 | ~1.6M | ~22K | 1 (score 11.36) | ~$5.00 |
| V9 | ls20 | 201 | ~1.8M | ~14K | 0 | ~$6.00 |
| V10 | ls20 | 201 | ~500K | ~7K | 0 | ~$1.50 |
| V10.4 | ls20 | 201 | ~570K | ~7K | 0 | ~$1.70 |
The key cost driver is images. Each 512x512 PNG is ~1500-3000 tokens. Sending images every turn for 200 actions adds ~400K-600K tokens. For movement games, sending every 3rd turn saves ~70% of image cost.
The navigate_to tool (V10+) significantly reduced token usage for movement games by replacing per-step API calls with programmatic navigation, cutting total cost from ~\(6 (V9) to ~\)1.50 (V10).
Meta-Lessons¶
-
Analyze recordings before iterating. Every breakthrough came from studying JSONL recordings frame-by-frame (grid diffs, action sequences, frame counts). Understanding the game mechanics precisely was worth 10x more than prompt tweaks. We wasted 6 versions optimizing for non-existent game mechanics because we hadn't studied the recordings carefully enough.
-
Game-type detection should happen as early as possible. The first frame's
available_actionsfield contains enough information to select the right strategy. Don't waste actions figuring out what kind of game you're playing. -
Programmatic components complement but don't replace LLM reasoning. The MazeNavigator is fast and cheap but blind to semantics. The LLM is slow and expensive but understands intent. The ideal agent uses programmatic execution of LLM-generated plans — the
navigate_totool is this pattern: Claude decides the destination, the agent executes the path. -
Stuck detection requires ignoring "noise" changes. Timer bars, animation frames, blinking cursors — all change the frame hash without indicating real progress. Content-aware hashing is essential. But even "content-aware" hashing (skipping timer rows) isn't enough — moving sprites, indicators, and autonomous animations cause 50+ pixel changes per step. True stuck detection may require tracking only player-relevant metrics (position, level completion).
-
Hard caps prevent catastrophic action waste. Without a confirm-spam cap, the agent burned 22 consecutive actions on futile confirms. Without a progress-based nav abort, the agent burned 25 consecutive actions oscillating between two wall positions. Simple guardrails save the budget for useful exploration.
-
Multi-layer frames are common and require deduplication. ft09 starts with 5 layers (cursor blink animation). Rendering all 5 as separate images wastes tokens. MD5 deduplication across layers reduces this to 1-2 unique images.
-
Don't hardcode game mechanics you haven't verified. The movement prompt's "SPATIAL LAYOUT" section hardcoded specific coordinates (switch at y~30-35, target at y<25). This was based on an incorrect model of the game. When the model was wrong, Claude was given precise but incorrect instructions, worse than no instructions at all. Game-specific hints should only be added after frame-level verification from recordings.
-
Greedy navigation fails in complex environments. A greedy Manhattan-distance path works for open areas but fails in corridor-heavy environments with walls. The agent needs proper pathfinding (BFS/A*) or at minimum, progress-based abort with fallback to exploratory individual moves. Wall retry + perpendicular approach creates oscillation; progress tracking catches it.
-
The objects list can be misleading.
extract_objects()identifies colored clusters by pixel count, but can't distinguish game-relevant entities from decorative elements. The "blue switch" at (21,32) was actually the player sprite's accent pixels. Objects need semantic context (what role they play) not just spatial data (where they are). -
LLM vision is better at layout comprehension than coordinate extraction. Claude can identify "there are two pattern boxes and a corridor structure" from the image better than it can read exact pixel coordinates. But for navigation, it needs the objects list's precise coordinates. The combination is essential: vision for understanding, structured data for action.
Bug Taxonomy¶
A catalog of bugs encountered during development, useful for anyone building similar agents:
| Bug | Version | Impact | Root Cause | Fix |
|---|---|---|---|---|
| Timer bar breaks stuck detection | V8→V9 | Agent never detects stuckness, stuck_counter=0 always | frame_hash includes timer bar pixels that change every action |
content_hash that hashes only rows 0-62 |
| Confirm spam | V8→V9 | 22+ actions wasted confirming wrong answers | No cap on consecutive confirms; timer changes mask stuckness | Hard cap: 3 consecutive confirms → force RESET |
_prev_player_pos update ordering |
V10→V10.1 | Every nav step looks like a wall hit, navigate_to aborts after 3 steps | _prev_player_pos updated before wall detection comparison |
Save old_player_pos before update, use it in comparison |
| Claude navigates to wrong target | V10.1→V10.2 | Goes to (59,61) instead of target box | No spatial guidance in prompt; Claude guesses wrong box | Added SPATIAL LAYOUT section (later found to be wrong itself) |
| Content change interruption fires every step | V10.3 | navigate_to interrupted on every step, defeating its purpose | Player movement changes grid (2 px), timer changes grid (49+ px), threshold too low | Changed to pixel magnitude threshold (>5), then discovered 51px noise floor from timer |
| Nav queue infinite oscillation | V10.3b→V10.4 | 25+ actions wasted bouncing between two positions | Wall retry counter resets on successful perpendicular moves | Progress-based abort: if no distance improvement after 12 steps, abort |
| Wrong game model (entire paradigm) | V4→V11 | 6 versions optimizing for non-existent mechanics | Never analyzed recordings to verify switch/rotation hypothesis | Frame-by-frame recording analysis revealed autonomous indicator, no switch |
Version History¶
| Version | Key Changes | ft09 Score | ls20 Score |
|---|---|---|---|
| V1-V3 | Basic vision + tool calling | 0 | 0 |
| V4-V6 | MazeNavigator (DFS), multi-layer dedup | 0 | 0 |
| V7-V8 | Game type detection, game-specific prompts | 2.0 | 0 |
| V9 | content_hash, confirm spam cap, disable MazeNav | 11.36 | 0 |
| V10 | navigate_to virtual tool | — | 0 |
| V10.1 | Wall detection fix | — | 0 |
| V10.2 | Spatial layout hints | — | 0 |
| V10.3-10.4 | Content change detection, progress abort | — | 0 |
| V11 | Corrected game mechanics from recording analysis | — | TBD |