← /writing/learning·2026 · 05 · 06·8 min read

My Claude Code session went great - until turn 30. What broke?

The first 20 turns felt like magic. The next 10 felt like a different model. Here's what was actually happening to my context window.

Harness Engineering · Part 1 of 10. Next: I asked for one thing. The agent did three..

My Claude Code session went great — until turn 30. What broke?

I was three hours into a refactor I'd been chasing for two days. Up to about turn twenty, the agent had been excellent: read the right files, made the right edits, ran the right tests, came back with a clean diff. Then somewhere around turn twenty-five it started forgetting which module we were working in. By turn thirty it confidently re-introduced a bug it had personally fixed an hour earlier — and cited the broken commit as evidence the fix had landed.

I rewound, gave it a sterner system prompt, and tried again. It got worse.

This post is me figuring out what was actually happening.

What I tried first

The instinct, when an agent starts drifting, is to give it more. More instructions. More reminders. A longer system prompt that re-states what we're building and why. A pinned message that says "do not touch the legacy adapter." Maybe a CLAUDE.md update with the conventions it just violated.

So that's what I did. I expanded my prompt with a recap of the architecture, a list of files that were in scope, a list of files that were not in scope, and a polite paragraph asking the model to keep all of that in mind.

It got worse, faster. By turn fifteen of the new run, the agent was hedging every edit. By turn twenty it was repeating the same diagnostic loop — running the same test, reading the same file, drawing the same conclusion — instead of moving on. By turn twenty-five it had quoted my own pinned reminder back at me as a finding, like it had discovered the architecture by reading the code.

That last one is what tipped me off. The reminder I had added to fix the problem had been absorbed into the problem. The model couldn't tell the difference between a load-bearing instruction and an old observation it had recorded itself. Everything in the window was just tokens, and tokens have weight.

The naive fix had pulled the rope tighter. The rope was the issue.

What clicked

The reframe is short and it took me embarrassingly long to internalize: the context window is not RAM. It is an attention budget.

When I treated the window as RAM, "more context" was free. Why not paste the whole file? Why not keep the previous turn's tool output? Why not pin a long reminder? It's all just sitting there. The model can ignore what it doesn't need.

That model is wrong. The model cannot ignore tokens. It attends to all of them, and every token I add competes with every other token for the model's finite ability to track what matters. Anthropic's context engineering writeup makes this explicit: a transformer's pairwise attention scales as n² with sequence length, models have less training data on very long sequences, and the practical implication is that context is "a finite resource with diminishing marginal returns." Chroma's Context Rot report goes further — across eighteen frontier models, performance degrades with input length even on tasks as trivial as repeating a string back. The thousandth token is not as cheap as the hundredth.

Drew Breunig's two posts gave me the vocabulary I was missing. Long contexts don't just get worse; they fail in specific ways. Distraction: past actions in the window pull the model toward repeating itself instead of synthesizing. Confusion: irrelevant tools and context degrade tool selection — the Berkeley function-calling work he cites shows models choking when the toolbox gets too big, even with room to spare in the window. Poisoning: a wrong observation that lands in the context gets re-cited as fact later. Clash: contradictions between earlier and later turns that the model resolves by quietly picking one side.

Once I had those names, I could go back and label what had happened to me. The first run was distraction (turn thirty kept replaying turn ten's reasoning). The second run was poisoning (my pinned reminder, plus a stale tool output it had decided was canonical).

The bigger shift was about what the harness is for. I had been treating Claude Code as a memory: every turn appended, nothing pruned, the conversation as the source of truth. But the harness's actual job — the part that makes it different from a chat box — is the opposite. It's there to prune. /compact is the explicit version of this: take the conversation so far, summarize it, start fresh with the summary as the seed. Auto-compaction is the implicit one, kicking in around 95% of the window. Both are admitting the same thing: the raw transcript is not the artifact. The model's working understanding of the task is the artifact, and you have to actively defend it from the transcript that produced it.

That reframes a lot of decisions. CLAUDE.md is not "extra context I'm sneaking in." It's persistent state that survives compaction, which is exactly why it's load-bearing and why everything else in the window is disposable. Subagents aren't a parallelism trick; they're context isolation — a fresh window for a sub-task whose intermediate noise should never touch the parent. Skills load on demand because if they loaded eagerly they'd burn attention budget for nothing.

The harness is not preserving my session. It is curating it.

What I'd do differently next time

Three concrete moves, in order of how much they helped me.

Compact early, not late. I used to wait for the auto-compact warning. Now I run /compact the moment a sub-task is done — when I've finished the refactor in module A and I'm about to start module B. The intermediate tool outputs from module A are pure noise for module B. Summarizing them down to "module A is done, here is the resulting interface" is exactly the kind of pruning the model can't do for itself mid-turn.

Externalize state to files, not turns. If the agent discovers something it'll need three turns from now — a config value, a function signature, a decision — I have it write to a scratch file and read it back when needed. The file is durable. The turn it lived in is going to get summarized away. Drew Breunig calls this context offloading; Anthropic's post calls it the same. It's the most robust thing I've changed.

Fresh session per task. When I'm about to start something genuinely new — not a continuation — I just start a new session. The cost of re-loading a few files into a clean window is much smaller than the cost of dragging an hour of unrelated context behind me. I default to a new session over --continue now, and it has been a free win.

What I'm still unsure about

I don't have a good intuition for when compaction actually hurts. Twice this week I compacted between sub-tasks and the agent lost a small detail I'd assumed was in CLAUDE.md but wasn't, and the recovery cost more than the noise would have. There's a frontier here I haven't figured out: which intermediate state is genuinely worth keeping in raw form vs. summarizing.

I'm also unsure how much of this generalizes off Claude Code. The shape of the problem is universal — context rot is documented across every frontier model — but the handles I have for managing it (slash-compact, CLAUDE.md, subagents) are Claude Code's. I haven't yet sat down and asked: what's the equivalent move in Cursor, or Aider, or a homegrown loop? That's a future post.

The thing I'm most sure of is the reframe itself. Context as attention budget, not RAM. Once you see the harness as a pruner, every other decision about how to drive it gets sharper.

References

  • Drew Breunig, How Long Contexts Fail — gave me the four failure-mode vocabulary (distraction, confusion, poisoning, clash) that let me label what had actually gone wrong on turn thirty. https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html
  • Drew Breunig, How to Fix Your Context — the companion piece, with six concrete techniques (RAG, tool loadout, quarantine, pruning, summarization, offloading). The "context offloading" idea is what convinced me to start writing scratch files instead of relying on turns. https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html
  • Chroma Research, Context Rot: How Increasing Input Tokens Impacts LLM Performance — empirical proof, across eighteen models, that performance degrades with input length even on trivial tasks. This was the paper that broke my "context is RAM" assumption. https://research.trychroma.com/context-rot
  • Anthropic, Effective Context Engineering for AI Agents — the model-builder's view: context as a finite attention budget, n² pairwise attention, diminishing marginal returns. The phrase "smallest possible set of high-signal tokens" is the one I now repeat to myself before adding anything to a prompt. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
  • Hamel Husain, P6: Context Rot — Hamel's annotated walkthrough of Kelly Hong's Chroma presentation. The practitioner framing — "thoughtful context engineering is critical" — is what pulled this from theory into a daily habit. https://hamel.dev/notes/llm/rag/p6-context_rot.html
  • Claude Code docs, How Claude Code works — the official description of auto-compaction, /compact, /context, and how subagents and skills isolate context. Re-reading this with the attention-budget frame in mind made the design choices obvious. https://code.claude.com/docs/en/how-claude-code-works

Harness Engineering · Part 1 of 10. Next: I asked for one thing. The agent did three..

← back to /writingset in fraunces · geist · geist mono