← /writing/learning·2026 · 05 · 06·8 min read

What does it mean for the harness to "trust" the model?

Every hook I write is a vote of no-confidence in the model. The question is which votes pay off.

Harness Engineering · Part 6 of 10. Previous: Skill, slash command, sub-agent, or just a better prompt?. Next: My agent remembers me across sessions.

What does it mean for the harness to "trust" the model?

A few weeks ago I had a settings.json that was, frankly, paranoid. I had PreToolUse hooks on Bash, on Edit, on Write. I had a hook that lint-checked every command before it ran. I had one that diffed every file the agent was about to touch and asked me to confirm. I had built it carefully, hook by hook, every time the agent did something I didn't like.

I sat down to do a small refactor. The agent asked me to confirm reading a file. Then to confirm a grep. Then to confirm an npm install it had already run twice that morning. Twenty minutes in I had clicked y forty times and edited two lines of code. I closed the laptop. I'd built a forms wizard.

This post is me figuring out what hooks are actually for.

What I tried first

The instinct, the first time an agent does something irreversible that you didn't want, is to lock the door it walked through. Mine had rm -rf node_modules on the wrong directory once. So I added a PreToolUse hook on Bash that asked for confirmation. Then it tried to amend a commit that wasn't mine. Hook on git commit. Then it ran a migration on what it thought was the dev database. Hook on anything matching migrate. Each addition felt like it was making the agent safer. Each one was independently defensible.

The cumulative effect was different. The agent was no longer driving; I was. Every loop went: agent proposes, I read the proposal, I click y or n, agent continues. The harness had stopped being a force multiplier and become a permission slip generator. Worse — because most of the prompts were on safe operations (a grep, a cat, an npm install in a sandboxed repo) — I was clicking through them on autopilot. The signal-to-noise was so bad that when a hook fired on something genuinely dangerous, I almost missed it. I'd trained myself, by hook design, to stop reading.

So I did the other thing. I disabled all of them. For two days the agent ran wild and I watched. It did not, in fact, rm -rf my homedir. It did, twice, do something I would have caught with a single well-placed gate — once it force-pushed a branch I'd been about to PR; once it dropped a table in a staging DB without checking the schema first.

Both extremes were wrong. The forms-wizard version made the agent unusable. The no-hooks version made specific failures cheap to imagine. The question wasn't whether to install hooks. It was which actions deserve the friction.

What clicked

The reframe is this: every hook is a vote of no-confidence in the model on a specific action. And the question to ask, every time you're tempted to add one, is whether that vote will pay off.

Which means trust is not a property of the agent. It's a property of the action.

I trust the model on grep. I trust it on cat. I trust it on ls, on npm test, on reading a file, on writing to a scratch file in /tmp. The downside if it's wrong is microseconds of wasted compute and a confused next turn. There is no scenario where intercepting a grep pays off — the action is so cheap to undo (it doesn't do anything) that the friction of a hook strictly costs more than it saves.

I do not trust the model on rm -rf. Or git push --force. Or anything that runs a schema migration against a live DB. Or git commit --amend on a shared branch. Or gh pr merge. The downside if any of those goes wrong is hours of recovery work, lost data, or — for the migration case — recovery that isn't even possible. The friction of a hook is irrelevant compared to the cost of a single miss.

The heuristic that came out of this is reversibility. Before adding a hook, I ask: if the model gets this action wrong, what does it cost me to undo it? If the answer is "nothing, I'll just tell it to try again," no hook. If the answer is "open a P0 incident," hook.

The Claude Code hooks reference spells out the mechanism — PreToolUse fires after the model has chosen a tool and built the parameters but before the tool runs, and it can allow, deny, ask, or defer. Which is exactly the right shape. The hook doesn't argue with the model's reasoning; it intercepts at the boundary where reasoning becomes consequence. That distinction matters: a hook is not a second model checking the first. It's a mechanical gate at the moment of irreversibility.

The cleanest version of my settings now has three hooks. One PreToolUse matcher on Bash that pattern-matches on rm -rf, git push --force, and a small list of destructive commands; it asks me. One on anything that touches migrations/. One on gh pr merge. That's it. Everything else — Edit, Write, Read, Grep, Bash, npm, the entire MCP toolbox — runs without me in the loop. The agent is fast again.

The interesting move that made this work: I added each hook only after a real miss. Not because I imagined a failure mode. Because one happened. That's a different discipline from "list every dangerous tool and gate it" — it constrains me to the actual blast radius the model has demonstrated. Imagined failures produce paranoid configs; real ones produce minimum-viable gates.

There's a related idea from HumanLayer's writing about human-in-the-loop: the gate should be at the act, not at the plan. Reviewing every plan slows the agent to chat speed. Reviewing every irreversible action keeps the agent autonomous on everything else and pulls you in only at the moments that matter.

What I'd do differently next time

If I were starting fresh on a new repo, I'd ship one hook: a PreToolUse on Bash that asks on a small explicit list of destructive command patterns — rm -rf, git push --force, git reset --hard on a remote-tracked branch, anything matching DROP TABLE or TRUNCATE. Nothing else, day one.

Then I'd watch. The next time the agent does something I didn't want and the cost of undo was real, I'd add that specific gate — not a category of gates, the one. Repeat for a couple of weeks. By the end, I'd have four or five hooks, each tied to a specific incident, and zero hooks on cheap operations.

The mental model I'd carry into a new project: a hook is not a safety feature, it is a tax. The tax is paid by every legitimate use of the gated tool, not just the misuses. So the gate is only worth installing when the cost of one misuse exceeds the cumulative cost of taxing every legitimate use forever. For rm -rf, that math is easy. For grep, it's the opposite — and I'd resist the instinct to "be safe" by adding a hook there, because being safe in that direction is what produced the forms-wizard version in the first place.

What I'm still unsure about

The piece I haven't figured out is whether the model can be trusted to self-rate its confidence on a given action — and whether, if it could, we could push some of the trust decision into the model rather than enforcing it from outside. Anthropic's writing on long-running harnesses introduces a separate evaluator agent for exactly the reason that models tend to praise their own work; calibrated self-evaluation seems to require a different agent or scoring rubric, not just asking nicely.

But I can imagine a version of PreToolUse where the model has to declare its confidence and the irreversibility of the action before the tool fires, and the hook is just a function over those two values. I don't have a working version of that. I'm not sure if it would beat the dumb pattern-match-on-rm -rf approach. The dumb version is at least legible.

References

  • Claude Code docs, Hooks reference — the canonical description of PreToolUse, the four decision outcomes (allow, deny, ask, defer), and the matcher syntax. This is the reference I came back to every time I was about to add or remove a hook. https://code.claude.com/docs/en/hooks
  • Anthropic, Effective harnesses for long-running agents — introduces the planner / generator / evaluator three-agent design and is candid about the failure mode that motivated it: agents asked to evaluate their own work tend to confidently praise it. That's the empirical case for why intercepts can't just be "ask the model nicely." https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
  • Hamel Husain, LLM Evals: Everything You Need to Know — Hamel's argument that evals are the closest thing the field has to a load-bearing trust mechanism, and that you build them by error-analyzing real failures rather than imagining them. The "look at twenty real outputs first" discipline is what convinced me to gate on observed misses, not imagined ones. https://hamel.dev/blog/posts/evals-faq/
  • Inspect AI (UK AI Security Institute), Inspect framework documentation — the framework I keep returning to when I want to see what systematic verification looks like end-to-end. Reading the agent-evaluation tooling here changed how I think about what a hook is even trying to approximate. https://inspect.aisi.org.uk/
  • HumanLayer, humanlayer/humanlayer on GitHub — Dex Horthy's project, built on the lesson that an early prototype almost dropped a SQL table because the agent had unsupervised control. The framing — "let agents notify humans for approval before taking significant actions" — is the cleanest articulation I've found of why the gate belongs at the act, not the plan. https://github.com/humanlayer/humanlayer
  • Latent Space, Claude Code: Anthropic's Agent in Your Terminal (with Boris Cherny and Cat Wu) — the design-philosophy episode. The "smallest building blocks that are useful, understandable, and extensible" line is the one I think about whenever I'm tempted to over-configure my hooks. https://www.latent.space/p/claude-code

Harness Engineering · Part 6 of 10. Previous: Skill, slash command, sub-agent, or just a better prompt?. Next: My agent remembers me across sessions.

← back to /writingset in fraunces · geist · geist mono