← /writing/learning·2026 · 05 · 06·9 min read

The agent hallucinated a tool argument. Whose fault was that?

I blamed the model. Then I read my own tool's JSON schema. It was the bug.

Harness Engineering · Part 3 of 10. Previous: I asked for one thing. The agent did three.. Next: Cognition vs. Anthropic on multi-agents.

The agent hallucinated a tool argument. Whose fault was that?

I had a tool called search_logs. The agent called it with query="error AND service:billing" — neat-looking Lucene-ish syntax — and got back zero rows. It tried again with query="error service=billing". Zero. Four more variations, each a confident guess at a query language my tool did not speak. My tool wanted a flat substring. That's it. The agent had invented a DSL out of thin air, watched it fail six times, and was about to invent a seventh. I sat there muttering "why is the model hallucinating" and reached for the prompt to scold it.

What I tried first

My first move was the move I always make when an agent does something dumb: write more words at it. I added a paragraph to the system prompt. "When calling search_logs, use plain substrings only. Do not use boolean operators. Do not use field qualifiers like service:foo. The query is matched as a literal substring against the log line." I felt good about this paragraph. It was specific. It named the failure mode. It even gave examples.

It worked for about two turns and then drifted. The agent would do one clean substring search, find a hit, and on the very next call slip back into status:500 path:/checkout because the previous tool result looked like a structured log line. The shape of the data it had just seen overrode the shape of the instructions I had pinned. I added another reminder. The agent started hedging — calling the tool with both a clean substring and a structured query, in two separate calls, as if it was hedging its bets. Tool budget burned. Same wrong answers.

I tried renaming the tool. I changed search_logs to grep_logs, hoping the verb would carry the semantics. Better, briefly. Then it regressed. I added a required: true flag and a regex on the input. Now the model just got validation errors and apologized and tried again with the same wrong query. The system prompt couldn't outshout the tool's own description, because the tool's description was part of every call, and my system prompt was several thousand tokens away.

I was patching the wrong file.

What clicked

I finally opened the JSON schema for my own tool. Here is what was actually being shipped to the model on every turn:

json{
  "name": "search_logs",
  "description": "Search application logs.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": { "type": "string", "description": "The search query." },
      "limit": { "type": "integer", "description": "Max results." }
    },
    "required": ["query"]
  }
}

Read that the way the model reads it. "Search application logs" with a "search query." Of course it tried Lucene. Of course it tried boolean operators. Search query is the most overloaded phrase in software; every search box the model has ever seen in training supports some kind of operator syntax. My schema was an open invitation to invent one. The type signature said string, which carries zero intent. The description said The search query, which is not a description, it is a restatement of the parameter name. I had given the model nothing to work with and then got mad when it filled the vacuum.

The reframe is short and it took me embarrassingly long to internalize: the tool description is the spec. The model reads it on every call. Most "hallucinations" are schema-clarity bugs.

Once I saw it that way, the fix wrote itself:

json{
  "name": "search_logs",
  "description": "Find log lines containing a literal substring. Does NOT support boolean operators (AND/OR/NOT), field qualifiers (service:billing), regex, or wildcards. The query is matched case-insensitively against the raw log line. To narrow by service, include the service name as part of the substring (e.g. 'billing timeout').",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "A literal substring to match. Examples of GOOD queries: 'NullPointerException', 'billing timeout', 'user_id=42 error'. Examples of BAD queries that will return zero results because they are interpreted literally: 'error AND timeout', 'service:billing', 'status=5*'."
      },
      "limit": {
        "type": "integer",
        "description": "Max results to return. Default 50. Use a smaller value (5-10) when iterating, larger only when you need full context.",
        "default": 50
      }
    },
    "required": ["query"]
  }
}

Same tool. Same code. Same model. The hallucination stopped on the first run after the schema change. I did not need a sterner prompt. I did not need a smarter model. I needed to write the documentation I had been too lazy to write.

The thing I keep coming back to: a type signature does not carry intent. query: string permits every string. The model has to guess which strings are real, and it guesses from prior. The description field is the only place I get to say "no, the prior is wrong, here is what counts as a valid call." Type signatures constrain shape. Descriptions constrain meaning. I had been treating the description like a comment — a polite annotation for humans skimming the schema — when it was actually load-bearing product copy. Anthropic's Writing Effective Tools for AI Agents makes this explicit: the description and the parameter docs are the most leveraged thing you can prompt-engineer, because they ride along with every single call. The system prompt is read once. The tool description is read on every invocation, in the most attended position in the window.

The framing that finally stuck for me: I am not writing an API for engineers, I am writing API documentation for the model itself. The audience is a non-deterministic user who will read my words and then act on them, immediately, with no follow-up question. Every ambiguity I leave is a coin flip. Every example I include is a free constraint. The "wrong call" example in particular — here is a call that looks plausible and will fail — does more work than three paragraphs of prose, because it preempts the exact failure mode the model would otherwise walk into.

This is also where I finally understood why MCP servers in the wild vary so wildly in quality. The protocol just gives you a slot for description and trusts you to fill it. Most servers fill it with a sentence. The good ones fill it with a small, opinionated user manual. The bad ones make their model look stupid.

What I'd do differently next time

Three rules I now follow, in order of how much they helped me.

Every tool earns one good wrong-call example. Not "here is how to use it." That's table stakes. Here is a call that looks correct and will not work, and here is why. The model has seen a billion search boxes; my one negative example is the only thing standing between it and the wrong prior. I now refuse to ship a tool description without one.

Treat the description like a PR description for an agent. Audience, context, examples, edge cases, what-not-to-do. Same discipline I'd apply to a function I expect a junior engineer to call without asking. If a human teammate could plausibly misuse the tool from reading my description, the model definitely will, and faster.

When the agent makes a wrong call, read the schema before reaching for the prompt. Nine times out of ten the bug is in the tool description, not the system prompt, not the model. The system prompt is where I tell the model what we're doing. The schema is where I tell the model how to do it. Conflating those two is what produced the hours I lost to search_logs.

What I'm still unsure about

I don't have a clean intuition for when generic tools beat purpose-built ones. grep and find and bash are unreasonably effective in agent harnesses — partly because the model has seen so much of them in training that the "description" is essentially free. Every well-worn primitive comes with a built-in spec the model already trusts. A purpose-built tool always has to fight to establish its own semantics. So when does the cost of writing a great description for a custom tool beat the cost of letting the model compose from primitives? My current heuristic is: if the action is dangerous, irreversible, or expensive, build a custom tool with strict args. If it's cheap and exploratory, let the model use bash. But that's a vibe, not a rule, and I've been wrong both ways.

I'm also unsure how this scales when there are forty tools in scope. Anthropic's writing on tool sets is blunt: more tools means worse selection, even with room in the window. I haven't yet figured out how to write descriptions that are simultaneously detailed enough to prevent misuse and short enough to not blow up the toolbox.

References

  • Anthropic, Writing Effective Tools for AI Agents — the post that named what I had been doing wrong. The line that stuck: tool descriptions and parameter docs are the most leveraged piece of prompt engineering you have, because they ride along with every call. https://www.anthropic.com/engineering/writing-tools-for-agents
  • Anthropic, Building Effective Agents — the section on agent-computer interface design is what reframed tool design as a UX problem for me, not an API problem. The core move: invest in tool docs the way you'd invest in human-facing docs. https://www.anthropic.com/research/building-effective-agents
  • Anthropic, Effective Context Engineering for AI Agents — gave me the language for bloated tool sets and ambiguous decision points. The "if a human engineer can't tell which tool to use, the agent can't either" test is now my first review for any toolbox. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
  • Anthropic API docs, How tool use works — the bit I had been skimming for years: tool use is a contract between your application and the model where you specify what operations are available and what shape their inputs take. The word "contract" is doing more work than I had given it credit for. https://platform.claude.com/docs/en/agents-and-tools/tool-use/how-tool-use-works
  • Model Context Protocol, Specification — the schema is just name, description, inputSchema. The protocol gives you one slot to teach the model your tool. Reading the spec made it obvious why MCP server quality varies so much: the description field is where the work is, and most authors don't do it. https://modelcontextprotocol.io/specification/2025-11-25
  • Dex Horthy / HumanLayer, 12-Factor Agents — Factor 1 (natural language to tool calls) is the framing I now use to explain to teammates why a wobbly tool description shows up as model "hallucination." The agent's job is to convert intent into a schema-valid call; if the schema is vague, the call will be too. https://github.com/humanlayer/12-factor-agents
  • Hamel Husain, How do I evaluate agentic workflows? — the step-level diagnostics framing (tool choice, parameter extraction, error handling) is what convinced me to start logging which tool args the agent invented, not just whether the run passed. Most of my "the model hallucinated" claims dissolved once I looked at the actual arg distributions. https://hamel.dev/blog/posts/evals-faq/how-do-i-evaluate-agentic-workflows.html

Harness Engineering · Part 3 of 10. Previous: I asked for one thing. The agent did three.. Next: Cognition vs. Anthropic on multi-agents.

← back to /writingset in fraunces · geist · geist mono