I built my own agent harness over a weekend. Here's what I got wrong.
A while-loop, an LLM, and a tool registry. What could go wrong? Most of it.
Harness Engineering · Part 9 of 10. Previous: Why Claude Code feels different from Cursor. Next: Are skills the new programming primitive?.
I built my own agent harness over a weekend. Here's what I got wrong.
Saturday morning. Coffee, blank harness.py, one question I'd been dodging for eight posts: if I had to write the loop myself — no Claude Code, no SDK helpers beyond the raw messages.create — how much would I actually need? I'd been writing about hooks and memory and sub-agents like they were obvious additions. But I'd never sat down and asked which of them were load-bearing on day one and which were scar tissue from problems I hadn't hit yet. The point of the weekend was to find out.
What I tried first
I started elegant. Of course I did.
Forty minutes in I had an Agent base class, a Tool ABC with schema() and run() methods, a MessageBus that wrapped anthropic.Anthropic() so I could "swap providers later," a HookRegistry with pre_tool, post_tool, and on_error slots, and a Memory interface with two implementations (InMemory, JSONLFile) behind it. I had not yet made a single API call. The README in my head was beautiful.
Around hour two I tried to actually run the thing. The agent's job was small: read a file, count the TODOs, write a summary to disk. Three tools — read_file, write_file, list_dir. The first call failed because my Tool.schema() returned a Pydantic model and messages.create wanted a plain dict. I added a serializer. The second call failed because my HookRegistry was async and my tools were sync, so pre_tool never awaited. I added an event loop. The third call worked, except the model called read_file with {"path": "./TODO.md"} and my "elegant" path-resolver decided ./ should be relative to the Memory working directory, not the cwd. The agent insisted the file didn't exist. I insisted it did. We argued for twenty minutes.
By Sunday morning I had deleted half of it. The MessageBus was a one-line wrapper around a function I already had. The HookRegistry had no second hook to register. Memory had one implementation and a TODO comment for the other. The abstractions were predicting a future I hadn't lived yet, and they were getting in the way of the present I was trying to debug.
What clicked
The minimum viable harness is shorter than the imports section of what I'd written.
Here is the skeleton I ended up with — a real loop, not pseudocode. Python, Anthropic SDK, one tool. It runs.
import anthropic, json, subprocess client = anthropic.Anthropic() TOOLS = [{ "name": "run_bash", "description": "Run a bash command and return stdout/stderr.", "input_schema": { "type": "object", "properties": {"cmd": {"type": "string"}}, "required": ["cmd"], }, }] def run_bash(cmd: str) -> str: r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=30) return (r.stdout + r.stderr)[:4000] def step(messages): resp = client.messages.create( model="claude-sonnet-4-5", max_tokens=2048, tools=TOOLS, messages=messages, ) messages.append({"role": "assistant", "content": resp.content}) if resp.stop_reason != "tool_use": return messages, True results = [] for block in resp.content: if block.type == "tool_use": out = run_bash(**block.input) if block.name == "run_bash" else "unknown tool" results.append({"type": "tool_result", "tool_use_id": block.id, "content": out}) messages.append({"role": "user", "content": results}) return messages, False def agent(task: str, max_turns: int = 25): messages = [{"role": "user", "content": task}] for _ in range(max_turns): messages, done = step(messages) if done: return messages return messages if __name__ == "__main__": agent("Find every TODO in this repo and summarize them.")
That's the whole thing. Roughly eighty lines once you add a system prompt, basic logging, and a second tool. An LLM call. A tool registry (a list). A while loop with a turn cap. Anthropic's Building Effective Agents says it out loud — the simplest agent is a model in a loop with tools, and most production systems don't need more. For a weekend, they really don't.
What it taught me was which of the things I'd been writing about were earned and which were premature.
Hooks were earned the moment I added a second tool. Once run_bash could rm anything, I wanted a "are you sure" gate before destructive commands. That's a pre-tool hook. I didn't need a registry — I needed an if statement. The hook abstraction is real, but it shows up when you've copy-pasted the same if three times across three tools, not before.
Memory was earned the moment a turn produced a useful artifact a later turn needed. The first time the agent ran the test suite, summarized the failures, and then thirty turns later asked to re-run the tests because it had forgotten the summary, I added a scratch.md file and a read_scratch tool. That's it. No Memory interface. A file. The "memory" abstraction is a file with manners.
Sub-agents were earned the moment one task generated so much intermediate noise that the parent loop drowned. I asked the agent to refactor a module — it read fifteen files, ran tests, made edits, ran tests again. By turn twenty the parent context was full of file contents I didn't care about. I wanted the parent to delegate "do this refactor, return a summary" and never see the fifteen files. That is what a sub-agent is for: context isolation. Not parallelism, not specialization. A fresh window with a return value. Until I felt that pain, the sub-agent was a feature in search of a problem.
The pattern was the same every time. Each abstraction I'd written about over the past eight posts mapped to a specific moment when the eighty-line loop stopped being enough. The mistake on Saturday was building the abstractions before I'd lived the moments.
What I'd do differently next time
Start uglier.
Specifically: write the loop inline. One file. Hard-code the model name. Hard-code the tools. No classes, no interfaces, no config. Make a real API call by hour one, not hour three. Let the messy parts stay messy until the third copy-paste hurts.
The "third copy-paste hurts" rule is the one I've internalized hardest. The first time I write a try/except around a tool call, it's a try/except. The second time, it's still a try/except — maybe I notice the duplication and shrug. The third time is when I have evidence: this is a pattern, here is what it looks like across three real tools, now I can name the abstraction and it will fit. Before the third instance, every abstraction is a guess; after, it's a refactor. Dex Horthy's "12-Factor Agents" puts a sharper version of this: most of an agent is just software, and the parts that aren't are smaller than you think. Start with software. Add agent shape only where it earns its keep.
The other thing I'd do differently: keep the "elegant" Saturday version in a branch. Not because I'll merge it. Because looking at what I deleted on Sunday is the cheapest way to remember which abstractions are mine to add and which are someone else's job.
What I'm still unsure about
How much of the eighty-line harness survives once frameworks cover the basics well? Smolagents already fits agent logic in ~1000 lines. The OpenAI Agents SDK (the production successor to Swarm) ships handoffs and tracing as primitives. If the floor keeps rising, the value of writing your own harness drops — except for the part that taught me what I just wrote. There's a learning artifact and a production artifact, and I'm not sure they should be the same code.
I'm also unsure where the line is between "tool registry" and "skill." A tool is a function the model can call. A skill, in Claude Code's sense, is a folder of instructions plus tools that loads on demand. Is the second one just a tool registry with lazy loading and prose? That's the question I want to chew on in Post 10.
References
- Anthropic, Building Effective Agents — the post that gave me permission to delete the
MessageBus. The line "find the simplest solution and only increase complexity when needed" is the one I taped above the desk on Sunday morning. https://www.anthropic.com/research/building-effective-agents - Anthropic API Docs, Tool use overview — the actual contract: how
tool_useblocks come back, howtool_resultgoes in, why client-executed tools require you to drive the loop. Re-reading this on Saturday is what made the eighty-line skeleton possible. https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview - Anthropic Cookbook, Customer Service Agent with Client-Side Tools — the runnable notebook I copied the loop shape from. Strip the customer service domain off it and you're left with the same model-tools-loop pattern. https://github.com/anthropics/claude-cookbooks/blob/main/tool_use/customer_service_agent.ipynb
- Dex Horthy, 12-Factor Agents — the principles repo that named what I was feeling about my Saturday version: most of an agent is just software, and the framework-y parts are smaller than they look. The "own your control flow" factor is the one I felt in my bones by Sunday. https://github.com/humanlayer/12-factor-agents
- HuggingFace, smolagents — the existence of a 1000-line agent framework is itself the argument. If the entire library fits in 1000 lines, my eighty are not unreasonable; they're the same idea with one less layer of paint. https://github.com/huggingface/smolagents
- OpenAI, Swarm — the educational multi-agent framework (now superseded by the Agents SDK). Reading Swarm's source on Sunday afternoon is what convinced me sub-agents are just "loops calling loops with a return value," not a separate concept. https://github.com/openai/swarm
Harness Engineering · Part 9 of 10. Previous: Why Claude Code feels different from Cursor. Next: Are skills the new programming primitive?.