Krosoft

AI_DIGEST_ENTRY

Agent Harnesses Beat Agent Theater

Today’s strongest AI discourse pointed in the same direction: agent gains are increasingly coming from better harnesses, evaluation loops, and document-preparation workflows rather than from autonomy theater. The practical lesson is to invest in scaffolding, normalization, and human review bounda...

Executive Summary

The most important shift in today’s AI discourse is that agent progress is looking less like a race toward full autonomy and more like a discipline of better scaffolding. Across a strong practitioner writeup from AI Tinkerers’ Post-Training, a fresh Simon Willison eval note, and a practical Nate B Jones workflow video, the common lesson was the same: builders are getting better results by tightening harnesses, normalizing context, and keeping humans at the decision boundary.

What Happened

The clearest artifact was Coral Bricks’ Post-Training essay on building a domain-specific agent harness for finance tasks (AI Tinkerers / Post-Training). The headline result was not a new model release but a harness result: the Reef-based setup reportedly reached an 82.6% pass rate on Vals AI Finance Agent v2 with Kimi K2.6, versus 44.87% for the reference harness and 49.8% for a retrieval-only variant using the same underlying data. The claimed gains came from structure: versioned skills, lazy loading, typed bindings, planner-versus-specialist separation, and runtime constraints that block obvious failure modes.

That matters because it fits a pattern that keeps getting stronger. The article describes a previous setup that had ballooned into thousands of lines of prompt instructions and dozens of overlapping tools. The improvement came from moving conventions out of prompt prose and into code, interfaces, and execution rules.

That same pattern showed up in a delayed-discovery Simon Willison note about evaluating Datasette Agent prompts with DSPy (Simon Willison). The useful part was not “DSPy as magic,” but the use of a gold dataset, real tools, and custom metrics to catch a concrete failure mode: the agent guessed column names because it had incomplete schema visibility and guidance that discouraged repeated inspection. Again, the story was not prompt cleverness in isolation. It was that evaluation infrastructure made the defect visible.

The newest supporting signal came from Nate B Jones, who argued that the durable opportunity for agents is not theatrical end-to-end automation but high-trust paperwork preparation (YouTube). His proposed agent skeleton was explicit: ingest documents, chunk them, normalize them, retrieve with citations, assemble a packet, and leave the final act to a human. In his framing, the valuable agent is the one that turns messy evidence into a reviewable case file for an insurance appeal or tax handoff, not the one that “clicks the button” on your behalf.

Why It Matters

This is a useful correction to a lot of agent discourse. The field still produces plenty of demos that imply autonomy is the product. Today’s stronger evidence says the product is often disciplined preparation: better context packaging, narrower tool surfaces, explicit checks, and evals that reveal where the system actually breaks.

That does not weaken the agent story; it sharpens it. If a model can be made materially more reliable through harness design, then the center of gravity shifts from model-selection hype toward workflow engineering. The practical frontier becomes: what information is available, how it is normalized, which actions are forbidden, what gets logged, and where a human must still review. That is a more boring story than “fully autonomous agents,” but it is also the one with clearer signs of compounding progress.

It also reinforces an emerging canon in this digest: a lot of current AI leverage is coming from system design around the model, not just from the model itself. Better scaffolding is starting to look like the durable advantage.

Workflow Implications

For builders, the near-term takeaway is concrete.

First, treat the harness as a product surface. If your agent depends on a giant prompt and a wide-open tool list, you are probably hiding failures rather than eliminating them.

Second, evaluate against real tasks and real artifacts. The Simon Willison example is important precisely because it exposed a small, embarrassing, very fixable defect that anecdotal prompting could have missed.

Third, aim agents at preparation-heavy workflows before execution-heavy ones. Document normalization, citation, packet assembly, and handoff design appear to be more robust value zones than unsupervised submission, payment, filing, or irreversible action.

Finally, cheaper models may become more viable than expected when context is clean and the task boundary is narrow. That is one of the quiet but important subtexts in both the harness and paperwork-prep framing.

Further Reading

  • Simon Willison’s delayed-discovery note on a thin coding-agent shell built on his llm framework is worth skimming for its inspectable artifact chain—spec, commits, tests, transcript, and README—rather than for any single benchmark claim: https://simonwillison.net/2026/Jul/2/llm-coding-agent/
Back to archive