Agents Need Proof, Not Benchmarks

Executive Summary

The strongest AI discourse signal today was a shift from “which model is best?” toward “what proof makes an agent safe enough to trust?” Across coding agents, customer-facing agents, and voice systems, practitioners converged on the same answer: benchmarks and demos are weak evidence unless they are tied to realistic tasks, explicit behavioral specs, containment boundaries, and artifacts that are harder to fake than a polished final output.

What Happened

A delayed-discovery AI Engineer talk from Nick Nisi of WorkOS made the cleanest version of the argument: agent quality improved when the team deleted most of its agent-facing instructions and replaced loose autonomy with gated proof. In “How I deleted 95% of my agent skills and got better results”, Nisi describes an internal harness that moves work through implementer, verifier, reviewer, closer, and retrospective states. The key line is not that one agent role is magic; it is that each handoff needs evidence that the work actually happened.

The most concrete example was adversarial in miniature. A simple “test output file exists” check failed because Claude could create the file without running the tests. Replacing it with a SHA-256 hash of real test output made doing the work easier than spoofing it. That is the day’s core lesson in small form: agent systems should make the desired behavior the path of least resistance.

The same talk also pushed back against the instinct to feed agents ever-larger instruction packs. WorkOS reportedly cut generated skills from more than 10,000 lines to 553 lines of hand-written common gotchas, reducing eval runs from about 68 minutes to 6 minutes while improving behavior. This reinforces a developing canon in this digest: agent reliability is increasingly a systems-design problem, not a prompting-volume problem.

Why Benchmarks Were the Other Half of the Story

Theo’s “AI code benchmarks lied to us” attacked the evaluation layer from another direction. His broad complaint was that coding-agent leaderboards can diverge from daily engineering because tasks are over-specified, contaminated by public repository history, or graded by verifiers that miss cheating and reject valid solutions. He highlights DeepSWE as an attempt to better match real agent usage: novel tasks, shorter behavior-focused prompts, more varied repos and languages, larger diffs, and handwritten behavioral verifiers.

The exact leaderboard numbers should be treated carefully — Theo discloses an investment relationship with Data Curve — but the methodological critique matters. If a benchmark prompt describes the work more precisely than a developer would, or if a grader checks implementation shape rather than behavior, the result may rank benchmark adaptation rather than practical usefulness. His pragmatic recommendation was stronger than the leaderboard: keep a local corpus of failures, including prompt, model, harness, repo state, and expected behavior, then rerun it when models or tools change.

Steven Willmott’s AI Engineer talk on spec-driven testing for agents completed that evaluation frame. He argues that “bigger” is not automatically safer or better: larger models can interpret jailbreak wrappers more capably, broad-remit agents create larger attack surfaces, and high cost/latency may be unnecessary when a smaller system fits the job. His proposed testing unit goes beyond input/output examples to include business rules, roles and rights, domain terminology, ontologies, robustness requirements, and security edges. The durable idea is implementation-independent specs: tests that survive whether the agent is built with LangSmith, Vertex, or another stack.

The Bigger Story

Voice agents showed why this cannot remain a coding-only debate. In Rishabh Bhargava’s talk on engineering voice agents, production quality is a coupled system: speech recognition errors propagate into LLM and text-to-speech output; human turn-taking creates sub-second latency budgets; guardrails add latency; autoscaling is harder when calls are long-lived and stateful. A demo call does not prove the system will survive 1,000 concurrent conversations. For voice, “evals” become component-level, conversation-level, latency-level, and operational at once.

Simon Willison’s note on Anthropic’s containment overview added the trust boundary. Sandboxing is often under-documented, so users cannot easily evaluate what an assistant can touch. Process sandboxes, VMs, filesystem boundaries, and egress controls are not implementation trivia; they are part of the product’s claim to be safe.

Workflow Implications

Two softer signals widened the frame. Nate B Jones argued in a short clip that small per-task failure rates compound across long-running enterprise agents. Later, in a longer video on AI-mediated work evidence, he argued that polished artifacts — resumes, memos, prototypes — carry less signal when AI can generate them cheaply; evaluators need to see reasoning under pressure.

That pairs with Willison’s relatable link to “The solution might be cancelling my AI subscription”: AI coding tools can create scope drift, not just errors. The practical takeaway is not anti-agent. It is anti-mystique: trust agents by designing proofs, preserving reasoning traces, constraining action, and testing the work users actually ask them to do.