Agents Move From Chat to Engineering Surfaces
Executive Summary
The strongest signal today is that “agent” discourse is becoming less about a model answering in a chat box and more about the surfaces that let models do verifiable work: repositories, skills, traces, evals, compute jobs, file APIs, metrics tables, and tool-specific context. The developing canon is shifting from “better prompting unlocks agency” toward “better operating environments make agency reliable.”
What Happened
The clearest example came from Ben Burtenshaw’s AI Engineer talk, “Your Coding Agent Should Do AI System Engineering”. Burtenshaw argues that coding agents are starting to matter closer to the hardware/model stack, not just in ordinary application code. His examples escalate from writing and benchmarking custom CUDA kernels, to fine-tuning a Hugging Face model from a prompt and dataset, to a multi-agent AutoLab setup where researcher, planner, worker, reviewer, and reporter agents search papers, propose hypotheses, run training jobs, submit patches, and track results.
The important part is not the theatrical multi-agent framing. It is the repeated emphasis on open, inspectable primitives: kernel repositories with compatibility metadata, project-maintained skill files, the upskill library for evaluating skills across models, Hugging Face jobs for compute, and Trackio/Parquet as an agent-readable metrics layer. Burtenshaw’s line that agents “work really well with primitives and open primitives” captures the day’s throughline: useful agentic work increasingly depends on environments designed to be read, changed, measured, and re-run by machines.
That connected directly with Marc Klingen’s Langfuse talk on building a coding-agent skill for observability and evals setup. Klingen’s pattern is progressive disclosure rather than one giant static instruction dump: give the agent style rules, make it ask follow-up questions before major decisions, point it toward current documentation and search endpoints, and let it use APIs or CLIs instead of relying on stale pretraining memory. His practical lessons were similarly grounded: inspect traces first, treat agent questions as observable product signals, start with basic evals rather than none, and reference dynamic docs instead of copying them into a brittle prompt.
Taken together, these two talks make “skills” look less like prompt-engineering garnish and more like maintainable operational packaging: file-based few-shot context, tied to current documentation, measured through traces and evals, and updated as the project changes.
The Bigger Story
Google’s post-I/O discourse reinforced the same pattern from a different angle. The Cognitive Revolution episode with Google DeepMind’s Logan Kilpatrick and Tulsee Doshi, “The Model Eats the Scaffolding”, was positioned around Gemini 3.5 Flash, Omni video generation, Gemini Spark, efficiency, agent harnesses, context limits, model psychology, AI welfare, and recursive self-improvement. Without a transcript, it should be treated cautiously, but the framing is revealing: even when a lab talks about more capable models, the discourse immediately turns to harnesses, context management, modality routing, and product integration.
Patrick Löber’s AI Engineer demo, “Gemini’s any-to-any multimodal agent”, made that concrete. The current product pattern is not one universal multimodal model doing everything. It is Gemini acting as a reasoning and orchestration layer while specialized models handle images, speech, video, function calls, and code. The demo architecture — a NotebookLM-like assistant over PDFs, images, videos, audio, YouTube URLs, caching, and token budgeting — is another version of the same claim: the system is the product, not just the model.
This complicates the “model eats the scaffolding” slogan. The model may absorb some workflow glue over time, but today’s practical direction still points toward better scaffolding: richer APIs, clearer tool boundaries, context caches, eval surfaces, and agent-readable state. The scaffolding is not disappearing so much as becoming a first-class engineering object.
Workflow Implications
For operators, the useful takeaway is conservative but actionable. Treat agent quality as a systems problem. If an agent is failing, the fix may not be a cleverer prompt; it may be better docs, searchable examples, a narrower skill, trace inspection, explicit evals, cleaner repositories, or a metrics format the agent can inspect.
Nate B Jones’s video, “You’re Not Bad at Prompting. You’re Bad at Defining the Work”, adds the human side of that shift. His claim that frontier tools should be treated more like a senior partner than a junior assistant is mostly a workflow framing, but it fits the larger pattern: as agents get more capable, human leverage moves toward defining the work, setting boundaries, and pointing the system at the right evidence.
A smaller supporting signal came from Theo’s discussion of extension and package supply-chain risk in AI-assisted coding environments, especially around auto-updates and poisoned developer tooling. The factual incident details need direct advisories before becoming a main claim, but the discourse point is durable: agents may amplify both defensive audit passes and attacker leverage. As agentic development moves deeper into engineering systems, the security properties of those systems become part of the agent story.
Further Reading
- Simon Willison’s short note on LLM tokens-per-second simulation is a useful calibration tool for product conversations about latency and perceived speed.
- Willison’s Google I/O note, “Stuff we figured out about AI in 2025” adjacent coverage, is worth preserving mainly for its launch-discipline reminder: prefer what can be tried now over preview claims that may change before release.

