May 13, 2026 ·

Shipping agents, not demos

A working note on what separates an LLM toy from something on-call engineers can live with.

The gap between an LLM demo and an LLM in production is wider than most teams budget for. The demo is the prompt. Production is everything else: the eval suite, the trace store, the budget guardrails, the off-switch the support team can actually reach, the runbook for when the model provider has a bad afternoon.

We have learned to budget the demo at 15% of the build and the rest of the work at 85%. It is not a popular ratio in pitch decks. It happens to be the one that keeps customer-facing agents working on a Tuesday morning six months in.

Three things we never skip

Trace from the first commit. If you cannot replay a run with full inputs and outputs, you cannot debug a regression. Adding tracing later is twice the work.
Shadow mode before write mode. Run the agent against real traffic, log what it would have done, do not let it act. Two weeks of shadow data is cheaper than two weeks of incident reviews.
One person on call for the agent. Not a rotation, not a Slack channel — a name. Until that exists, the agent is a side project.

None of this is glamorous. All of it is what separates a fortnight of fun engineering from a thing your CEO is happy to mention on a customer call.

Three things we never skip

We use cookies