The agent worked in the demo. In production it looped for 47 minutes.
Demos are not production. This is the gap nobody warns you about.
The task looked straightforward: build an agent that monitors competitor activity, pulls public filings and pricing pages, and surfaces a weekly summary. The kind of thing that takes a junior analyst four hours and a language model, in theory, twenty minutes.
We built it on LangGraph with a ReAct loop. Six tools: web search, URL fetcher, PDF reader, structured data extractor, a simple calculator for number comparisons, and a report writer that formatted the final output. Tested it against a dozen scenarios. It handled every one cleanly. The demo was genuinely impressive — the kind that makes a stakeholder want to shake your hand.
We shipped to production two days later.
Day two. A task came in for a company with an ambiguous name.
Two different companies, similar names, one larger and one a smaller regional subsidiary. The search tool returned results for both. The agent tried to disambiguate — fetched more pages, found more ambiguous results, fetched more pages. The URL fetcher returned paginated responses. The agent tried to exhaust the pagination to get the full picture.
There was no max_iterations ceiling in the graph. No exit condition for the case where the task cannot be cleanly resolved. After 47 minutes and 340 tool calls, it was still running when we killed it.
We had tested the happy path. We had not tested what happens when the agent cannot find what it is looking for.
An agent is not a person with judgment. It is a loop with a language model in the middle. When the loop hits something it was not designed for, it does not stop. It tries harder.
The looping problem was fixable. The second problem we found was subtler and took longer to diagnose.
On tasks that ran longer than eight or nine tool calls, the model started fabricating tool arguments. Calling the calculator with numbers that were not in the context. Calling the PDF reader with file paths that had never been returned by any tool. Invoking the report writer mid-task before the data was assembled.
The context window is the agent's only memory. As tool results accumulate, early context gets pushed toward the edge of the window. On a 128k context model this sounds like it should not matter — but in practice, the model's attention degrades over very long contexts in ways that show up as reasoning errors rather than obvious failures. By turn 10, facts established in turn 2 were not reliably influencing decisions.
Demo tasks are short, clean, and designed by the people building the system. They resolve in four to six steps. Production tasks are longer, dirtier, and full of edge cases nobody thought to test. That gap is where agentic systems fall apart.
The demo succeeds because you control everything: the task, the tools, the data, the scope. Production removes every one of those controls simultaneously.
What we changed
Set a hard iteration ceiling. Not a suggestion — a hard stop.
Every LangGraph agent we build now has max_iterations set explicitly, with a handler for hitting the ceiling that returns partial work and a structured error state to the caller. Silent infinite loops do not show up in any dashboard until someone notices the bill.
Tool schemas have to be airtight.
Agents call tools based on their interpretation of the schema description. If the description is vague, the model interprets it differently under different conditions. Every tool we define now has explicit input types, required fields, and — critically — a description of what the tool does not do. That last part is often what prevents the wrong tool from being called.
Test the error paths, not just the happy path.
In demos you control what tools return. In production, tools time out, return empty results, return paginated data, return rate-limit errors. We now maintain a test suite where every tool has a failure mode — empty response, timeout, malformed output — and we verify that the agent surfaces a clean error rather than spiralling. Most failures in production come from exactly these cases.
Compress intermediate state explicitly.
For long-running tasks we added a compression step every four to five tool calls — a short model call that summarises what has been found so far and what remains to be done, replacing the raw tool outputs in context. This keeps the working memory small and the reasoning grounded. Without it, context window drift starts showing up as fabrication around turn 8.
Ambiguity needs a human handoff path.
When a task cannot be resolved cleanly — ambiguous entity, conflicting data, missing information — the agent needs to say so and stop. We added an explicit needs_clarification tool that the agent can call to surface the ambiguity and pause execution. It feels like an admission of failure. It is actually the correct behaviour.
The system we run now has more scaffolding and fewer surprises. The demo is slightly less magical because we show the iteration count and the error handling. Production is considerably more reliable because of it.
Agentic AI is not hard because the models are bad. It is hard because the gap between a clean demo and a robust production system is almost entirely infrastructure — and that part is invisible until it fails.
Building an agent that works in production, not just demos?
The infrastructure gap is real and we know where it bites. Tell us what you are building — we can probably save you a few painful weeks.
Get in touch