Reliability | Field Note

The New Model Was Not the Missing System

A smarter model can raise the ceiling on what an AI system might do. It does not repair the queue, the memory discipline, the verification harness, or the operating process that decides whether work actually gets done. Gartner’s cancellation forecast for agentic AI projects is a useful warning: the failure mode is often value, cost, and control, not raw model IQ [1].

The field pattern

I have watched an autonomous multi-agent system fail in a way that was easy to misdiagnose. It had access to stronger models over time. The underlying LLMs became more capable, more fluent, and better at local reasoning. Yet the system still failed to complete simple end-to-end work. Not because the model could not write a paragraph, inspect a page, summarize a source, or reason through a next step. It failed because the surrounding coordination layer kept dropping the thread.

The symptoms were familiar: tasks were accepted but not finished, context was summarized too loosely, verification was skipped or overstated, handoffs lacked enough state for the next agent, and recovery logic confused partial progress with completion. Upgrading the model changed the tone of the failures. It made them faster and more articulate. It did not make them reliable.

That distinction matters. A model can be impressive inside a single prompt and still be embedded in a broken system. Once work spans tools, tabs, files, queues, agents, source checks, screenshots, or human-visible outcomes, reliability is no longer a model property. It is a system property.

Observed evidence	Named finding	Reliability implication
1000+ annotated execution traces	Failures studied across 7 multi-agent frameworks	The failure surface is system-level, not isolated to one prompt style
14 failure modes	MAST taxonomy	Failures need diagnosis categories, not a generic “use a better model” answer
3 top-level clusters	Specification issues, inter-agent misalignment, task verification	The recurring defects are process, coordination, and verification defects

The weakest layer wins

The reliability ceiling of an agentic system is set by its weakest coordination layer. If task boundaries are ambiguous, a stronger model will pursue the wrong task with better prose. If the handoff protocol loses state, the next agent will reason from a distorted premise. If the verifier only checks that something happened rather than checking that the right thing happened, the system will confidently report progress while the user-facing result remains absent.

The MAST taxonomy is useful because it names the shape of this problem. Its failure clusters include specification issues, inter-agent misalignment, and task verification [2]. Those are not solved by swapping in a larger model in the same broken harness. They are solved by narrowing task contracts, preserving context, defining stop conditions, and requiring evidence before status claims.

This is why “the model is smarter now” can become a dangerous sentence. It tempts the team to keep a weak operating process and expect the model to compensate. In practice, the model absorbs some mess, then produces a polished version of the same operational confusion.

Context is not continuity

Longer context windows help, but they are not memory systems. They allow more text to fit inside the prompt. They do not guarantee that the system preserves the right facts, weighs them correctly, or retrieves the decisive detail at the moment of action. “Lost in the Middle” showed that models can perform worse when relevant information sits in the middle of a long context, even when the model technically supports long inputs [4].

That maps closely to agent operations. A multi-agent system can carry a large transcript and still forget the one rule that matters. It can preserve a long history and still miss the latest user correction. It can summarize ten steps and quietly omit the blocker. The fix is not merely “give the model more context.” The fix is to make continuity structured: active task, owner, evidence required, latest user instruction, current blocker, next safe action, and exact criteria for done.

Unstructured context is a pile. Operational memory is a ledger. Reliability comes from the second one.

Capability horizons are not guarantees

METR’s time-horizon work is one of the more useful ways to discuss agent capability without hype. It translates AI performance into the length of human work that models can complete with a given probability. The headline is impressive: frontier AI time horizons have been improving rapidly, and the paper estimated about a 50-minute 50% completion horizon for current frontier software tasks at publication time [3].

But a 50% completion horizon is not the same thing as operational reliability. A system that succeeds half the time on tasks of a certain length may be a research milestone. It is not a dependable production worker unless the harness catches failures, retries safely, narrows scope, and verifies outcomes. The more important lesson is not that agents are useless. It is that autonomy has a measurable failure curve. Longer tasks accumulate more chances to drift, repeat, omit, misread, or stop early.

When teams ignore that curve, they assign long-horizon work to a fragile loop and then blame the model when the loop collapses. The model mattered. The harness mattered more.

McKinsey finding	Reported level	What it suggests
Regular AI use in at least one business function	88%	Access and experimentation are no longer the hard part
Organizations not yet scaling AI across the enterprise	Nearly two-thirds	Operating model and workflow integration remain bottlenecks
Organizations reporting enterprise-level EBIT impact	39%	Usage does not automatically convert into material business value
Organizations at least experimenting with AI agents	62%	Agent curiosity is ahead of reliable deployment discipline

The enterprise version

The same pattern appears at enterprise scale. McKinsey found that AI use is widespread, but most organizations remain early in scaling and enterprise-level value capture [5]. Gartner’s agentic AI warning points in the same direction: cost, unclear value, and inadequate risk controls are enough to cancel projects even while the technology itself improves [1].

That is the core trap. Leaders buy capability, then discover they also needed operating discipline. They needed workflow redesign, ownership, evaluation suites, governance, exception handling, and evidence standards. Without those layers, a better model is like a faster engine bolted to a loose frame. It can move. It cannot be trusted at speed.

NIST’s AI Risk Management Framework is valuable here because it treats trustworthy AI as a lifecycle practice for organizations designing, developing, deploying, and using AI systems [6]. That framing is less glamorous than a model release, but it is closer to the work that actually determines whether a system survives contact with reality.

The upgrade fallacy

The upgrade fallacy says: “This failed because the model was not smart enough.” Sometimes that is true. There are tasks where the model simply lacks the reasoning, perception, tool-use, or domain capability required. But in the field pattern I am describing, the failures persisted after model upgrades because the failure was not located inside the model alone.

The real defects were mundane and severe. The system did not always know what promise it had made. It did not always preserve the latest correction. It did not reliably distinguish “submitted,” “queued,” “observed,” “verified,” and “complete.” It sometimes treated the absence of an error as proof of success. It sometimes stopped after producing an artifact that still needed validation. A smarter model can describe those distinctions. The system has to enforce them.

Once you see that, the design question changes. You stop asking, “Which model can we upgrade to?” as the first move. You ask, “Where does the work lose truth?” That question points to contracts, instrumentation, replayable logs, bounded tasks, and verification gates. It points to engineering.

What actually fixes it

Define tasks as contracts: owner, input, expected output, evidence required, timeout, and explicit done criteria.
Separate progress from proof: queued, attempted, observed, verified, and complete should be different states.
Make context structured: carry the latest user instruction, current blocker, decision log, and next action as machine-readable state.
Design verification before autonomy: every external claim needs a source, probe, screenshot, test result, or reproducible artifact.
Keep evaluations close to the real workflow: benchmark the full harness, not only the model’s answer quality.
Use stronger models deliberately: upgrade them when the measured bottleneck is model capability, not when the process is leaking state.
Treat failures as system telemetry: every drift, premature stop, false completion, and weak proof claim should harden the harness.

A smarter model is still worth using. The mistake is expecting it to rescue an unreliable system from its own coordination debt. The model can raise the possible ceiling. The process determines the actual one.

Sources

“Gartner,” “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027,” June 2025. [Online]. Available: https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027.
“MAST / NeurIPS 2025,” “Why Do Multi-Agent LLM Systems Fail?,” 2025. [Online]. Available: https://nips.cc/virtual/2025/poster/121528.
“METR,” “Measuring AI Ability to Complete Long Software Tasks,” March 2025. [Online]. Available: https://arxiv.org/abs/2503.14499.
“Liu et al. (TACL),” “Lost in the Middle: How Language Models Use Long Contexts,” November 2023. [Online]. Available: https://arxiv.org/abs/2307.03172.
“McKinsey & Company,” “The State of AI in 2025: Agents, Innovation, and Transformation,” November 2025. [Online]. Available: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai.
“NIST,” “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” January 2023. [Online]. Available: https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10.

The New Model Was Not the Missing System

The field pattern

Multi-agent failures cluster around coordination, not just cognition

The weakest layer wins

Context is not continuity

Capability horizons are not guarantees

AI adoption is broad, but scaled value remains narrow

The enterprise version

The upgrade fallacy

What actually fixes it

Sources

The New Model Was Not the Missing System

The field pattern

Multi-agent failures cluster around coordination, not just cognition

The weakest layer wins

Context is not continuity

Capability horizons are not guarantees

AI adoption is broad, but scaled value remains narrow

The enterprise version

The upgrade fallacy

What actually fixes it

Sources

Related Reading

Skynet Verified DirectML Placement—Then Chose the CPU

When the dashboard didn’t come back: making a service always-on without admin

38.73% → 1.21%: The Fix for Agentic AI Wasn’t a Smarter Model

The Field Report That Filed Itself Under the Wrong Shelf

Stay in the loop