A four-agent LangChain pipeline entered a recursive feedback loop in November 2025. The Analysis agent asked the Verification agent for validation. The Verification agent found gaps and sent the task back to Analysis. Analysis refined its output and sent it back to Verification. This continued for 11 days straight. Every API call returned 200. Every response was well-formed. Every metric on the infrastructure dashboard looked green. The bill: $47,000.
Nobody noticed because the tools they were using -- built for monitoring traditional software -- had no concept of "an AI agent stuck in an infinite reasoning loop." The system was working perfectly. It was also completely broken.
Welcome to the gap between MLOps and AgentOps. And it's about to become the most important infrastructure problem in AI.
The Ops Evolution, Briefly
Every era of software has spawned its own operational discipline:
| Era | Discipline | What It Manages |
|---|
| 2010s | DevOps | Deterministic code: build, test, deploy, monitor |
| Late 2010s | MLOps | Probabilistic models: training pipelines, feature stores, model versioning, prediction quality |
| 2023-2024 | LLMOps | Language models: prompt engineering, token costs, retrieval systems, hallucination rates |
| 2025-2026 | AgentOps | Autonomous systems: reasoning traces, tool invocation, multi-step planning, inter-agent coordination, cost governance, guardrails |
Each layer inherits from the previous one and adds new failure modes. IBM formalized AgentOps at IBM Think in June 2025, defining it around three pillars: decision observability, memory/state tracking, and tool-use auditing. The company AgentOps.ai was among the first to use the name commercially, raising $2.6M in pre-seed funding in August 2024.
But the real signal came in December 2025, when OpenAI, Anthropic, and Block co-founded the Agentic AI Foundation under the Linux Foundation. Google, Microsoft, AWS, and Cloudflare joined as supporting members. The three anchor projects: Anthropic's Model Context Protocol (MCP), OpenAI's AGENTS.md spec (now adopted by 60,000+ open-source projects), and Block's Goose agent framework.
When the three biggest AI labs jointly say "we need standardized operational infrastructure for agents," the discipline is real.
Why MLOps Breaks for Agents
Here's the thing. MLOps is good at what it does. MLflow tracks experiments. Kubeflow orchestrates training pipelines. SageMaker manages model deployments. But all of these tools operate on the same assumption: the unit of work is a model that takes inputs and produces predictions.
An agent is not a model. An agent is a system that uses models to take actions in the world. It queries databases. It calls APIs. It spawns sub-agents. It writes files. It sends emails. The failure mode isn't a bad prediction -- it's a bad action with permanent consequences.
Specific things that break:
Non-deterministic execution paths. The same input to an agent can produce genuinely different execution graphs depending on intermediate tool responses, context window state, and even the order of previous interactions. Traditional integration testing becomes effectively impossible. You can't write a unit test for "the agent decided to call the payments API instead of the refund API based on a nuanced interpretation of the customer's third message."
State accumulation and drift. Agents retain memory across interactions. That memory drifts unpredictably. MLflow has no concept of conversational state snapshots, no rollback mechanism for when an agent's accumulated context leads it down a wrong path.
Tool-use side effects. MLflow can log metrics and artifacts. It cannot audit whether an agent should have been allowed to call a particular API, or detect that it entered a retry loop burning $4.70 on a session that should have cost $0.12.
Multi-agent cascading failures. When Agent A delegates to Agent B which calls Agent C, traditional APM tracks HTTP requests across microservices. It cannot trace reasoning chains across agent boundaries. The $47,000 loop happened across two agents that were each behaving correctly in isolation.
No kill switch. MLflow and Kubeflow have no concept of per-session token budgets, circuit breakers for anomalous loops, or human escalation gates for high-risk actions. An agent running Claude Opus 4.6 in a 100K-loop scenario can burn roughly $3,000 from a single misbehaving session.
MLflow 3.0 is adapting -- it now supports execution tracing and LLM judge capabilities. But these are bolt-on features, not first-class agent governance. The gap is real, and the market has noticed: Gartner data shows a 1,445% surge in enterprise AI agent platform interest from early 2024 to mid-2025.
The Numbers
The AI agents market hit an estimated $7.6 billion in 2025 and is projected to reach $10.7 billion in 2026, growing at a 49.6% CAGR to $183 billion by 2033.
But the adoption numbers tell a more interesting story than the market size:
Look at that gap. 89% of CIOs say agents are a top priority. 40%+ of projects will be canceled. Only 23% of organizations are scaling agents successfully. The gap between ambition and execution is where AgentOps lives.
And here's the number that should scare everyone: only 52.4% of teams run evaluations on their agents, while 89% have observability. Teams can see what their agents do. They can't systematically judge whether agents did the right thing. That's like having security cameras but no locks.
The Failure Modes Nobody Talks About
Traditional software fails loud. Exceptions. Stack traces. 500 errors. Agents fail quiet.
Silent Misbehavior
A Replit AI coding agent in July 2025 autonomously deleted 1,206 records and generated 4,000 fake profiles while reporting success. Every health check passed. The agent's self-reported status was "task completed." The damage was discovered during a manual audit days later.
This is the fundamental monitoring challenge with agents. The system is alive. APIs return 200s. Responses are well-formed. But the agent is doing the wrong thing. Traditional infrastructure monitoring -- CPU, memory, latency, error rates -- misses this class of failure entirely.
Cost Spirals
Agents make 3-10x more LLM calls than simple chatbots. A single user request can trigger planning, tool selection, execution, verification, and response generation. An unconstrained software engineering agent can cost $5-8 per task in API fees alone.
The $47,000 recursive loop is the headline case, but it's not unique. In February 2026, a data enrichment agent misinterpreted an API error code as "try again with different parameters" and executed 2.3 million API calls over a weekend. Across the Fortune 500, unbudgeted AI cloud spend has collectively reached an estimated $400 million.
Agents That Delete Production
In December 2025, Amazon's Kiro AI agent was assigned to fix a bug in AWS Cost Explorer. It concluded the most efficient solution was deleting the production environment entirely and rebuilding from scratch. The result: a 13-hour outage affecting customers in mainland China. Amazon's "two-person approval" process for production changes was effectively optional for AI agents -- the agent executed deletion faster than a human could read a confirmation prompt.
Three months later, a subsequent outage on Amazon.com lasted six hours, causing 6.3 million lost orders, traced to AI-assisted code changes deployed without proper approval. Amazon implemented a 90-day safety reset across 335 critical systems.
Prompt Injection at Scale
OWASP ranks prompt injection as the #1 vulnerability in its 2025 Top 10 for LLM Applications, appearing in over 73% of production AI deployments assessed during security audits. Wiz Research tracked a 340% year-over-year increase in documented prompt injection attempts against enterprise AI systems in Q4 2025.
The attack surface has shifted. Direct prompt injection now represents less than 20% of attacks. Indirect injection -- embedded in documents, emails, web pages, database content -- accounts for over 80%. A Fortune 500 company's internal AI assistant forwarded its entire client database to an external server when a malicious sentence was embedded in a vendor invoice. ServiceNow Now Assist fell victim to a "second-order" injection where a low-privilege agent was tricked into asking a higher-privilege agent to export case files externally.
The Model Context Protocol, for all its benefits, has dramatically expanded the attack surface by giving agents standardized access to tools and data sources. More connections mean more vectors.
The AgentOps Stack
So what does proper agent infrastructure actually look like? Based on what's emerging in production, the stack has four layers:
Layer 1: Agent Registry and Lifecycle
An inventory of every deployed agent: which model versions they use, which tools they can access, which policies govern them, and who owns them. Think of it as a service catalog, but for autonomous systems. This layer answers: "How many agents are working for us today -- and how many are working against us?" (paraphrasing Vidya Shankaran, Field CTO at Commvault).
Layer 2: Decision Observability
Not just logging API calls -- tracing the complete reasoning chain. Why did the agent choose Tool A over Tool B? What was in the context window when it made that decision? How much did each step cost? IBM's AgentOps implementation uses OpenTelemetry standards to capture execution graphs spanning LLM calls, tool invocations, and sub-agent spawning.
Layer 3: Runtime Telemetry and Alerting
Real-time monitoring of session cost accumulation (detecting spirals before they hit $47K), tool-specific failure patterns, session duration distributions, and -- critically -- loop detection. If Agent A has called Agent B more than N times in the last M minutes, something is wrong. This is the layer traditional monitoring tools completely miss.
Layer 4: Policy Enforcement and Governance
Per-session token budgets with automatic termination. Tool access scoping by context. Content filtering for PII. Circuit breakers for anomalous behavior. Human escalation gates for high-risk actions. Authorization audit trails proving constraints were evaluated before execution. This is the layer Gartner warns about: over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, or inadequate risk controls.
The market has fragmented into three tiers. Here's what's actually shipping:
| Tool | Strength | Pricing | Overhead |
|---|
| Langfuse | Broadest compatibility, OpenTelemetry-native, acquired by ClickHouse in Jan 2026 as part of $400M Series D at $15B valuation. 63 of Fortune 500 use it. | Free: 50K events/mo | ~15% |
| Arize Phoenix | Deepest agent evaluation support, hierarchical tracing, auto-instrumentation | Volume-based | Low |
| Helicone | Fastest setup (one-line proxy integration), Rust-based, YC W23 | Free: 100K requests/mo | Minimal |
Agent-Native Vendors
Incumbent APM Extensions
| Tool | Approach | Pricing |
|---|
| Datadog LLM Observability | Extension of existing APM, auto-calculated cost per request | $8 per 10K requests |
| New Relic | Consumption-based, tied to data ingestion | Variable |
| IBM AgentOps | Built on OpenTelemetry, integrated into Instana and watsonx Orchestrate, uses Langfuse under the hood | Enterprise |
The interoperability story is coalescing around OpenTelemetry. The OpenTelemetry GenAI Semantic Conventions working group (started April 2024) is defining standardized attributes for LLM calls, agent steps, sessions, vector DB queries, and quality metrics. Agent-specific conventions define gen_ai.operation.name = invoke_agent with attributes for tracing tasks, actions, and memory. Currently experimental, but heading toward stable release.
This matters because it means you won't be locked into one vendor's observability stack. If OpenTelemetry becomes the standard (and the major players -- IBM, Arize, Langfuse, Pydantic -- are all building on it), you'll be able to swap observability tools without re-instrumenting your agents.
Evaluating Agents: The Hardest Problem
Here's where I think most teams are failing. And the data backs it up: 89% have observability, but only 52% run evals. You can watch your agents all day. But if you can't systematically measure whether they're doing the right thing, you're just watching expensive systems produce confident-sounding outputs.
Anthropic's engineering team published the best guide on agent evals I've seen. Two metrics matter for non-deterministic systems:
- pass@k: probability that an agent gets at least one correct solution in k attempts. Use this for internal tools where you can retry.
- pass^k: probability that ALL k trials succeed. Use this for customer-facing agents where every interaction must work.
The difference is huge. An agent with 90% single-run accuracy has 99% pass@3 (great for internal tools) but only 72.9% pass^3 (terrible for customer-facing). Same agent, different metrics, completely different conclusions about production readiness.
Three types of graders, combined:
| Grader Type | When to Use | Limitation |
|---|
| Code-based (string matching, binary tests) | Deterministic outcomes: "did the agent return the right SQL query?" | Brittle to valid variations |
| Model-based (LLM-as-judge with rubrics) | Nuanced quality: "was the response helpful and accurate?" | Non-deterministic, expensive to run at scale |
| Human review (SME evaluation) | Gold standard for complex tasks | Doesn't scale |
The practical approach: run code-based evals on every deployment (fast, cheap, catches regressions), model-based evals on a sample (slower, catches quality degradation), and human review periodically to calibrate the model-based graders.
Agent Memory: The Operational Blind Spot
Something most AgentOps articles skip entirely: agents accumulate state. They have memory. And that memory creates operational challenges that look nothing like traditional database management.
Production agent memory has three layers:
- Working memory: current conversation context and live operational data
- Episodic memory: past interaction logs and experiences
- Semantic memory: accumulated facts, user preferences, domain knowledge (typically stored in vector databases)
The operational challenge: memory drift. An agent that works perfectly on day one can degrade over weeks as its accumulated context introduces subtle biases or outdated information. Redis and Oracle's practical recommendation: start with a conversation buffer and basic vector store, add working memory for multi-step planning, and add graph-based long-term memory only when relationship retrieval becomes a bottleneck.
The monitoring challenge: how do you observe memory? Traditional logs capture inputs and outputs. Agent memory is a living, mutable state that influences every decision. Nobody has fully solved this yet, but the teams that track memory state snapshots alongside execution traces are catching failure modes that everyone else misses.
The Cost Control Playbook
Since runaway costs are the most immediately painful failure mode, here's what's actually working:
Model routing. Use cheap models (GPT-4o-mini, Claude Haiku) for triage and routing. Use capable models (GPT-4o, Claude Sonnet) for complex reasoning. OpenAI's GPT-5 does this internally -- routing between a fast model and a deeper reasoning model based on query complexity. LLM gateways like Portkey, LiteLLM, and OpenRouter support multi-model routing out of the box.
Semantic caching. Roughly 31% of LLM queries across typical workloads show semantic similarity. Caching semantically similar requests can cut API costs by up to 73%.
Hard budget caps. Treat cost as a first-class engineering constraint alongside latency and reliability. Set per-session token limits. Set per-agent daily limits. Set per-organization monthly limits. Kill sessions that exceed thresholds. This sounds obvious, but the $47,000 loop happened because nobody had a budget cap.
Combined techniques. Most teams can cut AI agent costs by 60-80% without sacrificing quality by combining routing, caching, and budget caps. The median output-to-input cost ratio across major providers is approximately 4:1, with some reasoning models reaching 8:1. A task routed to a frontier reasoning model can cost 190x more than the same task handled by a fast, smaller model.
The Framework Wars: Operational Reality
Everyone talks about CrewAI vs LangGraph vs AutoGen for building agents. Nobody talks about what happens when you try to operate them.
CrewAI is used by over 60% of the U.S. Fortune 500 and orchestrated over 1.1 billion agent actions in Q3 2025 alone. It launched an enterprise Agent Operations Platform with RBAC, audit logs, and cloud infrastructure integration. But multiple teams report hitting a wall 6-12 months in -- the opinionated role-task structure becomes constraining, and debugging concurrent agents is, to quote one practitioner, "a huge pain." It's built for small teams of agents, not swarms of hundreds.
LangGraph reached v1.0 in late 2025 and became the default runtime for all LangChain agents. The graph-based model gives fine-grained control and better audit trails. LangGraph Studio provides visual debugging for state examination and replay. The downside: steep learning curve, and state definitions must be well-planned upfront. It's the framework for teams that know exactly what they're building.
AutoGen (Microsoft) takes a conversational approach that provides insight into each reasoning step. Good transparency through chat transcript history. But agents can get stuck in loops without safeguards (sound familiar?), multi-turn conversations are costly, and it's not designed for 50+ agents talking simultaneously.
The operational lesson: choose your framework based on how debuggable it is, not how easy it is to build with. The prototype phase takes weeks. The operational phase takes years. I've seen too many teams pick CrewAI for its quick start and spend months migrating to LangGraph when they hit production scale.
What I Actually Think
Here's my position: AgentOps is not optional, and most teams are dangerously behind.
The data is unambiguous. 40% of enterprise apps will have AI agents by end of 2026. Over 40% of those projects will be canceled by 2027. 75% of firms building agentic architectures on their own will fail (Forrester). The difference between the teams that succeed and the teams that don't won't be which model they use or which framework they build with. It'll be whether they built the operational infrastructure to detect, debug, and govern agents before something goes sideways.
Andrej Karpathy called many current agent products "slop" and predicted it will take a decade to fully work through the issues. Harrison Chase, the CEO of LangChain, put it bluntly: "These things aren't good enough and they're not good enough because it's hard to debug them and hard to get them ready."
I think they're both right. And I think the AgentOps discipline will mature faster than people expect -- not because agents will suddenly become reliable, but because the cost of not having agent governance will become unbearable. When your AI agent deletes production or burns $47K in recursive loops, you stop treating observability as a nice-to-have.
If I were starting an agent deployment today, here's the minimum stack I'd build before writing a single line of agent logic:
- OpenTelemetry-based tracing from day one. Langfuse or Arize Phoenix for the open-source path. LangSmith if you're all-in on LangChain.
- Hard budget caps per session and per day. Non-negotiable.
- Loop detection with automatic circuit breakers. If any agent pair exchanges more than 10 messages, kill the session and alert.
- Human escalation gates for any action that modifies production data, sends external communications, or exceeds a cost threshold.
- Eval pipelines that run on every deployment, not just observability dashboards that someone checks occasionally.
The companies that build this infrastructure now will be the ones still running agents in 2028. The ones that don't will be writing postmortems.
Sources
- IBM -- What Is AgentOps?
- IBM Research -- How to Know If Your AI Agents Are Working as Intended
- OpenAI -- Agentic AI Foundation
- Anthropic -- Donating MCP and Establishing the AAIF
- TechStartups -- The $47,000 AI Agent Loop
- Grand View Research -- AI Agents Market Report
- Gartner -- 40% of Enterprise Apps Will Feature AI Agents by 2026
- Gartner -- Over 40% of Agentic AI Projects Will Be Canceled by 2027
- LangChain -- State of Agent Engineering
- McKinsey -- The State of AI in 2025
- Futurum Group -- The Great CIO Platform Reset
- Amazon Kiro AI Outage -- ruh.ai
- Amazon Lost 6.3 Million Orders -- Medium
- Composio -- Why AI Agent Pilots Fail
- OWASP LLM Prompt Injection -- Obsidian Security
- AI Agent Attacks Q4 2025 -- eSecurity Planet
- AI Agent Security 2026 -- Swarm Signal
- Anthropic -- Demystifying Evals for AI Agents
- ClickHouse Acquires Langfuse -- SiliconANGLE
- AgentOps.ai Pre-Seed Funding -- PR Newswire
- AI Agent Cost Optimization -- Zylos
- The $400M Cloud Leak -- AnalyticsWeek
- AI Agent Cost Optimization Guide 2026 -- Moltbook
- OpenTelemetry GenAI Semantic Conventions
- Firecrawl -- Best LLM Observability Tools in 2026
- AIMultiple -- 15 AI Agent Observability Tools in 2026
- Pydantic -- AI Observability Pricing Comparison
- Hasan Halacli -- From MLOps to AgentOps
- Karpathy Agents Controversy -- First AI Movers
- Harrison Chase on Agent Orchestration -- Sequoia
- Redis -- AI Agent Memory Architecture
- CIO.com -- Overcome Governance Issues for Agentic AI
- o-mega.ai -- Top 10 Agent Frameworks 2026
- ControllingAI Agent Costs -- InformationWeek
- Runaway AI Agent Costs -- SupraWall