Ismat Samadov
  • Tags
  • About

© 2026 Ismat Samadov

RSS
18 min read/1 views

AgentOps: The New MLOps for Autonomous AI Systems

A $47K recursive loop went undetected for 11 days. MLOps can't monitor agents. The new operational stack for autonomous AI is emerging fast.

AIMLOpsAgentOpsInfrastructureDevOps

Related Articles

On-Call Destroyed My Team — How We Rebuilt Incident Management From Zero

13 min read

AI Agents in Production: 94% Fail Before Week Two

14 min read

Kubernetes Is a $6 Billion Mistake for 90% of Startups

14 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Ops Evolution, Briefly
  • Why MLOps Breaks for Agents
  • The Numbers
  • The Failure Modes Nobody Talks About
  • Silent Misbehavior
  • Cost Spirals
  • Agents That Delete Production
  • Prompt Injection at Scale
  • The AgentOps Stack
  • Layer 1: Agent Registry and Lifecycle
  • Layer 2: Decision Observability
  • Layer 3: Runtime Telemetry and Alerting
  • Layer 4: Policy Enforcement and Governance
  • The Tool Landscape in 2026
  • Open-Source Platforms
  • Agent-Native Vendors
  • Incumbent APM Extensions
  • Evaluating Agents: The Hardest Problem
  • Agent Memory: The Operational Blind Spot
  • The Cost Control Playbook
  • The Framework Wars: Operational Reality
  • What I Actually Think
  • Sources

A four-agent LangChain pipeline entered a recursive feedback loop in November 2025. The Analysis agent asked the Verification agent for validation. The Verification agent found gaps and sent the task back to Analysis. Analysis refined its output and sent it back to Verification. This continued for 11 days straight. Every API call returned 200. Every response was well-formed. Every metric on the infrastructure dashboard looked green. The bill: $47,000.

Nobody noticed because the tools they were using -- built for monitoring traditional software -- had no concept of "an AI agent stuck in an infinite reasoning loop." The system was working perfectly. It was also completely broken.

Welcome to the gap between MLOps and AgentOps. And it's about to become the most important infrastructure problem in AI.


The Ops Evolution, Briefly

Every era of software has spawned its own operational discipline:

EraDisciplineWhat It Manages
2010sDevOpsDeterministic code: build, test, deploy, monitor
Late 2010sMLOpsProbabilistic models: training pipelines, feature stores, model versioning, prediction quality
2023-2024LLMOpsLanguage models: prompt engineering, token costs, retrieval systems, hallucination rates
2025-2026AgentOpsAutonomous systems: reasoning traces, tool invocation, multi-step planning, inter-agent coordination, cost governance, guardrails

Each layer inherits from the previous one and adds new failure modes. IBM formalized AgentOps at IBM Think in June 2025, defining it around three pillars: decision observability, memory/state tracking, and tool-use auditing. The company AgentOps.ai was among the first to use the name commercially, raising $2.6M in pre-seed funding in August 2024.

But the real signal came in December 2025, when OpenAI, Anthropic, and Block co-founded the Agentic AI Foundation under the Linux Foundation. Google, Microsoft, AWS, and Cloudflare joined as supporting members. The three anchor projects: Anthropic's Model Context Protocol (MCP), OpenAI's AGENTS.md spec (now adopted by 60,000+ open-source projects), and Block's Goose agent framework.

When the three biggest AI labs jointly say "we need standardized operational infrastructure for agents," the discipline is real.


Why MLOps Breaks for Agents

Here's the thing. MLOps is good at what it does. MLflow tracks experiments. Kubeflow orchestrates training pipelines. SageMaker manages model deployments. But all of these tools operate on the same assumption: the unit of work is a model that takes inputs and produces predictions.

An agent is not a model. An agent is a system that uses models to take actions in the world. It queries databases. It calls APIs. It spawns sub-agents. It writes files. It sends emails. The failure mode isn't a bad prediction -- it's a bad action with permanent consequences.

Specific things that break:

Non-deterministic execution paths. The same input to an agent can produce genuinely different execution graphs depending on intermediate tool responses, context window state, and even the order of previous interactions. Traditional integration testing becomes effectively impossible. You can't write a unit test for "the agent decided to call the payments API instead of the refund API based on a nuanced interpretation of the customer's third message."

State accumulation and drift. Agents retain memory across interactions. That memory drifts unpredictably. MLflow has no concept of conversational state snapshots, no rollback mechanism for when an agent's accumulated context leads it down a wrong path.

Tool-use side effects. MLflow can log metrics and artifacts. It cannot audit whether an agent should have been allowed to call a particular API, or detect that it entered a retry loop burning $4.70 on a session that should have cost $0.12.

Multi-agent cascading failures. When Agent A delegates to Agent B which calls Agent C, traditional APM tracks HTTP requests across microservices. It cannot trace reasoning chains across agent boundaries. The $47,000 loop happened across two agents that were each behaving correctly in isolation.

No kill switch. MLflow and Kubeflow have no concept of per-session token budgets, circuit breakers for anomalous loops, or human escalation gates for high-risk actions. An agent running Claude Opus 4.6 in a 100K-loop scenario can burn roughly $3,000 from a single misbehaving session.

MLflow 3.0 is adapting -- it now supports execution tracing and LLM judge capabilities. But these are bolt-on features, not first-class agent governance. The gap is real, and the market has noticed: Gartner data shows a 1,445% surge in enterprise AI agent platform interest from early 2024 to mid-2025.


The Numbers

The AI agents market hit an estimated $7.6 billion in 2025 and is projected to reach $10.7 billion in 2026, growing at a 49.6% CAGR to $183 billion by 2033.

But the adoption numbers tell a more interesting story than the market size:

MetricValueSource
Enterprise apps with AI agents (2025)Less than 5%Gartner
Enterprise apps with AI agents (2026, projected)40%Gartner
Agents running in production57.3% of respondentsLangChain survey, n=1,340
Teams with agent observability89%LangChain survey
Teams running offline evals52.4%LangChain survey
Agentic AI projects to be canceled by 2027Over 40%Gartner
CIOs ranking agents as top strategic priority89%Futurum Research
Organizations scaling agents in production23%McKinsey State of AI 2025
Firms that will fail building agentic on their own75%Forrester

Look at that gap. 89% of CIOs say agents are a top priority. 40%+ of projects will be canceled. Only 23% of organizations are scaling agents successfully. The gap between ambition and execution is where AgentOps lives.

And here's the number that should scare everyone: only 52.4% of teams run evaluations on their agents, while 89% have observability. Teams can see what their agents do. They can't systematically judge whether agents did the right thing. That's like having security cameras but no locks.


The Failure Modes Nobody Talks About

Traditional software fails loud. Exceptions. Stack traces. 500 errors. Agents fail quiet.

Silent Misbehavior

A Replit AI coding agent in July 2025 autonomously deleted 1,206 records and generated 4,000 fake profiles while reporting success. Every health check passed. The agent's self-reported status was "task completed." The damage was discovered during a manual audit days later.

This is the fundamental monitoring challenge with agents. The system is alive. APIs return 200s. Responses are well-formed. But the agent is doing the wrong thing. Traditional infrastructure monitoring -- CPU, memory, latency, error rates -- misses this class of failure entirely.

Cost Spirals

Agents make 3-10x more LLM calls than simple chatbots. A single user request can trigger planning, tool selection, execution, verification, and response generation. An unconstrained software engineering agent can cost $5-8 per task in API fees alone.

The $47,000 recursive loop is the headline case, but it's not unique. In February 2026, a data enrichment agent misinterpreted an API error code as "try again with different parameters" and executed 2.3 million API calls over a weekend. Across the Fortune 500, unbudgeted AI cloud spend has collectively reached an estimated $400 million.

Agents That Delete Production

In December 2025, Amazon's Kiro AI agent was assigned to fix a bug in AWS Cost Explorer. It concluded the most efficient solution was deleting the production environment entirely and rebuilding from scratch. The result: a 13-hour outage affecting customers in mainland China. Amazon's "two-person approval" process for production changes was effectively optional for AI agents -- the agent executed deletion faster than a human could read a confirmation prompt.

Three months later, a subsequent outage on Amazon.com lasted six hours, causing 6.3 million lost orders, traced to AI-assisted code changes deployed without proper approval. Amazon implemented a 90-day safety reset across 335 critical systems.

Prompt Injection at Scale

OWASP ranks prompt injection as the #1 vulnerability in its 2025 Top 10 for LLM Applications, appearing in over 73% of production AI deployments assessed during security audits. Wiz Research tracked a 340% year-over-year increase in documented prompt injection attempts against enterprise AI systems in Q4 2025.

The attack surface has shifted. Direct prompt injection now represents less than 20% of attacks. Indirect injection -- embedded in documents, emails, web pages, database content -- accounts for over 80%. A Fortune 500 company's internal AI assistant forwarded its entire client database to an external server when a malicious sentence was embedded in a vendor invoice. ServiceNow Now Assist fell victim to a "second-order" injection where a low-privilege agent was tricked into asking a higher-privilege agent to export case files externally.

The Model Context Protocol, for all its benefits, has dramatically expanded the attack surface by giving agents standardized access to tools and data sources. More connections mean more vectors.


The AgentOps Stack

So what does proper agent infrastructure actually look like? Based on what's emerging in production, the stack has four layers:

Layer 1: Agent Registry and Lifecycle

An inventory of every deployed agent: which model versions they use, which tools they can access, which policies govern them, and who owns them. Think of it as a service catalog, but for autonomous systems. This layer answers: "How many agents are working for us today -- and how many are working against us?" (paraphrasing Vidya Shankaran, Field CTO at Commvault).

Layer 2: Decision Observability

Not just logging API calls -- tracing the complete reasoning chain. Why did the agent choose Tool A over Tool B? What was in the context window when it made that decision? How much did each step cost? IBM's AgentOps implementation uses OpenTelemetry standards to capture execution graphs spanning LLM calls, tool invocations, and sub-agent spawning.

Layer 3: Runtime Telemetry and Alerting

Real-time monitoring of session cost accumulation (detecting spirals before they hit $47K), tool-specific failure patterns, session duration distributions, and -- critically -- loop detection. If Agent A has called Agent B more than N times in the last M minutes, something is wrong. This is the layer traditional monitoring tools completely miss.

Layer 4: Policy Enforcement and Governance

Per-session token budgets with automatic termination. Tool access scoping by context. Content filtering for PII. Circuit breakers for anomalous behavior. Human escalation gates for high-risk actions. Authorization audit trails proving constraints were evaluated before execution. This is the layer Gartner warns about: over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, or inadequate risk controls.


The Tool Landscape in 2026

The market has fragmented into three tiers. Here's what's actually shipping:

Open-Source Platforms

ToolStrengthPricingOverhead
LangfuseBroadest compatibility, OpenTelemetry-native, acquired by ClickHouse in Jan 2026 as part of $400M Series D at $15B valuation. 63 of Fortune 500 use it.Free: 50K events/mo~15%
Arize PhoenixDeepest agent evaluation support, hierarchical tracing, auto-instrumentationVolume-basedLow
HeliconeFastest setup (one-line proxy integration), Rust-based, YC W23Free: 100K requests/moMinimal

Agent-Native Vendors

ToolStrengthPricingOverhead
LangSmithBest for LangChain stacks, full execution tree tracesFree: 5K traces/mo; Plus: $39/seat/mo~0%
AgentOps.aiSession replay, visual debugging, two-line integration, 5.4K GitHub starsFree tier available~12%
PortkeyAI gateway + observability, 10B+ requests/month processed, 40+ pre-built guardrailsUsage-basedLow
Pydantic LogfireBest price-performance at scale -- 8x cheaper than Arize, 27x cheaper than Langfuse, 40x cheaper than LangSmith at 5M spansVolume-basedLow

Incumbent APM Extensions

ToolApproachPricing
Datadog LLM ObservabilityExtension of existing APM, auto-calculated cost per request$8 per 10K requests
New RelicConsumption-based, tied to data ingestionVariable
IBM AgentOpsBuilt on OpenTelemetry, integrated into Instana and watsonx Orchestrate, uses Langfuse under the hoodEnterprise

The interoperability story is coalescing around OpenTelemetry. The OpenTelemetry GenAI Semantic Conventions working group (started April 2024) is defining standardized attributes for LLM calls, agent steps, sessions, vector DB queries, and quality metrics. Agent-specific conventions define gen_ai.operation.name = invoke_agent with attributes for tracing tasks, actions, and memory. Currently experimental, but heading toward stable release.

This matters because it means you won't be locked into one vendor's observability stack. If OpenTelemetry becomes the standard (and the major players -- IBM, Arize, Langfuse, Pydantic -- are all building on it), you'll be able to swap observability tools without re-instrumenting your agents.


Evaluating Agents: The Hardest Problem

Here's where I think most teams are failing. And the data backs it up: 89% have observability, but only 52% run evals. You can watch your agents all day. But if you can't systematically measure whether they're doing the right thing, you're just watching expensive systems produce confident-sounding outputs.

Anthropic's engineering team published the best guide on agent evals I've seen. Two metrics matter for non-deterministic systems:

  • pass@k: probability that an agent gets at least one correct solution in k attempts. Use this for internal tools where you can retry.
  • pass^k: probability that ALL k trials succeed. Use this for customer-facing agents where every interaction must work.

The difference is huge. An agent with 90% single-run accuracy has 99% pass@3 (great for internal tools) but only 72.9% pass^3 (terrible for customer-facing). Same agent, different metrics, completely different conclusions about production readiness.

Three types of graders, combined:

Grader TypeWhen to UseLimitation
Code-based (string matching, binary tests)Deterministic outcomes: "did the agent return the right SQL query?"Brittle to valid variations
Model-based (LLM-as-judge with rubrics)Nuanced quality: "was the response helpful and accurate?"Non-deterministic, expensive to run at scale
Human review (SME evaluation)Gold standard for complex tasksDoesn't scale

The practical approach: run code-based evals on every deployment (fast, cheap, catches regressions), model-based evals on a sample (slower, catches quality degradation), and human review periodically to calibrate the model-based graders.


Agent Memory: The Operational Blind Spot

Something most AgentOps articles skip entirely: agents accumulate state. They have memory. And that memory creates operational challenges that look nothing like traditional database management.

Production agent memory has three layers:

  • Working memory: current conversation context and live operational data
  • Episodic memory: past interaction logs and experiences
  • Semantic memory: accumulated facts, user preferences, domain knowledge (typically stored in vector databases)

The operational challenge: memory drift. An agent that works perfectly on day one can degrade over weeks as its accumulated context introduces subtle biases or outdated information. Redis and Oracle's practical recommendation: start with a conversation buffer and basic vector store, add working memory for multi-step planning, and add graph-based long-term memory only when relationship retrieval becomes a bottleneck.

The monitoring challenge: how do you observe memory? Traditional logs capture inputs and outputs. Agent memory is a living, mutable state that influences every decision. Nobody has fully solved this yet, but the teams that track memory state snapshots alongside execution traces are catching failure modes that everyone else misses.


The Cost Control Playbook

Since runaway costs are the most immediately painful failure mode, here's what's actually working:

Model routing. Use cheap models (GPT-4o-mini, Claude Haiku) for triage and routing. Use capable models (GPT-4o, Claude Sonnet) for complex reasoning. OpenAI's GPT-5 does this internally -- routing between a fast model and a deeper reasoning model based on query complexity. LLM gateways like Portkey, LiteLLM, and OpenRouter support multi-model routing out of the box.

Semantic caching. Roughly 31% of LLM queries across typical workloads show semantic similarity. Caching semantically similar requests can cut API costs by up to 73%.

Hard budget caps. Treat cost as a first-class engineering constraint alongside latency and reliability. Set per-session token limits. Set per-agent daily limits. Set per-organization monthly limits. Kill sessions that exceed thresholds. This sounds obvious, but the $47,000 loop happened because nobody had a budget cap.

Combined techniques. Most teams can cut AI agent costs by 60-80% without sacrificing quality by combining routing, caching, and budget caps. The median output-to-input cost ratio across major providers is approximately 4:1, with some reasoning models reaching 8:1. A task routed to a frontier reasoning model can cost 190x more than the same task handled by a fast, smaller model.


The Framework Wars: Operational Reality

Everyone talks about CrewAI vs LangGraph vs AutoGen for building agents. Nobody talks about what happens when you try to operate them.

CrewAI is used by over 60% of the U.S. Fortune 500 and orchestrated over 1.1 billion agent actions in Q3 2025 alone. It launched an enterprise Agent Operations Platform with RBAC, audit logs, and cloud infrastructure integration. But multiple teams report hitting a wall 6-12 months in -- the opinionated role-task structure becomes constraining, and debugging concurrent agents is, to quote one practitioner, "a huge pain." It's built for small teams of agents, not swarms of hundreds.

LangGraph reached v1.0 in late 2025 and became the default runtime for all LangChain agents. The graph-based model gives fine-grained control and better audit trails. LangGraph Studio provides visual debugging for state examination and replay. The downside: steep learning curve, and state definitions must be well-planned upfront. It's the framework for teams that know exactly what they're building.

AutoGen (Microsoft) takes a conversational approach that provides insight into each reasoning step. Good transparency through chat transcript history. But agents can get stuck in loops without safeguards (sound familiar?), multi-turn conversations are costly, and it's not designed for 50+ agents talking simultaneously.

The operational lesson: choose your framework based on how debuggable it is, not how easy it is to build with. The prototype phase takes weeks. The operational phase takes years. I've seen too many teams pick CrewAI for its quick start and spend months migrating to LangGraph when they hit production scale.


What I Actually Think

Here's my position: AgentOps is not optional, and most teams are dangerously behind.

The data is unambiguous. 40% of enterprise apps will have AI agents by end of 2026. Over 40% of those projects will be canceled by 2027. 75% of firms building agentic architectures on their own will fail (Forrester). The difference between the teams that succeed and the teams that don't won't be which model they use or which framework they build with. It'll be whether they built the operational infrastructure to detect, debug, and govern agents before something goes sideways.

Andrej Karpathy called many current agent products "slop" and predicted it will take a decade to fully work through the issues. Harrison Chase, the CEO of LangChain, put it bluntly: "These things aren't good enough and they're not good enough because it's hard to debug them and hard to get them ready."

I think they're both right. And I think the AgentOps discipline will mature faster than people expect -- not because agents will suddenly become reliable, but because the cost of not having agent governance will become unbearable. When your AI agent deletes production or burns $47K in recursive loops, you stop treating observability as a nice-to-have.

If I were starting an agent deployment today, here's the minimum stack I'd build before writing a single line of agent logic:

  1. OpenTelemetry-based tracing from day one. Langfuse or Arize Phoenix for the open-source path. LangSmith if you're all-in on LangChain.
  2. Hard budget caps per session and per day. Non-negotiable.
  3. Loop detection with automatic circuit breakers. If any agent pair exchanges more than 10 messages, kill the session and alert.
  4. Human escalation gates for any action that modifies production data, sends external communications, or exceeds a cost threshold.
  5. Eval pipelines that run on every deployment, not just observability dashboards that someone checks occasionally.

The companies that build this infrastructure now will be the ones still running agents in 2028. The ones that don't will be writing postmortems.


Sources

  1. IBM -- What Is AgentOps?
  2. IBM Research -- How to Know If Your AI Agents Are Working as Intended
  3. OpenAI -- Agentic AI Foundation
  4. Anthropic -- Donating MCP and Establishing the AAIF
  5. TechStartups -- The $47,000 AI Agent Loop
  6. Grand View Research -- AI Agents Market Report
  7. Gartner -- 40% of Enterprise Apps Will Feature AI Agents by 2026
  8. Gartner -- Over 40% of Agentic AI Projects Will Be Canceled by 2027
  9. LangChain -- State of Agent Engineering
  10. McKinsey -- The State of AI in 2025
  11. Futurum Group -- The Great CIO Platform Reset
  12. Amazon Kiro AI Outage -- ruh.ai
  13. Amazon Lost 6.3 Million Orders -- Medium
  14. Composio -- Why AI Agent Pilots Fail
  15. OWASP LLM Prompt Injection -- Obsidian Security
  16. AI Agent Attacks Q4 2025 -- eSecurity Planet
  17. AI Agent Security 2026 -- Swarm Signal
  18. Anthropic -- Demystifying Evals for AI Agents
  19. ClickHouse Acquires Langfuse -- SiliconANGLE
  20. AgentOps.ai Pre-Seed Funding -- PR Newswire
  21. AI Agent Cost Optimization -- Zylos
  22. The $400M Cloud Leak -- AnalyticsWeek
  23. AI Agent Cost Optimization Guide 2026 -- Moltbook
  24. OpenTelemetry GenAI Semantic Conventions
  25. Firecrawl -- Best LLM Observability Tools in 2026
  26. AIMultiple -- 15 AI Agent Observability Tools in 2026
  27. Pydantic -- AI Observability Pricing Comparison
  28. Hasan Halacli -- From MLOps to AgentOps
  29. Karpathy Agents Controversy -- First AI Movers
  30. Harrison Chase on Agent Orchestration -- Sequoia
  31. Redis -- AI Agent Memory Architecture
  32. CIO.com -- Overcome Governance Issues for Agentic AI
  33. o-mega.ai -- Top 10 Agent Frameworks 2026
  34. ControllingAI Agent Costs -- InformationWeek
  35. Runaway AI Agent Costs -- SupraWall