Ismat Samadov
  • Tags
  • About

© 2026 Ismat Samadov

RSS
17 min read/2 views

The 10M-Token Context Window vs the $1M/Day Inference Bill: AI's Fundamental Economics Problem

Sora cost $15M/day to run. Lifetime revenue: $2.1M. Context windows keep growing. The economics that decide which AI products survive.

AIEconomicsInfrastructureLLMStartups

Related Articles

Semantic Caching Saved Us $14K/Month in LLM API Costs

14 min read

LLM Evals Are Broken — How to Actually Test Your AI App Before Users Do

14 min read

AI Agents in Production: 94% Fail Before Week Two

14 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Price of a Token
  • The Context Window Trap
  • The $690 Billion Infrastructure Bet
  • The Jevons Paradox: Cheaper Tokens, Higher Bills
  • Who's Actually Making Money
  • The Sora Autopsy: A Case Study in Economics
  • The Custom Silicon Disruption
  • The Optimization Playbook
  • Model Routing
  • Semantic Caching
  • KV-Cache Compression
  • Quantization
  • Speculative Decoding
  • Combined Impact
  • A Decision Framework for AI Product Economics
  • Step 1: Calculate Your Token Budget Per User Interaction
  • Step 2: Compare Against Revenue Per User
  • Step 3: Identify Your Optimization Levers
  • Step 4: Build Cost Monitoring From Day One
  • The Historical Cost Curve
  • What I Actually Think
  • Sources

OpenAI's Sora cost $15 million per day to run. Its lifetime revenue was $2.1 million. Not per day -- total. In March 2026, OpenAI shut it down. Bill Peebles, Sora's lead, said what everyone already knew: "The economics are currently completely unsustainable."

That same month, Anthropic hit $19 billion in annualized revenue and approached break-even. Same industry. Same GPU costs. Same fundamental technology. One company burned $5.4 billion annualized on a product nobody paid for. The other built a sustainable business.

The difference wasn't the models. It was the economics. And understanding those economics -- the cost curves, the pricing traps, the infrastructure bets, and the optimization tricks -- is the single most important skill in AI product development right now.


The Price of a Token

Let's start with what things actually cost. Here's the current pricing for major models as of early 2026:

ModelInput (per 1M tokens)Output (per 1M tokens)Context Window
GPT-5.2$1.75$14.001M
GPT-5.2 Pro$21.00$168.001M
Claude Opus 4.6$5.00$25.001M
Claude Sonnet 4.6$3.00$15.001M
Claude Haiku 4.5$1.00$5.00200K
Gemini 3.1 Pro$2.00$12.001M
Gemini 3 Flash$0.50$3.001M
Grok 4.1$0.20$0.50128K
GPT-5 mini$0.25$2.00128K
GPT-5 nano$0.05$0.40128K

Sources: IntuitionLabs, PricePerToken

The spread is enormous. GPT-5.2 Pro output costs 420x more than GPT-5 nano output. A task that costs $0.004 on the cheapest model costs $1.68 on the most expensive. At scale -- millions of requests per day -- that difference is the difference between a profitable product and a catastrophic money pit.

And these are the subsidized prices. Sam Altman admitted OpenAI is "currently losing money" on its $200/month ChatGPT Pro subscriptions. Industry analysts estimate 30-50% API price increases over the next 18 months as vendors move toward sustainable unit economics. The prices above may be the floor, not the ceiling.


The Context Window Trap

Here's what the "10 million token context window" marketing doesn't tell you: attention scales quadratically.

Standard self-attention costs O(N^2) with sequence length. Double the context window, quadruple the computation. The KV-cache -- the memory structure that stores processed context -- scales linearly with context length. At 10 million tokens, the KV-cache alone requires an estimated 32 TB of memory. No single GPU or multi-GPU server comes close.

Let's do the math on what a single 10M-token query actually costs:

Model10M Input CostNotes
Claude Opus 4.6$50.00Per query, input only
Gemini 3.1 Pro (standard)$20.00Per query, input only
Gemini 3.1 Pro (long context)$40.002x price beyond 200K tokens
GPT-5.2$17.50Per query, input only

$50 per query. Just for input. Add output tokens and you're looking at $75-$150+ per request. A chatbot handling 100,000 queries per day at these context lengths would cost $5-$15 million per day. That's Sora-level economics.

And the performance doesn't justify the cost. As I wrote about in my article on Llama 4 Scout's context window, performance collapses long before you hit those limits. You're paying 2x for tokens the model isn't even using effectively.

The practical reality: most production workloads run between 4K-32K tokens. The 1M+ context windows are for RAG retrieval, code analysis, and document processing -- use cases where you can batch process offline and amortize costs. Anyone designing a real-time product around 10M-token context windows needs to talk to their CFO first.


The $690 Billion Infrastructure Bet

While individual API calls might seem cheap, the aggregate numbers are staggering. Here's what the hyperscalers plan to spend on AI infrastructure in 2026:

Company2026 CapExNotes
Amazon$200BMost for data centers; Jassy says AI capacity monetized as fast as installed
Alphabet (Google)$175-185BThird upward revision; cloud backlog surged 55% to $240B+
Microsoft$120B+$37.5B in most recent quarter alone
Meta$115-135B1GW Ohio data center; Louisiana facility potentially 5GW
Oracle$50B136% increase over 2025
Total$660-690BNearly double 2025's ~$380B

Sources: Futurum Group, CNBC, IEEE

Roughly 75% of that -- around $450 billion -- is directly tied to AI infrastructure: GPUs, custom silicon, cooling systems, power generation. These companies are collectively betting that demand for AI compute will grow fast enough to justify spending nearly $700 billion in a single year.

For context, that's more than the GDP of Belgium. In one year. On computers.

The bet works if inference demand scales exponentially. It doesn't work if the demand curve flattens -- if companies hit cost ceilings, find that AI products don't generate enough revenue to justify the compute, or discover that optimization techniques reduce the total compute needed.


The Jevons Paradox: Cheaper Tokens, Higher Bills

Here's the number that captures the whole problem in one stat: tokens got 99.7% cheaper between GPT-4's launch and mid-2025. Enterprise AI cloud spending tripled from $11.5 billion to $37 billion in the same period.

This is the Jevons Paradox applied to AI. When something gets cheaper, you use dramatically more of it. The per-token cost dropped 1,000x, but total spending went up 3x. 72% of IT leaders now report AI spending as "unmanageable."

The mechanism is agentic workflows. A simple chatbot makes one API call per user message. An AI agent -- one that reasons, plans, uses tools, and verifies its work -- makes 50-500x more calls per task. A customer support agent might make 15-30 LLM calls to resolve a single ticket. A coding agent might make 50-100. A research agent with tool use might make hundreds.

At a16z's tracked decline rate -- roughly 10x cost reduction per year for equivalent performance -- you'd think this problem would solve itself. And for simple use cases, it does. But the frontier keeps moving: more capable models enable more complex tasks, which require more compute, which consumes the savings.

The companies losing money on AI aren't the ones paying too much per token. They're the ones making too many calls per user interaction.


Who's Actually Making Money

Let me be specific about the financial reality of the major players:

OpenAI:

  • 2025 ARR: ~$20 billion (3x increase year-over-year)
  • 2025 losses: $13.5 billion net loss in the first half alone
  • 2026 projected: $13B revenue vs ~$22B spending = $14 billion loss
  • Cumulative cash burn through 2029: expected $115 billion
  • Profitability target: 2029 or 2030, with $200B annual revenue projection
  • HSBC analysts say OpenAI "likely won't make money by 2030" and faces a $207B funding shortfall

Anthropic:

  • March 2026 ARR: $19 billion (up from $9B at end of 2025, $1B fifteen months before)
  • Growth rate: ~10x per year vs. OpenAI's 3.4x
  • Expected to surpass OpenAI in revenue by mid-2026
  • Break-even expected in 2026; positive cash flow projected by 2027
  • Cash burn projected at ~1/3 of revenue in 2026, dropping to 9% by 2027

The broader industry:

  • 3,800 AI startups shut down in 2025 (27% of the 14,000+ launched in 2024)
  • Another 1,800 closed in early 2026
  • MIT's Project NANDA: 95% of enterprise generative AI pilots failed to deliver measurable ROI
  • Only the API providers and infrastructure companies are generating meaningful revenue. Most application-layer companies are losing money.

The Anthropic vs. OpenAI comparison is instructive. Both sell API access to foundation models. But Anthropic reached near-profitability on $19B ARR while OpenAI projects a $14B loss on similar revenue. The difference appears to be operational discipline -- Anthropic's model efficiency (Sonnet as the workhorse, Opus for premium), focused enterprise sales, and a more conservative approach to consumer products (no Sora-equivalent money pits).


The Sora Autopsy: A Case Study in Economics

Sora deserves a closer look because it's the most dramatic example of AI product economics gone wrong.

The numbers: $15 million per day in infrastructure costs. $5.4 billion annualized. Each 10-second video clip required roughly 40 minutes of total GPU time (8-10 minutes on 4 GPUs simultaneously), costing an estimated $1.30 per clip. User downloads dropped 66% from the November 2025 peak (3.33 million) to February 2026 (1.1 million). Total lifetime revenue: $2.1 million from in-app purchases.

What went wrong:

1. Video generation is a compute multiplier. Text generation produces tokens sequentially. Video generation produces frames spatially -- each second of output requires orders of magnitude more compute than equivalent text. The fundamental unit economics of video AI are worse than text AI by a factor of 100-1000x.

2. No pricing model could work. At $1.30 per clip in compute cost, you'd need to charge $5-$10 per generation to break even with overhead. Consumer willingness to pay for 10-second AI videos? Approximately zero, as the $2.1M lifetime revenue proved.

3. The Disney deal that wasn't. A rumored $1 billion partnership with Disney never formalized. No agreement was signed. The enterprise revenue that might have justified the infrastructure spend never materialized.

4. Usage declined as novelty wore off. AI-generated video has a novelty curve that peaks fast and drops hard. Without a compelling use case beyond "look what I made," retention collapses.

The Sora team pivoted to robotics research under the codename "Spud." The infrastructure was repurposed. $5.4 billion in annualized compute spending vanished in a press release.


The Custom Silicon Disruption

The GPU monoculture is breaking. And the economics of inference are about to shift because of it.

On Christmas Eve 2025, NVIDIA acquired Groq for $20 billion. Groq's Language Processing Units (LPUs) serve Llama 2 70B at 300 tokens per second. Their pricing: $0.79 per million output tokens for Llama 3.3 70B -- dramatically cheaper than equivalent GPU-based inference.

In January 2026, OpenAI signed a $10 billion+ deal with Cerebras for 750 megawatts of computing power through 2028. Cerebras's wafer-scale chips (each the size of a dinner plate) put entire models on a single piece of silicon, breaking the 1,000-token-per-second barrier for Llama 3.1-405B.

At GTC 2026, NVIDIA introduced the Groq 3 LPX -- dedicated inference hardware added to NVIDIA's platform for the first time. This is NVIDIA acknowledging that general-purpose GPUs are suboptimal for inference workloads and that specialized silicon can deliver 5x speed at 50% lower cost.

The GPU price crash is already happening. H100 cloud rental costs crashed 64-75% from the 2024 peak of $8-10/hour to $2-$4.50/hour in 2026. B200 availability is expected to push H100s below $2/hour by year-end.

GPU2024 Peak Price2026 PriceDecline
H100 (cloud rental/hr)$8-$10$2-$4.50-64% to -75%
H200 (cloud rental/hr)N/A$3.72-$10.60New
B200 (cloud rental/hr)N/A$2.25-$16.00New

Source: JarvisLabs, gpu.fm

This matters because GPU cost is the single largest component of inference cost. Every 50% drop in GPU pricing flows directly to the per-token economics. If dedicated inference hardware delivers the promised 5x efficiency improvement, the cost curves for production AI change dramatically.


The Optimization Playbook

You don't have to wait for hardware to get cheaper. There's a stack of techniques that can cut inference costs 60-90% today:

Model Routing

The simplest and highest-impact optimization. Not every query needs a frontier model. Route simple queries to cheap, fast models (GPT-5 nano at $0.05/1M input) and only escalate to expensive models (Claude Opus at $5/1M input) for complex tasks.

OpenAI's GPT-5 does this internally, routing between efficiency and reasoning modes. You can build the same pattern with LLM gateways like Portkey, LiteLLM, or OpenRouter.

Potential savings: 80-90% on blended cost, since the majority of queries are simple.

Semantic Caching

Roughly 31% of LLM queries across typical workloads show semantic similarity. If someone asks "What's the return policy?" 500 times, you don't need 500 separate LLM calls. Cache the response and serve it directly.

Potential savings: up to 73% of API costs for workloads with repetitive queries.

KV-Cache Compression

Google's TurboQuant (2026) compresses the KV-cache to 3 bits with zero measured accuracy loss, achieving 6x memory reduction. This enables longer context windows without the linear memory scaling that makes 10M-token windows impractical.

Quantization

Reducing model weights from 32-bit to 8-bit or 4-bit cuts memory requirements by 4-8x and speeds up inference proportionally, with minimal quality loss for most tasks. NVIDIA's Blackwell GPUs support native FP4/FP8 computation, making quantized inference a first-class operation.

Speculative Decoding

A small, fast "draft" model proposes 4-8 tokens; the large model verifies them in a single forward pass. Typical acceptance rates of 70-85% reduce the number of expensive forward passes by 3-5x without any quality loss.

Combined Impact

Most teams can cut costs 60-80% without sacrificing quality. The best combinations achieve 70-90% savings. GPU utilization has improved from 30-40% to 70-80% through these techniques.


A Decision Framework for AI Product Economics

If you're building an AI product, here's how to think about the economics before you write a single line of code:

Step 1: Calculate Your Token Budget Per User Interaction

Estimate the average tokens (input + output) per user action. Multiply by your model's pricing. Multiply by expected daily interactions per user. That's your per-user daily cost.

# Example: Customer support agent
avg_input_tokens = 8_000  # conversation history + system prompt
avg_output_tokens = 2_000  # response
calls_per_ticket = 15  # agent makes multiple LLM calls
tickets_per_day = 50  # per user/seat

# Using Claude Sonnet 4.6
input_cost = (avg_input_tokens * calls_per_ticket * tickets_per_day) / 1_000_000 * 3.00
output_cost = (avg_output_tokens * calls_per_ticket * tickets_per_day) / 1_000_000 * 15.00
daily_cost_per_seat = input_cost + output_cost
# = $18.00 + $22.50 = $40.50/day = ~$1,215/month per seat

Step 2: Compare Against Revenue Per User

If your product charges $99/month per seat and costs $1,215/month in compute per seat, you have a Sora problem. Either raise prices, reduce token consumption, or use cheaper models.

Step 3: Identify Your Optimization Levers

LeverSavingsEffortQuality Impact
Model routing (send 80% to cheap model)70-85%MediumMinimal if routed well
Semantic caching30-73%LowNone
Reduce context window (RAG instead of full context)50-80%HighDepends on implementation
Prompt compression20-40%LowMinimal
Speculative decoding40-60%HighNone

Step 4: Build Cost Monitoring From Day One

Track cost per user, per conversation, per feature. Set alerts. Build dashboards. The companies that blow their budgets are the ones that don't measure until the bill arrives.


The Historical Cost Curve

Zoom out and the long-term trend is relentlessly downward:

DateGPT-4-Class Cost (per 1M tokens)Milestone
Late 2022~$20.00GPT-4 preview pricing
Early 2024~$2.00Competition drives prices down 10x
Mid-2025~$0.40Continued decline
Early 2026~$0.14Near-commodity for mid-tier tasks
2028 (projected)under $0.01a16z projection

Source: a16z, Epoch AI

That's a roughly 10x decline per year for equivalent capability. Since January 2024, the decline rate has accelerated to 200x per year for some benchmarks. By 2028, GPT-4-equivalent inference should cost less than a penny per million tokens.

But here's what the optimistic cost-curve story misses: the frontier moves too. When GPT-4-class inference costs a penny, nobody will be using GPT-4-class models. They'll be using GPT-7-class models that cost $10 per million tokens. The per-token cost of the latest models has been remarkably stable even as the cost of older models collapses.

The inference market grew from ~$12B in 2023 to an estimated ~$55B in 2026. Inference now represents roughly 67% of total AI compute, up from about a third in 2023. Training gets the headlines. Inference gets the bills.


What I Actually Think

Here's my position: inference economics will determine which AI companies survive, and most current business models are unsustainable.

The math is unforgiving. OpenAI projects a $14 billion loss in 2026 and expects cumulative cash burn of $115 billion through 2029. Even with $20B in revenue, the spend outpaces the income by a factor of two. HBR argues that AI companies fundamentally don't have a profitable business model yet.

I think Anthropic's trajectory shows the business model can work, but only with ruthless efficiency: a tiered model lineup (Haiku for cheap, Sonnet for value, Opus for premium), enterprise-focused distribution, and no vanity products burning $15M/day. Their projected break-even on $19B ARR, while OpenAI loses $14B on similar revenue, is the clearest signal in the industry about what works.

For builders -- the people reading this blog, not the people running AI labs -- the implications are concrete:

  1. Every AI feature you ship has a recurring cost. Traditional software has near-zero marginal cost per user. AI products have per-token, per-call, per-user costs that scale linearly with usage. Design for this from day one.

  2. Model routing isn't optional; it's table stakes. Sending every query to Opus when 80% could be handled by Haiku is like running every database query against your production replica. Match the model to the task.

  3. Context windows are a trap. Marketing says 10M tokens. Physics says quadratic scaling. Economics says $50 per query. Use retrieval and summarization to keep context small, not brute-force large context windows.

  4. The cost curve will save you, but not yet. 10x per year decline is real, but you still need to survive this year. Build for today's costs, not 2028's.

  5. The companies that win will be the ones that treat inference cost as a core engineering discipline. Not an afterthought. Not "we'll optimize later." From the first prototype.

Sora died because nobody did this math before shipping. Every 10-second clip that nobody paid $1.30 for was a slow bleed toward a $5.4 billion annualized lesson. The opposite of Sora isn't "don't build AI products." It's "build AI products where the economics work."

The 10M-token context window is a capability. The $1M/day inference bill is a constraint. The founders and engineers who understand both will build the AI products that actually survive.


Sources

  1. OpenAI Sora Shutdown: $15M/Day Costs, $2.1M Revenue -- Medium
  2. The Real Sora Cost: OpenAI's $5 Billion Problem -- Remio
  3. Sora Lost $1M Per Day -- Digital Applied
  4. Sora Shutdown Highlights Cost Challenges -- CIOL
  5. LLM API Pricing Comparison 2025 -- IntuitionLabs
  6. AI API Pricing Comparison 2026 -- IntuitionLabs
  7. LLM API Pricing -- PricePerToken
  8. OpenAI Losing Money on ChatGPT Pro -- TechCrunch
  9. AI CapEx 2026: The $690B Sprint -- Futurum Group
  10. Tech AI Spending Approaches $700B in 2026 -- CNBC
  11. Hyperscaler CapEx >$600B in 2026 -- IEEE
  12. Tokens Got 99.7% Cheaper, Bills Tripled -- NavyaAI
  13. LLMflation: Inference Cost Going Down Fast -- a16z
  14. OpenAI Plans Stunning Annual Losses Through 2028 -- Fortune
  15. OpenAI's $14 Billion 2026 Loss -- ainvest
  16. Anthropic Could Surpass OpenAI in Revenue -- Epoch AI
  17. AI Companies Don't Have a Profitable Business Model -- HBR
  18. The Real Reason AI Startups Are Failing -- Medium
  19. NVIDIA's $20 Billion Groq Acquisition -- FinancialContent
  20. GTC 2026: Groq 3 LPX -- The Decoder
  21. AWS to Deploy Cerebras Chips -- IEEE
  22. Groq vs Cerebras 2026 -- Algeria Tech
  23. H100 Price Guide 2026 -- JarvisLabs
  24. B200 Complete Buyer's Guide -- gpu.fm
  25. 10M Token Context Window Analysis -- Medium
  26. The 1 Trillion Token Context Window -- Siskar
  27. AI Agent Cost Optimization: Token Economics -- Zylos
  28. AI Agent Cost Optimization Guide -- Moltbook
  29. Top 5 AI Model Optimization Techniques -- NVIDIA
  30. LLM Inference Optimization -- Clarifai
  31. LLM Inference Optimization: Cut Cost and Latency -- Morph
  32. AI Inference Economics: The 1,000x Cost Collapse -- GPUnex
  33. LLM Inference Price Trends -- Epoch AI
  34. Inference Economics: 2026 Enterprise AI Cost Crisis -- AnalyticsWeek