OpenAI's Sora cost $15 million per day to run. Its lifetime revenue was $2.1 million. Not per day -- total. In March 2026, OpenAI shut it down. Bill Peebles, Sora's lead, said what everyone already knew: "The economics are currently completely unsustainable."
That same month, Anthropic hit $19 billion in annualized revenue and approached break-even. Same industry. Same GPU costs. Same fundamental technology. One company burned $5.4 billion annualized on a product nobody paid for. The other built a sustainable business.
The difference wasn't the models. It was the economics. And understanding those economics -- the cost curves, the pricing traps, the infrastructure bets, and the optimization tricks -- is the single most important skill in AI product development right now.
The Price of a Token
Let's start with what things actually cost. Here's the current pricing for major models as of early 2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|
| GPT-5.2 | $1.75 | $14.00 | 1M |
| GPT-5.2 Pro | $21.00 | $168.00 | 1M |
| Claude Opus 4.6 | $5.00 | $25.00 | 1M |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M |
| Gemini 3 Flash | $0.50 | $3.00 | 1M |
| Grok 4.1 | $0.20 | $0.50 | 128K |
| GPT-5 mini | $0.25 | $2.00 | 128K |
| GPT-5 nano | $0.05 | $0.40 | 128K |
Sources: IntuitionLabs, PricePerToken
The spread is enormous. GPT-5.2 Pro output costs 420x more than GPT-5 nano output. A task that costs $0.004 on the cheapest model costs $1.68 on the most expensive. At scale -- millions of requests per day -- that difference is the difference between a profitable product and a catastrophic money pit.
And these are the subsidized prices. Sam Altman admitted OpenAI is "currently losing money" on its $200/month ChatGPT Pro subscriptions. Industry analysts estimate 30-50% API price increases over the next 18 months as vendors move toward sustainable unit economics. The prices above may be the floor, not the ceiling.
The Context Window Trap
Here's what the "10 million token context window" marketing doesn't tell you: attention scales quadratically.
Standard self-attention costs O(N^2) with sequence length. Double the context window, quadruple the computation. The KV-cache -- the memory structure that stores processed context -- scales linearly with context length. At 10 million tokens, the KV-cache alone requires an estimated 32 TB of memory. No single GPU or multi-GPU server comes close.
Let's do the math on what a single 10M-token query actually costs:
| Model | 10M Input Cost | Notes |
|---|
| Claude Opus 4.6 | $50.00 | Per query, input only |
| Gemini 3.1 Pro (standard) | $20.00 | Per query, input only |
| Gemini 3.1 Pro (long context) | $40.00 | 2x price beyond 200K tokens |
| GPT-5.2 | $17.50 | Per query, input only |
$50 per query. Just for input. Add output tokens and you're looking at $75-$150+ per request. A chatbot handling 100,000 queries per day at these context lengths would cost $5-$15 million per day. That's Sora-level economics.
And the performance doesn't justify the cost. As I wrote about in my article on Llama 4 Scout's context window, performance collapses long before you hit those limits. You're paying 2x for tokens the model isn't even using effectively.
The practical reality: most production workloads run between 4K-32K tokens. The 1M+ context windows are for RAG retrieval, code analysis, and document processing -- use cases where you can batch process offline and amortize costs. Anyone designing a real-time product around 10M-token context windows needs to talk to their CFO first.
The $690 Billion Infrastructure Bet
While individual API calls might seem cheap, the aggregate numbers are staggering. Here's what the hyperscalers plan to spend on AI infrastructure in 2026:
| Company | 2026 CapEx | Notes |
|---|
| Amazon | $200B | Most for data centers; Jassy says AI capacity monetized as fast as installed |
| Alphabet (Google) | $175-185B | Third upward revision; cloud backlog surged 55% to $240B+ |
| Microsoft | $120B+ | $37.5B in most recent quarter alone |
| Meta | $115-135B | 1GW Ohio data center; Louisiana facility potentially 5GW |
| Oracle | $50B | 136% increase over 2025 |
| Total | $660-690B | Nearly double 2025's ~$380B |
Sources: Futurum Group, CNBC, IEEE
Roughly 75% of that -- around $450 billion -- is directly tied to AI infrastructure: GPUs, custom silicon, cooling systems, power generation. These companies are collectively betting that demand for AI compute will grow fast enough to justify spending nearly $700 billion in a single year.
For context, that's more than the GDP of Belgium. In one year. On computers.
The bet works if inference demand scales exponentially. It doesn't work if the demand curve flattens -- if companies hit cost ceilings, find that AI products don't generate enough revenue to justify the compute, or discover that optimization techniques reduce the total compute needed.
The Jevons Paradox: Cheaper Tokens, Higher Bills
Here's the number that captures the whole problem in one stat: tokens got 99.7% cheaper between GPT-4's launch and mid-2025. Enterprise AI cloud spending tripled from $11.5 billion to $37 billion in the same period.
This is the Jevons Paradox applied to AI. When something gets cheaper, you use dramatically more of it. The per-token cost dropped 1,000x, but total spending went up 3x. 72% of IT leaders now report AI spending as "unmanageable."
The mechanism is agentic workflows. A simple chatbot makes one API call per user message. An AI agent -- one that reasons, plans, uses tools, and verifies its work -- makes 50-500x more calls per task. A customer support agent might make 15-30 LLM calls to resolve a single ticket. A coding agent might make 50-100. A research agent with tool use might make hundreds.
At a16z's tracked decline rate -- roughly 10x cost reduction per year for equivalent performance -- you'd think this problem would solve itself. And for simple use cases, it does. But the frontier keeps moving: more capable models enable more complex tasks, which require more compute, which consumes the savings.
The companies losing money on AI aren't the ones paying too much per token. They're the ones making too many calls per user interaction.
Who's Actually Making Money
Let me be specific about the financial reality of the major players:
OpenAI:
Anthropic:
- March 2026 ARR: $19 billion (up from $9B at end of 2025, $1B fifteen months before)
- Growth rate: ~10x per year vs. OpenAI's 3.4x
- Expected to surpass OpenAI in revenue by mid-2026
- Break-even expected in 2026; positive cash flow projected by 2027
- Cash burn projected at ~1/3 of revenue in 2026, dropping to 9% by 2027
The broader industry:
- 3,800 AI startups shut down in 2025 (27% of the 14,000+ launched in 2024)
- Another 1,800 closed in early 2026
- MIT's Project NANDA: 95% of enterprise generative AI pilots failed to deliver measurable ROI
- Only the API providers and infrastructure companies are generating meaningful revenue. Most application-layer companies are losing money.
The Anthropic vs. OpenAI comparison is instructive. Both sell API access to foundation models. But Anthropic reached near-profitability on $19B ARR while OpenAI projects a $14B loss on similar revenue. The difference appears to be operational discipline -- Anthropic's model efficiency (Sonnet as the workhorse, Opus for premium), focused enterprise sales, and a more conservative approach to consumer products (no Sora-equivalent money pits).
The Sora Autopsy: A Case Study in Economics
Sora deserves a closer look because it's the most dramatic example of AI product economics gone wrong.
The numbers: $15 million per day in infrastructure costs. $5.4 billion annualized. Each 10-second video clip required roughly 40 minutes of total GPU time (8-10 minutes on 4 GPUs simultaneously), costing an estimated $1.30 per clip. User downloads dropped 66% from the November 2025 peak (3.33 million) to February 2026 (1.1 million). Total lifetime revenue: $2.1 million from in-app purchases.
What went wrong:
1. Video generation is a compute multiplier. Text generation produces tokens sequentially. Video generation produces frames spatially -- each second of output requires orders of magnitude more compute than equivalent text. The fundamental unit economics of video AI are worse than text AI by a factor of 100-1000x.
2. No pricing model could work. At $1.30 per clip in compute cost, you'd need to charge $5-$10 per generation to break even with overhead. Consumer willingness to pay for 10-second AI videos? Approximately zero, as the $2.1M lifetime revenue proved.
3. The Disney deal that wasn't. A rumored $1 billion partnership with Disney never formalized. No agreement was signed. The enterprise revenue that might have justified the infrastructure spend never materialized.
4. Usage declined as novelty wore off. AI-generated video has a novelty curve that peaks fast and drops hard. Without a compelling use case beyond "look what I made," retention collapses.
The Sora team pivoted to robotics research under the codename "Spud." The infrastructure was repurposed. $5.4 billion in annualized compute spending vanished in a press release.
The Custom Silicon Disruption
The GPU monoculture is breaking. And the economics of inference are about to shift because of it.
On Christmas Eve 2025, NVIDIA acquired Groq for $20 billion. Groq's Language Processing Units (LPUs) serve Llama 2 70B at 300 tokens per second. Their pricing: $0.79 per million output tokens for Llama 3.3 70B -- dramatically cheaper than equivalent GPU-based inference.
In January 2026, OpenAI signed a $10 billion+ deal with Cerebras for 750 megawatts of computing power through 2028. Cerebras's wafer-scale chips (each the size of a dinner plate) put entire models on a single piece of silicon, breaking the 1,000-token-per-second barrier for Llama 3.1-405B.
At GTC 2026, NVIDIA introduced the Groq 3 LPX -- dedicated inference hardware added to NVIDIA's platform for the first time. This is NVIDIA acknowledging that general-purpose GPUs are suboptimal for inference workloads and that specialized silicon can deliver 5x speed at 50% lower cost.
The GPU price crash is already happening. H100 cloud rental costs crashed 64-75% from the 2024 peak of $8-10/hour to $2-$4.50/hour in 2026. B200 availability is expected to push H100s below $2/hour by year-end.
| GPU | 2024 Peak Price | 2026 Price | Decline |
|---|
| H100 (cloud rental/hr) | $8-$10 | $2-$4.50 | -64% to -75% |
| H200 (cloud rental/hr) | N/A | $3.72-$10.60 | New |
| B200 (cloud rental/hr) | N/A | $2.25-$16.00 | New |
Source: JarvisLabs, gpu.fm
This matters because GPU cost is the single largest component of inference cost. Every 50% drop in GPU pricing flows directly to the per-token economics. If dedicated inference hardware delivers the promised 5x efficiency improvement, the cost curves for production AI change dramatically.
The Optimization Playbook
You don't have to wait for hardware to get cheaper. There's a stack of techniques that can cut inference costs 60-90% today:
Model Routing
The simplest and highest-impact optimization. Not every query needs a frontier model. Route simple queries to cheap, fast models (GPT-5 nano at $0.05/1M input) and only escalate to expensive models (Claude Opus at $5/1M input) for complex tasks.
OpenAI's GPT-5 does this internally, routing between efficiency and reasoning modes. You can build the same pattern with LLM gateways like Portkey, LiteLLM, or OpenRouter.
Potential savings: 80-90% on blended cost, since the majority of queries are simple.
Semantic Caching
Roughly 31% of LLM queries across typical workloads show semantic similarity. If someone asks "What's the return policy?" 500 times, you don't need 500 separate LLM calls. Cache the response and serve it directly.
Potential savings: up to 73% of API costs for workloads with repetitive queries.
KV-Cache Compression
Google's TurboQuant (2026) compresses the KV-cache to 3 bits with zero measured accuracy loss, achieving 6x memory reduction. This enables longer context windows without the linear memory scaling that makes 10M-token windows impractical.
Quantization
Reducing model weights from 32-bit to 8-bit or 4-bit cuts memory requirements by 4-8x and speeds up inference proportionally, with minimal quality loss for most tasks. NVIDIA's Blackwell GPUs support native FP4/FP8 computation, making quantized inference a first-class operation.
Speculative Decoding
A small, fast "draft" model proposes 4-8 tokens; the large model verifies them in a single forward pass. Typical acceptance rates of 70-85% reduce the number of expensive forward passes by 3-5x without any quality loss.
Combined Impact
Most teams can cut costs 60-80% without sacrificing quality. The best combinations achieve 70-90% savings. GPU utilization has improved from 30-40% to 70-80% through these techniques.
A Decision Framework for AI Product Economics
If you're building an AI product, here's how to think about the economics before you write a single line of code:
Step 1: Calculate Your Token Budget Per User Interaction
Estimate the average tokens (input + output) per user action. Multiply by your model's pricing. Multiply by expected daily interactions per user. That's your per-user daily cost.
# Example: Customer support agent
avg_input_tokens = 8_000 # conversation history + system prompt
avg_output_tokens = 2_000 # response
calls_per_ticket = 15 # agent makes multiple LLM calls
tickets_per_day = 50 # per user/seat
# Using Claude Sonnet 4.6
input_cost = (avg_input_tokens * calls_per_ticket * tickets_per_day) / 1_000_000 * 3.00
output_cost = (avg_output_tokens * calls_per_ticket * tickets_per_day) / 1_000_000 * 15.00
daily_cost_per_seat = input_cost + output_cost
# = $18.00 + $22.50 = $40.50/day = ~$1,215/month per seat
Step 2: Compare Against Revenue Per User
If your product charges $99/month per seat and costs $1,215/month in compute per seat, you have a Sora problem. Either raise prices, reduce token consumption, or use cheaper models.
Step 3: Identify Your Optimization Levers
| Lever | Savings | Effort | Quality Impact |
|---|
| Model routing (send 80% to cheap model) | 70-85% | Medium | Minimal if routed well |
| Semantic caching | 30-73% | Low | None |
| Reduce context window (RAG instead of full context) | 50-80% | High | Depends on implementation |
| Prompt compression | 20-40% | Low | Minimal |
| Speculative decoding | 40-60% | High | None |
Step 4: Build Cost Monitoring From Day One
Track cost per user, per conversation, per feature. Set alerts. Build dashboards. The companies that blow their budgets are the ones that don't measure until the bill arrives.
The Historical Cost Curve
Zoom out and the long-term trend is relentlessly downward:
| Date | GPT-4-Class Cost (per 1M tokens) | Milestone |
|---|
| Late 2022 | ~$20.00 | GPT-4 preview pricing |
| Early 2024 | ~$2.00 | Competition drives prices down 10x |
| Mid-2025 | ~$0.40 | Continued decline |
| Early 2026 | ~$0.14 | Near-commodity for mid-tier tasks |
| 2028 (projected) | under $0.01 | a16z projection |
Source: a16z, Epoch AI
That's a roughly 10x decline per year for equivalent capability. Since January 2024, the decline rate has accelerated to 200x per year for some benchmarks. By 2028, GPT-4-equivalent inference should cost less than a penny per million tokens.
But here's what the optimistic cost-curve story misses: the frontier moves too. When GPT-4-class inference costs a penny, nobody will be using GPT-4-class models. They'll be using GPT-7-class models that cost $10 per million tokens. The per-token cost of the latest models has been remarkably stable even as the cost of older models collapses.
The inference market grew from ~$12B in 2023 to an estimated ~$55B in 2026. Inference now represents roughly 67% of total AI compute, up from about a third in 2023. Training gets the headlines. Inference gets the bills.
What I Actually Think
Here's my position: inference economics will determine which AI companies survive, and most current business models are unsustainable.
The math is unforgiving. OpenAI projects a $14 billion loss in 2026 and expects cumulative cash burn of $115 billion through 2029. Even with $20B in revenue, the spend outpaces the income by a factor of two. HBR argues that AI companies fundamentally don't have a profitable business model yet.
I think Anthropic's trajectory shows the business model can work, but only with ruthless efficiency: a tiered model lineup (Haiku for cheap, Sonnet for value, Opus for premium), enterprise-focused distribution, and no vanity products burning $15M/day. Their projected break-even on $19B ARR, while OpenAI loses $14B on similar revenue, is the clearest signal in the industry about what works.
For builders -- the people reading this blog, not the people running AI labs -- the implications are concrete:
-
Every AI feature you ship has a recurring cost. Traditional software has near-zero marginal cost per user. AI products have per-token, per-call, per-user costs that scale linearly with usage. Design for this from day one.
-
Model routing isn't optional; it's table stakes. Sending every query to Opus when 80% could be handled by Haiku is like running every database query against your production replica. Match the model to the task.
-
Context windows are a trap. Marketing says 10M tokens. Physics says quadratic scaling. Economics says $50 per query. Use retrieval and summarization to keep context small, not brute-force large context windows.
-
The cost curve will save you, but not yet. 10x per year decline is real, but you still need to survive this year. Build for today's costs, not 2028's.
-
The companies that win will be the ones that treat inference cost as a core engineering discipline. Not an afterthought. Not "we'll optimize later." From the first prototype.
Sora died because nobody did this math before shipping. Every 10-second clip that nobody paid $1.30 for was a slow bleed toward a $5.4 billion annualized lesson. The opposite of Sora isn't "don't build AI products." It's "build AI products where the economics work."
The 10M-token context window is a capability. The $1M/day inference bill is a constraint. The founders and engineers who understand both will build the AI products that actually survive.
Sources
- OpenAI Sora Shutdown: $15M/Day Costs, $2.1M Revenue -- Medium
- The Real Sora Cost: OpenAI's $5 Billion Problem -- Remio
- Sora Lost $1M Per Day -- Digital Applied
- Sora Shutdown Highlights Cost Challenges -- CIOL
- LLM API Pricing Comparison 2025 -- IntuitionLabs
- AI API Pricing Comparison 2026 -- IntuitionLabs
- LLM API Pricing -- PricePerToken
- OpenAI Losing Money on ChatGPT Pro -- TechCrunch
- AI CapEx 2026: The $690B Sprint -- Futurum Group
- Tech AI Spending Approaches $700B in 2026 -- CNBC
- Hyperscaler CapEx >$600B in 2026 -- IEEE
- Tokens Got 99.7% Cheaper, Bills Tripled -- NavyaAI
- LLMflation: Inference Cost Going Down Fast -- a16z
- OpenAI Plans Stunning Annual Losses Through 2028 -- Fortune
- OpenAI's $14 Billion 2026 Loss -- ainvest
- Anthropic Could Surpass OpenAI in Revenue -- Epoch AI
- AI Companies Don't Have a Profitable Business Model -- HBR
- The Real Reason AI Startups Are Failing -- Medium
- NVIDIA's $20 Billion Groq Acquisition -- FinancialContent
- GTC 2026: Groq 3 LPX -- The Decoder
- AWS to Deploy Cerebras Chips -- IEEE
- Groq vs Cerebras 2026 -- Algeria Tech
- H100 Price Guide 2026 -- JarvisLabs
- B200 Complete Buyer's Guide -- gpu.fm
- 10M Token Context Window Analysis -- Medium
- The 1 Trillion Token Context Window -- Siskar
- AI Agent Cost Optimization: Token Economics -- Zylos
- AI Agent Cost Optimization Guide -- Moltbook
- Top 5 AI Model Optimization Techniques -- NVIDIA
- LLM Inference Optimization -- Clarifai
- LLM Inference Optimization: Cut Cost and Latency -- Morph
- AI Inference Economics: The 1,000x Cost Collapse -- GPUnex
- LLM Inference Price Trends -- Epoch AI
- Inference Economics: 2026 Enterprise AI Cost Crisis -- AnalyticsWeek