OpenAI's gpt-oss-120b has 117 billion parameters. It activates 5.1 billion per token -- 4.4% of the total. With MXFP4 quantization, the entire model fits on a single 80GB H100 GPU. A frontier-class model. One GPU. That sentence would have been absurd eighteen months ago.
And it's not alone. Llama 4 Maverick, Mistral Small 4, GLM-5, Qwen 3.5 -- every major open-source frontier model released since early 2025 uses Mixture of Experts. According to NVIDIA, the top 10 most intelligent open-source models all use MoE architecture. Over 60% of open-source AI releases in 2026 are sparse. The architectural debate is over. MoE won.
What nobody's talking about is what this means for the people actually deploying these models.
The Scoreboard: Every Frontier Model in 2026
Here's the state of open-source frontier models as of April 2026. Count the dense ones.
| Model | Release | Total Params | Active Params | Experts | Activation Rate | Architecture |
|---|
| gpt-oss-120b | Aug 2025 | 117B | 5.1B | Undisclosed | 4.4% | MoE |
| Llama 4 Maverick | Apr 2025 | 400B | 17B | 128 + 1 shared | 4.3% | MoE |
| Llama 4 Scout | Apr 2025 | 109B | 17B | 16 | 15.6% | MoE |
| DeepSeek R1 | Jan 2025 | 671B | 37B | 256 + shared | 5.5% | MoE |
| GLM-5 | Feb 2026 | 744B | 44B | 256 | 5.9% | MoE |
| Mistral Small 4 | Mar 2026 | 119B | 6B | 128 | 5.0% | MoE |
| Qwen 3.5 | Feb 2026 | 397B | 17B | 512 | 4.3% | MoE |
| Kimi K2 | Jul 2025 | 1,040B | 32B | 384 | 3.1% | MoE |
Eight models. Zero dense. The last major dense frontier model was Qwen 2.5-72B, released in late 2024. Since then, every serious contender has been sparse.
The closed frontier tells a similar story. Google's Gemini 2.5 Pro is sparse MoE with 64 experts per block. GPT-4 has been widely rumored to be MoE since mid-2023. GPT-5 uses a routed duo system -- not classical token-level MoE, but still sparse routing. Claude is the notable holdout; Anthropic hasn't disclosed its architecture.
Dense models haven't disappeared -- Google still ships dense Gemma variants at 2B, 4B, and 31B for edge deployment. But at the frontier? It's MoE all the way down.
How MoE Actually Works (In Plain English)
Skip this section if you know the basics. But most articles get this wrong, so I'm going to explain it properly.
A standard ("dense") transformer processes every token through every parameter. If your model has 70 billion parameters, all 70 billion participate in generating each token. More parameters means more knowledge capacity, but also more compute per token.
MoE replaces the feed-forward network (FFN) in each transformer layer with a collection of smaller, specialized FFNs -- the "experts." A tiny routing network (the "gater") looks at each incoming token and picks which experts should handle it. Only the selected experts compute. The rest sit idle.
Here's the math that matters:
- Llama 4 Maverick: 400B total parameters. 128 experts. Only 1 expert activated per MoE layer (plus a shared expert). Result: 17B active parameters per token. You get the knowledge capacity of 400B with the compute cost of 17B.
- Kimi K2: 1.04 trillion total parameters. 384 experts. 8 activated per token. Result: 32B active. Trillion-parameter knowledge in a 32B compute envelope.
The activation rates across frontier models have converged to a narrow band: 3-6% of total parameters per token. This isn't a coincidence. It's the sweet spot where you maximize knowledge capacity per FLOP.
The Router Is the Whole Game
The gating network is typically a single linear layer that takes a token's hidden state and produces a score for each expert. Top-k scores win; those experts activate. Three routing strategies exist:
- Tokens choose experts (standard): Each token picks its top-k experts. Simple. But creates load imbalance -- popular experts get overwhelmed.
- Experts choose tokens (Expert Choice, Zhou et al. 2022): Each expert picks its top-k tokens. Guarantees balance but adds latency.
- Auxiliary-loss-free balancing (DeepSeek's innovation): Adds dynamic bias terms to routing scores based on expert utilization. No auxiliary loss function interfering with gradients. This is what DeepSeek V3 and R1 use, and it's arguably the most important routing innovation since Shazeer's 2017 paper.
A 35-Year History in 60 Seconds
MoE is not new. People forget this.
| Year | Milestone | Significance |
|---|
| 1991 | Jacobs, Jordan, Nowlan, Hinton | Original MoE paper. Vowel discrimination with local expert networks. |
| 2017 | Shazeer et al. at Google | "Outrageously Large Neural Networks." 137B parameter MoE LSTM. Modern era begins. |
| 2020 | GShard (Google) | 600B params. 16x model size for only 3.6x compute increase. Sub-linear scaling proven. |
| 2021 | Switch Transformer (Google) | 1.6T params. Simplified top-1 routing. Introduced auxiliary load balancing loss. |
| Dec 2023 | Mixtral 8x7B (Mistral) | Open-source MoE breakthrough. 45B total / 14B active. Better than GPT-3.5. Apache 2.0. |
| Dec 2024 | DeepSeek V3 | 671B / 37B active. Trained for $5.576M. Rewrote the economics of frontier AI. |
| Apr 2025 | Llama 4 (Meta) | Meta's first MoE. 400B / 17B active. Signaled the end of the dense-vs-sparse debate. |
| 2026 | GLM-5, Mistral Small 4, Qwen 3.5 | MoE becomes the default. Dense frontier models stop shipping. |
The inflection point was Mixtral 8x7B in December 2023. Before that, MoE was a research curiosity with Google papers and not much real-world impact. Mixtral proved you could build a better-than-GPT-3.5 model that anybody could download and run, using a fraction of the compute. After Mixtral, every lab started building sparse.
DeepSeek V3 was the second inflection. Training a 671B frontier model for $5.576M -- even acknowledging that figure excludes prior research costs -- demonstrated that MoE wasn't just more efficient. It was a completely different cost curve.
The Self-Hosting Revolution Nobody Saw Coming
Here's the part that matters for practitioners.
MoE models have a paradoxical hardware profile. They need less compute per token (because only active parameters fire) but the same memory as a dense model of their total size (because all experts must be loaded). This creates a strange situation where the binding constraint isn't how fast you can compute, but how much VRAM you have.
But quantization changes the equation.
What Fits on What
| Model | Total Params | FP16 VRAM | INT4 VRAM | Minimum Hardware | Approx. Cost |
|---|
| gpt-oss-120b | 117B | ~234 GB | ~60 GB | 1x H100 80GB | $15-25K |
| Qwen3.5-122B-A10B | 122B | ~244 GB | ~60 GB | 1x H100 80GB | $15-25K |
| Mistral Small 4 | 119B | ~238 GB | ~60 GB | 4x H100 (FP16) or 1x H100 (quantized) | $15-25K |
| Llama 4 Scout | 109B | ~218 GB | ~55 GB | 1x H100 80GB | $15-25K |
| Qwen3.5-35B-A3B | 35B | ~70 GB | ~22 GB | 1x RTX 4090 24GB | $1,500-2K |
| DeepSeek R1 | 671B | ~1.34 TB | ~340 GB | 8x H100 or 4x H200 | $60-250K |
| GLM-5 | 744B | ~1.49 TB | ~375 GB | 4x H200 | $150-250K |
| Kimi K2 | 1,040B | ~2.08 TB | ~520 GB | 8x H200 | $300K+ |
Source: Hardware estimates from Onyx AI, Spheron, model cards
Read that table again. A frontier-class 120B MoE model fits on a single H100 when quantized. That same H100 could serve the model at thousands of tokens per second with only 5B active parameters computing per token. You're getting 120B-class intelligence at 5B-class compute cost on 25K-class hardware.
Compare that to running a dense 70B model -- which needs the same single H100 but activates all 70 billion parameters per token. The MoE model has nearly twice the knowledge capacity at a fraction of the compute.
The Cost Per Token Collapse
The numbers are staggering:
| Year | Approximate Cost per Million Tokens | Notes |
|---|
| Late 2022 | ~$20 | GPT-4 class |
| Early 2024 | ~$5 | GPT-4-turbo, competition begins |
| Late 2024 | ~$1 | MoE models, DeepSeek |
| Early 2026 | ~$0.40 | Blackwell hardware + MoE |
| Projected 2028 | ~$0.01 | Vera Rubin + next-gen MoE |
That's a 50x reduction in three years. And it's not just API pricing -- self-hosting costs followed the same curve. DeepInfra reduced their per-token cost from $0.20 on Hopper to $0.05 on Blackwell with NVFP4 -- a 4x improvement from hardware alone.
The formula, per Epoch AI: an 8-way sparse MoE has inference economics comparable to a dense model with 50% of its total parameters. For 4-way sparsity, the ratio is roughly 65%. So gpt-oss-120b (117B total, 5.1B active) has inference economics roughly equivalent to a ~60B dense model -- but with the knowledge capacity of 117B.
What Most MoE Articles Get Wrong
"MoE models need less memory"
Wrong. This is the single most dangerous misconception. All expert parameters must be loaded into VRAM or RAM, because the router needs access to every expert to make routing decisions dynamically. A 671B MoE model needs the same memory as a 671B dense model. You save on compute per token, not on memory.
The confusion comes from the "active parameters" marketing. When someone says "DeepSeek R1 only has 37B active parameters," that doesn't mean you need 37B parameters worth of VRAM. You need 671B parameters worth of VRAM. The 37B figure tells you about compute cost, not hardware cost.
"Sparse means faster"
It depends. During large-batch, throughput-oriented serving (the way APIs work), MoE is absolutely faster. FLOPs scale with active parameters, not total. But during single-user, low-batch inference -- the way most local deployments work -- memory bandwidth becomes the bottleneck, not compute. You're loading all those expert weights from VRAM even though most sit idle. In that regime, a dense model with the same active parameter count can actually be slightly faster because there's no routing overhead.
"MoE is always cheaper to train"
Mostly true, with a big asterisk. DeepSeek V3's $5.576M training cost is real but excludes all prior research, ablation experiments, and architecture search. The true all-in cost of developing a frontier MoE model is substantially higher. Still cheaper than equivalent dense models, but not by the 20x factor the headlines suggest.
GShard's numbers are more honest: 16x increase in parameters for 3.6x increase in compute. That's a ~4.4x efficiency gain. Real. Significant. Not miraculous.
The Unsolved Problems
MoE isn't a free lunch. Here are the problems that still bite.
Expert Collapse
Some experts get all the traffic while others go dormant. Without careful regularization, the model converges to using 1-2 experts for nearly all tokens, breaking the entire sparse computation promise. The traditional fix -- auxiliary load balancing loss -- works but introduces interference gradients that impair model performance. DeepSeek's auxiliary-loss-free balancing is the best solution so far, but it's relatively new and less battle-tested.
Communication Overhead
Expert parallelism (distributing experts across GPUs) generates roughly 9x the communication volume compared to tensor parallelism. Each token might route to experts on different GPUs, requiring All-to-All communication. NVIDIA's NVLink at 1.8 TB/s per GPU helps, but multi-node deployments still hit this wall.
For ultra-sparse models like Llama 4 Maverick (0.78% activation density), vLLM found that expert parallelism actually hurts performance -- EP=0 outperforms EP=1 by 7-12%. The overhead exceeds the benefit when so few experts activate.
Batch Size Requirements
MoE needs larger batch sizes than dense models to be economical. Each token in a batch activates different experts, so you need enough tokens to keep all active experts busy. Dense models share weight-loading costs across batch tokens naturally. Under SLA constraints (guaranteed tokens/sec/user), providers operate at suboptimal batch sizes, erasing MoE's theoretical advantage.
Reinforcement Learning Instability
Here's a problem almost nobody discusses. The top-k routing operator creates discontinuities in the optimization landscape. Gradients with respect to unselected experts' logits are exactly zero almost everywhere. This causes "gradient blackouts and training collapses" during RL fine-tuning. It's why dense models between 0.6B and 30B still win in some agentic AI use cases -- the continuous, differentiable policy mappings of dense models are simply more stable for RL training.
Self-Hosting MoE: The Practical Guide
If you want to run a frontier MoE model yourself, here's what you actually need.
Step 1: Pick Your Model and Hardware Tier
Step 2: Choose Your Serving Framework
# vLLM -- best for high-throughput production
pip install vllm
vllm serve openai/gpt-oss-120b --quantization mxfp4 --tensor-parallel-size 1
# SGLang -- best for DeepSeek models with expert parallelism
pip install sglang
python -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --enable-ep
# Ollama -- simplest for local development
ollama pull qwen3.5:35b-a3b
ollama run qwen3.5:35b-a3b
Decision framework for frameworks:
- Low concurrency (fewer than 128 requests): Use tensor parallelism in vLLM
- High concurrency (>512 requests): Use data parallelism + expert parallelism in SGLang
- DeepSeek specifically: SGLang has the best optimizations, including EPLB load balancing and elastic expert parallelism
- Consumer hardware / dev: Ollama or KTransformers for CPU/GPU hybrid
Step 3: Quantize Appropriately
| Format | Memory Savings | Quality Impact | Best For |
|---|
| FP8 | 2x vs FP16 | Near-lossless | Production serving |
| INT4 / NF4 | 4x vs FP16 | Minor degradation on math/code | Budget deployments |
| MXFP4 | ~4x vs FP16 | Engineered for Blackwell | NVIDIA B200/GB200 |
| GGUF Q4_K_M | 4x vs FP16 | Good balance | llama.cpp / Ollama |
Warning: MoE models react differently to aggressive quantization than dense models. Different experts learn different weight distributions, so uniform quantization can disproportionately harm certain expert pathways. FP8 is the safe bet for production. INT4 is fine for development and non-critical workloads.
Step 4: Monitor and Optimize
The gotcha with MoE serving: expert load imbalance. If your workload distribution doesn't match training data distribution, some experts get hammered while others idle. Monitor per-expert throughput and consider KTransformers for CPU/GPU hybrid inference, which achieves 4.6-19.7x prefilling speedups by intelligently placing experts across CPU and GPU memory.
When Dense Still Wins
I've painted a rosy picture of MoE. Let me be honest about where it doesn't apply.
Edge deployment. You're not running 512 experts on a phone. Dense models at 2-4B parameters remain the only option for on-device inference. Google ships dense Gemma variants specifically for this.
RL-heavy agentic workloads. The routing discontinuity problem is real. If your pipeline involves heavy reinforcement learning fine-tuning, dense models in the 8-32B range offer more stable training dynamics.
Latency-critical, single-user scenarios. If you're serving one user at a time (a coding assistant, a personal chatbot), the memory bandwidth overhead of loading all those expert weights offsets the compute savings. A dense 7B will feel snappier than a 35B-A3B MoE even though the MoE activates fewer parameters.
Simplicity. Dense models are easier to train, fine-tune, quantize, deploy, and debug. If your team doesn't have MoE serving expertise and your workload doesn't need frontier capability, a dense model is still the rational choice.
What I Actually Think
The MoE transition is the most important architectural shift in AI since the transformer itself. And I don't think people have fully internalized what it means.
Here's what it means: the cost of frontier intelligence just decoupled from the cost of frontier hardware.
Before MoE, building a frontier model meant spending $100M+ on training and requiring multi-million-dollar GPU clusters to serve. Only a handful of companies could play. The moat was capital -- whoever could spend the most on compute owned the frontier.
MoE broke that model. DeepSeek trained a frontier-class model for $5.576M. OpenAI released gpt-oss-120b that fits on a $25K GPU. Qwen 3.5's 35B-A3B variant runs on a consumer RTX 4090 while outperforming models ten times its active size. The capital moat is eroding.
This is terrible news for companies whose business model depends on selling access to frontier models via API. If a startup can self-host a 120B MoE on a single H100 and get 80%+ of GPT-5 quality for a one-time $25K investment, why are they paying OpenAI $15 per million output tokens?
It's great news for everyone else. The self-hosting economics have flipped. The break-even against API pricing at high utilization is now 2-3 months, not 2-3 years. Over a two-year horizon with steady demand, self-hosting is 5-10x cheaper than API access. That math was never this good before MoE.
I think the next 12 months will see a massive shift toward self-hosted MoE deployments. Not because companies want to. Because the economics demand it. When frontier-class AI costs $0.05 per million tokens to serve on your own hardware, paying $3-15 per million tokens to an API provider becomes an indefensible line item.
The dense model era lasted about seven years -- from the original Transformer paper in 2017 to Mixtral's proof-of-concept in late 2023. The MoE era is just beginning. And unlike dense scaling, which hit a wall of diminishing returns versus compute cost, MoE scaling has room to run. More experts, sparser activation, better routing -- each axis offers independent improvement.
We went from "frontier AI requires a data center" to "frontier AI fits on one GPU" in eighteen months. Dense models didn't make that happen. MoE did.
Sources
- NVIDIA Blog -- Mixture of Experts Powers Frontier AI Models
- LLM Stats -- AI Trends 2026
- HuggingFace -- gpt-oss-120b Model Card
- OpenAI -- Introducing gpt-oss
- Meta AI -- Llama 4: Open, Multimodal Intelligence
- HuggingFace -- Welcome Llama 4
- HuggingFace -- DeepSeek-R1
- DeepSeek V3 Technical Report (arXiv)
- GLM-5 Official Page
- Mistral AI -- Mistral Small 4
- HuggingFace -- Qwen3.5-397B-A17B
- Kimi K2 GitHub
- Gemini 2.5 Pro Technical Report (PDF)
- Encord -- GPT-5 Technical Breakdown
- Jacobs et al. 1991 -- Adaptive Mixtures of Local Experts
- Shazeer et al. 2017 -- Outrageously Large Neural Networks
- GShard (arXiv)
- Switch Transformers (JMLR)
- Mistral AI -- Mixtral of Experts
- Signal65 -- From Dense to MoE: New Economics of AI Inference
- Epoch AI -- MoE vs Dense Models Inference
- Onyx AI -- Best Self-Hosted LLMs 2026
- Spheron -- Deploy GPT-OSS on GPU Cloud
- The Register -- DeepSeek's Real Training Cost
- APXML -- Expert Collapse in MoE
- HuggingFace -- Mixture of Experts Explained
- DeepSeek -- Auxiliary-Loss-Free Load Balancing (arXiv)
- Microsoft DeepSpeed -- MoE Inference and Training
- Expert Choice Routing (arXiv)
- vLLM -- Expert Parallel Deployment Guide
- SGLang -- Expert Parallelism
- LMSYS -- Large-Scale Expert Parallelism on 96 H100s
- KTransformers -- CPU/GPU Hybrid MoE Inference
- KTransformers SOSP25 Paper (PDF)
- arXiv -- MoE Routing Instability in RL
- DigitalApplied -- Open Source AI Landscape April 2026
- Tom's Hardware -- NVIDIA Vera Rubin NVL72
- NVIDIA Developer -- GB200 NVL72 + Dynamo for MoE
- Tensor Economics -- MoE Inference from First Principles
- DeepInfra -- Quantization Guide
- GPUStack -- Quantization Impact on vLLM Performance
- ZenML -- Self-Hosting DeepSeek-R1 Cost-Benefit
- vLLM Blog -- GPT-OSS Optimizations on Blackwell
- NVIDIA Blog -- Blackwell 10x Token Cost Reduction
- Hacker News -- GPT-4 MoE Architecture Leak