15 min read/1 views

Mixture of Experts Won: Why Every Frontier Model Uses MoE (And What It Means for Self-Hosting)

Every major open-source frontier model in 2026 uses MoE. A 120B model now fits on one H100. The self-hosting economics changed forever.

AI LLM Machine Learning Infrastructure

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

15 min read/1 views

Mixture of Experts Won: Why Every Frontier Model Uses MoE (And What It Means for Self-Hosting)

Every major open-source frontier model in 2026 uses MoE. A 120B model now fits on one H100. The self-hosting economics changed forever.

AI LLM Machine Learning Infrastructure

OpenAI's gpt-oss-120b has 117 billion parameters. It activates 5.1 billion per token -- 4.4% of the total. With MXFP4 quantization, the entire model fits on a single 80GB H100 GPU. A frontier-class model. One GPU. That sentence would have been absurd eighteen months ago.

And it's not alone. Llama 4 Maverick, Mistral Small 4, GLM-5, Qwen 3.5 -- every major open-source frontier model released since early 2025 uses Mixture of Experts. According to NVIDIA, the top 10 most intelligent open-source models all use MoE architecture. Over 60% of open-source AI releases in 2026 are sparse. The architectural debate is over. MoE won.

What nobody's talking about is what this means for the people actually deploying these models.

The Scoreboard: Every Frontier Model in 2026

Here's the state of open-source frontier models as of April 2026. Count the dense ones.

Model	Release	Total Params	Active Params	Experts	Activation Rate	Architecture
gpt-oss-120b	Aug 2025	117B	5.1B	Undisclosed	4.4%	MoE
Llama 4 Maverick	Apr 2025	400B	17B	128 + 1 shared	4.3%	MoE
Llama 4 Scout	Apr 2025	109B	17B	16	15.6%	MoE
DeepSeek R1	Jan 2025	671B	37B	256 + shared	5.5%	MoE
GLM-5	Feb 2026	744B	44B	256	5.9%	MoE
Mistral Small 4	Mar 2026	119B	6B	128	5.0%	MoE
Qwen 3.5	Feb 2026	397B	17B	512	4.3%	MoE
Kimi K2	Jul 2025	1,040B	32B	384	3.1%	MoE

Eight models. Zero dense. The last major dense frontier model was Qwen 2.5-72B, released in late 2024. Since then, every serious contender has been sparse.

The closed frontier tells a similar story. Google's Gemini 2.5 Pro is sparse MoE with 64 experts per block. GPT-4 has been widely rumored to be MoE since mid-2023. GPT-5 uses a routed duo system -- not classical token-level MoE, but still sparse routing. Claude is the notable holdout; Anthropic hasn't disclosed its architecture.

Dense models haven't disappeared -- Google still ships dense Gemma variants at 2B, 4B, and 31B for edge deployment. But at the frontier? It's MoE all the way down.

How MoE Actually Works (In Plain English)

Skip this section if you know the basics. But most articles get this wrong, so I'm going to explain it properly.

A standard ("dense") transformer processes every token through every parameter. If your model has 70 billion parameters, all 70 billion participate in generating each token. More parameters means more knowledge capacity, but also more compute per token.

MoE replaces the feed-forward network (FFN) in each transformer layer with a collection of smaller, specialized FFNs -- the "experts." A tiny routing network (the "gater") looks at each incoming token and picks which experts should handle it. Only the selected experts compute. The rest sit idle.

Here's the math that matters:

Llama 4 Maverick: 400B total parameters. 128 experts. Only 1 expert activated per MoE layer (plus a shared expert). Result: 17B active parameters per token. You get the knowledge capacity of 400B with the compute cost of 17B.
Kimi K2: 1.04 trillion total parameters. 384 experts. 8 activated per token. Result: 32B active. Trillion-parameter knowledge in a 32B compute envelope.

The activation rates across frontier models have converged to a narrow band: 3-6% of total parameters per token. This isn't a coincidence. It's the sweet spot where you maximize knowledge capacity per FLOP.

The Router Is the Whole Game

The gating network is typically a single linear layer that takes a token's hidden state and produces a score for each expert. Top-k scores win; those experts activate. Three routing strategies exist:

Tokens choose experts (standard): Each token picks its top-k experts. Simple. But creates load imbalance -- popular experts get overwhelmed.
Experts choose tokens (Expert Choice, Zhou et al. 2022): Each expert picks its top-k tokens. Guarantees balance but adds latency.
Auxiliary-loss-free balancing (DeepSeek's innovation): Adds dynamic bias terms to routing scores based on expert utilization. No auxiliary loss function interfering with gradients. This is what DeepSeek V3 and R1 use, and it's arguably the most important routing innovation since Shazeer's 2017 paper.

A 35-Year History in 60 Seconds

MoE is not new. People forget this.

Year	Milestone	Significance
1991	Jacobs, Jordan, Nowlan, Hinton	Original MoE paper. Vowel discrimination with local expert networks.
2017	Shazeer et al. at Google	"Outrageously Large Neural Networks." 137B parameter MoE LSTM. Modern era begins.
2020	GShard (Google)	600B params. 16x model size for only 3.6x compute increase. Sub-linear scaling proven.
2021	Switch Transformer (Google)	1.6T params. Simplified top-1 routing. Introduced auxiliary load balancing loss.
Dec 2023	Mixtral 8x7B (Mistral)	Open-source MoE breakthrough. 45B total / 14B active. Better than GPT-3.5. Apache 2.0.
Dec 2024	DeepSeek V3	671B / 37B active. Trained for $5.576M. Rewrote the economics of frontier AI.
Apr 2025	Llama 4 (Meta)	Meta's first MoE. 400B / 17B active. Signaled the end of the dense-vs-sparse debate.
2026	GLM-5, Mistral Small 4, Qwen 3.5	MoE becomes the default. Dense frontier models stop shipping.

The inflection point was Mixtral 8x7B in December 2023. Before that, MoE was a research curiosity with Google papers and not much real-world impact. Mixtral proved you could build a better-than-GPT-3.5 model that anybody could download and run, using a fraction of the compute. After Mixtral, every lab started building sparse.

DeepSeek V3 was the second inflection. Training a 671B frontier model for $5.576M -- even acknowledging that figure excludes prior research costs -- demonstrated that MoE wasn't just more efficient. It was a completely different cost curve.

The Self-Hosting Revolution Nobody Saw Coming

Here's the part that matters for practitioners.

MoE models have a paradoxical hardware profile. They need less compute per token (because only active parameters fire) but the same memory as a dense model of their total size (because all experts must be loaded). This creates a strange situation where the binding constraint isn't how fast you can compute, but how much VRAM you have.

But quantization changes the equation.

What Fits on What

Model	Total Params	FP16 VRAM	INT4 VRAM	Minimum Hardware	Approx. Cost
gpt-oss-120b	117B	~234 GB	~60 GB	1x H100 80GB	$15-25K
Qwen3.5-122B-A10B	122B	~244 GB	~60 GB	1x H100 80GB	$15-25K
Mistral Small 4	119B	~238 GB	~60 GB	4x H100 (FP16) or 1x H100 (quantized)	$15-25K
Llama 4 Scout	109B	~218 GB	~55 GB	1x H100 80GB	$15-25K
Qwen3.5-35B-A3B	35B	~70 GB	~22 GB	1x RTX 4090 24GB	$1,500-2K
DeepSeek R1	671B	~1.34 TB	~340 GB	8x H100 or 4x H200	$60-250K
GLM-5	744B	~1.49 TB	~375 GB	4x H200	$150-250K
Kimi K2	1,040B	~2.08 TB	~520 GB	8x H200	$300K+

Source: Hardware estimates from Onyx AI, Spheron, model cards

Read that table again. A frontier-class 120B MoE model fits on a single H100 when quantized. That same H100 could serve the model at thousands of tokens per second with only 5B active parameters computing per token. You're getting 120B-class intelligence at 5B-class compute cost on 25K-class hardware.

Compare that to running a dense 70B model -- which needs the same single H100 but activates all 70 billion parameters per token. The MoE model has nearly twice the knowledge capacity at a fraction of the compute.

The Cost Per Token Collapse

The numbers are staggering:

Year	Approximate Cost per Million Tokens	Notes
Late 2022	~$20	GPT-4 class
Early 2024	~$5	GPT-4-turbo, competition begins
Late 2024	~$1	MoE models, DeepSeek
Early 2026	~$0.40	Blackwell hardware + MoE
Projected 2028	~$0.01	Vera Rubin + next-gen MoE

That's a 50x reduction in three years. And it's not just API pricing -- self-hosting costs followed the same curve. DeepInfra reduced their per-token cost from $0.20 on Hopper to $0.05 on Blackwell with NVFP4 -- a 4x improvement from hardware alone.

The formula, per Epoch AI: an 8-way sparse MoE has inference economics comparable to a dense model with 50% of its total parameters. For 4-way sparsity, the ratio is roughly 65%. So gpt-oss-120b (117B total, 5.1B active) has inference economics roughly equivalent to a ~60B dense model -- but with the knowledge capacity of 117B.

What Most MoE Articles Get Wrong

"MoE models need less memory"

Wrong. This is the single most dangerous misconception. All expert parameters must be loaded into VRAM or RAM, because the router needs access to every expert to make routing decisions dynamically. A 671B MoE model needs the same memory as a 671B dense model. You save on compute per token, not on memory.

The confusion comes from the "active parameters" marketing. When someone says "DeepSeek R1 only has 37B active parameters," that doesn't mean you need 37B parameters worth of VRAM. You need 671B parameters worth of VRAM. The 37B figure tells you about compute cost, not hardware cost.

"Sparse means faster"

It depends. During large-batch, throughput-oriented serving (the way APIs work), MoE is absolutely faster. FLOPs scale with active parameters, not total. But during single-user, low-batch inference -- the way most local deployments work -- memory bandwidth becomes the bottleneck, not compute. You're loading all those expert weights from VRAM even though most sit idle. In that regime, a dense model with the same active parameter count can actually be slightly faster because there's no routing overhead.

"MoE is always cheaper to train"

Mostly true, with a big asterisk. DeepSeek V3's $5.576M training cost is real but excludes all prior research, ablation experiments, and architecture search. The true all-in cost of developing a frontier MoE model is substantially higher. Still cheaper than equivalent dense models, but not by the 20x factor the headlines suggest.

GShard's numbers are more honest: 16x increase in parameters for 3.6x increase in compute. That's a ~4.4x efficiency gain. Real. Significant. Not miraculous.

The Unsolved Problems

MoE isn't a free lunch. Here are the problems that still bite.

Expert Collapse

Some experts get all the traffic while others go dormant. Without careful regularization, the model converges to using 1-2 experts for nearly all tokens, breaking the entire sparse computation promise. The traditional fix -- auxiliary load balancing loss -- works but introduces interference gradients that impair model performance. DeepSeek's auxiliary-loss-free balancing is the best solution so far, but it's relatively new and less battle-tested.

Communication Overhead

Expert parallelism (distributing experts across GPUs) generates roughly 9x the communication volume compared to tensor parallelism. Each token might route to experts on different GPUs, requiring All-to-All communication. NVIDIA's NVLink at 1.8 TB/s per GPU helps, but multi-node deployments still hit this wall.

For ultra-sparse models like Llama 4 Maverick (0.78% activation density), vLLM found that expert parallelism actually hurts performance -- EP=0 outperforms EP=1 by 7-12%. The overhead exceeds the benefit when so few experts activate.

Batch Size Requirements

MoE needs larger batch sizes than dense models to be economical. Each token in a batch activates different experts, so you need enough tokens to keep all active experts busy. Dense models share weight-loading costs across batch tokens naturally. Under SLA constraints (guaranteed tokens/sec/user), providers operate at suboptimal batch sizes, erasing MoE's theoretical advantage.

Reinforcement Learning Instability

Here's a problem almost nobody discusses. The top-k routing operator creates discontinuities in the optimization landscape. Gradients with respect to unselected experts' logits are exactly zero almost everywhere. This causes "gradient blackouts and training collapses" during RL fine-tuning. It's why dense models between 0.6B and 30B still win in some agentic AI use cases -- the continuous, differentiable policy mappings of dense models are simply more stable for RL training.

Self-Hosting MoE: The Practical Guide

If you want to run a frontier MoE model yourself, here's what you actually need.

Step 1: Pick Your Model and Hardware Tier

Budget	Hardware	Best MoE Model	Active Params	Use Case
~$2K	1x RTX 4090 (24 GB)	Qwen3.5-35B-A3B	3B	Personal/dev, light production
~$4K	Dual EPYC + 384GB RAM	DeepSeek R1 (CPU offload)	37B	Batch processing, 5-8 tok/s
~$20K	1x H100 80GB	gpt-oss-120b (MXFP4)	5.1B	Production serving
~$80K	4x H100 320GB	Qwen3.5-397B	17B	High-throughput production
~$200K	4x H200 564GB	GLM-5 (744B)	44B	Maximum capability

Step 2: Choose Your Serving Framework

# vLLM -- best for high-throughput production
pip install vllm
vllm serve openai/gpt-oss-120b --quantization mxfp4 --tensor-parallel-size 1

# SGLang -- best for DeepSeek models with expert parallelism
pip install sglang
python -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --enable-ep

# Ollama -- simplest for local development
ollama pull qwen3.5:35b-a3b
ollama run qwen3.5:35b-a3b

Decision framework for frameworks:

Low concurrency (fewer than 128 requests): Use tensor parallelism in vLLM
High concurrency (>512 requests): Use data parallelism + expert parallelism in SGLang
DeepSeek specifically: SGLang has the best optimizations, including EPLB load balancing and elastic expert parallelism
Consumer hardware / dev: Ollama or KTransformers for CPU/GPU hybrid

Step 3: Quantize Appropriately

Format	Memory Savings	Quality Impact	Best For
FP8	2x vs FP16	Near-lossless	Production serving
INT4 / NF4	4x vs FP16	Minor degradation on math/code	Budget deployments
MXFP4	~4x vs FP16	Engineered for Blackwell	NVIDIA B200/GB200
GGUF Q4_K_M	4x vs FP16	Good balance	llama.cpp / Ollama

Warning: MoE models react differently to aggressive quantization than dense models. Different experts learn different weight distributions, so uniform quantization can disproportionately harm certain expert pathways. FP8 is the safe bet for production. INT4 is fine for development and non-critical workloads.

Step 4: Monitor and Optimize

The gotcha with MoE serving: expert load imbalance. If your workload distribution doesn't match training data distribution, some experts get hammered while others idle. Monitor per-expert throughput and consider KTransformers for CPU/GPU hybrid inference, which achieves 4.6-19.7x prefilling speedups by intelligently placing experts across CPU and GPU memory.

When Dense Still Wins

I've painted a rosy picture of MoE. Let me be honest about where it doesn't apply.

Edge deployment. You're not running 512 experts on a phone. Dense models at 2-4B parameters remain the only option for on-device inference. Google ships dense Gemma variants specifically for this.

RL-heavy agentic workloads. The routing discontinuity problem is real. If your pipeline involves heavy reinforcement learning fine-tuning, dense models in the 8-32B range offer more stable training dynamics.

Latency-critical, single-user scenarios. If you're serving one user at a time (a coding assistant, a personal chatbot), the memory bandwidth overhead of loading all those expert weights offsets the compute savings. A dense 7B will feel snappier than a 35B-A3B MoE even though the MoE activates fewer parameters.

Simplicity. Dense models are easier to train, fine-tune, quantize, deploy, and debug. If your team doesn't have MoE serving expertise and your workload doesn't need frontier capability, a dense model is still the rational choice.

What I Actually Think

The MoE transition is the most important architectural shift in AI since the transformer itself. And I don't think people have fully internalized what it means.

Here's what it means: the cost of frontier intelligence just decoupled from the cost of frontier hardware.

Before MoE, building a frontier model meant spending $100M+ on training and requiring multi-million-dollar GPU clusters to serve. Only a handful of companies could play. The moat was capital -- whoever could spend the most on compute owned the frontier.

MoE broke that model. DeepSeek trained a frontier-class model for $5.576M. OpenAI released gpt-oss-120b that fits on a $25K GPU. Qwen 3.5's 35B-A3B variant runs on a consumer RTX 4090 while outperforming models ten times its active size. The capital moat is eroding.

This is terrible news for companies whose business model depends on selling access to frontier models via API. If a startup can self-host a 120B MoE on a single H100 and get 80%+ of GPT-5 quality for a one-time $25K investment, why are they paying OpenAI $15 per million output tokens?

It's great news for everyone else. The self-hosting economics have flipped. The break-even against API pricing at high utilization is now 2-3 months, not 2-3 years. Over a two-year horizon with steady demand, self-hosting is 5-10x cheaper than API access. That math was never this good before MoE.

I think the next 12 months will see a massive shift toward self-hosted MoE deployments. Not because companies want to. Because the economics demand it. When frontier-class AI costs $0.05 per million tokens to serve on your own hardware, paying $3-15 per million tokens to an API provider becomes an indefensible line item.

The dense model era lasted about seven years -- from the original Transformer paper in 2017 to Mixtral's proof-of-concept in late 2023. The MoE era is just beginning. And unlike dense scaling, which hit a wall of diminishing returns versus compute cost, MoE scaling has room to run. More experts, sparser activation, better routing -- each axis offers independent improvement.

We went from "frontier AI requires a data center" to "frontier AI fits on one GPU" in eighteen months. Dense models didn't make that happen. MoE did.

Sources

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

What nobody's talking about is what this means for the people actually deploying these models.

The Scoreboard: Every Frontier Model in 2026

Here's the state of open-source frontier models as of April 2026. Count the dense ones.

Model	Release	Total Params	Active Params	Experts	Activation Rate	Architecture
gpt-oss-120b	Aug 2025	117B	5.1B	Undisclosed	4.4%	MoE
Llama 4 Maverick	Apr 2025	400B	17B	128 + 1 shared	4.3%	MoE
Llama 4 Scout	Apr 2025	109B	17B	16	15.6%	MoE
DeepSeek R1	Jan 2025	671B	37B	256 + shared	5.5%	MoE
GLM-5	Feb 2026	744B	44B	256	5.9%	MoE
Mistral Small 4	Mar 2026	119B	6B	128	5.0%	MoE
Qwen 3.5	Feb 2026	397B	17B	512	4.3%	MoE
Kimi K2	Jul 2025	1,040B	32B	384	3.1%	MoE

Eight models. Zero dense. The last major dense frontier model was Qwen 2.5-72B, released in late 2024. Since then, every serious contender has been sparse.

Dense models haven't disappeared -- Google still ships dense Gemma variants at 2B, 4B, and 31B for edge deployment. But at the frontier? It's MoE all the way down.

How MoE Actually Works (In Plain English)

Skip this section if you know the basics. But most articles get this wrong, so I'm going to explain it properly.

Here's the math that matters:

Llama 4 Maverick: 400B total parameters. 128 experts. Only 1 expert activated per MoE layer (plus a shared expert). Result: 17B active parameters per token. You get the knowledge capacity of 400B with the compute cost of 17B.
Kimi K2: 1.04 trillion total parameters. 384 experts. 8 activated per token. Result: 32B active. Trillion-parameter knowledge in a 32B compute envelope.

The Router Is the Whole Game

The gating network is typically a single linear layer that takes a token's hidden state and produces a score for each expert. Top-k scores win; those experts activate. Three routing strategies exist:

Tokens choose experts (standard): Each token picks its top-k experts. Simple. But creates load imbalance -- popular experts get overwhelmed.
Experts choose tokens (Expert Choice, Zhou et al. 2022): Each expert picks its top-k tokens. Guarantees balance but adds latency.
Auxiliary-loss-free balancing (DeepSeek's innovation): Adds dynamic bias terms to routing scores based on expert utilization. No auxiliary loss function interfering with gradients. This is what DeepSeek V3 and R1 use, and it's arguably the most important routing innovation since Shazeer's 2017 paper.

A 35-Year History in 60 Seconds

MoE is not new. People forget this.

Year	Milestone	Significance
1991	Jacobs, Jordan, Nowlan, Hinton	Original MoE paper. Vowel discrimination with local expert networks.
2017	Shazeer et al. at Google	"Outrageously Large Neural Networks." 137B parameter MoE LSTM. Modern era begins.
2020	GShard (Google)	600B params. 16x model size for only 3.6x compute increase. Sub-linear scaling proven.
2021	Switch Transformer (Google)	1.6T params. Simplified top-1 routing. Introduced auxiliary load balancing loss.
Dec 2023	Mixtral 8x7B (Mistral)	Open-source MoE breakthrough. 45B total / 14B active. Better than GPT-3.5. Apache 2.0.
Dec 2024	DeepSeek V3	671B / 37B active. Trained for $5.576M. Rewrote the economics of frontier AI.
Apr 2025	Llama 4 (Meta)	Meta's first MoE. 400B / 17B active. Signaled the end of the dense-vs-sparse debate.
2026	GLM-5, Mistral Small 4, Qwen 3.5	MoE becomes the default. Dense frontier models stop shipping.

The Self-Hosting Revolution Nobody Saw Coming

Here's the part that matters for practitioners.

But quantization changes the equation.

What Fits on What

Model	Total Params	FP16 VRAM	INT4 VRAM	Minimum Hardware	Approx. Cost
gpt-oss-120b	117B	~234 GB	~60 GB	1x H100 80GB	$15-25K
Qwen3.5-122B-A10B	122B	~244 GB	~60 GB	1x H100 80GB	$15-25K
Mistral Small 4	119B	~238 GB	~60 GB	4x H100 (FP16) or 1x H100 (quantized)	$15-25K
Llama 4 Scout	109B	~218 GB	~55 GB	1x H100 80GB	$15-25K
Qwen3.5-35B-A3B	35B	~70 GB	~22 GB	1x RTX 4090 24GB	$1,500-2K
DeepSeek R1	671B	~1.34 TB	~340 GB	8x H100 or 4x H200	$60-250K
GLM-5	744B	~1.49 TB	~375 GB	4x H200	$150-250K
Kimi K2	1,040B	~2.08 TB	~520 GB	8x H200	$300K+

Source: Hardware estimates from Onyx AI, Spheron, model cards

The Cost Per Token Collapse

The numbers are staggering:

Year	Approximate Cost per Million Tokens	Notes
Late 2022	~$20	GPT-4 class
Early 2024	~$5	GPT-4-turbo, competition begins
Late 2024	~$1	MoE models, DeepSeek
Early 2026	~$0.40	Blackwell hardware + MoE
Projected 2028	~$0.01	Vera Rubin + next-gen MoE

What Most MoE Articles Get Wrong

"MoE models need less memory"

"Sparse means faster"

"MoE is always cheaper to train"

GShard's numbers are more honest: 16x increase in parameters for 3.6x increase in compute. That's a ~4.4x efficiency gain. Real. Significant. Not miraculous.

The Unsolved Problems

MoE isn't a free lunch. Here are the problems that still bite.

Expert Collapse

Communication Overhead

Batch Size Requirements

Reinforcement Learning Instability

Self-Hosting MoE: The Practical Guide

If you want to run a frontier MoE model yourself, here's what you actually need.

Step 1: Pick Your Model and Hardware Tier

Budget	Hardware	Best MoE Model	Active Params	Use Case
~$2K	1x RTX 4090 (24 GB)	Qwen3.5-35B-A3B	3B	Personal/dev, light production
~$4K	Dual EPYC + 384GB RAM	DeepSeek R1 (CPU offload)	37B	Batch processing, 5-8 tok/s
~$20K	1x H100 80GB	gpt-oss-120b (MXFP4)	5.1B	Production serving
~$80K	4x H100 320GB	Qwen3.5-397B	17B	High-throughput production
~$200K	4x H200 564GB	GLM-5 (744B)	44B	Maximum capability

Step 2: Choose Your Serving Framework

# vLLM -- best for high-throughput production
pip install vllm
vllm serve openai/gpt-oss-120b --quantization mxfp4 --tensor-parallel-size 1

# SGLang -- best for DeepSeek models with expert parallelism
pip install sglang
python -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --enable-ep

# Ollama -- simplest for local development
ollama pull qwen3.5:35b-a3b
ollama run qwen3.5:35b-a3b

Decision framework for frameworks:

Low concurrency (fewer than 128 requests): Use tensor parallelism in vLLM
High concurrency (>512 requests): Use data parallelism + expert parallelism in SGLang
DeepSeek specifically: SGLang has the best optimizations, including EPLB load balancing and elastic expert parallelism
Consumer hardware / dev: Ollama or KTransformers for CPU/GPU hybrid

Step 3: Quantize Appropriately

Format	Memory Savings	Quality Impact	Best For
FP8	2x vs FP16	Near-lossless	Production serving
INT4 / NF4	4x vs FP16	Minor degradation on math/code	Budget deployments
MXFP4	~4x vs FP16	Engineered for Blackwell	NVIDIA B200/GB200
GGUF Q4_K_M	4x vs FP16	Good balance	llama.cpp / Ollama

Step 4: Monitor and Optimize

When Dense Still Wins

I've painted a rosy picture of MoE. Let me be honest about where it doesn't apply.

What I Actually Think

The MoE transition is the most important architectural shift in AI since the transformer itself. And I don't think people have fully internalized what it means.

Here's what it means: the cost of frontier intelligence just decoupled from the cost of frontier hardware.

We went from "frontier AI requires a data center" to "frontier AI fits on one GPU" in eighteen months. Dense models didn't make that happen. MoE did.