Ismat Samadov
  • Tags
  • About
13 min read/0 views

vLLM vs TGI vs Ollama: Self-Hosting LLMs Without Burning Money or Losing Sleep

Ollama peaks at 41 tok/s. vLLM hits 793. TGI is in maintenance mode. Here's the self-hosting guide I wish existed before I started.

AILLMInfrastructurePython

Related Articles

Structured Output Changed How I Build LLM Apps — Pydantic, Tool Use, and the End of Regex Parsing

13 min read

Semantic Caching Saved Us $14K/Month in LLM API Costs

14 min read

LLM Evals Are Broken — How to Actually Test Your AI App Before Users Do

14 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Self-Hosting Decision: When It Actually Makes Sense
  • The Three Frameworks: Different Tools for Different Jobs
  • Ollama: The Developer's Best Friend
  • vLLM: The Production Workhorse
  • TGI: The Fallen Champion (Now in Maintenance Mode)
  • The Benchmark Table Nobody Shows You
  • The Dark Horse: SGLang
  • Quantization: How to Run 70B Models on Consumer GPUs
  • The GPU Decision: What Hardware to Actually Buy (or Rent)
  • Practical Deployment: From Zero to Serving
  • The Mistakes I Made (So You Don't Have To)
  • The Decision Framework
  • What I Actually Think
  • Sources

© 2026 Ismat Samadov

RSS

Our OpenAI bill crossed $19,000 in February. The CTO asked me to "look into self-hosting." I spun up Ollama on my MacBook in 20 minutes, felt like a genius for 48 hours, then tried to serve 50 concurrent users with it and watched it collapse at 41 tokens per second. Three frameworks, two GPU providers, and one very expensive lesson later, I finally have a self-hosted stack that actually works. This is the guide I wish someone had written before I started.

The Self-Hosting Decision: When It Actually Makes Sense

Let me kill the hype upfront: self-hosting LLMs is not cheaper for most teams. Below 50 million tokens per day, APIs are almost always cheaper. The crossover from API to self-hosted happens at roughly 40-100 million tokens per month depending on the model and instance type.

The math works like this. Groq charges $0.11 per million tokens for Llama 4 Scout. A self-hosted Llama model on an H100 costs about $1.49-2.99/hour on cloud GPUs, which translates to $1,080-2,150 per month running 24/7. Add 20-40 hours of engineer time for initial setup and 5-10 hours per month for ongoing maintenance at $75-150/hour fully loaded cost.

So why self-host at all? Three reasons:

  1. Data privacy. If you can't send customer data to third-party APIs — healthcare, finance, legal, government — self-hosting isn't a cost optimization. It's a compliance requirement.
  2. Latency control. API calls add 100-500ms of network latency. Self-hosted inference on local GPUs can hit under 50ms time-to-first-token.
  3. Volume economics. Above 2 million tokens daily (roughly 8,000+ conversations per day), self-hosting starts saving 40-60% compared to APIs. Most teams see payback within 6-12 months.

If none of these apply to you, stop reading and keep using the API. I'm serious. Self-hosting is operationally complex and the cost savings only work at scale.

The Three Frameworks: Different Tools for Different Jobs

Here's the mistake every comparison article makes: they benchmark vLLM, TGI, and Ollama on the same workload and declare a winner. That's like benchmarking a Formula 1 car, a delivery truck, and a bicycle on a racetrack. They're not competing — they're solving different problems.

Ollama: The Developer's Best Friend

Ollama hit 52 million monthly downloads in Q1 2026, with over 90,000 GitHub stars. It's the Docker of LLMs — one command pulls and runs a model:

# That's it. You're running Llama 3.2 locally.
ollama run llama3.2

# Or serve it as an API
ollama serve
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain PagedAttention in one paragraph"
}'

Ollama wraps llama.cpp under the hood, handles quantization automatically, exposes an OpenAI-compatible API, and supports NVIDIA (CUDA), Apple Silicon (Metal), and AMD (ROCm) GPUs. The model library includes Llama, Mistral, Gemma, Phi, Qwen, and 135,000+ GGUF-formatted models on HuggingFace.

But here's what nobody tells you: Ollama caps at roughly 4 parallel requests by default and peaks around 41 tokens per second under load. At 10 concurrent users, total throughput is only ~150 tokens per second. It has no PagedAttention, no continuous batching, and no multi-node support.

Ollama is perfect for: local development, prototyping, demos, single-user tools, CLI assistants, and small internal apps with under 5 concurrent users. It is not for production serving at scale.

vLLM: The Production Workhorse

vLLM is what you deploy when performance matters. Built at UC Berkeley, its core innovation is PagedAttention — inspired by virtual memory paging in operating systems, it breaks the KV cache into fixed-size blocks that can be stored anywhere in GPU memory and reused across requests.

The impact is massive. Existing LLM serving systems waste 60-80% of KV cache memory. vLLM reduces that waste to under 4%. That memory efficiency translates directly to throughput: vLLM achieves up to 24x higher throughput than HuggingFace Transformers and hits 793 tokens per second where Ollama manages 41.

# Install and serve
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# OpenAI-compatible API out of the box
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

vLLM V1 (January 2025) brought a major architectural upgrade with 1.7x speedup, and the latest releases add full NVIDIA Blackwell SM120 and H200 support. Key features:

  • Continuous batching — new requests join the batch without waiting for the current batch to finish
  • Tensor parallelism — split a model across multiple GPUs
  • Speculative decoding — use a small draft model to speed up inference
  • Structured outputs — constrained decoding for JSON, regex, grammar-based generation
  • 85-92% GPU utilization under high concurrency

vLLM is the default choice for production LLM serving in 2026. But it requires a GPU with sufficient VRAM, more setup than Ollama, and operational knowledge of CUDA, model loading, and memory management.

TGI: The Fallen Champion (Now in Maintenance Mode)

This is the part most comparison articles get wrong in 2026, because they're still recommending TGI as a production option.

Hugging Face put TGI into maintenance mode in December 2025. Only minor bug fixes and documentation improvements are accepted. Hugging Face is now contributing directly to the vLLM project and integrating vLLM as a TGI backend. They're also partnering with the llama.cpp team for CPU-based inference.

TGI was a great framework. It introduced continuous batching and Flash Attention to the ecosystem. In 2024, TGI v3.0 was 13x faster than vLLM on long prompts. But the ecosystem moved, and Hugging Face made the pragmatic decision to converge rather than maintain two separate engines.

My recommendation: Don't start new deployments on TGI. If you're already running TGI in production, plan a migration to vLLM or SGLang within the next 6-12 months. The clock is ticking.

The Benchmark Table Nobody Shows You

Most comparison articles show single-metric benchmarks. Here's the full picture, based on Llama 3.1 8B on a single GPU:

MetricOllamaTGIvLLM
Single-user throughput65 tok/s110 tok/s140 tok/s
10 concurrent users~150 tok/s total~500 tok/s total~800 tok/s total
Peak throughput41 tok/s per user~50 tok/s per user793 tok/s total
P99 latency (peak)673 ms~200 ms80 ms
GPU utilization40-60%65-80%85-92%
PagedAttentionNoYes (via vLLM kernels)Yes (native)
Continuous batchingNoYesYes
Multi-GPU supportLimitedYesYes (tensor parallelism)
Setup time5 minutes30 minutes45 minutes
OpenAI-compatible APIYesYesYes
Structured outputYesLimitedYes
Maintenance statusActiveMaintenance modeActive

The gap between Ollama and vLLM isn't 2x. It's 19x at peak throughput. If you're serving more than a handful of concurrent users, that difference isn't a nice-to-have — it's the difference between a responsive application and a queue that backs up indefinitely.

The Dark Horse: SGLang

I'd be doing you a disservice if I didn't mention SGLang. It's the fastest inference engine in 2026 for multi-turn workloads.

SGLang delivers approximately 16,200 tokens per second on H100s, compared to vLLM's ~12,500 — a 29% throughput advantage. Its core innovation is RadixAttention: it stores cached prefixes in a radix tree, enabling up to 5x faster inference for workloads with shared prefixes.

SGLang already powers over 400,000 GPUs across xAI (Grok 3), Microsoft Azure, LinkedIn, and Cursor. If your workload involves multi-turn conversations, shared document context, or structured output, SGLang is worth evaluating alongside vLLM.

But vLLM has the more mature ecosystem, larger community, and broader model support. For most teams in 2026, vLLM is still the safer default.

Quantization: How to Run 70B Models on Consumer GPUs

You don't need an $30,000 H100 to run serious models. Quantization compresses model weights from 16-bit floats to 4-bit or 5-bit integers, cutting memory requirements by 70-75% with 92-95% quality retention.

The practical cheat sheet:

QuantizationQuality RetentionVRAM for 7BVRAM for 70BBest For
FP16 (no quant)100%~14 GB~140 GBBenchmarking only
Q8_0~99%~8 GB~75 GBQuality-critical production
Q5_K_M~97%~5.5 GB~50 GBProduction sweet spot
Q4_K_M~95%~4.5 GB~40 GBMost users, best balance
Q3_K_M~90%~3.5 GB~32 GBExtremely constrained

Q4_K_M is the mainstream choice — good for most tasks with acceptable quality loss. Q5_K_M is recommended for critical applications where you can afford the extra VRAM.

For Ollama, quantization is automatic — ollama run llama3.2 pulls a Q4 variant by default. For vLLM, you'll typically use AWQ or GPTQ quantized models from HuggingFace:

# vLLM with AWQ quantization
vllm serve TheBloke/Llama-2-70B-Chat-AWQ \
  --quantization awq \
  --tensor-parallel-size 2 \
  --max-model-len 4096

The GPU Decision: What Hardware to Actually Buy (or Rent)

The fundamental constraint for LLM serving is VRAM, not compute FLOPS. The formula: model parameters x bytes per parameter (set by quantization) + KV cache overhead + 2-4 GB framework runtime.

For development (Ollama):

  • 12 GB GPU (RTX 3060, RTX 4060 Ti): 7B models at Q4, comfortable for prototyping
  • 16 GB GPU (RTX 4080, M2 Pro): 13B models, sweet spot for local development
  • 24 GB GPU (RTX 4090, M3 Max): 30B models, entry point for 70B with aggressive quantization

For production (vLLM):

  • A100 40GB ($1.19/hr on RunPod): 70B Q4, good for moderate traffic
  • A100 80GB ($1.59/hr on RunPod): 70B Q5, comfortable headroom for KV cache
  • H100 80GB ($1.49-2.99/hr cloud): The production workhorse, 70B FP16 with room to spare

Current cloud GPU pricing for a single card running 24/7:

GPUCloud Price/HrMonthly CostVRAMBest For
A100 40GB~$1.19~$86040 GBBudget production
A100 80GB~$1.59~$1,14580 GBStandard production
H100 80GB~$1.49-2.99~$1,080-2,15080 GBHigh-throughput production
RTX 4090Buy: ~$1,600Electricity only24 GBOn-prem development

Practical Deployment: From Zero to Serving

Here's the deployment path I'd follow today if starting from scratch.

Phase 1: Prototype with Ollama (Day 1)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and test your target model
ollama pull llama3.1:8b
ollama run llama3.1:8b

# Test the API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello"}]}'

Build your application against Ollama's OpenAI-compatible API. All your code will work unchanged when you switch to vLLM later — same API format.

Phase 2: Benchmark your workload (Days 2-3)

Before committing to a GPU and framework, measure your actual traffic:

  • How many concurrent users do you expect?
  • What's the average input/output token length?
  • Do you need multi-turn conversations (favors SGLang) or single-turn (vLLM is fine)?
  • What's your latency budget?

If the answers are "under 5 concurrent users" and "latency doesn't matter much" — stay on Ollama. You're done.

Phase 3: Deploy vLLM for production (Days 4-7)

# Docker deployment (recommended for production)
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

For multi-GPU setups:

# Tensor parallel across 2 GPUs for 70B models
docker run --runtime nvidia --gpus all \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90

Phase 4: Add monitoring and failover (Week 2)

Put a reverse proxy (Nginx, Caddy) in front with authentication. Never expose vLLM directly to the internet. Add health checks, log request latency, and set up alerts for GPU memory and throughput degradation.

Use LiteLLM as your proxy layer for cost tracking, rate limiting, and API key management. It adds load balancing across multiple vLLM instances and automatic failover to cloud APIs when your self-hosted cluster is overloaded.

The Mistakes I Made (So You Don't Have To)

Mistake 1: Using vLLM for development. I spent an entire day configuring CUDA, downloading model shards, and debugging tensor parallel settings — for a prototype that only I was using. Should've used Ollama for development and switched to vLLM only when deploying to production. The setup overhead is zero with Ollama.

Mistake 2: Undersizing RAM. Our first production deployment had 32 GB system RAM for a 70B quantized model (~35 GB at Q4). The model loaded fine but the system became unusable — no headroom for the OS, KV cache overflow to CPU memory, constant swapping. System RAM should be at least 1.5x model size. We upgraded to 64 GB and the problems vanished.

Mistake 3: Ignoring model quality on our actual use case. We picked the model with the best MMLU benchmark score. It was terrible at our specific task (customer support ticket classification). Benchmark performance doesn't match real-world task quality. We wasted two weeks before realizing a smaller, fine-tuned model outperformed the larger one on our data.

Mistake 4: No fallback to cloud APIs. GPU hardware fails. CUDA OOMs happen. When our self-hosted endpoint went down at 2 AM, our entire product was down. Now we route through LiteLLM with automatic fallback to OpenAI when self-hosted latency exceeds 5 seconds. The cloud API bill during outages is negligible compared to the cost of downtime.

Mistake 5: Running at 100% GPU utilization. Sounds efficient. Isn't. At 100% GPU memory utilization, there's no headroom for KV cache growth during longer conversations. We had random OOMs during peak hours because a single long-context request would push memory over the edge. We settled on 90% GPU memory utilization (--gpu-memory-utilization 0.90) — the remaining 10% acts as a buffer.

The Decision Framework

Stop googling benchmarks. Answer these four questions:

How many concurrent users?

  • Under 5 → Ollama. Done.
  • 5-50 → vLLM on a single GPU.
  • 50-500 → vLLM with tensor parallelism or SGLang.
  • 500+ → Multiple vLLM instances behind a load balancer.

What's your budget?

  • $0/month → Ollama on existing hardware.
  • Under $1,000/month → Single A100 on RunPod or Lambda.
  • $1,000-5,000/month → H100 with room for redundancy.
  • Over $5,000/month → Multi-GPU cluster, consider reserved instances.

How sensitive is your data?

  • Public data → Use APIs. Cheaper, simpler, better.
  • Internal but not regulated → Self-host on cloud GPUs.
  • Regulated (HIPAA, SOC2, GDPR) → Self-host on dedicated or on-prem hardware.

How much ops capacity do you have?

  • No dedicated infra team → Ollama or managed APIs.
  • Some infra experience → vLLM on a managed GPU cloud.
  • Full platform team → vLLM/SGLang on bare metal.

What I Actually Think

The self-hosting discourse is full of people who either think everyone should self-host (they're selling GPU compute) or nobody should (they're selling API access). The truth is boring: it depends on your scale, your data sensitivity, and your willingness to maintain infrastructure.

I think Ollama is one of the best developer tools released in the last three years. The fact that you can ollama run llama3.2 and have a working LLM in 20 seconds is extraordinary. But I also think too many people try to run Ollama in production and wonder why it falls over. It's a development tool. Use it for development.

I think vLLM is the correct default for production self-hosting in 2026. Not because it's always the fastest — SGLang beats it on multi-turn by 29% — but because it has the largest community, the most integrations, and the deepest documentation. When you hit a problem at 3 AM, there are Stack Overflow answers for vLLM. There aren't for SGLang.

I think TGI is effectively dead for new deployments. Hugging Face made the right call putting it in maintenance mode — maintaining two inference engines makes no sense when you can contribute to the one that's winning. If you're on TGI today, start planning your migration.

I think most teams that self-host will save less money than they expect and spend more time on operations than they budgeted. The break-even point is real, but it's higher than the blog posts claim once you factor in engineer time, GPU redundancy, and the operational overhead of keeping a CUDA stack healthy. Self-host because you need to — for data privacy, latency, or genuine volume economics — not because it sounds cool.

And I think the biggest unlock isn't the framework choice. It's quantization. The fact that you can run a 70B parameter model on a single $1,600 RTX 4090 with Q4 quantization and get 95% of the original quality is the real revolution. Three years ago, that model required a $200,000 cluster. Now it runs on hardware you can buy at Best Buy.

Start with Ollama. Build your app. Measure your traffic. And only when the numbers actually justify it, deploy vLLM on a rented GPU. Everything else is premature optimization.


Sources

  1. Ptolemay — LLM Total Cost of Ownership 2025
  2. Introl — Inference Unit Economics: True Cost Per Million Tokens
  3. JarvisLabs — NVIDIA H100 Price Guide 2026
  4. Aimprosoft — Cost to Host Private LLM 2025
  5. PremAI — Self-Hosted LLM Guide 2026
  6. vLLM — GitHub Repository
  7. RunPod — Introduction to vLLM and PagedAttention
  8. Red Hat — Ollama vs vLLM Performance Benchmarking
  9. Build with Matija — vLLM vs Ollama vs TGI
  10. HuggingFace — Text Generation Inference Docs
  11. HuggingFace — TGI Multi-Backend Support
  12. MarkTechPost — TGI v3.0: 13x Faster Than vLLM
  13. Ollama — Official Website
  14. DEV Community — Local AI in 2026: Ollama Benchmarks
  15. SGLang — GitHub Repository
  16. PremAI — vLLM vs SGLang vs LMDeploy
  17. Kanerika — SGLang vs vLLM
  18. Will It Run AI — GGUF Quantization Explained
  19. Local AI Zone — AI Model Quantization Guide
  20. SitePoint — Enterprise Local LLM Deployment 2026