Our Datadog bill hit $47,000 in a single month. We were running 120 hosts, ingesting about 2 TB of logs per day, and tracking 80,000 custom metrics across a Kubernetes cluster. Nobody authorized that number. Nobody even predicted it. A noisy logging change in one microservice doubled our log ingestion overnight, and Datadog happily charged us for every byte.
That invoice was the beginning of the end of our Datadog contract — and the beginning of our OpenTelemetry migration. We weren't alone. Across the industry, engineering teams are discovering that the observability vendor they chose three years ago is now their second-largest infrastructure cost after compute itself.
OpenTelemetry isn't just an alternative to Datadog. It's a fundamentally different approach to observability — one where you own your instrumentation, choose your backend, and never get surprised by a bill again. Here's why it's winning, how the stack actually works, and what the migration looks like in practice.
The Numbers Tell the Story
Let's start with what's happening in the market, because the shift is bigger than most people realize.
OpenTelemetry is the second-highest velocity project in the CNCF, behind only Kubernetes. It saw a 39% rise in commits and its contributor base grew from 1,301 to 1,756 in a single year — a 35% increase. The project now has 28,435 total contributors across 5,353 organizations.
On the adoption side, 48% of cloud-native organizations are already using OpenTelemetry, with another 25% planning to implement it soon. More than 61% of organizations believe OpenTelemetry is a "Very Important" or "Critical" enabler of observability.
Meanwhile, Datadog is still growing — revenue hit $3.43 billion in fiscal year 2025, up 28% year-over-year — but there's a tension underneath those numbers. Their guidance for 2026 projects $4.06-$4.10 billion, implying growth deceleration to under 20%. The observability market itself is expected to reach $6.93 billion by 2031, growing at 15.6% CAGR. Datadog is growing faster than the market — for now — but the pricing pressure is real.
Here's the uncomfortable truth for Datadog: their revenue growth is partly driven by the same pricing model that's pushing customers away. Datadog bills are growing 30-50% year-over-year for most teams, not because they're using more features, but because their infrastructure is scaling and Datadog's per-host, per-metric, per-GB pricing scales with it.
Why Datadog Bills Explode
If you've never been surprised by a Datadog invoice, you haven't been using it at scale. The pricing model has layers of complexity that only reveal themselves during traffic spikes or infrastructure growth.
The breakdown:
| Component | Datadog Pricing | What Actually Happens |
|---|
| Infrastructure | $15-23/host/month | K8s pods count as hosts. Autoscaling = unpredictable costs |
| Log Management | $0.10/GB ingested + $1.70/million indexed | One noisy service = thousands in surprise charges |
| APM (traces) | $31/host/month + $1.70/million spans | Distributed tracing across microservices adds up fast |
| Custom Metrics | $0.05/metric/month | A K8s cluster with Prometheus exporters easily generates 50K+ metrics |
| Synthetics | $5-12/test/month | Monitoring dozens of endpoints gets expensive |
The math gets ugly quickly. A mid-size team running 50 hosts with moderate log and trace volume can easily reach $15,000 to $30,000 per month. A real-world comparison using the OpenTelemetry Demo application showed Datadog costing approximately $174/day versus OpenObserve at approximately $3/day for identical telemetry data — a 58x cost difference.
The worst part: custom metrics are particularly expensive, often adding $5 per metric per month. A typical Kubernetes cluster with Prometheus exporters can easily generate 50,000+ custom metrics. That's an extra $2,450/month on top of your host costs — just for metrics you might not even be looking at.
Teams are responding in predictable ways. Many are moving to a split model: keep Datadog for the one or two products it does best (usually APM), and move everything else to cheaper alternatives. Logs are usually the first thing to migrate because they're the biggest line item and the easiest to redirect via OpenTelemetry.
What OpenTelemetry Actually Is (And Isn't)
Here's where most articles get it wrong. OpenTelemetry is not a Datadog replacement. It's not a monitoring platform. It's not something you install and get dashboards.
OpenTelemetry is an instrumentation standard — a set of APIs, SDKs, and tools for generating, collecting, and exporting telemetry data (traces, metrics, and logs). It standardizes how your applications produce observability data. It does not store that data or visualize it. You still need a backend for that.
Think of it this way:
Your Application
│
│ (OpenTelemetry SDK — generates telemetry)
▼
OTel Collector
│
│ (OTLP protocol — vendor-neutral format)
▼
Backend of Your Choice
├── Grafana + Prometheus + Loki + Tempo (self-hosted)
├── Jaeger (traces only)
├── SigNoz (self-hosted full stack)
├── Grafana Cloud (managed)
├── Datadog (yes, it accepts OTel data)
└── Elastic / New Relic / Honeycomb / etc.
The power is in that middle layer. Once your application emits data via OpenTelemetry, switching backends is a configuration change in the Collector — not a re-instrumentation of your entire codebase.
This is the fundamental difference from Datadog's approach. With Datadog, you install the Datadog agent, use the Datadog SDK, send data in Datadog's format, store it on Datadog's platform, and visualize it in Datadog's UI. If you cancel, you lose access to historical data. Your instrumentation is coupled to your vendor.
With OpenTelemetry, your instrumentation is vendor-neutral. Your data is stored in whatever backend you choose, and you can migrate between backends without losing data fidelity. You own the data because you control where it lives.
The LGTM Stack: Open-Source Observability That Actually Works
The most popular open-source backend for OpenTelemetry data is the LGTM stack — Loki, Grafana, Tempo, Mimir (or Prometheus). Each component handles one telemetry type:
| Component | Handles | Replaces |
|---|
| Prometheus / Mimir | Metrics | Datadog Infrastructure, Custom Metrics |
| Loki | Logs | Datadog Log Management |
| Tempo | Traces | Datadog APM |
| Grafana | Visualization | Datadog Dashboards |
| OTel Collector | Collection & routing | Datadog Agent |
This modular architecture means you can adopt incrementally. Start with metrics (Prometheus is already everywhere). Add log aggregation with Loki. Add distributed tracing with Tempo. Visualize everything in Grafana. Each piece works independently.
The performance numbers back it up. Integration of Prometheus, Grafana, Loki, and Tempo reduces mean time to resolution (MTTR) by 65% compared to traditional monitoring stacks. Loki helped Paytm Insider save 75% of logging and monitoring costs.
And 78% of enterprises now operate in hybrid cloud environments, where a single vendor's agent often can't cover everything. The open-source stack runs wherever you run — on-prem, any cloud, edge nodes — without per-host licensing.
Setting Up OpenTelemetry: Python in 10 Minutes
Let me show you how fast it is to get started. This is a real setup, not a toy example.
Step 1: Install the SDK
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
The opentelemetry-bootstrap command detects libraries in your environment and installs the matching instrumentation packages automatically. If you're running Flask, SQLAlchemy, and Redis — it finds all three and installs the instrumentation for each.
Step 2: Auto-Instrument Your Application
Zero code changes. Seriously.
OTEL_SERVICE_NAME=my-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
opentelemetry-instrument python app.py
That's it. Your Flask routes, SQLAlchemy queries, Redis calls, and HTTP client requests are now producing traces, metrics, and logs in OTLP format. The auto-instrumentation works by monkey-patching library functions at runtime to capture telemetry data.
Step 3: Add Custom Spans
Auto-instrumentation catches framework-level operations, but you'll want custom spans for your business logic:
from opentelemetry import trace
tracer = trace.get_tracer("my-service")
@app.route("/checkout")
def checkout():
with tracer.start_as_current_span("process_checkout") as span:
span.set_attribute("cart.item_count", len(cart.items))
span.set_attribute("cart.total", cart.total)
with tracer.start_as_current_span("validate_inventory"):
validate_inventory(cart)
with tracer.start_as_current_span("charge_payment"):
charge_payment(cart.total)
with tracer.start_as_current_span("send_confirmation"):
send_confirmation_email(user)
return {"status": "confirmed"}
Each start_as_current_span creates a child span in the trace. When you view this in Jaeger or Tempo, you'll see the full waterfall: how long each step took, which one was slow, and where errors occurred.
Step 4: Deploy the Collector
The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. Here's a production-ready config:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 1024
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
resource:
attributes:
- key: environment
value: production
action: upsert
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [loki]
Deploy it alongside your services:
# docker-compose.yaml (relevant section)
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # gRPC
- "4318:4318" # HTTP
volumes:
- ./otel-collector-config.yaml:/etc/otel/config.yaml
command: ["--config", "/etc/otel/config.yaml"]
Now every service that sends data to port 4317 gets its telemetry routed to the right backend — traces to Tempo, metrics to Mimir, logs to Loki. Switching from Tempo to Jaeger? Change one line in the exporter config. Adding Datadog as a secondary destination? Add another exporter. No application changes.
The Migration Path: Datadog to OpenTelemetry
You don't rip out Datadog on a Friday afternoon. Here's the migration strategy that actually works:
Phase 1: Dual-Ship (Weeks 1-4)
Install the OTel Collector alongside the Datadog agent. Configure your applications to send telemetry to the Collector, and configure the Collector to export to both your new backend and Datadog:
exporters:
otlp/tempo:
endpoint: tempo:4317
datadog:
api:
key: ${DD_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo, datadog] # send to both
This gives you side-by-side comparison without risk. Your team still uses Datadog dashboards while you validate the new stack.
Phase 2: Migrate Logs First (Weeks 3-6)
Logs are the biggest cost driver and the easiest signal to migrate. Redirect log output from the Datadog agent to the OTel Collector, exporting to Loki. Build equivalent dashboards in Grafana. Once the team is comfortable, disable Datadog log ingestion.
Expected savings: 40-60% of your Datadog bill, depending on log volume.
Phase 3: Migrate Metrics (Weeks 5-8)
Switch from Datadog custom metrics to Prometheus/Mimir. If you're already running Prometheus exporters (most Kubernetes clusters do), this is mostly a matter of routing. The OTel Collector can scrape Prometheus endpoints and forward to Mimir.
Phase 4: Migrate Traces (Weeks 7-12)
Traces are the hardest to migrate because they require the most dashboard rebuilding. APM dashboards, service maps, and error tracking need to be recreated in Grafana with Tempo as the data source. This is where most of the engineering time goes — not the instrumentation, but the visualization.
Phase 5: Decommission Datadog (Weeks 10-14)
Remove the Datadog agent. Cancel the contract. Keep screenshots of your final Datadog bill for the celebration Slack message.
Total timeline: 3-4 months for a mid-size team. Longer if you have complex custom dashboards or SLO configurations that need migration.
The Competitive Landscape in 2026
It's not just Datadog feeling the pressure. The entire commercial observability market is responding to OpenTelemetry:
The pattern is clear: every vendor now supports OpenTelemetry ingestion. The question is no longer "should we use OTel?" — it's "what backend should we pair with OTel?"
Datadog itself now accepts OpenTelemetry data. That tells you everything about where the industry is heading. When the incumbent builds compatibility with the open standard, the standard has already won.
What Most Articles Get Wrong
Most "OpenTelemetry vs Datadog" articles frame it as a simple cost comparison. It's not. Here's what they miss:
The operational cost of self-hosting. Running Prometheus, Loki, Tempo, and Grafana in production requires engineering time. You need to manage storage, handle upgrades, tune retention policies, and build alerting rules. For a team of 3-5 engineers, the operational overhead of self-hosted observability can consume 10-20% of one engineer's time. That's real cost — potentially $20-40K per year in engineering time.
The talent gap. Your team knows Datadog's UI. They know where to click when production is on fire. Switching to Grafana + Tempo means retraining — and during incidents, familiarity saves minutes that matter. Budget 2-4 weeks for the team to build muscle memory with the new tools.
Datadog's real value is correlation. The reason Datadog charges premium prices is that their platform correlates logs, traces, metrics, and infrastructure data in a single view. When you click on a slow trace, you see the corresponding logs, the host metrics, and the deployment that might have caused it. Rebuilding this correlation in Grafana is possible (Tempo-to-Loki links, exemplars in Prometheus) but it takes effort.
OTel isn't free either. The software is free. The compute is not. Running a self-hosted observability stack for 100+ services requires dedicated infrastructure — expect 8-16 vCPUs and 32-64 GB RAM for the backend components alone, plus storage that grows with retention. On AWS, that's $500-1,500/month in infrastructure. Still dramatically cheaper than Datadog, but not zero.
The Decision Framework
Here's how I'd decide:
Stay with Datadog if:
- Your team is under 20 engineers and observability isn't your core competency
- Your Datadog bill is under $5,000/month (the convenience is worth it at that scale)
- You're in a regulated industry where a SOC 2-certified SaaS vendor simplifies compliance
- You rely heavily on Datadog-specific features like RUM, Synthetics, or their security products
Migrate to OTel + self-hosted backend if:
- Your Datadog bill exceeds $15,000/month and is growing faster than your infrastructure
- You have a platform engineering team that can own the observability stack
- You're running hybrid or multi-cloud and need vendor-neutral instrumentation
- You want to avoid vendor lock-in as a strategic decision
Use OTel + managed backend (Grafana Cloud, New Relic) if:
- You want OTel's portability without the operational burden of self-hosting
- Your Datadog bill is painful but you don't have the team to run infrastructure
- You want a middle ground: open instrumentation, managed storage and visualization
The hybrid approach is increasingly common. Many teams keep Datadog for APM and move logs to Loki, cutting their bill by 40-60% while keeping the most valuable Datadog features.
A Production OTel Setup: What We Actually Run
After migrating from Datadog, here's our production stack for a 60-service Kubernetes cluster:
┌─────────────┐ OTLP ┌───────────────┐
│ Services │──────────────▶│ OTel Collector│
│ (60 pods) │ gRPC:4317 │ (3 replicas) │
└─────────────┘ └───────┬────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌────────────┐ ┌────────────┐
│ Mimir │ │ Loki │ │ Tempo │
│ (metrics) │ │ (logs) │ │ (traces) │
│ 3 ingest │ │ 3 ingest │ │ 3 ingest │
│ 2 store │ │ 2 store │ │ 2 store │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
└─────────────────┼───────────────────┘
│
┌──────▼──────┐
│ Grafana │
│ (3 replicas)│
└─────────────┘
Infrastructure cost: approximately $1,200/month on AWS (EKS). Our Datadog bill for the same workload was $28,000/month. The savings pay for a full-time engineer with money left over.
Key configuration decisions:
# Retention policies — balance cost vs. debugging needs
mimir:
retention: 90d # metrics: 90 days
loki:
retention: 30d # logs: 30 days (most debugging is recent)
tempo:
retention: 14d # traces: 14 days (only need recent traces)
sampling_rate: 0.1 # sample 10% of traces in production
Trace sampling is where the real savings come from. Sending 100% of traces to your backend is wasteful — you'll never look at 90% of them. A 10% head-based sampling rate with tail-based sampling for errors and slow requests gives you the best of both worlds: low storage costs and guaranteed visibility into problems.
# Tail-based sampling in the OTel Collector
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 2000
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
This config keeps 100% of error traces, 100% of traces slower than 2 seconds, and 10% of everything else. In practice, this means your storage is dominated by the traces you actually care about.
Monitoring the Monitor
One lesson from running our own observability stack: you need to monitor the monitoring. If your OTel Collector drops spans and nobody notices, your traces have gaps. If Loki's ingestion falls behind, your logs are delayed during the exact moment you need them most.
Essential alerts for a self-hosted stack:
groups:
- name: observability-health
rules:
- alert: OTelCollectorDroppedSpans
expr: rate(otelcol_exporter_send_failed_spans[5m]) > 0
for: 5m
annotations:
summary: "OTel Collector is dropping spans"
- alert: LokiIngestionLag
expr: loki_ingester_wal_replay_duration_seconds > 300
for: 10m
annotations:
summary: "Loki ingestion is lagging"
- alert: TempoCompactionFailing
expr: tempo_compactor_runs_failed_total > 0
for: 15m
annotations:
summary: "Tempo compaction is failing"
- alert: MimirIngesterUnhealthy
expr: cortex_ring_members{state="Unhealthy"} > 0
for: 5m
annotations:
summary: "Mimir ingester is unhealthy"
Use a separate, lightweight monitoring stack (even a simple Prometheus + Alertmanager) to watch your primary observability infrastructure. Sounds circular, but it's the same principle as having a backup phone number for your alerting system.
What I Actually Think
OpenTelemetry has already won the instrumentation war. Every major observability vendor supports it. The CNCF recommends it in the "adopt" position of their technology radar. 48% of organizations are using it today, and that number is only going one direction.
But I want to be honest: Datadog is still a great product. The UI is polished. The correlation engine is genuinely useful. The onboarding experience is years ahead of anything in the open-source world. If someone handed me a greenfield project with a $10K/month observability budget and a small team, I'd probably still start with Datadog — and instrument with OTel from day one so I could leave later.
The problem is that $10K/month budget doesn't stay at $10K. It grows to $20K, then $40K, and suddenly you're spending more on monitoring your infrastructure than on the infrastructure itself. That's the moment when OpenTelemetry + LGTM stack goes from "nice to have" to "business necessity."
The strongest argument for OpenTelemetry isn't cost — it's portability. When you instrument with OTel, you're making a decision that lasts longer than any vendor contract. The observability market will consolidate. Vendors will raise prices, get acquired, or shut down. Your instrumentation should survive all of that. OTel gives you that insurance.
My prediction for 2027: Datadog will still be the market leader by revenue, but the majority of new instrumentation will be OTel-native. Datadog becomes a premium backend that accepts OTel data — essentially Grafana Cloud's more expensive competitor. The companies that instrumented with proprietary SDKs will wish they hadn't.
The best time to adopt OpenTelemetry was two years ago. The second best time is your next sprint.
Sources: CNCF Mid-Year 2025 Project Velocity, CNCF 2025 Project Velocity Analysis, Grafana OpenTelemetry Report, Datadog Q4 2025 Financial Results, Datadog 2026 Revenue Guidance — Seeking Alpha, Mordor Intelligence Observability Market Report, OneUptime — Datadog Bill Shock in 2026, OpenObserve — Datadog Pricing Explained, SigNoz — Datadog Pricing Caveats, SigNoz — OpenTelemetry vs Datadog, Cloud-Native Observability Stack 2026, OpenTelemetry Python Getting Started, OpenTelemetry Python Auto-Instrumentation, Datadog OpenTelemetry Solutions, Gartner Observability Platform Reviews, Coherent Market Insights — Observability Market, Datadog Market Cap — MacroTrends.