Ismat Samadov
  • Tags
  • About

© 2026 Ismat Samadov

RSS
14 min read/0 views

OpenTelemetry Is Eating Datadog's Lunch — The Open-Source Observability Stack in 2026

Our Datadog bill hit $47K/month. OpenTelemetry + LGTM stack replaced it for $1,200. The instrumentation war is over — OTel won.

MonitoringDevOpsOpen SourceInfrastructure

Related Articles

SLOs Changed How We Ship Software — Error Budgets, Burn Rates, and Why 99.99% Uptime Is a Lie

15 min read

Terraform Is Legacy Now — Pulumi, CDKTF, and the Infrastructure-as-Real-Code Movement

14 min read

vLLM vs TGI vs Ollama: Self-Hosting LLMs Without Burning Money or Losing Sleep

13 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Numbers Tell the Story
  • Why Datadog Bills Explode
  • What OpenTelemetry Actually Is (And Isn't)
  • The LGTM Stack: Open-Source Observability That Actually Works
  • Setting Up OpenTelemetry: Python in 10 Minutes
  • Step 1: Install the SDK
  • Step 2: Auto-Instrument Your Application
  • Step 3: Add Custom Spans
  • Step 4: Deploy the Collector
  • The Migration Path: Datadog to OpenTelemetry
  • Phase 1: Dual-Ship (Weeks 1-4)
  • Phase 2: Migrate Logs First (Weeks 3-6)
  • Phase 3: Migrate Metrics (Weeks 5-8)
  • Phase 4: Migrate Traces (Weeks 7-12)
  • Phase 5: Decommission Datadog (Weeks 10-14)
  • The Competitive Landscape in 2026
  • What Most Articles Get Wrong
  • The Decision Framework
  • A Production OTel Setup: What We Actually Run
  • Monitoring the Monitor
  • What I Actually Think

Our Datadog bill hit $47,000 in a single month. We were running 120 hosts, ingesting about 2 TB of logs per day, and tracking 80,000 custom metrics across a Kubernetes cluster. Nobody authorized that number. Nobody even predicted it. A noisy logging change in one microservice doubled our log ingestion overnight, and Datadog happily charged us for every byte.

That invoice was the beginning of the end of our Datadog contract — and the beginning of our OpenTelemetry migration. We weren't alone. Across the industry, engineering teams are discovering that the observability vendor they chose three years ago is now their second-largest infrastructure cost after compute itself.

OpenTelemetry isn't just an alternative to Datadog. It's a fundamentally different approach to observability — one where you own your instrumentation, choose your backend, and never get surprised by a bill again. Here's why it's winning, how the stack actually works, and what the migration looks like in practice.

The Numbers Tell the Story

Let's start with what's happening in the market, because the shift is bigger than most people realize.

OpenTelemetry is the second-highest velocity project in the CNCF, behind only Kubernetes. It saw a 39% rise in commits and its contributor base grew from 1,301 to 1,756 in a single year — a 35% increase. The project now has 28,435 total contributors across 5,353 organizations.

On the adoption side, 48% of cloud-native organizations are already using OpenTelemetry, with another 25% planning to implement it soon. More than 61% of organizations believe OpenTelemetry is a "Very Important" or "Critical" enabler of observability.

Meanwhile, Datadog is still growing — revenue hit $3.43 billion in fiscal year 2025, up 28% year-over-year — but there's a tension underneath those numbers. Their guidance for 2026 projects $4.06-$4.10 billion, implying growth deceleration to under 20%. The observability market itself is expected to reach $6.93 billion by 2031, growing at 15.6% CAGR. Datadog is growing faster than the market — for now — but the pricing pressure is real.

Here's the uncomfortable truth for Datadog: their revenue growth is partly driven by the same pricing model that's pushing customers away. Datadog bills are growing 30-50% year-over-year for most teams, not because they're using more features, but because their infrastructure is scaling and Datadog's per-host, per-metric, per-GB pricing scales with it.

Why Datadog Bills Explode

If you've never been surprised by a Datadog invoice, you haven't been using it at scale. The pricing model has layers of complexity that only reveal themselves during traffic spikes or infrastructure growth.

The breakdown:

ComponentDatadog PricingWhat Actually Happens
Infrastructure$15-23/host/monthK8s pods count as hosts. Autoscaling = unpredictable costs
Log Management$0.10/GB ingested + $1.70/million indexedOne noisy service = thousands in surprise charges
APM (traces)$31/host/month + $1.70/million spansDistributed tracing across microservices adds up fast
Custom Metrics$0.05/metric/monthA K8s cluster with Prometheus exporters easily generates 50K+ metrics
Synthetics$5-12/test/monthMonitoring dozens of endpoints gets expensive

The math gets ugly quickly. A mid-size team running 50 hosts with moderate log and trace volume can easily reach $15,000 to $30,000 per month. A real-world comparison using the OpenTelemetry Demo application showed Datadog costing approximately $174/day versus OpenObserve at approximately $3/day for identical telemetry data — a 58x cost difference.

The worst part: custom metrics are particularly expensive, often adding $5 per metric per month. A typical Kubernetes cluster with Prometheus exporters can easily generate 50,000+ custom metrics. That's an extra $2,450/month on top of your host costs — just for metrics you might not even be looking at.

Teams are responding in predictable ways. Many are moving to a split model: keep Datadog for the one or two products it does best (usually APM), and move everything else to cheaper alternatives. Logs are usually the first thing to migrate because they're the biggest line item and the easiest to redirect via OpenTelemetry.

What OpenTelemetry Actually Is (And Isn't)

Here's where most articles get it wrong. OpenTelemetry is not a Datadog replacement. It's not a monitoring platform. It's not something you install and get dashboards.

OpenTelemetry is an instrumentation standard — a set of APIs, SDKs, and tools for generating, collecting, and exporting telemetry data (traces, metrics, and logs). It standardizes how your applications produce observability data. It does not store that data or visualize it. You still need a backend for that.

Think of it this way:

Your Application
     │
     │  (OpenTelemetry SDK — generates telemetry)
     ▼
OTel Collector
     │
     │  (OTLP protocol — vendor-neutral format)
     ▼
Backend of Your Choice
├── Grafana + Prometheus + Loki + Tempo (self-hosted)
├── Jaeger (traces only)
├── SigNoz (self-hosted full stack)
├── Grafana Cloud (managed)
├── Datadog (yes, it accepts OTel data)
└── Elastic / New Relic / Honeycomb / etc.

The power is in that middle layer. Once your application emits data via OpenTelemetry, switching backends is a configuration change in the Collector — not a re-instrumentation of your entire codebase.

This is the fundamental difference from Datadog's approach. With Datadog, you install the Datadog agent, use the Datadog SDK, send data in Datadog's format, store it on Datadog's platform, and visualize it in Datadog's UI. If you cancel, you lose access to historical data. Your instrumentation is coupled to your vendor.

With OpenTelemetry, your instrumentation is vendor-neutral. Your data is stored in whatever backend you choose, and you can migrate between backends without losing data fidelity. You own the data because you control where it lives.

The LGTM Stack: Open-Source Observability That Actually Works

The most popular open-source backend for OpenTelemetry data is the LGTM stack — Loki, Grafana, Tempo, Mimir (or Prometheus). Each component handles one telemetry type:

ComponentHandlesReplaces
Prometheus / MimirMetricsDatadog Infrastructure, Custom Metrics
LokiLogsDatadog Log Management
TempoTracesDatadog APM
GrafanaVisualizationDatadog Dashboards
OTel CollectorCollection & routingDatadog Agent

This modular architecture means you can adopt incrementally. Start with metrics (Prometheus is already everywhere). Add log aggregation with Loki. Add distributed tracing with Tempo. Visualize everything in Grafana. Each piece works independently.

The performance numbers back it up. Integration of Prometheus, Grafana, Loki, and Tempo reduces mean time to resolution (MTTR) by 65% compared to traditional monitoring stacks. Loki helped Paytm Insider save 75% of logging and monitoring costs.

And 78% of enterprises now operate in hybrid cloud environments, where a single vendor's agent often can't cover everything. The open-source stack runs wherever you run — on-prem, any cloud, edge nodes — without per-host licensing.

Setting Up OpenTelemetry: Python in 10 Minutes

Let me show you how fast it is to get started. This is a real setup, not a toy example.

Step 1: Install the SDK

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

The opentelemetry-bootstrap command detects libraries in your environment and installs the matching instrumentation packages automatically. If you're running Flask, SQLAlchemy, and Redis — it finds all three and installs the instrumentation for each.

Step 2: Auto-Instrument Your Application

Zero code changes. Seriously.

OTEL_SERVICE_NAME=my-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
opentelemetry-instrument python app.py

That's it. Your Flask routes, SQLAlchemy queries, Redis calls, and HTTP client requests are now producing traces, metrics, and logs in OTLP format. The auto-instrumentation works by monkey-patching library functions at runtime to capture telemetry data.

Step 3: Add Custom Spans

Auto-instrumentation catches framework-level operations, but you'll want custom spans for your business logic:

from opentelemetry import trace

tracer = trace.get_tracer("my-service")

@app.route("/checkout")
def checkout():
    with tracer.start_as_current_span("process_checkout") as span:
        span.set_attribute("cart.item_count", len(cart.items))
        span.set_attribute("cart.total", cart.total)

        with tracer.start_as_current_span("validate_inventory"):
            validate_inventory(cart)

        with tracer.start_as_current_span("charge_payment"):
            charge_payment(cart.total)

        with tracer.start_as_current_span("send_confirmation"):
            send_confirmation_email(user)

        return {"status": "confirmed"}

Each start_as_current_span creates a child span in the trace. When you view this in Jaeger or Tempo, you'll see the full waterfall: how long each step took, which one was slow, and where errors occurred.

Step 4: Deploy the Collector

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. Here's a production-ready config:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]

Deploy it alongside your services:

# docker-compose.yaml (relevant section)
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    command: ["--config", "/etc/otel/config.yaml"]

Now every service that sends data to port 4317 gets its telemetry routed to the right backend — traces to Tempo, metrics to Mimir, logs to Loki. Switching from Tempo to Jaeger? Change one line in the exporter config. Adding Datadog as a secondary destination? Add another exporter. No application changes.

The Migration Path: Datadog to OpenTelemetry

You don't rip out Datadog on a Friday afternoon. Here's the migration strategy that actually works:

Phase 1: Dual-Ship (Weeks 1-4)

Install the OTel Collector alongside the Datadog agent. Configure your applications to send telemetry to the Collector, and configure the Collector to export to both your new backend and Datadog:

exporters:
  otlp/tempo:
    endpoint: tempo:4317
  datadog:
    api:
      key: ${DD_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo, datadog]  # send to both

This gives you side-by-side comparison without risk. Your team still uses Datadog dashboards while you validate the new stack.

Phase 2: Migrate Logs First (Weeks 3-6)

Logs are the biggest cost driver and the easiest signal to migrate. Redirect log output from the Datadog agent to the OTel Collector, exporting to Loki. Build equivalent dashboards in Grafana. Once the team is comfortable, disable Datadog log ingestion.

Expected savings: 40-60% of your Datadog bill, depending on log volume.

Phase 3: Migrate Metrics (Weeks 5-8)

Switch from Datadog custom metrics to Prometheus/Mimir. If you're already running Prometheus exporters (most Kubernetes clusters do), this is mostly a matter of routing. The OTel Collector can scrape Prometheus endpoints and forward to Mimir.

Phase 4: Migrate Traces (Weeks 7-12)

Traces are the hardest to migrate because they require the most dashboard rebuilding. APM dashboards, service maps, and error tracking need to be recreated in Grafana with Tempo as the data source. This is where most of the engineering time goes — not the instrumentation, but the visualization.

Phase 5: Decommission Datadog (Weeks 10-14)

Remove the Datadog agent. Cancel the contract. Keep screenshots of your final Datadog bill for the celebration Slack message.

Total timeline: 3-4 months for a mid-size team. Longer if you have complex custom dashboards or SLO configurations that need migration.

The Competitive Landscape in 2026

It's not just Datadog feeling the pressure. The entire commercial observability market is responding to OpenTelemetry:

VendorOTel SupportStrategy
DatadogAccepts OTel dataEmbrace and extend — accept OTel but add proprietary features
New RelicFull OTel supportPivoted hard to OTel-native, generous free tier
DynatraceOTel ingestion15-time Gartner Leader, Davis AI differentiator
Grafana CloudOTel-nativeManaged LGTM stack, natural OTel destination
ElasticOTel Collector supportConverging observability + security
Splunk (Cisco)OTel-native$28B acquisition validated the market
SigNozOTel-onlyBuilt from ground up on OTel, open-source alternative
HoneycombOTel-nativeHigh-cardinality tracing focus

The pattern is clear: every vendor now supports OpenTelemetry ingestion. The question is no longer "should we use OTel?" — it's "what backend should we pair with OTel?"

Datadog itself now accepts OpenTelemetry data. That tells you everything about where the industry is heading. When the incumbent builds compatibility with the open standard, the standard has already won.

What Most Articles Get Wrong

Most "OpenTelemetry vs Datadog" articles frame it as a simple cost comparison. It's not. Here's what they miss:

The operational cost of self-hosting. Running Prometheus, Loki, Tempo, and Grafana in production requires engineering time. You need to manage storage, handle upgrades, tune retention policies, and build alerting rules. For a team of 3-5 engineers, the operational overhead of self-hosted observability can consume 10-20% of one engineer's time. That's real cost — potentially $20-40K per year in engineering time.

The talent gap. Your team knows Datadog's UI. They know where to click when production is on fire. Switching to Grafana + Tempo means retraining — and during incidents, familiarity saves minutes that matter. Budget 2-4 weeks for the team to build muscle memory with the new tools.

Datadog's real value is correlation. The reason Datadog charges premium prices is that their platform correlates logs, traces, metrics, and infrastructure data in a single view. When you click on a slow trace, you see the corresponding logs, the host metrics, and the deployment that might have caused it. Rebuilding this correlation in Grafana is possible (Tempo-to-Loki links, exemplars in Prometheus) but it takes effort.

OTel isn't free either. The software is free. The compute is not. Running a self-hosted observability stack for 100+ services requires dedicated infrastructure — expect 8-16 vCPUs and 32-64 GB RAM for the backend components alone, plus storage that grows with retention. On AWS, that's $500-1,500/month in infrastructure. Still dramatically cheaper than Datadog, but not zero.

The Decision Framework

Here's how I'd decide:

Stay with Datadog if:

  • Your team is under 20 engineers and observability isn't your core competency
  • Your Datadog bill is under $5,000/month (the convenience is worth it at that scale)
  • You're in a regulated industry where a SOC 2-certified SaaS vendor simplifies compliance
  • You rely heavily on Datadog-specific features like RUM, Synthetics, or their security products

Migrate to OTel + self-hosted backend if:

  • Your Datadog bill exceeds $15,000/month and is growing faster than your infrastructure
  • You have a platform engineering team that can own the observability stack
  • You're running hybrid or multi-cloud and need vendor-neutral instrumentation
  • You want to avoid vendor lock-in as a strategic decision

Use OTel + managed backend (Grafana Cloud, New Relic) if:

  • You want OTel's portability without the operational burden of self-hosting
  • Your Datadog bill is painful but you don't have the team to run infrastructure
  • You want a middle ground: open instrumentation, managed storage and visualization

The hybrid approach is increasingly common. Many teams keep Datadog for APM and move logs to Loki, cutting their bill by 40-60% while keeping the most valuable Datadog features.

A Production OTel Setup: What We Actually Run

After migrating from Datadog, here's our production stack for a 60-service Kubernetes cluster:

┌─────────────┐     OTLP      ┌───────────────┐
│  Services   │──────────────▶│  OTel Collector│
│  (60 pods)  │   gRPC:4317   │  (3 replicas)  │
└─────────────┘               └───────┬────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    │                 │                   │
                    ▼                 ▼                   ▼
             ┌───────────┐    ┌────────────┐     ┌────────────┐
             │  Mimir     │    │   Loki     │     │   Tempo    │
             │  (metrics) │    │   (logs)   │     │  (traces)  │
             │  3 ingest  │    │  3 ingest  │     │  3 ingest  │
             │  2 store   │    │  2 store   │     │  2 store   │
             └─────┬──────┘    └─────┬──────┘     └─────┬──────┘
                   │                 │                   │
                   └─────────────────┼───────────────────┘
                                     │
                              ┌──────▼──────┐
                              │   Grafana   │
                              │  (3 replicas)│
                              └─────────────┘

Infrastructure cost: approximately $1,200/month on AWS (EKS). Our Datadog bill for the same workload was $28,000/month. The savings pay for a full-time engineer with money left over.

Key configuration decisions:

# Retention policies — balance cost vs. debugging needs
mimir:
  retention: 90d        # metrics: 90 days
loki:
  retention: 30d        # logs: 30 days (most debugging is recent)
tempo:
  retention: 14d        # traces: 14 days (only need recent traces)
  sampling_rate: 0.1    # sample 10% of traces in production

Trace sampling is where the real savings come from. Sending 100% of traces to your backend is wasteful — you'll never look at 90% of them. A 10% head-based sampling rate with tail-based sampling for errors and slow requests gives you the best of both worlds: low storage costs and guaranteed visibility into problems.

# Tail-based sampling in the OTel Collector
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

This config keeps 100% of error traces, 100% of traces slower than 2 seconds, and 10% of everything else. In practice, this means your storage is dominated by the traces you actually care about.

Monitoring the Monitor

One lesson from running our own observability stack: you need to monitor the monitoring. If your OTel Collector drops spans and nobody notices, your traces have gaps. If Loki's ingestion falls behind, your logs are delayed during the exact moment you need them most.

Essential alerts for a self-hosted stack:

groups:
  - name: observability-health
    rules:
      - alert: OTelCollectorDroppedSpans
        expr: rate(otelcol_exporter_send_failed_spans[5m]) > 0
        for: 5m
        annotations:
          summary: "OTel Collector is dropping spans"

      - alert: LokiIngestionLag
        expr: loki_ingester_wal_replay_duration_seconds > 300
        for: 10m
        annotations:
          summary: "Loki ingestion is lagging"

      - alert: TempoCompactionFailing
        expr: tempo_compactor_runs_failed_total > 0
        for: 15m
        annotations:
          summary: "Tempo compaction is failing"

      - alert: MimirIngesterUnhealthy
        expr: cortex_ring_members{state="Unhealthy"} > 0
        for: 5m
        annotations:
          summary: "Mimir ingester is unhealthy"

Use a separate, lightweight monitoring stack (even a simple Prometheus + Alertmanager) to watch your primary observability infrastructure. Sounds circular, but it's the same principle as having a backup phone number for your alerting system.

What I Actually Think

OpenTelemetry has already won the instrumentation war. Every major observability vendor supports it. The CNCF recommends it in the "adopt" position of their technology radar. 48% of organizations are using it today, and that number is only going one direction.

But I want to be honest: Datadog is still a great product. The UI is polished. The correlation engine is genuinely useful. The onboarding experience is years ahead of anything in the open-source world. If someone handed me a greenfield project with a $10K/month observability budget and a small team, I'd probably still start with Datadog — and instrument with OTel from day one so I could leave later.

The problem is that $10K/month budget doesn't stay at $10K. It grows to $20K, then $40K, and suddenly you're spending more on monitoring your infrastructure than on the infrastructure itself. That's the moment when OpenTelemetry + LGTM stack goes from "nice to have" to "business necessity."

The strongest argument for OpenTelemetry isn't cost — it's portability. When you instrument with OTel, you're making a decision that lasts longer than any vendor contract. The observability market will consolidate. Vendors will raise prices, get acquired, or shut down. Your instrumentation should survive all of that. OTel gives you that insurance.

My prediction for 2027: Datadog will still be the market leader by revenue, but the majority of new instrumentation will be OTel-native. Datadog becomes a premium backend that accepts OTel data — essentially Grafana Cloud's more expensive competitor. The companies that instrumented with proprietary SDKs will wish they hadn't.

The best time to adopt OpenTelemetry was two years ago. The second best time is your next sprint.


Sources: CNCF Mid-Year 2025 Project Velocity, CNCF 2025 Project Velocity Analysis, Grafana OpenTelemetry Report, Datadog Q4 2025 Financial Results, Datadog 2026 Revenue Guidance — Seeking Alpha, Mordor Intelligence Observability Market Report, OneUptime — Datadog Bill Shock in 2026, OpenObserve — Datadog Pricing Explained, SigNoz — Datadog Pricing Caveats, SigNoz — OpenTelemetry vs Datadog, Cloud-Native Observability Stack 2026, OpenTelemetry Python Getting Started, OpenTelemetry Python Auto-Instrumentation, Datadog OpenTelemetry Solutions, Gartner Observability Platform Reviews, Coherent Market Insights — Observability Market, Datadog Market Cap — MacroTrends.