Our batch ETL pipeline ran every 15 minutes. That sounds reasonable until a customer changed their subscription tier and spent 14 minutes seeing the wrong price. Support got the ticket. Engineering got the blame. The pipeline was "working correctly."
That was the moment I stopped believing in batch.
Change Data Capture (CDC) replaced our entire ETL pipeline — not incrementally, not as an experiment, but wholesale. Debezium reading the Postgres write-ahead log, streaming every INSERT, UPDATE, and DELETE to Kafka in real time. No more scheduling. No more "it'll sync on the next run." No more praying that the batch window finishes before the next one starts.
This is the practical guide I wish I had before we made the switch. What CDC actually is, how Debezium works with Postgres, what broke along the way, and why I think batch ETL is dying for most use cases.
What Batch ETL Actually Costs You
Every team that runs batch ETL knows the pain, even if they've normalized it.
Your pipeline runs on a schedule — every 5 minutes, every 15 minutes, every hour. Between runs, your downstream systems are lying. They're showing stale data and calling it truth. For dashboards, that's annoying. For billing, pricing, or inventory, it's a business problem.
Here's what batch ETL looked like for us:
Airflow DAG: runs every 15 minutes
├── Extract: SELECT * FROM orders WHERE updated_at > last_run
├── Transform: normalize, deduplicate, join with users table
├── Load: UPSERT into analytics warehouse
└── Duration: 3-8 minutes depending on volume
The problems compound:
Missed updates. If your extraction query uses updated_at > last_run, you miss rows that were updated during the previous extraction window. You need watermarking, and watermarking needs overlap, and overlap means processing the same rows twice.
Schema drift. Someone adds a column on Tuesday. The pipeline doesn't know until it breaks on Wednesday's run. Now you're backfilling three days of data at 2 AM.
Ordering problems. A customer updates their address, then updates it again. Batch might process these out of order, or collapse them into one update and lose the intermediate state.
Scaling walls. As your data grows, the extraction query gets slower. You add parallelism. Now you need coordination. Your "simple" pipeline is a distributed system you didn't sign up to build.
Research from Fivetran's 2025 data integration report found that enterprises with batch-only pipelines spend an average of $15 million per year on costs attributable to delayed data decisions — downstream errors, manual reconciliation, and incident response. The data integration market hit $15.24 billion in 2026, largely driven by organizations moving away from exactly this kind of architecture.
CDC eliminates the scheduling problem entirely. Instead of asking "what changed since my last run?", you listen to every change as it happens.
How Change Data Capture Actually Works
CDC isn't a product — it's a pattern. The idea is simple: instead of polling your database for changes, you read the database's own change log.
Every serious database maintains a log of every write operation. Postgres calls it the Write-Ahead Log (WAL). MySQL calls it the binlog. MongoDB has the oplog. These logs exist for crash recovery and replication, but CDC hijacks them for data integration.
The flow looks like this:
Application writes to Postgres
│
▼
Postgres writes to WAL (write-ahead log)
│
▼
Debezium reads WAL via logical replication slot
│
▼
Change event published to Kafka topic
│
▼
Consumers process events (warehouse, cache, search index)
Each change event contains:
- Before state — what the row looked like before the change (for UPDATEs and DELETEs)
- After state — what the row looks like now (for INSERTs and UPDATEs)
- Operation type — INSERT, UPDATE, DELETE, or TRUNCATE
- Source metadata — transaction ID, timestamp, schema name, table name, LSN position
This is fundamentally different from application-level event sourcing. You don't need to modify your application code. You don't need to publish events from your service layer. The database is already recording every change — CDC just makes those changes available to other systems.
Why Debezium Won
There are multiple CDC tools available: Debezium, Fivetran, Airbyte, Estuary Flow, Streamkap. But Debezium dominates the open-source CDC space, and for good reason.
Debezium runs as a Kafka Connect source connector. It reads the WAL, converts each change into a structured JSON event, and publishes it to Kafka topics — one topic per table by default. It handles snapshotting (the initial full read of existing data), schema changes, and exactly-once semantics with the right configuration.
Here's what makes Debezium the default choice:
It's truly real-time. Most "CDC" tools (Fivetran, Airbyte) are actually micro-batch — they poll for changes every few minutes. Debezium streams continuously from the WAL. The latency difference is seconds versus minutes.
It captures everything. Because it reads the WAL, it sees every change including those made by database triggers, stored procedures, or direct SQL. Application-level CDC misses anything that bypasses your application layer.
It preserves ordering. Changes arrive in the same order they were committed. Your downstream consumers see a consistent view of the data.
The connector ecosystem is massive. Postgres, MySQL, MongoDB, SQL Server, Oracle, Cassandra, Db2. The Postgres connector alone handles logical decoding, pgoutput, and decoderbufs.
One important note: Confluent deprecated their own Debezium-based connector in late 2025, with EOL set for March 2026. The community Debezium connector is now the standard path — Confluent actually recommends it in their migration guide. This is a rare case where the open-source version won against the commercial fork.
Setting Up CDC: Postgres to Kafka with Debezium
Let me walk through the actual setup. This isn't a toy example — it's close to what we run in production.
Step 1: Configure Postgres for Logical Replication
First, your Postgres instance needs wal_level set to logical. On managed services like RDS or Cloud SQL, this is a parameter group change. On self-hosted:
-- postgresql.conf
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
-- Restart required after changing wal_level
Create a dedicated replication user:
CREATE ROLE cdc_user WITH REPLICATION LOGIN PASSWORD 'your-secure-password';
GRANT USAGE ON SCHEMA public TO cdc_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO cdc_user;
Verify logical replication is active:
SHOW wal_level;
-- Should return: logical
SELECT * FROM pg_replication_slots;
-- Empty initially, Debezium will create its slot
Step 2: Deploy Kafka Connect with Debezium
We run Kafka Connect as a Docker container. Here's the compose file:
version: '3.8'
services:
kafka-connect:
image: debezium/connect:2.5
ports:
- "8083:8083"
environment:
BOOTSTRAP_SERVERS: kafka:9092
GROUP_ID: cdc-connect-cluster
CONFIG_STORAGE_TOPIC: cdc_connect_configs
OFFSET_STORAGE_TOPIC: cdc_connect_offsets
STATUS_STORAGE_TOPIC: cdc_connect_statuses
KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
depends_on:
- kafka
Step 3: Register the Debezium Connector
POST the connector configuration to Kafka Connect's REST API:
{
"name": "orders-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "cdc_user",
"database.password": "your-secure-password",
"database.dbname": "production",
"topic.prefix": "cdc",
"schema.include.list": "public",
"table.include.list": "public.orders,public.users,public.subscriptions",
"plugin.name": "pgoutput",
"slot.name": "debezium_orders",
"publication.name": "cdc_publication",
"snapshot.mode": "initial",
"tombstones.on.delete": true,
"decimal.handling.mode": "string",
"time.precision.mode": "connect"
}
}
Key configuration choices:
plugin.name: pgoutput — Use the built-in Postgres logical decoding plugin. No extensions needed.
snapshot.mode: initial — Takes a full snapshot of existing data on first start, then switches to streaming.
tombstones.on.delete: true — Sends a null-value tombstone record after DELETE events, which Kafka log compaction uses to clean up.
decimal.handling.mode: string — Prevents floating-point precision loss for monetary values.
Step 4: Verify It's Working
# Check connector status
curl -s localhost:8083/connectors/orders-cdc-connector/status | jq .
# List created topics
kafka-topics --bootstrap-server kafka:9092 --list | grep cdc
# Watch events in real time
kafka-console-consumer \
--bootstrap-server kafka:9092 \
--topic cdc.public.orders \
--from-beginning \
--max-messages 5
A typical change event looks like this:
{
"before": null,
"after": {
"id": 42,
"customer_id": 1001,
"total": "149.99",
"status": "confirmed",
"created_at": 1714435200000
},
"source": {
"version": "2.5.0",
"connector": "postgresql",
"name": "cdc",
"ts_ms": 1714435200123,
"db": "production",
"schema": "public",
"table": "orders",
"txId": 98765,
"lsn": 234567890
},
"op": "c",
"ts_ms": 1714435200456
}
The op field tells you the operation: c for create, u for update, d for delete, r for read (during snapshot).
The Snapshot Problem Nobody Warns You About
The initial snapshot is where most CDC deployments stall.
When Debezium starts for the first time with snapshot.mode: initial, it reads every row from every table in your include list. For a 50 GB orders table, this means Debezium is essentially running SELECT * FROM orders — and it holds a transaction open for the entire duration.
On our production database with 80 million rows in the orders table, the initial snapshot took 4 hours. During those 4 hours:
WAL accumulation. Postgres can't reclaim WAL segments referenced by the replication slot until the snapshot completes. Our WAL storage grew from 2 GB to 47 GB. If you're on a managed service with limited storage, this can fill your disk and crash the database.
Lock contention. Debezium takes an ACCESS SHARE lock during snapshotting. This doesn't block writes, but it does block DDL operations. No schema migrations during a snapshot.
Memory pressure. All those rows flow through Kafka Connect's memory. We had to bump the JVM heap to 4 GB to prevent OOM kills.
The mitigation strategies we used:
{
"snapshot.mode": "initial",
"snapshot.fetch.size": 10240,
"snapshot.max.threads": 2,
"heartbeat.interval.ms": 30000
}
snapshot.fetch.size controls the JDBC fetch size. Smaller values reduce memory pressure but increase snapshot duration.
snapshot.max.threads parallelizes the snapshot across tables.
heartbeat.interval.ms sends periodic heartbeat events that advance the replication slot position, preventing some WAL accumulation scenarios.
For truly massive tables, consider snapshot.mode: schema_only — it captures the schema but skips existing data, only streaming new changes. You can backfill historical data through a separate batch process.
Consumer Patterns That Actually Work
Getting data into Kafka is half the battle. The other half is building consumers that handle the CDC event stream correctly.
Pattern 1: The Upsert Consumer
The most common pattern — materializing CDC events into a downstream database or warehouse:
from confluent_kafka import Consumer
import json
import psycopg2
consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'warehouse-sync',
'auto.offset.reset': 'earliest',
'enable.auto.commit': False,
})
consumer.subscribe(['cdc.public.orders'])
conn = psycopg2.connect(dsn=WAREHOUSE_DSN)
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
event = json.loads(msg.value())
op = event['op']
if op in ('c', 'r', 'u'): # create, read (snapshot), update
row = event['after']
with conn.cursor() as cur:
cur.execute("""
INSERT INTO orders (id, customer_id, total, status)
VALUES (%(id)s, %(customer_id)s, %(total)s, %(status)s)
ON CONFLICT (id) DO UPDATE SET
customer_id = EXCLUDED.customer_id,
total = EXCLUDED.total,
status = EXCLUDED.status
""", row)
elif op == 'd': # delete
row = event['before']
with conn.cursor() as cur:
cur.execute("DELETE FROM orders WHERE id = %(id)s", row)
conn.commit()
consumer.commit(message=msg)
Pattern 2: The Cache Invalidation Consumer
CDC is the best cache invalidation strategy I've ever used. Instead of TTL-based expiry or manual invalidation, you invalidate exactly when data changes:
import redis
redis_client = redis.Redis(host='redis', port=6379)
def handle_cdc_event(event):
table = event['source']['table']
op = event['op']
if table == 'products':
product_id = (event['after'] or event['before'])['id']
# Invalidate the specific product cache
redis_client.delete(f"product:{product_id}")
# Invalidate any list caches that might include this product
redis_client.delete("products:featured")
redis_client.delete(f"products:category:{event['after']['category_id']}"
if event['after'] else "")
elif table == 'users':
user_id = (event['after'] or event['before'])['id']
redis_client.delete(f"user:{user_id}")
redis_client.delete(f"user:session:{user_id}")
No more stale caches. No more "have you tried clearing the cache?" No more cache-database inconsistency that takes 15 minutes to resolve itself.
Pattern 3: The Search Index Synchronizer
Keeping Elasticsearch or Typesense in sync with your primary database is a nightmare with batch ETL. With CDC:
from elasticsearch import Elasticsearch
es = Elasticsearch(['http://elasticsearch:9200'])
def sync_to_search(event):
if event['source']['table'] != 'products':
return
op = event['op']
if op in ('c', 'r', 'u'):
product = event['after']
es.index(
index='products',
id=product['id'],
document={
'name': product['name'],
'description': product['description'],
'price': float(product['price']),
'category': product['category'],
'updated_at': product['updated_at'],
}
)
elif op == 'd':
es.delete(index='products', id=event['before']['id'],
ignore=[404])
Your search index stays within seconds of your primary database, automatically.
Production Gotchas That Will Bite You
After running CDC in production for over a year, here's what actually went wrong:
Replication Slot Growth
If your consumer stops processing — Kafka Connect crash, network partition, deployment gone wrong — the replication slot keeps accumulating WAL. Postgres won't reclaim it. I've seen replication slots grow to consume 200 GB of disk in 6 hours during an outage.
Set up monitoring and alerts:
SELECT
slot_name,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal_size,
active
FROM pg_replication_slots;
Add a dead man's switch — if retained WAL exceeds a threshold, page someone. On managed services like RDS, you can set rds.logical_replication_slot_max_retained_wal to auto-drop slots that exceed a limit.
Schema Evolution
When you ALTER TABLE in production, Debezium needs to handle it. With pgoutput, Debezium detects schema changes through the replication stream. But your consumers need to handle it too.
The safest approach is using Avro with a Schema Registry instead of JSON:
{
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
Schema Registry enforces compatibility rules — by default, backward compatibility — so consumers using the old schema can still read events with the new schema.
The 7,000 Events/Second Ceiling
A single Debezium connector task maxes out around 7,000 events per second in our benchmarks. For most applications, this is fine. But if you're capturing changes from a high-throughput OLTP system, you'll hit this ceiling.
Debezium's Postgres connector doesn't support task parallelism within a single connector (because WAL is a single stream). To scale beyond this, you need:
- Multiple databases with separate connectors
- Partitioned source tables with the newer Debezium parallel snapshot features
- Or accept the throughput limit and size your Kafka partitions to handle the consumer side
Transaction Boundaries
By default, Debezium publishes each row change as an independent event. If a transaction updates 5 rows, you get 5 events. This can cause consistency issues if a consumer processes event 3 of 5 and crashes.
Debezium 2.x added transaction metadata support:
{
"provide.transaction.metadata": true,
"transaction.topic": "cdc.transaction"
}
This publishes BEGIN and END markers to a transaction topic, letting consumers correlate events that belong to the same transaction and process them atomically.
TOAST Columns
Postgres stores large column values (over 2 KB) in TOAST tables. When a row is updated but the TOASTed column isn't changed, the WAL entry doesn't include the TOAST value. Debezium receives a placeholder instead.
This means if you UPDATE a row's status column but don't touch its description column (which is TOASTed), the CDC event will have description: null or a special unavailable marker in the after state.
The workaround is REPLICA IDENTITY FULL:
ALTER TABLE products REPLICA IDENTITY FULL;
This forces Postgres to include all column values in WAL entries, at the cost of larger WAL volume. Apply it selectively — only on tables where you need complete row data in CDC events.
CDC vs Batch ETL: When Each Makes Sense
I said batch ETL is dying, but I should be precise about what I mean. Batch isn't dead — it's inappropriate as a default.
Use CDC when:
- Downstream latency matters (real-time dashboards, cache invalidation, search sync)
- You need event-by-event processing (audit trails, compliance logging)
- You're syncing between operational systems (service A needs to react to changes in service B)
- You want to decouple your services without building explicit event publishing into every write path
- Your data volume is continuously growing and batch windows are getting tight
Stick with batch when:
- You're doing heavy analytical transformations that need the full dataset (ML training, complex aggregations)
- Your source system doesn't support log-based CDC (some legacy databases, SaaS APIs)
- Latency genuinely doesn't matter (weekly reports, monthly reconciliation)
- You need complex multi-table joins during extraction — CDC gives you single-table streams
The sweet spot for most teams is a hybrid architecture: CDC for operational data synchronization and event-driven flows, batch for heavy analytical workloads that consume the CDC-populated warehouse.
The Architecture We Landed On
After ripping out batch ETL, here's what our production architecture looks like:
┌─────────────┐ WAL ┌───────────┐ Kafka ┌──────────────┐
│ Postgres │────────────▶│ Debezium │────────────▶│ Kafka │
│ (primary) │ logical │ Connect │ topics per │ Cluster │
│ │ replication│ │ table │ │
└─────────────┘ └───────────┘ └──────┬───────┘
│
┌────────────────────────────────┼────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌────────────┐ ┌─────────┐ ┌────────┐ ┌──────┐
│ Warehouse │ │ Redis │ │ Elastic│ │Audit │
│ (ClickHouse│ │ Cache │ │ Search │ │ Log │
│ sink) │ │ Inval. │ │ Sync │ │ │
└────────────┘ └─────────┘ └────────┘ └──────┘
Four consumers, one source of truth. When an order is confirmed in Postgres:
- The warehouse gets it in under 3 seconds (was 15 minutes with batch)
- The product availability cache invalidates immediately (was TTL-based, 5-minute staleness)
- The order appears in search within 2 seconds (was next batch run)
- The audit trail captures the exact change with before/after state (was reconstructed from snapshots, often inaccurate)
Total events processed: roughly 2,400 per second at peak, well within the single-connector throughput limit. End-to-end latency from Postgres commit to all four consumers processing the event: median 1.2 seconds, p99 4.8 seconds.
Monitoring Your CDC Pipeline
CDC pipelines fail silently if you're not watching. Here's the minimum monitoring setup:
# Prometheus metrics to alert on
alerts:
- name: debezium_replication_lag
expr: debezium_metrics_MilliSecondsBehindSource > 30000
message: "CDC lag exceeds 30 seconds"
- name: replication_slot_wal_growth
expr: pg_replication_slot_wal_bytes > 5368709120 # 5 GB
message: "Replication slot retaining too much WAL"
- name: connector_status
expr: kafka_connect_connector_status != 1
message: "Debezium connector is not running"
- name: consumer_lag
expr: kafka_consumer_group_lag > 100000
message: "CDC consumer falling behind"
The most important metric is replication slot WAL retention. If that number is growing, something is wrong — either Debezium is stuck or it can't keep up with write throughput.
Migration Strategy: Batch to CDC Without Downtime
You don't have to rip out batch overnight. Here's how we migrated:
Week 1-2: Deploy CDC in shadow mode. Run Debezium alongside your existing batch ETL. CDC events flow to Kafka but don't feed downstream systems. Monitor connector stability, throughput, and WAL impact.
Week 3-4: Dual-write comparison. Run a reconciliation job that compares the CDC-populated target against the batch-populated target. Identify and fix discrepancies. Common issues: timezone handling, decimal precision, NULL semantics.
Week 5: Cut over non-critical consumers. Switch search indexing and cache invalidation to CDC. Keep batch ETL running for the warehouse. If CDC has issues, these consumers can tolerate a brief interruption.
Week 6-8: Cut over the warehouse. Switch the warehouse feed from batch ETL to CDC. Keep the batch pipeline in standby for two more weeks. Monitor data quality, completeness, and latency.
Week 9+: Decommission batch. Turn off the Airflow DAGs. Remove the extraction queries. Reclaim the compute resources. Pour one out for WHERE updated_at > last_run.
What I Actually Think
CDC isn't just a technical improvement over batch ETL — it's a philosophical shift in how you think about data movement. Batch ETL treats data synchronization as a periodic task. CDC treats it as a continuous stream. The difference matters more than most teams realize until they've lived with both.
Debezium specifically is one of those rare open-source projects that genuinely works in production without heroics. The Postgres connector is battle-tested across thousands of production deployments. The documentation is excellent. The community is active and helpful.
But I want to be honest about the operational complexity. You're adding Kafka, Kafka Connect, schema management, consumer group coordination, and WAL monitoring to your stack. If you're a three-person team running a monolith on a single Postgres instance, batch ETL with a cron job is probably fine. The threshold where CDC starts paying for itself is roughly when you have more than two downstream systems that need to react to changes in your primary database, or when your batch windows are consistently exceeding 50% of the interval between runs.
The industry is clearly moving this direction. 86% of IT leaders surveyed by Confluent in 2025 said they're prioritizing real-time data streaming. The CDC market is projected to exceed $3 billion by 2025 (and it likely already has). Gartner predicts 30% of enterprise data will be processed in real-time by 2026, up from under 10% a few years ago.
Batch ETL isn't disappearing tomorrow, but it's no longer the default answer. If you're starting a new project and your first instinct is to reach for Airflow and a cron schedule — stop. Ask whether CDC would give you the same result with less code and better latency. The answer is yes more often than you'd expect.
The best time to set up CDC was before your first batch pipeline caused a production incident. The second best time is now.
Sources: Debezium documentation and changelog, Confluent Platform documentation, PostgreSQL logical replication documentation, Fivetran 2025 data integration report, Confluent 2025 streaming survey, Gartner real-time data processing predictions, Kafka Connect REST API reference, PostgreSQL TOAST documentation, Streamkap CDC benchmark data, Red Hat Debezium community resources, Estuary Flow documentation, wal2json and pgoutput plugin comparisons, RDS logical replication parameter documentation, Schema Registry compatibility documentation, Kafka consumer group protocol specification, PeerDB CDC benchmarks, Airbyte CDC connector architecture, and production operational experience.