Kafka Is Overkill for 90% of Teams

80% of Fortune 100 companies use Apache Kafka. IBM just paid $11 billion to acquire Confluent. Over 150,000 organizations run it in production. So when I say Kafka is overkill for most teams, I'm not saying it's bad software. I'm saying most teams aren't Fortune 100.

The typical Kafka deployment I've encountered looks like this: a team of 8 engineers, processing maybe 5,000 events per second, running a 3-broker cluster with 6 ZooKeeper nodes (or wrestling with a KRaft migration), spending 20% of an SRE's time babysitting consumer lag and partition rebalances. They chose Kafka because a blog post told them to. They stayed because migration is terrifying.

Simpler tools exist. Redpanda, NATS, even plain SQS or Redis Streams will handle what most teams actually need, with a fraction of the operational burden.

How We Got Here

Jay Kreps, Jun Rao, and Neha Narkhede built Kafka at LinkedIn around 2010 to solve a specific problem: low-latency ingestion of massive event data into a lambda architecture that fed both Hadoop and real-time processing systems. By 2011, LinkedIn was already ingesting over 1 billion events per day. Nothing else could do that at the time.

Kreps named it after the author Franz Kafka because, as he put it, it's "a system optimized for writing."

The problem is that Kafka's design decisions made sense for LinkedIn's scale. Partitioned append-only logs, consumer groups, configurable retention; these are brilliant when you're processing a billion events daily across hundreds of services. They're expensive overhead when you're routing webhook payloads between three microservices.

Kafka graduated from the Apache Incubator in 2012. Confluent was founded in 2014 to commercialize it. And somewhere between 2015 and 2020, "use Kafka" became the default answer to every messaging question, regardless of whether the question warranted it.

The Operational Tax Nobody Mentions Up Front

Running Kafka means running a distributed system. That sounds obvious. In practice, it means debugging partition reassignment storms at 2 AM.

Before KRaft, you were managing two distributed systems: Kafka brokers and a ZooKeeper ensemble. Each partition state change required a synchronous ZooKeeper write. Performance degraded as the number of znodes grew. Watchers and notifications across thousands of partitions created cascading overhead.

KRaft fixes the ZooKeeper dependency, but the migration itself is brutal. It's a one-way operation with no rollback path. Teams attempting migration on older versions hit controller election failures during high metadata load, ACL migration failures with large permission sets, and metadata migration timeouts with large topic counts. Kafka 3.9 is the bridge release; anything older requires a multi-hop upgrade path.

Then there's the day-to-day: partition count planning, consumer group rebalancing, ISR (in-sync replica) monitoring, broker capacity planning, schema registry management, Connect cluster maintenance. Each is manageable individually. Together, they consume a significant chunk of engineering time.

A team I worked with calculated their Kafka overhead at roughly 15 hours per week of SRE time across a 6-person platform team. They were processing 3,000 messages per second. An SQS queue would have handled that load for about $50/month with zero operational overhead.

Redpanda: Kafka Without the Baggage (Mostly)

Redpanda is the most interesting Kafka alternative because it's Kafka-compatible. Your existing producers, consumers, and tooling work with it. No code changes. It's written in C++, eliminates the JVM, eliminates ZooKeeper, and ships as a single binary.

The performance claims are impressive: at least 10x faster tail latencies than Kafka on medium to high throughput workloads, with up to 3x fewer nodes. Redpanda broke the 1 GB/sec barrier on 3 nodes while Kafka required at least double the hardware and introduced severe latency penalties.

But those claims deserve scrutiny.

Jack Vanlightly ran independent benchmarks and found that Redpanda's 1 GB/s benchmark "is not at all generalizable". Performance deteriorated significantly with 50 producers instead of 4. It also degraded after running for more than 12 hours. His conclusion: "Redpanda can outshine only when it gets the right workload. There are many workloads where Redpanda cannot outperform Kafka."

That doesn't make Redpanda bad. It means you should benchmark with your workload, not trust marketing numbers. For most small-to-medium teams, the operational simplicity matters more than peak throughput anyway: no JVM tuning, no ZooKeeper, single binary deployment, lower memory footprint.

Where Redpanda genuinely wins is operational cost. No JVM heap tuning. No garbage collection pauses. No ZooKeeper quorum management. For teams under 10,000 messages per second, the reduced ops burden is worth more than any throughput benchmark.

One thing I appreciate about the Redpanda team: they published their benchmarks openly, and when Vanlightly poked holes in them, the conversation was productive. That kind of transparency is rare in infrastructure marketing. Compare it to cloud providers who bury latency numbers in footnotes and measure availability using definitions that exclude half their outages.

If you need Kafka's semantics (partitions, consumer groups, exactly-once) but not Kafka's complexity, Redpanda is the obvious choice. Just test it with your actual workload shape, not a synthetic one. Run it for at least a week under production-like load before committing. The 12-hour degradation Vanlightly observed matters if you're running 24/7 workloads.

NATS: The Broker You Can Forget About

If Redpanda is "Kafka but simpler," NATS is "do you even need Kafka?"

NATS is a lightweight messaging system written in Go. The binary is around 20 MB. It starts in milliseconds. Configuration is minimal. JetStream (added later) gives you persistence, replay, and exactly-once delivery when you need it.

The performance profile is different from Kafka. Recent benchmarks show NATS JetStream achieving 200,000-400,000 messages per second with persistence enabled, while Kafka reaches 500,000 to 1 million with batching. The CNCF Performance Report 2024 measured NATS bursting to over 50 million messages per second for non-persistent workloads.

Companies like JetBrains and Bosch use NATS in production. It's a CNCF incubating project, so the governance model is solid.

NATS makes the most sense for three scenarios:

Microservice communication: Request/reply, fan-out, load-balanced queues. NATS handles these patterns natively without the concept of partitions or consumer groups.
Edge and IoT: The tiny footprint means you can run NATS on devices with constrained resources. Try running a Kafka broker on a Raspberry Pi.
Ephemeral messaging: If you don't need message replay or long-term retention, core NATS (without JetStream) gives you a pub/sub system with almost zero configuration.

The trade-off is clear: NATS won't handle terabyte-scale log retention or complex stream processing topologies. It's not trying to.

The Other Options Worth Knowing

RabbitMQ has been around since 2007. It's the original "just works" message broker for routing patterns: direct, topic, fanout, headers. If your problem is "Service A needs to send tasks to Service B," RabbitMQ does that with no ceremony. RabbitMQ Streams (added in 3.9) gives it append-only log semantics similar to Kafka, enabling replay and fan-out without switching platforms.

For cloud-native teams, managed services eliminate operational overhead entirely:

Amazon SQS + SNS: SQS for point-to-point queues, SNS for pub/sub fan-out. Together they cover 80% of messaging patterns. Zero servers to manage. You pay per request. For teams under 10,000 messages per second, this is often the right starting point.
Google Cloud Pub/Sub: Serverless, globally distributed, automatic scaling. Google built it to handle YouTube's event volume. Your 5,000 events per second won't even register on their meters.
Redis Streams: If you already run Redis, Streams give you a persistent, replayable log without introducing a new system. The limitation is durability: Redis Pub/Sub is fire-and-forget with no history. Redis Streams adds persistence, but it's still fundamentally an in-memory system. For non-critical event processing, this trade-off is often acceptable.

The mistake I see most often: teams evaluate Kafka vs. Redpanda vs. NATS and skip the managed cloud options entirely. If you're on AWS already, SQS should be your default until you hit a concrete reason it won't work.

The Decision Nobody Wants to Make Honestly

Most teams reach for Kafka because they're afraid of outgrowing a simpler tool. "What if we need replay later?" "What if throughput doubles?" These are valid questions. But they're also the kind of questions that lead to premature complexity.

The real decision framework is shorter than most articles make it:

You need Kafka (or Redpanda) when:

Message replay is a requirement, not a nice-to-have
You process over 50,000 events per second sustained
You need exactly-once semantics with multi-partition transactions
Multiple independent consumer groups read the same stream
You're building an event sourcing architecture as your primary data pattern

You don't need Kafka when:

You're routing tasks between services (use SQS, RabbitMQ, or NATS)
You need simple pub/sub without replay (use Redis Pub/Sub or NATS core)
Your throughput is under 10,000 events per second (almost anything works)
You have fewer than 3 dedicated infrastructure engineers
"We might need it later" is the primary justification

Here's how this breaks down by team size:

Team Size	Throughput	Recommended Approach
1-5 engineers	Under 1K msg/sec	SQS, Redis Streams, or managed Pub/Sub
5-15 engineers	1K-10K msg/sec	NATS JetStream or RabbitMQ Streams
15-50 engineers	10K-100K msg/sec	Redpanda or Confluent Cloud
50+ engineers	100K+ msg/sec	Kafka (self-managed or Confluent)

These are guidelines, not rules. A 5-person fintech team processing payment events might need Kafka's durability guarantees regardless of throughput. A 200-person company doing analytics might be fine with managed Pub/Sub.

What "Just Use Kafka" Actually Costs

Let me make this concrete. Take a typical startup: 12 engineers, 3 microservices, processing order events at about 2,000 per second.

With Kafka on AWS (3 brokers, m5.large):

# Monthly cost breakdown (approximate)
EC2 instances (3x m5.large):    $210
EBS storage (3x 500GB gp3):     $120
Data transfer:                   $50
Engineer time (10 hrs/week):     $2,500  # at $60/hr loaded cost
                                 ------
Total monthly:                   $2,880

With SQS:

# Monthly cost breakdown
SQS requests (2K/sec):          $85   # ~5.2B requests/month
Engineer time (1 hr/week):      $250
                                ------
Total monthly:                  $335

With NATS on a single instance:

# Monthly cost breakdown  
EC2 instance (1x t3.medium):    $30
EBS storage (100GB):            $8
Engineer time (2 hrs/week):     $500
                                ------
Total monthly:                  $538

The raw infrastructure difference is significant, but the engineer time is where it really adds up. Kafka doesn't just cost compute; it costs attention. Every hour an engineer spends debugging consumer lag is an hour not spent building product.

And these numbers are generous to Kafka. They assume a clean, well-tuned cluster. In practice, most small teams run into at least one multi-hour incident per quarter: a partition leader election that stalls, a consumer group rebalance that cascades, a disk that fills up because retention wasn't configured correctly. Factor in incident response time and the gap widens.

Confluent Cloud eliminates the operational burden, but at a premium. For 2,000 messages per second on their Basic tier, expect roughly $400-600/month depending on retention and throughput patterns. Better than self-managed, but still several times more expensive than SQS for workloads that don't need Kafka-specific features.

When Companies Walk Away

The pattern is consistent. XPENG Motors replaced Kafka and reduced costs by over 50%. Palmpay did the same, optimizing costs by more than half by moving to AutoMQ's cloud-native approach.

These aren't small companies abandoning Kafka because they couldn't figure it out. They're companies that ran Kafka in production, measured the overhead, and concluded simpler alternatives met their actual requirements at lower cost.

The common trigger: teams realize they weren't using Kafka's distinctive features. They didn't need replay. They didn't need multi-consumer-group fan-out. They didn't need exactly-once transactional guarantees. They were using a distributed log as a fancy message queue.

That's like buying a semi truck to deliver groceries. The groceries get delivered. The truck works. But you're paying for diesel, CDL drivers, and a loading dock when a van would have been fine.

In 2026, teams often leave Kafka "not because it is slow or unreliable, but because simpler systems deliver the guarantees they need with less cognitive overhead." That sentence captures everything. The problem was never reliability. It was complexity that wasn't buying anything.

The Strongest Argument for Kafka

I've been making the case against default Kafka adoption, so let me steelman the other side.

Kafka is a 13-year-old battle-tested system. It handles trillions of messages daily at companies like LinkedIn, Netflix, and Uber. The ecosystem is massive: Kafka Connect has connectors for nearly every data system, Kafka Streams provides a lightweight stream processing library, and the Schema Registry solves data contract evolution. Confluent's fiscal year 2025 subscription revenue hit $1.12 billion, which means real commercial support exists.

When Kai Waehner mapped the data streaming landscape for 2026, Kafka remained the center of gravity. Alternatives orbit it; they don't replace it.

If you're building a platform that multiple teams will build on for years, Kafka's ecosystem and mindshare are genuine advantages. Hiring is easier (more engineers know Kafka). Tooling is richer. Documentation is deeper. The operational burden, while real, is a solved problem at companies willing to invest in dedicated platform teams.

The argument isn't that Kafka is bad. It's that Kafka is specialized infrastructure being deployed for general-purpose messaging.

A Practical Migration Path

If you're running Kafka and suspect it's overkill, don't rip it out overnight. Here's the measured approach:

Week 1-2: Audit your usage.

# Identify which topics actually use Kafka-specific features
from kafka.admin import KafkaAdminClient

def audit_kafka_usage(bootstrap_servers: str) -> dict:
    admin = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
    topics = admin.list_topics()
    
    report = {
        "total_topics": len(topics),
        "needs_replay": [],         # topics with retention > 24h
        "multi_consumer": [],       # topics with 2+ consumer groups
        "single_consumer_queue": [], # topics acting as simple queues
        "dead_topics": [],          # topics with no recent production
    }
    
    for topic in topics:
        configs = admin.describe_configs([(topic,)])
        consumer_groups = get_consumer_groups_for_topic(topic)
        
        retention_ms = int(configs.get("retention.ms", 604800000))
        
        if retention_ms > 86400000 and len(consumer_groups) > 1:
            report["needs_replay"].append(topic)
        elif len(consumer_groups) > 1:
            report["multi_consumer"].append(topic)
        elif len(consumer_groups) <= 1:
            report["single_consumer_queue"].append(topic)
    
    return report

Topics in single_consumer_queue are your migration candidates. They're using Kafka as a message queue; SQS, NATS, or RabbitMQ can replace them directly.

Week 3-4: Run the alternative in parallel. Dual-write to both Kafka and the new system. Compare delivery latency, ordering guarantees, and failure modes under your actual workload.

Week 5-8: Migrate consumers one at a time. Start with the lowest-risk topic. Move the consumer. Monitor for a week. Repeat. Keep Kafka running for the topics that genuinely need it.

Month 3+: Decommission. Once all simple-queue topics are migrated, evaluate whether the remaining Kafka topics justify a full cluster. Sometimes 2-3 topics that need replay can move to a single Redpanda instance.

The biggest risk in migration isn't technical. It's organizational. The team that built the Kafka infrastructure has expertise, ownership, and sometimes identity wrapped up in it. "We're the Kafka team" is a real thing at mid-size companies. Approach the migration as a resource reallocation, not a judgment on past decisions. The team's distributed systems knowledge is valuable regardless of which broker runs underneath.

I also want to be honest about what can go wrong. Message ordering guarantees are the subtlest trap. Kafka guarantees ordering within a partition. SQS FIFO queues guarantee ordering within a message group. NATS JetStream guarantees ordering within a stream. These are similar but not identical. If your consumers rely on specific ordering behavior, test exhaustively before switching. I once saw a team migrate from Kafka to SQS Standard (not FIFO) and spend two weeks debugging why their event sourcing projections were occasionally wrong. The messages arrived out of order. The fix was simple (switch to FIFO), but the debugging time was painful.

Stop Choosing Infrastructure Out of Fear

The streaming market is projected to grow from $1.6 billion to $5.3 billion by 2025, according to IDC. That growth is real, and it reflects genuine demand for event-driven architectures. But market growth doesn't mean every team needs the market leader.

I've watched three different teams adopt Kafka, spend 6 months building expertise, then realize SQS would have covered their actual requirements from day one. The time spent learning partition strategies, tuning consumer group rebalance protocols, and debugging ISR shrinkage was time not spent building the product their users cared about.

The fear is always the same: "We'll outgrow the simpler tool." Maybe. But you can migrate from SQS to Kafka in a few weeks when that day comes. You can't get back the six months you spent operating Kafka before you needed it.

Before you docker pull confluentinc/cp-kafka, ask one question: "What happens if I use SQS instead?"

If the answer is "nothing bad," you don't need Kafka. You need a queue. Queues are boring, cheap, and they work.

Kafka solved LinkedIn's billion-event-per-day problem in 2011. It's excellent at that. Thirteen years later, it remains the gold standard for high-throughput event streaming. But your 2,000-event-per-second order pipeline isn't LinkedIn, and pretending otherwise doesn't make your architecture more sophisticated. It just makes your on-call rotation more miserable.

Acknowledge that, pick the boring tool, and spend the saved engineering hours on the thing your company actually does. Your users don't care which message broker you run. They care whether the feature ships.

References:

Kafka Is Overkill for 90% of Teams

How We Got Here

The Operational Tax Nobody Mentions Up Front

Redpanda: Kafka Without the Baggage (Mostly)

NATS: The Broker You Can Forget About

The Other Options Worth Knowing

The Decision Nobody Wants to Make Honestly

What "Just Use Kafka" Actually Costs

When Companies Walk Away

The Strongest Argument for Kafka

A Practical Migration Path

Stop Choosing Infrastructure Out of Fear

Enjoyed this article?

Kafka Is Overkill for 90% of Teams

How We Got Here

The Operational Tax Nobody Mentions Up Front

Redpanda: Kafka Without the Baggage (Mostly)

NATS: The Broker You Can Forget About

The Other Options Worth Knowing

The Decision Nobody Wants to Make Honestly

What "Just Use Kafka" Actually Costs

When Companies Walk Away

The Strongest Argument for Kafka

A Practical Migration Path

Stop Choosing Infrastructure Out of Fear