Ismat Samadov
  • Tags
  • About
18 min read/0 views

Rate Limiting, Circuit Breakers, and Backpressure: The Three Patterns That Keep Distributed Systems Alive

A missing timeout killed our checkout on Black Friday. Rate limiting, circuit breakers, and backpressure are the three patterns that prevent cascading failures.

ArchitectureBackendPerformancePython

Related Articles

SLOs Changed How We Ship Software — Error Budgets, Burn Rates, and Why 99.99% Uptime Is a Lie

15 min read

OWASP Top 10 for LLM Applications: The Attacks Your AI App Isn't Ready For

15 min read

Testing LLM Applications Is Nothing Like Testing Regular Software — Here's What Actually Works

14 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • Why Systems Cascade
  • Rate Limiting: The Bouncer
  • The Algorithms
  • Distributed Rate Limiting
  • Multi-Level Rate Limiting
  • Circuit Breakers: The Fuse
  • The Three States
  • Python Implementation
  • Fallback Strategies
  • What Netflix Learned
  • Backpressure: The Dam
  • Where Backpressure Shows Up
  • The Queue-Based Backpressure Pattern
  • Backpressure in HTTP APIs
  • The Kafka Backpressure Problem
  • Putting Them Together
  • The Decision Framework
  • Common Mistakes
  • Tools Worth Knowing
  • What I Actually Think

© 2026 Ismat Samadov

RSS

A single missing timeout brought down our entire checkout flow on Black Friday 2023. The payment service slowed to 12-second response times. The order service kept retrying. The gateway queued thousands of requests. Within four minutes, every service in the cluster was unresponsive — not because they were broken, but because they were all waiting on each other. The CPU was fine. Memory was fine. The system drowned in its own politeness.

That incident taught me something most architecture articles skip: distributed systems don't fail from bugs. They fail from the absence of boundaries. Rate limiting, circuit breakers, and backpressure are those boundaries. They're the three patterns that keep a slow dependency from becoming a total outage.

This is the practical guide to all three — what they are, how they differ, when to use which, and the Python implementations I actually run in production.

Why Systems Cascade

Before jumping into the patterns, you need to understand why distributed systems fail the way they do. It's not random. It's physics.

Every service has a maximum throughput — the number of requests it can handle per second before latency starts climbing. When a downstream dependency slows down, the calling service's threads (or coroutine slots, or connection pool entries) start piling up waiting for responses. Those occupied resources can't serve new requests. The calling service slows down. Its callers slow down. And so on, all the way up to the user.

This is a cascading failure. Netflix built an entire engineering discipline around preventing them, eventually creating Hystrix — the library that popularized the circuit breaker pattern. Hystrix is deprecated now (Resilience4j replaced it in the Java world), but the problems it solved haven't changed.

The three patterns address different failure modes:

PatternProtects AgainstDirectionAnalogy
Rate LimitingToo many requests coming inInboundBouncer at a club
Circuit BreakerDownstream dependency failingOutboundElectrical fuse
BackpressureProducer faster than consumerFlow controlDam on a river

They're complementary, not competitive. Most production systems need all three.

Rate Limiting: The Bouncer

Rate limiting controls how many requests a client (or all clients) can make in a given time window. It's the simplest of the three patterns, and the most misunderstood.

Most developers think rate limiting is about preventing abuse. It is — but that's the least interesting use case. The real value of rate limiting is protecting your system from itself. A misconfigured batch job, a retry storm from a buggy client, a marketing email that drives 10x normal traffic — these are all scenarios where your own legitimate traffic kills you.

The Algorithms

There are four main rate limiting algorithms. Each makes a different tradeoff:

Token Bucket — The most common algorithm. A bucket holds tokens; each request consumes one. Tokens refill at a fixed rate. If the bucket is empty, requests are rejected. The bucket size controls burst capacity — a bucket of 100 tokens with a 10/second refill rate allows bursts of 100 requests followed by a sustained 10/second.

import time
import threading

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max burst size
        self.tokens = capacity
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def allow(self) -> bool:
        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(
                self.capacity,
                self.tokens + elapsed * self.rate
            )
            self.last_refill = now

            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

# 100 requests/sec with burst of 50
limiter = TokenBucket(rate=100, capacity=50)

if not limiter.allow():
    # Return 429 Too Many Requests
    raise HTTPException(status_code=429)

Sliding Window Counter — Splits time into fixed windows but weights the previous window's count based on how far into the current window you are. Better accuracy than fixed windows, lower memory than sliding logs. This is what Redis recommends for most use cases.

Fixed Window — Simple counter that resets every N seconds. Problem: a burst at the window boundary can allow 2x the intended rate. If your limit is 100/minute and a client sends 100 requests at 11:59:59 and another 100 at 12:00:01, they've sent 200 requests in 2 seconds.

Leaky Bucket — Processes requests at a fixed rate regardless of input rate. Excess requests queue up and eventually drop. Good for smoothing traffic but adds latency to bursty workloads.

AlgorithmBurst ToleranceMemoryAccuracyBest For
Token BucketHigh (configurable)O(1)GoodAPIs with burst traffic
Sliding WindowLowO(1)Very GoodGeneral-purpose limiting
Fixed WindowBoundary spike riskO(1)FairSimple use cases
Leaky BucketNone (smoothing)O(1)GoodTraffic shaping

For most APIs, token bucket is the strongest default. It models capacity accumulation naturally and the burst parameter gives you a tuning knob that the other algorithms lack.

Distributed Rate Limiting

The single-process token bucket above breaks in a distributed system. If you have 10 API servers behind a load balancer, each running its own token bucket, a client can send 10x your intended rate by hitting different servers.

The standard solution is a centralized counter in Redis:

import redis
import time

r = redis.Redis(host='localhost', port=6379)

def rate_limit(key: str, limit: int, window: int) -> bool:
    """Sliding window rate limiter using Redis."""
    now = time.time()
    pipeline = r.pipeline()

    # Remove expired entries
    pipeline.zremrangebyscore(key, 0, now - window)
    # Add current request
    pipeline.zadd(key, {f"{now}": now})
    # Count requests in window
    pipeline.zcard(key)
    # Set expiry on the key
    pipeline.expire(key, window)

    results = pipeline.execute()
    request_count = results[2]

    return request_count <= limit

# 100 requests per 60-second window
if not rate_limit(f"user:{user_id}", limit=100, window=60):
    return Response(status_code=429, headers={
        "Retry-After": "60",
        "X-RateLimit-Limit": "100",
        "X-RateLimit-Remaining": "0",
    })

This uses a Redis sorted set as a sliding window log. Each request is a member with its timestamp as the score. We remove expired entries, add the new one, and check the count — all in a single pipeline for atomicity.

The tradeoff: every request now requires a Redis round-trip. At Ably's scale, they found this adds 1-2ms per request, which is acceptable for most APIs but matters at very high throughput.

Multi-Level Rate Limiting

Production systems rarely use a single rate limit. You want layers:

RATE_LIMITS = {
    "per_user": {"limit": 100, "window": 60},
    "per_ip": {"limit": 1000, "window": 60},
    "per_endpoint": {"limit": 50, "window": 60},
    "global": {"limit": 10000, "window": 60},
}

def check_all_limits(user_id: str, ip: str, endpoint: str) -> bool:
    checks = [
        rate_limit(f"user:{user_id}", **RATE_LIMITS["per_user"]),
        rate_limit(f"ip:{ip}", **RATE_LIMITS["per_ip"]),
        rate_limit(f"endpoint:{endpoint}", **RATE_LIMITS["per_endpoint"]),
        rate_limit("global", **RATE_LIMITS["global"]),
    ]
    return all(checks)

Per-user limits prevent individual abuse. Per-IP limits catch credential stuffing attacks that rotate user accounts. Per-endpoint limits protect expensive operations (search, reports). Global limits are your last line of defense against traffic you didn't anticipate.

As of 2025, 31% of organizations use multiple API gateways, each potentially enforcing its own rate limits. If you're running Kong, Envoy, or AWS API Gateway in front of your services, you already have gateway-level rate limiting — but you still need application-level limits for business logic rules that the gateway can't express.

Circuit Breakers: The Fuse

Rate limiting protects you from too much inbound traffic. Circuit breakers protect you from broken outbound dependencies.

The concept is borrowed directly from electrical engineering. When current exceeds a safe threshold, a circuit breaker trips and stops the flow. In software, when a downstream service starts failing, the circuit breaker trips and stops sending requests — returning an immediate error (or fallback) instead of waiting for timeouts.

The Three States

A circuit breaker operates in three states:

CLOSED (normal operation)
  │
  │  failure_count > threshold
  ▼
OPEN (failing fast)
  │
  │  timeout expires
  ▼
HALF-OPEN (testing recovery)
  │
  ├── success → CLOSED
  └── failure → OPEN

Closed: All requests pass through. The breaker tracks success/failure rates. When the failure rate exceeds a threshold (say, 50% of the last 100 requests), the breaker trips to Open.

Open: All requests immediately fail without calling the downstream service. This is the key insight — you're not waiting for a timeout. You're failing in milliseconds instead of seconds. After a configurable timeout (say, 30 seconds), the breaker transitions to Half-Open.

Half-Open: A limited number of test requests are allowed through. If they succeed, the breaker closes. If they fail, it reopens. This is the recovery probe.

Python Implementation

Here's a production-grade circuit breaker:

import time
import threading
from enum import Enum
from collections import deque
from typing import Callable, Any

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 3,
        window_size: int = 100,
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        self.window_size = window_size

        self.state = State.CLOSED
        self.failures = deque(maxlen=window_size)
        self.last_failure_time = 0
        self.half_open_calls = 0
        self.lock = threading.Lock()

    @property
    def failure_rate(self) -> float:
        if not self.failures:
            return 0.0
        return sum(self.failures) / len(self.failures)

    def call(self, func: Callable, *args, **kwargs) -> Any:
        with self.lock:
            if self.state == State.OPEN:
                if time.monotonic() - self.last_failure_time > self.recovery_timeout:
                    self.state = State.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    raise CircuitOpenError(
                        f"Circuit is OPEN. Retry after "
                        f"{self.recovery_timeout}s"
                    )

            if self.state == State.HALF_OPEN:
                if self.half_open_calls >= self.half_open_max_calls:
                    raise CircuitOpenError("Half-open call limit reached")
                self.half_open_calls += 1

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self.lock:
            self.failures.append(0)
            if self.state == State.HALF_OPEN:
                self.state = State.CLOSED

    def _on_failure(self):
        with self.lock:
            self.failures.append(1)
            self.last_failure_time = time.monotonic()
            if self.state == State.HALF_OPEN:
                self.state = State.OPEN
            elif (len(self.failures) >= self.failure_threshold
                  and self.failure_rate > 0.5):
                self.state = State.OPEN

class CircuitOpenError(Exception):
    pass

Usage:

payment_breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=30.0,
)

async def process_payment(order_id: str):
    try:
        result = payment_breaker.call(
            payment_client.charge,
            order_id=order_id,
        )
        return result
    except CircuitOpenError:
        # Fallback: queue for retry, show "payment pending"
        await payment_queue.enqueue(order_id)
        return PaymentResult(status="pending")
    except PaymentError:
        # Individual failure, breaker is tracking it
        raise

Fallback Strategies

The circuit breaker's power comes from what you do when it's open. Simply throwing an error is the minimum. Better options:

Cached response: Return the last known good response. Works for read operations — a slightly stale product catalog is better than no catalog.

Degraded response: Return a partial result. If the recommendation service is down, show popular items instead of personalized ones.

Queue for retry: Accept the request, queue it, process it when the dependency recovers. Works for writes that aren't time-critical.

Alternative service: Route to a backup provider. If your primary payment processor is down, fall back to the secondary one.

class PaymentService:
    def __init__(self):
        self.primary_breaker = CircuitBreaker(failure_threshold=3)
        self.fallback_breaker = CircuitBreaker(failure_threshold=5)

    async def charge(self, amount: float) -> PaymentResult:
        # Try primary
        try:
            return self.primary_breaker.call(
                self.stripe_client.charge, amount
            )
        except CircuitOpenError:
            pass

        # Fallback to secondary
        try:
            return self.fallback_breaker.call(
                self.braintree_client.charge, amount
            )
        except CircuitOpenError:
            # Both providers down — queue it
            await self.retry_queue.push(amount)
            return PaymentResult(status="queued")

What Netflix Learned

Netflix built Hystrix specifically because cascading failures were their number one cause of customer-facing outages. A key insight from their experience: the circuit breaker's recovery timeout needs to be long enough for the downstream service to actually recover.

If you set it to 5 seconds and the downstream service takes 30 seconds to restart, you'll just hammer it with half-open probe requests during its most vulnerable startup period. Netflix uses chaos engineering (Chaos Monkey, Chaos Kong) to test these scenarios — simulating AWS region failures and verifying that circuit breakers trip correctly.

Hystrix was deprecated in 2018, but the patterns it established live on in Resilience4j (Java), Polly (.NET), and Python libraries like pybreaker and pyresilience.

Backpressure: The Dam

Rate limiting and circuit breakers are binary — they either allow or reject a request. Backpressure is more subtle. It's a continuous feedback mechanism where a slow consumer tells a fast producer to slow down.

Think of it like water management. Without a dam, a rainstorm floods the valley. With a dam, you control the flow rate to match what the downstream infrastructure can handle. The water doesn't disappear — it accumulates in the reservoir. If the reservoir fills up, you open the spillway (drop messages) or stop accepting inflow (apply upstream backpressure).

Where Backpressure Shows Up

Backpressure is everywhere in distributed systems, even when you don't realize it:

TCP flow control. TCP has built-in backpressure. The receiver advertises a window size — the amount of data it can accept. When the receiver's buffer fills, the window shrinks to zero, and the sender stops. This is why a slow consumer doesn't crash when a fast producer sends data over a TCP connection. The protocol handles it.

Message queues. Kafka, RabbitMQ, and SQS all implement backpressure differently. Kafka consumers pull messages at their own pace — natural backpressure. RabbitMQ supports consumer prefetch counts. SQS uses visibility timeouts.

Stream processing. Apache Flink and Kafka Streams implement backpressure natively. When a downstream operator can't keep up, the upstream operator slows its output. This propagates all the way back to the source.

HTTP APIs. This is where backpressure is most often missing. HTTP has no built-in flow control beyond TCP. If your API receives requests faster than it can process them, you need to implement backpressure yourself.

The Queue-Based Backpressure Pattern

The most common backpressure implementation uses a bounded queue between producer and consumer:

import asyncio
from asyncio import Queue

class BackpressureProcessor:
    def __init__(self, max_queue_size: int = 1000, workers: int = 10):
        self.queue: Queue = Queue(maxsize=max_queue_size)
        self.workers = workers
        self.is_accepting = True

    async def submit(self, item: dict) -> bool:
        """Submit work. Returns False if backpressure is active."""
        if self.queue.qsize() >= self.queue.maxsize * 0.8:
            # 80% full — signal backpressure
            self.is_accepting = False
            return False

        await self.queue.put(item)

        if self.queue.qsize() < self.queue.maxsize * 0.5:
            # Below 50% — accept again
            self.is_accepting = True

        return True

    async def worker(self, worker_id: int):
        while True:
            item = await self.queue.get()
            try:
                await self.process(item)
            except Exception as e:
                await self.handle_error(item, e)
            finally:
                self.queue.task_done()

    async def process(self, item: dict):
        # Your actual processing logic
        await asyncio.sleep(0.1)  # simulate work

    async def start(self):
        workers = [
            asyncio.create_task(self.worker(i))
            for i in range(self.workers)
        ]
        return workers

The key detail: the 80% high-water mark and 50% low-water mark. This hysteresis prevents oscillation — without it, the system would rapidly flip between accepting and rejecting at the boundary.

Backpressure in HTTP APIs

For HTTP APIs, backpressure typically manifests as one of three strategies:

429 with Retry-After. The simplest approach. When your processing queue is full, return a 429 status code with a Retry-After header telling the client when to try again.

from fastapi import FastAPI, HTTPException, Response

app = FastAPI()
processor = BackpressureProcessor(max_queue_size=5000)

@app.post("/events")
async def ingest_event(event: dict, response: Response):
    if not processor.is_accepting:
        response.headers["Retry-After"] = "5"
        raise HTTPException(
            status_code=429,
            detail="System under load. Retry after 5 seconds."
        )

    accepted = await processor.submit(event)
    if not accepted:
        response.headers["Retry-After"] = "5"
        raise HTTPException(status_code=429)

    return {"status": "accepted"}

Load shedding. When the system is overloaded, drop low-priority requests entirely. This is different from rate limiting — you're not counting requests, you're measuring system capacity.

import psutil

def should_shed_load() -> bool:
    """Shed load when system resources are constrained."""
    cpu_percent = psutil.cpu_percent(interval=0.1)
    memory_percent = psutil.virtual_memory().percent

    # Shed non-critical requests above 85% CPU or 90% memory
    return cpu_percent > 85 or memory_percent > 90

@app.post("/search")
async def search(query: str, priority: str = "normal"):
    if should_shed_load() and priority != "critical":
        raise HTTPException(
            status_code=503,
            detail="Service temporarily degraded"
        )
    return await perform_search(query)

Adaptive concurrency limits. Netflix's concurrency-limits library (and its concept) dynamically adjusts how many concurrent requests a service will accept based on measured latency. As latency increases (indicating saturation), the concurrency limit decreases. As latency decreases, the limit increases. This is TCP congestion control applied to HTTP.

The Kafka Backpressure Problem

Kafka consumers implement natural backpressure by pulling messages — but that doesn't mean it's free. If your consumer falls behind, the consumer lag grows. That lag represents unprocessed messages sitting in Kafka, consuming disk space and increasing end-to-end latency.

from confluent_kafka import Consumer, KafkaError

consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'my-processor',
    'auto.offset.reset': 'latest',
    'max.poll.interval.ms': 300000,  # 5 minutes max between polls
    'max.poll.records': 100,         # limit batch size
    'fetch.max.bytes': 1048576,      # 1MB max per fetch
})

def process_with_backpressure():
    while True:
        messages = consumer.consume(
            num_messages=100,  # process in small batches
            timeout=1.0
        )

        if not messages:
            continue

        # Process batch
        for msg in messages:
            if msg.error():
                continue
            process_message(msg)

        # Commit after successful processing
        consumer.commit(asynchronous=False)

        # If processing is slow, the consumer naturally
        # pulls fewer messages — backpressure by design

The danger with Kafka backpressure is silent accumulation. Consumer lag growing from thousands to millions of messages can happen gradually, and by the time you notice, you're hours behind. Monitor consumer lag aggressively:

# Monitor with prometheus_client
from prometheus_client import Gauge

consumer_lag = Gauge(
    'kafka_consumer_lag',
    'Number of messages behind',
    ['topic', 'partition']
)

# Alert when lag exceeds threshold
# lag > 10000 for 5 minutes → page someone

Putting Them Together

Here's where it gets interesting. These three patterns aren't independent — they compose into a resilience layer that handles failures at every level.

                    Inbound Request
                         │
                    ┌────▼────┐
                    │  Rate   │ ← "Are we accepting this client?"
                    │ Limiter │
                    └────┬────┘
                         │ (allowed)
                    ┌────▼────┐
                    │  Back-  │ ← "Can we handle this right now?"
                    │pressure │
                    └────┬────┘
                         │ (capacity available)
                    ┌────▼────┐
                    │ Circuit │ ← "Is the dependency working?"
                    │ Breaker │
                    └────┬────┘
                         │ (closed)
                    ┌────▼────┐
                    │Downstream│
                    │ Service  │
                    └─────────┘

A request arrives. Rate limiting checks if this client has exceeded their quota. Backpressure checks if the system has capacity. The circuit breaker checks if the downstream dependency is healthy. Only if all three say yes does the request proceed.

Here's a combined middleware for FastAPI:

from fastapi import FastAPI, Request, HTTPException
from functools import wraps

app = FastAPI()

# Per-user rate limiter
user_limiter = RedisRateLimiter(limit=100, window=60)

# System-level backpressure
processor = BackpressureProcessor(max_queue_size=5000)

# Per-dependency circuit breakers
breakers = {
    "payment": CircuitBreaker(failure_threshold=5, recovery_timeout=30),
    "inventory": CircuitBreaker(failure_threshold=10, recovery_timeout=15),
    "notification": CircuitBreaker(failure_threshold=20, recovery_timeout=60),
}

@app.middleware("http")
async def resilience_middleware(request: Request, call_next):
    user_id = request.headers.get("X-User-ID", request.client.host)

    # Layer 1: Rate limiting
    if not user_limiter.allow(user_id):
        return Response(status_code=429, headers={"Retry-After": "60"})

    # Layer 2: Backpressure
    if not processor.is_accepting:
        return Response(status_code=503, headers={"Retry-After": "5"})

    # Layer 3: Circuit breakers checked per-call in handlers
    response = await call_next(request)
    return response

@app.post("/orders")
async def create_order(order: OrderRequest):
    # Circuit breaker on payment
    try:
        payment = breakers["payment"].call(
            payment_service.charge, order.amount
        )
    except CircuitOpenError:
        return {"status": "payment_pending", "retry": True}

    # Circuit breaker on inventory
    try:
        breakers["inventory"].call(
            inventory_service.reserve, order.items
        )
    except CircuitOpenError:
        # Reverse the payment
        payment_service.refund(payment.id)
        raise HTTPException(503, "Inventory service unavailable")

    return {"status": "confirmed", "payment_id": payment.id}

Notice the different thresholds for each breaker. Payment is critical and low-volume — trip after 5 failures. Inventory is called more frequently — trip after 10. Notifications are fire-and-forget — tolerate 20 failures before tripping because a missed notification isn't a business-critical failure.

The Decision Framework

When should you reach for each pattern? Here's how I think about it:

ScenarioPatternWhy
Public API with multiple clientsRate LimitingPrevent one client from starving others
Calling a flaky third-party serviceCircuit BreakerStop waiting for timeouts
Processing queue growing unboundedBackpressureSlow producers before memory explodes
Login endpoint under brute forceRate LimitingSecurity + availability
Database connection pool exhaustionCircuit BreakerStop sending queries to a saturated DB
Event ingestion pipeline at capacityBackpressureReject or delay events gracefully
Microservice calling 5 dependenciesAll threeRate limit inbound, circuit break outbound, backpressure on queues

If you're building a monolith with a single database, you probably only need rate limiting. If you're calling external APIs, add circuit breakers. If you're processing asynchronous workloads, add backpressure. If you're running microservices, you need all three.

Common Mistakes

Setting recovery timeouts too short. If your circuit breaker reopens after 5 seconds but the downstream service takes 30 seconds to restart, you're just hammering it with probe requests during its most vulnerable period. Start with 30 seconds and tune from there.

Rate limiting without Retry-After headers. If you reject a request with 429 but don't tell the client when to retry, they'll just retry immediately — making the problem worse. Always include Retry-After.

Ignoring partial failures. A service returning 200 but with degraded data is still a failure for circuit breaker purposes. Track success quality, not just HTTP status codes.

No backpressure on internal queues. Every unbounded queue is a memory leak waiting to happen. If your queue can grow without limit, it will — usually at 3 AM on a weekend.

Using the same limits everywhere. Not all endpoints are equal. Your health check endpoint should have different limits than your payment endpoint. Not all dependencies are equal either — a slow CDN is not the same as a slow database.

Forgetting about the thundering herd. When a circuit breaker closes after being open, all queued clients retry simultaneously. Add jitter to your retry logic:

import random

def retry_with_jitter(attempt: int, base_delay: float = 1.0) -> float:
    """Exponential backoff with full jitter."""
    max_delay = base_delay * (2 ** attempt)
    return random.uniform(0, max_delay)

Tools Worth Knowing

For Python specifically:

ToolWhat It DoesWhen to Use
pybreakerCircuit breakerSimple CB needs
pyresilienceCB + retry + rate limit + bulkheadFull resilience stack
slowapiRate limiting for FastAPI/StarletteAPI rate limiting
redis-pyDistributed rate limitingMulti-server setups
tenacityRetry with backoffRetry logic
asyncio.QueueBounded queue backpressureAsync processing

For infrastructure-level protection:

ToolWhat It Does
KongAPI gateway with rate limiting plugins
EnvoyService mesh with circuit breaking
IstioService mesh with all three patterns
NGINXReverse proxy with rate limiting

The trend in 2025-2026 is pushing these patterns into the infrastructure layer — service meshes and API gateways — rather than implementing them in application code. Kong and Envoy are the dominant choices for teams that want resilience without modifying every service.

But here's my contrarian take: infrastructure-level resilience is necessary but not sufficient. Your API gateway can rate limit, but only your application knows that a payment request is more important than a notification request. Your service mesh can circuit-break, but only your application can provide a meaningful fallback. The best systems implement resilience at both layers.

What I Actually Think

After building and operating distributed systems for years, I've come to believe that most production outages aren't caused by hardware failures, bugs, or traffic spikes. They're caused by the absence of these three patterns. The system worked fine under normal conditions, and nobody bothered to add the guardrails for abnormal conditions.

Rate limiting is the easiest to implement and the hardest to tune. The right limits are different for every endpoint, every client, and every time of day. Start too aggressive and you'll reject legitimate traffic. Start too permissive and the limits won't help when you need them. My advice: start with conservative limits and loosen them based on observed usage patterns. It's easier to increase a limit than to recover from an outage.

Circuit breakers are the highest-impact pattern for microservices. A single circuit breaker on your most critical dependency will prevent more outages than any amount of horizontal scaling. The timeout is the most important parameter — not the failure threshold, not the window size. Get the timeout right, and the rest is tuning.

Backpressure is the most underappreciated of the three. Every team I've worked with has had at least one incident caused by an unbounded queue or a producer-consumer speed mismatch. The fix is always the same: add a bounded buffer, measure the fill level, and signal the producer when it's getting full. It's not glamorous, but it prevents the 3 AM page.

The order I'd implement them in a new system: rate limiting first (it's the simplest and protects against the most common failure mode), circuit breakers second (they prevent cascading failures from dependencies), backpressure third (it handles the subtler producer-consumer mismatches).

Don't build your own in production unless you have specific needs that existing libraries don't cover. Use pybreaker or pyresilience for circuit breakers. Use Redis for distributed rate limiting. Use asyncio.Queue or Kafka consumer config for backpressure. These are solved problems — the value is in tuning them for your specific system, not reimplementing them.

The three patterns together form a complete resilience layer. Rate limiting handles the demand side. Circuit breakers handle the supply side. Backpressure handles the flow between them. Miss any one of the three, and you have a gap that will eventually become an incident. Usually at 3 AM. Usually on a holiday.


Sources: Netflix Hystrix Wiki, Resilience4j Documentation, Redis Rate Limiting Tutorial, Arcjet Rate Limiting Algorithm Comparison, Ably Distributed Rate Limiting, Microservices.io Circuit Breaker Pattern, InfoQ Cascading Failures, Netflix Fault Injection and Chaos Engineering, Streamkap Backpressure in Stream Processing, Jay Phelps Backpressure Explained, GeeksforGeeks Back Pressure in Distributed Systems, DigitalAPI API Gateway Framework 2026 Guide, Calmops Kong and Envoy API Gateways 2026, PyBreaker Documentation, Pyresilience Documentation, Kafka Consumer Backpressure, OneUptime Kafka Backpressure Guide, API7 Rate Limiting Guide, Designing a Distributed Rate Limiter.