The best infrastructure engineer I've ever worked with quit on a Tuesday. No drama, no counteroffer negotiation. He just walked into our CTO's office, said "I can't do this anymore," and put in his two weeks. He'd been paged 3-4 times per night for six weeks straight because he was one of two people on the rotation, and the only one who could debug our payment system.
We lost $340K in fully-loaded hiring and onboarding costs to replace him. The payment system had three more incidents in his first week gone. And the whole thing was avoidable.
65% of engineers report currently experiencing burnout. On-call is the single biggest contributor I've seen. Not because on-call is inherently bad — it's because most companies do it in a way that slowly destroys their teams.
The Numbers Behind the Burnout
On-call has gotten measurably worse, not better, despite a decade of SRE evangelism.
That alert fatigue stat is the one that kills me. Teams receive over 2,000 alerts weekly, but only 3% need immediate human action. That means 97% of pages are noise. And 73% of organizations have experienced outages because real alerts got lost in the noise.
Operational toil rose to 30% from 25%, marking the first increase in five years. We're going backwards. More tools, more alerts, more dashboards, more exhaustion.
How We Broke It
I've seen the same pattern at multiple companies. Here's how on-call goes from "manageable" to "people are quitting."
One or two senior engineers know the system best. When things break, they get called. There's no formal rotation. It works because incidents are rare and the heroes are willing.
Stage 2: The Rotation That Isn't
The team creates a "rotation" with 2-3 people. But it's not a real rotation — it's the same heroes with a shared calendar. When something complex breaks, the on-call person escalates to the hero anyway. The hero is effectively always on-call.
Stage 3: Alert Proliferation
More services, more monitoring, more alerts. Nobody prunes old alerts. A threshold set during a traffic spike in 2023 now fires every Tuesday at 3 AM because the baseline shifted. The on-call engineer wakes up, sees it's a false alarm, goes back to sleep. Gets paged again at 4:15 AM. Another false alarm.
Stage 4: Normalized Misery
Engineers stop expecting sleep during on-call weeks. They plan their lives around the rotation — canceling plans, warning partners, sleeping with their phones. The team collectively agrees "this is just how it is." New hires are warned: "On-call sucks here, but every company is the same."
Every company is not the same. I know this because I've seen teams where on-call weeks are boring — where the pager rarely goes off, where runbooks cover 90% of scenarios, and where engineers don't dread their rotation. Those teams exist. They just don't get written about because "our on-call is fine" isn't a compelling blog post.
Stage 5: People Leave
The best engineers leave first because they have the most options. The remaining team inherits their on-call burden, making each rotation worse. Hiring slows because candidates ask about on-call culture in interviews and don't like what they hear. The spiral accelerates.
I've watched this entire sequence play out in under 18 months.
The worst part? Management often doesn't see it happening. Alert volume isn't tracked in executive dashboards. On-call burden isn't discussed in leadership meetings. The first signal leadership gets is when a senior engineer gives notice. By then, the damage is done — you've lost institutional knowledge, the remaining team is more overloaded, and hiring a replacement takes months.
On-call burnout isn't loud. It's quiet. Engineers don't usually complain publicly. They just start updating their LinkedIn profiles and taking recruiter calls during lunch.
What "Good" Actually Looks Like
Good on-call exists. I've seen it. It's not magical — it's just intentional.
Rotation Design
Engineers should be on-call no more than 1 week every 6-8 weeks. Anything more frequent leads to fatigue. A 2-person rotation is not a rotation — it's a burden split. You need at least 5-6 people to maintain a healthy weekly rotation.
The Follow-the-Sun model — where teams in different time zones hand off shifts so everyone works during daytime hours — is the gold standard. If you can't do that, at minimum:
- Limit on-call shifts to 7 days maximum
- Guarantee a minimum rest period between rotations
- Allocate 30-40% of on-call bandwidth to incident responsibilities (not 100% of normal workload plus on-call)
- Provide compensatory time off after heavy on-call weeks
Alert Quality Over Quantity
The single most impactful change you can make: ruthlessly delete alerts.
If an alert fired 50 times last quarter and never required action, delete it. If an alert fires and the runbook says "acknowledge and ignore unless it persists for 30 minutes," change the threshold or add a delay. If an alert goes to the team channel instead of a specific person, it's not an alert — it's noise.
As Charity Majors wrote: the most durable strategy for on-call burnout prevention is reducing the number of incidents that require human response in the first place. This means:
- Every alert must be actionable. If there's nothing to do, delete it.
- Every alert must have a runbook. If the response isn't documented, the alert is broken.
- Review alert volume monthly. If total pages per week are trending up, that's a management problem, not an engineering problem.
Compensation and Recognition
Engineering teams that treat on-call as an informal obligation — without compensation, time back, or acknowledgment — send a clear message that engineers' time outside business hours doesn't matter.
The approaches that work:
| Compensation Model | How It Works | Where I've Seen It |
|---|
| Flat stipend | $500-$1,500/week for being on-call | Mid-size startups |
| Per-page bonus | $50-$200 per after-hours page | Some enterprise teams |
| Comp time | Day off after each on-call week | Common at European companies |
| Reduced workload | 30-40% less sprint work during on-call | Google SRE model |
| Hybrid | Stipend + comp time | Best-in-class companies |
Gergely Orosz's research shows that healthy on-call practices correlate strongly with team retention. The specific compensation model matters less than consistency and transparency — engineers need to know the program is fair and the organization recognizes the burden.
The Incident Response Rebuild: Step by Step
When I had to rebuild our incident management after the departure I described, here's what we did.
Step 1: Audit Everything (Week 1-2)
We pulled every alert from the last 90 days and categorized them:
# Quick PagerDuty audit via API
curl -s "https://api.pagerduty.com/incidents?since=2026-01-01&until=2026-03-31" \
-H "Authorization: Token token=YOUR_TOKEN" \
| jq '.incidents | group_by(.service.summary) |
map({service: .[0].service.summary, count: length}) |
sort_by(-.count)'
We found:
- 847 total alerts in 90 days
- Top 3 services generated 72% of all alerts
- 61% of alerts were auto-resolved before anyone looked at them
- Only 23 alerts (2.7%) were actual incidents requiring human intervention
We deleted 340 alerts that day. Nobody noticed.
Step 2: Build Severity Levels (Week 2)
We implemented a simple severity framework based on Google's SRE practices:
| Severity | Definition | Response | Example |
|---|
| SEV-1 | Complete outage, revenue impacted | All hands, 15-min response | Payment processing down |
| SEV-2 | Major degradation, users affected | Primary on-call + backup, 30-min | Search returning errors for 20% of users |
| SEV-3 | Minor issue, workaround exists | On-call during business hours | Batch job delayed, no user impact |
| SEV-4 | Cosmetic or non-urgent | Next business day, ticket only | Dashboard showing stale data |
The key insight: SEV-3 and SEV-4 never page anyone outside business hours. Before this change, every alert was implicitly treated as SEV-1.
Step 3: Write Runbooks for Everything (Week 3-4)
Every alert that survived the audit got a runbook. Runbooks dramatically reduce MTTR and lower cognitive load, especially for junior engineers or anyone new to the system.
Our template:
# runbook-template.yaml
alert_name: "payment_processing_error_rate_high"
severity: SEV-2
description: "Payment error rate exceeds 2% over 5 minutes"
first_responder_actions:
- Check Grafana dashboard: [link]
- Check Stripe status page: [link]
- If Stripe is down: Update status page, no action needed
- If our side: Check payment-service logs for stack traces
escalation:
- After 15 min without resolution: page payments team lead
- After 30 min: page engineering manager
known_causes:
- "Connection pool exhaustion": restart payment-service pods
- "Stripe rate limiting": reduce batch size in config
- "Database timeout": check pg_stat_activity for locks
Writing runbooks feels tedious. I know. But they pay for themselves the first time a junior engineer resolves a SEV-2 at 2 AM without escalating, because the runbook told them exactly what to check and exactly what to do. That's the difference between "this on-call week was fine" and "I called the team lead at 2 AM and everyone was grumpy the next day."
We wrote 34 runbooks in two weeks. It was brutal. But in the following quarter, our escalation rate dropped from 65% to 18%.
Step 4: Implement Blameless Postmortems (Week 4)
Blameless postmortems originated in healthcare and avionics — industries where mistakes can be fatal. The principle is simple: focus on systemic causes, not individual blame.
This matters for retention more than anything else. When a culture of finger pointing prevails, people don't bring issues to light for fear of punishment. Engineers who feel psychologically safe take smarter risks and act sooner. The ones who don't? They keep quiet, let issues fester, and eventually quit.
Our postmortem format:
- Timeline — What happened, when, in chronological order
- Impact — Who was affected, for how long, and how
- Root cause — The systemic reason, not "John pushed a bad deploy"
- Contributing factors — What made it worse or delayed recovery
- Action items — Specific, assigned, and deadline-bound
- What went well — Always include this. Recovery matters too.
We evaluated our tooling stack. OpsGenie is being sunset (stopped new sales June 2025, full shutdown April 2027), so if you're on it, start migrating now.
| Tool | Best For | Price Point | Key Feature |
|---|
| PagerDuty | Large orgs with complex routing | $$$ | ML-powered incident suggestions |
| incident.io | Slack-native teams | $$ | Chat-native incident management |
| Rootly | Teams wanting automation | $$ | Automated retrospectives |
| FireHydrant | Process-heavy teams | $$ | Compliance-friendly workflows |
| Better Stack | Small teams, simple needs | $ | Monitoring + alerting + status pages |
Teams save 30-60% on total cost by moving from PagerDuty's add-on model to newer tools. For teams under 50 engineers, PagerDuty is often more tool than you need.
Step 6: Measure and Iterate (Ongoing)
Track the DORA metrics alongside on-call health metrics:
| Metric | What It Tells You | Target |
|---|
| Pages per week | Alert quality | Fewer than 10 actionable pages |
| MTTA (Mean Time to Acknowledge) | Response readiness | Under 5 minutes |
| MTTR (Mean Time to Resolve) | Resolution effectiveness | Under 60 min for SEV-1 |
| After-hours pages | Night disruption | Trending down month over month |
| Deployment frequency | Delivery velocity | Multiple times per day (elite) |
| Change failure rate | Deploy safety | Under 15% (elite) |
Lowe's reduced their MTTR by 82% and MTTA by 97% by streamlining their workflow from alerting to blameless postmortems. Those aren't aspirational numbers. They're achievable with basic process discipline.
The Results
Six months after our rebuild, here's where we landed:
- After-hours pages dropped from 47/week to 6/week
- MTTR for SEV-1 incidents went from 2+ hours to 38 minutes
- Zero engineers quit due to on-call burnout (we had lost 3 in the prior year)
- Deployment frequency increased from weekly to 3x daily
- Our hiring pipeline improved — candidates stopped asking "how bad is on-call?" because current engineers started saying "it's fine, actually"
None of this required new technology. We used the same monitoring stack (Grafana, Prometheus, PagerDuty). The change was entirely structural: fewer alerts, better severity classification, runbooks, blameless postmortems, and fair compensation.
The most surprising result was deployment frequency. When engineers stopped dreading deploys — because they weren't afraid of getting paged at 3 AM if something went wrong — they started deploying more often. Smaller changes. Lower risk. Faster feedback loops. The reliability improvement fed itself.
This matches the broader DORA data. Elite-performing teams deploy multiple times per day with a change failure rate under 5%. They achieve this not by deploying less carefully, but by building systems where deploying is safe. On-call is a critical part of that safety net. When the net is broken, everyone deploys less, which means bigger changes, which means more incidents. The vicious cycle.
The Anti-Patterns That Keep Showing Up
Before I share my take, let me call out the patterns I see companies repeat despite knowing better.
"We'll fix on-call after we ship this feature." On-call improvement always gets deprioritized. But every month you delay, you're paying the toil tax and increasing attrition risk. It never gets less urgent — it gets more urgent.
"We need AI-powered incident management." No, you need to delete 80% of your alerts and write runbooks for the rest. AI incident correlation is useful at enormous scale. For most teams, the problem is simpler: too many alerts, not enough process.
"On-call compensation would be too expensive." A $1,500/week on-call stipend costs about $78K/year across a 6-person rotation. Losing one senior engineer costs $340K+ in replacement costs. The math is obvious.
"Everyone takes turns, it's fair." Fair doesn't mean equal. A junior engineer shouldn't be on primary for a system they barely understand. Pair them with a senior backup. Use shadowing rotations. Build expertise gradually instead of throwing people into the deep end and calling it fairness.
"Our monitoring is fine, we see everything." Seeing everything is the problem. If you see everything, your on-call engineer sees everything too — including the 97% that doesn't matter. Monitoring is only as good as the alerts it produces, and most alerting configurations prioritize coverage over precision.
What I Actually Think
I think on-call is the most important test of an engineering organization's health. Not architecture. Not tech stack. Not hiring bar. On-call.
Here's why: on-call is where incentives become visible. If management treats reliability as someone else's problem, on-call falls on the same 2-3 heroes. If there's no accountability for noisy alerts, they pile up. If the culture is blame-heavy, postmortems become interrogations and people hide mistakes instead of fixing them.
As Charity Majors put it: it is engineering's responsibility to be on call and own their code. It is management's responsibility to make sure that on-call does not suck. This is a handshake. It goes both ways. And if you don't hold up your end, they should quit and leave you.
That's not rhetoric. That's exactly what happened to us. And it's happening to companies everywhere right now.
The fix isn't complicated. Delete most of your alerts. Build real rotations with 6+ people. Write runbooks. Run blameless postmortems. Compensate on-call fairly. Measure alert volume and MTTR. Hold managers accountable for on-call health the same way you hold engineers accountable for uptime.
The $9.4 million annual cost of developer toil that Runframe reports for large organizations? Most of it is preventable. Not with AI-powered incident management platforms or ML-driven alert correlation. With basic hygiene: fewer alerts, better runbooks, fair rotations.
I interview candidates regularly, and I've started asking every one of them: "Tell me about on-call at your last company." The answers are revealing. Engineers from well-run organizations describe it matter-of-factly — a week every two months, clear runbooks, reasonable compensation. Engineers from broken organizations describe it with visible stress — the 3 AM pages, the impossible rotations, the blame-heavy postmortems.
You can tell more about a company's engineering culture from its on-call practices than from its tech stack, its interview process, or its blog posts about engineering values. On-call is where values meet reality.
On-call doesn't have to destroy your team. But if you're not actively managing it, it will.
Sources
- Runframe — State of Incident Management 2025
- DEV Community — On-Call Burnout: What Incident Data Doesn't Show
- incident.io — Alert Fatigue Solutions for DevOps Teams 2025
- DevOps.com — On-Call Rotation Best Practices
- The New Stack — Is Your On-Call Rotation Burning Out Top Talent?
- Runframe — On-Call Rotation Guide
- Charity Majors — On Call Shouldn't Suck: A Guide for Managers
- Gergely Orosz — Healthy On-Call Practices
- Honeycomb — I Don't Want to Be On Call Anymore
- Google SRE — Postmortem Culture
- Google SRE — Incident Management Guide
- Google Cloud Blog — How Lowe's Improved Incident Response with SRE
- Medium — Blameless Postmortems and Blameless Culture
- Medium — Designing Sustainable On-Call Rotations
- Rootly — On-Call Pay: Compensation Models
- incident.io — 3 Best PagerDuty Alternatives 2025
- DORA — DORA Metrics
- Rootly — Incident Response Metrics Guide
- incident.io — 7 Ways SRE Teams Reduce MTTR