Ismat Samadov
  • Tags
  • About

© 2026 Ismat Samadov

RSS
13 min read/1 views

On-Call Destroyed My Team — How We Rebuilt Incident Management From Zero

97% of alerts are noise. 65% of engineers report burnout. We lost 3 engineers to bad on-call. Here's how we rebuilt incident management from scratch.

DevOpsMonitoringCareerOpinion

Related Articles

SQLite Is the Most Deployed Database on Earth and You're Ignoring It

13 min read

Technical Debt Is a Lie Engineers Tell Managers

13 min read

Kubernetes Is a $6 Billion Mistake for 90% of Startups

14 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Numbers Behind the Burnout
  • How We Broke It
  • Stage 1: The Informal Hero
  • Stage 2: The Rotation That Isn't
  • Stage 3: Alert Proliferation
  • Stage 4: Normalized Misery
  • Stage 5: People Leave
  • What "Good" Actually Looks Like
  • Rotation Design
  • Alert Quality Over Quantity
  • Compensation and Recognition
  • The Incident Response Rebuild: Step by Step
  • Step 1: Audit Everything (Week 1-2)
  • Step 2: Build Severity Levels (Week 2)
  • Step 3: Write Runbooks for Everything (Week 3-4)
  • Step 4: Implement Blameless Postmortems (Week 4)
  • Step 5: Choose Better Tools (Week 5-6)
  • Step 6: Measure and Iterate (Ongoing)
  • The Results
  • The Anti-Patterns That Keep Showing Up
  • What I Actually Think
  • Sources

The best infrastructure engineer I've ever worked with quit on a Tuesday. No drama, no counteroffer negotiation. He just walked into our CTO's office, said "I can't do this anymore," and put in his two weeks. He'd been paged 3-4 times per night for six weeks straight because he was one of two people on the rotation, and the only one who could debug our payment system.

We lost $340K in fully-loaded hiring and onboarding costs to replace him. The payment system had three more incidents in his first week gone. And the whole thing was avoidable.

65% of engineers report currently experiencing burnout. On-call is the single biggest contributor I've seen. Not because on-call is inherently bad — it's because most companies do it in a way that slowly destroys their teams.


The Numbers Behind the Burnout

On-call has gotten measurably worse, not better, despite a decade of SRE evangelism.

MetricValueSource
Operational toil (% of dev time)30% (up from 25%)Runframe 2025
Devs spending 30%+ time on toil78%Runframe 2025
Engineers working 40+ hrs/week88%Runframe 2025
Weekly alerts received per team2,000+incident.io
Alerts needing immediate action3%incident.io
Orgs with outages from ignored alerts73%incident.io
SREs handling 5+ incidents/month46%Runframe 2025
Cost of replacing a senior engineer~$340KIndustry estimates

That alert fatigue stat is the one that kills me. Teams receive over 2,000 alerts weekly, but only 3% need immediate human action. That means 97% of pages are noise. And 73% of organizations have experienced outages because real alerts got lost in the noise.

Operational toil rose to 30% from 25%, marking the first increase in five years. We're going backwards. More tools, more alerts, more dashboards, more exhaustion.


How We Broke It

I've seen the same pattern at multiple companies. Here's how on-call goes from "manageable" to "people are quitting."

Stage 1: The Informal Hero

One or two senior engineers know the system best. When things break, they get called. There's no formal rotation. It works because incidents are rare and the heroes are willing.

Stage 2: The Rotation That Isn't

The team creates a "rotation" with 2-3 people. But it's not a real rotation — it's the same heroes with a shared calendar. When something complex breaks, the on-call person escalates to the hero anyway. The hero is effectively always on-call.

Stage 3: Alert Proliferation

More services, more monitoring, more alerts. Nobody prunes old alerts. A threshold set during a traffic spike in 2023 now fires every Tuesday at 3 AM because the baseline shifted. The on-call engineer wakes up, sees it's a false alarm, goes back to sleep. Gets paged again at 4:15 AM. Another false alarm.

Stage 4: Normalized Misery

Engineers stop expecting sleep during on-call weeks. They plan their lives around the rotation — canceling plans, warning partners, sleeping with their phones. The team collectively agrees "this is just how it is." New hires are warned: "On-call sucks here, but every company is the same."

Every company is not the same. I know this because I've seen teams where on-call weeks are boring — where the pager rarely goes off, where runbooks cover 90% of scenarios, and where engineers don't dread their rotation. Those teams exist. They just don't get written about because "our on-call is fine" isn't a compelling blog post.

Stage 5: People Leave

The best engineers leave first because they have the most options. The remaining team inherits their on-call burden, making each rotation worse. Hiring slows because candidates ask about on-call culture in interviews and don't like what they hear. The spiral accelerates.

I've watched this entire sequence play out in under 18 months.

The worst part? Management often doesn't see it happening. Alert volume isn't tracked in executive dashboards. On-call burden isn't discussed in leadership meetings. The first signal leadership gets is when a senior engineer gives notice. By then, the damage is done — you've lost institutional knowledge, the remaining team is more overloaded, and hiring a replacement takes months.

On-call burnout isn't loud. It's quiet. Engineers don't usually complain publicly. They just start updating their LinkedIn profiles and taking recruiter calls during lunch.


What "Good" Actually Looks Like

Good on-call exists. I've seen it. It's not magical — it's just intentional.

Rotation Design

Engineers should be on-call no more than 1 week every 6-8 weeks. Anything more frequent leads to fatigue. A 2-person rotation is not a rotation — it's a burden split. You need at least 5-6 people to maintain a healthy weekly rotation.

The Follow-the-Sun model — where teams in different time zones hand off shifts so everyone works during daytime hours — is the gold standard. If you can't do that, at minimum:

  • Limit on-call shifts to 7 days maximum
  • Guarantee a minimum rest period between rotations
  • Allocate 30-40% of on-call bandwidth to incident responsibilities (not 100% of normal workload plus on-call)
  • Provide compensatory time off after heavy on-call weeks

Alert Quality Over Quantity

The single most impactful change you can make: ruthlessly delete alerts.

If an alert fired 50 times last quarter and never required action, delete it. If an alert fires and the runbook says "acknowledge and ignore unless it persists for 30 minutes," change the threshold or add a delay. If an alert goes to the team channel instead of a specific person, it's not an alert — it's noise.

As Charity Majors wrote: the most durable strategy for on-call burnout prevention is reducing the number of incidents that require human response in the first place. This means:

  • Every alert must be actionable. If there's nothing to do, delete it.
  • Every alert must have a runbook. If the response isn't documented, the alert is broken.
  • Review alert volume monthly. If total pages per week are trending up, that's a management problem, not an engineering problem.

Compensation and Recognition

Engineering teams that treat on-call as an informal obligation — without compensation, time back, or acknowledgment — send a clear message that engineers' time outside business hours doesn't matter.

The approaches that work:

Compensation ModelHow It WorksWhere I've Seen It
Flat stipend$500-$1,500/week for being on-callMid-size startups
Per-page bonus$50-$200 per after-hours pageSome enterprise teams
Comp timeDay off after each on-call weekCommon at European companies
Reduced workload30-40% less sprint work during on-callGoogle SRE model
HybridStipend + comp timeBest-in-class companies

Gergely Orosz's research shows that healthy on-call practices correlate strongly with team retention. The specific compensation model matters less than consistency and transparency — engineers need to know the program is fair and the organization recognizes the burden.


The Incident Response Rebuild: Step by Step

When I had to rebuild our incident management after the departure I described, here's what we did.

Step 1: Audit Everything (Week 1-2)

We pulled every alert from the last 90 days and categorized them:

# Quick PagerDuty audit via API
curl -s "https://api.pagerduty.com/incidents?since=2026-01-01&until=2026-03-31" \
  -H "Authorization: Token token=YOUR_TOKEN" \
  | jq '.incidents | group_by(.service.summary) | 
    map({service: .[0].service.summary, count: length}) | 
    sort_by(-.count)'

We found:

  • 847 total alerts in 90 days
  • Top 3 services generated 72% of all alerts
  • 61% of alerts were auto-resolved before anyone looked at them
  • Only 23 alerts (2.7%) were actual incidents requiring human intervention

We deleted 340 alerts that day. Nobody noticed.

Step 2: Build Severity Levels (Week 2)

We implemented a simple severity framework based on Google's SRE practices:

SeverityDefinitionResponseExample
SEV-1Complete outage, revenue impactedAll hands, 15-min responsePayment processing down
SEV-2Major degradation, users affectedPrimary on-call + backup, 30-minSearch returning errors for 20% of users
SEV-3Minor issue, workaround existsOn-call during business hoursBatch job delayed, no user impact
SEV-4Cosmetic or non-urgentNext business day, ticket onlyDashboard showing stale data

The key insight: SEV-3 and SEV-4 never page anyone outside business hours. Before this change, every alert was implicitly treated as SEV-1.

Step 3: Write Runbooks for Everything (Week 3-4)

Every alert that survived the audit got a runbook. Runbooks dramatically reduce MTTR and lower cognitive load, especially for junior engineers or anyone new to the system.

Our template:

# runbook-template.yaml
alert_name: "payment_processing_error_rate_high"
severity: SEV-2
description: "Payment error rate exceeds 2% over 5 minutes"

first_responder_actions:
  - Check Grafana dashboard: [link]
  - Check Stripe status page: [link]
  - If Stripe is down: Update status page, no action needed
  - If our side: Check payment-service logs for stack traces
  
escalation:
  - After 15 min without resolution: page payments team lead
  - After 30 min: page engineering manager
  
known_causes:
  - "Connection pool exhaustion": restart payment-service pods
  - "Stripe rate limiting": reduce batch size in config
  - "Database timeout": check pg_stat_activity for locks

Writing runbooks feels tedious. I know. But they pay for themselves the first time a junior engineer resolves a SEV-2 at 2 AM without escalating, because the runbook told them exactly what to check and exactly what to do. That's the difference between "this on-call week was fine" and "I called the team lead at 2 AM and everyone was grumpy the next day."

We wrote 34 runbooks in two weeks. It was brutal. But in the following quarter, our escalation rate dropped from 65% to 18%.

Step 4: Implement Blameless Postmortems (Week 4)

Blameless postmortems originated in healthcare and avionics — industries where mistakes can be fatal. The principle is simple: focus on systemic causes, not individual blame.

This matters for retention more than anything else. When a culture of finger pointing prevails, people don't bring issues to light for fear of punishment. Engineers who feel psychologically safe take smarter risks and act sooner. The ones who don't? They keep quiet, let issues fester, and eventually quit.

Our postmortem format:

  1. Timeline — What happened, when, in chronological order
  2. Impact — Who was affected, for how long, and how
  3. Root cause — The systemic reason, not "John pushed a bad deploy"
  4. Contributing factors — What made it worse or delayed recovery
  5. Action items — Specific, assigned, and deadline-bound
  6. What went well — Always include this. Recovery matters too.

Step 5: Choose Better Tools (Week 5-6)

We evaluated our tooling stack. OpsGenie is being sunset (stopped new sales June 2025, full shutdown April 2027), so if you're on it, start migrating now.

ToolBest ForPrice PointKey Feature
PagerDutyLarge orgs with complex routing$$$ML-powered incident suggestions
incident.ioSlack-native teams$$Chat-native incident management
RootlyTeams wanting automation$$Automated retrospectives
FireHydrantProcess-heavy teams$$Compliance-friendly workflows
Better StackSmall teams, simple needs$Monitoring + alerting + status pages

Teams save 30-60% on total cost by moving from PagerDuty's add-on model to newer tools. For teams under 50 engineers, PagerDuty is often more tool than you need.

Step 6: Measure and Iterate (Ongoing)

Track the DORA metrics alongside on-call health metrics:

MetricWhat It Tells YouTarget
Pages per weekAlert qualityFewer than 10 actionable pages
MTTA (Mean Time to Acknowledge)Response readinessUnder 5 minutes
MTTR (Mean Time to Resolve)Resolution effectivenessUnder 60 min for SEV-1
After-hours pagesNight disruptionTrending down month over month
Deployment frequencyDelivery velocityMultiple times per day (elite)
Change failure rateDeploy safetyUnder 15% (elite)

Lowe's reduced their MTTR by 82% and MTTA by 97% by streamlining their workflow from alerting to blameless postmortems. Those aren't aspirational numbers. They're achievable with basic process discipline.


The Results

Six months after our rebuild, here's where we landed:

  • After-hours pages dropped from 47/week to 6/week
  • MTTR for SEV-1 incidents went from 2+ hours to 38 minutes
  • Zero engineers quit due to on-call burnout (we had lost 3 in the prior year)
  • Deployment frequency increased from weekly to 3x daily
  • Our hiring pipeline improved — candidates stopped asking "how bad is on-call?" because current engineers started saying "it's fine, actually"

None of this required new technology. We used the same monitoring stack (Grafana, Prometheus, PagerDuty). The change was entirely structural: fewer alerts, better severity classification, runbooks, blameless postmortems, and fair compensation.

The most surprising result was deployment frequency. When engineers stopped dreading deploys — because they weren't afraid of getting paged at 3 AM if something went wrong — they started deploying more often. Smaller changes. Lower risk. Faster feedback loops. The reliability improvement fed itself.

This matches the broader DORA data. Elite-performing teams deploy multiple times per day with a change failure rate under 5%. They achieve this not by deploying less carefully, but by building systems where deploying is safe. On-call is a critical part of that safety net. When the net is broken, everyone deploys less, which means bigger changes, which means more incidents. The vicious cycle.


The Anti-Patterns That Keep Showing Up

Before I share my take, let me call out the patterns I see companies repeat despite knowing better.

"We'll fix on-call after we ship this feature." On-call improvement always gets deprioritized. But every month you delay, you're paying the toil tax and increasing attrition risk. It never gets less urgent — it gets more urgent.

"We need AI-powered incident management." No, you need to delete 80% of your alerts and write runbooks for the rest. AI incident correlation is useful at enormous scale. For most teams, the problem is simpler: too many alerts, not enough process.

"On-call compensation would be too expensive." A $1,500/week on-call stipend costs about $78K/year across a 6-person rotation. Losing one senior engineer costs $340K+ in replacement costs. The math is obvious.

"Everyone takes turns, it's fair." Fair doesn't mean equal. A junior engineer shouldn't be on primary for a system they barely understand. Pair them with a senior backup. Use shadowing rotations. Build expertise gradually instead of throwing people into the deep end and calling it fairness.

"Our monitoring is fine, we see everything." Seeing everything is the problem. If you see everything, your on-call engineer sees everything too — including the 97% that doesn't matter. Monitoring is only as good as the alerts it produces, and most alerting configurations prioritize coverage over precision.


What I Actually Think

I think on-call is the most important test of an engineering organization's health. Not architecture. Not tech stack. Not hiring bar. On-call.

Here's why: on-call is where incentives become visible. If management treats reliability as someone else's problem, on-call falls on the same 2-3 heroes. If there's no accountability for noisy alerts, they pile up. If the culture is blame-heavy, postmortems become interrogations and people hide mistakes instead of fixing them.

As Charity Majors put it: it is engineering's responsibility to be on call and own their code. It is management's responsibility to make sure that on-call does not suck. This is a handshake. It goes both ways. And if you don't hold up your end, they should quit and leave you.

That's not rhetoric. That's exactly what happened to us. And it's happening to companies everywhere right now.

The fix isn't complicated. Delete most of your alerts. Build real rotations with 6+ people. Write runbooks. Run blameless postmortems. Compensate on-call fairly. Measure alert volume and MTTR. Hold managers accountable for on-call health the same way you hold engineers accountable for uptime.

The $9.4 million annual cost of developer toil that Runframe reports for large organizations? Most of it is preventable. Not with AI-powered incident management platforms or ML-driven alert correlation. With basic hygiene: fewer alerts, better runbooks, fair rotations.

I interview candidates regularly, and I've started asking every one of them: "Tell me about on-call at your last company." The answers are revealing. Engineers from well-run organizations describe it matter-of-factly — a week every two months, clear runbooks, reasonable compensation. Engineers from broken organizations describe it with visible stress — the 3 AM pages, the impossible rotations, the blame-heavy postmortems.

You can tell more about a company's engineering culture from its on-call practices than from its tech stack, its interview process, or its blog posts about engineering values. On-call is where values meet reality.

On-call doesn't have to destroy your team. But if you're not actively managing it, it will.


Sources

  1. Runframe — State of Incident Management 2025
  2. DEV Community — On-Call Burnout: What Incident Data Doesn't Show
  3. incident.io — Alert Fatigue Solutions for DevOps Teams 2025
  4. DevOps.com — On-Call Rotation Best Practices
  5. The New Stack — Is Your On-Call Rotation Burning Out Top Talent?
  6. Runframe — On-Call Rotation Guide
  7. Charity Majors — On Call Shouldn't Suck: A Guide for Managers
  8. Gergely Orosz — Healthy On-Call Practices
  9. Honeycomb — I Don't Want to Be On Call Anymore
  10. Google SRE — Postmortem Culture
  11. Google SRE — Incident Management Guide
  12. Google Cloud Blog — How Lowe's Improved Incident Response with SRE
  13. Medium — Blameless Postmortems and Blameless Culture
  14. Medium — Designing Sustainable On-Call Rotations
  15. Rootly — On-Call Pay: Compensation Models
  16. incident.io — 3 Best PagerDuty Alternatives 2025
  17. DORA — DORA Metrics
  18. Rootly — Incident Response Metrics Guide
  19. incident.io — 7 Ways SRE Teams Reduce MTTR