13 min read/1 views

On-Call Destroyed My Team — How We Rebuilt Incident Management From Zero

97% of alerts are noise. 65% of engineers report burnout. We lost 3 engineers to bad on-call. Here's how we rebuilt incident management from scratch.

DevOps Monitoring Career Opinion

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

13 min read/1 views

On-Call Destroyed My Team — How We Rebuilt Incident Management From Zero

97% of alerts are noise. 65% of engineers report burnout. We lost 3 engineers to bad on-call. Here's how we rebuilt incident management from scratch.

DevOps Monitoring Career Opinion

The best infrastructure engineer I've ever worked with quit on a Tuesday. No drama, no counteroffer negotiation. He just walked into our CTO's office, said "I can't do this anymore," and put in his two weeks. He'd been paged 3-4 times per night for six weeks straight because he was one of two people on the rotation, and the only one who could debug our payment system.

We lost $340K in fully-loaded hiring and onboarding costs to replace him. The payment system had three more incidents in his first week gone. And the whole thing was avoidable.

65% of engineers report currently experiencing burnout. On-call is the single biggest contributor I've seen. Not because on-call is inherently bad — it's because most companies do it in a way that slowly destroys their teams.

The Numbers Behind the Burnout

On-call has gotten measurably worse, not better, despite a decade of SRE evangelism.

Metric	Value	Source
Operational toil (% of dev time)	30% (up from 25%)	Runframe 2025
Devs spending 30%+ time on toil	78%	Runframe 2025
Engineers working 40+ hrs/week	88%	Runframe 2025
Weekly alerts received per team	2,000+	incident.io
Alerts needing immediate action	3%	incident.io
Orgs with outages from ignored alerts	73%	incident.io
SREs handling 5+ incidents/month	46%	Runframe 2025
Cost of replacing a senior engineer	~$340K	Industry estimates

That alert fatigue stat is the one that kills me. Teams receive over 2,000 alerts weekly, but only 3% need immediate human action. That means 97% of pages are noise. And 73% of organizations have experienced outages because real alerts got lost in the noise.

Operational toil rose to 30% from 25%, marking the first increase in five years. We're going backwards. More tools, more alerts, more dashboards, more exhaustion.

How We Broke It

I've seen the same pattern at multiple companies. Here's how on-call goes from "manageable" to "people are quitting."

Stage 1: The Informal Hero

One or two senior engineers know the system best. When things break, they get called. There's no formal rotation. It works because incidents are rare and the heroes are willing.

Stage 2: The Rotation That Isn't

The team creates a "rotation" with 2-3 people. But it's not a real rotation — it's the same heroes with a shared calendar. When something complex breaks, the on-call person escalates to the hero anyway. The hero is effectively always on-call.

Stage 3: Alert Proliferation

More services, more monitoring, more alerts. Nobody prunes old alerts. A threshold set during a traffic spike in 2023 now fires every Tuesday at 3 AM because the baseline shifted. The on-call engineer wakes up, sees it's a false alarm, goes back to sleep. Gets paged again at 4:15 AM. Another false alarm.

Stage 4: Normalized Misery

Engineers stop expecting sleep during on-call weeks. They plan their lives around the rotation — canceling plans, warning partners, sleeping with their phones. The team collectively agrees "this is just how it is." New hires are warned: "On-call sucks here, but every company is the same."

Every company is not the same. I know this because I've seen teams where on-call weeks are boring — where the pager rarely goes off, where runbooks cover 90% of scenarios, and where engineers don't dread their rotation. Those teams exist. They just don't get written about because "our on-call is fine" isn't a compelling blog post.

Stage 5: People Leave

The best engineers leave first because they have the most options. The remaining team inherits their on-call burden, making each rotation worse. Hiring slows because candidates ask about on-call culture in interviews and don't like what they hear. The spiral accelerates.

I've watched this entire sequence play out in under 18 months.

The worst part? Management often doesn't see it happening. Alert volume isn't tracked in executive dashboards. On-call burden isn't discussed in leadership meetings. The first signal leadership gets is when a senior engineer gives notice. By then, the damage is done — you've lost institutional knowledge, the remaining team is more overloaded, and hiring a replacement takes months.

On-call burnout isn't loud. It's quiet. Engineers don't usually complain publicly. They just start updating their LinkedIn profiles and taking recruiter calls during lunch.

What "Good" Actually Looks Like

Good on-call exists. I've seen it. It's not magical — it's just intentional.

Rotation Design

Engineers should be on-call no more than 1 week every 6-8 weeks. Anything more frequent leads to fatigue. A 2-person rotation is not a rotation — it's a burden split. You need at least 5-6 people to maintain a healthy weekly rotation.

The Follow-the-Sun model — where teams in different time zones hand off shifts so everyone works during daytime hours — is the gold standard. If you can't do that, at minimum:

Limit on-call shifts to 7 days maximum
Guarantee a minimum rest period between rotations
Allocate 30-40% of on-call bandwidth to incident responsibilities (not 100% of normal workload plus on-call)
Provide compensatory time off after heavy on-call weeks

Alert Quality Over Quantity

The single most impactful change you can make: ruthlessly delete alerts.

If an alert fired 50 times last quarter and never required action, delete it. If an alert fires and the runbook says "acknowledge and ignore unless it persists for 30 minutes," change the threshold or add a delay. If an alert goes to the team channel instead of a specific person, it's not an alert — it's noise.

As Charity Majors wrote: the most durable strategy for on-call burnout prevention is reducing the number of incidents that require human response in the first place. This means:

Every alert must be actionable. If there's nothing to do, delete it.
Every alert must have a runbook. If the response isn't documented, the alert is broken.
Review alert volume monthly. If total pages per week are trending up, that's a management problem, not an engineering problem.

Compensation and Recognition

Engineering teams that treat on-call as an informal obligation — without compensation, time back, or acknowledgment — send a clear message that engineers' time outside business hours doesn't matter.

The approaches that work:

Compensation Model	How It Works	Where I've Seen It
Flat stipend	$500-$1,500/week for being on-call	Mid-size startups
Per-page bonus	$50-$200 per after-hours page	Some enterprise teams
Comp time	Day off after each on-call week	Common at European companies
Reduced workload	30-40% less sprint work during on-call	Google SRE model
Hybrid	Stipend + comp time	Best-in-class companies

Gergely Orosz's research shows that healthy on-call practices correlate strongly with team retention. The specific compensation model matters less than consistency and transparency — engineers need to know the program is fair and the organization recognizes the burden.

The Incident Response Rebuild: Step by Step

When I had to rebuild our incident management after the departure I described, here's what we did.

Step 1: Audit Everything (Week 1-2)

We pulled every alert from the last 90 days and categorized them:

# Quick PagerDuty audit via API
curl -s "https://api.pagerduty.com/incidents?since=2026-01-01&until=2026-03-31" \
  -H "Authorization: Token token=YOUR_TOKEN" \
  | jq '.incidents | group_by(.service.summary) | 
    map({service: .[0].service.summary, count: length}) | 
    sort_by(-.count)'

We found:

847 total alerts in 90 days
Top 3 services generated 72% of all alerts
61% of alerts were auto-resolved before anyone looked at them
Only 23 alerts (2.7%) were actual incidents requiring human intervention

We deleted 340 alerts that day. Nobody noticed.

Step 2: Build Severity Levels (Week 2)

We implemented a simple severity framework based on Google's SRE practices:

Severity	Definition	Response	Example
SEV-1	Complete outage, revenue impacted	All hands, 15-min response	Payment processing down
SEV-2	Major degradation, users affected	Primary on-call + backup, 30-min	Search returning errors for 20% of users
SEV-3	Minor issue, workaround exists	On-call during business hours	Batch job delayed, no user impact
SEV-4	Cosmetic or non-urgent	Next business day, ticket only	Dashboard showing stale data

The key insight: SEV-3 and SEV-4 never page anyone outside business hours. Before this change, every alert was implicitly treated as SEV-1.

Step 3: Write Runbooks for Everything (Week 3-4)

Every alert that survived the audit got a runbook. Runbooks dramatically reduce MTTR and lower cognitive load, especially for junior engineers or anyone new to the system.

Our template:

# runbook-template.yaml
alert_name: "payment_processing_error_rate_high"
severity: SEV-2
description: "Payment error rate exceeds 2% over 5 minutes"

first_responder_actions:
  - Check Grafana dashboard: [link]
  - Check Stripe status page: [link]
  - If Stripe is down: Update status page, no action needed
  - If our side: Check payment-service logs for stack traces
  
escalation:
  - After 15 min without resolution: page payments team lead
  - After 30 min: page engineering manager
  
known_causes:
  - "Connection pool exhaustion": restart payment-service pods
  - "Stripe rate limiting": reduce batch size in config
  - "Database timeout": check pg_stat_activity for locks

Writing runbooks feels tedious. I know. But they pay for themselves the first time a junior engineer resolves a SEV-2 at 2 AM without escalating, because the runbook told them exactly what to check and exactly what to do. That's the difference between "this on-call week was fine" and "I called the team lead at 2 AM and everyone was grumpy the next day."

We wrote 34 runbooks in two weeks. It was brutal. But in the following quarter, our escalation rate dropped from 65% to 18%.

Step 4: Implement Blameless Postmortems (Week 4)

Blameless postmortems originated in healthcare and avionics — industries where mistakes can be fatal. The principle is simple: focus on systemic causes, not individual blame.

This matters for retention more than anything else. When a culture of finger pointing prevails, people don't bring issues to light for fear of punishment. Engineers who feel psychologically safe take smarter risks and act sooner. The ones who don't? They keep quiet, let issues fester, and eventually quit.

Our postmortem format:

Timeline — What happened, when, in chronological order
Impact — Who was affected, for how long, and how
Root cause — The systemic reason, not "John pushed a bad deploy"
Contributing factors — What made it worse or delayed recovery
Action items — Specific, assigned, and deadline-bound
What went well — Always include this. Recovery matters too.

Step 5: Choose Better Tools (Week 5-6)

We evaluated our tooling stack. OpsGenie is being sunset (stopped new sales June 2025, full shutdown April 2027), so if you're on it, start migrating now.

Tool	Best For	Price Point	Key Feature
PagerDuty	Large orgs with complex routing	$$$	ML-powered incident suggestions
incident.io	Slack-native teams	$$	Chat-native incident management
Rootly	Teams wanting automation	$$	Automated retrospectives
FireHydrant	Process-heavy teams	$$	Compliance-friendly workflows
Better Stack	Small teams, simple needs	$	Monitoring + alerting + status pages

Teams save 30-60% on total cost by moving from PagerDuty's add-on model to newer tools. For teams under 50 engineers, PagerDuty is often more tool than you need.

Step 6: Measure and Iterate (Ongoing)

Track the DORA metrics alongside on-call health metrics:

Metric	What It Tells You	Target
Pages per week	Alert quality	Fewer than 10 actionable pages
MTTA (Mean Time to Acknowledge)	Response readiness	Under 5 minutes
MTTR (Mean Time to Resolve)	Resolution effectiveness	Under 60 min for SEV-1
After-hours pages	Night disruption	Trending down month over month
Deployment frequency	Delivery velocity	Multiple times per day (elite)
Change failure rate	Deploy safety	Under 15% (elite)

Lowe's reduced their MTTR by 82% and MTTA by 97% by streamlining their workflow from alerting to blameless postmortems. Those aren't aspirational numbers. They're achievable with basic process discipline.

The Results

Six months after our rebuild, here's where we landed:

After-hours pages dropped from 47/week to 6/week
MTTR for SEV-1 incidents went from 2+ hours to 38 minutes
Zero engineers quit due to on-call burnout (we had lost 3 in the prior year)
Deployment frequency increased from weekly to 3x daily
Our hiring pipeline improved — candidates stopped asking "how bad is on-call?" because current engineers started saying "it's fine, actually"

None of this required new technology. We used the same monitoring stack (Grafana, Prometheus, PagerDuty). The change was entirely structural: fewer alerts, better severity classification, runbooks, blameless postmortems, and fair compensation.

The most surprising result was deployment frequency. When engineers stopped dreading deploys — because they weren't afraid of getting paged at 3 AM if something went wrong — they started deploying more often. Smaller changes. Lower risk. Faster feedback loops. The reliability improvement fed itself.

This matches the broader DORA data. Elite-performing teams deploy multiple times per day with a change failure rate under 5%. They achieve this not by deploying less carefully, but by building systems where deploying is safe. On-call is a critical part of that safety net. When the net is broken, everyone deploys less, which means bigger changes, which means more incidents. The vicious cycle.

The Anti-Patterns That Keep Showing Up

Before I share my take, let me call out the patterns I see companies repeat despite knowing better.

"We'll fix on-call after we ship this feature." On-call improvement always gets deprioritized. But every month you delay, you're paying the toil tax and increasing attrition risk. It never gets less urgent — it gets more urgent.

"We need AI-powered incident management." No, you need to delete 80% of your alerts and write runbooks for the rest. AI incident correlation is useful at enormous scale. For most teams, the problem is simpler: too many alerts, not enough process.

"On-call compensation would be too expensive." A $1,500/week on-call stipend costs about $78K/year across a 6-person rotation. Losing one senior engineer costs $340K+ in replacement costs. The math is obvious.

"Everyone takes turns, it's fair." Fair doesn't mean equal. A junior engineer shouldn't be on primary for a system they barely understand. Pair them with a senior backup. Use shadowing rotations. Build expertise gradually instead of throwing people into the deep end and calling it fairness.

"Our monitoring is fine, we see everything." Seeing everything is the problem. If you see everything, your on-call engineer sees everything too — including the 97% that doesn't matter. Monitoring is only as good as the alerts it produces, and most alerting configurations prioritize coverage over precision.

What I Actually Think

I think on-call is the most important test of an engineering organization's health. Not architecture. Not tech stack. Not hiring bar. On-call.

Here's why: on-call is where incentives become visible. If management treats reliability as someone else's problem, on-call falls on the same 2-3 heroes. If there's no accountability for noisy alerts, they pile up. If the culture is blame-heavy, postmortems become interrogations and people hide mistakes instead of fixing them.

As Charity Majors put it: it is engineering's responsibility to be on call and own their code. It is management's responsibility to make sure that on-call does not suck. This is a handshake. It goes both ways. And if you don't hold up your end, they should quit and leave you.

That's not rhetoric. That's exactly what happened to us. And it's happening to companies everywhere right now.

The fix isn't complicated. Delete most of your alerts. Build real rotations with 6+ people. Write runbooks. Run blameless postmortems. Compensate on-call fairly. Measure alert volume and MTTR. Hold managers accountable for on-call health the same way you hold engineers accountable for uptime.

The $9.4 million annual cost of developer toil that Runframe reports for large organizations? Most of it is preventable. Not with AI-powered incident management platforms or ML-driven alert correlation. With basic hygiene: fewer alerts, better runbooks, fair rotations.

I interview candidates regularly, and I've started asking every one of them: "Tell me about on-call at your last company." The answers are revealing. Engineers from well-run organizations describe it matter-of-factly — a week every two months, clear runbooks, reasonable compensation. Engineers from broken organizations describe it with visible stress — the 3 AM pages, the impossible rotations, the blame-heavy postmortems.

You can tell more about a company's engineering culture from its on-call practices than from its tech stack, its interview process, or its blog posts about engineering values. On-call is where values meet reality.

On-call doesn't have to destroy your team. But if you're not actively managing it, it will.

Sources

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

We lost $340K in fully-loaded hiring and onboarding costs to replace him. The payment system had three more incidents in his first week gone. And the whole thing was avoidable.

The Numbers Behind the Burnout

On-call has gotten measurably worse, not better, despite a decade of SRE evangelism.

Metric	Value	Source
Operational toil (% of dev time)	30% (up from 25%)	Runframe 2025
Devs spending 30%+ time on toil	78%	Runframe 2025
Engineers working 40+ hrs/week	88%	Runframe 2025
Weekly alerts received per team	2,000+	incident.io
Alerts needing immediate action	3%	incident.io
Orgs with outages from ignored alerts	73%	incident.io
SREs handling 5+ incidents/month	46%	Runframe 2025
Cost of replacing a senior engineer	~$340K	Industry estimates

Operational toil rose to 30% from 25%, marking the first increase in five years. We're going backwards. More tools, more alerts, more dashboards, more exhaustion.

How We Broke It

I've seen the same pattern at multiple companies. Here's how on-call goes from "manageable" to "people are quitting."

Stage 1: The Informal Hero

One or two senior engineers know the system best. When things break, they get called. There's no formal rotation. It works because incidents are rare and the heroes are willing.

Stage 2: The Rotation That Isn't

Stage 3: Alert Proliferation

Stage 4: Normalized Misery

Stage 5: People Leave

I've watched this entire sequence play out in under 18 months.

On-call burnout isn't loud. It's quiet. Engineers don't usually complain publicly. They just start updating their LinkedIn profiles and taking recruiter calls during lunch.

What "Good" Actually Looks Like

Good on-call exists. I've seen it. It's not magical — it's just intentional.

Rotation Design

The Follow-the-Sun model — where teams in different time zones hand off shifts so everyone works during daytime hours — is the gold standard. If you can't do that, at minimum:

Limit on-call shifts to 7 days maximum
Guarantee a minimum rest period between rotations
Allocate 30-40% of on-call bandwidth to incident responsibilities (not 100% of normal workload plus on-call)
Provide compensatory time off after heavy on-call weeks

Alert Quality Over Quantity

The single most impactful change you can make: ruthlessly delete alerts.

As Charity Majors wrote: the most durable strategy for on-call burnout prevention is reducing the number of incidents that require human response in the first place. This means:

Every alert must be actionable. If there's nothing to do, delete it.
Every alert must have a runbook. If the response isn't documented, the alert is broken.
Review alert volume monthly. If total pages per week are trending up, that's a management problem, not an engineering problem.

Compensation and Recognition

The approaches that work:

Compensation Model	How It Works	Where I've Seen It
Flat stipend	$500-$1,500/week for being on-call	Mid-size startups
Per-page bonus	$50-$200 per after-hours page	Some enterprise teams
Comp time	Day off after each on-call week	Common at European companies
Reduced workload	30-40% less sprint work during on-call	Google SRE model
Hybrid	Stipend + comp time	Best-in-class companies

The Incident Response Rebuild: Step by Step

When I had to rebuild our incident management after the departure I described, here's what we did.

Step 1: Audit Everything (Week 1-2)

We pulled every alert from the last 90 days and categorized them:

# Quick PagerDuty audit via API
curl -s "https://api.pagerduty.com/incidents?since=2026-01-01&until=2026-03-31" \
  -H "Authorization: Token token=YOUR_TOKEN" \
  | jq '.incidents | group_by(.service.summary) | 
    map({service: .[0].service.summary, count: length}) | 
    sort_by(-.count)'

We found:

847 total alerts in 90 days
Top 3 services generated 72% of all alerts
61% of alerts were auto-resolved before anyone looked at them
Only 23 alerts (2.7%) were actual incidents requiring human intervention

We deleted 340 alerts that day. Nobody noticed.

Step 2: Build Severity Levels (Week 2)

We implemented a simple severity framework based on Google's SRE practices:

Severity	Definition	Response	Example
SEV-1	Complete outage, revenue impacted	All hands, 15-min response	Payment processing down
SEV-2	Major degradation, users affected	Primary on-call + backup, 30-min	Search returning errors for 20% of users
SEV-3	Minor issue, workaround exists	On-call during business hours	Batch job delayed, no user impact
SEV-4	Cosmetic or non-urgent	Next business day, ticket only	Dashboard showing stale data

The key insight: SEV-3 and SEV-4 never page anyone outside business hours. Before this change, every alert was implicitly treated as SEV-1.

Step 3: Write Runbooks for Everything (Week 3-4)

Every alert that survived the audit got a runbook. Runbooks dramatically reduce MTTR and lower cognitive load, especially for junior engineers or anyone new to the system.

Our template:

# runbook-template.yaml
alert_name: "payment_processing_error_rate_high"
severity: SEV-2
description: "Payment error rate exceeds 2% over 5 minutes"

first_responder_actions:
  - Check Grafana dashboard: [link]
  - Check Stripe status page: [link]
  - If Stripe is down: Update status page, no action needed
  - If our side: Check payment-service logs for stack traces
  
escalation:
  - After 15 min without resolution: page payments team lead
  - After 30 min: page engineering manager
  
known_causes:
  - "Connection pool exhaustion": restart payment-service pods
  - "Stripe rate limiting": reduce batch size in config
  - "Database timeout": check pg_stat_activity for locks

We wrote 34 runbooks in two weeks. It was brutal. But in the following quarter, our escalation rate dropped from 65% to 18%.

Step 4: Implement Blameless Postmortems (Week 4)

Blameless postmortems originated in healthcare and avionics — industries where mistakes can be fatal. The principle is simple: focus on systemic causes, not individual blame.

Our postmortem format:

Timeline — What happened, when, in chronological order
Impact — Who was affected, for how long, and how
Root cause — The systemic reason, not "John pushed a bad deploy"
Contributing factors — What made it worse or delayed recovery
Action items — Specific, assigned, and deadline-bound
What went well — Always include this. Recovery matters too.

Step 5: Choose Better Tools (Week 5-6)

We evaluated our tooling stack. OpsGenie is being sunset (stopped new sales June 2025, full shutdown April 2027), so if you're on it, start migrating now.

Tool	Best For	Price Point	Key Feature
PagerDuty	Large orgs with complex routing	$$$	ML-powered incident suggestions
incident.io	Slack-native teams	$$	Chat-native incident management
Rootly	Teams wanting automation	$$	Automated retrospectives
FireHydrant	Process-heavy teams	$$	Compliance-friendly workflows
Better Stack	Small teams, simple needs	$	Monitoring + alerting + status pages

Teams save 30-60% on total cost by moving from PagerDuty's add-on model to newer tools. For teams under 50 engineers, PagerDuty is often more tool than you need.

Step 6: Measure and Iterate (Ongoing)

Track the DORA metrics alongside on-call health metrics:

Metric	What It Tells You	Target
Pages per week	Alert quality	Fewer than 10 actionable pages
MTTA (Mean Time to Acknowledge)	Response readiness	Under 5 minutes
MTTR (Mean Time to Resolve)	Resolution effectiveness	Under 60 min for SEV-1
After-hours pages	Night disruption	Trending down month over month
Deployment frequency	Delivery velocity	Multiple times per day (elite)
Change failure rate	Deploy safety	Under 15% (elite)

The Results

Six months after our rebuild, here's where we landed:

After-hours pages dropped from 47/week to 6/week
MTTR for SEV-1 incidents went from 2+ hours to 38 minutes
Zero engineers quit due to on-call burnout (we had lost 3 in the prior year)
Deployment frequency increased from weekly to 3x daily
Our hiring pipeline improved — candidates stopped asking "how bad is on-call?" because current engineers started saying "it's fine, actually"

The Anti-Patterns That Keep Showing Up

Before I share my take, let me call out the patterns I see companies repeat despite knowing better.

What I Actually Think

I think on-call is the most important test of an engineering organization's health. Not architecture. Not tech stack. Not hiring bar. On-call.

That's not rhetoric. That's exactly what happened to us. And it's happening to companies everywhere right now.

On-call doesn't have to destroy your team. But if you're not actively managing it, it will.