IT Incident Response

Walk into any incident with a clear head: this checklist covers every phase from first alert to blameless postmortem, with the process discipline that keeps teams coordinated, stakeholders informed, and organizations actually learning from outages. For more background and examples, see the guidance below; for built-in tools and options, use the quick tools guide.

Author
Checklistify Editorial Team
Last Updated

Checklist Items

0 done20 left4 of 5 sections collapsed

0%

What Poor Incident Response Actually Costs

The financial argument for process discipline is not abstract. Industry research consistently places the cost of unplanned downtime for critical systems at $5,000–$9,000 per minute for large enterprises — and significantly higher for revenue-critical infrastructure like payment processing or e-commerce during peak periods. But visible downtime cost is rarely the largest cost. The slower, harder-to-quantify costs — engineer trust eroded by chaotic on-call, customer confidence lost through poor communication, and repeat incidents caused by postmortems that produced no lasting change — often dwarf the direct outage impact.

74%

of incident time is typically spent on diagnosis and coordination — not on applying the fix. Better process reduces this fraction directly, without changing the underlying system at all.

longer mean time to resolve for teams without a designated Incident Commander versus teams with one — a consistently observed pattern in incident retrospective data across organizations.

58%

of engineers report being paged for incidents that required no action — a primary driver of on-call burnout and one of the most addressable incident response problems.

🔧 Runbooks: The Companion Artifact This Checklist Needs

This checklist provides the process framework — the phases, decisions, and communication patterns that structure every incident regardless of what broke. What it deliberately omits is how to restart your specific payment service or which exact dashboard shows your database connection pool. That is what runbooks are for.

A runbook is a system-specific document that captures operational knowledge your team has accumulated about a particular service or failure mode. A good runbook for a database connection pool issue might include the exact monitoring URL to open first, the precise command to inspect live connection counts, the steps to safely restart the connection manager without dropping active transactions, and the name of the secondary contact if the primary owner is unreachable.

📝 What every runbook for a critical service should include:

  • What healthy looks like — baseline metric values with specific numbers, not vague descriptions
  • Symptom-to-cause mapping for the three or four most common failure modes
  • Step-by-step recovery procedures with exact commands or UI paths
  • Escalation contacts with a secondary if the primary is unavailable
  • Last updated date and the engineer who updated it

⚠️ A runbook that is never updated becomes dangerous faster than no runbook at all. Stale procedures applied confidently during a crisis are worse than acknowledged uncertainty. Treat runbook maintenance as part of the deployment process: if a deployment changes a service's behavior, the runbook for that service gets updated in the same pull request.

⚡ Game Days: Rehearse the Response Before the Incident Finds You

The strongest predictor of smooth incident response is not the quality of the monitoring stack or the seniority of the engineers — it is whether the team has practiced the response process before a real incident. Organizations that run game days consistently report lower mean time to resolve than those that respond only to live incidents, because game days surface process failures before they produce customer impact.

Tabletop exercise — 90 minutes, no systems touched

A facilitator describes an evolving incident scenario in a meeting room; participants describe what they would do at each step. No systems are actually affected. Best for: validating the process, identifying gaps in runbooks and escalation contacts, and familiarizing engineers who are new to the Incident Commander role. The most common tabletop finding: the team discovers that a critical service runbook lists a primary on-call contact who left the company six months ago.

Live fire exercise — 3–4 hours, real failure injected

A chaos engineering tool (Chaos Monkey, Gremlin, or a hand-crafted failure script) injects an actual failure into a staging or isolated environment. The team runs the full response process in real time without knowing in advance what will fail. Best for: testing the actual technical response, validating runbooks against real system behavior, and building muscle memory for high-stress decision-making. Debrief immediately afterward while observations are fresh.

The most common objection to game days is time. The counterargument: a 90-minute tabletop exercise that reveals a gap in an escalation path saves far more than 90 minutes during a live critical incident. Quarterly game days have consistently separated high-performing reliability teams from reactive ones.

🧮 The Two Numbers Worth Tracking Over Time

Most teams know they had incidents last quarter. Fewer can demonstrate whether their incident response is improving or worsening. Two metrics close that gap without requiring expensive tooling:

Mean Time to Detect (MTTD)

The average elapsed time between when an incident begins and when your team becomes aware of it. MTTD measures your observability investment: alerting coverage, alert quality, and the signal-to-noise ratio in your monitoring. A high MTTD means incidents have been affecting users for extended periods before the team knows. Reduce it by auditing alert coverage against your critical user journeys and aggressively eliminating false positive noise that conditions engineers to ignore pages.

Mean Time to Resolve (MTTR)

The average time from incident declaration to confirmed resolution. MTTR measures response process quality — how efficiently the team coordinates, diagnoses, and mitigates. Improvements come from clearer severity criteria, faster Incident Commander assignment, better runbooks that reduce diagnosis time, and mitigation options like feature flags that allow faster service restoration without a full root cause fix. Track MTTR by severity level separately — P1 and P3 incident patterns are structurally very different.

💡 Start simple: a spreadsheet logging each incident's declared time, resolved time, severity, and cause category. After three months you have a baseline. After six months you can see whether process improvements are producing measurable MTTR reductions — or whether the same failure modes are recurring, signaling that postmortem action items are not being completed.

🧑‍💻 The Human Side: Sustainable On-Call

Incident response process improvements fail when the people running them are exhausted. On-call burnout is real, measurable, and directly correlated with both response quality and engineer attrition. A team that is chronically paged — especially for non-actionable alerts — will eventually stop responding with the urgency and quality that real incidents require. The best reliability programs treat on-call sustainability as a first-class concern, not an afterthought.

⚠️ Signs of unsustainable on-call

  • More than 2–3 actionable pages per week per engineer
  • More than 25% of alerts require no action at all
  • Rotation that excludes senior or staff engineers
  • No compensation policy or recovery time for overnight incidents
  • Engineers describing upcoming on-call weeks with dread

✅ Practices that protect engineers

  • Alert review as a standing team ritual — audit and retire noisy alerts regularly
  • Written handoff documentation so the incoming engineer has context on open issues
  • Shadow rotations for engineers new to on-call, paired with experienced responders
  • Compensatory time off after on-call periods with significant overnight incidents
  • Clear escalation paths so no one is ever the last line of defense alone

The organizations with the best incident response records are not always those with the most resilient infrastructure — they are the ones with the most invested engineers. Process and tooling are multipliers, but the team behind them is the foundation.

📖 What Winging It Looks Like at 2 AM

A composite scenario based on patterns observed across multiple real incident retrospectives

An alert fires at 2:04 AM. Three engineers independently acknowledge it and begin investigating — none of them knows the others are looking. One engineer restarts a service without announcing it anywhere; a second is in the middle of pulling logs from that same service and loses the connection mid-read. A third is capturing a database state snapshot, unaware that the restart has already changed the system state they are trying to capture.

At 2:31 AM, no one knows what has been changed, what the current system state is, or whether any of the actions taken have helped or made things worse. Meanwhile, the Director of Engineering receives a message from the CEO asking what is happening — neither has received an update since the incident began, because no one was assigned to stakeholder communication. The customer support team is fielding inbound tickets with nothing to tell users.

By 4:18 AM, the incident resolves — not because anyone diagnosed the root cause, but because the 2:10 AM restart happened to clear a stuck process and the system stabilized on its own. Two weeks later, the stuck process recurs. There is no postmortem, no action items, and no institutional record of what actually fixed it. Total duration: 134 minutes. The same incident, with a structured response and a single coordinator directing actions: a conservative estimate of under 25 minutes to mitigation, 45 minutes to full resolution — and a postmortem that prevents the recurrence entirely.

Master This Checklist Quickly

Every important button and option for this pre-made checklist, shown in a glance-friendly format.

Start Here

  1. 1

    Click any item row to mark it complete.

  2. 2

    Use the note row under each item for quick notes.

  3. 3

    Use the tool row for undo, redo, reset, and check all.

  4. 4

    Use Save Progress when you want to continue later.

Checklist Row Tools

UndoRedoResetCheck allCollapse/Expand sectionsShow/Hide detailsInline notes

Top Action Buttons

Share

Open all sharing and export options in one menu.

Email DraftContinue on another devicePrint or Save as PDFPlain Text (.txt)Word (.docx)Excel (.xlsx)

Add & Ask

Open one menu for apps and AI guidance.

NotionTodoist CSVChatGPTClaude

Copy and customize

Create a new editable checklist pre-filled with your chosen content.

Save Progress

Adds this checklist to My Checklists and keeps your progress in this browser.

Most Natural Usage

Track over time

Check items -> Add notes where needed -> Save Progress

Send or export

Open Share -> Choose format -> Continue

Make your own version

Copy and customize -> Open create page -> Edit freely