20 items

IT Incident Response

Walk into any incident with a clear head: this checklist covers every phase from first alert to blameless postmortem, with the process discipline that keeps teams coordinated, stakeholders informed, and organizations actually learning from outages. For more background and examples, see the guidance below; for built-in tools and options, use the quick tools guide.

Author

Checklistify Editorial Team

Last Updated

Checklist

0 done•20 left•4 of 5 sections collapsed

Acknowledge the alert and determine: real incident or false positive?
#1
ContextAlert triage is the first gate. Common false positive sources: monitoring threshold set too aggressively (alert fires during normal load spikes), metric flapping (oscillating near a threshold rather than sustained degradation), test environment alerts misconfigured to production channels, or scheduled maintenance windows not excluded from monitoring. Real incident indicators: multiple independent alerts corroborating the same issue, user-reported errors independent of monitoring, service health dashboards showing degradation, or a pattern matching known failure modes. The acknowledgment step matters practically: in incident management systems like PagerDuty or OpsGenie, acknowledging stops the escalation chain and signals that a human owns the incident — critical for preventing two engineers from starting independent, parallel investigations without knowing about each other.
Assess the blast radius: which systems, services, and users are affected?
#2
ContextThe blast radius assessment determines the severity level and scales the response appropriately. Assess: which services are affected (a single microservice, a critical path service, or a full platform outage?), how many users are impacted (0.1% of users, all users, or a specific authenticated segment?), what transaction or revenue flows are disrupted, and whether any data is at risk (security incidents require a different escalation path than availability incidents). Document the initial assessment with a timestamp — this becomes the first entry in the incident timeline. Resist the urge to start investigating cause before the blast radius is understood. The response scale should match the actual impact, and this assessment will determine whether you escalate further or de-escalate.
Assign a severity level (SEV1/P1, SEV2/P2, SEV3/P3) using predefined, written criteria.
#3
ContextPre-defined severity criteria prevent subjective, pressure-driven judgments during a live incident. Example criteria to adapt: SEV1/P1 — complete service unavailability affecting all or most users, any data loss or corruption risk, security breach with active exfiltration, or total failure of a revenue-critical system. SEV2/P2 — significant degradation of a primary feature affecting a substantial user segment, critical path impaired, or a non-critical system completely down. SEV3/P3 — minor feature unavailability, performance slightly below baseline, or issue confined to a small user subset. The severity level determines who gets paged (SEV1 wakes the full on-call team and notifies engineering leadership; SEV3 may be handled by the on-call engineer alone), the communication frequency required, and whether a public status page update is warranted.
Formally declare the incident, start the incident clock, and log it in your tracking system.
#4
ContextFormal declaration transforms an ambiguous alert into an organized response. The moment of declaration — the incident clock start — is the timestamp used to calculate your mean time to resolve (MTTR) later. Log it immediately in your incident management system: a Jira ticket, a PagerDuty incident, or a dedicated incident tracking tool. This formal record triggers any automated workflows you have configured (status page updates, management notifications), creates a durable artifact that persists beyond the Slack channel, and anchors the postmortem timeline. The declaration should also confirm who is on the response team at that moment — names, not just handles.

Open a dedicated incident channel and make it the single hub for all response activity.
#5
ContextA dedicated Slack channel (or equivalent) is the single source of truth for all incident activity. Use a consistent naming convention: #incident-2024-03-15-payment-service or #inc-042-database-outage. All status updates, decisions, findings, and actions are posted here — no critical decisions happen in DMs or parallel calls. Populate immediately: link to the monitoring dashboard showing the issue, paste the initial alert text, and note the timestamp of incident declaration. These become the first entries of the incident log automatically. This channel also becomes the primary source for postmortem timeline reconstruction — every message is timestamped, which removes the need to reconstruct a sequence of events from memory.
Assign an Incident Commander (IC) — one person with authority to direct the response.
#6
ContextThe Incident Commander role is borrowed from emergency management. The IC is not the person fixing the problem — they are coordinating the response so fixers can focus. IC responsibilities: assigning investigation tasks (asking an engineer to pull database error logs for the last 30 minutes), managing the communication cadence, escalating when the team needs additional resources or authority, making decisions when the team disagrees or is stuck, and controlling the scope of changes during the incident to prevent well-intentioned but uncoordinated modifications that introduce new variables. The IC does not need to be the most senior engineer — the qualities needed are clear communication, calm under pressure, and the authority to direct the response. Without a designated IC, the too-many-cooks failure mode is almost guaranteed: multiple engineers make independent, potentially conflicting changes simultaneously, each invalidating the others' diagnostic work.
Send an initial stakeholder notification within 5–10 minutes of declaring the incident.
#7
ContextInitial notification template: 'We are currently investigating an issue affecting [service/feature]. Impact: [user-facing description of the problem]. We will provide an update in [15/30] minutes. Tracking: [status page URL / incident channel link].' The initial notification does not need to explain the cause — it needs to establish three things: you are aware, you are working on it, and you will update at a defined interval. Sending the initial notification quickly — even before any hypothesis has formed — prevents the flood of inquiries that create noise and interrupt the investigation. For external-facing incidents: update the public status page simultaneously with the same information.
Maintain a communication cadence — send updates at defined intervals even when there is nothing new to report.
#8
ContextCommunication cadence by severity: SEV1 — update every 15 minutes. SEV2 — every 30 minutes. SEV3 — every 60 minutes. A minimal update reads: 'Status update [time]: We have identified the affected component as [X]. Investigation ongoing; current hypothesis is [Y]. No service restoration yet. Next update at [time].' Stakeholders who receive regular updates — even still-investigating messages — experience significantly less anxiety and send far fewer interrupting inquiries than stakeholders left in silence. The discipline of writing regular updates also forces the IC to synthesize the current state of knowledge, which occasionally surfaces insights that move the investigation forward.

Review monitoring dashboards and logs — establish exactly what changed and when.
#9
ContextThe investigation starts with when: precisely when did the issue begin, and what metrics or log patterns show that inflection point? This timestamp constrains the cause search to changes that occurred at or before that moment. Data sources to consult: APM tools (Datadog, New Relic, Dynatrace) for request rates, latency percentiles, and error rates with time-series precision. Application logs (ELK, Splunk, CloudWatch Logs) for specific errors with stack traces. Infrastructure metrics (CPU, memory, disk I/O, network) for resource constraints. Database slow query logs for query performance degradation. The diagnostic question at each data source is always: what is different about this time window compared to the equivalent window yesterday or last week?
Check recent deployments, configuration changes, and infrastructure modifications — most incidents are caused by change.
#10
ContextThe majority of production incidents trace back to something that changed: a code deployment, a configuration modification, a certificate expiration, a database schema migration, a dependency version update, an infrastructure resize, or a cron job that ran at an unusual time. Checking the deployment log and change history is often the fastest diagnostic path available. Questions to answer: what was deployed in the last 24 hours? Was there a configuration change, even in a non-production system that shares infrastructure? Did a scheduled job run recently? Did any third-party service have a change or its own incident? Did any certificate expire? Run this check in parallel with log investigation rather than sequentially — both paths have equal probability of being the fastest route to root cause.
Maintain a shared hypothesis log in the incident channel as the investigation evolves.
#11
ContextPost each new hypothesis as a timestamped message in the incident channel: 'Hypothesis [time]: The deployment at 14:32 changed the database connection pool configuration — testing now.' Then post the result: 'Hypothesis [time] REFUTED — connection pool metrics show no exhaustion. Revising hypothesis.' This creates a real-time log of what has been tried and eliminated, preventing two engineers from independently testing the same hypothesis without knowing the other is working on it, and preventing the team from cycling back to already-tested ideas. It also becomes a key section of the postmortem: the investigation path reveals where observability gaps exist — specifically, hypotheses that could not be confirmed or refuted because the right data was not available.
Form a hypothesis about root cause — then test it before acting on it.
#12
ContextHypothesis formation is not the same as root cause identification. A hypothesis is a testable statement: the deployment at 14:32 included a change to the payment service connection pool configuration — this may be causing connection exhaustion under load. Testing it: look at connection pool metrics to see if exhaustion is actually occurring. If evidence supports the hypothesis, that is the root cause to address. If not, revise and test again. Acting on an untested hypothesis — rolling back a deployment that was not the cause — wastes time, may introduce new issues, and leaves the actual cause continuing to produce impact. The discipline of hypothesis, then test, then confirm before taking action separates systematic incident response from chaotic trial-and-error.

Apply mitigation first to restore service — then address root cause separately.
#13
ContextMitigation is the fastest path to service restoration even when it does not address root cause permanently. Mitigation options: rollback the most recent deployment (if deployment-caused), restart a service or clear a stuck queue (if a recoverable failure state), scale up infrastructure or adjust rate limits (if load-related), fail over to a secondary region or redundant component (if infrastructure failure), or apply a feature flag to disable the affected feature (if code-caused). Mitigation that does not address root cause is acceptable — the priority order is: first, restore service to users; second, permanently fix the root cause. Communicate the distinction clearly to stakeholders: 'We have applied a temporary mitigation that has restored service. We will monitor stability and complete root cause analysis within [timeframe].'
Verify service recovery — monitor metrics for at least 10–15 minutes before declaring resolution.
#14
ContextFalse resolution is one of the most common incident response failures: a mitigation is applied, the engineer observes immediate improvement, declares resolution, and disbands the response team — and the issue recurs 20 minutes later because the mitigation was incomplete or the root cause is still present. Verification requires: key error rate metrics returning to baseline, user-facing success rates recovered, monitoring alerts cleared or clearing, and a soak period of 10–15 minutes without re-escalation. For SEV1 and SEV2 incidents: keep the response team on standby during the soak period rather than immediately disbanding. The cost of a false resolution is high — the second incident declaration triggers greater stakeholder alarm than the first.
Post a resolution update to all stakeholders with a clear summary of what happened and what was done.
#15
ContextResolution message format: 'Resolved [timestamp]: The [service/feature] issue is resolved as of [time]. Root cause: [brief description, or under investigation if not yet confirmed]. Duration: [X minutes/hours]. Impact: [user/transaction scope]. Next steps: full postmortem within [timeframe]. Thank you for your patience.' Update the public status page to Resolved simultaneously if applicable. The resolution message is the formal signal that the incident is over from the stakeholder perspective — it closes the information loop and signals that the response team is standing down. Vague resolutions leave stakeholders uncertain about what happened and erode trust over time.

Schedule a blameless postmortem meeting within 48 hours of resolution.
#16
ContextThe 48-hour window matters: the incident is recent enough that responders can recall decisions and observations accurately, rather than reconstructing from faded memory. Blameless is a specific analysis methodology, not a euphemism for avoiding accountability. It attributes problems to systems, processes, and conditions rather than to individual errors or negligence. The rationale is practical: humans will make mistakes in complex systems. Blaming individuals for mistakes incentivizes concealing problems and sanitizing postmortem timelines — the opposite of what produces learning and improvement. A blameless postmortem asks what conditions allowed this to happen and what can we change about those conditions, rather than who made the mistake.
Build a complete incident timeline — timestamps for every key event from first signal to resolution.
#17
ContextThe incident timeline is the factual backbone of the postmortem. A complete timeline includes: when the first signal appeared in logs or monitoring, when the alert fired, when the incident was acknowledged and formally declared, when the initial stakeholder notification was sent, timestamps for each major investigation finding and each hypothesis tested, when the root cause was identified, when mitigation was applied, when service began recovering, and when resolution was declared. Source the timeline from: the monitoring system's event log, the incident Slack channel timestamps, deployment logs, and responder recollections for events not captured in systems. Gaps in the timeline — periods where no information exists — are themselves findings: they indicate blind spots in your observability coverage.
Identify the root cause and contributing factors — distinguish between the immediate cause and systemic contributors.
#18
ContextRoot cause analysis operates at three levels. Immediate cause: the specific technical change or failure that directly triggered the incident. Contributing factors: the conditions that made the immediate cause possible and amplified its impact. Systemic factors: the processes, practices, or organizational conditions that allowed the contributing factors to exist. Example: immediate cause is database connection pool exhaustion following a deployment. Contributing factor is that no connection pool monitoring or alerting was configured. Systemic factor is that the deployment review process did not include connection pool impact assessment for configuration changes. The 5 Whys technique helps trace the causal chain: for each cause identified, ask why it happened until reaching a systemic factor that can be addressed. Stopping at the immediate cause produces superficial patches; following the chain to systemic factors produces lasting improvements.
Define specific, owned, dated action items — then track them in the team's project management system.
#19
ContextPostmortem action items fail when they are vague. 'Improve monitoring' is not an action item; 'Add connection pool utilization alert with 80% threshold — owner: [name], due: [date]' is. Each action item must answer: what specifically will be done, who is responsible, and when will it be complete. Track action items in the team's project management system (Jira, Linear, GitHub Issues) rather than in the postmortem document — items in documents get lost; items in tracked systems get done. Follow up in the next team meeting: are items in progress and on track? The quality of postmortem follow-through determines whether the organization learns from incidents or repeats them. A postmortem with excellent analysis and zero completed action items is, ultimately, a missed opportunity.
Distribute the completed postmortem to the broader organization, not just the response team.
#20
ContextIncident learnings that stay within the response team benefit only that team. Distributing the postmortem broadly — to other engineering teams, product leadership, and customer success — multiplies the value of the analysis. Other teams may recognize patterns from their own systems, discover that a systemic factor affects them too, or apply the same fix proactively before their own version of the incident occurs. Format the distributed postmortem for readability by non-participants: lead with the summary and impact, follow with the key learnings and action items, and make the full timeline and analysis available for those who want the detail. Internal postmortem culture — sharing findings openly rather than treating incidents as embarrassments — is a marker of engineering maturity and psychological safety.

What Poor Incident Response Actually Costs

The financial argument for process discipline is not abstract. Industry research consistently places the cost of unplanned downtime for critical systems at $5,000–$9,000 per minute for large enterprises — and significantly higher for revenue-critical infrastructure like payment processing or e-commerce during peak periods. But visible downtime cost is rarely the largest cost. The slower, harder-to-quantify costs — engineer trust eroded by chaotic on-call, customer confidence lost through poor communication, and repeat incidents caused by postmortems that produced no lasting change — often dwarf the direct outage impact.

74%

of incident time is typically spent on diagnosis and coordination — not on applying the fix. Better process reduces this fraction directly, without changing the underlying system at all.

3×

longer mean time to resolve for teams without a designated Incident Commander versus teams with one — a consistently observed pattern in incident retrospective data across organizations.

58%

of engineers report being paged for incidents that required no action — a primary driver of on-call burnout and one of the most addressable incident response problems.

🔧 Runbooks: The Companion Artifact This Checklist Needs

This checklist provides the process framework — the phases, decisions, and communication patterns that structure every incident regardless of what broke. What it deliberately omits is how to restart your specific payment service or which exact dashboard shows your database connection pool. That is what runbooks are for.

A runbook is a system-specific document that captures operational knowledge your team has accumulated about a particular service or failure mode. A good runbook for a database connection pool issue might include the exact monitoring URL to open first, the precise command to inspect live connection counts, the steps to safely restart the connection manager without dropping active transactions, and the name of the secondary contact if the primary owner is unreachable.

📝 What every runbook for a critical service should include:

What healthy looks like — baseline metric values with specific numbers, not vague descriptions
Symptom-to-cause mapping for the three or four most common failure modes
Step-by-step recovery procedures with exact commands or UI paths
Escalation contacts with a secondary if the primary is unavailable
Last updated date and the engineer who updated it

⚠️ A runbook that is never updated becomes dangerous faster than no runbook at all. Stale procedures applied confidently during a crisis are worse than acknowledged uncertainty. Treat runbook maintenance as part of the deployment process: if a deployment changes a service's behavior, the runbook for that service gets updated in the same pull request.

⚡ Game Days: Rehearse the Response Before the Incident Finds You

The strongest predictor of smooth incident response is not the quality of the monitoring stack or the seniority of the engineers — it is whether the team has practiced the response process before a real incident. Organizations that run game days consistently report lower mean time to resolve than those that respond only to live incidents, because game days surface process failures before they produce customer impact.

Tabletop exercise — 90 minutes, no systems touched

A facilitator describes an evolving incident scenario in a meeting room; participants describe what they would do at each step. No systems are actually affected. Best for: validating the process, identifying gaps in runbooks and escalation contacts, and familiarizing engineers who are new to the Incident Commander role. The most common tabletop finding: the team discovers that a critical service runbook lists a primary on-call contact who left the company six months ago.

Live fire exercise — 3–4 hours, real failure injected

A chaos engineering tool (Chaos Monkey, Gremlin, or a hand-crafted failure script) injects an actual failure into a staging or isolated environment. The team runs the full response process in real time without knowing in advance what will fail. Best for: testing the actual technical response, validating runbooks against real system behavior, and building muscle memory for high-stress decision-making. Debrief immediately afterward while observations are fresh.

The most common objection to game days is time. The counterargument: a 90-minute tabletop exercise that reveals a gap in an escalation path saves far more than 90 minutes during a live critical incident. Quarterly game days have consistently separated high-performing reliability teams from reactive ones.

🧮 The Two Numbers Worth Tracking Over Time

Most teams know they had incidents last quarter. Fewer can demonstrate whether their incident response is improving or worsening. Two metrics close that gap without requiring expensive tooling:

Mean Time to Detect (MTTD)

The average elapsed time between when an incident begins and when your team becomes aware of it. MTTD measures your observability investment: alerting coverage, alert quality, and the signal-to-noise ratio in your monitoring. A high MTTD means incidents have been affecting users for extended periods before the team knows. Reduce it by auditing alert coverage against your critical user journeys and aggressively eliminating false positive noise that conditions engineers to ignore pages.

Mean Time to Resolve (MTTR)

The average time from incident declaration to confirmed resolution. MTTR measures response process quality — how efficiently the team coordinates, diagnoses, and mitigates. Improvements come from clearer severity criteria, faster Incident Commander assignment, better runbooks that reduce diagnosis time, and mitigation options like feature flags that allow faster service restoration without a full root cause fix. Track MTTR by severity level separately — P1 and P3 incident patterns are structurally very different.

💡 Start simple: a spreadsheet logging each incident's declared time, resolved time, severity, and cause category. After three months you have a baseline. After six months you can see whether process improvements are producing measurable MTTR reductions — or whether the same failure modes are recurring, signaling that postmortem action items are not being completed.

🧑‍💻 The Human Side: Sustainable On-Call

Incident response process improvements fail when the people running them are exhausted. On-call burnout is real, measurable, and directly correlated with both response quality and engineer attrition. A team that is chronically paged — especially for non-actionable alerts — will eventually stop responding with the urgency and quality that real incidents require. The best reliability programs treat on-call sustainability as a first-class concern, not an afterthought.

⚠️ Signs of unsustainable on-call

More than 2–3 actionable pages per week per engineer
More than 25% of alerts require no action at all
Rotation that excludes senior or staff engineers
No compensation policy or recovery time for overnight incidents
Engineers describing upcoming on-call weeks with dread

✅ Practices that protect engineers

Alert review as a standing team ritual — audit and retire noisy alerts regularly
Written handoff documentation so the incoming engineer has context on open issues
Shadow rotations for engineers new to on-call, paired with experienced responders
Compensatory time off after on-call periods with significant overnight incidents
Clear escalation paths so no one is ever the last line of defense alone

The organizations with the best incident response records are not always those with the most resilient infrastructure — they are the ones with the most invested engineers. Process and tooling are multipliers, but the team behind them is the foundation.

📖 What Winging It Looks Like at 2 AM

A composite scenario based on patterns observed across multiple real incident retrospectives

An alert fires at 2:04 AM. Three engineers independently acknowledge it and begin investigating — none of them knows the others are looking. One engineer restarts a service without announcing it anywhere; a second is in the middle of pulling logs from that same service and loses the connection mid-read. A third is capturing a database state snapshot, unaware that the restart has already changed the system state they are trying to capture.

At 2:31 AM, no one knows what has been changed, what the current system state is, or whether any of the actions taken have helped or made things worse. Meanwhile, the Director of Engineering receives a message from the CEO asking what is happening — neither has received an update since the incident began, because no one was assigned to stakeholder communication. The customer support team is fielding inbound tickets with nothing to tell users.

By 4:18 AM, the incident resolves — not because anyone diagnosed the root cause, but because the 2:10 AM restart happened to clear a stuck process and the system stabilized on its own. Two weeks later, the stuck process recurs. There is no postmortem, no action items, and no institutional record of what actually fixed it. Total duration: 134 minutes. The same incident, with a structured response and a single coordinator directing actions: a conservative estimate of under 25 minutes to mitigation, 45 minutes to full resolution — and a postmortem that prevents the recurrence entirely.

Incident Response Process Sources

These sources verify the triage, coordination, mitigation, recovery, and post-incident practices this IT incident response checklist is built around.

Master This Checklist Quickly

Every important button and option for this pre-made checklist, shown in a glance-friendly format.

Start Here

1
Click any item row to mark it complete.
2
Use the note row under each item for quick notes.
3
Use the tool row for undo, redo, reset, and check all.
4
Use Save Progress when you want to continue later.

Checklist Row Tools

UndoRedoResetCheck allCollapse/Expand sectionsShow/Hide detailsInline notes

Top Action Buttons

Open all sharing and export options in one menu.

Email DraftContinue on another devicePrint or Save as PDFPlain Text (.txt)Word (.docx)Excel (.xlsx)

Add & Ask

Open one menu for apps and AI guidance.

NotionTodoist CSVChatGPTClaude

Copy and customize

Create a new editable checklist pre-filled with your chosen content.

Save Progress

Adds this checklist to My Checklists and keeps your progress in this browser.

Most Natural Usage

Track over time

Check items -> Add notes where needed -> Save Progress

Send or export

Open Share -> Choose format -> Continue

Make your own version

Copy and customize -> Open create page -> Edit freely

IT Incident Response

Checklist

DSLR & Mirrorless Camera: First Setup & Custom Configuration

Gaming PC Build

Laptop Annual Deep Clean & Performance Tune-Up

Operating System Fresh Install

What Poor Incident Response Actually Costs

🔧 Runbooks: The Companion Artifact This Checklist Needs

⚡ Game Days: Rehearse the Response Before the Incident Finds You

Tabletop exercise — 90 minutes, no systems touched

Live fire exercise — 3–4 hours, real failure injected

🧮 The Two Numbers Worth Tracking Over Time

Mean Time to Detect (MTTD)

Mean Time to Resolve (MTTR)

🧑‍💻 The Human Side: Sustainable On-Call

⚠️ Signs of unsustainable on-call

✅ Practices that protect engineers

📖 What Winging It Looks Like at 2 AM

Incident Response Process Sources

Master This Checklist Quickly

Start Here

Checklist Row Tools

Top Action Buttons

Share

Add & Ask

Copy and customize

Save Progress

Most Natural Usage

Track over time

Send or export

Make your own version

IT Incident Response

Phase 1 — Identify and Triage

Phase 2 — Communicate and Organize

Phase 3 — Investigate and Diagnose

Phase 4 — Mitigate and Resolve

Phase 5 — Postmortem

Additional Notes