You are on-call for an infrastructure service that has a large number of dependent systems. You receive an alert indicating that the service is failing to serve most of its requests and all of its dependent systems with hundreds of thousands of users are affected. As part of your Site Reliability Engineering (SRE) incident management protocol, you declare yourself Incident Commander (IC) and pull in two experienced people from your team as Operations Lead (OL) and Communications Lead (CL). What should you do next?
A. Look for ways to mitigate user impact and deploy the mitigations to production.
B. Contact the affected service owners and update them on the status of the incident.
C. Establish a communication channel where incident responders and leads can communicate with each other.
D. Start a postmortem, add incident information, circulate the draft internally, and ask internal stakeholders for input.
Disclaimer
This is a practice question. There is no guarantee of coming this question in the certification exam.
Answer
C
Explanation
A. Look for ways to mitigate user impact and deploy the mitigations to production.
(It is also an important step but should be taken in parallel with establishing a communication channel.)
B. Contact the affected service owners and update them on the status of the incident.
(It is also important but should be done during the incident management process.)
C. Establish a communication channel where incident responders and leads can communicate with each other.
(https://sre.google/workbook/incident-response/
Prepare Beforehand
In addition to incident response training, it helps to prepare for an incident beforehand. Use the following tips and strategies to be better prepared.
Decide on a communication channel
Decide and agree on a communication channel [Slack, a phone bridge, IRC, HipChat, etc.] beforehand.
Keep your audience informed
Unless you acknowledge that an incident is happening and actively being addressed, people will automatically assume nothing is being done to resolve the issue. Similarly, if you forget to call off the response once the issue has been mitigated or resolved, people will assume the incident is ongoing. You can preempt this dynamic by keeping your audience informed throughout the incident with regular status updates. Having a prepared list of contacts (see the next tip) saves valuable time and ensures you don’t miss anyone.)
D. Start a postmortem, add incident information, circulate the draft internally, and ask internal stakeholders for input.
(It is important and should be done later once the incident is resolved.)