Your company follows Site Reliability Engineering principles. You are writing a postmortem for an incident, triggered by a software change, that severely affected users. You want to prevent severe incidents from happening in the future. What should you do?
A. Identify engineers responsible for the incident and escalate to their senior management.
B. Ensure that test cases that catch errors of this type are run successfully before new software is released.
C. Follow up with the employees who reviewed the changes and prescribe practices they should follow in the future.
D. Design a policy that will require on-call teams to immediately call engineers and management to discuss a plan of action if an incident occurs.
Disclaimer
This is a practice question. There is no guarantee of coming this question in the certification exam.
Answer
B
Explanation
A. Identify engineers responsible for the incident and escalate to their senior management.
(It should be Blameless.)
B. Ensure that test cases that catch errors of this type are run successfully before new software is released.
C. Follow up with the employees who reviewed the changes and prescribe practices they should follow in the future.
(It should be Blameless. Also, there is chance of human error.)
D. Design a policy that will require on-call teams to immediately call engineers and management to discuss a plan of action if an incident occurs.
(This is incident management.)