You support a popular mobile game application deployed on Google Kubernetes Engine (GKE) across several Google Cloud regions. Each region has multiple
Kubernetes clusters. You receive a report that none of the users in a specific region can connect to the application. You want to resolve the incident while following Site Reliability Engineering practices. What should you do first?
A. Reroute the user traffic from the affected region to other regions that don’t report issues.
B. Use Stackdriver Monitoring to check for a spike in CPU or memory usage for the affected region.
C. Add an extra node pool that consists of high memory and high CPU machine type instances to the cluster.
D. Use Stackdriver Logging to filter on the clusters in the affected region, and inspect error messages in the logs.
Disclaimer
This is a practice question. There is no guarantee of coming this question in the certification exam.
Answer
A
Explanation
A. Reroute the user traffic from the affected region to other regions that don’t report issues.
(Always aims to first stop the impact of an incident [Resume the service first, then everything else.], and then find the root cause.)
B. Use Stackdriver Monitoring to check for a spike in CPU or memory usage for the affected region.
(SRE suggests resolving the incident impact first.)
C. Add an extra node pool that consists of high memory and high CPU machine type instances to the cluster.
(SRE suggests resolving the incident impact first. Troubleshoot later.)
D. Use Stackdriver Logging to filter on the clusters in the affected region, and inspect error messages in the logs.
(SRE suggests resolving the incident impact first. Troubleshoot later.)