Google Professional Cloud DevOps Engineer Certification Guide

23 Mar

This is a handcrafted guide for Google Professional Cloud DevOps Engineer certification. It lists everything from key areas, to summary, to important links. This will be definite last moment guide before you appear for the certification.

Summary

Key Areas

  • SRE Principles
    • Class SRE implements DevOps
    • SLI (Availability, Durability, Latency, etc.), SLO (Binding target with SLIs), SLA (Between service provider and service user)
    • SLIs drive SLOs which inform SLAs
    • SLI – 95th percentile latency of homepage requests over past 5 minutes < 300ms
    • SLO – 95th percentile homepage SLI will succeed 99.9% over trailing year
    • SLA – Service credits if 95th percentile homepage SLI succeeds less than 99.5% over trailing year
    • Error Budgets (99.9% SLO -> 43.2 minutes/month Error Budget)
    • Overhead (Things not tied to production direclty i.e. email, expense reports, meetings, traveling, etc.)
    • Toil & Toil budgets (Reduce manual things on production by automating it)
    • Blameless Postmortem
  • Manage Service Incidents
    • Incident management
    • Mitigating incident impact
    • Roles in incident management (Incident Commander, Operation Lead, Communication Lead)
    • Burnouts & Handovers
  • Implement service monitoring strategies
    • Types of logs (Audit/Data Access/Event)
    • Exporting logs using log sink
    • Thirdparty integration (OpenTelemetry, etc.)
    • Managing application/infrastructure logs (fluentd, etc.)
    • Monitoring of resources and SLIs
    • Monitoring dashboards (Custom dashboards and access levels, etc.)
    • Filtering logs for security and PII data
    • Ops Agent
  • Building and Managing CICD pipelines
    • CICD pipelines using Cloud Build
    • Setup build triggers
    • Integrating third party tools (Jenkins, Spinnaker, GitLabs, etc.)
    • Security features
    • Binary Authorization for GKE
    • Vulnerability analysis with Artifact Registry
  • Optimize service performance
    • Cost optimization
    • Cloud trace (Latency)
    • Cloud profiler (CPI, Memory usage, etc.)
    • Billing and TCO (Total Cost of Ownership)
  • Metric Kinds
    • Gauge – the value measures a specific instant in time, i.e. CPU utilization, Memory utilization, Disk utilization, HTTP request latencies, etc.
    • Delta – the value measures the change in a time interval, i.e. number of requests received, etc.
    • Cumulative – the value constantly increases over time, i.e. total number of bytes sents, etc.

Developer tools

    • Google Cloud Build
      • It can be integrated with Pub/Sub to publish messages.
      • It uses directory with /workspace name as a working directory.
      • It can trigger Spinnaker using Pub/Sub.
      • It uses config file for specifying instruction steps.
      • Assets persisted and passed to next step using /workspace directory.
    • Binary Authorization and Vulnerability Scanning
      • Binary Authorization uses attestations to verify image is build by trusted build system.
      • It’s part of secure supply chain.
    • Google Source Repositories
      • Private Git repositories hosted on Google Cloud.
      • Version controlled.
    • Google Artifact Registry
      • Supports regional and multi-regional repositories.
    • Google Cloud Deployment
    • Google Cloud Client Libraries
      • Google Cloud provides client libraries and SDKs in various languages like npm, python, etc., in order to call Google Cloud APIs using those programming languages.
      • If any languages is not supporting using client libraries or SDKs then REST API can be used.
    • Deployment Techniques
      • Recreate deployment – Scale down to zero, and scale up with new application version.
      • Rolling update – Update only subset of application instances at a time.
      • Blue/Green deployment – Also known as Red/Black deployment. Perform two identical deployments of your application. Blue is current version with 100% traffic. Green is new version with 0% traffic. Then, move 100% traffic to Green marked as release of new version. If you run into issues, rollback to Blue with 100% traffic to rollback to previous version of the application.
      • GKE supports recreate and rolling updates for pods. It supports rolling and Blue/Green deployments for nodes. Managed instance groups supports rolling updates.
        • maxSurge – Maximum number of pods created with new version of the application at a time.
        • maxUnavailable – Maximum number of pods deleted with old version of the application at a time.
    • Testing Strategies
      • Canary testing – Rollout the change partially and evaluate the performance against the baseline.
      • A/B testing – Test hypothesis traffic in a simulated environment. Used for making business decisions.
    • Spinnaker
      • Supports Blue/Green deployments using Replica Sets.
    • Ops Agent – Used to collect telemetry from virtual machine (VM) instances or third party applications.

Operation Suite

  • Cloud Monitoring
    • Virtual Machine needs Cloud Ops agent to be installed for additional metrics i.e. CPI Utilization, Disk I/O, etc.
    • Monitoring API supports push or export of custom metrics.
    • Monitoring logs can be sink to Cloud Storage, Pub/Sub, BigQuery or external tools like Splunk.
    • Uptime check can be used to check uptime of the application.
  • Cloud Logging
    • It is a real-time log analysis and management tool.
    • You can do custom logging from your application.
    • Logging agent needs to be installed in virtual machine for sending custom application logs.
    • Logging agent uses fluentd and fluentd filter can be used to redact sensitive data from application logs.
    • VPC Flow logs tracks to and from network traffic for virtual machines.
    • Log based metrics can be used to create alerts from logs.
  • Cloud Error Reporting
    • It counts, analyzes and aggregates the crashes in the cloud services.
  • Cloud Profiler
    • It allows monitoring of system resources like CPU and Memory on both on-premises resources and GCP.
  • Cloud Trace
    • It is a distributed tracing tool that collects latency data from applications.
  • Cloud Debugger
    • It allows to debug running application, without changing application code.
    • Debug Logpoints – It allows logging injection into running application
    • Debug Snapshots – It captures local variables value and the call stack.

Compute Services

  • Compute Engine
    • Preemptible VMs – For batch jobs/workloads, short term needs and lower costs.
    • Spot VMs – Same as Preemptible VMs, but running for 24 hours.
    • Committed Usage Discounts – Long-term usage cost benefits.
    • Managed Instance Groups – Auto-scaling and Auto-healing.
  • Google Kubernetes Engine
    • Scaling
      • Cluster AutoScaler for scaling the cluster
      • Horizontal Pod Scaler – Scale pods based on CPU or Memory usage.
      • Vertical Pod Scaler – Scaling compute resources.
    • Kubernetes Secrets to store secrets.

Security

  • Cloud Key Management Service
    • Used to store keys to encrypt data in Cloud Storage
  • Secret Manager
    • Used to store secrets like passwords, etc.

Site Reliability Engineering

  • Increasing reliability, observability, collaboration, and reducing toil using automation.
  • SLI = (good events / valid events) x 100%
  • SLI fall between 0% and 100%
  • SLI Menu
    • Request/Response
      • Availability – The proportion of valid requests served successfully.
      • Latency – The proportion of valid requests served faster than a threshold.
      • Quality – The proportion of valid requests served without degrading quality.
    • Data Processing
      • Freshness – The proportion of valid data updated more recently than a threshold.
      • Coverage – The proportion of valid data processed successfully.
      • Correctness – The proportion of valid data producing correct output.
      • Throughput – The proportion of time where the data processing rate is faster than a threshold.
    • Storage
      • Latency
      • Throughput
  • SLOs – choosing the measurement method
    • Client side instrumentation
    • Application and infrastructure metrics
    • Synthentic clients to measure user experience
    • Logs processing
  • SLA – Explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
  • Error budget – Provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter.
  • Toil – Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
  • Define Error Budget and Error Budget policy which needs to be aligned with all the stakeholders and help plan releases to focus on features vs reliability
  • Main focus of SRE is on reducing toil – identify repetitive tasks and automate
  • Production Readiness Review (PRR)
    • Performance testing should be done for all applications before it is deployed to production.
    • SLOs never be modified/adjusted for production deployments.
  • SRE practice
    • Incident Management and Response
      • Mitigate the issue, rollback the broken deployment and find the root cause.
      • Incident Live State Document – Track the events and decision making which can be useful for postmortem.
      • Incident Commander/Manager
        • Setup a communication channel for all to collaborate
        • Assign and delegate roles. IC can assume any role if not delegated.
        • Responsible for Incident Live State Document.
      • Communications Lead
        • Periodic updates to all the stakeholders and customers
      • Operations Lead
        • Responds to the incident and should be the only group working on the system during an incident.
  • Postmortem
    • Should be Blameless
    • Should contain the root cause
    • Should be shared with all for collaboration and feedback
    • Should be shared with all the shareholders
    • Should have proper action items to prevent recurrence with an owner and collaborators
  • Alerting – While there may be many alerts ultimately, your goal is to be notified for a significant event; an event that consumes a large fraction of the error budget.
  • Monitoring – Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts, etc.
  • Managing Risk – Item or risk that may cause you to not meet the SLO.
  • Response Structure – Communication and structure is a key part of handling incident.
  • API Lifecycle
  • API Error Codes – 2xx, 3xx, 4xx, and 5xx.

YouTube Playlists

Books

Important Concepts Summary

Test Exam



Leave a Reply

Your email address will not be published. Required fields are marked *