Disaster Planning and Recovery Strategies

This is post 6 of 7 in the series “Designing Reliable Solution in Google Cloud”

Choosing Google Cloud Storage and Data Solutions
Choosing a Google Cloud Deployment Platform
Designing Google Cloud Networks
Designing Reliable Systems
GCP Compute Engine Instance Lifecycle
Disaster Planning and Recovery Strategies
Designing Secured Systems in GCP

Disaster planning involves creating strategies and mechanisms to mitigate the impact of unforeseen events, such as natural disasters, system failures, cyberattacks, or human errors, on the functionality, availability, and integrity of computer systems and data. It aims to ensure business continuity and minimize downtime in the face of adverse circumstances.

Table of Contents

Disaster Planning

High Availability Using Instance Groups

High availability can be achieved by deploying to multiple zones in a region. When using Compute Engine for higher availability, you can use a regional instance group which provides built-in functionality to keep instances running. Use auto healing with an application health check and load balancing to distribute load.
For data, the storage solution selected will affect what is needed to achieve high availability. For Cloud SQL, the database can be configured for high availability which provides data redundancy and a standby instance of the database server in another zone.

High Availability Using GKE

Google Kubernetes Engine clusters can also be deployed to either a single or multiple zones. A cluster consists of a master controller and collections of node pools.
Regional clusters increase the availability of both a clusters master and its nodes by replicating them across multiple zones of a region.

Creating Health Checks

If you are using instance groups for your service, you should create a health check to enable auto healing. The health check is a test endpoint in your service.
It should indicate that your service is available and ready to accept requests and not just that the server is running.
A challenge with creating a good health check endpoint is that if you use other back-end services, you need to check that they are available to provide positive confirmation that your service is ready to run. If the services it is dependent on are not available, it should not be available.
If a health check fails the instance group, it will remove the failing instance and create a new one.
Health checks can also be used by the load balancers to determine which instances to send requests to.

High Availability for Storage and Database Services

For Google Cloud Storage, you can achieve high availability with multi-region storage buckets if the latency impact is negligible.

Bucket Type	Availability	Price (us-central1)
Multi-region	99.95%	$0.026/GB
Single region	99.90%	$0.020/GB

If you are using Cloud SQL and need high availability, you can create a failover replica.

The above picture shows the configuration where a master is configured in one zone and a replica is created in another zone but in the same region. Remember that you are paying for the extra instance with this design.

Firestore and Spanner both offer single and multi-region deployments. A multi-region location is a general geographical area, such as the United States.

Database	Availability SLA
Firestore single region	99.99%
Firestore multi-region	99.999%
Spanner single region	99.99%
Spanner multi-region (nam3)	99.999%

Data in a multi-region location is replicated in multiple region. Within a region, data is replicated across zones. Multi-region locations can withstand the loss of entire regions and maintain availability without losing data. The multi-region configurations for both Firestore and Spanner offer five 9s of availability which is less than 6 minutes of downtime per year.

Risk/Cost Analysis

Deploying for high availability increases costs because extra resources are used. It is important that you consider the costs of your architectural decisions as part of your design process. Don’t just estimate the cost of the resources used, but also consider the cost of your service being down.

Deployment	Estimate Cost	Availability %	Cost of being down
Single zone
Multiple zones in a region
Multiple regions

This table shown is a really effective way of assessing the risk versus cost, by considering the different deployment options and balancing them against the cost of being down.

Brainstorm Scenarios (that might cause data loss and/or service failure)

What could happen that would cause a failure?
What is the Recovery Point Objective (amount of data that would be acceptable to lose)?
What is the Recovery Time Objective (amount of time it can take to be back up and running)?

This can be helpful to provide structure on the different scenarios and to prioritize them accordingly.
For example:

Service	Scenario	Recovery Point Objective (RPO)	Recovery Time Objective (RTO)	Priority
Product Rating Service	Programmer deleted all ratings accidently	24 hours	1 hour	Medium
Orders Service	Database server crashed	0	1 minute	High

Formulate a plan to recover.
For example:

Resource	Backup Strategy	Backup Location	Recovery Procedure
Ratings MySQL Database	Daily automated backups	Multi-regional Cloud Storage bucket	Run the restore script
Orders Spanner database	Multi-region deployment	us-east1 backup region	Snapshot and backup at regular intervals, outside of the serving infrastructure; e.g. Cloud Storage.

The procedure should be tested and validated regularly, at least once per year, and ideally, recovery becomes a part of daily operations which helps streamline the process.

Disaster Recovery Strategies

Strategy 1 – Code Standby

A simple disaster recovery strategy may be to have a cold standby. You should create snapshots of persistent disks, machine images and data backups and store them in a multi-region storage.

Snapshots are taken that could be used to recreate the system. If the main region fails, you can spin up service in the backup region using the snapshot images and persistent disks. You will have to route requests to the new region, and it’s vital to document and test this recovery procedure regularly.

Strategy 2 – Hot Standby

Another disaster recovery strategy is to have a hot standby, where instance groups exist in multiple regions, and traffic is forwarded with a global load balancer.

You can also implement this for data storage services like multi-regional Cloud Storage buckets and database services like Spanner and Firestore.

Store unstructured data in multi-region buckets. For structured data, use a multi-region database such as Spanner or Firestore.

Prepare the teams for disaster (By using drills)

Planning

What can go wrong with your system?
What are your plans to address each scenario?
Document the plans.

Practice

Can be in production or a test environment as appropriate.
Assess the risks carefully.
Balance against the risk of not knowing your system’s weaknesses.

At each stage, assess the risks carefully and balance the costs of availability against the cost of unavailability. The cost of unavailability will help you evaluate the risk of not knowing the system’s weaknesses.

Post Views: 13