Apr
19
2024

Disaster Planning and Recovery Strategies

ha-instance-groups
  1. Choosing Google Cloud Storage and Data Solutions
  2. Choosing a Google Cloud Deployment Platform
  3. Designing Google Cloud Networks
  4. Designing Reliable Systems
  5. GCP Compute Engine Instance Lifecycle
  6. Disaster Planning and Recovery Strategies
  7. Designing Secured Systems in GCP

Disaster planning involves creating strategies and mechanisms to mitigate the impact of unforeseen events, such as natural disasters, system failures, cyberattacks, or human errors, on the functionality, availability, and integrity of computer systems and data. It aims to ensure business continuity and minimize downtime in the face of adverse circumstances.

Disaster Planning

High Availability Using Instance Groups

  • High availability can be achieved by deploying to multiple zones in a region. When using Compute Engine for higher availability, you can use a regional instance group which provides built-in functionality to keep instances running. Use auto healing with an application health check and load balancing to distribute load.
  • For data, the storage solution selected will affect what is needed to achieve high availability. For Cloud SQL, the database can be configured for high availability which provides data redundancy and a standby instance of the database server in another zone.

ha-instance-groups

High Availability Using GKE

  • Google Kubernetes Engine clusters can also be deployed to either a single or multiple zones. A cluster consists of a master controller and collections of node pools.
  • Regional clusters increase the availability of both a clusters master and its nodes by replicating them across multiple zones of a region.

Creating Health Checks

  • If you are using instance groups for your service, you should create a health check to enable auto healing. The health check is a test endpoint in your service.
  • It should indicate that your service is available and ready to accept requests and not just that the server is running.
  • A challenge with creating a good health check endpoint is that if you use other back-end services, you need to check that they are available to provide positive confirmation that your service is ready to run. If the services it is dependent on are not available, it should not be available.
  • If a health check fails the instance group, it will remove the failing instance and create a new one.
  • Health checks can also be used by the load balancers to determine which instances to send requests to.

High Availability for Storage and Database Services

For Google Cloud Storage, you can achieve high availability with multi-region storage buckets if the latency impact is negligible.

Bucket TypeAvailabilityPrice (us-central1)
Multi-region99.95%$0.026/GB
Single region99.90%$0.020/GB

If you are using Cloud SQL and need high availability, you can create a failover replica.
ha-cloud-sql
The above picture shows the configuration where a master is configured in one zone and a replica is created in another zone but in the same region. Remember that you are paying for the extra instance with this design.

Firestore and Spanner both offer single and multi-region deployments. A multi-region location is a general geographical area, such as the United States.

DatabaseAvailability SLA
Firestore single region99.99%
Firestore multi-region99.999%
Spanner single region99.99%
Spanner multi-region (nam3)99.999%

Data in a multi-region location is replicated in multiple region. Within a region, data is replicated across zones. Multi-region locations can withstand the loss of entire regions and maintain availability without losing data. The multi-region configurations for both Firestore and Spanner offer five 9s of availability which is less than 6 minutes of downtime per year.

Risk/Cost Analysis

Deploying for high availability increases costs because extra resources are used. It is important that you consider the costs of your architectural decisions as part of your design process. Don’t just estimate the cost of the resources used, but also consider the cost of your service being down.

DeploymentEstimate CostAvailability %Cost of being down
Single zone
Multiple zones in a region
Multiple regions

This table shown is a really effective way of assessing the risk versus cost, by considering the different deployment options and balancing them against the cost of being down.

Brainstorm Scenarios (that might cause data loss and/or service failure)

What could happen that would cause a failure?
What is the Recovery Point Objective (amount of data that would be acceptable to lose)?
What is the Recovery Time Objective (amount of time it can take to be back up and running)?

This can be helpful to provide structure on the different scenarios and to prioritize them accordingly.
For example:

ServiceScenarioRecovery Point Objective (RPO)Recovery Time Objective (RTO)Priority
Product Rating ServiceProgrammer deleted all ratings accidently24 hours1 hourMedium
Orders ServiceDatabase server crashed01 minuteHigh

Formulate a plan to recover.
For example:

ResourceBackup StrategyBackup LocationRecovery Procedure
Ratings MySQL DatabaseDaily automated backupsMulti-regional Cloud Storage bucketRun the restore script
Orders Spanner databaseMulti-region deploymentus-east1 backup regionSnapshot and backup at regular intervals, outside of the serving infrastructure; e.g. Cloud Storage.

The procedure should be tested and validated regularly, at least once per year, and ideally, recovery becomes a part of daily operations which helps streamline the process.

Disaster Recovery Strategies

Strategy 1 – Code Standby

A simple disaster recovery strategy may be to have a cold standby. You should create snapshots of persistent disks, machine images and data backups and store them in a multi-region storage.
dr-cold-standby
Snapshots are taken that could be used to recreate the system. If the main region fails, you can spin up service in the backup region using the snapshot images and persistent disks. You will have to route requests to the new region, and it’s vital to document and test this recovery procedure regularly.

Strategy 2 – Hot Standby

Another disaster recovery strategy is to have a hot standby, where instance groups exist in multiple regions, and traffic is forwarded with a global load balancer.
dr-hot-standby
You can also implement this for data storage services like multi-regional Cloud Storage buckets and database services like Spanner and Firestore.

Store unstructured data in multi-region buckets. For structured data, use a multi-region database such as Spanner or Firestore.

Prepare the teams for disaster (By using drills)

Planning

  • What can go wrong with your system?
  • What are your plans to address each scenario?
  • Document the plans.

Practice

  • Can be in production or a test environment as appropriate.
  • Assess the risks carefully.
  • Balance against the risk of not knowing your system’s weaknesses.

At each stage, assess the risks carefully and balance the costs of availability against the cost of unavailability. The cost of unavailability will help you evaluate the risk of not knowing the system’s weaknesses.