Designing Reliable Systems

  1. Choosing Google Cloud Storage and Data Solutions
  2. Choosing a Google Cloud Deployment Platform
  3. Designing Google Cloud Networks
  4. Designing Reliable Systems
  5. GCP Compute Engine Instance Lifecycle
  6. Disaster Planning and Recovery Strategies
  7. Designing Secured Systems in GCP

The article will highlight a few of the proven strategies for designing reliable systems in case of single point of failures, correlated failures, cascading failures, query of death overload, lazy deletion, etc. with solutions. In this era of cloud designing reliable systems are very important and critical to encounter system/cloud failure as the users rapidly increasing if the system gains popularity overnight. These strategies mainly focus on Google Cloud; however, you can employ these strategies in any cloud including private cloud.

Key Performance Metrics


Availability is the percent of time a system is running and able to process requests. To achieve high availability, monitoring is vital. Health checks can detect when an application reports that it is okay. More detailed monitoring of services using white box metrics to count traffic successes and failures will help predict problems. Building in fault tolerance by, for example, removing single points of failure, is also vital for improving availability. Backup systems also play a key role in improving availability.

Durability is the chance of losing data because hardware or system failure. Ensuring that data is preserved and available is a mixture of replication and backup. Data can be replicated in multiple zones. Regular restores from backup should be performed to confirm that the process works as expected.

Scalability is the ability of a system to continue to work as user load and data grow. Monitoring and autoscaling should be used to respond to variations in load. The metrics for scaling can be the standard metrics, like CPU or memory, or you can create custom metrics, like number of players on a game server.

Designing for Reliability

Avoid single point of failure

  • Avoid single points of failure by replicating data and creating multiple virtual machine instances.
  • It is important to define your unit of deployment and understand its capabilities.
  • To avoid single points of failure, you should deploy two extra instances or N+2 to handle both failure and upgrades.
  • These deployments should ideally be in different zones to mitigate for zonal failures.
  • Don’t make any single unit too large
    • For example: Consider three VMs that are load balanced to achieve N plus two. If one is being upgraded and another fails, 50 percent of the available capacity of the compute is removed, which potentially doubles the load on the remaining instance and increases the chance of that failing. This is where capacity planning and knowing the capability of your deployment unit is important.
  • Also, for ease of scaling, it is a good practice to make the deployment units interchangeable, stateless clones.

Beware of correlated failures

  • These occur when related items fail at the same time. At the simplest level, if a single machine fails, all requests served by that machine fail.
  • At a hardware level, if a top-of-rack switch fails, the complete rack fails. At the cloud level, if a zone or region is lost, all the resources are unavailable. Servers running the same software suffer from the same issue. If there’s a fault in the software, the service may fail at a similar time.
  • Correlated failures can also apply to configuration data. If a global configuration system fails and multiple systems depend on it, they potentially fail too.
  • When we have a group of related items that could fail together, we refer to it as a failure or fault domain.

To avoid correlated failures

  • Decouple servers and use microservices distributed among multiple failure domains.
    • Divide business logic into services based on failure domains.
    • Deploy to multiple zones and/or regions.
    • Split responsibility into components and spread over multiple processes.
    • Design independent, loosely coupled but collaborating services. A failure in one service should not cause a failure in another service.

Beware of cascading failures

  • Cascading failures occur when one system fails, causing others to be overloaded and subsequently fail. For example, a message queue could be overloaded because a backend fails and it cannot process messages placed on the queue.
  • For example: A cloud load balancer distributing load across two backend servers. Each server can handle a maximum of 1,000 queries per second. The load balancer is currently sending 600 queries per second to each instance. If server B now fails, all 1,200 queries per second have to be sent to just server A. This is much higher than the specified maximum and could lead to a cascading failure.


Avoid cascading failures

  • Cascading failures can be handled with support from the deployment platform. For example, you can use health checks in Compute Engine or readiness and liveliness probes in GKE to enable the detection and repair of unhealthy instances. You want to ensure that new instances start fast and ideally do not rely on other backends or systems to start up before they are ready.
  • Based on the current traffic, a server failure can be absorbed by the remaining three servers as shown on the right-hand side. If the system uses Compute Engine with instance groups and auto-healing, the failed server would be replaced with a new instance.


Query of death overload

  • You also want to plan against query of death, where a request made to a service causes a failure in the service. This is referred to as a query of death because the error manifests itself as overconsumption of resources, but in reality is due to an error in business logic itself.
  • This can be difficult to diagnosis and requires good monitoring, observability and logging to determine the root cause of the problem.
  • When the requests are made latency, resource, utilization and error rates should be monitored to help identify the problem.


Positive feedback cycle overload failure

  • A problem is caused by trying to prevent problems. This happens when you try to make the system more reliable by adding retries in the event of a failure. Instead of fixing the failure, this creates the potential for overload. You may be actually adding more load to an already overloaded system.
  • The solution is intelligent retries that make use of feedback from the service that is failing. Two strategies to avoid this:
    • If a service fails, it is okay to try again:
      • Continue to retry, but wait a while between attempts.
      • Wait a little longer each time the request fails.
      • Set a maximum length of time and maximum number of requests.
      • Eventually, give up.
      • Example:
        • Request fails; wait 1 second + random_number_milliseconds and retry.
        • Request fails; wait 2 seconds + random_number_milliseconds and retry.
        • Request fails; wait 4 seconds + random_number_milliseconds and retry.
        • And so on, up to a maximum_backoff time.
        • Continue waiting and retrying up to some maximum number of retires.
    • The circuit breaker pattern
      • Plan for degraded state operations.
      • If a service is down and all its clients are retrying, the increasing number of requests can make matters worse.
        • Protect the service behind a proxy that monitors service health (the circuit breaker).
        • If the service is not healthy, don’t forward requests to it.
      • If using GKE, leverage Istio to automatically implement circuit breakers.

Lazy deletion to reliably recover (when user deletes data by mistake)

  • Lazy deletion is a method that builds in the ability to reliably recover data when a user deletes the data by mistake.
  • The first stage is that the user deletes the data but it can be restored within a pre-defined time period. In this example, it’s 30 days. This protects against mistakes by the user.
  • When the pre-defined period is over, the data is no longer visible to the user but moves to the soft deletion phase. Here the data can be restored by user support or administrators. This deletion protects against any mistakes in the application.
  • After the soft deletion period of 15, 30, 45 or even 60 days, the data is deleted and no longer available. The only way to restore the data is by whatever backups or archives were made of the data.