Skip to content

Availability

Availability refers to a system's ability to remain operational and accessible in the face of failures or unexpected issues.

Factors Affecting Availability

Several factors and architectural patterns can impact a system's availability:

  • Network Connectivity: Network disruptions in a hosting region can block user access. This can be mitigated by deploying services across multiple Availability Zones (AZs).
  • Hardware and Software Failures: Data centers hosting your services can experience hardware failures. Without redundancy and data replication, this will disrupt services. Similarly, software bugs can crash application instances, leading to downtime.
  • Scalability Issues: If a system is not designed to scale dynamically with user demand, sudden traffic spikes can cause high latency or dropped requests.
  • Security Breaches: Attacks (such as Distributed Denial of Service, or DDoS) can compromise system integrity, overwhelm resources, or force temporary shutdowns to contain the breach.
  • External Dependencies: Systems often rely on third-party APIs, payment gateways, or cloud storage. If these external dependencies experience downtime, it directly impacts your system's availability.

Availability Patterns

These are established architectural patterns incorporated to ensure systems remain operational and accessible during failures or unexpected issues.

Failover Pattern

The failover pattern involves maintaining a backup component or system to replace the primary system if a failure occurs.

The primary system handles all incoming requests and is continuously monitored. If a failure is detected, traffic is automatically redirected to the secondary component, which is promoted to the primary role. This ensures minimal service disruption.

Here are the main ways to implement this pattern:

  • Active-Passive: The secondary component does not process active requests. Instead, it monitors the primary system using "heartbeat" messages. If the secondary system stops receiving heartbeats for a specified period, it assumes the primary is down and automatically promotes itself to primary.
  • Active-Active: Multiple systems handle traffic simultaneously, and a load balancer distributes requests among them. If one system fails, the load balancer detects it and routes all traffic to the remaining operational systems.
  • Hot-Standby: The secondary component runs simultaneously alongside the primary system. It receives real-time data replication from the primary, ensuring its state is completely up-to-date. If the primary fails, the hot-standby takes over immediately with near-zero downtime or data loss.

Trade-offs: Failover adds hardware costs and can still lead to potential data loss or minor downtime, depending on how long it takes for the secondary component to replace the primary.

Replication Pattern

The replication pattern involves copying data across different locations so it can be retrieved from an alternate source if a primary location fails.

Master-Master Replication

In this model, multiple servers act as masters and synchronize writes between each other. If one server goes down, the system continues to handle both reads and writes using the remaining infrastructure.

  • Requirements: Needs a load balancer or specific application logic to route traffic across multiple servers.
  • Consistency Challenges: Synchronizing multiple servers can weaken ACID compliance or cause high latency due to the coordination overhead.
  • Data Loss Risks: Potential data loss can occur if one server fails mid-request before its changes are synchronized with the other master.
  • Conflict Resolution: Resolving write conflicts between servers increases complexity and latency as more nodes are added to the system.

Master-Slave Replication

In this model, a single designated master server handles all writes and propagates those changes to multiple slave (read) servers. Slaves are read-only and can pass data down to their own child nodes. If the master server fails, one of the slave servers is promoted to take its place.

  • Complex Failover: Additional logic and coordination are required to promote a slave to a master automatically.
  • Downtime Hazards: Data loss can occur during the downtime window when a slave is being promoted, as write permissions are locked or new data has not yet reached the promoted slave.
  • Performance Impact: Replaying master write logs onto read replicas can bog them down, decreasing overall read performance.- Replication Lag: Adding more slave servers increases replication overhead, which can cause significant synchronization delays (replication lag).

Measuring Availability

Availability is calculated as the percentage of time a system remains operational, dividing total uptime by the sum of total uptime and downtime.

$$\text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}}$$

Uptime is commonly measured in "nines." For example, 99.999% availability is referred to as "five nines" of availability.

Nines of Availability

Availability in parallel vs in sequence

If a service consists of multiple components prone to failure, the service's overall availability depends on whether the components are in sequence or in parallel.

  • Sequence: Availability (Total) = Availability (Foo) * Availability (Bar)
  • Parallel: Availability (Total) = 1 - (1 - Availability (Foo)) * (1 - Availability (Bar))