Reliability Patterns
These patterns provide a way to design and implement systems that can withstand failures, maintain high levels of performance, and recover quickly from disruptions
Availability
Availability is measured as a percentage of uptime, and defines the proportion of time that a system is functional and working. Availability is affected by system errors, infrastructure problems, malicious attacks, and system load. Cloud applications typically provide users with a service level agreement (SLA), which means that applications must be designed and implemented to maximize availability.
- Deployment Stamps
- Geodes
- Throttling
- Health Endpoint Monitoring
- Queue-Based load leveling
Deployment Stamps
It involves provisioning, managing, and scaling a group of identical, independent infrastructure stacks (including web servers, databases, and network components) to host a specific subset of tenants or users. Each self-contained infrastructure unit is called a Stamp (or a scale unit, pod, or cell). Instead of scaling a single global infrastructure monolith up to its absolute physical limits, you scale out by stamping out new copies of the entire environment.
┌─────────────────────┐
│ Global Router │
│ (Traffic Director) │
└─────────────────────┘
/ | \
Tenant 1-500 / │ \ Tenant 1001-1500
(US West Region) │ \ (EU Central Region)
▼ Tenant 501-1000 ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Stamp 1 │ │ Stamp 2 │ │ Stamp 3 │
│ (App + DB)│ │ (App + DB)│ │ (App + DB)│
└───────────┘ └───────────┘ └───────────┘Pros:
- Strict Fault Isolation (Tiny Blast Radius):
- Deterministic Scalability
- Geographic Compliance
- Smooth Update Deployments
Cons:
- High CI/CD DevOps Overhead
- Cross-Stamp Communication Gaps
- Resource Fragmentation and Waste
Global Endpoint Deployment/Geode
It involves deploying a collection of fully independent, self-contained infrastructure stacks across multiple geographic regions worldwide, where every single region can read and write to a globally synchronized database layer.
┌─────────────────────────┐
│ Global Traffic Router │
│ (Anycast / Route53) │
└─────────────────────────┘
/ | \
Closest Region / │ \ Closest Region
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Region │ │ Region │ │ Region │
│ (Americas)│ │ (Europe) │ │ (Asia) │
└───────────┘ └───────────┘ └───────────┘
▲ ▲ ▲
│ │ │
◄─────┴─────────────┴─────────────┴─────►
Globally Replicated Multi-Master Database
(Cosmos DB / CockroachDB / Spanner)Pros:
- Ultra-Low Global Latency
- Extreme High Availability
- No Single Point of Failure
Cons:
- Data Consistency Challenges
- Complex Conflict Resolution due to globally replicated multi-master database
- High Infrastructure Cost
Throttling
It involves intentionally limiting the number of requests a user, device, or application tenant can make to an API or service within a given timeframe. It can be used to mitigate DDOS attacks, noisy neighbour problem and make cost of infra predictable. Some common strategies are:
- Rate Limiting (Token/Leaky Bucket): Tracks requests per fixed time window (e.g., maximum 60 requests per minute). If a user makes 61 requests, the 61st request is instantly blocked until the timer resets.
- Concurrent Request Throttling: Limits the exact number of active, parallel connections a client can hold open at the same millisecond. This protects synchronous threads from being locked up by slow-running processes
- Tier-Based Throttling: Varies limits dynamically based on service agreements (e.g., Free Tier accounts are capped at 5 requests/min, while Premium Tier accounts are allocated up to 1,000 requests/min).
[ Client App ]
│
▼ (Sends 61st Request in 1 minute)
┌───────────────────────┐
│ API Gateway / WAF │
└───────────────────────┘
│
▼ (Rejects & Drops Request)
Returns: HTTP 429 Too Many Requests
Headers:
- Retry-After: 15 (Wait 15 seconds)
- X-RateLimit-Limit: 60
- X-RateLimit-Remaining: 0Pros:
- Guaranteed System Availability
- Monetization Enabler
- Resource Predictability
Cons:
- Frontend User Friction:
- State Performance Bottleneck
- Integration Complexity: Client applications must be intentionally engineered to gracefully intercept HTTP 429 codes, implement backoff logic, and retry operations without crashing the user interface.
Health Endpoint Monitoring
It involves exposing a dedicated HTTP endpoint (such as /healthz, /health, or /status) from an application. External monitoring tools, cloud orchestrators, or load balancers periodically query this endpoint to verify that the service is not just alive, but actually capable of performing its core duties. It is of two types:
- Shallow Checks: A shallow check simply verifies that the application process is running and can return a fast response. It does not check external dependencies.
- Deep Checks: A deep check executes lightweight queries against internal and external dependencies to ensure the service is fully functional such as checking database, cache, local disk,etc that are needed for the service to work properly are healthy and responding.
Pros:
- Allows orchestrators to automatically tear down and replace dead, locked, or unresponsive app instances without human intervention
- Load balancers instantly isolate unhealthy instances, preventing users from seeing error screens
- Integrates directly with monitoring ecosystems (like Prometheus, Datadog, or cloud watchdogs) to alert operators before a minor dependency issue causes a major outage.
Cons:
- your database goes down, a deep check across 50 app instances will simultaneously fail. If your orchestrator is misconfigured to restart "unhealthy" nodes, it will violently kill and reboot all 50 containers at once, turning a minor database blip into a catastrophic infrastructure reboot loop.
- If hundreds of monitoring agents query a deep health endpoint too frequently, the continuous database pings and dependency checks can create noticeable resource strain.
- Health endpoints can leak internal system details (like database status or version numbers) if left unsecured. They should be restricted to internal networks or protected via basic network security filters
Bulkhead
It is an application design strategy that isolates elements of an application into independent pools so that if one fails, the others continue to function.
Traditional Shared Resources (Risk)
[ Incoming Traffic ] ──► [ Single Shared Thread Pool ] ──► Handles everything (Auth, Orders, Ads)
*If Ads hang, all threads freeze.
Bulkhead Isolation (Safe)
┌─────────────────────────────┐
│ API Gateway / Router │
└─────────────────────────────┘
/ | \
Auth Traffic / │ \ Marketing Traffic
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Thread │ │ Thread │ │ Thread │
│ Pool A │ │ Pool B │ │ Pool C │
│ (Auth) │ │ (Orders) │ │ (Ads) │
└───────────┘ └───────────┘ └───────────┘Common implementation approaches are:
- Thread pools: Assigning a dedicated thread pool or queue to each distinct backend service or consumer class
- Container isolation: Moving entirely separate features into isolated Docker containers or microservices, ensuring that a CPU spike or Out-of-Memory (OOM) error in one service cannot crash another.
- Tenant Isolation: In multi-tenant SaaS environments, allocating separate infrastructure clusters (or Deployment Stamps) exclusively for premium VIP clients, ensuring that a traffic surge on a free tier tenant doesn't slow down high-value users.
Pros:
- Contained Blast Radius: Prevents a minor issue in a low-priority feature from cascading into a catastrophic, full-system outage.
- Preserved Core Functionality: Ensures critical application pathways (e.g., checkout or authentication) stay online even when secondary systems are failing
- Granular Priority Control: Allows you to dedicate your highest-performing hardware or largest resource queues directly to vital business operations
Cons:
- Resource Underutilization: If the Advertisement thread pool sits idle 99% of the day, those allocated threads cannot automatically be borrowed by the overloaded Order pool, leading to structural inefficiency
- Increased Management Overhead: Designing, monitoring, and fine-tuning the exact size of multiple isolated resource pools adds configuration and runtime complexity.
- Configuration Drift Penalties: If the pool sizes are chosen poorly (e.g., allocating too few threads to a critical service), you can inadvertently cause artificial bottlenecks under completely normal usage spikes.
Circuit Breaker
It is an architectural stability pattern that prevents an application from repeatedly trying to execute an operation that is highly likely to fail. A circuit breaker operates as a state machine with three distinct positions:
- Closed State: Traffic flows normally. The circuit breaker monitors the percentage of responses passing through it. As long as failures stay below a specific threshold (e.g., fewer than 5% of calls fail), the circuit stays Closed.
- Open State (Tripped / Failing Fast): If the downstream service crashes and the failure rate crosses your threshold (e.g., 5 out of 10 consecutive calls fail), the circuit breaker Trips Open and for configured duration (e.g., 60 seconds), any subsequent requests hitting this circuit breaker are blocked instantly at the application layer.
- Half-Open State (Testing the Waters): It permits a small, limited number of trial requests to pass through to the downstream service. If they succeed, the breaker assumes the issue is resolved and snaps back to the Closed state but if they fail, the breaker assumes the service is still broken, resets its cooling-off timer, and instantly trips back to the Open state
┌──────────────────────────────────────────────┐
│ CLOSED STATE │◀───────┐
│ (Traffic flows; checks failures) │ │
└──────────────────────────────────────────────┘ │
│ │
Failure Threshold Crossed │ Trial Calls
V │ All Pass
┌──────────────────────────────────────────────┐ │
│ OPEN STATE │ │
│ (Fails fast immediately; bypasses network) │ │
└──────────────────────────────────────────────┘ │
│ │
Sleep Timer Expires │
V │
┌──────────────────────────────────────────────┐ │
┌────►│ HALF-OPEN STATE │────────┘
│ │ (Sends limited trial requests) │
│ └──────────────────────────────────────────────┘
│ │
└────────────────────────────┘
Any Trial Call FailsPros:
- Prevents Cascading Outages: Keeps a single broken downstream service from hogging resources and taking down your entire infrastructure stack.
- Graceful Degradation: Allows you to design smart Fallback Strategies (e.g., if the Recommendation Service breaks, display standard generic top-sellers instead of a blank page or an error screen).
- System Healing Space: Protects recovering services from getting immediately crushed by a backlog of accumulated traffic the millisecond they boot back up
Cons:
- Complex Configuration Testing: Finding the exact right failure thresholds, timeouts, and reset window metrics requires careful load testing. Setting them too tightly causes false-alarm tripping.
- Distributed State Synchronization: If you have 50 horizontal containers running an API, managing whether the circuit breaker is open or closed globally requires synchronous orchestration layers or accepting localized tracing drift.
- Testing Obstacles: Simulating random network timeouts, partial packet losses, and state transitions during integration tests requires specialized chaos engineering tools (like Chaos Mesh or Gremlin).
Security
Security provides confidentiality, integrity, and availability assurances against malicious attacks on information systems (and safety assurances for attacks on operational technology systems). Losing these assurances can negatively impact your business operations and revenue, as well as your organization's reputation in the marketplace. Maintaining security requires following well-established practices (security hygiene) and being vigilant to detect and rapidly remediate vulnerabilities and active attacks.
- Federated Identity
- Gatekeeper
- Valet Key
Federated Identity
It involves separating user authentication from the application itself by delegating it to an external, trusted third-party Identity Provider (IdP) such as google, github, etc.
Pros:
- Drastically Reduced Security Liability: You no longer store passwords or manage complex registration flows, eliminating the risk of database password leaks on your end.
- Frictionless Onboarding: Users can register and log into your application with a single click, significantly increasing sign-up conversion rates.
- Centralized Access Control: For corporate environments, if an employee leaves the company, deleting their account in the central enterprise directory (like Active Directory) automatically blocks their access to all federated apps instantly.
Cons:
- Single Point of Failure: If the Identity Provider experiences an outage, users will be completely unable to log into your application.
- Account Lockout Vulnerability: If a user loses access to their primary provider account (e.g., their Google account gets suspended), they lose access to your platform as well.
- Complex Cross-Domain Mapping: If different identity providers format user schemas differently (e.g., one uses firstName and another uses given_name), your application code must handle data translation layers to map them to an internal user domain
Gatekeeper
It involves using a dedicated, isolated host instance (the Gatekeeper) to act as a primary firewall and validation layer in front of your core backend services and storage components. Instead of allowing external clients to communicate directly with your sensitive application logic or database layers, the Gatekeeper intercepts all traffic, validates identity and incoming schemas, and safely proxies clean requests forward.
(Untrusted Internet) (Trusted Internal Network)
[ Public Clients ] ───> [ Gatekeeper Host ] ───> [ Trusted Key Host ] ───> [ Database ]
(Validates Token, (Executes Business
Sanitizes Input) Logic)Pros:
- Minimized Attack Surface: An attacker who successfully compromises or breaks into the Gatekeeper still cannot access the database or application secrets directly, as they are isolated on a completely separate backend server tier.
- Offloaded Security Overhead: Moves the heavy CPU burden of cryptographic verification, input validation, and SSL termination away from your core business logic runners.
- Defense in Depth: Even if code updates introduce security bugs into the inner application layer, the outer Gatekeeper layer can block malicious attempts to exploit them before they ever reach the flawed code.
Cons:
- Increased Network Hop Latency: Introducing an explicit middleman host between the client and the backend adds an extra internal network transmission hop, increasing overall response times by a few milliseconds.
- Single Point of Failure / Management Complexity: If the Gatekeeper cluster misconfigures its validation rules or goes down, your entire application goes completely dark.
- Synchronization Penalties: Every time your backend engineers update an API request schema or path structure, the validation configurations inside the Gatekeeper layer must be synchronized instantly to prevent valid requests from being dropped.
Resiliency
Resiliency is the ability of a system to gracefully handle and recover from failures, both inadvertent and malicious. The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is an increased likelihood that both transient and more permanent faults will arise. The connected nature of the internet and the rise in sophistication and volume of attacks increase the likelihood of a security disruption. Detecting failures and recovering quickly and efficiently, is necessary to maintain resiliency.
- Bulkhead
- Circuit Breaker
- Compensating Transaction
- Health Endpoint Monitoring
- Leader Election
- Queue-Based load leveling
- Retry
- Scheduling Agent Supervisor
Compensating Transaction
It involves executing an explicit set of undo operations to reverse the effects of a partially completed multi-step workflow when a failure occurs midway through. In traditional monolithic databases, data consistency is maintained through synchronous ACID transactions and Two-Phase Commit (2PC) protocols that lock rows such as inventory and user balance to commit updates simultaneously. However, in a distributed microservices architecture where transactions span independent databases across an Order Service, Payment Service, and Shipping Service, this approach fails because running a single ACID transaction across network boundaries is highly impractical. Attempting to use distributed locks over the network for the duration of external API calls (like a payment gateway) causes severe resource starvation, performance degradation, and massive gridlocks. To solve this, systems implement the Compensating Transaction pattern, where each microservice commits its work immediately to maximize throughput, and if a subsequent step fails downstream, the system dynamically fires a sequence of explicit counter-transactions to cleanly undo the changes made by previous steps.
[ Step 1: Book Flight ] ──> Success (Committed to Flight DB)
│
V
[ Step 2: Rent Car ] ──> FAILURE (Car Rental Database Out of Stock!)
│
V (Failure Triggers Rollback Sequence)
[ Compensating Step ] ──> Runs "Cancel Flight Booking" (Restores original state)Key rules for implementation are:
- Idempotency is Mandatory: Because network transmissions can fail or repeat, a compensating transaction might be executed multiple times for the exact same failure. The code must be idempotent—running the undo operation twice must not alter the database state a second time.
- Eventually Consistent: During the brief window between Step 1 succeeding and the final step completing (or failing), the data is in an intermediate state. Other users can see that the flight seat is temporarily booked, even if it gets cancelled 3 seconds later.
- Commutative Execution: The system must be engineered to handle race conditions where a compensating transaction (the cancellation) accidentally arrives at a microservice before the original forward transaction (the booking) due to network routing delays.
Pros:
- High System Throughput: Eliminates heavy, long-lived database locks across networks, allowing distributed microservices to run at maximum speed.
- Elastic Scalability: Complex, multi-step workflows can stretch out over hours or days (e.g., an enterprise procurement approval chain) without degrading database connection pools.
- Resilient Distributed Flows: Provides a structured, predictable blueprint to handle business-level errors across completely decoupled infrastructure domains.
Cons:
- High Development Overhead: Engineers must write twice as much code for every business feature—one function to execute the logic, and an entire separate function to cleanly undo it.
- Data Visual Shifting (Dirty Reads): Because steps commit immediately, a user might see their bank account balance drop, only to watch it jump back up a few seconds later because a downstream step failed and issued a compensation refund.
- Complex Compensation Failures: If a compensating transaction itself fails (e.g., the Flight Service database crashes right when trying to process the cancellation), the system falls into an inconsistent state. This requires specialized dead-letter queues (DLQs) or manual operator intervention to fix.
Retry
It enables an application to handle temporary, short-lived network or infrastructure failures by automatically retrying a failed operation a specified number of times before giving up. Systems use structured timing rules such as:
- Fixed Interval: try every n secs.
- Exponential Backoff: try -> wait 1s -> wait 2s -> wait 4s -> wait 8s
- Jitter (Random Delay): Adding a small amount of random variance (jitter) to the backoff timing ensures that if 1,000 distributed clients fail simultaneously, they don't all retry at the exact same millisecond, smoothing out traffic spikes. Pros:
- Improves Application Resilience: Transparently heals transient connectivity blips without showing error screens to the end user.
- Simple Implementation: Easily added via standard library wrappers (like Polly in .NET, Resilience4j in Java, or custom middleware in Node/Python).
Cons:
- Aggravates Overload (Without Backoff): If a downstream service is crashing due to high CPU, a naive retry loop across hundreds of client instances acts like a self-inflicted Distributed Denial of Service (DDoS) attack.
- Thread and Memory Latency: Keeping a request alive while it waits through multiple retry delays traps execution threads and network sockets, which can bubble up and slow down the caller service.
- Strict Idempotency Required: If a network call times out after the database wrote the data but before sending the success response, retrying it will create a duplicate row unless the endpoint is strictly idempotent.
