Skip to content

Cloud design patterns

Cloud design patterns are solutions to common problems that arise when building systems that run on a cloud platform. These patterns provide a way to design and implement systems that can take advantage of the unique characteristics of the cloud, such as scalability, elasticity, and pay-per-use pricing. Cloud workloads are vulnerable to the fallacies of distributed computing, which are common but incorrect assumptions about how distributed systems operate. Examples of these fallacies include:

  • The network is reliable.
  • Latency is zero.
  • Bandwidth is infinite.
  • The network is secure.
  • Topology doesn't change.
  • There's one administrator.
  • Component versioning is simple.
  • Observability implementation can be delayed.

Design and Implementation

Good design encompasses factors such as consistency and coherence in component design and deployment, maintainability to simplify administration and development, and reusability to allow components and subsystems to be used in other applications and in other scenarios. Decisions made during the design and implementation phase have a huge impact on the quality and the total cost of ownership of cloud hosted applications and services.

Stranfler fit pattern

The Strangler Fig pattern is a software architecture strategy for incrementally migrating a legacy (monolithic) system to a new architecture (typically microservices). Instead of a high-risk "Big Bang" rewrite where you replace the whole system at once, you replace specific features piece by piece. Over time, the new system grows until the old one is completely eclipsed and can be decommissioned

The pattern relies on an interception layer (like an API Gateway, Reverse Proxy, or Load Balancer) placed in front of both systems.

  • Interception: All user requests go to the interceptor first.
  • Coexistence: Initially, 100% of traffic goes to the legacy system.
  • Incremental Migration: You rebuild one specific capability in the new architecture. The interceptor is reconfigured to route requests for that specific feature to the new service, while everything else still goes to the monolith.
  • Elimination: Repeat this process feature by feature until the legacy system has zero traffic and can be safely turned off.
[ Client Requests ]

        v
┌────────────────┐
│  Interceptor   │ ───► (Routes traffic based on migrated paths)
└────────────────┘
   │          │
   v          v
┌────────┐  ┌──────────────┐
│ Legacy │  │ New Services │
│ Mono   │  │ (Micro       │
└────────┘  └──────────────┘

The Sidecar Pattern

It is a software architecture design pattern where an auxiliary component is deployed alongside a primary application container. Its primary goal is separation of concerns and allows developers to offload peripheral tasks from the main application code so the application can focus purely on business logic. These are made to by language agnostic, independantly deployable and loosely coupled. It can lead to resource overhead, increased latency. Its common use cases are:

  • Service Mesh & Network Proxies: Handling service discovery, traffic routing, encryption (mTLS), and circuit breaking (e.g., Envoy in Istio).
  • Logging and Monitoring: Collecting application logs or metrics and forwarding them to a centralized system (e.g., Fluentd or Prometheus exporters).
  • Configuration & Secrets: Fetching configuration updates or rotating security tokens dynamically without restarting the main application.
  • Distributed Tracing: Injecting trace IDs and exporting telemetry data to systems like Jaeger or Zipkin.

Static Content Hosting pattern

It involves deploying static website assets (HTML, CSS, JavaScript, images, videos) directly to a cloud-based storage service instead of serving them from a traditional web or application server. By offloading these files, you reduce compute load on backend servers, slash hosting costs, and improve global page load times.

Instead of routing every user request to a web server (like Node.js, Go, or a framework running on a VM), you decouple the frontend from the backend:

  • Storage: Static files are uploaded to a managed object storage service (e.g., AWS S3, Google Cloud Storage, Supabase Storage).
  • Delivery (CDN): A Content Delivery Network (e.g., Cloudflare, CloudFront, Vercel) sits in front of the storage bucket. It caches the static assets at edge locations close to users worldwide.
  • Execution: The browser downloads the static shell (HTML/JS) from the nearest edge node. Any dynamic interaction or data fetching happens client-side via asynchronous API calls to a separate, lightweight backend or serverless functions.

Pros:

  • Cost Efficiency
  • Massive Scalability
  • Reduced Server Load
  • Security and Isolation

Cons:

  • No Dynamic Content. Use CSR
  • Deployment & Caching can make users see old data. Implement cache busting and short TTL headers
  • CORS (Cross-Origin Resource Sharing). Set proper CORS headers

Leader Election

The Leader Election pattern ensures that out of a cluster of identical application instances, exactly one coordinator node is dynamically chosen to manage centralized tasks, while the remaining instances act as followers. If the active leader crashes or loses network connectivity, the surviving nodes automatically detect the failure and elect a new leader to ensure continuous operation. Eg: etcd, zookeepter, etc.

In distributed architectures, running certain operations concurrently across multiple instances can cause race conditions, data corruption, or duplicate resource usage. The leader node takes charge of these sensitive tasks:

  • Coordinating Schedules: Running periodic cron jobs or batch data processing pipelines exactly once.
  • Sequential Writes: Managing transactional write workflows to a database or shared file system to prevent conflicts.
  • Resource Allocation: Dynamically assigning partitions, shards, or processing queues to worker nodes.

Pros:

  • Eliminates Race Conditions: By routing all critical mutating actions through a single node, you completely avoid concurrent database row locking and data overwrites.
  • High Performance via In-Memory State: Processing transactional tasks sequentially in the leader's RAM bypasses slow disk I/O bottlenecks during peak traffic.
  • High availability: Unlike a hardcoded master server, if an elected leader crashes, the cluster automatically heals itself by choosing a new one within milliseconds.
  • Simplified client routing: Downstream worker nodes only need to track a single source of truth rather than coordinating distributed state across themselves.

Cons:

  • Increased Network Latency: Workers must perform an extra internal network hop to communicate with the leader instead of modifying data locally or hitting the database directly.
  • Split-Brain Risk: If a network partition occurs, a disconnected leader might still think it's in charge while the rest of the cluster elects a new one. This requires strict consensus fencing tokens or etcd leases to prevent duplicate execution
  • Complexity of State Recovery: When a leader drops dead, reconstructing its volatile memory on a new server requires robust, synchronized write-ahead logging or replication, which is difficult to implement from scratch.
  • Resource Bottleneck: If the leader node becomes overwhelmed with requests from too many worker servers, it can become a system-wide performance bottleneck.

CQRS (Command Query Responsibility Segregation)

It splits your application into two completely separate paths: one for modifying data (Commands) and one for reading data (Queries) to optimize performance, scaling, and security. In most traditional systems, the same database model is used to create, update, and read records. While this works fine for simple applications, it breaks down under heavy traffic:

  • Conflicting Needs: A read operation wants data nicely flattened, pre-joined, and indexed for fast retrieval (like a dashboard). A write operation wants data normalized to prevent duplication and ensure strict transactional consistency
  • Resource Contention: Heavy, complex read queries can lock database tables, slowing down critical write operations (like user sign-ups or checkouts).

CQRS divides the system into two distinct responsibilities:

  • The Command Side (Writes): It handles operations that change system state and focus on logic validation, data consistency and transactional safety. Often used in normalized relational database or event store model.
  • The Query Side(Read): Handles operation that fetches data and focuses on returning data as fast as possible. Uses a read-optimized, flat data structure, frequently hosted on NoSQL databases, search engines, or caches.
[ User Action ] ---> [ Command Service ] ---> Writes to ---> [ Write DB ]
                                                                   |
                                                                   v
                                                            Publishes Event
                                                                   |
                                                                   v
[ Dashboard ]   <--- [ Query Service ]   <--- Reads from <-- [ Read DB (Redis/NoSQL) ]

Pros:

  • Independent Scaling for read and write resources.
  • Optimized Data Structures as read side can store data exactly how the UI needs it, eliminating heavy SQL JOIN statements and minimizing CPU overhead
  • Simpler Domain Logic as write side doesnt have to care about performance tricks for reports.

Cons:

  • Eventual Consistency
  • High Complexity due to managing two database now
  • Code Duplication as you often have to maintain separate code models for input validation (Commands) and data output representation (DTOs/Queries).

Pipes and Filters

The Pipes and Filters pattern decomposes a complex task that processes a stream of data into a series of separate, reusable components called Filters. These filters are chained together sequentially using communication channels called Pipes and each filter performs a single, specific data transformation, passing its output to the next pipe until the entire pipeline is complete.

[ Raw Data ] -> ( Pipe ) -> 【 Filter 1: Parse 】-> ( Pipe ) ->【 Filter 2: Enrich 】-> ( Pipe ) -> [ Processed Data ]

Pros:

  • High Reusability as filters are completed isolated so we can mix and match creating entirely different pipelines
  • Each step can be scaled independently
  • Due to isolation of each filter they can be debugged and tested easily

Cons:

  • Deformatting overhead as each filter may take different type as input
  • Complex error handling for failures as well recovering from failures
  • State management and synchronization if a filter takes input from multiple streams.

Ambassador

It involves deploying a helper service or container alongside your main application instance for handling common off the shelf infrastructure tasks such as logging, routing, circuit breaking, security metrics, etc.

┌─────────────────────────────────────────────────────────┐
│                    Pod / Host Machine                   │
│                                                         │
│  ┌───────────────────┐             ┌─────────────────┐  │      Outbound Call
│  │    Application    │ <─────────> │   Ambassador    │  │───────────────> [ Remote API / ]
│  │  (Business Logic) │  Localhost  │ (Sidecar Proxy) │  │   (Secured/Retried) [ Microservice]
│  └───────────────────┘             └─────────────────┘  │
└─────────────────────────────────────────────────────────┘

It common use cases are:

  • Migrating legacy apps to service mesh
  • Centralized connectivity logic

Pros:

  • Language Agnostic
  • Seperation of Concerns
  • Dynamic Configurations

Cons:

  • Resource Overhead
  • Increased Latency
  • Debugging Complexity

Gateway Routing

involves using a single, centralized entry point such as an API Gateway to route incoming requests to various downstream microservices or backend systems based on the request's path, host, headers, or query parameters.

                       ┌─────────────────┐      /users      ┌───────────────────┐
                       │                 │ ───────────────> │   User Service    │
[ Client Request ] ──> │   API Gateway   │                  └───────────────────┘
(api.mycompany.com)    │ (Gateway Route) │      /orders     ┌───────────────────┐
                       │                 │ ───────────────> │   Order Service   │
                       └─────────────────┘                  └───────────────────┘

Its common use cases are:

  • Microservices
  • Legacy Migration
  • Multi-Tenant routing

Pros:

  • Decoupling and independant scaling of different resources
  • The gateway can accept standard public HTTP/REST requests from the internet and translate them into fast internal protocols
  • provides a perfect place to bundle shared edge concerns like global rate limiting, CORS configuration, SSL termination, and centralized logging before requests ever touch your application layer

Cons:

  • Single Point of Failure of API Gateway
  • Increased Latency
  • Configuration bottleneck.

Gateway Offloading

It involves moving shared cross-cutting infrastructure concerns such as authentication, authorization, rate limiting, SSL termination, and IP whitelisting away from individual downstream microservices and onto a centralized API Gateway at the edge of the network.

                       ┌─────────────────┐       Valid Token      ┌───────────────────┐
                       │                 │ ─────────────────────> │   User Service    │
[ Client Request ] ──> │   API Gateway   │                        │  (No Auth Code)   │
(Raw, Unverified)      │ (Validates JWT) │                        └───────────────────┘
                       │  (Rate Limits)  │       Blocked Request  ┌───────────────────┐
                       └─────────────────┘ ─────────────────────> │ X [Dropped Edge]  │
                                                                  └───────────────────┘

Pros:

  • Engineering teams don't waste time writing, updating, or debugging authentication libraries across different backend languages
  • Security rules are managed in a single configuration file at the gateway level
  • Offloading heavy tasks like SSL decryption and caching frees up CPU and memory on your downstream service nodes

Cons

  • Single Point of Failure (Security & Routing)
  • Tighter Coupling to Infrastructure
  • Latency Overhead

Gateway Aggregation

It involves using a centralized gateway to accept a single client request, fan out multiple parallel requests to various downstream microservices, consolidate their individual responses, and return a single, unified data payload back to the client.

                      ┌─────────────────┐  (Parallel Internal Calls)
                      │                 │ ───────> [ User Service ] ───┐
[ 1. Single Request ] │   API Gateway   │                              │        (3 Responses)
 ───────────────────> │                 │ ───────> [ Order Service ] ──┼─────> [ Collates/Maps ] ──┐
   (api.com/profile)  │ (Aggregates)    │                              │                           │
                      │                 │ ───────> [ Payment Service ] ┘                           │
                      └─────────────────┘                                                          │
                               ^                                                                   │
                               │                                                                   │
                               └───────────────── [ 2. Unified JSON Payload ] ─────────────────────┘

Pros:

  • Drastically Reduced Network Roundtrips
  • Payload Optimization
  • Separation of Concerns

Cons:

  • Tight Architectural Coupling as gateway has to know which route and which service to request.
  • Increased latency due to slowest service
  • Single point of failure

External Config Store

It involves moving all application configurations, feature flags, and environment variables out of the application deployment package (like a Docker image or a compiled binary) and placing them into a centralized, external service

        ┌───────────────────────────────────┐
        │    Central Configuration Store    │
        │  (Consul / Vault / AWS AppConfig) │
        └───────────────────────────────────┘
              /           │           \
Fetch Config /            │            \ Fetch Config
            /             │             \
           v              v              v
┌───────────┐       ┌───────────┐      ┌───────────┐
│ Service A │       │ Service B │      │ Service C │
└───────────┘       └───────────┘      └───────────┘

Pros:

  • Zero-Downtime Updates: You can change system behavior, rotate database passwords, or toggle feature flags instantly at runtime
  • Centralized Audit Trails: It provides a single point of control to monitor exactly who changed which configuration parameter, and when it was modified.
  • Enhanced Security: Secrets and operational settings can be kept entirely separate from developer source code repositories, using strict role-based access control (RBAC).

Cons:

  • Single Point of Failure
  • Caching and Staleness
  • Increased Startup Latency due to network call to fetch the configuration

Compute Resource Consolidation

It binvolves bundling multiple independent tasks or applications onto a smaller number of compute instances (such as Virtual Machines or containers) instead of giving each individual task its own dedicated server. It can be useful in:

  • Managing Low-Traffic or Background Tasks
  • Development, Staging, and Testing Environments
  • Legacy App Migration
  • Shared Web Hosting Architecture

Pros:

  • Significant Cost Savings
  • Simplified Management
  • Better Resource Density

Cons:

  • Noisy Neighbors
  • Single Point of Failure
  • Security Isolation Risks
  • Deployment Friction

Backends for Frontends (BFF)

It is a variant of the API Gateway pattern where you create separate, dedicated backend services for each specific type of user interface (such as a mobile app, a web app, and a desktop portal), rather than forcing every frontend to consume a single, generic API. Each client application communicates exclusively with its own dedicated BFF service. The BFF then interacts with the internal downstream microservices on behalf of that client.

┌──────────────┐      ┌───────────────┐
│  Mobile App  │ ───► │  Mobile BFF   │ ───┐
└──────────────┘      └───────────────┘    │
                                           │   (Internal Calls)
┌──────────────┐      ┌───────────────┐    ├─> [ User Service ]
│ Web Browser  │ ───► │    Web BFF    │ ───┤
└──────────────┘      └───────────────┘    ├─> [ Order Service ]

┌──────────────┐      ┌───────────────┐    ├─> [ Billing Service ]
│ Third-Party  │ ───► │ Public API BFF│ ───┘
└──────────────┘      └───────────────┘

Pros:

  • Decoupled Deployments
  • Optimized Performance
  • Simplified Client Code

Cons:

  • Code Duplication
  • Operational Overhead
  • Silo Risk: If not managed properly, BFFs can inadvertently start absorbing core business logic that actually belongs inside the downstream core microservices.

Anti-corruption Layer

It is a design strategy used to translate communications between two different subsystems (usually a new, clean application and an old, legacy system). Instead of allowing the messy, outdated data models, APIs, or database schemas of the legacy system to bleed into and "corrupt" the architecture of the new system, the ACL acts as an isolated, bidirectional translator between them.

┌────────────────┐          ┌──────────────────────┐          ┌────────────────┐
│  Modern App    │ ◄──────► │ Anti-Corruption Layer│ ◄──────► │  Legacy App    │
│ (Clean Domain) │  Modern  │ (Translates Models)  │  Legacy  │ (Messy Domain) │
└────────────────┘   API    └──────────────────────┘ Protocol └────────────────┘

Pros:

  • Protects Modern Architecture
  • Simplifies Decommissioning
  • Parallel Development

Cons:

  • Latency Penalty
  • Maintenance Overhead
  • Over-engineering Risk

Data Management

Data management is the key element of cloud applications, and influences most of the quality attributes. Data is typically hosted in different locations and across multiple servers for reasons such as performance, scalability or availability, and this can present a range of challenges.

Valet Key

It involves using a token or a cryptographically signed URL to give clients direct, time-limited access to a specific data resource (like a cloud storage bucket or CDN) without forcing the data to pass through your main application servers.

[ Client App ] ─────> 1. "Can I download file X?" ─────> [ API Server ]
      ^                                                        │
      │ 2. Returns Valet Key (Signed URL with 15m expiry) <────┘

      └────── 3. Downloads file X directly ─────────────> [ Cloud Storage (S3/CDN) ]

Pros:

  • Massive Cost & Resource Savings
  • Excellent Scalability
  • Fine-Grained Security

Cons:

  • Loss of Real-Time Control
  • Data Decoupling Challenges
  • Public Exposure Risk

Sharding

It involves horizontally partitioning a single massive database into smaller, faster, and more manageable pieces called shards. To shard a database, you must choose a Shard Key (a column present in your data, like user_id or tenant_id). A hashing algorithm or routing function uses this key to determine exactly which server stores a specific row:

                  ┌──────────────────────┐
                  │  Application Router  │
                  └──────────────────────┘
                   /          |         \
   Hash(user_id)  /           |          \
   Maps to 0-33  /  Maps 34-66|           \ Maps 67-100
                v             v            v
          ┌───────────┐ ┌────────────┐ ┌────────────┐
          │  Shard 1  │ │   Shard 2  │ │   Shard 3  │
          │(Users A-H)│ │ (Users I-Q)│ │ (Users R-Z)│
          └───────────┘ └────────────┘ └────────────┘

Common Sharding Strategies are:

  • Algorithmic (Hash-Based) Sharding: Takes the shard key and runs a hash function modulo the number of shards (Shard = Hash(Key) % N). This distributes data beautifully and evenly, but makes adding new shards later highly complex because almost all data has to be re-hashed and moved.
  • Range-Based Sharding: Divides data based on ranges of a feature (e.g., Shard 1 holds IDs 1–1,000,000; Shard 2 holds 1,000,001–2,000,000). This makes adding shards easy, but can create massive data imbalances (e.g., if newer IDs are heavily active while old IDs are completely idle).
  • Directory-Based Sharding: A centralized lookup service or state map tracks exactly which IDs live on which physical nodes. This is highly flexible but introduces a single point of failure and a network hop just to locate the data

Pros:

  • Infinite Horizontal Scalability
  • Fast Read/Writes
  • Fault Isolation

Cons:

  • No Join Operations Across Shards
  • Loss of Referential Integrity
  • If one specific shard key becomes intensely popular, the shard holding that user's data will experience 100% CPU utilization while the other shards sit idle.
  • Operational Overdrive: Backups, schema updates, index management, and database migrations must now be coordinated across dozens of separate database clusters perfectly.

Materialized Views

It involves pre-computing and saving the results of a complex, slow database query into a physical table on disk.

[ Complex Tables ] ─── (Heavy Query/Joins) ───> [ Materialized View ] ───>  [ Fast Reads ]
(Users, Orders, Items)                           (Pre-computed Table)      (Dashboards/APIs)

Pros:

  • Instant Reads
  • Reduced DB Load for each aggregated read
  • Simpler Code

Cons:

  • The data is only as fresh as the last refresh cycle
  • Storage Overhead
  • you force the database to refresh that view frequently, the database still hits 100% CPU and locks up tables during the refresh window.

Index Table

It involves creating a specialized secondary table to speed up data retrieval when querying a database by columns other than its primary key

+-----------------------+-----------------+-------------+---------+
| user_id (Primary Key) | email_address   | full_name   | country |
+-----------------------+-----------------+-------------+---------+
| 883                   | bob@email.com   | Bob Smith   | USA     |
| 994                   | alice@email.com | Alice Jones | India   |
+-----------------------+-----------------+-------------+---------+

+-----------------------------+-------------------+
| email_address (Primary Key) | user_id (Pointer) |
+-----------------------------+-------------------+
| alice@email.com             | 994               |
| bob@email.com               | 883               |
+-----------------------------+-------------------+

Pros:

  • Fast Alternative Queries: Eliminates resource-hogging full-table scans when searching by non-primary attributes.
  • Optimized Storage: The index table only contains the indexed column and a pointer, keeping its storage footprint minimal so it can easily fit into fast server RAM.
  • Flexibility: You can create multiple index tables for the same primary table (e.g., indexing by email, username, or creation_date simultaneously).

Cons:

  • Write Penalty (Overhead): Every time you insert, update, or delete a row in the main table, you must also write to the index table. This slows down write performance
  • Eventual Consistency Risks: If the database updates the index table asynchronously in the background, a query to the index might briefly return stale data or missing references right after a write.
  • Storage Cost: While slim, creating dozens of secondary index tables across billions of rows still consumes additional disk space and memory.

Event Sourcing

Instead of only saving the current state of an object in a database row, Event Sourcing record every single change as an immutable sequence of events in an Event Store. The current state of the system is derived by replaying these historical events from the beginning of time.

Sequence #1: [Cart Created]       - User 123 started a cart
Sequence #2: [Item Added]         - Added 'Shoes'
Sequence #3: [Item Added]         - Added 'Hat'
Sequence #4: [Item Removed]       - Removed 'Shoes'

If an object has millions of events (like a long-lived bank account), replaying every event from day one would take too long. To fix this, the system periodically takes a Snapshot (e.g., every 1,000 events). The application loads the last snapshot and only replays the handful of events that occurred after that snapshot.

Pros:

  • 100% Accurate Audit Trail
  • Time Travel Debugging
  • Because the database only performs append operations (INSERT), there are no complex table locks or updates, making writes incredibly fast

Cons:

  • High Learning Curve: It requires a complete shift in mindset. You cannot run simple SQL queries like SELECT * FROM Users WHERE status = 'active'
  • Schema Evolution Challenges: If your event structure changes over time (e.g., adding a new field to the Item Added event), your code must remain backwards-compatible to handle old, legacy events during replays.
  • Latency: Replaying events to generate read models takes time, meaning users might experience a brief delay before their changes appear on screen

Cache Aside

It is a caching strategy where the application handles data synchronization between the database and the cache. Instead of the cache talking directly to the database, the application acts as the coordinator, pulling data into the cache only when a cache miss occurs.

read path:

               +-----------------------+
               |  Application Request  |
               +-----------------------+
                           |
                           v
                 /───────────────────\
                <  Is data in Cache?  >
                 \───────────────────/
                   /               \
            YES   /                 \  NO (Cache Miss)
                 v                   v
        [ Return Data ]     [ Read from DB ]
                                     |
                                     v
                            [ Write to Cache ]
                                     |
                                     v
                              [ Return Data ]

write path:

  [ User Update ]

         v
+─────────────────────────────────+
| 1. Update Primary Database      |
|    (e.g., SQL UPDATE query)     |
+─────────────────────────────────+

         │ Database Write Succeeds
         v
+─────────────────────────────────+
| 2. Delete Key from Cache        |
|    (Evict old data from Redis)  |
+─────────────────────────────────+

         v
  [ Success Response ]

Pros:

  • Memory Efficiency as cache only stores data that users are actively requesting
  • Resilience to Cache Failures: the caching layer crashes, the application doesn't break down entirely. It simply falls back to executing every query against the database
  • Data Model Independence because the application coordinates everything, the structure of the data in the cache can be completely different from database schema.

Cons:

  • Cache Miss Penalty (Three Network Hops)
  • Stale Data Risks
  • The Thundering Herd Problem: If a highly popular key expires or gets deleted during peak traffic, thousands of concurrent user requests will miss the cache simultaneously, crashing the primary database with duplicate queries.

Messaging

Messaging is a pattern that allows for the communication and coordination between different components or systems, using messaging technologies such as message queues, message brokers, and event buses.

Sequential Convoy/Saga Queue Convoy/FIFO Message Grouping

It is pattern used in messaging and event-driven systems to process a set of related messages in a strict, sequential order, while still allowing unrelated messages to be processed concurrently across multiple worker instances.

If we have 10 workers processing messages from a single queue concurrently, Message #2 might finish processing before Message #1 because of network blips or variations in task complexity. For many business operations, out-of-order execution causes severe corruption. For example, in an e-commerce platform, if an Order Cancelled event gets processed before the Order Created event, your system will break or leave the database in an inconsistent state.

            Incoming Queue (Ordered by Group)
[ UserB: Msg2 ] [ UserA: Msg2 ] [ UserB: Msg1 ] [ UserA: Msg1 ]
       │               │               │               │
       └───────────────┼───────────────┼───────────────┘
                       v               v
           ┌──────────────────────────────────────┐
           |        Message Router / Broker       |
           └──────────────────────────────────────┘
                  /                        \
      (Locks Group: UserA)      (Locks Group: UserB)
                /                            \
               v                              v
     ┌──────────────────┐           ┌──────────────────┐
     │     Worker 1     │           │     Worker 2     │
     │ Processes UserA  │           │ Processes UserB  │
     │  Msg1 then Msg2  │           │  Msg1 then Msg2  │
     └──────────────────┘           └──────────────────┘

Pros:

  • Guaranteed Ordered Execution
  • Scalable Concurrency
  • Simplified Business Logic

Cons:

  • If a single message in a convoy fails or encounters a heavy processing bottleneck, it halts the entire group sequence behind it
  • If one specific Group ID suddenly receives millions of events, a single worker node will get overwhelmed while other workers sit idle, defeating horizontal scaling
  • Managing runtime message locks, group states, and FIFO guarantees across thousands of distributed queues introduces higher CPU and storage complexity inside your message broker.

Orchestrator-Worker/Scheduling Agent Supervisor

It is a design strategy used to manage complex, distributed long-running tasks across a cluster of worker nodes. It is centralized component (the Supervisor) schedules tasks, assigns them to available instances (the Agents), monitors their health, and coordinates recovery logic if an agent crashes midway through a job.

In distributed systems, handling large batch workloads (like processing millions of video transcodes, running massive ETL pipelines, or orchestrating multi-step airline bookings) introduces structural stability risks such as:

  • If a worker node crashes or loses power while executing a 3-hour processing task, that task vanishes into a black hole unless an external supervisor is actively tracking its lifecycle.
  • Without a centralized scheduler, workers might experience severe resource imbalances
  • Many workflows require strict execution order (e.g., Task B cannot start until Task A successfully finishes). A supervisor cleanly maps and manages these dependency trees.
                 ┌───────────────────────────┐
                 │         Supervisor        │
                 │  (Orchestrator / State)   │
                 └───────────────────────────┘
                   /           │           \
   Assign Task 1  /            │            \  Assign Task 3
& Monitor Heart  /  Task 2     │             \  & Monitor Heart
                 v             v              v
           ┌───────────┐ ┌───────────┐  ┌───────────┐
           │  Agent 1  │ │  Agent 2  │  │  Agent 3  │
           │ (Worker)  │ │ (Worker)  │  │ (Worker)  │
           └───────────┘ └───────────┘  └───────────┘

The supervisor is made to handle agent failures using heartbeat timeouts, idempotency safeguards to handle anomalies.

Pros:

  • High Fault Tolerance: Ensures that no distributed task is ever permanently dropped or forgotten due to unexpected server crashes or network drops
  • Efficient Load Balancing: The centralized scheduler can inspect current cluster utilization and distribute tasks evenly, maximizing total compute efficiency.
  • Clear State Visibility: Operators can look directly at the supervisor’s state tables to see the exact real-time progress, history, and bottleneck areas of a massive global processing pipeline.

Cons:

  • Single Point of Failure / Bottleneck
  • State Synchronization Overhead
  • Implementing reliable distributed timers, dealing with network split-brains (where two agents think they are actively processing the same task ID), and writing safe idempotency controls adds significant engineering surface area.

Queue-Based Load Leveling

uses a message queue as a buffer between a client application and a backend service to handle sudden spikes in traffic allowing backends to consume tasks at its own consistent stable pace.

(Volatile Inflow)                 (Leveled Outflow)
[ Client Apps ] ───────────────────► [ Message ] ───────────────────► [ Backend Workers ]
(Sudden Traffic Spikes)             Queue (Buffer)                   (Consumes at a safe,constant rate)

Pros:

  • Eliminates application crashes and database connection pool starvation caused by sudden, unexpected traffic spikes
  • You don't need to over-provision expensive, massive server clusters to handle rare peak-load scenarios.
  • The web-facing APIs are completely decoupled from the processing logic, database states, and availability of the backend workers

Cons:

  • No Direct Backpressure: If the incoming traffic rate remains consistently higher than the backend's consumption rate over a long period, the queue will grow indefinitely, exhausting disk space or causing significant processing delays
  • Increased System Complexity: You introduce a stateful infrastructure component (the message queue broker) that requires independent configuration, dead-letter queue (DLQ) tracking, and operational monitoring.
  • Asynchronous UX Friction: Because requests are processed asynchronously, users cannot see the final result of their action immediately. The frontend interface must rely on long-polling, WebSockets, or background email notifications to inform the user when the task actually completes

Publisher-Subscriber

It is an asynchronous messaging model where creators of data (Publishers) send messages without knowing who will receive them. Instead of routing messages directly to specific destinations, publishers broadcast them to a central component called Message Broker. Interested receivers (Subscribers) register their interest with the topic, and the broker automatically distributes copies of the message to all active subscribers simultaneously.

┌─────────────────┐       ┌───────────────────┐
│    Publisher    │       │    Subscriber     │
│(CheckoutService)│       │ (InventoryService)│
└─────────────────┘       └───────────────────┘
        │                          ^
  Publishes Event             Receives Copy
        v                          │
┌─────────────────┐ ────> ┌───────────────────┐
│  Message Topic  │       │    Subscriber     │
│ ("OrderPlaced") │       │ (ShippingService) │
└─────────────────┘ ────> └───────────────────┘
                          │    Subscriber     │
                          │  (EmailService)   │
                          └───────────────────┘

Pros:

  • Publishers and subscribers operate in complete isolation
  • You can add new downstream features (e.g., a fraud detection service) by simply subscribing it to the existing topic
  • If the Email Service crashes, messages queue up safely inside the broker. The Checkout Service remains completely unaffected, and the emails will process automatically once the service recovers.

Cons:

  • Publishers operate on a "fire-and-forget" basis. The Checkout Service cannot easily receive a success code or display an immediate error message back to the user if a downstream subscriber fails to process its data.
  • Systems must deal with distributed network realities. You must design your system to handle scenarios like duplicate messages (ensuring subscribers are idempotent) or out-of-order delivery.
  • Because messages flow asynchronously through background brokers, debugging distributed data pipelines or tracing a single transaction's lifecycle across dozens of loosely coupled topics requires specialized distributed tracing infrastructure

Priority Queue

is a specialized messaging and data structure pattern where items are processed based on an assigned priority level rather than strictly following their arrival order.

            Incoming Tasks (with Priority levels)
[ Report: P1 ] [ FraudCheck: P3 ] [ Statement: P1 ] [ ResetToken: P3 ]
       │                │               │               │
       └────────────────┼───────────────┼───────────────┘
                        v               v
           ┌──────────────────────────────────────┐
           |      Priority Queue Broker           |
           | (Sorts dynamically by priority code) |
           └──────────────────────────────────────┘
                   /                        \
      (High Priority: P3)              (Low Priority: P1)
                 /                            \
                v                              v
     ┌──────────────────┐           ┌──────────────────┐
     │  Urgent Workers  │           │ Batch Workers    │
     │ Processes P3 Rows│           │ Processes P1 Rows│
     │   Immediately    │           │ As Capacity Allows│
     └──────────────────┘           └──────────────────┘

Pros:

  • Guaranteed Low Latency for Critical Actions
  • Granular Quality of Service (QoS):
  • Backend infrastructure can be fine-tuned to dedicate more processing nodes or faster hardware threads exclusively to the high-priority queues

Cons:

  • If the application experiences a sustained, heavy volume of high-priority messages, the low-priority tasks may sit in the queue indefinitely and never get processed
  • Maintaining a sorted queue structure requires the message broker to continuously reshuffle items in memory on every single insert operation
  • Because messages are skipping the line dynamically based on priority metrics, standard sequential ordering guarantees are broken, forcing downstream services to be strictly idempotent to prevent state collision bugs.

Competing Consumers

It involves deploying multiple identical worker instances to read and process messages from a single shared message channel simultaneously increasing throughput.

                               ┌───────────────────┐
                               │     Worker 1      │
                        ┌────> │ (Processes Msg 1) │
                        │      └───────────────────┘
┌───────────────┐       │      ┌───────────────────┐
│ Message Queue │ ──────┼────> │     Worker 2      │
│ [M3] [M2] [M1]│       │      │ (Processes Msg 2) │
└───────────────┘       │      └───────────────────┘
                        │      ┌───────────────────┐
                        └────> │     Worker 3      │
                               │ (Processes Msg 3) │
                               └───────────────────┘

Pros:

  • Seamless Horizontal Scaling
  • High Availability & Fault Tolerance
  • Built-in Load Balancing

Cons:

  • Loss of Ordering Guarantees
  • Strict Idempotency Requirement: If a network blip occurs right after a worker finishes a job but before it can send its ACK, the broker will assume the worker died and hand the message to another node. The system will process some tasks twice, meaning your worker code must be perfectly idempotent
  • Poison Message Bottlenecks: If a specific message contains malformed data that causes a worker to crash, the broker will re-queue it, causing it to crash the next worker, and the next, trapping your cluster in a continuous crash-loop

Choreography

It is an approach to microservice orchestration where individual services communicate asynchronously by reacting to events published to a message broker, rather than relying on a centralized controller.

             ┌─────────────────┐
             │  Order Service  │
             └─────────────────┘

           Publishes: "OrderCreated"
                      v
             ┌─────────────────┐
             │ Message Broker  │ <─── (Decentralized Event Hub)
             └─────────────────┘
                /           \
Listens to "OrderCreated"   Listens to "OrderCreated"
              /               \
             v                 v
   ┌──────────────────┐       ┌──────────────────┐
   │ Payment Service  │       │Inventory Service │
   └──────────────────┘       └──────────────────┘
            │                           │
Publishes: "PaymentPaid"     Publishes: "StockReserved"

Pros:

  • No Single Point of Failure
  • Extreme Loose Coupling
  • Fast Performance

Cons:

  • Because data flow is completely decentralized, it can be incredibly difficult for engineers to visualize the entire global workflow or map end-to-end business lifecycles.
  • Tracing a bug across an event chain requires robust distributed tracing tools
  • Split-Brain Rollback Risks: If a highly nested sequence fails 5 steps down the line, managing the matrix of background compensating events to cleanly undo data states requires rigorous integration testing.

Claim Check

It involves splitting a large message payload into two parts: a lightweight claim check token and the raw heavy data payload. Message brokers (like RabbitMQ, Apache Kafka, or AWS SQS) are designed to route millions of lightweight data packets per second with sub-millisecond latency. They achieve this speed by keeping active queues entirely or heavily inside server RAM. Therefore decreasing message is really beneficial

[ Sender App ] ─────── 1. Stores 50MB Payload ───────► [ Cloud Storage (S3/Blob) ]
      │                                                         ^
2. Sends Token                                                  │
(ID: abc-123)                                           4. Fetches 50MB Payload
      v                                                         │
[ Message Queue ] ──── 3. Forwards Token (ID: abc-123) ──► [ Receiver App ]

Pros:

  • Protects Infrastructure Health: Prevents memory saturation, high disk-swapping latency, and throughput degradation inside your core message brokers.
  • Bypasses Platform Limits: Allows you to seamlessly orchestrate workloads involving gigabytes of data over messaging platforms that have strict kilobytes payload caps
  • Cost Optimization

Cons:

  • Increased System Latency
  • Storage Accumulation (Lifecycle Debt) if resources are not deleted after completion of request
  • Dual-Component Vulnerability: If your cloud storage bucket goes down, or if a network split prevents the receiver from reaching it, the entire event pipeline stalls even though your message broker is completely healthy.

Asynchronous Request-Reply

It decouples client-backend communication when a request triggers a long-running task. Instead of keeping an HTTP connection open for minutes while the server processes the job, the server accepts the request immediately, hands the client a status endpoint, and terminates the initial connection. The client then monitors the status endpoint until the work is complete.

[ Client App ] ─────── 1. POST /api/reports ────────────────> [ API Gateway / Server ]
      ^                                                              │
      │ 2. Returns HTTP 202 Accepted                                 │
      │    Header -> Location: /api/status/123 <─────────────────────┘

      ├─── 3. GET /api/status/123 (Polling Loop) ──────────────> [ Status Endpoint ]
      │ <─── Returns HTTP 200 OK {"status": "Processing"} ───────────┤
      │                                                              │
      ├─── 4. GET /api/status/123 (Final Poll) ──────────────────>   │
        <─── Returns HTTP 303 See Other                              │
             Header -> Location: /api/downloads/report.pdf ◄─────────┘

Pros:

  • Eliminates Connection Timeouts
  • Highly Scalable Architecture
  • Standardized UX Interface

Cons:

  • If thousands of clients repeatedly ping the status endpoint every 2 seconds, it creates a "chatter" effect that can stress the server layer. Implement SSE or Websockets to push final stage change to frontend
  • State Persistence Necessity: The server must maintain a persistent, fast state database to share the real-time status of the worker job with the API endpoint.
  • Increased System Complexity: The architecture requires adding background runners, message ingestion brokers, and robust distributed task tracing to coordinate the lifecycle safely.