Monitoring
Distributed applications and services that run in the cloud are complex pieces of software that comprise many moving parts. In a production environment, it's important to track how customers use your system, trace resource utilization, and monitor the health and performance of your system. You can use this information to detect and correct problems and to catch potential problems before they occur.
Health Monitoring
The system should raise an alert within seconds if any part is unhealthy. The status of a system can be signified by using the traffic light analogy: red - unhealthy, yellow - partially health, green -healthy. We can collect a lot of data to get an idea about the health of a system.
- Trace the execution of user requests
- log exceptioms, faults and warnings
- Monitoring health of external services such as database, api's ,etc.
- Monitor endpoints of the server
- Collect ambient performance information such as cpu utilization, i/o operations, network activity, etc.
Availability Monitoring
Availability monitoring refers to tracking the availability of the system and its components to generate statistics about uptime such as 99.999% availability. It is about tracking the availability of resources to get the idea services used by the users are up and working how much of the time. Availability of a system depends on multiple level factors that might be specific to application, system and environment. The availability monitoring solution provides current and historical views of the availability status of each subsystem. It quickly alerts you when one or more services fail or when users can't connect to services. Use this information to identify trends that might cause subsystems to fail.
%Availability = ((Total Time – Total Downtime) / Total Time ) * 100
Performance Monitoring
It is continuous process of tracking, measuring, and analyzing a system's operational metrics to ensure it runs efficiently, remains available, and delivers a good user experience. Depending on what we are monitoring such as infra, networks or application we look at the following:
- Latency
- Traffic
- Rate of Errors
- Saturation of different resources
- CPU utilization
- Memory Utilization
- Disk I/O Utilization
- Network throughput
To determine whether the system's performance is good or bad, you need to know its typical performance level. Observe the system while it functions under a typical load and capture the data for each KPI over a period of time. Consider running the system under a simulated load in a test environment and gather the appropriate data before you deploy it to a production environment
Security Monitoring
Ir is continuous process of collecting, analyzing, and correlating data across an organization's IT infrastructure to detect, investigate, and respond to potential security threats and vulnerabilities. It might help detect attacks on the system. For example, several failed sign-in attempts might indicate a brute-force attack. An unexpected surge in requests might be the result of a DDoS attack. In a system requiring sytem to be authenticated we record the following info:
- All sign-in attempts and whether they fail or succeed
- All operations performed by an authenticated user and the details of all resources that they access
- When a user ends a session and signs out
- One account makes repeated failed sign-in attempts within a specified period
- One authenticated account repeatedly tries to access a prohibited resource during a specified period.
- A large number of unauthenticated or unauthorized requests occur during a specified period
SLA Monitoring
Commercial systems that support paying customers make commitments about system performance in the form of SLAs (Service Level Agreement). SLAs state that the system can handle a defined volume of work within an agreed time frame and without losing critical information. The metrics defining SLA's are:
- Overall system availability such as more than 99.9% availability.
- Operational throughput such as handling 100k concurrent users.
- Operational Response Time such as 99% of all business transaction must finish within 2sec.
Usage Monitoring
Usage monitoring tracks how customers use an application's features and components. We can use the following data:
- Identify popular features and potential hotspots in the system. High-traffic elements might benefit from functional partitioning or replication to spread the load more evenly. You can also use this information to determine which features are infrequently used and are possible candidates for retirement or replacement in a future version of the system.
- Obtain information about the operational events of the system under normal use. For example, in an e-commerce site, you can record statistical information about the number of transactions and the volume of customers responsible for them. You can use this information for capacity planning as the number of customers grows.
- Detect user satisfaction with the performance and functionality of the system. For example, if many customers in an e-commerce system regularly abandon their shopping carts, there might be a problem with the checkout functionality.
- Generate billing information. A commercial application or multitenant service might charge customers for the resources that they use.
- Enforce quotas. If a user in a multitenant system exceeds their paid quota of processing time or resource usage during a specified period, you can limit their access or throttle processing.
- Detect noisy neighbor problems. To help drive error investigations or product decisions, determine whether traffic is evenly spread or if a small set of users generates most of the traffic. If a single user generates significant traffic, the feature might need performance tuning. Alternatively, you might decide to impose extra quotas to reduce traffic.
Instrumentation
It is the process of embedding code, hooks, or monitoring agents into an application and its underlying infrastructure to measure performance, diagnose errors, and track state. There are of two types:
- Automatic Instrumentation: Code is injected or wrapped automatically at runtime or compile time without manual changes to the source code. It uses byte-code manipulation, middleware, or agents such as eBPF in linux. It is quick to setup and instantly captures request, database calls and standard os metrics but provides less business specific context.
- Manual Instrumentation: Developers explicitly write code using an SDK or API to track specific events, metrics, or execution paths. It includes adding lines of code to start a span, increment a counter, or log a custom error. Its highly precise as it captures domain specific business data but requires codebase modification and ongoing maintainance.
+----------------------------------------------------------+
| Application Layer (Custom code, business logic) | --> Traces, Logs, Metrics
+----------------------------------------------------------+
| Framework/Runtime Layer (HTTP servers, Database drivers)| --> Latency, Errors, Connections
+----------------------------------------------------------+
| Infrastructure Layer (OS, Containers, Hypervisor) | --> CPU, Memory, Disk I/O
+----------------------------------------------------------+Some standards and frameworks are:
- OpenTelemetry (OTel): The industry standard for cloud-native software instrumentation. It provides a unified set of APIs, SDKs, and tooling to generate and export telemetry data (Metrics, Logs, and Traces).
- eBPF (Extended Berkeley Packet Filter): A Linux kernel technology that allows running sandboxed programs inside the kernel without changing kernel source code or loading modules. It enables zero-code instrumentation for network, security, and performance tracking directly from the OS layer.
Visualization and Alerts
It translates raw telemetry data (metrics, logs, and traces) into actionable intelligence. Instrumentation collects the data, but dashboards and alerting systems are how humans interact with it to prevent or resolve outages. There are various types of graphs such as: time series graphs, heat maps, histograms, single stat blocks, etc.
An alert is a notification triggered when a system metric violates a predefined condition. Bad alerting structures lead to alert fatigue, where engineers ignore critical notifications because they are constantly bombarded by false alarms.
[ Telemetry Data ]
│
▼
[ Alerting Engine (Prometheus Alertmanager / Datadog) ]
│
├─► P1 / Critical ──► PagerDuty / Opsgenie (Wakes up an on-call engineer)
│
└─► P3 / Warning ──► Slack / Microsoft Teams (For non-urgent review)