What is observability?
In software engineering, observability refers to the ability to understand what's happening inside a system based on the data it produces — particularly when something goes wrong.
It’s about answering "why is this happening?" without needing to manually dig into the code or reproduce the issue.
Key Components of Observability
- Logs
- Textual records of events that happen during software execution.
- Useful for debugging and tracing individual events or requests.
- Metrics
- Numeric measurements over time, like CPU usage, memory consumption, request rates, or error counts.
- Great for spotting trends and performance issues.
- Traces
- Show the journey of a request as it passes through different services or components.
- Help pinpoint slowdowns or failures in distributed systems.
Monitoring vs. Observability
| Aspect |
Monitoring |
Observability |
| Focus |
What is happening? |
Why is it happening? |
| Based on |
Predefined checks & metrics |
Logs, metrics, traces (rich data) |
| Use case |
Alerting, uptime tracking |
Debugging, root cause analysis |
| Scope |
Reactive |
Proactive & investigative |
Who implements observability?
Observability in software is typically implemented collaboratively, involving multiple roles. Here's how it's usually broken down:
- Developers
- Instrumentation: Application developers are typically responsible for embedding logging, tracing, and metrics directly into the code. This instrumentation forms the backbone of observability by providing detailed insights into how the application functions.
- Best Practices: Developers adhere to established practices for writing log messages, using correlation IDs, and ensuring that error details are captured meaningfully.
- DevOps and Site Reliability Engineering (SRE) Teams
- Monitoring Setup: DevOps and SRE teams typically design and maintain the monitoring infrastructure, which includes setting up systems such as Prometheus, Grafana, ELK stack, or other observability platforms.
- Alerting and Incident Response: These teams define thresholds, alerts, and dashboards based on the data collected, allowing for rapid identification and diagnosis of problems in production.
- Automation: They often automate the collection and correlation of logs, metrics, and traces across different environments to ensure continuous observability.
Metrics
Metrics are quantitative measurements that describe the behavior, performance, and health of a system over time. They're a core component of observability, helping teams understand how systems are functioning and whether they're meeting performance and reliability goals.
Types of Metrics
Here are the most common categories of metrics in software systems:
System Metrics
- CPU usage (e.g.,
cpu_usage_percent)
- Memory usage (e.g.,
memory_free_bytes)