Observability | Notion

What is observability?

In software engineering, observability refers to the ability to understand what's happening inside a system based on the data it produces — particularly when something goes wrong.

It’s about answering "why is this happening?" without needing to manually dig into the code or reproduce the issue.

Key Components of Observability

Logs
- Textual records of events that happen during software execution.
- Useful for debugging and tracing individual events or requests.
Metrics
- Numeric measurements over time, like CPU usage, memory consumption, request rates, or error counts.
- Great for spotting trends and performance issues.
Traces
- Show the journey of a request as it passes through different services or components.
- Help pinpoint slowdowns or failures in distributed systems.

Monitoring vs. Observability

Aspect	Monitoring	Observability
Focus	What is happening?	Why is it happening?
Based on	Predefined checks & metrics	Logs, metrics, traces (rich data)
Use case	Alerting, uptime tracking	Debugging, root cause analysis
Scope	Reactive	Proactive & investigative

Who implements observability?

Observability in software is typically implemented collaboratively, involving multiple roles. Here's how it's usually broken down:

Developers
- Instrumentation: Application developers are typically responsible for embedding logging, tracing, and metrics directly into the code. This instrumentation forms the backbone of observability by providing detailed insights into how the application functions.
- Best Practices: Developers adhere to established practices for writing log messages, using correlation IDs, and ensuring that error details are captured meaningfully.
DevOps and Site Reliability Engineering (SRE) Teams
- Monitoring Setup: DevOps and SRE teams typically design and maintain the monitoring infrastructure, which includes setting up systems such as Prometheus, Grafana, ELK stack, or other observability platforms.
- Alerting and Incident Response: These teams define thresholds, alerts, and dashboards based on the data collected, allowing for rapid identification and diagnosis of problems in production.
- Automation: They often automate the collection and correlation of logs, metrics, and traces across different environments to ensure continuous observability.

Metrics

Metrics are quantitative measurements that describe the behavior, performance, and health of a system over time. They're a core component of observability, helping teams understand how systems are functioning and whether they're meeting performance and reliability goals.

Types of Metrics

Here are the most common categories of metrics in software systems:

System Metrics

CPU usage (e.g., cpu_usage_percent)
Memory usage (e.g., memory_free_bytes)