Building Observable Systems: Monitoring and Observability Patterns
Observability is not just monitoring—it's the ability to understand your system's internal state from its external outputs. Here's how to build truly observable systems.
The Three Pillars
1. Metrics (Prometheus)
Collect and query time-series data about system behavior.
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true2. Logs (Loki)
Aggregate and query logs from all services.
import logging
from pythonjsonlogger import jsonlogger
# Structured logging for better querying
logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)
logger.info("Request processed", extra={
"user_id": user.id,
"duration_ms": duration,
"status_code": 200
})3. Traces (Jaeger/Tempo)
Track requests across service boundaries.
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order"):
# Your business logic here
order = create_order(data)
with tracer.start_as_current_span("payment"):
payment = process_payment(order)Key Metrics to Track
Golden Signals (SRE)
- Latency: Response time
- Traffic: Request volume
- Errors: Failure rate
- Saturation: Resource utilization
# Example Prometheus queries
rate(http_requests_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])Grafana Dashboard Best Practices
Dashboard Organization
{
"dashboard": {
"title": "Service Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total{job=\"my-service\"}[5m])"
}
]
}
]
}
}Variables for Flexibility
Use dashboard variables to make dashboards reusable across environments and services.
Alerting Strategy
Define meaningful alerts that require action.
# alertmanager.yml
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}%"Implementation Checklist
- Deploy Prometheus for metrics collection
- Set up Grafana for visualization
- Configure Loki for log aggregation
- Implement distributed tracing
- Define SLIs/SLOs for your services
- Create runbooks for common alerts
- Test alert routing and escalation
Conclusion
Observability is an investment that pays dividends during incidents and performance optimization. Start with the golden signals, add structured logging, and progressively enhance your observability stack.
The goal is not to collect all possible data, but to collect the right data that enables quick diagnosis and resolution of issues.