Building Observable Systems: Monitoring and Observability Patterns

January 20, 2025•

2 min read

#monitoring#observability#prometheus#grafana#sre

Observability is not just monitoring—it's the ability to understand your system's internal state from its external outputs. Here's how to build truly observable systems.

The Three Pillars

1. Metrics (Prometheus)

Collect and query time-series data about system behavior.

yaml

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

2. Logs (Loki)

Aggregate and query logs from all services.

python

import logging
from pythonjsonlogger import jsonlogger

# Structured logging for better querying
logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)

logger.info("Request processed", extra={
    "user_id": user.id,
    "duration_ms": duration,
    "status_code": 200
})

3. Traces (Jaeger/Tempo)

Track requests across service boundaries.

python

from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order"):
    # Your business logic here
    order = create_order(data)
    with tracer.start_as_current_span("payment"):
        payment = process_payment(order)

Key Metrics to Track

Golden Signals (SRE)

Latency: Response time
Traffic: Request volume
Errors: Failure rate
Saturation: Resource utilization

promql

# Example Prometheus queries
rate(http_requests_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Grafana Dashboard Best Practices

Dashboard Organization

json

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{job=\"my-service\"}[5m])"
          }
        ]
      }
    ]
  }
}

Variables for Flexibility

Use dashboard variables to make dashboards reusable across environments and services.

Alerting Strategy

Define meaningful alerts that require action.

yaml

# alertmanager.yml
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}%"

Implementation Checklist

Deploy Prometheus for metrics collection
Set up Grafana for visualization
Configure Loki for log aggregation
Implement distributed tracing
Define SLIs/SLOs for your services
Create runbooks for common alerts
Test alert routing and escalation

Observability is an investment that pays dividends during incidents and performance optimization. Start with the golden signals, add structured logging, and progressively enhance your observability stack.

The goal is not to collect all possible data, but to collect the right data that enables quick diagnosis and resolution of issues.