Building Observable Systems: Monitoring and Observability Patterns

2 min read
#monitoring#observability#prometheus#grafana#sre

Observability is not just monitoring—it's the ability to understand your system's internal state from its external outputs. Here's how to build truly observable systems.

The Three Pillars

1. Metrics (Prometheus)

Collect and query time-series data about system behavior.

yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

2. Logs (Loki)

Aggregate and query logs from all services.

python
import logging
from pythonjsonlogger import jsonlogger

# Structured logging for better querying
logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)

logger.info("Request processed", extra={
    "user_id": user.id,
    "duration_ms": duration,
    "status_code": 200
})

3. Traces (Jaeger/Tempo)

Track requests across service boundaries.

python
from opentelemetry import trace
from opentelemetry.exporter.jaeger import JaegerExporter

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order"):
    # Your business logic here
    order = create_order(data)
    with tracer.start_as_current_span("payment"):
        payment = process_payment(order)

Key Metrics to Track

Golden Signals (SRE)

  1. Latency: Response time
  2. Traffic: Request volume
  3. Errors: Failure rate
  4. Saturation: Resource utilization
promql
# Example Prometheus queries
rate(http_requests_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

Grafana Dashboard Best Practices

Dashboard Organization

json
{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{job=\"my-service\"}[5m])"
          }
        ]
      }
    ]
  }
}

Variables for Flexibility

Use dashboard variables to make dashboards reusable across environments and services.

Alerting Strategy

Define meaningful alerts that require action.

yaml
# alertmanager.yml
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}%"

Implementation Checklist

  • Deploy Prometheus for metrics collection
  • Set up Grafana for visualization
  • Configure Loki for log aggregation
  • Implement distributed tracing
  • Define SLIs/SLOs for your services
  • Create runbooks for common alerts
  • Test alert routing and escalation

Conclusion

Observability is an investment that pays dividends during incidents and performance optimization. Start with the golden signals, add structured logging, and progressively enhance your observability stack.

The goal is not to collect all possible data, but to collect the right data that enables quick diagnosis and resolution of issues.

$ find ./blog --related