Production monitoring, observability, SLO/SLI management, and incident response.
Trigger terms: monitoring, observability, SRE, site reliability, alerting, incident response,
SLO, SLI, error budget, Prometheus, Grafana, Datadog, New Relic, ELK stack, logs, metrics,
traces, on-call, production monitoring, health checks, uptime, availability, dashboards,
post-mortem, incident management, runbook.
Completes SDD Stage 8 (Monitoring) with comprehensive production observability:
- SLI/SLO definitions and tracking
- Monitoring stack setup (Prometheus, Grafana, ELK, Datadog, etc.)
- Alert rules and notification channels
- Incident response runbooks
- Observability dashboards (logs, metrics, traces)
- Post-mortem templates and analysis
- Health check endpoints
- Error budget tracking
Use when: user needs production monitoring, observability platform, alerting, SLOs,
incident response, or post-deployment health tracking.