🚨 The 3 AM Wake-Up Call
Your Kubernetes cluster is down. Users are angry. You have no idea what happened. This is why Fortune 500 companies spend $500K/year on observability.
The Complete Observability Stack
📊 The Three Pillars
- Metrics – What’s happening right now (Prometheus)
- Logs – What happened in the past (Loki)
- Traces – Why requests are slow (Jaeger)
âš¡ Quick Install (Helm)
# Add Prometheus repo
helm repo add prometheus-community \\
https://prometheus-community.github.io/helm-charts
# Install Prometheus + Grafana stack
helm install monitoring prometheus-community/kube-prometheus-stack \\
--namespace monitoring --create-namespace
# Get Grafana password
kubectl get secret -n monitoring monitoring-grafana \\
-o jsonpath="{.data.admin-password}" | base64 --decode
# Port forward to access
kubectl port-forward -n monitoring \\
svc/monitoring-grafana 3000:80
Result: Full monitoring stack in 5 minutes. 50+ pre-built dashboards included.
🎯 Critical Alerts
- Pod restarts > 5 in 10 min
- Memory > 90% used
- CPU throttling detected
- Disk > 85% full
- Node not ready > 2 min
📈 Key Metrics
- Request rate (QPS)
- Error rate (4xx, 5xx)
- Latency (p50, p95, p99)
- Saturation (CPU, RAM)
- Availability (uptime %)
🔥 Production War Stories
Pod memory slowly climbing. Grafana showed it weeks before the crash. We ignored it. Pod OOMKilled during Black Friday. Lost $50K in sales. Now we have alerts.
Request spike alert fired. Dashboard showed 10,000% traffic increase from one IP range. Blocked at CDN level. Attack neutralized before impacting users. Total downtime: 0 seconds.
| Dashboard | Purpose | When to Check |
|---|---|---|
| Cluster Overview | Overall health | Daily morning check |
| Node Metrics | Hardware utilization | Capacity planning |
| Pod Metrics | Application performance | During deployments |
| API Server | Kubernetes health | When things feel slow |
“Before Prometheus: We found out about outages from angry customers. After Prometheus: We fix issues before users notice them.”
