
SRE Monitoring: Building Effective Alerting Systems
After years of being on-call and managing production systems, I’ve learned that effective monitoring isn’t about having the most alerts—it’s about having the right alerts. Here’s what I’ve learned about building monitoring systems that actually help rather than overwhelm your team.
The Golden Signals: Start with What Matters
Google’s SRE book introduced the concept of the Four Golden Signals, and they remain the best starting point for any monitoring strategy:
- Latency - How long requests take
- Traffic - How much demand is on your system
- Errors - Rate of failed requests
- Saturation - How “full” your service is
Implementing Golden Signals with CloudWatch
Here’s how I implement these signals for a typical web application on AWS:
# CloudWatch Alarms for Golden Signals
Resources:
HighLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApplicationName}-high-latency"
AlarmDescription: "Average response time exceeds threshold"
MetricName: TargetResponseTime
Namespace: AWS/ApplicationELB
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 1.0 # 1 second
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: LoadBalancer
Value: !Ref ApplicationLoadBalancer
ErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApplicationName}-error-rate"
MetricName: HTTPCode_Target_5XX_Count
Namespace: AWS/ApplicationELB
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
The Art of Alert Thresholds
Setting good thresholds is more art than science. Here’s my approach:
1. Use Historical Data
Don’t guess—analyze your actual metrics:
# Python script to analyze historical latency patterns
import boto3
import pandas as pd
from datetime import datetime, timedelta
def analyze_latency_patterns(metric_name, days=30):
cloudwatch = boto3.client('cloudwatch')
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
response = cloudwatch.get_metric_statistics(
Namespace='AWS/ApplicationELB',
MetricName=metric_name,
StartTime=start_time,
EndTime=end_time,
Period=300,
Statistics=['Average', 'Maximum']
)
df = pd.DataFrame(response['Datapoints'])
# Calculate percentiles for threshold setting
p95 = df['Average'].quantile(0.95)
p99 = df['Average'].quantile(0.99)
print(f"95th percentile: {p95:.2f}")
print(f"99th percentile: {p99:.2f}")
print(f"Recommended alert threshold: {p95 * 1.2:.2f}")
2. Implement Gradual Escalation
Not every issue needs to wake someone up at 3 AM:
# Terraform example for graduated alerting
resource "aws_sns_topic" "alerts_warning" {
name = "application-alerts-warning"
}
resource "aws_sns_topic" "alerts_critical" {
name = "application-alerts-critical"
}
# Warning alert - Slack notification
resource "aws_cloudwatch_metric_alarm" "latency_warning" {
alarm_name = "latency-warning"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = "300"
statistic = "Average"
threshold = "0.5"
alarm_description = "Response time elevated"
alarm_actions = [aws_sns_topic.alerts_warning.arn]
treat_missing_data = "notBreaching"
}
# Critical alert - PagerDuty notification
resource "aws_cloudwatch_metric_alarm" "latency_critical" {
alarm_name = "latency-critical"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "3"
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = "300"
statistic = "Average"
threshold = "2.0"
alarm_description = "Response time critically high"
alarm_actions = [aws_sns_topic.alerts_critical.arn]
treat_missing_data = "breaching"
}
Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
SLOs give you objective targets for your monitoring:
Defining SLIs
# Example SLI calculations
def calculate_availability_sli(total_requests, error_requests):
"""Calculate availability SLI as percentage of successful requests"""
if total_requests == 0:
return 100.0
success_rate = ((total_requests - error_requests) / total_requests) * 100
return round(success_rate, 2)
def calculate_latency_sli(response_times, threshold_ms=1000):
"""Calculate latency SLI as percentage of requests under threshold"""
if not response_times:
return 100.0
under_threshold = sum(1 for rt in response_times if rt < threshold_ms)
percentage = (under_threshold / len(response_times)) * 100
return round(percentage, 2)
# Example usage
availability_sli = calculate_availability_sli(10000, 50) # 99.5%
latency_sli = calculate_latency_sli(response_times, 1000) # e.g., 99.2%
Error Budget Monitoring
def calculate_error_budget_burn_rate(current_sli, slo_target, window_hours=24):
"""
Calculate how fast we're burning through our error budget
"""
error_rate = 100 - current_sli
allowed_error_rate = 100 - slo_target
if allowed_error_rate == 0:
return float('inf') if error_rate > 0 else 0
burn_rate = error_rate / allowed_error_rate
# Project how long until budget is exhausted
hours_to_exhaustion = window_hours / burn_rate if burn_rate > 0 else float('inf')
return {
'burn_rate': burn_rate,
'hours_to_exhaustion': hours_to_exhaustion,
'alert_level': 'critical' if burn_rate > 10 else 'warning' if burn_rate > 2 else 'ok'
}
Reducing Alert Fatigue
The biggest enemy of effective monitoring is alert fatigue. Here’s how to combat it:
1. Alert on Symptoms, Not Causes
❌ Bad: Alert on high CPU usage ✅ Good: Alert on high response times
# Focus on user-impacting metrics
HighResponseTimeAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "user-experience-degraded"
AlarmDescription: "Users experiencing slow response times"
MetricName: TargetResponseTime
# ... rest of configuration
# CPU alerts should be informational, not urgent
HighCPUInformation:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: "high-cpu-investigation-needed"
AlarmDescription: "High CPU - investigate for optimization"
TreatMissingData: notBreaching
# Send to different, non-urgent channel
2. Implement Alert Suppression
# Simple alert suppression logic
class AlertManager:
def __init__(self):
self.suppressed_alerts = {}
self.suppression_duration = 3600 # 1 hour
def should_send_alert(self, alert_name, current_time):
if alert_name in self.suppressed_alerts:
last_sent = self.suppressed_alerts[alert_name]
if current_time - last_sent < self.suppression_duration:
return False
self.suppressed_alerts[alert_name] = current_time
return True
def send_alert_if_needed(self, alert_name, message):
import time
current_time = time.time()
if self.should_send_alert(alert_name, current_time):
self.send_to_pagerduty(message)
return True
return False
3. Use Composite Alarms
AWS CloudWatch Composite Alarms let you combine multiple conditions:
CompositeServiceDegradedAlarm:
Type: AWS::CloudWatch::CompositeAlarm
Properties:
AlarmName: "service-degraded"
AlarmRule: !Sub |
(ALARM("${HighLatencyAlarm}") OR ALARM("${HighErrorRateAlarm}"))
AND ALARM("${HealthCheckAlarm}")
AlarmActions:
- !Ref CriticalAlertsTopic
AlarmDescription: "Service is degraded - multiple symptoms detected"
Building Effective Runbooks
Every alert should have a corresponding runbook. Here’s my template:
# Alert: High Response Time
## Severity: Critical
**When to escalate:** If response time doesn't improve within 15 minutes
## Quick Checks (< 5 minutes)
1. Check application logs for errors
```bash
kubectl logs -f deployment/web-app --tail=100
-
Verify downstream dependencies
curl -s https://api.dependency.com/health | jq .
-
Check resource utilization
kubectl top pods kubectl describe nodes
Likely Causes
- Database connection pool exhaustion
- Downstream API latency
- Memory leaks causing GC pressure
- Traffic spike beyond capacity
Mitigation Steps
-
Immediate relief (if traffic spike):
kubectl scale deployment web-app --replicas=10
-
Database issues:
# Check active connections psql -c "SELECT count(*) FROM pg_stat_activity;"
-
Application restart (if memory leak suspected):
kubectl rollout restart deployment/web-app kubectl rollout status deployment/web-app
Recovery Verification
- Response time back to normal (< 500ms)
- Error rate below 1%
- All health checks passing
- Downstream dependencies responding normally
Post-Incident
- Update monitoring thresholds if needed
- Review capacity planning
- Schedule post-mortem if impact > 15 minutes
## Custom Metrics for Business Logic
Don't rely only on infrastructure metrics. Monitor your business logic:
```python
# Example: Custom metrics for an e-commerce application
import boto3
from datetime import datetime
class BusinessMetrics:
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
def track_order_processing_time(self, order_id, processing_time_ms):
"""Track how long it takes to process an order"""
self.cloudwatch.put_metric_data(
Namespace='ECommerce/Orders',
MetricData=[
{
'MetricName': 'OrderProcessingTime',
'Value': processing_time_ms,
'Unit': 'Milliseconds',
'Dimensions': [
{
'Name': 'OrderType',
'Value': self._get_order_type(order_id)
}
],
'Timestamp': datetime.utcnow()
}
]
)
def track_payment_failures(self, reason, amount):
"""Track payment failures by reason"""
self.cloudwatch.put_metric_data(
Namespace='ECommerce/Payments',
MetricData=[
{
'MetricName': 'PaymentFailures',
'Value': 1,
'Unit': 'Count',
'Dimensions': [
{
'Name': 'FailureReason',
'Value': reason
},
{
'Name': 'AmountRange',
'Value': self._get_amount_range(amount)
}
]
}
]
)
def track_inventory_levels(self, product_id, current_stock):
"""Monitor inventory levels for stock-out prevention"""
self.cloudwatch.put_metric_data(
Namespace='ECommerce/Inventory',
MetricData=[
{
'MetricName': 'StockLevel',
'Value': current_stock,
'Unit': 'Count',
'Dimensions': [
{
'Name': 'ProductId',
'Value': product_id
}
]
}
]
)
def _get_order_type(self, order_id):
# Implementation to determine order type
return "standard" # or "premium", "express", etc.
def _get_amount_range(self, amount):
# Categorize amounts for better analysis
if amount < 50:
return "small"
elif amount < 200:
return "medium"
else:
return "large"
Distributed Tracing for Complex Systems
When you have microservices, single metrics aren’t enough. Implement distributed tracing:
# Example using AWS X-Ray
import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
# Automatically instrument AWS SDK calls
patch_all()
@xray_recorder.capture('order_processing')
def process_order(order_data):
# Start a subsegment for payment processing
with xray_recorder.in_subsegment('payment_processing'):
payment_result = process_payment(order_data['payment_info'])
# Add metadata to the trace
xray_recorder.current_subsegment().put_metadata(
'payment_method',
order_data['payment_info']['method']
)
# Another subsegment for inventory update
with xray_recorder.in_subsegment('inventory_update'):
inventory_result = update_inventory(order_data['items'])
if inventory_result['stock_warning']:
# Add annotation for easy filtering
xray_recorder.current_subsegment().put_annotation(
'low_stock_warning',
True
)
return {'status': 'completed', 'order_id': order_data['id']}
Monitoring Dashboard Best Practices
Your dashboards should tell a story. Here’s my approach:
1. The 5-Second Rule
Anyone should be able to look at your dashboard and understand system health within 5 seconds.
2. Use the Inverted Pyramid Structure
- Top: Overall system health (RED metrics)
- Middle: Service-level metrics
- Bottom: Infrastructure metrics
3. Color Coding That Makes Sense
# CloudWatch Dashboard with consistent colors
Dashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: "production-overview"
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "${LoadBalancer}"],
[".", "HTTPCode_Target_2XX_Count", ".", "."],
[".", "HTTPCode_Target_5XX_Count", ".", "."]
],
"period": 300,
"stat": "Average",
"region": "us-east-1",
"title": "Request Metrics",
"yAxis": {
"left": {
"min": 0
}
}
}
}
]
}
Implementing Chaos Engineering
Test your monitoring by breaking things intentionally:
#!/bin/bash
# Simple chaos engineering script
echo "Starting chaos engineering test..."
# Test 1: Increase response time
echo "Injecting latency..."
kubectl patch deployment web-app -p='{"spec":{"template":{"spec":{"containers":[{"name":"web-app","env":[{"name":"ARTIFICIAL_DELAY","value":"2000"}]}]}}}}'
sleep 300 # Wait 5 minutes
# Verify alerts fired
echo "Checking if latency alerts triggered..."
aws cloudwatch describe-alarms --state-value ALARM --query 'MetricAlarms[?contains(AlarmName, `latency`)].AlarmName'
# Restore normal operation
echo "Removing latency injection..."
kubectl patch deployment web-app -p='{"spec":{"template":{"spec":{"containers":[{"name":"web-app","env":[{"name":"ARTIFICIAL_DELAY","value":"0"}]}]}}}}'
echo "Chaos engineering test complete."
Key Takeaways for SRE Monitoring
- Start with the Four Golden Signals - They cover 80% of issues
- Alert on symptoms, not causes - Focus on user impact
- Use data to set thresholds - Don’t guess
- Implement gradual escalation - Not everything is critical
- Write runbooks for every alert - Make response predictable
- Monitor business metrics - Infrastructure metrics aren’t enough
- Test your monitoring - Use chaos engineering
- Fight alert fatigue aggressively - Too many alerts = no alerts
Tools I Recommend
Based on my experience with AWS-centric environments:
- Metrics: CloudWatch + Custom Metrics
- Alerting: CloudWatch Alarms + PagerDuty
- Dashboards: CloudWatch Dashboards + Grafana
- Tracing: AWS X-Ray
- Log Aggregation: CloudWatch Logs + ELK Stack
- Chaos Engineering: Chaos Monkey + custom scripts
Remember: The best monitoring system is the one your team actually uses and trusts. Start simple, iterate based on real incidents, and always prioritize reducing noise over adding more data.
How does your team handle monitoring and alerting? I’d love to discuss different approaches and learn from your experiences. Connect with me on LinkedIn or drop me an email.