Jun 1, 2025

SRE Monitoring: Building Effective Alerting Systems

After years of being on-call and managing production systems, I’ve learned that effective monitoring isn’t about having the most alerts—it’s about having the right alerts. Here’s what I’ve learned about building monitoring systems that actually help rather than overwhelm your team.

The Golden Signals: Start with What Matters

Google’s SRE book introduced the concept of the Four Golden Signals, and they remain the best starting point for any monitoring strategy:

Latency - How long requests take
Traffic - How much demand is on your system
Errors - Rate of failed requests
Saturation - How “full” your service is

Implementing Golden Signals with CloudWatch

Here’s how I implement these signals for a typical web application on AWS:

# CloudWatch Alarms for Golden Signals
Resources:
  HighLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApplicationName}-high-latency"
      AlarmDescription: "Average response time exceeds threshold"
      MetricName: TargetResponseTime
      Namespace: AWS/ApplicationELB
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 1.0  # 1 second
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref ApplicationLoadBalancer

  ErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApplicationName}-error-rate"
      MetricName: HTTPCode_Target_5XX_Count
      Namespace: AWS/ApplicationELB
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold

The Art of Alert Thresholds

Setting good thresholds is more art than science. Here’s my approach:

1. Use Historical Data

Don’t guess—analyze your actual metrics:

# Python script to analyze historical latency patterns
import boto3
import pandas as pd
from datetime import datetime, timedelta

def analyze_latency_patterns(metric_name, days=30):
    cloudwatch = boto3.client('cloudwatch')
    
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(days=days)
    
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/ApplicationELB',
        MetricName=metric_name,
        StartTime=start_time,
        EndTime=end_time,
        Period=300,
        Statistics=['Average', 'Maximum']
    )
    
    df = pd.DataFrame(response['Datapoints'])
    
    # Calculate percentiles for threshold setting
    p95 = df['Average'].quantile(0.95)
    p99 = df['Average'].quantile(0.99)
    
    print(f"95th percentile: {p95:.2f}")
    print(f"99th percentile: {p99:.2f}")
    print(f"Recommended alert threshold: {p95 * 1.2:.2f}")

2. Implement Gradual Escalation

Not every issue needs to wake someone up at 3 AM:

# Terraform example for graduated alerting
resource "aws_sns_topic" "alerts_warning" {
  name = "application-alerts-warning"
}

resource "aws_sns_topic" "alerts_critical" {
  name = "application-alerts-critical"
}

# Warning alert - Slack notification
resource "aws_cloudwatch_metric_alarm" "latency_warning" {
  alarm_name          = "latency-warning"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = "300"
  statistic           = "Average"
  threshold           = "0.5"
  alarm_description   = "Response time elevated"
  alarm_actions       = [aws_sns_topic.alerts_warning.arn]
  treat_missing_data  = "notBreaching"
}

# Critical alert - PagerDuty notification  
resource "aws_cloudwatch_metric_alarm" "latency_critical" {
  alarm_name          = "latency-critical"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "3"
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = "300"
  statistic           = "Average"
  threshold           = "2.0"
  alarm_description   = "Response time critically high"
  alarm_actions       = [aws_sns_topic.alerts_critical.arn]
  treat_missing_data  = "breaching"
}

Service Level Indicators (SLIs) and Service Level Objectives (SLOs)

SLOs give you objective targets for your monitoring:

Defining SLIs

# Example SLI calculations
def calculate_availability_sli(total_requests, error_requests):
    """Calculate availability SLI as percentage of successful requests"""
    if total_requests == 0:
        return 100.0
    
    success_rate = ((total_requests - error_requests) / total_requests) * 100
    return round(success_rate, 2)

def calculate_latency_sli(response_times, threshold_ms=1000):
    """Calculate latency SLI as percentage of requests under threshold"""
    if not response_times:
        return 100.0
    
    under_threshold = sum(1 for rt in response_times if rt < threshold_ms)
    percentage = (under_threshold / len(response_times)) * 100
    return round(percentage, 2)

# Example usage
availability_sli = calculate_availability_sli(10000, 50)  # 99.5%
latency_sli = calculate_latency_sli(response_times, 1000)  # e.g., 99.2%

Error Budget Monitoring

def calculate_error_budget_burn_rate(current_sli, slo_target, window_hours=24):
    """
    Calculate how fast we're burning through our error budget
    """
    error_rate = 100 - current_sli
    allowed_error_rate = 100 - slo_target
    
    if allowed_error_rate == 0:
        return float('inf') if error_rate > 0 else 0
    
    burn_rate = error_rate / allowed_error_rate
    
    # Project how long until budget is exhausted
    hours_to_exhaustion = window_hours / burn_rate if burn_rate > 0 else float('inf')
    
    return {
        'burn_rate': burn_rate,
        'hours_to_exhaustion': hours_to_exhaustion,
        'alert_level': 'critical' if burn_rate > 10 else 'warning' if burn_rate > 2 else 'ok'
    }

Reducing Alert Fatigue

The biggest enemy of effective monitoring is alert fatigue. Here’s how to combat it:

1. Alert on Symptoms, Not Causes

❌ Bad: Alert on high CPU usage ✅ Good: Alert on high response times

# Focus on user-impacting metrics
HighResponseTimeAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: "user-experience-degraded"
    AlarmDescription: "Users experiencing slow response times"
    MetricName: TargetResponseTime
    # ... rest of configuration

# CPU alerts should be informational, not urgent
HighCPUInformation:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: "high-cpu-investigation-needed"
    AlarmDescription: "High CPU - investigate for optimization"
    TreatMissingData: notBreaching
    # Send to different, non-urgent channel

2. Implement Alert Suppression

# Simple alert suppression logic
class AlertManager:
    def __init__(self):
        self.suppressed_alerts = {}
        self.suppression_duration = 3600  # 1 hour
    
    def should_send_alert(self, alert_name, current_time):
        if alert_name in self.suppressed_alerts:
            last_sent = self.suppressed_alerts[alert_name]
            if current_time - last_sent < self.suppression_duration:
                return False
        
        self.suppressed_alerts[alert_name] = current_time
        return True
    
    def send_alert_if_needed(self, alert_name, message):
        import time
        current_time = time.time()
        
        if self.should_send_alert(alert_name, current_time):
            self.send_to_pagerduty(message)
            return True
        return False

3. Use Composite Alarms

AWS CloudWatch Composite Alarms let you combine multiple conditions:

CompositeServiceDegradedAlarm:
  Type: AWS::CloudWatch::CompositeAlarm
  Properties:
    AlarmName: "service-degraded"
    AlarmRule: !Sub |
      (ALARM("${HighLatencyAlarm}") OR ALARM("${HighErrorRateAlarm}"))
      AND ALARM("${HealthCheckAlarm}")
    AlarmActions:
      - !Ref CriticalAlertsTopic
    AlarmDescription: "Service is degraded - multiple symptoms detected"

Building Effective Runbooks

Every alert should have a corresponding runbook. Here’s my template:

# Alert: High Response Time

## Severity: Critical
**When to escalate:** If response time doesn't improve within 15 minutes

## Quick Checks (< 5 minutes)
1. Check application logs for errors
   ```bash
   kubectl logs -f deployment/web-app --tail=100

Verify downstream dependencies

curl -s https://api.dependency.com/health | jq .

Check resource utilization
```
kubectl top pods
kubectl describe nodes
```

Likely Causes

Database connection pool exhaustion
Downstream API latency
Memory leaks causing GC pressure
Traffic spike beyond capacity

Mitigation Steps

Immediate relief (if traffic spike):

kubectl scale deployment web-app --replicas=10

Database issues:

# Check active connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

Application restart (if memory leak suspected):

kubectl rollout restart deployment/web-app
kubectl rollout status deployment/web-app

Recovery Verification

Response time back to normal (< 500ms)
Error rate below 1%
All health checks passing
Downstream dependencies responding normally

Post-Incident

Update monitoring thresholds if needed
Review capacity planning
Schedule post-mortem if impact > 15 minutes


## Custom Metrics for Business Logic

Don't rely only on infrastructure metrics. Monitor your business logic:

```python
# Example: Custom metrics for an e-commerce application
import boto3
from datetime import datetime

class BusinessMetrics:
    def __init__(self):
        self.cloudwatch = boto3.client('cloudwatch')
    
    def track_order_processing_time(self, order_id, processing_time_ms):
        """Track how long it takes to process an order"""
        self.cloudwatch.put_metric_data(
            Namespace='ECommerce/Orders',
            MetricData=[
                {
                    'MetricName': 'OrderProcessingTime',
                    'Value': processing_time_ms,
                    'Unit': 'Milliseconds',
                    'Dimensions': [
                        {
                            'Name': 'OrderType',
                            'Value': self._get_order_type(order_id)
                        }
                    ],
                    'Timestamp': datetime.utcnow()
                }
            ]
        )
    
    def track_payment_failures(self, reason, amount):
        """Track payment failures by reason"""
        self.cloudwatch.put_metric_data(
            Namespace='ECommerce/Payments',
            MetricData=[
                {
                    'MetricName': 'PaymentFailures',
                    'Value': 1,
                    'Unit': 'Count',
                    'Dimensions': [
                        {
                            'Name': 'FailureReason',
                            'Value': reason
                        },
                        {
                            'Name': 'AmountRange',
                            'Value': self._get_amount_range(amount)
                        }
                    ]
                }
            ]
        )
    
    def track_inventory_levels(self, product_id, current_stock):
        """Monitor inventory levels for stock-out prevention"""
        self.cloudwatch.put_metric_data(
            Namespace='ECommerce/Inventory',
            MetricData=[
                {
                    'MetricName': 'StockLevel',
                    'Value': current_stock,
                    'Unit': 'Count',
                    'Dimensions': [
                        {
                            'Name': 'ProductId',
                            'Value': product_id
                        }
                    ]
                }
            ]
        )
    
    def _get_order_type(self, order_id):
        # Implementation to determine order type
        return "standard"  # or "premium", "express", etc.
    
    def _get_amount_range(self, amount):
        # Categorize amounts for better analysis
        if amount < 50:
            return "small"
        elif amount < 200:
            return "medium"
        else:
            return "large"

Distributed Tracing for Complex Systems

When you have microservices, single metrics aren’t enough. Implement distributed tracing:

# Example using AWS X-Ray
import boto3
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

# Automatically instrument AWS SDK calls
patch_all()

@xray_recorder.capture('order_processing')
def process_order(order_data):
    # Start a subsegment for payment processing
    with xray_recorder.in_subsegment('payment_processing'):
        payment_result = process_payment(order_data['payment_info'])
        
        # Add metadata to the trace
        xray_recorder.current_subsegment().put_metadata(
            'payment_method', 
            order_data['payment_info']['method']
        )
    
    # Another subsegment for inventory update
    with xray_recorder.in_subsegment('inventory_update'):
        inventory_result = update_inventory(order_data['items'])
        
        if inventory_result['stock_warning']:
            # Add annotation for easy filtering
            xray_recorder.current_subsegment().put_annotation(
                'low_stock_warning', 
                True
            )
    
    return {'status': 'completed', 'order_id': order_data['id']}

Monitoring Dashboard Best Practices

Your dashboards should tell a story. Here’s my approach:

1. The 5-Second Rule

Anyone should be able to look at your dashboard and understand system health within 5 seconds.

2. Use the Inverted Pyramid Structure

Top: Overall system health (RED metrics)
Middle: Service-level metrics
Bottom: Infrastructure metrics

3. Color Coding That Makes Sense

# CloudWatch Dashboard with consistent colors
Dashboard:
  Type: AWS::CloudWatch::Dashboard
  Properties:
    DashboardName: "production-overview"
    DashboardBody: !Sub |
      {
        "widgets": [
          {
            "type": "metric",
            "properties": {
              "metrics": [
                ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "${LoadBalancer}"],
                [".", "HTTPCode_Target_2XX_Count", ".", "."],
                [".", "HTTPCode_Target_5XX_Count", ".", "."]
              ],
              "period": 300,
              "stat": "Average",
              "region": "us-east-1",
              "title": "Request Metrics",
              "yAxis": {
                "left": {
                  "min": 0
                }
              }
            }
          }
        ]
      }

Implementing Chaos Engineering

Test your monitoring by breaking things intentionally:

#!/bin/bash
# Simple chaos engineering script

echo "Starting chaos engineering test..."

# Test 1: Increase response time
echo "Injecting latency..."
kubectl patch deployment web-app -p='{"spec":{"template":{"spec":{"containers":[{"name":"web-app","env":[{"name":"ARTIFICIAL_DELAY","value":"2000"}]}]}}}}'

sleep 300  # Wait 5 minutes

# Verify alerts fired
echo "Checking if latency alerts triggered..."
aws cloudwatch describe-alarms --state-value ALARM --query 'MetricAlarms[?contains(AlarmName, `latency`)].AlarmName'

# Restore normal operation
echo "Removing latency injection..."
kubectl patch deployment web-app -p='{"spec":{"template":{"spec":{"containers":[{"name":"web-app","env":[{"name":"ARTIFICIAL_DELAY","value":"0"}]}]}}}}'

echo "Chaos engineering test complete."

Key Takeaways for SRE Monitoring

Start with the Four Golden Signals - They cover 80% of issues
Alert on symptoms, not causes - Focus on user impact
Use data to set thresholds - Don’t guess
Implement gradual escalation - Not everything is critical
Write runbooks for every alert - Make response predictable
Monitor business metrics - Infrastructure metrics aren’t enough
Test your monitoring - Use chaos engineering
Fight alert fatigue aggressively - Too many alerts = no alerts

Based on my experience with AWS-centric environments:

Metrics: CloudWatch + Custom Metrics
Alerting: CloudWatch Alarms + PagerDuty
Dashboards: CloudWatch Dashboards + Grafana
Tracing: AWS X-Ray
Log Aggregation: CloudWatch Logs + ELK Stack
Chaos Engineering: Chaos Monkey + custom scripts

Remember: The best monitoring system is the one your team actually uses and trusts. Start simple, iterate based on real incidents, and always prioritize reducing noise over adding more data.

How does your team handle monitoring and alerting? I’d love to discuss different approaches and learn from your experiences. Connect with me on LinkedIn or drop me an email.