The Complete AWS CloudWatch Alerts Setup Guide

📋 Contents

Setup Parameters
1. ECS
2. EC2
3. RDS
4. Lambda
5. ALB
6. API Gateway
7. SQS
8. DynamoDB
9. ElastiCache
10. Cost & Budget

Show all code as: Switches all code blocks

⚙️ Setup Parameters — Read This First

Every snippet uses placeholder values. Replace them before deploying:

YOUR_SNS_TOPIC_ARN — ARN of your SNS topic (e.g. arn:aws:sns:eu-central-1:123456789012:alerts)
YOUR_CLUSTER_NAME / YOUR_SERVICE_NAME — ECS cluster and service names
YOUR_INSTANCE_ID — EC2 instance ID (e.g. i-0abc123def456789)
YOUR_DB_INSTANCE_ID — RDS DB instance identifier
YOUR_FUNCTION_NAME — Lambda function name
YOUR_ALB_SUFFIX — Part after loadbalancer/ in ALB ARN (e.g. app/my-alb/abc123def456)
YOUR_API_NAME / YOUR_STAGE — API Gateway name and stage (e.g. prod)
YOUR_QUEUE_NAME — SQS queue name
YOUR_TABLE_NAME — DynamoDB table name
YOUR_CACHE_CLUSTER_ID — ElastiCache cluster ID
YOUR_MONTHLY_BUDGET — Your monthly AWS budget in USD

📫 Create an SNS topic that emails you

☁️ CloudFormation YAML

🟣 Terraform HCL

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  AlertEmail:
    Type: String
    Description: Email address to receive CloudWatch alerts

Resources:
  AlertsTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: infra-alerts
      Subscription:
        - Protocol: email
          Endpoint: !Ref AlertEmail

Outputs:
  SnsTopicArn:
    Value: !Ref AlertsTopic
    Description: Use this ARN as YOUR_SNS_TOPIC_ARN in all alarm snippets below

variable "alert_email" {
  description = "Email address to receive alerts"
  type        = string
}

resource "aws_sns_topic" "alerts" {
  name = "infra-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = var.alert_email
}

# Use aws_sns_topic.alerts.arn as var.sns_topic_arn in alarm resources below

💡 After deploying, AWS sends a confirmation email. Click "Confirm subscription" in that email — alarms won't deliver until you do.

🐳

1. ECS — Elastic Container Service

ECS containers can silently exhaust CPU/memory or stop running without the load balancer health check catching it in time. These alarms detect saturation and task crashes before users are impacted.

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`CPUUtilization`	> 80%	5 min	2	WARN	Sustained CPU pressure — scale before saturation
`CPUUtilization`	> 95%	5 min	2	CRITICAL	Tasks CPU-throttled; latency spikes imminent
`MemoryUtilization`	> 85%	5 min	2	WARN	Memory pressure building; OOM kill possible
`MemoryUtilization`	> 95%	5 min	2	CRITICAL	Near OOM; task will be killed and restarted
`RunningTaskCount`	< desired count	1 min	1	CRITICAL	Tasks crashed and not recovering; service may be down

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  ClusterName:
    Type: String
    Default: YOUR_CLUSTER_NAME
  ServiceName:
    Type: String
    Default: YOUR_SERVICE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  DesiredTaskCount:
    Type: Number
    Default: 2
    Description: Alarm when running tasks fall below this number

Resources:
  EcsCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-cpu-warn"
      AlarmDescription: ECS CPU utilization above 80% for 10 minutes
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  EcsCpuCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-cpu-critical"
      AlarmDescription: ECS CPU above 95% - tasks are throttled
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-memory-warn"
      AlarmDescription: ECS memory utilization above 85%
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsMemoryCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-memory-critical"
      AlarmDescription: ECS memory utilization above 95% - OOM kill imminent
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsRunningTasksCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-tasks-critical"
      AlarmDescription: Running task count below desired - service may be down
      Namespace: AWS/ECS
      MetricName: RunningTaskCount
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 60
      EvaluationPeriods: 1
      Threshold: !Ref DesiredTaskCount
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

variable "cluster_name"    { type = string }
variable "service_name"    { type = string }
variable "sns_topic_arn"   { type = string }
variable "desired_count"   { type = number; default = 2 }

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_warn" {
  alarm_name          = "${var.service_name}-cpu-warn"
  alarm_description   = "ECS CPU above 80% for 10 minutes"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_critical" {
  alarm_name          = "${var.service_name}-cpu-critical"
  alarm_description   = "ECS CPU above 95% - tasks throttled"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 95
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_memory_warn" {
  alarm_name          = "${var.service_name}-memory-warn"
  alarm_description   = "ECS memory above 85%"
  namespace           = "AWS/ECS"
  metric_name         = "MemoryUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 85
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_memory_critical" {
  alarm_name          = "${var.service_name}-memory-critical"
  alarm_description   = "ECS memory above 95% - OOM kill imminent"
  namespace           = "AWS/ECS"
  metric_name         = "MemoryUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 95
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_running_tasks" {
  alarm_name          = "${var.service_name}-tasks-critical"
  alarm_description   = "Running tasks below desired count"
  namespace           = "AWS/ECS"
  metric_name         = "RunningTaskCount"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 1
  threshold           = var.desired_count
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

🖥️

2. EC2 — Elastic Compute Cloud

EC2 instances can become unresponsive due to hardware failures, runaway processes, or network issues. Status check alarms catch hard failures that Auto Scaling or ELB health checks may miss initially.

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`CPUUtilization`	> 85%	5 min	3	WARN	Sustained high CPU; investigate before saturation
`CPUUtilization`	> 95%	5 min	2	CRITICAL	Instance at capacity; requests will queue or fail
`StatusCheckFailed`	> 0	1 min	2	CRITICAL	Instance or system check failing — likely unresponsive
`StatusCheckFailed_System`	> 0	1 min	2	CRITICAL	AWS hardware issue — instance may need recovery
`NetworkIn`	< 1000 bytes/period	5 min	3	WARN	Traffic dropped to near-zero — instance may be isolated

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  InstanceId:
    Type: String
    Default: YOUR_INSTANCE_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  Ec2CpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-cpu-warn"
      AlarmDescription: EC2 CPU above 85% for 15 minutes
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Ec2CpuCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-cpu-critical"
      AlarmDescription: EC2 CPU above 95% for 10 minutes
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2StatusCheckFailed:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-status-check-failed"
      AlarmDescription: EC2 status check failed - instance may be unresponsive
      Namespace: AWS/EC2
      MetricName: StatusCheckFailed
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2StatusCheckFailedSystem:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-status-check-system"
      AlarmDescription: EC2 system status check failed - AWS hardware issue
      Namespace: AWS/EC2
      MetricName: StatusCheckFailed_System
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2NetworkInDrop:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-network-in-drop"
      AlarmDescription: EC2 NetworkIn near zero - traffic may have stopped
      Namespace: AWS/EC2
      MetricName: NetworkIn
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 1000
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

variable "instance_id"   { type = string }
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "ec2_cpu_warn" {
  alarm_name          = "${var.instance_id}-cpu-warn"
  alarm_description   = "EC2 CPU above 85% for 15 minutes"
  namespace           = "AWS/EC2"
  metric_name         = "CPUUtilization"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 85
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_cpu_critical" {
  alarm_name          = "${var.instance_id}-cpu-critical"
  alarm_description   = "EC2 CPU above 95%"
  namespace           = "AWS/EC2"
  metric_name         = "CPUUtilization"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 95
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_status_check" {
  alarm_name          = "${var.instance_id}-status-check"
  alarm_description   = "EC2 status check failed"
  namespace           = "AWS/EC2"
  metric_name         = "StatusCheckFailed"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_status_check_system" {
  alarm_name          = "${var.instance_id}-status-check-system"
  alarm_description   = "EC2 system status check failed - hardware issue"
  namespace           = "AWS/EC2"
  metric_name         = "StatusCheckFailed_System"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_network_in_drop" {
  alarm_name          = "${var.instance_id}-network-in-drop"
  alarm_description   = "EC2 NetworkIn near zero - traffic may have stopped"
  namespace           = "AWS/EC2"
  metric_name         = "NetworkIn"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 1000
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

🗃️

3. RDS — Relational Database Service

Databases fail silently — connections pile up, disk fills, replicas fall behind. By the time your app throws errors, it's already too late. These alarms give you a 10–30 minute warning window.

ⓘ max_connections lookup: Set DatabaseConnections threshold to 80% of your instance's max_connections value:

Instance Class	max_connections	80% threshold
db.t3.micro	87	69
db.t3.small	171	136
db.t3.medium	341	272
db.t3.large	648	518
db.r5.large	1365	1092
db.r5.xlarge	2730	2184
db.r5.2xlarge	5460	4368

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`CPUUtilization`	> 80%	5 min	3	WARN	DB under CPU load; queries slowing down
`DatabaseConnections`	> 80% of max	5 min	2	WARN	Connection pool filling; new connections will fail soon
`FreeStorageSpace`	< 10 GB	5 min	2	WARN	Disk filling; DB will stop accepting writes when full
`FreeStorageSpace`	< 2 GB	5 min	1	CRITICAL	Critically low disk — DB failure imminent
`ReplicaLag`	> 300 s	1 min	2	WARN	Read replica falling behind; stale reads possible
`FreeableMemory`	< 256 MB	5 min	3	WARN	Low memory; buffer pool shrinking, queries slowing

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  DbInstanceId:
    Type: String
    Default: YOUR_DB_INSTANCE_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  MaxConnectionsThreshold:
    Type: Number
    Default: 272
    Description: 80% of max_connections for your instance class (see table above)

Resources:
  RdsCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-cpu-warn"
      AlarmDescription: RDS CPU above 80% for 15 minutes
      Namespace: AWS/RDS
      MetricName: CPUUtilization
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsConnectionsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-connections-warn"
      AlarmDescription: RDS connections above 80% of max_connections
      Namespace: AWS/RDS
      MetricName: DatabaseConnections
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: !Ref MaxConnectionsThreshold
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsDiskWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-disk-warn"
      AlarmDescription: RDS free storage below 10 GB
      Namespace: AWS/RDS
      MetricName: FreeStorageSpace
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10737418240
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsDiskCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-disk-critical"
      AlarmDescription: RDS free storage critically low (below 2 GB)
      Namespace: AWS/RDS
      MetricName: FreeStorageSpace
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 1
      Threshold: 2147483648
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsReplicaLag:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-replica-lag"
      AlarmDescription: RDS read replica lag above 5 minutes (read replicas only)
      Namespace: AWS/RDS
      MetricName: ReplicaLag
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 60
      EvaluationPeriods: 2
      Threshold: 300
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsFreeMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-memory-warn"
      AlarmDescription: RDS freeable memory below 256 MB
      Namespace: AWS/RDS
      MetricName: FreeableMemory
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 268435456
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

variable "db_instance_id"          { type = string }
variable "sns_topic_arn"            { type = string }
variable "max_connections_threshold" { type = number; default = 272 }
# Set max_connections_threshold to 80% of your instance's max_connections
# db.t3.micro=69, db.t3.small=136, db.t3.medium=272, db.r5.large=1092

resource "aws_cloudwatch_metric_alarm" "rds_cpu_warn" {
  alarm_name          = "${var.db_instance_id}-cpu-warn"
  alarm_description   = "RDS CPU above 80% for 15 minutes"
  namespace           = "AWS/RDS"
  metric_name         = "CPUUtilization"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_connections_warn" {
  alarm_name          = "${var.db_instance_id}-connections-warn"
  alarm_description   = "RDS connections above 80% of max_connections"
  namespace           = "AWS/RDS"
  metric_name         = "DatabaseConnections"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = var.max_connections_threshold
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_disk_warn" {
  alarm_name          = "${var.db_instance_id}-disk-warn"
  alarm_description   = "RDS free storage below 10 GB"
  namespace           = "AWS/RDS"
  metric_name         = "FreeStorageSpace"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 10737418240  # 10 GB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_disk_critical" {
  alarm_name          = "${var.db_instance_id}-disk-critical"
  alarm_description   = "RDS free storage critically low (below 2 GB)"
  namespace           = "AWS/RDS"
  metric_name         = "FreeStorageSpace"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 1
  threshold           = 2147483648  # 2 GB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_replica_lag" {
  # Apply only to read replicas
  alarm_name          = "${var.db_instance_id}-replica-lag"
  alarm_description   = "RDS replica lag above 5 minutes"
  namespace           = "AWS/RDS"
  metric_name         = "ReplicaLag"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 2
  threshold           = 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_memory_warn" {
  alarm_name          = "${var.db_instance_id}-memory-warn"
  alarm_description   = "RDS freeable memory below 256 MB"
  namespace           = "AWS/RDS"
  metric_name         = "FreeableMemory"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 268435456  # 256 MB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

4. Lambda

Lambda errors are silent by default — your function fails and nothing tells you. Throttles mean requests are being dropped. Duration alerts catch runaway executions before they eat your budget.

ⓘ Duration threshold: Set the Duration alarm threshold to 80% of your function's configured timeout. For example, if your timeout is 30 seconds, set threshold to 24000ms (24 seconds). You must set this manually — there is no automatic way to reference the function timeout in a CloudWatch alarm.

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`Errors`	> 0	1 min	1	WARN	Any function error — investigate immediately
`Errors`	> 5	1 min	2	CRITICAL	Repeated errors — function may be completely broken
`Throttles`	> 0	1 min	2	WARN	Requests being dropped due to concurrency limit
`Duration`	> 80% of timeout	1 min	2	WARN	Function nearing timeout; will fail if trend continues
`ConcurrentExecutions`	> 800 (80% of default 1000)	1 min	2	WARN	Approaching account concurrency limit; throttles incoming

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  FunctionName:
    Type: String
    Default: YOUR_FUNCTION_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  DurationThresholdMs:
    Type: Number
    Default: 24000
    Description: |
      80% of your function timeout in ms.
      e.g. 30s timeout -> 24000ms, 15s timeout -> 12000ms, 5s timeout -> 4000ms

Resources:
  LambdaErrorsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-errors-warn"
      AlarmDescription: Lambda function errors detected
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  LambdaErrorsCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-errors-critical"
      AlarmDescription: Lambda function errors above 5 - may be completely broken
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaThrottlesWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-throttles"
      AlarmDescription: Lambda throttles detected - requests being dropped
      Namespace: AWS/Lambda
      MetricName: Throttles
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaDurationWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-duration-warn"
      AlarmDescription: !Sub "Lambda duration above 80% of timeout (${DurationThresholdMs}ms)"
      Namespace: AWS/Lambda
      MetricName: Duration
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      ExtendedStatistic: p99
      Period: 60
      EvaluationPeriods: 2
      Threshold: !Ref DurationThresholdMs
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaConcurrencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-concurrency-warn"
      AlarmDescription: Lambda concurrent executions above 800 (80% of default limit 1000)
      Namespace: AWS/Lambda
      MetricName: ConcurrentExecutions
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 800
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

variable "function_name"       { type = string }
variable "sns_topic_arn"        { type = string }
variable "duration_threshold_ms" {
  type        = number
  default     = 24000
  description = "80% of function timeout in ms. e.g. 30s timeout -> 24000"
}

resource "aws_cloudwatch_metric_alarm" "lambda_errors_warn" {
  alarm_name          = "${var.function_name}-errors-warn"
  alarm_description   = "Lambda errors detected"
  namespace           = "AWS/Lambda"
  metric_name         = "Errors"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 1
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_errors_critical" {
  alarm_name          = "${var.function_name}-errors-critical"
  alarm_description   = "Lambda errors above 5"
  namespace           = "AWS/Lambda"
  metric_name         = "Errors"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 5
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_throttles" {
  alarm_name          = "${var.function_name}-throttles"
  alarm_description   = "Lambda throttles - requests being dropped"
  namespace           = "AWS/Lambda"
  metric_name         = "Throttles"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_duration_warn" {
  alarm_name          = "${var.function_name}-duration-warn"
  alarm_description   = "Lambda p99 duration above 80% of timeout"
  namespace           = "AWS/Lambda"
  metric_name         = "Duration"
  dimensions          = { FunctionName = var.function_name }
  extended_statistic  = "p99"
  period              = 60
  evaluation_periods  = 2
  threshold           = var.duration_threshold_ms
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_concurrency_warn" {
  alarm_name          = "${var.function_name}-concurrency-warn"
  alarm_description   = "Lambda concurrent executions above 800 (80% of default limit)"
  namespace           = "AWS/Lambda"
  metric_name         = "ConcurrentExecutions"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 800
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

⚖

5. ALB — Application Load Balancer

Your load balancer is the front door to your application. 5XX errors mean backends are failing. Unhealthy hosts mean containers are crashing. These alarms catch both.

ⓘ Finding your ALB suffix: In the AWS console, go to EC2 → Load Balancers, click your ALB, and copy the ARN. The suffix is everything after loadbalancer/ (e.g. app/my-alb/abc123def456).

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`HTTPCode_Target_5XX_Count`	> 0	1 min	2	WARN	Backend returning server errors
`HTTPCode_Target_5XX_Count`	> 10	1 min	2	CRITICAL	High rate of 5XX — backend likely down
`TargetResponseTime`	> 2 s	5 min	3	WARN	Slow responses — users experiencing latency
`TargetResponseTime`	> 5 s	5 min	2	CRITICAL	Very slow responses — likely timing out for users
`UnHealthyHostCount`	> 0	1 min	2	CRITICAL	Targets failing health checks — service degraded
`RejectedConnectionCount`	> 0	1 min	2	WARN	ALB at max connections — requests being dropped

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  AlbSuffix:
    Type: String
    Default: YOUR_ALB_SUFFIX
    Description: e.g. app/my-alb/abc123def456 (after "loadbalancer/" in the ARN)
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  Alb5xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-5xx-warn-${AlbSuffix}"
      AlarmDescription: ALB backend 5XX errors detected
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Alb5xxCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-5xx-critical-${AlbSuffix}"
      AlarmDescription: ALB backend 5XX errors above 10 per minute
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbLatencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-latency-warn-${AlbSuffix}"
      AlarmDescription: ALB target response time above 2 seconds
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 2
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbLatencyCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-latency-critical-${AlbSuffix}"
      AlarmDescription: ALB target response time above 5 seconds
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbUnhealthyHosts:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-unhealthy-hosts-${AlbSuffix}"
      AlarmDescription: ALB unhealthy target count above zero
      Namespace: AWS/ApplicationELB
      MetricName: UnHealthyHostCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbRejectedConnections:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-rejected-connections-${AlbSuffix}"
      AlarmDescription: ALB rejected connections - load balancer at max capacity
      Namespace: AWS/ApplicationELB
      MetricName: RejectedConnectionCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

variable "alb_suffix"    { type = string }  # e.g. "app/my-alb/abc123def456"
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "alb_5xx_warn" {
  alarm_name          = "alb-5xx-warn"
  alarm_description   = "ALB 5XX errors detected"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "HTTPCode_Target_5XX_Count"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_5xx_critical" {
  alarm_name          = "alb-5xx-critical"
  alarm_description   = "ALB 5XX errors above 10/min"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "HTTPCode_Target_5XX_Count"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 10
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_latency_warn" {
  alarm_name          = "alb-latency-warn"
  alarm_description   = "ALB p99 response time above 2 seconds"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "TargetResponseTime"
  dimensions          = { LoadBalancer = var.alb_suffix }
  extended_statistic  = "p99"
  period              = 300
  evaluation_periods  = 3
  threshold           = 2
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_latency_critical" {
  alarm_name          = "alb-latency-critical"
  alarm_description   = "ALB p99 response time above 5 seconds"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "TargetResponseTime"
  dimensions          = { LoadBalancer = var.alb_suffix }
  extended_statistic  = "p99"
  period              = 300
  evaluation_periods  = 2
  threshold           = 5
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_unhealthy_hosts" {
  alarm_name          = "alb-unhealthy-hosts"
  alarm_description   = "ALB unhealthy targets detected"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "UnHealthyHostCount"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_rejected_connections" {
  alarm_name          = "alb-rejected-connections"
  alarm_description   = "ALB at max connections - requests being dropped"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "RejectedConnectionCount"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

🌐

6. API Gateway

API Gateway has a hard 29-second timeout limit. If your backends are slow, requests will silently time out. 5XX and 4XX errors can indicate broken integrations or client misconfigurations at scale.

ⓘ Traffic drop detection: Detecting a sudden drop in request Count requires metric math (comparing current Count to a rolling average). Standard CloudWatch alarms can't do this natively — use CloudWatch Anomaly Detection or external monitoring for this alarm. The snippets below cover the simpler threshold-based alarms.

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`5XXError`	> 5 count	1 min	2	WARN	Backend integration errors; Lambda or HTTP backend failing
`4XXError`	> high rate	5 min	3	WARN	High client error rate; API misuse or broken client
`Latency`	> 3000 ms p99	5 min	3	WARN	Slow backend responses; users experiencing delays
`Latency`	> 10000 ms	5 min	2	CRITICAL	Near 29s timeout; requests will start failing
`Count`	sudden drop > 50%	—	—	WARN	Requires metric math / anomaly detection (see note above)

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  ApiName:
    Type: String
    Default: YOUR_API_NAME
  Stage:
    Type: String
    Default: prod
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  ApiGw5xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-5xx-warn"
      AlarmDescription: API Gateway 5XX errors above 5 per minute
      Namespace: AWS/ApiGateway
      MetricName: 5XXError
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  ApiGw4xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-4xx-warn"
      AlarmDescription: API Gateway 4XX errors above 50 per 5 minutes
      Namespace: AWS/ApiGateway
      MetricName: 4XXError
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 50
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  ApiGwLatencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-latency-warn"
      AlarmDescription: API Gateway p99 latency above 3 seconds
      Namespace: AWS/ApiGateway
      MetricName: Latency
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 3000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  ApiGwLatencyCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-latency-critical"
      AlarmDescription: API Gateway latency above 10 seconds - near 29s timeout
      Namespace: AWS/ApiGateway
      MetricName: Latency
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

variable "api_name"      { type = string }
variable "stage"         { type = string; default = "prod" }
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "apigw_5xx_warn" {
  alarm_name          = "${var.api_name}-${var.stage}-5xx-warn"
  alarm_description   = "API Gateway 5XX errors above 5/min"
  namespace           = "AWS/ApiGateway"
  metric_name         = "5XXError"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 5
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "apigw_4xx_warn" {
  alarm_name          = "${var.api_name}-${var.stage}-4xx-warn"
  alarm_description   = "API Gateway 4XX high volume"
  namespace           = "AWS/ApiGateway"
  metric_name         = "4XXError"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 50
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "apigw_latency_warn" {
  alarm_name          = "${var.api_name}-${var.stage}-latency-warn"
  alarm_description   = "API Gateway p99 latency above 3 seconds"
  namespace           = "AWS/ApiGateway"
  metric_name         = "Latency"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  extended_statistic  = "p99"
  period              = 300
  evaluation_periods  = 3
  threshold           = 3000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "apigw_latency_critical" {
  alarm_name          = "${var.api_name}-${var.stage}-latency-critical"
  alarm_description   = "API Gateway latency above 10s - near 29s timeout"
  namespace           = "AWS/ApiGateway"
  metric_name         = "Latency"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 10000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

📩

7. SQS — Simple Queue Service

A backed-up SQS queue means your consumers have stopped or are too slow. Old messages indicate processing failures. Left unattended, queues can grow to millions of messages and take hours to drain.

ⓘ Traffic drop detection: Detecting a sudden drop in NumberOfMessagesSent requires metric math (comparing to a rolling baseline). Use CloudWatch Anomaly Detection alarms for this — the standard alarm snippets below cover threshold-based alarms only.

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`ApproximateNumberOfMessagesNotVisible`	> 1000	5 min	3	WARN	Queue building up; consumers may be slow or failing
`ApproximateNumberOfMessagesNotVisible`	> 10000	5 min	2	CRITICAL	Severe queue backup; consumers definitely failing
`ApproximateAgeOfOldestMessage`	> 300 s	5 min	2	WARN	Messages sitting unprocessed for 5+ minutes
`ApproximateAgeOfOldestMessage`	> 900 s	5 min	2	CRITICAL	Messages 15+ minutes old; SLA likely being breached
`NumberOfMessagesSent`	sudden drop	—	—	WARN	Requires anomaly detection / metric math (see note above)

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  QueueName:
    Type: String
    Default: YOUR_QUEUE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  SqsQueueDepthWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-depth-warn"
      AlarmDescription: SQS queue depth above 1000 - consumers may be lagging
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesNotVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  SqsQueueDepthCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-depth-critical"
      AlarmDescription: SQS queue depth above 10000 - severe consumer failure
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesNotVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  SqsMessageAgeWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-age-warn"
      AlarmDescription: SQS oldest message age above 5 minutes
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 300
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  SqsMessageAgeCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-age-critical"
      AlarmDescription: SQS oldest message age above 15 minutes - SLA breach
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 900
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

variable "queue_name"    { type = string }
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "sqs_depth_warn" {
  alarm_name          = "${var.queue_name}-depth-warn"
  alarm_description   = "SQS queue depth above 1000"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesNotVisible"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 1000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "sqs_depth_critical" {
  alarm_name          = "${var.queue_name}-depth-critical"
  alarm_description   = "SQS queue depth above 10000 - severe consumer failure"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesNotVisible"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 10000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "sqs_message_age_warn" {
  alarm_name          = "${var.queue_name}-age-warn"
  alarm_description   = "SQS oldest message above 5 minutes old"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "sqs_message_age_critical" {
  alarm_name          = "${var.queue_name}-age-critical"
  alarm_description   = "SQS oldest message above 15 minutes - SLA breach"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 900
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

⚡

8. DynamoDB

DynamoDB throttling is silent and cumulative. Throttled requests are retried with exponential backoff, which means your application slows down before it starts failing. Catch throttles early.

ⓘ On-demand vs provisioned mode: If you use on-demand (PAY_PER_REQUEST) mode, skip the ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits alarms — there is no provisioned limit to alarm against. Keep ThrottledRequests and SystemErrors for all modes.

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`SystemErrors`	> 0	1 min	2	CRITICAL	AWS-side DynamoDB errors; likely service issue
`UserErrors`	> 0	5 min	3	WARN	Client-side errors (bad requests, auth issues)
`ConsumedReadCapacityUnits`	> 80% of provisioned	5 min	2	WARN	Read capacity filling up (provisioned mode only)
`ConsumedWriteCapacityUnits`	> 80% of provisioned	5 min	2	WARN	Write capacity filling up (provisioned mode only)
`ThrottledRequests`	> 0	5 min	2	WARN	Requests being throttled; app latency increasing

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  TableName:
    Type: String
    Default: YOUR_TABLE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  ProvisionedReadCapacity:
    Type: Number
    Default: 100
    Description: Your table's provisioned RCU (skip for on-demand mode)
  ProvisionedWriteCapacity:
    Type: Number
    Default: 100
    Description: Your table's provisioned WCU (skip for on-demand mode)

Resources:
  DynamoDbSystemErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-system-errors"
      AlarmDescription: DynamoDB system errors detected - possible AWS service issue
      Namespace: AWS/DynamoDB
      MetricName: SystemErrors
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbUserErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-user-errors"
      AlarmDescription: DynamoDB user errors - bad requests or auth issues
      Namespace: AWS/DynamoDB
      MetricName: UserErrors
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbReadCapacityWarn:
    # Remove this resource if using on-demand mode
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-read-capacity-warn"
      AlarmDescription: DynamoDB read capacity above 80% of provisioned
      Namespace: AWS/DynamoDB
      MetricName: ConsumedReadCapacityUnits
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: !Sub "${ProvisionedReadCapacity * 0.8 * 300}"
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbWriteCapacityWarn:
    # Remove this resource if using on-demand mode
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-write-capacity-warn"
      AlarmDescription: DynamoDB write capacity above 80% of provisioned
      Namespace: AWS/DynamoDB
      MetricName: ConsumedWriteCapacityUnits
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: !Sub "${ProvisionedWriteCapacity * 0.8 * 300}"
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbThrottledRequests:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-throttled"
      AlarmDescription: DynamoDB throttled requests - requests being delayed
      Namespace: AWS/DynamoDB
      MetricName: ThrottledRequests
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

variable "table_name"                { type = string }
variable "sns_topic_arn"              { type = string }
variable "provisioned_read_capacity"  { type = number; default = 100 }
variable "provisioned_write_capacity" { type = number; default = 100 }
# Set provisioned_read/write_capacity to 0 if using on-demand mode and remove capacity alarms

resource "aws_cloudwatch_metric_alarm" "dynamodb_system_errors" {
  alarm_name          = "${var.table_name}-system-errors"
  alarm_description   = "DynamoDB system errors - possible AWS service issue"
  namespace           = "AWS/DynamoDB"
  metric_name         = "SystemErrors"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_user_errors" {
  alarm_name          = "${var.table_name}-user-errors"
  alarm_description   = "DynamoDB user errors - bad requests or auth issues"
  namespace           = "AWS/DynamoDB"
  metric_name         = "UserErrors"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_read_capacity_warn" {
  # Remove this block if using on-demand mode
  alarm_name          = "${var.table_name}-read-capacity-warn"
  alarm_description   = "DynamoDB consumed read capacity above 80% of provisioned"
  namespace           = "AWS/DynamoDB"
  metric_name         = "ConsumedReadCapacityUnits"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 2
  # Threshold = 80% of provisioned RCU * period seconds
  threshold           = var.provisioned_read_capacity * 0.8 * 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_write_capacity_warn" {
  # Remove this block if using on-demand mode
  alarm_name          = "${var.table_name}-write-capacity-warn"
  alarm_description   = "DynamoDB consumed write capacity above 80% of provisioned"
  namespace           = "AWS/DynamoDB"
  metric_name         = "ConsumedWriteCapacityUnits"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 2
  threshold           = var.provisioned_write_capacity * 0.8 * 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_throttled" {
  alarm_name          = "${var.table_name}-throttled"
  alarm_description   = "DynamoDB requests being throttled"
  namespace           = "AWS/DynamoDB"
  metric_name         = "ThrottledRequests"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

🔆

9. ElastiCache (Redis)

Redis is often invisible until it fails — then everything that depends on it slows down or crashes. Low cache hit rate means your backend database is absorbing all the traffic Redis should be handling.

Metric	Threshold	Period	Eval Periods	Severity	Why It Matters
`CPUUtilization`	> 80%	5 min	2	WARN	Redis single-threaded; high CPU causes latency spikes
`FreeableMemory`	< 100 MB	5 min	2	WARN	Redis evicting keys; cache effectiveness dropping
`CacheHitRate`	< 0.8 (80%)	5 min	3	WARN	Cache not effective; DB taking excessive load
`CurrConnections`	> 1000	5 min	2	WARN	High connection count; connection pool exhaustion possible
`ReplicationLag`	> 60 s	1 min	2	WARN	Replica falling behind primary; stale reads from replica

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  CacheClusterId:
    Type: String
    Default: YOUR_CACHE_CLUSTER_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  RedisCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-cpu-warn"
      AlarmDescription: ElastiCache CPU above 80%
      Namespace: AWS/ElastiCache
      MetricName: CPUUtilization
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisFreeMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-memory-warn"
      AlarmDescription: ElastiCache freeable memory below 100 MB - keys may be evicted
      Namespace: AWS/ElastiCache
      MetricName: FreeableMemory
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 104857600
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisCacheHitRateWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-hit-rate-warn"
      AlarmDescription: ElastiCache cache hit rate below 80% - DB taking excessive load
      Namespace: AWS/ElastiCache
      MetricName: CacheHitRate
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0.8
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisCurrConnectionsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-connections-warn"
      AlarmDescription: ElastiCache connections above 1000
      Namespace: AWS/ElastiCache
      MetricName: CurrConnections
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisReplicationLagWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-replication-lag"
      AlarmDescription: ElastiCache replication lag above 60 seconds
      Namespace: AWS/ElastiCache
      MetricName: ReplicationLag
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 60
      EvaluationPeriods: 2
      Threshold: 60
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

variable "cache_cluster_id" { type = string }
variable "sns_topic_arn"     { type = string }

resource "aws_cloudwatch_metric_alarm" "redis_cpu_warn" {
  alarm_name          = "${var.cache_cluster_id}-cpu-warn"
  alarm_description   = "ElastiCache CPU above 80%"
  namespace           = "AWS/ElastiCache"
  metric_name         = "CPUUtilization"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_memory_warn" {
  alarm_name          = "${var.cache_cluster_id}-memory-warn"
  alarm_description   = "ElastiCache freeable memory below 100 MB"
  namespace           = "AWS/ElastiCache"
  metric_name         = "FreeableMemory"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 104857600  # 100 MB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_hit_rate_warn" {
  alarm_name          = "${var.cache_cluster_id}-hit-rate-warn"
  alarm_description   = "ElastiCache cache hit rate below 80%"
  namespace           = "AWS/ElastiCache"
  metric_name         = "CacheHitRate"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 0.8
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_connections_warn" {
  alarm_name          = "${var.cache_cluster_id}-connections-warn"
  alarm_description   = "ElastiCache connections above 1000"
  namespace           = "AWS/ElastiCache"
  metric_name         = "CurrConnections"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 1000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_replication_lag" {
  alarm_name          = "${var.cache_cluster_id}-replication-lag"
  alarm_description   = "ElastiCache replication lag above 60 seconds"
  namespace           = "AWS/ElastiCache"
  metric_name         = "ReplicationLag"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 2
  threshold           = 60
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

💰

10. Cost & Budget Alerts

Cost alerts use AWS Budgets, not CloudWatch. They notify you when actual or forecasted spend crosses a threshold — giving you time to investigate before the bill arrives.

ⓘ This is AWS Budgets, not CloudWatch. The resources below are AWS::Budgets::Budget (CloudFormation) and aws_budgets_budget (Terraform), not CloudWatch alarms. They still send alerts to email or SNS.

Alert Type	Threshold	Type	Severity	Why It Matters
Monthly spend actual	80% of budget	ACTUAL	WARN	Early warning to review usage before hitting budget
Monthly spend actual	100% of budget	ACTUAL	CRITICAL	Budget exceeded — take action now
Monthly spend forecasted	100% of budget	FORECASTED	WARN	Projected to exceed budget by month end
Anomaly detection	$50 above expected	ANOMALY	WARN	Unusual spending pattern — runaway resource possible

☁️ CloudFormation YAML

🟣 Terraform HCL

Parameters:
  MonthlyBudgetAmount:
    Type: Number
    Default: 100
    Description: Monthly AWS budget in USD
  AlertEmail:
    Type: String
    Default: you@yourcompany.com
    Description: Email for budget alerts

Resources:
  MonthlyBudget:
    Type: AWS::Budgets::Budget
    Properties:
      Budget:
        BudgetName: monthly-aws-budget
        BudgetType: COST
        TimeUnit: MONTHLY
        BudgetLimit:
          Amount: !Ref MonthlyBudgetAmount
          Unit: USD
      NotificationsWithSubscribers:
        # 80% actual spend warning
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 80
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail
        # 100% actual spend - critical
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 100
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail
        # Forecasted to exceed 100%
        - Notification:
            NotificationType: FORECASTED
            ComparisonOperator: GREATER_THAN
            Threshold: 100
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail

  # Cost Anomaly Detection
  # Note: AWS::CE::AnomalyMonitor and AnomalySubscription are separate resources
  CostAnomalyMonitor:
    Type: AWS::CE::AnomalyMonitor
    Properties:
      MonitorName: aws-cost-anomaly-monitor
      MonitorType: DIMENSIONAL
      MonitorDimension: SERVICE

  CostAnomalySubscription:
    Type: AWS::CE::AnomalySubscription
    Properties:
      SubscriptionName: cost-anomaly-alerts
      MonitorArnList:
        - !GetAtt CostAnomalyMonitor.MonitorArn
      Subscribers:
        - Address: !Ref AlertEmail
          Type: EMAIL
      Threshold: 50
      Frequency: DAILY

variable "monthly_budget_amount" {
  type        = number
  default     = 100
  description = "Monthly AWS budget in USD"
}

variable "alert_email" {
  type        = string
  description = "Email for budget alerts"
}

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-aws-budget"
  budget_type  = "COST"
  limit_amount = var.monthly_budget_amount
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # 80% actual spend - warning
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }

  # 100% actual spend - critical
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }

  # Forecasted to exceed budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = [var.alert_email]
  }
}

# Cost Anomaly Detection
resource "aws_ce_anomaly_monitor" "main" {
  name         = "aws-cost-anomaly-monitor"
  monitor_type = "DIMENSIONAL"

  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "main" {
  name      = "cost-anomaly-alerts"
  frequency = "DAILY"

  monitor_arn_list = [aws_ce_anomaly_monitor.main.arn]

  subscriber {
    address = var.alert_email
    type    = "EMAIL"
  }

  # Alert when spend is $50 above expected
  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["50"]
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

Alarms set up. What happens when they fire?

ConvOps sends CloudWatch alarms to WhatsApp or Slack with AI root cause analysis. Investigate and act from your phone — no laptop needed.

Try ConvOps Free — 2 minutes to connect

No credit card. Works with the alarms you just set up.