Show all code as: Switches all code blocks

⚙️ Setup Parameters — Read This First

Every snippet uses placeholder values. Replace them before deploying:

  • YOUR_SNS_TOPIC_ARN — ARN of your SNS topic (e.g. arn:aws:sns:eu-central-1:123456789012:alerts)
  • YOUR_CLUSTER_NAME / YOUR_SERVICE_NAME — ECS cluster and service names
  • YOUR_INSTANCE_ID — EC2 instance ID (e.g. i-0abc123def456789)
  • YOUR_DB_INSTANCE_ID — RDS DB instance identifier
  • YOUR_FUNCTION_NAME — Lambda function name
  • YOUR_ALB_SUFFIX — Part after loadbalancer/ in ALB ARN (e.g. app/my-alb/abc123def456)
  • YOUR_API_NAME / YOUR_STAGE — API Gateway name and stage (e.g. prod)
  • YOUR_QUEUE_NAME — SQS queue name
  • YOUR_TABLE_NAME — DynamoDB table name
  • YOUR_CACHE_CLUSTER_ID — ElastiCache cluster ID
  • YOUR_MONTHLY_BUDGET — Your monthly AWS budget in USD

📫 Create an SNS topic that emails you

☁️ CloudFormation YAML
🟣 Terraform HCL
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  AlertEmail:
    Type: String
    Description: Email address to receive CloudWatch alerts

Resources:
  AlertsTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: infra-alerts
      Subscription:
        - Protocol: email
          Endpoint: !Ref AlertEmail

Outputs:
  SnsTopicArn:
    Value: !Ref AlertsTopic
    Description: Use this ARN as YOUR_SNS_TOPIC_ARN in all alarm snippets below
variable "alert_email" {
  description = "Email address to receive alerts"
  type        = string
}

resource "aws_sns_topic" "alerts" {
  name = "infra-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = var.alert_email
}

# Use aws_sns_topic.alerts.arn as var.sns_topic_arn in alarm resources below
💡 After deploying, AWS sends a confirmation email. Click "Confirm subscription" in that email — alarms won't deliver until you do.
🐳
1. ECS — Elastic Container Service

ECS containers can silently exhaust CPU/memory or stop running without the load balancer health check catching it in time. These alarms detect saturation and task crashes before users are impacted.

MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 80%5 min2WARNSustained CPU pressure — scale before saturation
CPUUtilization> 95%5 min2CRITICALTasks CPU-throttled; latency spikes imminent
MemoryUtilization> 85%5 min2WARNMemory pressure building; OOM kill possible
MemoryUtilization> 95%5 min2CRITICALNear OOM; task will be killed and restarted
RunningTaskCount< desired count1 min1CRITICALTasks crashed and not recovering; service may be down
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  ClusterName:
    Type: String
    Default: YOUR_CLUSTER_NAME
  ServiceName:
    Type: String
    Default: YOUR_SERVICE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  DesiredTaskCount:
    Type: Number
    Default: 2
    Description: Alarm when running tasks fall below this number

Resources:
  EcsCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-cpu-warn"
      AlarmDescription: ECS CPU utilization above 80% for 10 minutes
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  EcsCpuCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-cpu-critical"
      AlarmDescription: ECS CPU above 95% - tasks are throttled
      Namespace: AWS/ECS
      MetricName: CPUUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-memory-warn"
      AlarmDescription: ECS memory utilization above 85%
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsMemoryCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-memory-critical"
      AlarmDescription: ECS memory utilization above 95% - OOM kill imminent
      Namespace: AWS/ECS
      MetricName: MemoryUtilization
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  EcsRunningTasksCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ServiceName}-tasks-critical"
      AlarmDescription: Running task count below desired - service may be down
      Namespace: AWS/ECS
      MetricName: RunningTaskCount
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
        - Name: ServiceName
          Value: !Ref ServiceName
      Statistic: Average
      Period: 60
      EvaluationPeriods: 1
      Threshold: !Ref DesiredTaskCount
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]
variable "cluster_name"    { type = string }
variable "service_name"    { type = string }
variable "sns_topic_arn"   { type = string }
variable "desired_count"   { type = number; default = 2 }

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_warn" {
  alarm_name          = "${var.service_name}-cpu-warn"
  alarm_description   = "ECS CPU above 80% for 10 minutes"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_critical" {
  alarm_name          = "${var.service_name}-cpu-critical"
  alarm_description   = "ECS CPU above 95% - tasks throttled"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 95
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_memory_warn" {
  alarm_name          = "${var.service_name}-memory-warn"
  alarm_description   = "ECS memory above 85%"
  namespace           = "AWS/ECS"
  metric_name         = "MemoryUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 85
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_memory_critical" {
  alarm_name          = "${var.service_name}-memory-critical"
  alarm_description   = "ECS memory above 95% - OOM kill imminent"
  namespace           = "AWS/ECS"
  metric_name         = "MemoryUtilization"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 95
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ecs_running_tasks" {
  alarm_name          = "${var.service_name}-tasks-critical"
  alarm_description   = "Running tasks below desired count"
  namespace           = "AWS/ECS"
  metric_name         = "RunningTaskCount"
  dimensions          = { ClusterName = var.cluster_name, ServiceName = var.service_name }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 1
  threshold           = var.desired_count
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
🖥️
2. EC2 — Elastic Compute Cloud

EC2 instances can become unresponsive due to hardware failures, runaway processes, or network issues. Status check alarms catch hard failures that Auto Scaling or ELB health checks may miss initially.

MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 85%5 min3WARNSustained high CPU; investigate before saturation
CPUUtilization> 95%5 min2CRITICALInstance at capacity; requests will queue or fail
StatusCheckFailed> 01 min2CRITICALInstance or system check failing — likely unresponsive
StatusCheckFailed_System> 01 min2CRITICALAWS hardware issue — instance may need recovery
NetworkIn< 1000 bytes/period5 min3WARNTraffic dropped to near-zero — instance may be isolated
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  InstanceId:
    Type: String
    Default: YOUR_INSTANCE_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  Ec2CpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-cpu-warn"
      AlarmDescription: EC2 CPU above 85% for 15 minutes
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Ec2CpuCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-cpu-critical"
      AlarmDescription: EC2 CPU above 95% for 10 minutes
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 95
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2StatusCheckFailed:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-status-check-failed"
      AlarmDescription: EC2 status check failed - instance may be unresponsive
      Namespace: AWS/EC2
      MetricName: StatusCheckFailed
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2StatusCheckFailedSystem:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-status-check-system"
      AlarmDescription: EC2 system status check failed - AWS hardware issue
      Namespace: AWS/EC2
      MetricName: StatusCheckFailed_System
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  Ec2NetworkInDrop:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${InstanceId}-network-in-drop"
      AlarmDescription: EC2 NetworkIn near zero - traffic may have stopped
      Namespace: AWS/EC2
      MetricName: NetworkIn
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 1000
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]
variable "instance_id"   { type = string }
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "ec2_cpu_warn" {
  alarm_name          = "${var.instance_id}-cpu-warn"
  alarm_description   = "EC2 CPU above 85% for 15 minutes"
  namespace           = "AWS/EC2"
  metric_name         = "CPUUtilization"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 85
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_cpu_critical" {
  alarm_name          = "${var.instance_id}-cpu-critical"
  alarm_description   = "EC2 CPU above 95%"
  namespace           = "AWS/EC2"
  metric_name         = "CPUUtilization"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 95
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_status_check" {
  alarm_name          = "${var.instance_id}-status-check"
  alarm_description   = "EC2 status check failed"
  namespace           = "AWS/EC2"
  metric_name         = "StatusCheckFailed"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_status_check_system" {
  alarm_name          = "${var.instance_id}-status-check-system"
  alarm_description   = "EC2 system status check failed - hardware issue"
  namespace           = "AWS/EC2"
  metric_name         = "StatusCheckFailed_System"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "ec2_network_in_drop" {
  alarm_name          = "${var.instance_id}-network-in-drop"
  alarm_description   = "EC2 NetworkIn near zero - traffic may have stopped"
  namespace           = "AWS/EC2"
  metric_name         = "NetworkIn"
  dimensions          = { InstanceId = var.instance_id }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 1000
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
🗃️
3. RDS — Relational Database Service

Databases fail silently — connections pile up, disk fills, replicas fall behind. By the time your app throws errors, it's already too late. These alarms give you a 10–30 minute warning window.

max_connections lookup: Set DatabaseConnections threshold to 80% of your instance's max_connections value:
Instance Classmax_connections80% threshold
db.t3.micro8769
db.t3.small171136
db.t3.medium341272
db.t3.large648518
db.r5.large13651092
db.r5.xlarge27302184
db.r5.2xlarge54604368
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 80%5 min3WARNDB under CPU load; queries slowing down
DatabaseConnections> 80% of max5 min2WARNConnection pool filling; new connections will fail soon
FreeStorageSpace< 10 GB5 min2WARNDisk filling; DB will stop accepting writes when full
FreeStorageSpace< 2 GB5 min1CRITICALCritically low disk — DB failure imminent
ReplicaLag> 300 s1 min2WARNRead replica falling behind; stale reads possible
FreeableMemory< 256 MB5 min3WARNLow memory; buffer pool shrinking, queries slowing
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  DbInstanceId:
    Type: String
    Default: YOUR_DB_INSTANCE_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  MaxConnectionsThreshold:
    Type: Number
    Default: 272
    Description: 80% of max_connections for your instance class (see table above)

Resources:
  RdsCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-cpu-warn"
      AlarmDescription: RDS CPU above 80% for 15 minutes
      Namespace: AWS/RDS
      MetricName: CPUUtilization
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsConnectionsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-connections-warn"
      AlarmDescription: RDS connections above 80% of max_connections
      Namespace: AWS/RDS
      MetricName: DatabaseConnections
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: !Ref MaxConnectionsThreshold
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsDiskWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-disk-warn"
      AlarmDescription: RDS free storage below 10 GB
      Namespace: AWS/RDS
      MetricName: FreeStorageSpace
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10737418240
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsDiskCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-disk-critical"
      AlarmDescription: RDS free storage critically low (below 2 GB)
      Namespace: AWS/RDS
      MetricName: FreeStorageSpace
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 1
      Threshold: 2147483648
      ComparisonOperator: LessThanThreshold
      TreatMissingData: breaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsReplicaLag:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-replica-lag"
      AlarmDescription: RDS read replica lag above 5 minutes (read replicas only)
      Namespace: AWS/RDS
      MetricName: ReplicaLag
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 60
      EvaluationPeriods: 2
      Threshold: 300
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RdsFreeMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${DbInstanceId}-memory-warn"
      AlarmDescription: RDS freeable memory below 256 MB
      Namespace: AWS/RDS
      MetricName: FreeableMemory
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Ref DbInstanceId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 268435456
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
variable "db_instance_id"          { type = string }
variable "sns_topic_arn"            { type = string }
variable "max_connections_threshold" { type = number; default = 272 }
# Set max_connections_threshold to 80% of your instance's max_connections
# db.t3.micro=69, db.t3.small=136, db.t3.medium=272, db.r5.large=1092

resource "aws_cloudwatch_metric_alarm" "rds_cpu_warn" {
  alarm_name          = "${var.db_instance_id}-cpu-warn"
  alarm_description   = "RDS CPU above 80% for 15 minutes"
  namespace           = "AWS/RDS"
  metric_name         = "CPUUtilization"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_connections_warn" {
  alarm_name          = "${var.db_instance_id}-connections-warn"
  alarm_description   = "RDS connections above 80% of max_connections"
  namespace           = "AWS/RDS"
  metric_name         = "DatabaseConnections"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = var.max_connections_threshold
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_disk_warn" {
  alarm_name          = "${var.db_instance_id}-disk-warn"
  alarm_description   = "RDS free storage below 10 GB"
  namespace           = "AWS/RDS"
  metric_name         = "FreeStorageSpace"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 10737418240  # 10 GB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_disk_critical" {
  alarm_name          = "${var.db_instance_id}-disk-critical"
  alarm_description   = "RDS free storage critically low (below 2 GB)"
  namespace           = "AWS/RDS"
  metric_name         = "FreeStorageSpace"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 1
  threshold           = 2147483648  # 2 GB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_replica_lag" {
  # Apply only to read replicas
  alarm_name          = "${var.db_instance_id}-replica-lag"
  alarm_description   = "RDS replica lag above 5 minutes"
  namespace           = "AWS/RDS"
  metric_name         = "ReplicaLag"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 2
  threshold           = 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_memory_warn" {
  alarm_name          = "${var.db_instance_id}-memory-warn"
  alarm_description   = "RDS freeable memory below 256 MB"
  namespace           = "AWS/RDS"
  metric_name         = "FreeableMemory"
  dimensions          = { DBInstanceIdentifier = var.db_instance_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 268435456  # 256 MB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
λ
4. Lambda

Lambda errors are silent by default — your function fails and nothing tells you. Throttles mean requests are being dropped. Duration alerts catch runaway executions before they eat your budget.

Duration threshold: Set the Duration alarm threshold to 80% of your function's configured timeout. For example, if your timeout is 30 seconds, set threshold to 24000ms (24 seconds). You must set this manually — there is no automatic way to reference the function timeout in a CloudWatch alarm.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
Errors> 01 min1WARNAny function error — investigate immediately
Errors> 51 min2CRITICALRepeated errors — function may be completely broken
Throttles> 01 min2WARNRequests being dropped due to concurrency limit
Duration> 80% of timeout1 min2WARNFunction nearing timeout; will fail if trend continues
ConcurrentExecutions> 800 (80% of default 1000)1 min2WARNApproaching account concurrency limit; throttles incoming
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  FunctionName:
    Type: String
    Default: YOUR_FUNCTION_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  DurationThresholdMs:
    Type: Number
    Default: 24000
    Description: |
      80% of your function timeout in ms.
      e.g. 30s timeout -> 24000ms, 15s timeout -> 12000ms, 5s timeout -> 4000ms

Resources:
  LambdaErrorsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-errors-warn"
      AlarmDescription: Lambda function errors detected
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  LambdaErrorsCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-errors-critical"
      AlarmDescription: Lambda function errors above 5 - may be completely broken
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaThrottlesWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-throttles"
      AlarmDescription: Lambda throttles detected - requests being dropped
      Namespace: AWS/Lambda
      MetricName: Throttles
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaDurationWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-duration-warn"
      AlarmDescription: !Sub "Lambda duration above 80% of timeout (${DurationThresholdMs}ms)"
      Namespace: AWS/Lambda
      MetricName: Duration
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      ExtendedStatistic: p99
      Period: 60
      EvaluationPeriods: 2
      Threshold: !Ref DurationThresholdMs
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  LambdaConcurrencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${FunctionName}-concurrency-warn"
      AlarmDescription: Lambda concurrent executions above 800 (80% of default limit 1000)
      Namespace: AWS/Lambda
      MetricName: ConcurrentExecutions
      Dimensions:
        - Name: FunctionName
          Value: !Ref FunctionName
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 800
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
variable "function_name"       { type = string }
variable "sns_topic_arn"        { type = string }
variable "duration_threshold_ms" {
  type        = number
  default     = 24000
  description = "80% of function timeout in ms. e.g. 30s timeout -> 24000"
}

resource "aws_cloudwatch_metric_alarm" "lambda_errors_warn" {
  alarm_name          = "${var.function_name}-errors-warn"
  alarm_description   = "Lambda errors detected"
  namespace           = "AWS/Lambda"
  metric_name         = "Errors"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 1
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_errors_critical" {
  alarm_name          = "${var.function_name}-errors-critical"
  alarm_description   = "Lambda errors above 5"
  namespace           = "AWS/Lambda"
  metric_name         = "Errors"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 5
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_throttles" {
  alarm_name          = "${var.function_name}-throttles"
  alarm_description   = "Lambda throttles - requests being dropped"
  namespace           = "AWS/Lambda"
  metric_name         = "Throttles"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_duration_warn" {
  alarm_name          = "${var.function_name}-duration-warn"
  alarm_description   = "Lambda p99 duration above 80% of timeout"
  namespace           = "AWS/Lambda"
  metric_name         = "Duration"
  dimensions          = { FunctionName = var.function_name }
  extended_statistic  = "p99"
  period              = 60
  evaluation_periods  = 2
  threshold           = var.duration_threshold_ms
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "lambda_concurrency_warn" {
  alarm_name          = "${var.function_name}-concurrency-warn"
  alarm_description   = "Lambda concurrent executions above 800 (80% of default limit)"
  namespace           = "AWS/Lambda"
  metric_name         = "ConcurrentExecutions"
  dimensions          = { FunctionName = var.function_name }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 800
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
5. ALB — Application Load Balancer

Your load balancer is the front door to your application. 5XX errors mean backends are failing. Unhealthy hosts mean containers are crashing. These alarms catch both.

Finding your ALB suffix: In the AWS console, go to EC2 → Load Balancers, click your ALB, and copy the ARN. The suffix is everything after loadbalancer/ (e.g. app/my-alb/abc123def456).
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
HTTPCode_Target_5XX_Count> 01 min2WARNBackend returning server errors
HTTPCode_Target_5XX_Count> 101 min2CRITICALHigh rate of 5XX — backend likely down
TargetResponseTime> 2 s5 min3WARNSlow responses — users experiencing latency
TargetResponseTime> 5 s5 min2CRITICALVery slow responses — likely timing out for users
UnHealthyHostCount> 01 min2CRITICALTargets failing health checks — service degraded
RejectedConnectionCount> 01 min2WARNALB at max connections — requests being dropped
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  AlbSuffix:
    Type: String
    Default: YOUR_ALB_SUFFIX
    Description: e.g. app/my-alb/abc123def456 (after "loadbalancer/" in the ARN)
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  Alb5xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-5xx-warn-${AlbSuffix}"
      AlarmDescription: ALB backend 5XX errors detected
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  Alb5xxCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-5xx-critical-${AlbSuffix}"
      AlarmDescription: ALB backend 5XX errors above 10 per minute
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbLatencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-latency-warn-${AlbSuffix}"
      AlarmDescription: ALB target response time above 2 seconds
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 2
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbLatencyCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-latency-critical-${AlbSuffix}"
      AlarmDescription: ALB target response time above 5 seconds
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbUnhealthyHosts:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-unhealthy-hosts-${AlbSuffix}"
      AlarmDescription: ALB unhealthy target count above zero
      Namespace: AWS/ApplicationELB
      MetricName: UnHealthyHostCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Maximum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  AlbRejectedConnections:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "alb-rejected-connections-${AlbSuffix}"
      AlarmDescription: ALB rejected connections - load balancer at max capacity
      Namespace: AWS/ApplicationELB
      MetricName: RejectedConnectionCount
      Dimensions:
        - Name: LoadBalancer
          Value: !Ref AlbSuffix
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
variable "alb_suffix"    { type = string }  # e.g. "app/my-alb/abc123def456"
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "alb_5xx_warn" {
  alarm_name          = "alb-5xx-warn"
  alarm_description   = "ALB 5XX errors detected"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "HTTPCode_Target_5XX_Count"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_5xx_critical" {
  alarm_name          = "alb-5xx-critical"
  alarm_description   = "ALB 5XX errors above 10/min"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "HTTPCode_Target_5XX_Count"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 10
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_latency_warn" {
  alarm_name          = "alb-latency-warn"
  alarm_description   = "ALB p99 response time above 2 seconds"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "TargetResponseTime"
  dimensions          = { LoadBalancer = var.alb_suffix }
  extended_statistic  = "p99"
  period              = 300
  evaluation_periods  = 3
  threshold           = 2
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_latency_critical" {
  alarm_name          = "alb-latency-critical"
  alarm_description   = "ALB p99 response time above 5 seconds"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "TargetResponseTime"
  dimensions          = { LoadBalancer = var.alb_suffix }
  extended_statistic  = "p99"
  period              = 300
  evaluation_periods  = 2
  threshold           = 5
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_unhealthy_hosts" {
  alarm_name          = "alb-unhealthy-hosts"
  alarm_description   = "ALB unhealthy targets detected"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "UnHealthyHostCount"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "alb_rejected_connections" {
  alarm_name          = "alb-rejected-connections"
  alarm_description   = "ALB at max connections - requests being dropped"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "RejectedConnectionCount"
  dimensions          = { LoadBalancer = var.alb_suffix }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
🌐
6. API Gateway

API Gateway has a hard 29-second timeout limit. If your backends are slow, requests will silently time out. 5XX and 4XX errors can indicate broken integrations or client misconfigurations at scale.

Traffic drop detection: Detecting a sudden drop in request Count requires metric math (comparing current Count to a rolling average). Standard CloudWatch alarms can't do this natively — use CloudWatch Anomaly Detection or external monitoring for this alarm. The snippets below cover the simpler threshold-based alarms.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
5XXError> 5 count1 min2WARNBackend integration errors; Lambda or HTTP backend failing
4XXError> high rate5 min3WARNHigh client error rate; API misuse or broken client
Latency> 3000 ms p995 min3WARNSlow backend responses; users experiencing delays
Latency> 10000 ms5 min2CRITICALNear 29s timeout; requests will start failing
Countsudden drop > 50%WARNRequires metric math / anomaly detection (see note above)
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  ApiName:
    Type: String
    Default: YOUR_API_NAME
  Stage:
    Type: String
    Default: prod
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  ApiGw5xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-5xx-warn"
      AlarmDescription: API Gateway 5XX errors above 5 per minute
      Namespace: AWS/ApiGateway
      MetricName: 5XXError
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  ApiGw4xxWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-4xx-warn"
      AlarmDescription: API Gateway 4XX errors above 50 per 5 minutes
      Namespace: AWS/ApiGateway
      MetricName: 4XXError
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 50
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  ApiGwLatencyWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-latency-warn"
      AlarmDescription: API Gateway p99 latency above 3 seconds
      Namespace: AWS/ApiGateway
      MetricName: Latency
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      ExtendedStatistic: p99
      Period: 300
      EvaluationPeriods: 3
      Threshold: 3000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  ApiGwLatencyCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ApiName}-${Stage}-latency-critical"
      AlarmDescription: API Gateway latency above 10 seconds - near 29s timeout
      Namespace: AWS/ApiGateway
      MetricName: Latency
      Dimensions:
        - Name: ApiName
          Value: !Ref ApiName
        - Name: Stage
          Value: !Ref Stage
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
variable "api_name"      { type = string }
variable "stage"         { type = string; default = "prod" }
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "apigw_5xx_warn" {
  alarm_name          = "${var.api_name}-${var.stage}-5xx-warn"
  alarm_description   = "API Gateway 5XX errors above 5/min"
  namespace           = "AWS/ApiGateway"
  metric_name         = "5XXError"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 5
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "apigw_4xx_warn" {
  alarm_name          = "${var.api_name}-${var.stage}-4xx-warn"
  alarm_description   = "API Gateway 4XX high volume"
  namespace           = "AWS/ApiGateway"
  metric_name         = "4XXError"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 50
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "apigw_latency_warn" {
  alarm_name          = "${var.api_name}-${var.stage}-latency-warn"
  alarm_description   = "API Gateway p99 latency above 3 seconds"
  namespace           = "AWS/ApiGateway"
  metric_name         = "Latency"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  extended_statistic  = "p99"
  period              = 300
  evaluation_periods  = 3
  threshold           = 3000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "apigw_latency_critical" {
  alarm_name          = "${var.api_name}-${var.stage}-latency-critical"
  alarm_description   = "API Gateway latency above 10s - near 29s timeout"
  namespace           = "AWS/ApiGateway"
  metric_name         = "Latency"
  dimensions          = { ApiName = var.api_name, Stage = var.stage }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 10000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
📩
7. SQS — Simple Queue Service

A backed-up SQS queue means your consumers have stopped or are too slow. Old messages indicate processing failures. Left unattended, queues can grow to millions of messages and take hours to drain.

Traffic drop detection: Detecting a sudden drop in NumberOfMessagesSent requires metric math (comparing to a rolling baseline). Use CloudWatch Anomaly Detection alarms for this — the standard alarm snippets below cover threshold-based alarms only.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
ApproximateNumberOfMessagesNotVisible> 10005 min3WARNQueue building up; consumers may be slow or failing
ApproximateNumberOfMessagesNotVisible> 100005 min2CRITICALSevere queue backup; consumers definitely failing
ApproximateAgeOfOldestMessage> 300 s5 min2WARNMessages sitting unprocessed for 5+ minutes
ApproximateAgeOfOldestMessage> 900 s5 min2CRITICALMessages 15+ minutes old; SLA likely being breached
NumberOfMessagesSentsudden dropWARNRequires anomaly detection / metric math (see note above)
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  QueueName:
    Type: String
    Default: YOUR_QUEUE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  SqsQueueDepthWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-depth-warn"
      AlarmDescription: SQS queue depth above 1000 - consumers may be lagging
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesNotVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
      OKActions: [!Ref SnsTopicArn]

  SqsQueueDepthCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-depth-critical"
      AlarmDescription: SQS queue depth above 10000 - severe consumer failure
      Namespace: AWS/SQS
      MetricName: ApproximateNumberOfMessagesNotVisible
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 10000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  SqsMessageAgeWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-age-warn"
      AlarmDescription: SQS oldest message age above 5 minutes
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 300
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  SqsMessageAgeCritical:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${QueueName}-age-critical"
      AlarmDescription: SQS oldest message age above 15 minutes - SLA breach
      Namespace: AWS/SQS
      MetricName: ApproximateAgeOfOldestMessage
      Dimensions:
        - Name: QueueName
          Value: !Ref QueueName
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 900
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
variable "queue_name"    { type = string }
variable "sns_topic_arn" { type = string }

resource "aws_cloudwatch_metric_alarm" "sqs_depth_warn" {
  alarm_name          = "${var.queue_name}-depth-warn"
  alarm_description   = "SQS queue depth above 1000"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesNotVisible"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 1000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
  ok_actions          = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "sqs_depth_critical" {
  alarm_name          = "${var.queue_name}-depth-critical"
  alarm_description   = "SQS queue depth above 10000 - severe consumer failure"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesNotVisible"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 10000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "sqs_message_age_warn" {
  alarm_name          = "${var.queue_name}-age-warn"
  alarm_description   = "SQS oldest message above 5 minutes old"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "sqs_message_age_critical" {
  alarm_name          = "${var.queue_name}-age-critical"
  alarm_description   = "SQS oldest message above 15 minutes - SLA breach"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  dimensions          = { QueueName = var.queue_name }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 900
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
8. DynamoDB

DynamoDB throttling is silent and cumulative. Throttled requests are retried with exponential backoff, which means your application slows down before it starts failing. Catch throttles early.

On-demand vs provisioned mode: If you use on-demand (PAY_PER_REQUEST) mode, skip the ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits alarms — there is no provisioned limit to alarm against. Keep ThrottledRequests and SystemErrors for all modes.
MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
SystemErrors> 01 min2CRITICALAWS-side DynamoDB errors; likely service issue
UserErrors> 05 min3WARNClient-side errors (bad requests, auth issues)
ConsumedReadCapacityUnits> 80% of provisioned5 min2WARNRead capacity filling up (provisioned mode only)
ConsumedWriteCapacityUnits> 80% of provisioned5 min2WARNWrite capacity filling up (provisioned mode only)
ThrottledRequests> 05 min2WARNRequests being throttled; app latency increasing
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  TableName:
    Type: String
    Default: YOUR_TABLE_NAME
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN
  ProvisionedReadCapacity:
    Type: Number
    Default: 100
    Description: Your table's provisioned RCU (skip for on-demand mode)
  ProvisionedWriteCapacity:
    Type: Number
    Default: 100
    Description: Your table's provisioned WCU (skip for on-demand mode)

Resources:
  DynamoDbSystemErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-system-errors"
      AlarmDescription: DynamoDB system errors detected - possible AWS service issue
      Namespace: AWS/DynamoDB
      MetricName: SystemErrors
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbUserErrors:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-user-errors"
      AlarmDescription: DynamoDB user errors - bad requests or auth issues
      Namespace: AWS/DynamoDB
      MetricName: UserErrors
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbReadCapacityWarn:
    # Remove this resource if using on-demand mode
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-read-capacity-warn"
      AlarmDescription: DynamoDB read capacity above 80% of provisioned
      Namespace: AWS/DynamoDB
      MetricName: ConsumedReadCapacityUnits
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: !Sub "${ProvisionedReadCapacity * 0.8 * 300}"
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbWriteCapacityWarn:
    # Remove this resource if using on-demand mode
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-write-capacity-warn"
      AlarmDescription: DynamoDB write capacity above 80% of provisioned
      Namespace: AWS/DynamoDB
      MetricName: ConsumedWriteCapacityUnits
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: !Sub "${ProvisionedWriteCapacity * 0.8 * 300}"
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  DynamoDbThrottledRequests:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${TableName}-throttled"
      AlarmDescription: DynamoDB throttled requests - requests being delayed
      Namespace: AWS/DynamoDB
      MetricName: ThrottledRequests
      Dimensions:
        - Name: TableName
          Value: !Ref TableName
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
variable "table_name"                { type = string }
variable "sns_topic_arn"              { type = string }
variable "provisioned_read_capacity"  { type = number; default = 100 }
variable "provisioned_write_capacity" { type = number; default = 100 }
# Set provisioned_read/write_capacity to 0 if using on-demand mode and remove capacity alarms

resource "aws_cloudwatch_metric_alarm" "dynamodb_system_errors" {
  alarm_name          = "${var.table_name}-system-errors"
  alarm_description   = "DynamoDB system errors - possible AWS service issue"
  namespace           = "AWS/DynamoDB"
  metric_name         = "SystemErrors"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_user_errors" {
  alarm_name          = "${var.table_name}-user-errors"
  alarm_description   = "DynamoDB user errors - bad requests or auth issues"
  namespace           = "AWS/DynamoDB"
  metric_name         = "UserErrors"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 3
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_read_capacity_warn" {
  # Remove this block if using on-demand mode
  alarm_name          = "${var.table_name}-read-capacity-warn"
  alarm_description   = "DynamoDB consumed read capacity above 80% of provisioned"
  namespace           = "AWS/DynamoDB"
  metric_name         = "ConsumedReadCapacityUnits"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 2
  # Threshold = 80% of provisioned RCU * period seconds
  threshold           = var.provisioned_read_capacity * 0.8 * 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_write_capacity_warn" {
  # Remove this block if using on-demand mode
  alarm_name          = "${var.table_name}-write-capacity-warn"
  alarm_description   = "DynamoDB consumed write capacity above 80% of provisioned"
  namespace           = "AWS/DynamoDB"
  metric_name         = "ConsumedWriteCapacityUnits"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 2
  threshold           = var.provisioned_write_capacity * 0.8 * 300
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "dynamodb_throttled" {
  alarm_name          = "${var.table_name}-throttled"
  alarm_description   = "DynamoDB requests being throttled"
  namespace           = "AWS/DynamoDB"
  metric_name         = "ThrottledRequests"
  dimensions          = { TableName = var.table_name }
  statistic           = "Sum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
🔆
9. ElastiCache (Redis)

Redis is often invisible until it fails — then everything that depends on it slows down or crashes. Low cache hit rate means your backend database is absorbing all the traffic Redis should be handling.

MetricThresholdPeriodEval PeriodsSeverityWhy It Matters
CPUUtilization> 80%5 min2WARNRedis single-threaded; high CPU causes latency spikes
FreeableMemory< 100 MB5 min2WARNRedis evicting keys; cache effectiveness dropping
CacheHitRate< 0.8 (80%)5 min3WARNCache not effective; DB taking excessive load
CurrConnections> 10005 min2WARNHigh connection count; connection pool exhaustion possible
ReplicationLag> 60 s1 min2WARNReplica falling behind primary; stale reads from replica
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  CacheClusterId:
    Type: String
    Default: YOUR_CACHE_CLUSTER_ID
  SnsTopicArn:
    Type: String
    Default: YOUR_SNS_TOPIC_ARN

Resources:
  RedisCpuWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-cpu-warn"
      AlarmDescription: ElastiCache CPU above 80%
      Namespace: AWS/ElastiCache
      MetricName: CPUUtilization
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisFreeMemoryWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-memory-warn"
      AlarmDescription: ElastiCache freeable memory below 100 MB - keys may be evicted
      Namespace: AWS/ElastiCache
      MetricName: FreeableMemory
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 104857600
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisCacheHitRateWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-hit-rate-warn"
      AlarmDescription: ElastiCache cache hit rate below 80% - DB taking excessive load
      Namespace: AWS/ElastiCache
      MetricName: CacheHitRate
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 0.8
      ComparisonOperator: LessThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisCurrConnectionsWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-connections-warn"
      AlarmDescription: ElastiCache connections above 1000
      Namespace: AWS/ElastiCache
      MetricName: CurrConnections
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Maximum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]

  RedisReplicationLagWarn:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${CacheClusterId}-replication-lag"
      AlarmDescription: ElastiCache replication lag above 60 seconds
      Namespace: AWS/ElastiCache
      MetricName: ReplicationLag
      Dimensions:
        - Name: CacheClusterId
          Value: !Ref CacheClusterId
      Statistic: Average
      Period: 60
      EvaluationPeriods: 2
      Threshold: 60
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions: [!Ref SnsTopicArn]
variable "cache_cluster_id" { type = string }
variable "sns_topic_arn"     { type = string }

resource "aws_cloudwatch_metric_alarm" "redis_cpu_warn" {
  alarm_name          = "${var.cache_cluster_id}-cpu-warn"
  alarm_description   = "ElastiCache CPU above 80%"
  namespace           = "AWS/ElastiCache"
  metric_name         = "CPUUtilization"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_memory_warn" {
  alarm_name          = "${var.cache_cluster_id}-memory-warn"
  alarm_description   = "ElastiCache freeable memory below 100 MB"
  namespace           = "AWS/ElastiCache"
  metric_name         = "FreeableMemory"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 2
  threshold           = 104857600  # 100 MB in bytes
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_hit_rate_warn" {
  alarm_name          = "${var.cache_cluster_id}-hit-rate-warn"
  alarm_description   = "ElastiCache cache hit rate below 80%"
  namespace           = "AWS/ElastiCache"
  metric_name         = "CacheHitRate"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 3
  threshold           = 0.8
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_connections_warn" {
  alarm_name          = "${var.cache_cluster_id}-connections-warn"
  alarm_description   = "ElastiCache connections above 1000"
  namespace           = "AWS/ElastiCache"
  metric_name         = "CurrConnections"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Maximum"
  period              = 300
  evaluation_periods  = 2
  threshold           = 1000
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}

resource "aws_cloudwatch_metric_alarm" "redis_replication_lag" {
  alarm_name          = "${var.cache_cluster_id}-replication-lag"
  alarm_description   = "ElastiCache replication lag above 60 seconds"
  namespace           = "AWS/ElastiCache"
  metric_name         = "ReplicationLag"
  dimensions          = { CacheClusterId = var.cache_cluster_id }
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 2
  threshold           = 60
  comparison_operator = "GreaterThanThreshold"
  treat_missing_data  = "notBreaching"
  alarm_actions       = [var.sns_topic_arn]
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →
💰
10. Cost & Budget Alerts

Cost alerts use AWS Budgets, not CloudWatch. They notify you when actual or forecasted spend crosses a threshold — giving you time to investigate before the bill arrives.

This is AWS Budgets, not CloudWatch. The resources below are AWS::Budgets::Budget (CloudFormation) and aws_budgets_budget (Terraform), not CloudWatch alarms. They still send alerts to email or SNS.
Alert TypeThresholdTypeSeverityWhy It Matters
Monthly spend actual80% of budgetACTUALWARNEarly warning to review usage before hitting budget
Monthly spend actual100% of budgetACTUALCRITICALBudget exceeded — take action now
Monthly spend forecasted100% of budgetFORECASTEDWARNProjected to exceed budget by month end
Anomaly detection$50 above expectedANOMALYWARNUnusual spending pattern — runaway resource possible
☁️ CloudFormation YAML
🟣 Terraform HCL
Parameters:
  MonthlyBudgetAmount:
    Type: Number
    Default: 100
    Description: Monthly AWS budget in USD
  AlertEmail:
    Type: String
    Default: you@yourcompany.com
    Description: Email for budget alerts

Resources:
  MonthlyBudget:
    Type: AWS::Budgets::Budget
    Properties:
      Budget:
        BudgetName: monthly-aws-budget
        BudgetType: COST
        TimeUnit: MONTHLY
        BudgetLimit:
          Amount: !Ref MonthlyBudgetAmount
          Unit: USD
      NotificationsWithSubscribers:
        # 80% actual spend warning
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 80
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail
        # 100% actual spend - critical
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 100
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail
        # Forecasted to exceed 100%
        - Notification:
            NotificationType: FORECASTED
            ComparisonOperator: GREATER_THAN
            Threshold: 100
            ThresholdType: PERCENTAGE
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail

  # Cost Anomaly Detection
  # Note: AWS::CE::AnomalyMonitor and AnomalySubscription are separate resources
  CostAnomalyMonitor:
    Type: AWS::CE::AnomalyMonitor
    Properties:
      MonitorName: aws-cost-anomaly-monitor
      MonitorType: DIMENSIONAL
      MonitorDimension: SERVICE

  CostAnomalySubscription:
    Type: AWS::CE::AnomalySubscription
    Properties:
      SubscriptionName: cost-anomaly-alerts
      MonitorArnList:
        - !GetAtt CostAnomalyMonitor.MonitorArn
      Subscribers:
        - Address: !Ref AlertEmail
          Type: EMAIL
      Threshold: 50
      Frequency: DAILY
variable "monthly_budget_amount" {
  type        = number
  default     = 100
  description = "Monthly AWS budget in USD"
}

variable "alert_email" {
  type        = string
  description = "Email for budget alerts"
}

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-aws-budget"
  budget_type  = "COST"
  limit_amount = var.monthly_budget_amount
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # 80% actual spend - warning
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }

  # 100% actual spend - critical
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.alert_email]
  }

  # Forecasted to exceed budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = [var.alert_email]
  }
}

# Cost Anomaly Detection
resource "aws_ce_anomaly_monitor" "main" {
  name         = "aws-cost-anomaly-monitor"
  monitor_type = "DIMENSIONAL"

  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "main" {
  name      = "cost-anomaly-alerts"
  frequency = "DAILY"

  monitor_arn_list = [aws_ce_anomaly_monitor.main.arn]

  subscriber {
    address = var.alert_email
    type    = "EMAIL"
  }

  # Alert when spend is $50 above expected
  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["50"]
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}
💡 ConvOps tip: Once these alarms fire, ConvOps sends them to WhatsApp or Slack and lets you investigate & act without leaving your phone. Get started →

Alarms set up. What happens when they fire?

ConvOps sends CloudWatch alarms to WhatsApp or Slack with AI root cause analysis. Investigate and act from your phone — no laptop needed.

Try ConvOps Free — 2 minutes to connect

No credit card. Works with the alarms you just set up.