⚙️ Setup Parameters — Read This First
Every snippet uses placeholder values. Replace them before deploying:
- YOUR_SNS_TOPIC_ARN — ARN of your SNS topic (e.g. arn:aws:sns:eu-central-1:123456789012:alerts)
- YOUR_CLUSTER_NAME / YOUR_SERVICE_NAME — ECS cluster and service names
- YOUR_INSTANCE_ID — EC2 instance ID (e.g. i-0abc123def456789)
- YOUR_DB_INSTANCE_ID — RDS DB instance identifier
- YOUR_FUNCTION_NAME — Lambda function name
- YOUR_ALB_SUFFIX — Part after loadbalancer/ in ALB ARN (e.g. app/my-alb/abc123def456)
- YOUR_API_NAME / YOUR_STAGE — API Gateway name and stage (e.g. prod)
- YOUR_QUEUE_NAME — SQS queue name
- YOUR_TABLE_NAME — DynamoDB table name
- YOUR_CACHE_CLUSTER_ID — ElastiCache cluster ID
- YOUR_MONTHLY_BUDGET — Your monthly AWS budget in USD
📫 Create an SNS topic that emails you
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
AlertEmail:
Type: String
Description: Email address to receive CloudWatch alerts
Resources:
AlertsTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: infra-alerts
Subscription:
- Protocol: email
Endpoint: !Ref AlertEmail
Outputs:
SnsTopicArn:
Value: !Ref AlertsTopic
Description: Use this ARN as YOUR_SNS_TOPIC_ARN in all alarm snippets below
variable "alert_email" {
description = "Email address to receive alerts"
type = string
}
resource "aws_sns_topic" "alerts" {
name = "infra-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}
# Use aws_sns_topic.alerts.arn as var.sns_topic_arn in alarm resources below
ECS containers can silently exhaust CPU/memory or stop running without the load balancer health check catching it in time. These alarms detect saturation and task crashes before users are impacted.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
CPUUtilization | > 80% | 5 min | 2 | WARN | Sustained CPU pressure — scale before saturation |
CPUUtilization | > 95% | 5 min | 2 | CRITICAL | Tasks CPU-throttled; latency spikes imminent |
MemoryUtilization | > 85% | 5 min | 2 | WARN | Memory pressure building; OOM kill possible |
MemoryUtilization | > 95% | 5 min | 2 | CRITICAL | Near OOM; task will be killed and restarted |
RunningTaskCount | < desired count | 1 min | 1 | CRITICAL | Tasks crashed and not recovering; service may be down |
Parameters:
ClusterName:
Type: String
Default: YOUR_CLUSTER_NAME
ServiceName:
Type: String
Default: YOUR_SERVICE_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
DesiredTaskCount:
Type: Number
Default: 2
Description: Alarm when running tasks fall below this number
Resources:
EcsCpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-cpu-warn"
AlarmDescription: ECS CPU utilization above 80% for 10 minutes
Namespace: AWS/ECS
MetricName: CPUUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
EcsCpuCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-cpu-critical"
AlarmDescription: ECS CPU above 95% - tasks are throttled
Namespace: AWS/ECS
MetricName: CPUUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 95
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
EcsMemoryWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-memory-warn"
AlarmDescription: ECS memory utilization above 85%
Namespace: AWS/ECS
MetricName: MemoryUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 85
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
EcsMemoryCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-memory-critical"
AlarmDescription: ECS memory utilization above 95% - OOM kill imminent
Namespace: AWS/ECS
MetricName: MemoryUtilization
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 95
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
EcsRunningTasksCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ServiceName}-tasks-critical"
AlarmDescription: Running task count below desired - service may be down
Namespace: AWS/ECS
MetricName: RunningTaskCount
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
- Name: ServiceName
Value: !Ref ServiceName
Statistic: Average
Period: 60
EvaluationPeriods: 1
Threshold: !Ref DesiredTaskCount
ComparisonOperator: LessThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
variable "cluster_name" { type = string }
variable "service_name" { type = string }
variable "sns_topic_arn" { type = string }
variable "desired_count" { type = number; default = 2 }
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_warn" {
alarm_name = "${var.service_name}-cpu-warn"
alarm_description = "ECS CPU above 80% for 10 minutes"
namespace = "AWS/ECS"
metric_name = "CPUUtilization"
dimensions = { ClusterName = var.cluster_name, ServiceName = var.service_name }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 80
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
ok_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_critical" {
alarm_name = "${var.service_name}-cpu-critical"
alarm_description = "ECS CPU above 95% - tasks throttled"
namespace = "AWS/ECS"
metric_name = "CPUUtilization"
dimensions = { ClusterName = var.cluster_name, ServiceName = var.service_name }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 95
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ecs_memory_warn" {
alarm_name = "${var.service_name}-memory-warn"
alarm_description = "ECS memory above 85%"
namespace = "AWS/ECS"
metric_name = "MemoryUtilization"
dimensions = { ClusterName = var.cluster_name, ServiceName = var.service_name }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 85
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ecs_memory_critical" {
alarm_name = "${var.service_name}-memory-critical"
alarm_description = "ECS memory above 95% - OOM kill imminent"
namespace = "AWS/ECS"
metric_name = "MemoryUtilization"
dimensions = { ClusterName = var.cluster_name, ServiceName = var.service_name }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 95
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ecs_running_tasks" {
alarm_name = "${var.service_name}-tasks-critical"
alarm_description = "Running tasks below desired count"
namespace = "AWS/ECS"
metric_name = "RunningTaskCount"
dimensions = { ClusterName = var.cluster_name, ServiceName = var.service_name }
statistic = "Average"
period = 60
evaluation_periods = 1
threshold = var.desired_count
comparison_operator = "LessThanThreshold"
treat_missing_data = "breaching"
alarm_actions = [var.sns_topic_arn]
}
EC2 instances can become unresponsive due to hardware failures, runaway processes, or network issues. Status check alarms catch hard failures that Auto Scaling or ELB health checks may miss initially.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
CPUUtilization | > 85% | 5 min | 3 | WARN | Sustained high CPU; investigate before saturation |
CPUUtilization | > 95% | 5 min | 2 | CRITICAL | Instance at capacity; requests will queue or fail |
StatusCheckFailed | > 0 | 1 min | 2 | CRITICAL | Instance or system check failing — likely unresponsive |
StatusCheckFailed_System | > 0 | 1 min | 2 | CRITICAL | AWS hardware issue — instance may need recovery |
NetworkIn | < 1000 bytes/period | 5 min | 3 | WARN | Traffic dropped to near-zero — instance may be isolated |
Parameters:
InstanceId:
Type: String
Default: YOUR_INSTANCE_ID
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
Ec2CpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-cpu-warn"
AlarmDescription: EC2 CPU above 85% for 15 minutes
Namespace: AWS/EC2
MetricName: CPUUtilization
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 85
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
Ec2CpuCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-cpu-critical"
AlarmDescription: EC2 CPU above 95% for 10 minutes
Namespace: AWS/EC2
MetricName: CPUUtilization
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 95
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
Ec2StatusCheckFailed:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-status-check-failed"
AlarmDescription: EC2 status check failed - instance may be unresponsive
Namespace: AWS/EC2
MetricName: StatusCheckFailed
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
Ec2StatusCheckFailedSystem:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-status-check-system"
AlarmDescription: EC2 system status check failed - AWS hardware issue
Namespace: AWS/EC2
MetricName: StatusCheckFailed_System
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
Ec2NetworkInDrop:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${InstanceId}-network-in-drop"
AlarmDescription: EC2 NetworkIn near zero - traffic may have stopped
Namespace: AWS/EC2
MetricName: NetworkIn
Dimensions:
- Name: InstanceId
Value: !Ref InstanceId
Statistic: Sum
Period: 300
EvaluationPeriods: 3
Threshold: 1000
ComparisonOperator: LessThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
variable "instance_id" { type = string }
variable "sns_topic_arn" { type = string }
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_warn" {
alarm_name = "${var.instance_id}-cpu-warn"
alarm_description = "EC2 CPU above 85% for 15 minutes"
namespace = "AWS/EC2"
metric_name = "CPUUtilization"
dimensions = { InstanceId = var.instance_id }
statistic = "Average"
period = 300
evaluation_periods = 3
threshold = 85
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
ok_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ec2_cpu_critical" {
alarm_name = "${var.instance_id}-cpu-critical"
alarm_description = "EC2 CPU above 95%"
namespace = "AWS/EC2"
metric_name = "CPUUtilization"
dimensions = { InstanceId = var.instance_id }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 95
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ec2_status_check" {
alarm_name = "${var.instance_id}-status-check"
alarm_description = "EC2 status check failed"
namespace = "AWS/EC2"
metric_name = "StatusCheckFailed"
dimensions = { InstanceId = var.instance_id }
statistic = "Maximum"
period = 60
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "breaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ec2_status_check_system" {
alarm_name = "${var.instance_id}-status-check-system"
alarm_description = "EC2 system status check failed - hardware issue"
namespace = "AWS/EC2"
metric_name = "StatusCheckFailed_System"
dimensions = { InstanceId = var.instance_id }
statistic = "Maximum"
period = 60
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "breaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "ec2_network_in_drop" {
alarm_name = "${var.instance_id}-network-in-drop"
alarm_description = "EC2 NetworkIn near zero - traffic may have stopped"
namespace = "AWS/EC2"
metric_name = "NetworkIn"
dimensions = { InstanceId = var.instance_id }
statistic = "Sum"
period = 300
evaluation_periods = 3
threshold = 1000
comparison_operator = "LessThanThreshold"
treat_missing_data = "breaching"
alarm_actions = [var.sns_topic_arn]
}
Databases fail silently — connections pile up, disk fills, replicas fall behind. By the time your app throws errors, it's already too late. These alarms give you a 10–30 minute warning window.
| Instance Class | max_connections | 80% threshold |
|---|---|---|
| db.t3.micro | 87 | 69 |
| db.t3.small | 171 | 136 |
| db.t3.medium | 341 | 272 |
| db.t3.large | 648 | 518 |
| db.r5.large | 1365 | 1092 |
| db.r5.xlarge | 2730 | 2184 |
| db.r5.2xlarge | 5460 | 4368 |
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
CPUUtilization | > 80% | 5 min | 3 | WARN | DB under CPU load; queries slowing down |
DatabaseConnections | > 80% of max | 5 min | 2 | WARN | Connection pool filling; new connections will fail soon |
FreeStorageSpace | < 10 GB | 5 min | 2 | WARN | Disk filling; DB will stop accepting writes when full |
FreeStorageSpace | < 2 GB | 5 min | 1 | CRITICAL | Critically low disk — DB failure imminent |
ReplicaLag | > 300 s | 1 min | 2 | WARN | Read replica falling behind; stale reads possible |
FreeableMemory | < 256 MB | 5 min | 3 | WARN | Low memory; buffer pool shrinking, queries slowing |
Parameters:
DbInstanceId:
Type: String
Default: YOUR_DB_INSTANCE_ID
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
MaxConnectionsThreshold:
Type: Number
Default: 272
Description: 80% of max_connections for your instance class (see table above)
Resources:
RdsCpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-cpu-warn"
AlarmDescription: RDS CPU above 80% for 15 minutes
Namespace: AWS/RDS
MetricName: CPUUtilization
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 80
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsConnectionsWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-connections-warn"
AlarmDescription: RDS connections above 80% of max_connections
Namespace: AWS/RDS
MetricName: DatabaseConnections
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: !Ref MaxConnectionsThreshold
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsDiskWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-disk-warn"
AlarmDescription: RDS free storage below 10 GB
Namespace: AWS/RDS
MetricName: FreeStorageSpace
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 10737418240
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsDiskCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-disk-critical"
AlarmDescription: RDS free storage critically low (below 2 GB)
Namespace: AWS/RDS
MetricName: FreeStorageSpace
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 1
Threshold: 2147483648
ComparisonOperator: LessThanThreshold
TreatMissingData: breaching
AlarmActions: [!Ref SnsTopicArn]
RdsReplicaLag:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-replica-lag"
AlarmDescription: RDS read replica lag above 5 minutes (read replicas only)
Namespace: AWS/RDS
MetricName: ReplicaLag
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 60
EvaluationPeriods: 2
Threshold: 300
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RdsFreeMemoryWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${DbInstanceId}-memory-warn"
AlarmDescription: RDS freeable memory below 256 MB
Namespace: AWS/RDS
MetricName: FreeableMemory
Dimensions:
- Name: DBInstanceIdentifier
Value: !Ref DbInstanceId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 268435456
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
variable "db_instance_id" { type = string }
variable "sns_topic_arn" { type = string }
variable "max_connections_threshold" { type = number; default = 272 }
# Set max_connections_threshold to 80% of your instance's max_connections
# db.t3.micro=69, db.t3.small=136, db.t3.medium=272, db.r5.large=1092
resource "aws_cloudwatch_metric_alarm" "rds_cpu_warn" {
alarm_name = "${var.db_instance_id}-cpu-warn"
alarm_description = "RDS CPU above 80% for 15 minutes"
namespace = "AWS/RDS"
metric_name = "CPUUtilization"
dimensions = { DBInstanceIdentifier = var.db_instance_id }
statistic = "Average"
period = 300
evaluation_periods = 3
threshold = 80
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "rds_connections_warn" {
alarm_name = "${var.db_instance_id}-connections-warn"
alarm_description = "RDS connections above 80% of max_connections"
namespace = "AWS/RDS"
metric_name = "DatabaseConnections"
dimensions = { DBInstanceIdentifier = var.db_instance_id }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = var.max_connections_threshold
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "rds_disk_warn" {
alarm_name = "${var.db_instance_id}-disk-warn"
alarm_description = "RDS free storage below 10 GB"
namespace = "AWS/RDS"
metric_name = "FreeStorageSpace"
dimensions = { DBInstanceIdentifier = var.db_instance_id }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 10737418240 # 10 GB in bytes
comparison_operator = "LessThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "rds_disk_critical" {
alarm_name = "${var.db_instance_id}-disk-critical"
alarm_description = "RDS free storage critically low (below 2 GB)"
namespace = "AWS/RDS"
metric_name = "FreeStorageSpace"
dimensions = { DBInstanceIdentifier = var.db_instance_id }
statistic = "Average"
period = 300
evaluation_periods = 1
threshold = 2147483648 # 2 GB in bytes
comparison_operator = "LessThanThreshold"
treat_missing_data = "breaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "rds_replica_lag" {
# Apply only to read replicas
alarm_name = "${var.db_instance_id}-replica-lag"
alarm_description = "RDS replica lag above 5 minutes"
namespace = "AWS/RDS"
metric_name = "ReplicaLag"
dimensions = { DBInstanceIdentifier = var.db_instance_id }
statistic = "Average"
period = 60
evaluation_periods = 2
threshold = 300
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "rds_memory_warn" {
alarm_name = "${var.db_instance_id}-memory-warn"
alarm_description = "RDS freeable memory below 256 MB"
namespace = "AWS/RDS"
metric_name = "FreeableMemory"
dimensions = { DBInstanceIdentifier = var.db_instance_id }
statistic = "Average"
period = 300
evaluation_periods = 3
threshold = 268435456 # 256 MB in bytes
comparison_operator = "LessThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
Lambda errors are silent by default — your function fails and nothing tells you. Throttles mean requests are being dropped. Duration alerts catch runaway executions before they eat your budget.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
Errors | > 0 | 1 min | 1 | WARN | Any function error — investigate immediately |
Errors | > 5 | 1 min | 2 | CRITICAL | Repeated errors — function may be completely broken |
Throttles | > 0 | 1 min | 2 | WARN | Requests being dropped due to concurrency limit |
Duration | > 80% of timeout | 1 min | 2 | WARN | Function nearing timeout; will fail if trend continues |
ConcurrentExecutions | > 800 (80% of default 1000) | 1 min | 2 | WARN | Approaching account concurrency limit; throttles incoming |
Parameters:
FunctionName:
Type: String
Default: YOUR_FUNCTION_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
DurationThresholdMs:
Type: Number
Default: 24000
Description: |
80% of your function timeout in ms.
e.g. 30s timeout -> 24000ms, 15s timeout -> 12000ms, 5s timeout -> 4000ms
Resources:
LambdaErrorsWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-errors-warn"
AlarmDescription: Lambda function errors detected
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
LambdaErrorsCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-errors-critical"
AlarmDescription: Lambda function errors above 5 - may be completely broken
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
LambdaThrottlesWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-throttles"
AlarmDescription: Lambda throttles detected - requests being dropped
Namespace: AWS/Lambda
MetricName: Throttles
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
LambdaDurationWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-duration-warn"
AlarmDescription: !Sub "Lambda duration above 80% of timeout (${DurationThresholdMs}ms)"
Namespace: AWS/Lambda
MetricName: Duration
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
ExtendedStatistic: p99
Period: 60
EvaluationPeriods: 2
Threshold: !Ref DurationThresholdMs
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
LambdaConcurrencyWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${FunctionName}-concurrency-warn"
AlarmDescription: Lambda concurrent executions above 800 (80% of default limit 1000)
Namespace: AWS/Lambda
MetricName: ConcurrentExecutions
Dimensions:
- Name: FunctionName
Value: !Ref FunctionName
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 800
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
variable "function_name" { type = string }
variable "sns_topic_arn" { type = string }
variable "duration_threshold_ms" {
type = number
default = 24000
description = "80% of function timeout in ms. e.g. 30s timeout -> 24000"
}
resource "aws_cloudwatch_metric_alarm" "lambda_errors_warn" {
alarm_name = "${var.function_name}-errors-warn"
alarm_description = "Lambda errors detected"
namespace = "AWS/Lambda"
metric_name = "Errors"
dimensions = { FunctionName = var.function_name }
statistic = "Sum"
period = 60
evaluation_periods = 1
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
ok_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "lambda_errors_critical" {
alarm_name = "${var.function_name}-errors-critical"
alarm_description = "Lambda errors above 5"
namespace = "AWS/Lambda"
metric_name = "Errors"
dimensions = { FunctionName = var.function_name }
statistic = "Sum"
period = 60
evaluation_periods = 2
threshold = 5
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "lambda_throttles" {
alarm_name = "${var.function_name}-throttles"
alarm_description = "Lambda throttles - requests being dropped"
namespace = "AWS/Lambda"
metric_name = "Throttles"
dimensions = { FunctionName = var.function_name }
statistic = "Sum"
period = 60
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "lambda_duration_warn" {
alarm_name = "${var.function_name}-duration-warn"
alarm_description = "Lambda p99 duration above 80% of timeout"
namespace = "AWS/Lambda"
metric_name = "Duration"
dimensions = { FunctionName = var.function_name }
extended_statistic = "p99"
period = 60
evaluation_periods = 2
threshold = var.duration_threshold_ms
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "lambda_concurrency_warn" {
alarm_name = "${var.function_name}-concurrency-warn"
alarm_description = "Lambda concurrent executions above 800 (80% of default limit)"
namespace = "AWS/Lambda"
metric_name = "ConcurrentExecutions"
dimensions = { FunctionName = var.function_name }
statistic = "Maximum"
period = 60
evaluation_periods = 2
threshold = 800
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
Your load balancer is the front door to your application. 5XX errors mean backends are failing. Unhealthy hosts mean containers are crashing. These alarms catch both.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
HTTPCode_Target_5XX_Count | > 0 | 1 min | 2 | WARN | Backend returning server errors |
HTTPCode_Target_5XX_Count | > 10 | 1 min | 2 | CRITICAL | High rate of 5XX — backend likely down |
TargetResponseTime | > 2 s | 5 min | 3 | WARN | Slow responses — users experiencing latency |
TargetResponseTime | > 5 s | 5 min | 2 | CRITICAL | Very slow responses — likely timing out for users |
UnHealthyHostCount | > 0 | 1 min | 2 | CRITICAL | Targets failing health checks — service degraded |
RejectedConnectionCount | > 0 | 1 min | 2 | WARN | ALB at max connections — requests being dropped |
Parameters:
AlbSuffix:
Type: String
Default: YOUR_ALB_SUFFIX
Description: e.g. app/my-alb/abc123def456 (after "loadbalancer/" in the ARN)
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
Alb5xxWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-5xx-warn-${AlbSuffix}"
AlarmDescription: ALB backend 5XX errors detected
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
Alb5xxCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-5xx-critical-${AlbSuffix}"
AlarmDescription: ALB backend 5XX errors above 10 per minute
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 10
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbLatencyWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-latency-warn-${AlbSuffix}"
AlarmDescription: ALB target response time above 2 seconds
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
ExtendedStatistic: p99
Period: 300
EvaluationPeriods: 3
Threshold: 2
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbLatencyCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-latency-critical-${AlbSuffix}"
AlarmDescription: ALB target response time above 5 seconds
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
ExtendedStatistic: p99
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbUnhealthyHosts:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-unhealthy-hosts-${AlbSuffix}"
AlarmDescription: ALB unhealthy target count above zero
Namespace: AWS/ApplicationELB
MetricName: UnHealthyHostCount
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Maximum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
AlbRejectedConnections:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "alb-rejected-connections-${AlbSuffix}"
AlarmDescription: ALB rejected connections - load balancer at max capacity
Namespace: AWS/ApplicationELB
MetricName: RejectedConnectionCount
Dimensions:
- Name: LoadBalancer
Value: !Ref AlbSuffix
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
variable "alb_suffix" { type = string } # e.g. "app/my-alb/abc123def456"
variable "sns_topic_arn" { type = string }
resource "aws_cloudwatch_metric_alarm" "alb_5xx_warn" {
alarm_name = "alb-5xx-warn"
alarm_description = "ALB 5XX errors detected"
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_Target_5XX_Count"
dimensions = { LoadBalancer = var.alb_suffix }
statistic = "Sum"
period = 60
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
ok_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "alb_5xx_critical" {
alarm_name = "alb-5xx-critical"
alarm_description = "ALB 5XX errors above 10/min"
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_Target_5XX_Count"
dimensions = { LoadBalancer = var.alb_suffix }
statistic = "Sum"
period = 60
evaluation_periods = 2
threshold = 10
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "alb_latency_warn" {
alarm_name = "alb-latency-warn"
alarm_description = "ALB p99 response time above 2 seconds"
namespace = "AWS/ApplicationELB"
metric_name = "TargetResponseTime"
dimensions = { LoadBalancer = var.alb_suffix }
extended_statistic = "p99"
period = 300
evaluation_periods = 3
threshold = 2
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "alb_latency_critical" {
alarm_name = "alb-latency-critical"
alarm_description = "ALB p99 response time above 5 seconds"
namespace = "AWS/ApplicationELB"
metric_name = "TargetResponseTime"
dimensions = { LoadBalancer = var.alb_suffix }
extended_statistic = "p99"
period = 300
evaluation_periods = 2
threshold = 5
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "alb_unhealthy_hosts" {
alarm_name = "alb-unhealthy-hosts"
alarm_description = "ALB unhealthy targets detected"
namespace = "AWS/ApplicationELB"
metric_name = "UnHealthyHostCount"
dimensions = { LoadBalancer = var.alb_suffix }
statistic = "Maximum"
period = 60
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "alb_rejected_connections" {
alarm_name = "alb-rejected-connections"
alarm_description = "ALB at max connections - requests being dropped"
namespace = "AWS/ApplicationELB"
metric_name = "RejectedConnectionCount"
dimensions = { LoadBalancer = var.alb_suffix }
statistic = "Sum"
period = 60
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
API Gateway has a hard 29-second timeout limit. If your backends are slow, requests will silently time out. 5XX and 4XX errors can indicate broken integrations or client misconfigurations at scale.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
5XXError | > 5 count | 1 min | 2 | WARN | Backend integration errors; Lambda or HTTP backend failing |
4XXError | > high rate | 5 min | 3 | WARN | High client error rate; API misuse or broken client |
Latency | > 3000 ms p99 | 5 min | 3 | WARN | Slow backend responses; users experiencing delays |
Latency | > 10000 ms | 5 min | 2 | CRITICAL | Near 29s timeout; requests will start failing |
Count | sudden drop > 50% | — | — | WARN | Requires metric math / anomaly detection (see note above) |
Parameters:
ApiName:
Type: String
Default: YOUR_API_NAME
Stage:
Type: String
Default: prod
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
ApiGw5xxWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-5xx-warn"
AlarmDescription: API Gateway 5XX errors above 5 per minute
Namespace: AWS/ApiGateway
MetricName: 5XXError
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
ApiGw4xxWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-4xx-warn"
AlarmDescription: API Gateway 4XX errors above 50 per 5 minutes
Namespace: AWS/ApiGateway
MetricName: 4XXError
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
Statistic: Sum
Period: 300
EvaluationPeriods: 3
Threshold: 50
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
ApiGwLatencyWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-latency-warn"
AlarmDescription: API Gateway p99 latency above 3 seconds
Namespace: AWS/ApiGateway
MetricName: Latency
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
ExtendedStatistic: p99
Period: 300
EvaluationPeriods: 3
Threshold: 3000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
ApiGwLatencyCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ApiName}-${Stage}-latency-critical"
AlarmDescription: API Gateway latency above 10 seconds - near 29s timeout
Namespace: AWS/ApiGateway
MetricName: Latency
Dimensions:
- Name: ApiName
Value: !Ref ApiName
- Name: Stage
Value: !Ref Stage
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 10000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
variable "api_name" { type = string }
variable "stage" { type = string; default = "prod" }
variable "sns_topic_arn" { type = string }
resource "aws_cloudwatch_metric_alarm" "apigw_5xx_warn" {
alarm_name = "${var.api_name}-${var.stage}-5xx-warn"
alarm_description = "API Gateway 5XX errors above 5/min"
namespace = "AWS/ApiGateway"
metric_name = "5XXError"
dimensions = { ApiName = var.api_name, Stage = var.stage }
statistic = "Sum"
period = 60
evaluation_periods = 2
threshold = 5
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
ok_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "apigw_4xx_warn" {
alarm_name = "${var.api_name}-${var.stage}-4xx-warn"
alarm_description = "API Gateway 4XX high volume"
namespace = "AWS/ApiGateway"
metric_name = "4XXError"
dimensions = { ApiName = var.api_name, Stage = var.stage }
statistic = "Sum"
period = 300
evaluation_periods = 3
threshold = 50
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "apigw_latency_warn" {
alarm_name = "${var.api_name}-${var.stage}-latency-warn"
alarm_description = "API Gateway p99 latency above 3 seconds"
namespace = "AWS/ApiGateway"
metric_name = "Latency"
dimensions = { ApiName = var.api_name, Stage = var.stage }
extended_statistic = "p99"
period = 300
evaluation_periods = 3
threshold = 3000
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "apigw_latency_critical" {
alarm_name = "${var.api_name}-${var.stage}-latency-critical"
alarm_description = "API Gateway latency above 10s - near 29s timeout"
namespace = "AWS/ApiGateway"
metric_name = "Latency"
dimensions = { ApiName = var.api_name, Stage = var.stage }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 10000
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
A backed-up SQS queue means your consumers have stopped or are too slow. Old messages indicate processing failures. Left unattended, queues can grow to millions of messages and take hours to drain.
NumberOfMessagesSent requires metric math (comparing to a rolling baseline). Use CloudWatch Anomaly Detection alarms for this — the standard alarm snippets below cover threshold-based alarms only.| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
ApproximateNumberOfMessagesNotVisible | > 1000 | 5 min | 3 | WARN | Queue building up; consumers may be slow or failing |
ApproximateNumberOfMessagesNotVisible | > 10000 | 5 min | 2 | CRITICAL | Severe queue backup; consumers definitely failing |
ApproximateAgeOfOldestMessage | > 300 s | 5 min | 2 | WARN | Messages sitting unprocessed for 5+ minutes |
ApproximateAgeOfOldestMessage | > 900 s | 5 min | 2 | CRITICAL | Messages 15+ minutes old; SLA likely being breached |
NumberOfMessagesSent | sudden drop | — | — | WARN | Requires anomaly detection / metric math (see note above) |
Parameters:
QueueName:
Type: String
Default: YOUR_QUEUE_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
SqsQueueDepthWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-depth-warn"
AlarmDescription: SQS queue depth above 1000 - consumers may be lagging
Namespace: AWS/SQS
MetricName: ApproximateNumberOfMessagesNotVisible
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 3
Threshold: 1000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
OKActions: [!Ref SnsTopicArn]
SqsQueueDepthCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-depth-critical"
AlarmDescription: SQS queue depth above 10000 - severe consumer failure
Namespace: AWS/SQS
MetricName: ApproximateNumberOfMessagesNotVisible
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 10000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
SqsMessageAgeWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-age-warn"
AlarmDescription: SQS oldest message age above 5 minutes
Namespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 300
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
SqsMessageAgeCritical:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${QueueName}-age-critical"
AlarmDescription: SQS oldest message age above 15 minutes - SLA breach
Namespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Dimensions:
- Name: QueueName
Value: !Ref QueueName
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 900
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
variable "queue_name" { type = string }
variable "sns_topic_arn" { type = string }
resource "aws_cloudwatch_metric_alarm" "sqs_depth_warn" {
alarm_name = "${var.queue_name}-depth-warn"
alarm_description = "SQS queue depth above 1000"
namespace = "AWS/SQS"
metric_name = "ApproximateNumberOfMessagesNotVisible"
dimensions = { QueueName = var.queue_name }
statistic = "Maximum"
period = 300
evaluation_periods = 3
threshold = 1000
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
ok_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "sqs_depth_critical" {
alarm_name = "${var.queue_name}-depth-critical"
alarm_description = "SQS queue depth above 10000 - severe consumer failure"
namespace = "AWS/SQS"
metric_name = "ApproximateNumberOfMessagesNotVisible"
dimensions = { QueueName = var.queue_name }
statistic = "Maximum"
period = 300
evaluation_periods = 2
threshold = 10000
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "sqs_message_age_warn" {
alarm_name = "${var.queue_name}-age-warn"
alarm_description = "SQS oldest message above 5 minutes old"
namespace = "AWS/SQS"
metric_name = "ApproximateAgeOfOldestMessage"
dimensions = { QueueName = var.queue_name }
statistic = "Maximum"
period = 300
evaluation_periods = 2
threshold = 300
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "sqs_message_age_critical" {
alarm_name = "${var.queue_name}-age-critical"
alarm_description = "SQS oldest message above 15 minutes - SLA breach"
namespace = "AWS/SQS"
metric_name = "ApproximateAgeOfOldestMessage"
dimensions = { QueueName = var.queue_name }
statistic = "Maximum"
period = 300
evaluation_periods = 2
threshold = 900
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
DynamoDB throttling is silent and cumulative. Throttled requests are retried with exponential backoff, which means your application slows down before it starts failing. Catch throttles early.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
SystemErrors | > 0 | 1 min | 2 | CRITICAL | AWS-side DynamoDB errors; likely service issue |
UserErrors | > 0 | 5 min | 3 | WARN | Client-side errors (bad requests, auth issues) |
ConsumedReadCapacityUnits | > 80% of provisioned | 5 min | 2 | WARN | Read capacity filling up (provisioned mode only) |
ConsumedWriteCapacityUnits | > 80% of provisioned | 5 min | 2 | WARN | Write capacity filling up (provisioned mode only) |
ThrottledRequests | > 0 | 5 min | 2 | WARN | Requests being throttled; app latency increasing |
Parameters:
TableName:
Type: String
Default: YOUR_TABLE_NAME
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
ProvisionedReadCapacity:
Type: Number
Default: 100
Description: Your table's provisioned RCU (skip for on-demand mode)
ProvisionedWriteCapacity:
Type: Number
Default: 100
Description: Your table's provisioned WCU (skip for on-demand mode)
Resources:
DynamoDbSystemErrors:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-system-errors"
AlarmDescription: DynamoDB system errors detected - possible AWS service issue
Namespace: AWS/DynamoDB
MetricName: SystemErrors
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 60
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
DynamoDbUserErrors:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-user-errors"
AlarmDescription: DynamoDB user errors - bad requests or auth issues
Namespace: AWS/DynamoDB
MetricName: UserErrors
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 300
EvaluationPeriods: 3
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
DynamoDbReadCapacityWarn:
# Remove this resource if using on-demand mode
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-read-capacity-warn"
AlarmDescription: DynamoDB read capacity above 80% of provisioned
Namespace: AWS/DynamoDB
MetricName: ConsumedReadCapacityUnits
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: !Sub "${ProvisionedReadCapacity * 0.8 * 300}"
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
DynamoDbWriteCapacityWarn:
# Remove this resource if using on-demand mode
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-write-capacity-warn"
AlarmDescription: DynamoDB write capacity above 80% of provisioned
Namespace: AWS/DynamoDB
MetricName: ConsumedWriteCapacityUnits
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: !Sub "${ProvisionedWriteCapacity * 0.8 * 300}"
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
DynamoDbThrottledRequests:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${TableName}-throttled"
AlarmDescription: DynamoDB throttled requests - requests being delayed
Namespace: AWS/DynamoDB
MetricName: ThrottledRequests
Dimensions:
- Name: TableName
Value: !Ref TableName
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
variable "table_name" { type = string }
variable "sns_topic_arn" { type = string }
variable "provisioned_read_capacity" { type = number; default = 100 }
variable "provisioned_write_capacity" { type = number; default = 100 }
# Set provisioned_read/write_capacity to 0 if using on-demand mode and remove capacity alarms
resource "aws_cloudwatch_metric_alarm" "dynamodb_system_errors" {
alarm_name = "${var.table_name}-system-errors"
alarm_description = "DynamoDB system errors - possible AWS service issue"
namespace = "AWS/DynamoDB"
metric_name = "SystemErrors"
dimensions = { TableName = var.table_name }
statistic = "Sum"
period = 60
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "dynamodb_user_errors" {
alarm_name = "${var.table_name}-user-errors"
alarm_description = "DynamoDB user errors - bad requests or auth issues"
namespace = "AWS/DynamoDB"
metric_name = "UserErrors"
dimensions = { TableName = var.table_name }
statistic = "Sum"
period = 300
evaluation_periods = 3
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "dynamodb_read_capacity_warn" {
# Remove this block if using on-demand mode
alarm_name = "${var.table_name}-read-capacity-warn"
alarm_description = "DynamoDB consumed read capacity above 80% of provisioned"
namespace = "AWS/DynamoDB"
metric_name = "ConsumedReadCapacityUnits"
dimensions = { TableName = var.table_name }
statistic = "Sum"
period = 300
evaluation_periods = 2
# Threshold = 80% of provisioned RCU * period seconds
threshold = var.provisioned_read_capacity * 0.8 * 300
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "dynamodb_write_capacity_warn" {
# Remove this block if using on-demand mode
alarm_name = "${var.table_name}-write-capacity-warn"
alarm_description = "DynamoDB consumed write capacity above 80% of provisioned"
namespace = "AWS/DynamoDB"
metric_name = "ConsumedWriteCapacityUnits"
dimensions = { TableName = var.table_name }
statistic = "Sum"
period = 300
evaluation_periods = 2
threshold = var.provisioned_write_capacity * 0.8 * 300
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "dynamodb_throttled" {
alarm_name = "${var.table_name}-throttled"
alarm_description = "DynamoDB requests being throttled"
namespace = "AWS/DynamoDB"
metric_name = "ThrottledRequests"
dimensions = { TableName = var.table_name }
statistic = "Sum"
period = 300
evaluation_periods = 2
threshold = 0
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
Redis is often invisible until it fails — then everything that depends on it slows down or crashes. Low cache hit rate means your backend database is absorbing all the traffic Redis should be handling.
| Metric | Threshold | Period | Eval Periods | Severity | Why It Matters |
|---|---|---|---|---|---|
CPUUtilization | > 80% | 5 min | 2 | WARN | Redis single-threaded; high CPU causes latency spikes |
FreeableMemory | < 100 MB | 5 min | 2 | WARN | Redis evicting keys; cache effectiveness dropping |
CacheHitRate | < 0.8 (80%) | 5 min | 3 | WARN | Cache not effective; DB taking excessive load |
CurrConnections | > 1000 | 5 min | 2 | WARN | High connection count; connection pool exhaustion possible |
ReplicationLag | > 60 s | 1 min | 2 | WARN | Replica falling behind primary; stale reads from replica |
Parameters:
CacheClusterId:
Type: String
Default: YOUR_CACHE_CLUSTER_ID
SnsTopicArn:
Type: String
Default: YOUR_SNS_TOPIC_ARN
Resources:
RedisCpuWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-cpu-warn"
AlarmDescription: ElastiCache CPU above 80%
Namespace: AWS/ElastiCache
MetricName: CPUUtilization
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 80
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisFreeMemoryWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-memory-warn"
AlarmDescription: ElastiCache freeable memory below 100 MB - keys may be evicted
Namespace: AWS/ElastiCache
MetricName: FreeableMemory
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 104857600
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisCacheHitRateWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-hit-rate-warn"
AlarmDescription: ElastiCache cache hit rate below 80% - DB taking excessive load
Namespace: AWS/ElastiCache
MetricName: CacheHitRate
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 0.8
ComparisonOperator: LessThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisCurrConnectionsWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-connections-warn"
AlarmDescription: ElastiCache connections above 1000
Namespace: AWS/ElastiCache
MetricName: CurrConnections
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Maximum
Period: 300
EvaluationPeriods: 2
Threshold: 1000
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
RedisReplicationLagWarn:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${CacheClusterId}-replication-lag"
AlarmDescription: ElastiCache replication lag above 60 seconds
Namespace: AWS/ElastiCache
MetricName: ReplicationLag
Dimensions:
- Name: CacheClusterId
Value: !Ref CacheClusterId
Statistic: Average
Period: 60
EvaluationPeriods: 2
Threshold: 60
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref SnsTopicArn]
variable "cache_cluster_id" { type = string }
variable "sns_topic_arn" { type = string }
resource "aws_cloudwatch_metric_alarm" "redis_cpu_warn" {
alarm_name = "${var.cache_cluster_id}-cpu-warn"
alarm_description = "ElastiCache CPU above 80%"
namespace = "AWS/ElastiCache"
metric_name = "CPUUtilization"
dimensions = { CacheClusterId = var.cache_cluster_id }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 80
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "redis_memory_warn" {
alarm_name = "${var.cache_cluster_id}-memory-warn"
alarm_description = "ElastiCache freeable memory below 100 MB"
namespace = "AWS/ElastiCache"
metric_name = "FreeableMemory"
dimensions = { CacheClusterId = var.cache_cluster_id }
statistic = "Average"
period = 300
evaluation_periods = 2
threshold = 104857600 # 100 MB in bytes
comparison_operator = "LessThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "redis_hit_rate_warn" {
alarm_name = "${var.cache_cluster_id}-hit-rate-warn"
alarm_description = "ElastiCache cache hit rate below 80%"
namespace = "AWS/ElastiCache"
metric_name = "CacheHitRate"
dimensions = { CacheClusterId = var.cache_cluster_id }
statistic = "Average"
period = 300
evaluation_periods = 3
threshold = 0.8
comparison_operator = "LessThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "redis_connections_warn" {
alarm_name = "${var.cache_cluster_id}-connections-warn"
alarm_description = "ElastiCache connections above 1000"
namespace = "AWS/ElastiCache"
metric_name = "CurrConnections"
dimensions = { CacheClusterId = var.cache_cluster_id }
statistic = "Maximum"
period = 300
evaluation_periods = 2
threshold = 1000
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
resource "aws_cloudwatch_metric_alarm" "redis_replication_lag" {
alarm_name = "${var.cache_cluster_id}-replication-lag"
alarm_description = "ElastiCache replication lag above 60 seconds"
namespace = "AWS/ElastiCache"
metric_name = "ReplicationLag"
dimensions = { CacheClusterId = var.cache_cluster_id }
statistic = "Average"
period = 60
evaluation_periods = 2
threshold = 60
comparison_operator = "GreaterThanThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [var.sns_topic_arn]
}
Cost alerts use AWS Budgets, not CloudWatch. They notify you when actual or forecasted spend crosses a threshold — giving you time to investigate before the bill arrives.
| Alert Type | Threshold | Type | Severity | Why It Matters |
|---|---|---|---|---|
| Monthly spend actual | 80% of budget | ACTUAL | WARN | Early warning to review usage before hitting budget |
| Monthly spend actual | 100% of budget | ACTUAL | CRITICAL | Budget exceeded — take action now |
| Monthly spend forecasted | 100% of budget | FORECASTED | WARN | Projected to exceed budget by month end |
| Anomaly detection | $50 above expected | ANOMALY | WARN | Unusual spending pattern — runaway resource possible |
Parameters:
MonthlyBudgetAmount:
Type: Number
Default: 100
Description: Monthly AWS budget in USD
AlertEmail:
Type: String
Default: you@yourcompany.com
Description: Email for budget alerts
Resources:
MonthlyBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: monthly-aws-budget
BudgetType: COST
TimeUnit: MONTHLY
BudgetLimit:
Amount: !Ref MonthlyBudgetAmount
Unit: USD
NotificationsWithSubscribers:
# 80% actual spend warning
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: EMAIL
Address: !Ref AlertEmail
# 100% actual spend - critical
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 100
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: EMAIL
Address: !Ref AlertEmail
# Forecasted to exceed 100%
- Notification:
NotificationType: FORECASTED
ComparisonOperator: GREATER_THAN
Threshold: 100
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: EMAIL
Address: !Ref AlertEmail
# Cost Anomaly Detection
# Note: AWS::CE::AnomalyMonitor and AnomalySubscription are separate resources
CostAnomalyMonitor:
Type: AWS::CE::AnomalyMonitor
Properties:
MonitorName: aws-cost-anomaly-monitor
MonitorType: DIMENSIONAL
MonitorDimension: SERVICE
CostAnomalySubscription:
Type: AWS::CE::AnomalySubscription
Properties:
SubscriptionName: cost-anomaly-alerts
MonitorArnList:
- !GetAtt CostAnomalyMonitor.MonitorArn
Subscribers:
- Address: !Ref AlertEmail
Type: EMAIL
Threshold: 50
Frequency: DAILY
variable "monthly_budget_amount" {
type = number
default = 100
description = "Monthly AWS budget in USD"
}
variable "alert_email" {
type = string
description = "Email for budget alerts"
}
resource "aws_budgets_budget" "monthly" {
name = "monthly-aws-budget"
budget_type = "COST"
limit_amount = var.monthly_budget_amount
limit_unit = "USD"
time_unit = "MONTHLY"
# 80% actual spend - warning
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.alert_email]
}
# 100% actual spend - critical
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.alert_email]
}
# Forecasted to exceed budget
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = [var.alert_email]
}
}
# Cost Anomaly Detection
resource "aws_ce_anomaly_monitor" "main" {
name = "aws-cost-anomaly-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "main" {
name = "cost-anomaly-alerts"
frequency = "DAILY"
monitor_arn_list = [aws_ce_anomaly_monitor.main.arn]
subscriber {
address = var.alert_email
type = "EMAIL"
}
# Alert when spend is $50 above expected
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["50"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
Alarms set up. What happens when they fire?
ConvOps sends CloudWatch alarms to WhatsApp or Slack with AI root cause analysis. Investigate and act from your phone — no laptop needed.
Try ConvOps Free — 2 minutes to connectNo credit card. Works with the alarms you just set up.