CloudWatch Monitoring Guide: Metrics, Alarms, Logs, and Dashboards on AWS

Your application is live. Users are signing up. Everything looks great in the console. Then one morning you wake up to angry emails: the site has been down for four hours and nobody noticed. No alerts fired. No dashboards showed anything wrong. Because you never set up monitoring.

Monitoring is not optional. It is the difference between finding out about problems from your users and finding out before they do. This guide will teach you how to monitor your AWS applications properly using CloudWatch, from basic metrics to production-ready dashboards.

Prerequisites: You should understand EC2 instance types and Lambda before starting this article.

What You Will Learn

By the end of this article, you will be able to:

Explain the Three Pillars of Observability and the Four Golden Signals, and evaluate which metrics matter most for a given workload
Configure CloudWatch alarms with appropriate thresholds, evaluation periods, and composite logic to catch real problems without alert fatigue
Design structured logging for your applications and troubleshoot production issues using CloudWatch Logs Insights queries
Implement a CloudWatch dashboard that surfaces the Four Golden Signals at a glance for any web application
Compare CloudWatch, X-Ray, Synthetics, and Contributor Insights, and decide which observability tool fits each use case

The Three Pillars of Observability

Before we dive into CloudWatch, let us establish the framework. Observability is the ability to understand what is happening inside your system by looking at its outputs. There are three pillars:

Metrics tell you WHAT is happening. CPU is at 95%. Error rate jumped to 5%. Latency doubled.

Logs tell you WHY it happened. The database connection pool is exhausted. A third-party API is returning 503 errors. A configuration change broke the authentication flow.

Traces tell you WHERE the problem is. The request spent 200ms in the API Gateway, 50ms in the Lambda function, and 3,000ms waiting for the database. The bottleneck is clear.

Together, these three pillars give you complete visibility into your system. CloudWatch handles metrics and logs natively. AWS X-Ray handles traces.

The Four Golden Signals

Before we touch CloudWatch, let us talk about what to monitor. Google's Site Reliability Engineering book popularized the Four Golden Signals. These four metrics tell you the health of any system:

1. Latency

How long does it take to serve a request? Track both successful request latency and failed request latency. A failed request that returns an error in 10 milliseconds is very different from one that times out after 30 seconds.

What to watch: Average response time, p50, p95, p99 percentiles.

Why it matters: Users notice latency before anything else. A slow page feels broken even if it is technically working.

Percentiles explained:

Percentile	Meaning	Why It Matters
p50 (median)	Half of requests are faster than this	Your typical user experience
p95	95% of requests are faster than this	Your slow-but-not-extreme user experience
p99	99% of requests are faster than this	Your worst-case user experience
Average	Sum of all times / count	Hides outliers; use percentiles instead

Always alert on p95 or p99, never on average. An average of 200ms can hide the fact that 5% of your users are waiting 10 seconds.

2. Traffic

How many requests is your system handling? This tells you the demand on your system and helps you spot anomalies.

What to watch: Requests per second, concurrent users, messages processed per minute.

Why it matters: Sudden spikes might indicate a marketing campaign (good) or a DDoS attack (bad). Sudden drops might mean your DNS is broken or a deployment went wrong.

3. Errors

How many requests are failing? Track both explicit errors (HTTP 500s) and implicit errors (HTTP 200 responses that return wrong data).

What to watch: Error rate (errors / total requests), error count by type, HTTP status code distribution.

Why it matters: A 2% error rate might be acceptable. A 50% error rate means half your users are having a bad time. A sudden jump from 0.1% to 5% means something just broke.

4. Saturation

How full is your system? What percentage of capacity are you using? This is the early warning signal that tells you to add resources before things break.

What to watch: CPU utilization, memory usage, disk usage, queue depth, connection pool usage.

Why it matters: A server at 95% CPU is one traffic spike away from falling over. A database at 90% disk is one log rotation away from crashing.

CloudWatch: The Monitoring Foundation

Amazon CloudWatch is the core monitoring service on AWS. It collects metrics, stores logs, triggers alarms, and displays dashboards. Almost every AWS service sends metrics to CloudWatch automatically.

CloudWatch Metrics

Metrics are the numerical data points that represent the behavior of your resources over time. Every AWS service publishes metrics to CloudWatch.

Built-in metrics (free, no setup required):

Service	Key Metrics
EC2	CPUUtilization, NetworkIn, NetworkOut, DiskReadOps
ALB	RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count
RDS	CPUUtilization, FreeStorageSpace, DatabaseConnections, ReadLatency
Lambda	Invocations, Duration, Errors, Throttles, ConcurrentExecutions
SQS	NumberOfMessagesSent, ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage
DynamoDB	ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests
API Gateway	Count, Latency, 4XXError, 5XXError
CloudFront	Requests, BytesDownloaded, 4xxErrorRate, 5xxErrorRate
NAT Gateway	BytesOutToDestination, PacketsDropCount, ErrorPortAllocation

Important: EC2 does NOT send memory or disk usage metrics by default. You need the CloudWatch Agent for those:

# Install the CloudWatch Agent on an EC2 instance
sudo yum install -y amazon-cloudwatch-agent

# Configure the agent (creates a config wizard)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start the agent
sudo systemctl start amazon-cloudwatch-agent

The CloudWatch Agent configuration file specifies which metrics to collect:

{
  "metrics": {
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["*"],
        "metrics_collection_interval": 60
      }
    }
  }
}

Custom metrics: You can publish your own metrics for application-specific data:

# Publish a custom metric
aws cloudwatch put-metric-data \
  --namespace "MyApp" \
  --metric-name "OrdersProcessed" \
  --value 42 \
  --unit Count

# Publish with dimensions (to track per-environment)
aws cloudwatch put-metric-data \
  --namespace "MyApp" \
  --metric-name "OrdersProcessed" \
  --value 42 \
  --unit Count \
  --dimensions Environment=prod,Service=order-api

# Publish from application code (Python example)
# import boto3
# cloudwatch = boto3.client('cloudwatch')
# cloudwatch.put_metric_data(
#     Namespace='MyApp',
#     MetricData=[{
#         'MetricName': 'OrdersProcessed',
#         'Value': 42,
#         'Unit': 'Count',
#         'Dimensions': [
#             {'Name': 'Environment', 'Value': 'prod'},
#             {'Name': 'Service', 'Value': 'order-api'}
#         ]
#     }]
# )

Metric resolution:

Standard resolution: Data points at 1-minute intervals (free for basic metrics)
High resolution: Data points at 1-second intervals (for custom metrics, costs more)

For most applications, standard resolution is sufficient. Use high resolution only for time-sensitive metrics like trading systems or real-time gaming.

CloudWatch Alarms

Alarms watch a single metric and take action when it crosses a threshold. This is how you get notified before users do.

Alarm states:

OK: The metric is within the threshold
ALARM: The metric has breached the threshold
INSUFFICIENT_DATA: Not enough data to determine the state

Setting Up Your First Alarm

Create an alarm that notifies you when EC2 CPU exceeds 80%:

# Create an SNS topic for notifications
aws sns create-topic --name monitoring-alerts
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:monitoring-alerts \
  --protocol email \
  --notification-endpoint your-email@example.com

# Create the alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "High-CPU-Alarm" \
  --alarm-description "CPU utilization exceeds 80% for 5 minutes" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:monitoring-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:monitoring-alerts \
  --dimensions Name=InstanceId,Value=i-xxxxx

This alarm triggers when average CPU exceeds 80% for two consecutive 5-minute periods (10 minutes total). This prevents false alarms from brief spikes. The --ok-actions parameter also sends a notification when the alarm recovers, so you know the problem is resolved.

Alarm Math: Understanding Evaluation

The evaluation logic is critical to getting alarms right:

Setting	Value	Meaning
Period	300 seconds	Each evaluation window is 5 minutes
Evaluation Periods	2	Must breach for 2 consecutive periods
Datapoints to Alarm	2 (default = evaluation periods)	2 of 2 periods must breach
Total detection time	10 minutes	2 periods x 5 minutes each

For faster detection, reduce the period. For fewer false positives, increase evaluation periods. Find the balance that works for your application.

Alarms Every Application Should Have

Here are the essential alarms for a typical web application:

Alarm	Metric	Threshold	Why
High CPU	CPUUtilization	> 80% for 10 min	Server is struggling
High error rate	HTTPCode_Target_5XX_Count	> 10 in 5 min	Application is failing
High latency	TargetResponseTime	p95 > 2 seconds	Users are waiting
Low disk space	FreeStorageSpace (RDS)	< 20% remaining	Database will crash
Queue backing up	ApproximateAgeOfOldestMessage	> 300 seconds	Processing is falling behind
Lambda errors	Errors (Lambda)	> 5 in 5 min	Functions are failing
Lambda throttles	Throttles (Lambda)	> 0 in 5 min	Hitting concurrency limits
Billing	EstimatedCharges	> $10 (or your budget)	Cost protection
Healthy hosts	HealthyHostCount (ALB)	< 2	Losing capacity

Composite Alarms

Sometimes a single metric is not enough context. Composite alarms combine multiple alarms with AND/OR logic:

# Alarm only when BOTH high CPU AND high error rate
aws cloudwatch put-composite-alarm \
  --alarm-name "Critical-App-Health" \
  --alarm-rule "ALARM(High-CPU-Alarm) AND ALARM(High-Error-Rate)" \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:critical-alerts

This reduces alert fatigue. High CPU alone might be fine (batch processing). High CPU plus high errors means something is wrong.

Anomaly Detection Alarms

For metrics without obvious fixed thresholds, use anomaly detection:

aws cloudwatch put-metric-alarm \
  --alarm-name "Unusual-Request-Count" \
  --metric-name RequestCount \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 3 \
  --threshold-metric-id "ad1" \
  --comparison-operator LessThanLowerOrGreaterThanUpperThreshold \
  --metrics '[
    {
      "Id": "m1",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/ApplicationELB",
          "MetricName": "RequestCount",
          "Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/xxxxx"}]
        },
        "Period": 300,
        "Stat": "Sum"
      },
      "ReturnData": true
    },
    {
      "Id": "ad1",
      "Expression": "ANOMALY_DETECTION_BAND(m1, 2)",
      "ReturnData": true
    }
  ]'

CloudWatch learns the normal pattern (daily cycles, weekend drops) and alerts when the metric deviates significantly.

CloudWatch Logs

Logs are the narrative of what happened in your application. Metrics tell you something is wrong. Logs tell you why.

Log Concepts

Log Group: A collection of log streams with the same retention, monitoring, and access settings. Usually one per application or service. Example: /aws/lambda/my-function or /my-app/production.

Log Stream: A sequence of log events from a single source. Usually one per instance or function invocation. Example: i-xxxxx or 2026/05/12/[$LATEST]abc123.

Log Event: A single log entry with a timestamp and a message.

Structured Logging

Do not log unstructured text. Log structured JSON. This makes searching, filtering, and analyzing logs dramatically easier.

Bad (unstructured):

Processing order 12345 for user john@example.com, total $99.99

Good (structured JSON):

{
  "level": "INFO",
  "message": "Processing order",
  "orderId": "12345",
  "userId": "john@example.com",
  "total": 99.99,
  "timestamp": "2026-05-12T14:30:00Z",
  "requestId": "abc-123-def",
  "service": "order-processor"
}

With structured logs, you can use CloudWatch Logs Insights to query:

fields @timestamp, orderId, total
| filter level = "ERROR"
| sort @timestamp desc
| limit 20

CloudWatch Logs Insights

Logs Insights lets you run SQL-like queries across your log groups. This is enormously powerful for debugging.

Find the most common errors:

fields @message
| filter level = "ERROR"
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 10

Calculate p95 response time from application logs:

fields @timestamp, responseTime
| filter path = "/api/users"
| stats avg(responseTime) as avg,
        percentile(responseTime, 95) as p95,
        percentile(responseTime, 99) as p99
  by bin(5m)

Find slow requests:

fields @timestamp, path, responseTime, userId
| filter responseTime > 3000
| sort responseTime desc
| limit 50

Identify error patterns over time:

fields @timestamp, level
| filter level = "ERROR"
| stats count(*) as errors by bin(15m)
| sort @timestamp asc

Find Lambda cold starts:

fields @timestamp, @duration, @initDuration
| filter ispresent(@initDuration)
| stats count(*) as coldStarts,
        avg(@initDuration) as avgColdStartMs,
        max(@initDuration) as maxColdStartMs
  by bin(1h)

Metric Filters

Metric filters extract metrics from log data automatically. This is how you create custom metrics from application logs without changing your code:

# Create a metric filter that counts ERROR logs
aws logs put-metric-filter \
  --log-group-name /my-app/production \
  --filter-name ErrorCount \
  --filter-pattern '{ $.level = "ERROR" }' \
  --metric-transformations \
    metricName=ApplicationErrors,metricNamespace=MyApp,metricValue=1

# Now you can alarm on this custom metric
aws cloudwatch put-metric-alarm \
  --alarm-name "High-Application-Errors" \
  --metric-name ApplicationErrors \
  --namespace MyApp \
  --statistic Sum \
  --period 300 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:monitoring-alerts

Log Retention

CloudWatch Logs never expire by default. Set a retention policy to control costs:

aws logs put-retention-policy \
  --log-group-name /my-app/production \
  --retention-in-days 30

Retention	Use Case
1-7 days	Development, debugging
30 days	Standard production
90 days	Compliance requirements
365 days	Audit and regulatory
Never expire	Legal hold

For cost optimization, export old logs to S3 where storage is much cheaper ($0.023/GB vs $0.50/GB in CloudWatch Logs).

CloudWatch Dashboards

Dashboards are visual displays of your metrics. A good dashboard tells a story at a glance: is the system healthy, and if not, where is the problem?

Building an Effective Dashboard

Rule 1: One dashboard per service or application. Do not cram everything into one screen. Create separate dashboards for your API, your database, your queue processing, and your overall account health.

Rule 2: Put the Four Golden Signals at the top. Latency, traffic, errors, saturation. If these are all green, your system is healthy.

Rule 3: Use consistent time ranges. All widgets on a dashboard should show the same time range. Mixed time ranges make correlation impossible.

Rule 4: Add annotations for deployments. Mark when deployments happened so you can correlate changes in metrics with code changes.

Rule 5: Include alarm status widgets. Show current alarm states so you can see at a glance if anything needs attention.

# Create a dashboard
aws cloudwatch put-dashboard \
  --dashboard-name MyApp-Production \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "x": 0, "y": 0, "width": 12, "height": 6,
        "properties": {
          "title": "API Latency (p95)",
          "metrics": [
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/my-alb/xxxxx",
             {"stat": "p95"}]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "x": 12, "y": 0, "width": 12, "height": 6,
        "properties": {
          "title": "Request Count",
          "metrics": [
            ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/my-alb/xxxxx",
             {"stat": "Sum"}]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "x": 0, "y": 6, "width": 12, "height": 6,
        "properties": {
          "title": "5XX Errors",
          "metrics": [
            ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", "app/my-alb/xxxxx",
             {"stat": "Sum"}]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      },
      {
        "type": "metric",
        "x": 12, "y": 6, "width": 12, "height": 6,
        "properties": {
          "title": "CPU Utilization",
          "metrics": [
            ["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", "my-asg",
             {"stat": "Average"}]
          ],
          "period": 60,
          "view": "timeSeries"
        }
      }
    ]
  }'

Beyond CloudWatch: The Observability Ecosystem

CloudWatch is the foundation, but AWS offers additional observability tools:

AWS X-Ray (Distributed Tracing)

X-Ray traces requests as they flow through your distributed application. If a request hits API Gateway, then Lambda, then DynamoDB, then SQS, X-Ray shows you the entire path and where time was spent.

Use X-Ray when:

You have a microservices architecture
You need to identify which service in the chain is slow
You want to understand the flow of requests

CloudWatch Synthetics (Canary Tests)

Synthetics runs scripted tests against your endpoints at regular intervals. Think of it as a robot that continuously checks if your website is working.

Canary runs every 5 minutes -> Loads your homepage -> Checks for expected content -> Alerts if it fails

Use Synthetics to catch problems before users report them, especially for public-facing websites and APIs.

CloudWatch Application Insights

Application Insights automatically detects and sets up monitoring for common application patterns (.NET, SQL Server, IIS, etc.). It reduces the manual work of configuring metrics and alarms.

CloudWatch Contributor Insights

Contributor Insights analyzes log data and creates time-series data showing the top contributors. For example: which IP addresses are generating the most 5XX errors, or which API endpoints are the slowest.

# Create a contributor insights rule
aws cloudwatch put-insight-rule \
  --rule-name "TopSlowEndpoints" \
  --rule-state ENABLED \
  --rule-definition '{
    "Schema": {"Name": "CloudWatchLogRule", "Version": 1},
    "LogGroupNames": ["/my-app/production"],
    "LogFormat": "JSON",
    "Contribution": {
      "Keys": ["$.path"],
      "ValueOf": "$.responseTime",
      "Filters": [{"Match": "$.level", "EqualTo": "INFO"}]
    },
    "AggregateOn": "Sum"
  }'

Troubleshooting Common Errors

Custom metric not appearing After calling put-metric-data, custom metrics can take up to 15 minutes to appear in the CloudWatch console. Verify your namespace, metric name, and dimensions are spelled exactly as you published them (they are case-sensitive). A common mistake is publishing to namespace MyApp but searching in myapp. Run aws cloudwatch list-metrics --namespace MyApp to confirm your metric exists. Also check that the IAM role or user has cloudwatch:PutMetricData permission.

Alarm stuck in INSUFFICIENT_DATA This state means CloudWatch is not receiving enough data points to evaluate the alarm. The most common causes: the metric has no data during the evaluation period (for example, an EC2 instance that is stopped), the period is shorter than the metric's reporting interval, or the dimensions on the alarm do not match the dimensions on the metric. Run aws cloudwatch describe-alarms --alarm-names "Your-Alarm-Name" and compare the Namespace, MetricName, and Dimensions fields against the actual metric. For metrics that are sometimes absent, set TreatMissingData to notBreaching or missing instead of the default.

Logs Insights query returns empty First confirm you selected the correct log group and time range. Then verify that your filter pattern matches the log format. If your logs are JSON, use the $.fieldName syntax (for example, filter $.level = "ERROR"). If your logs are plain text, use filter @message like /pattern/. Also check log retention: if the retention policy is shorter than the time range you are querying, older logs will already be deleted.

Monitoring Cost on AWS

CloudWatch pricing can add up if you are not careful:

Feature	Free Tier	Pricing After
Basic metrics	10 metrics	Free (built-in service metrics)
Detailed monitoring	None	$0.30/metric/month (1-minute EC2 metrics)
Custom metrics	None	$0.30/metric/month
Alarms	10 alarms	$0.10/alarm/month
Logs ingestion	5 GB/month	$0.50/GB
Logs storage	5 GB/month	$0.03/GB/month
Dashboard	3 dashboards	$3/dashboard/month
Logs Insights queries	None	$0.005/GB scanned

Cost optimization tips:

Use standard resolution metrics unless you need 1-second granularity
Set log retention policies aggressively (30 days covers most needs)
Export old logs to S3 for long-term storage
Use metric filters instead of Logs Insights for repeated queries
Delete unused dashboards and alarms
Reduce verbose logging in production (debug logs cost money)

The Monitoring Checklist

Here is what to set up for any production application:

Day 1 (Essential)

Enable CloudTrail for API activity logging
Create a billing alarm ($5 or $10 threshold)
Enable detailed monitoring for EC2 instances (1-minute metrics)
Set up an SNS topic for alarm notifications
Create alarms for CPU, errors, and disk space

Week 1 (Important)

Enable VPC Flow Logs
Configure application logs to CloudWatch Logs (structured JSON)
Set log retention policies
Build a dashboard with the Four Golden Signals
Create alarms for latency p95 and error rates
Install CloudWatch Agent for memory and disk metrics

Month 1 (Production-Ready)

Set up CloudWatch Synthetics for endpoint health checks
Enable X-Ray tracing for distributed applications
Create composite alarms to reduce alert fatigue
Set up log-based metric filters for business metrics
Export logs to S3 for long-term retention
Review and tune alarm thresholds based on baseline data
Set up Contributor Insights for top-N analysis

Common Monitoring Anti-Patterns

Alert fatigue. Too many alarms firing for non-critical issues. People start ignoring them, and then they miss the real ones. Be selective. Only alert on things that require action.

Dashboard overload. A dashboard with 50 widgets is useless. If you cannot understand the system's health in 5 seconds of looking at the dashboard, simplify it.

Monitoring only infrastructure. CPU and memory tell you about the server, not the user experience. Always include application-level metrics: response time, error rate, and business metrics (orders processed, sign-ups completed).

No baselines. An alarm that fires when CPU exceeds 80% is meaningless if your normal CPU is 75%. Observe your system under normal conditions first, then set thresholds based on deviations from normal.

Ignoring logs. Metrics tell you something is wrong. Logs tell you why. Without good logging, you will spend hours guessing during an incident.

Alerting on averages. Average latency of 200ms sounds fine. But if 5% of users experience 10-second latency, that is terrible. Alert on percentiles (p95, p99), not averages.

How This Shows Up in Architecture Decisions

CloudWatch collects metrics and logs. CloudTrail records API calls. They are different services with different purposes.
Basic monitoring is free (5-minute intervals for EC2). Detailed monitoring costs extra (1-minute intervals).
Custom metrics use put-metric-data. You publish them from your application.
Metric filters extract metrics from log data (e.g., count the number of ERROR lines).
CloudWatch Logs agent or the unified CloudWatch agent sends logs and custom metrics from EC2 instances to CloudWatch.
Alarms can trigger Auto Scaling, SNS, or EC2 actions (stop, terminate, reboot).
Logs Insights is for ad-hoc querying. Metric filters are for continuous monitoring.
CloudWatch does NOT collect memory or disk metrics by default. You need the CloudWatch Agent.
Anomaly detection uses machine learning to identify unusual metric behavior.
Composite alarms combine multiple alarms with AND/OR logic.

Start Monitoring Today

Here is the honest truth about monitoring: most teams set up too many alarms and act on too few of them. Dashboards get built during an incident, then never looked at again. Logs pile up with no retention policy until someone notices the bill.

Do not be that team. Set up one alarm today. Just one. A billing alarm for $5, a CPU alarm on your busiest instance, whatever makes sense for your workload. Get the SNS notification flowing to your inbox. Then tomorrow, add structured logging to one service. The day after, build a single dashboard with the Four Golden Signals.

Small, consistent steps beat a massive monitoring overhaul that never actually ships.

Pricing note: CloudWatch alarm, custom metric, log ingestion, and dashboard costs cited in this article are for us-east-1 and were verified in May 2026. Check the AWS Pricing Calculator for current rates in your Region.

Hands-On Challenge

Set up a complete observability stack for a running AWS application (an EC2 instance, Lambda function, or ECS service).

Success criteria:

Publish at least one custom metric from your application using put-metric-data (for example, orders processed or request count)
Create a CloudWatch alarm on that custom metric with appropriate threshold and evaluation periods, connected to an SNS topic that emails you
Build a CloudWatch dashboard with at least four widgets covering the Four Golden Signals: latency, traffic, errors, and saturation
Configure structured JSON logging from your application to a CloudWatch log group with a 30-day retention policy
Write a Logs Insights query that returns the top 10 most frequent errors from your log group in the past 24 hours
Create a composite alarm that combines two or more individual alarms with AND logic to reduce false positives

Build it yourself: This topic is covered hands-on in Module 14: Monitoring and Observability of our AWS Bootcamp.