Lab 14: Monitoring with CloudWatch Metrics, Alarms, Logs Insights, and X-Ray
Objective
Create a CloudWatch dashboard that monitors a Lambda function, configure alarms on error rate and latency, query Lambda logs using CloudWatch Logs Insights, enable X-Ray tracing on a Lambda function, and analyze the resulting traces.
Architecture Diagram
This lab builds a monitoring stack around a Lambda function that you instrument and observe:
API Gateway (REST API)
|
v
Lambda function: bootcamp-monitored-function
| (X-Ray active tracing enabled)
|
├── DynamoDB: bootcamp-monitoring-table (GetItem)
└── CloudWatch Logs: /aws/lambda/bootcamp-monitored-function
|
v
CloudWatch Logs Insights (queries)
CloudWatch Metrics
├── Lambda: Invocations, Duration, Errors, Throttles
├── Custom metric: BootcampApp/OrdersProcessed
└── Alarms:
├── bootcamp-error-alarm (Lambda Errors > 1)
└── bootcamp-latency-alarm (Lambda Duration p95 > 1000ms)
CloudWatch Dashboard: bootcamp-operations
└── Widgets: Lambda metrics, alarm status, error rate, latency percentiles
AWS X-Ray
└── Service map: API Gateway -> Lambda -> DynamoDB
└── Traces: individual request paths with timing
Prerequisites
- Completed Lab 09: Serverless Computing with Lambda (Lambda function creation and API Gateway integration)
- Completed Module 14: Monitoring and Observability lesson content
- Signed in to the AWS Management Console as the
bootcamp-adminIAM user - AWS CloudShell available (or the AWS CLI installed and configured locally)
Duration
60 minutes
Instructions
Step 1: Create a Lambda Function with Structured Logging
In this step, you create a Lambda function that uses structured JSON logging and interacts with DynamoDB. This function will be the target of your monitoring setup.
- Create a DynamoDB table named
bootcamp-monitoring-tablewith a partition keyid(String). Use on-demand capacity mode.
aws dynamodb create-table \
--table-name bootcamp-monitoring-table \
--attribute-definitions AttributeName=id,AttributeType=S \
--key-schema AttributeName=id,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--region us-east-1
- Create a Lambda execution role with permissions for DynamoDB, CloudWatch Logs, and X-Ray:
ROLE_ARN=$(aws iam create-role \
--role-name bootcamp-monitoring-lambda-role \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "lambda.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}' \
--query "Role.Arn" --output text)
echo "Role ARN: $ROLE_ARN"
aws iam attach-role-policy \
--role-name bootcamp-monitoring-lambda-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam attach-role-policy \
--role-name bootcamp-monitoring-lambda-role \
--policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBReadOnlyAccess
aws iam attach-role-policy \
--role-name bootcamp-monitoring-lambda-role \
--policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess
- Create a file named
index.pywith the following Lambda function code that uses structured JSON logging:
import json
import logging
import os
import time
import boto3
logger = logging.getLogger()
logger.setLevel(logging.INFO)
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table(os.environ.get("TABLE_NAME", "bootcamp-monitoring-table"))
def handler(event, context):
start_time = time.time()
request_id = context.aws_request_id
logger.info(json.dumps({
"level": "INFO",
"message": "Request received",
"requestId": request_id,
"path": event.get("path", "/"),
"method": event.get("httpMethod", "GET")
}))
try:
item_id = event.get("queryStringParameters", {}).get("id", "default")
response = table.get_item(Key={"id": item_id})
item = response.get("Item", {"id": item_id, "status": "not found"})
duration = (time.time() - start_time) * 1000
logger.info(json.dumps({
"level": "INFO",
"message": "Request completed",
"requestId": request_id,
"durationMs": round(duration, 2),
"itemFound": "Item" in response
}))
return {
"statusCode": 200,
"body": json.dumps(item, default=str)
}
except Exception as e:
logger.error(json.dumps({
"level": "ERROR",
"message": "Request failed",
"requestId": request_id,
"error": str(e),
"errorType": type(e).__name__
}))
return {
"statusCode": 500,
"body": json.dumps({"error": "Internal server error"})
}
- Package and deploy the function:
zip function.zip index.py
# Wait 10 seconds for the role to propagate
sleep 10
aws lambda create-function \
--function-name bootcamp-monitored-function \
--runtime python3.12 \
--handler index.handler \
--role $ROLE_ARN \
--zip-file fileb://function.zip \
--timeout 30 \
--memory-size 256 \
--environment "Variables={TABLE_NAME=bootcamp-monitoring-table}" \
--tracing-config Mode=Active \
--region us-east-1
Expected result: The function is created with X-Ray active tracing enabled.
- Invoke the function a few times to generate metrics and logs:
for i in $(seq 1 10); do
aws lambda invoke \
--function-name bootcamp-monitored-function \
--payload '{"queryStringParameters": {"id": "test-'$i'"}}' \
--cli-binary-format raw-in-base64-out \
/tmp/response.json \
--region us-east-1
echo "Invocation $i: $(cat /tmp/response.json)"
done
Step 2: Create CloudWatch Alarms
In this step, you create two alarms: one for Lambda errors and one for high latency.
- Create an error alarm that triggers when the function produces more than 1 error in a 5-minute period:
aws cloudwatch put-metric-alarm \
--alarm-name bootcamp-error-alarm \
--metric-name Errors \
--namespace AWS/Lambda \
--dimensions Name=FunctionName,Value=bootcamp-monitored-function \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--alarm-description "Alarm when Lambda function errors exceed 1 in 5 minutes" \
--region us-east-1
- Create a latency alarm that triggers when the p95 duration exceeds 1000ms:
aws cloudwatch put-metric-alarm \
--alarm-name bootcamp-latency-alarm \
--metric-name Duration \
--namespace AWS/Lambda \
--dimensions Name=FunctionName,Value=bootcamp-monitored-function \
--extended-statistic p95 \
--period 300 \
--evaluation-periods 1 \
--threshold 1000 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--alarm-description "Alarm when Lambda p95 latency exceeds 1000ms" \
--region us-east-1
- Verify the alarms exist:
aws cloudwatch describe-alarms \
--alarm-name-prefix bootcamp \
--query "MetricAlarms[].{Name:AlarmName,State:StateValue}" \
--output table \
--region us-east-1
Expected result: Both alarms are listed with state OK or INSUFFICIENT_DATA.
Step 3: Query Logs with CloudWatch Logs Insights
In this step, you use CloudWatch Logs Insights to query the structured logs your Lambda function produces. Logs Insights uses a purpose-built query language that lets you filter, aggregate, and visualize log data without exporting it to another tool.
Navigate to the CloudWatch console, choose Logs Insights in the left navigation pane, and select the log group /aws/lambda/bootcamp-monitored-function.
Run the following queries and analyze the results:
- Find all log entries from the last hour:
fields @timestamp, @message
| sort @timestamp desc
| limit 20
- Calculate the average and p95 duration from the structured logs:
filter @message like /Request completed/
| parse @message '"durationMs":*,' as duration
| stats avg(duration) as avgMs, pct(duration, 95) as p95Ms, count(*) as requests by bin(5m)
- Find all error log entries:
filter @message like /ERROR/
| fields @timestamp, @message
| sort @timestamp desc
| limit 10
Hint: If you do not see structured log fields parsed automatically, use the
parsecommand to extract JSON fields from the@messagefield. CloudWatch Logs Insights automatically discovers JSON fields in log entries that are formatted as pure JSON objects.
After running these queries, experiment with writing your own query. For example, try to find all requests where itemFound is false.
Step 4: Analyze X-Ray Traces
In this step, you examine the X-Ray traces generated by your Lambda function.
Navigate to the X-Ray console (or the CloudWatch console under X-Ray traces) and explore the following:
-
Service map. View the service map to see the relationship between your Lambda function and DynamoDB. Note the average latency and request count on each node.
-
Traces. Choose a trace to see the detailed timeline. Identify:
- The total request duration
- The time spent in the Lambda initialization phase (cold start, if present)
- The time spent in the DynamoDB
GetItemcall - Any subsegments created by the AWS SDK
-
Trace filtering. Use the filter expression to find traces with errors or high latency:
service("bootcamp-monitored-function") { error = true }service("bootcamp-monitored-function") { responsetime > 0.5 }
Hint: If the service map is empty, wait 2 to 3 minutes for X-Ray to process the trace data. X-Ray aggregates data in near-real-time, but there can be a short delay.
Analyze the traces and identify which portion of the request takes the most time. Is it the Lambda initialization, the function code, or the DynamoDB call?
Step 5: Build a CloudWatch Dashboard
In this step, you create a CloudWatch dashboard that displays the four golden signals for your Lambda function.
Design and build a dashboard named bootcamp-operations that includes the following widgets:
- An alarm status widget showing the state of
bootcamp-error-alarmandbootcamp-latency-alarm - A line graph showing Lambda
Invocationsover time (traffic signal) - A line graph showing Lambda
Durationwith p50, p95, and p99 statistics (latency signal) - A line graph showing Lambda
Errorsover time (error signal) - A number widget showing the current
ConcurrentExecutionsvalue (saturation signal)
Hint: In the CloudWatch console, choose Dashboards, then Create dashboard. Add widgets using the Add widget button. For each metric widget, select the
AWS/Lambdanamespace, choose theFunctionNamedimension, and selectbootcamp-monitored-function. For the alarm status widget, choose the Alarm status widget type and select your two alarms.
After building the dashboard, review it and consider: does it give you enough information to diagnose a problem quickly? What additional widgets would you add for a production application?
Step 6: Publish a Custom Metric
In this step, you publish a custom metric to CloudWatch and add it to your dashboard.
Publish a custom metric that tracks the number of "orders processed" (simulated):
for i in $(seq 1 5); do
aws cloudwatch put-metric-data \
--namespace "BootcampApp" \
--metric-name "OrdersProcessed" \
--dimensions "Environment=lab,Service=order-api" \
--value $((RANDOM % 100 + 1)) \
--unit Count \
--region us-east-1
sleep 2
done
After publishing the data points, add a widget to your bootcamp-operations dashboard that displays the BootcampApp/OrdersProcessed metric.
Hint: When adding the widget, select the
BootcampAppnamespace (under Custom namespaces) instead of theAWS/Lambdanamespace. Custom metrics may take 1 to 2 minutes to appear in the console after the firstPutMetricDatacall.
Validation
Confirm the following:
- The Lambda function
bootcamp-monitored-functionis deployed with X-Ray active tracing enabled - Invoking the function produces structured JSON log entries in CloudWatch Logs
- Two CloudWatch alarms exist:
bootcamp-error-alarmandbootcamp-latency-alarm - CloudWatch Logs Insights queries return results from the function's log group
- The X-Ray service map shows the Lambda function and its DynamoDB dependency
- X-Ray traces show the request timeline with subsegments for DynamoDB calls
- A CloudWatch dashboard named
bootcamp-operationsdisplays Lambda metrics and alarm status - A custom metric
BootcampApp/OrdersProcessedappears in the CloudWatch console
Cleanup
Delete all resources created in this lab to avoid charges:
-
Delete the CloudWatch dashboard:
- Navigate to the CloudWatch console.
- Choose Dashboards, select
bootcamp-operations, and choose Delete.
-
Delete the CloudWatch alarms:
aws cloudwatch delete-alarms \
--alarm-names bootcamp-error-alarm bootcamp-latency-alarm \
--region us-east-1
- Delete the Lambda function:
aws lambda delete-function \
--function-name bootcamp-monitored-function \
--region us-east-1
- Delete the DynamoDB table:
aws dynamodb delete-table \
--table-name bootcamp-monitoring-table \
--region us-east-1
- Delete the IAM role and policies:
aws iam detach-role-policy \
--role-name bootcamp-monitoring-lambda-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam detach-role-policy \
--role-name bootcamp-monitoring-lambda-role \
--policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBReadOnlyAccess
aws iam detach-role-policy \
--role-name bootcamp-monitoring-lambda-role \
--policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess
aws iam delete-role --role-name bootcamp-monitoring-lambda-role
- Delete the CloudWatch Logs log group:
aws logs delete-log-group \
--log-group-name /aws/lambda/bootcamp-monitored-function \
--region us-east-1
- Clean up the local files:
rm -f function.zip index.py /tmp/response.json
Warning: CloudWatch custom metrics cannot be deleted manually. They expire automatically after 15 months of no new data points. The custom metric namespace
BootcampAppwill disappear on its own after you stop publishing data.
Challenge (Optional)
Extend this lab by implementing the following:
-
Create an anomaly detection alarm on the Lambda
Durationmetric instead of a static threshold. Let CloudWatch learn the baseline for 24 hours, then test whether it detects an artificially introduced latency spike (add atime.sleep(2)to the Lambda function code). -
Create a composite alarm that triggers only when both the error alarm AND the latency alarm are in ALARM state simultaneously. Test it by modifying the Lambda function to both throw errors and introduce latency.
-
Set up a CloudWatch Logs subscription filter that streams all ERROR-level log entries to a Lambda function, which then sends a notification to an SNS topic. This creates a real-time error notification pipeline.
These challenges combine monitoring concepts with Lambda, SNS, and alarm configuration from previous modules.
AWS Bootcamp: From Novice to Architect Author: Samuel Ogunti License: CC BY-NC 4.0