Why This Phase Exists
You cannot improve what you cannot measure. You cannot fix what you cannot see. You cannot prevent what you cannot predict.
Every piece of infrastructure you have deployed across the previous eight phases is running right now. EC2 instances are consuming CPU cycles. Lambda functions are processing invocations. RDS databases are accepting connections. ALBs are routing requests. S3 buckets are serving objects. ECS tasks are executing containers. Every one of those services is producing signals: metrics that indicate health, logs that record behavior, traces that reveal request paths, and cost data that quantifies resource consumption.
Without operational visibility, you are flying blind. An EC2 instance reaches 98% CPU utilization and your users experience degraded performance for hours before anyone notices. A Lambda function starts throwing errors after a deployment and the team discovers it only when customers report broken functionality. A misconfigured service silently accumulates cost at $200/day and the invoice arrives 30 days later with an unexplained $6,000 charge. A single slow downstream dependency adds 3 seconds of latency to 40% of requests, but your monitoring only shows average response times that look acceptable.
These are not hypothetical scenarios. They are the operational reality of running distributed systems without observability. This phase teaches you to run infrastructure like a professional: to detect problems before your users do, to diagnose root causes in minutes rather than hours, to optimize spending systematically rather than reactively, to recover from failures automatically rather than manually, and to maintain operational standards at scale.
The distinction between a junior engineer and a senior engineer is not the ability to deploy infrastructure. Deployment is straightforward. The distinction is the ability to operate that infrastructure reliably, efficiently, and sustainably over months and years. Phase 9 is where you develop that operational maturity.
What You Will Master
By the end of Phase 9, you will be able to:
- Design comprehensive monitoring architectures using CloudWatch metrics, alarms, and dashboards that provide real-time visibility into infrastructure and application health
- Build centralized logging solutions that aggregate logs from EC2, Lambda, ECS, API Gateway, and VPC Flow Logs into queryable, searchable, alertable systems
- Implement distributed tracing with X-Ray that visualizes request flow across microservices and identifies latency bottlenecks at the subsegment level
- Construct cost management frameworks using Cost Explorer, Budgets, and Cost Anomaly Detection that provide visibility, governance, and automated alerts
- Apply resource optimization strategies using Compute Optimizer, Trusted Advisor, and right-sizing analysis that reduce waste without sacrificing performance
- Architect backup solutions using AWS Backup with cross-region, cross-account vault replication and compliance-driven retention policies
- Design disaster recovery architectures across the four DR strategies (backup/restore, pilot light, warm standby, multi-site active/active) with defined RTO/RPO targets
- Automate operational tasks using Systems Manager with Run Command, Patch Manager, State Manager, and maintenance windows that eliminate manual intervention
Modules in This Phase
| Module | Title | Key Focus Areas |
|---|---|---|
| 61 | CloudWatch Metrics & Alarms | Namespaces, dimensions, custom metrics, alarm states, composite alarms, dashboards, anomaly detection |
| 62 | CloudWatch Logs | Log groups, log streams, Logs Insights queries, metric filters, subscription filters, structured logging |
| 63 | Distributed Tracing with X-Ray | Traces, segments, subsegments, service maps, sampling rules, annotations, OpenTelemetry integration |
| 64 | Cost Management | Cost Explorer, Budgets, Cost Anomaly Detection, CUR, cost allocation tags, Savings Plans |
| 65 | Resource Optimization | Compute Optimizer, Trusted Advisor, right-sizing, instance scheduling, spot strategies, storage tiering |
| 66 | AWS Backup | Backup plans, vault lock, cross-region/cross-account replication, compliance frameworks, restore testing |
| 67 | Disaster Recovery | DR strategies, RTO/RPO planning, pilot light, warm standby, multi-site, failover automation |
| 68 | Systems Manager | Run Command, Patch Manager, State Manager, Session Manager, maintenance windows, automation documents |
The Progressive Path
This phase follows a deliberate progression from visibility through optimization to operational automation.
Modules 61 through 63 establish the three pillars of observability. Module 61 starts with metrics because they are the fastest signal. A metric tells you that something changed: CPU spiked, error rate increased, latency degraded. Metrics are cheap, fast, and aggregatable. They answer the question "is something wrong?" Module 62 adds logs, which provide the context metrics lack. A metric tells you errors increased. A log tells you why: the specific exception, the request parameters that triggered it, the downstream service that failed. Logs answer "what went wrong?" Module 63 introduces traces, which reveal causality across distributed systems. When a request touches an API Gateway, a Lambda function, a DynamoDB table, and an SQS queue, a trace shows you exactly where the bottleneck occurs. Traces answer "where did it go wrong?"
Together, these three pillars form a complete observability stack. Metrics detect. Logs explain. Traces localize. No single pillar is sufficient. A production architecture requires all three operating in coordination.
Module 64 shifts to cost management. Cost is a first-class operational metric. A service that runs correctly but costs three times more than it should is operationally deficient. Cost Explorer provides visibility. Budgets provide governance. Cost Anomaly Detection provides alerting. The Cost and Usage Report provides the raw data for custom analysis. Module 65 follows with resource optimization: the actions you take once cost visibility reveals inefficiency. Right-sizing, instance scheduling, Savings Plans, spot utilization, and storage lifecycle policies turn cost awareness into cost reduction.
Module 66 introduces data protection through backup. Every production system requires a backup strategy with defined retention, tested restores, and cross-region replication for critical data. Module 67 extends data protection into disaster recovery: what happens when an entire region becomes unavailable, or when a catastrophic failure requires rebuilding infrastructure from scratch. The four DR strategies (backup/restore, pilot light, warm standby, multi-site active/active) represent a continuum of cost versus recovery speed.
Module 68 concludes with Systems Manager, the operational automation platform. Everything you have learned about monitoring, logging, cost management, and recovery can be automated through Systems Manager. Patching operating systems across a fleet. Running diagnostic commands without SSH access. Enforcing configuration state. Scheduling maintenance windows. Systems Manager transforms manual runbooks into automated workflows.
Services You Will Command
Monitoring and Observability
- Amazon CloudWatch — Metrics, alarms, dashboards, anomaly detection, Contributor Insights
- Amazon CloudWatch Logs — Log aggregation, Logs Insights queries, metric filters, Live Tail
- AWS X-Ray — Distributed tracing, service maps, trace analysis, sampling configuration
- Amazon CloudWatch ServiceLens — Unified observability combining metrics, logs, and traces
- AWS Distro for OpenTelemetry — Vendor-neutral telemetry collection (metrics, traces, logs)
Cost and Optimization
- AWS Cost Explorer — Spend visualization, forecasting, Savings Plans recommendations
- AWS Budgets — Cost/usage budgets, alerts, automated actions
- AWS Cost Anomaly Detection — ML-based detection of unexpected spending patterns
- AWS Cost and Usage Report — Detailed billing data export for custom analysis
- AWS Compute Optimizer — Right-sizing recommendations for EC2, Lambda, EBS, ECS
- AWS Trusted Advisor — Best practice checks across cost, performance, security, fault tolerance
- Cost Optimization Hub — Centralized recommendations across Organizations accounts
Data Protection and Recovery
- AWS Backup — Centralized backup management with cross-region/cross-account vaults
- AWS Elastic Disaster Recovery — Continuous replication for rapid failover
- Amazon S3 Glacier — Long-term archival storage for backup retention
Operational Automation
- AWS Systems Manager — Operational hub for fleet management, patching, automation
- AWS Systems Manager Run Command — Remote command execution without SSH/RDP
- AWS Systems Manager Patch Manager — Automated OS and application patching
- AWS Systems Manager State Manager — Configuration drift remediation
- AWS Systems Manager Session Manager — Secure shell access without open inbound ports
- AWS Systems Manager Automation — Multi-step runbook execution
The Three Pillars of Observability
Modern distributed systems require three complementary signals to achieve full operational visibility. Each pillar answers a distinct operational question, and no single pillar can replace the others.
┌─────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ METRICS │ │ LOGS │ │ TRACES │ │
│ │ │ │ │ │ │ │
│ │ "Is it │ │ "What │ │ "Where is │ │
│ │ broken?" │ │ happened?" │ │ the delay?" │ │
│ │ │ │ │ │ │ │
│ │ CloudWatch │ │ CloudWatch │ │ X-Ray / │ │
│ │ Metrics │ │ Logs │ │ OpenTelemetry│ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ │ │
│ ┌───────────▼───────────┐ │
│ │ CloudWatch │ │
│ │ ServiceLens │ │
│ │ (Unified View) │ │
│ └───────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
| Pillar | Signal Type | Latency | Cardinality | Best For |
|---|---|---|---|---|
| Metrics | Numeric time-series | Seconds | Low (aggregated) | Detection, alerting, dashboards, SLO tracking |
| Logs | Structured/unstructured text | Seconds to minutes | High (per-event) | Root cause analysis, audit, debugging |
| Traces | Request-scoped spans | Seconds | Medium (sampled) | Latency analysis, dependency mapping, bottleneck identification |
Metrics are numeric measurements collected at regular intervals. They are lightweight, fast to query, and ideal for real-time alerting. A spike in error rate or CPU utilization triggers an alarm within one evaluation period. But metrics lack context. They tell you that errors increased but not which specific requests failed or why.
Logs are immutable records of discrete events. They contain the full context of what happened: the request payload, the exception stack trace, the authentication failure reason, the SQL query that timed out. Logs are essential for root cause analysis but expensive to store at scale and slow to search without proper indexing.
Traces follow a single request as it traverses multiple services. In a microservice architecture, a single API call might touch an API Gateway, a Lambda authorizer, a backend Lambda function, a DynamoDB query, and an SQS publish. A trace reveals exactly which service introduced latency, which call failed, and how the request flowed through the system.
Architecture Context
Phase 9 operationalizes everything you have built across the previous eight phases. The infrastructure from Phases 1 through 8 generates the signals. Phase 9 teaches you to collect, analyze, and act on those signals.
The EC2 instances from Phase 3 emit CPU, network, and disk metrics to CloudWatch automatically. The Lambda functions from Phase 4 report invocation count, duration, and error rate. The RDS databases from Phase 5 surface connection count, freeable memory, and read/write latency. The ALBs from Phase 6 measure request count, target response time, and HTTP error codes. Every service you deployed is already producing metrics. Phase 9 teaches you to interpret them, alarm on them, and visualize them in dashboards that provide at-a-glance operational health.
The CloudWatch Logs agent on EC2, the automatic log emission from Lambda, the VPC Flow Logs from your networking configuration, the CloudTrail events from Phase 8: all of these log sources feed into the centralized logging architecture you build in Module 62. X-Ray traces the requests flowing through your API Gateway, Lambda, DynamoDB, and SQS integrations from Phases 4 and 5.
Cost management from Module 64 quantifies the financial impact of every architectural decision across all phases. The instance type selections, the storage class choices, the data transfer patterns, the reserved capacity commitments. Resource optimization from Module 65 acts on that data to reduce waste.
The backup and DR modules (66 and 67) protect the stateful resources: RDS databases, DynamoDB tables, EFS file systems, S3 buckets. Systems Manager from Module 68 automates the operational tasks that keep the entire stack healthy: patching, configuration enforcement, and incident response.
Phase Exam
After completing all eight modules, you will take the Phase 9 Operations & Observability exam:
- 35 multiple-choice questions covering CloudWatch metrics and alarms, log architecture, distributed tracing, cost management, resource optimization, backup strategy, disaster recovery design, and operational automation
- 55 minutes time limit
- 70% pass threshold (25/35 correct)
- Questions emphasize operational decisions: which metric to alarm on for a given failure mode, how to structure log queries for incident investigation, when to use sampling in traces, how to design budgets with automated actions, which DR strategy meets specific RTO/RPO requirements, and how to automate patching across a multi-account fleet
- Expect scenario-based questions that present an operational problem (performance degradation, unexpected cost spike, data loss scenario, compliance-driven backup requirement) and ask you to select the correct service combination, configuration, or architecture pattern
- CloudWatch alarm math, Logs Insights query syntax, X-Ray sampling rules, Cost Explorer filters, Backup vault lock configurations, DR strategy trade-offs, and Systems Manager document structure are heavily represented