Why This Phase Exists

The data layer is the most critical architectural decision you will make. Get it wrong and you are stuck.

Compute is fungible. You can swap a Lambda function for a container or an EC2 instance with a weekend of engineering effort. Networking is configurable. Route tables, security groups, and load balancers change in minutes. But your database? Once you have a terabyte of relational data in Aurora, or a billion items in DynamoDB with carefully designed partition keys, or a year of time-series metrics in Timestream, migration becomes an exercise measured in months, not hours. Applications shape themselves around their data model. Access patterns harden into code. Indexes, queries, and consistency expectations permeate every layer of your application.

AWS offers 15+ purpose-built database services because one size does not fit all. A relational database optimized for ACID transactions will never deliver single-digit millisecond reads at millions of requests per second. A key-value store that scales horizontally to virtually unlimited throughput will never give you the ad-hoc join capabilities of SQL. A graph database that traverses billions of relationships in milliseconds would be absurdly wasteful for storing user session data.

A Solutions Architect chooses the right database for the access pattern, not the other way around. You do not start with "we use PostgreSQL for everything" and then fight the technology when your requirements demand something different. You start with the access pattern, the consistency requirement, the scale target, and the latency budget, then select the engine that was purpose-built for exactly that profile.

This phase takes you from understanding managed relational databases through NoSQL at scale, in-memory caching, analytics engines, and the full spectrum of purpose-built databases. By the end, you will make database technology decisions with the same confidence you now have selecting compute or storage services.

What You Will Master

By the end of Phase 5, you will be able to:

Deploy and manage production RDS instances across multiple Availability Zones with automated failover, backups, and point-in-time recovery
Architect Aurora clusters that deliver five times the throughput of standard MySQL with storage that scales automatically to 128 TiB
Design DynamoDB tables with partition key strategies that distribute load evenly across partitions at any scale
Implement advanced DynamoDB patterns including GSIs, LSIs, single-table design, DynamoDB Streams, and global tables
Deploy ElastiCache and MemoryDB clusters that reduce database load by 90% and deliver sub-millisecond response times
Build analytics architectures using Redshift for petabyte-scale warehousing and Athena for serverless queries against S3 data lakes
Select the correct purpose-built database (DocumentDB, Neptune, Keyspaces, QLDB, Timestream) based on data model and access pattern requirements
Make the relational vs NoSQL vs in-memory vs graph vs time-series vs ledger decision correctly on the first attempt for any workload

Modules in This Phase

Module	Title	Key Focus Areas
27	RDS Fundamentals	Managed relational databases, engine selection, instance classes, storage types, backup/recovery, security, monitoring
28	RDS High Availability	Multi-AZ deployments, read replicas, cross-region replicas, automated failover, promotion strategies
29	Amazon Aurora	Aurora architecture, cluster topology, storage auto-scaling, Aurora Serverless v2, Global Database, parallel query
30	DynamoDB Fundamentals	Tables, items, attributes, partition keys, sort keys, read/write capacity modes, consistency models, basic operations
31	DynamoDB Advanced	GSIs, LSIs, single-table design, DynamoDB Streams, Global Tables, DAX, transactions, TTL
32	ElastiCache & MemoryDB	Redis vs Memcached, cluster modes, caching strategies, session stores, MemoryDB for durable in-memory workloads
33	Analytics Databases	Redshift architecture and distribution styles, Athena serverless queries, Redshift Spectrum, data lake analytics patterns
34	Purpose-Built Databases	DocumentDB (MongoDB-compatible), Neptune (graph), Keyspaces (Cassandra-compatible), QLDB (ledger), Timestream (time-series)

The Progressive Path

This phase follows a deliberate progression that builds competence in layers.

Modules 27 and 28 form the relational database arc. You start with RDS fundamentals because relational databases remain the backbone of most enterprise applications. You learn what RDS manages for you, how to select engines and instance classes, and how backup and security work. Module 28 then layers on the high availability and scalability patterns (Multi-AZ, read replicas) that make RDS production-ready. You cannot design read replica architectures without first understanding the fundamentals from Module 27.

Module 29 introduces Aurora as the evolution of managed relational databases on AWS. Aurora's shared storage architecture, automatic failover, and Aurora Serverless v2 represent a fundamentally different approach from standard RDS. You must understand standard RDS to appreciate what Aurora improves and where it differs.

Modules 30 and 31 pivot to NoSQL with DynamoDB. This is intentional. After mastering relational patterns, you need to understand when relational is the wrong answer. DynamoDB fundamentals teaches you a completely different data modeling paradigm: access-pattern-first design, single-digit millisecond performance at any scale, and eventual consistency as a feature. Module 31 advances into patterns that make DynamoDB viable for complex applications: secondary indexes, single-table design, streams, and global tables.

Module 32 introduces the in-memory tier. Caching is not optional at scale. ElastiCache and MemoryDB sit in front of your databases, absorbing the read load that would otherwise crush your relational or NoSQL backend. You cannot design an effective caching layer without understanding the databases it protects.

Module 33 shifts to analytics. OLTP databases (RDS, DynamoDB) serve transactional workloads. When you need to analyze billions of rows across years of historical data, you need purpose-built analytics engines. Redshift and Athena solve that problem at different price points and with different operational models.

Module 34 concludes the phase with the specialized engines. Not every data problem is relational, key-value, or analytical. Document databases, graph databases, wide-column stores, ledgers, and time-series databases each exist because certain access patterns are fundamentally incompatible with general-purpose engines.

Services You Will Command

Amazon RDS

Relational Database Service eliminates the undifferentiated heavy lifting of database administration: hardware provisioning, OS patching, database engine installation, backup configuration, and minor version upgrades. RDS supports six engines (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Db2) and gives you Multi-AZ failover, automated backups with point-in-time recovery, and read replicas for horizontal read scaling. You retain full SQL access and application compatibility while AWS handles everything below the database layer.

Amazon Aurora

Aurora is AWS's cloud-native relational database, compatible with MySQL and PostgreSQL but architecturally distinct. Its shared distributed storage layer automatically replicates six copies of your data across three Availability Zones, scales to 128 TiB without downtime, and delivers up to five times the throughput of standard MySQL. Aurora Serverless v2 scales compute capacity in fine-grained increments based on application demand, eliminating capacity planning for variable workloads. Aurora Global Database extends replication across Regions with sub-second replication lag.

Amazon DynamoDB

DynamoDB is a fully managed NoSQL key-value and document database that delivers single-digit millisecond performance at any scale. It has no servers to manage, no storage to provision, and no throughput ceilings that cannot be raised. Tables can grow from zero to petabytes and from zero to millions of requests per second without architectural changes. DynamoDB Streams captures a time-ordered sequence of item changes for event-driven architectures. Global Tables provides multi-region, multi-active replication with sub-second convergence.

Amazon ElastiCache

ElastiCache provides fully managed in-memory data stores compatible with Redis and Memcached. A Redis cluster on ElastiCache delivers sub-millisecond read latency and supports data structures (strings, hashes, lists, sets, sorted sets, streams) that enable use cases beyond simple key-value caching: leaderboards, session stores, rate limiters, real-time analytics, and message queues. Cluster mode distributes data across multiple shards for horizontal scaling.

Amazon MemoryDB

MemoryDB for Redis is a durable, in-memory database service that delivers microsecond reads and single-digit millisecond writes with Multi-AZ durability. Unlike ElastiCache (which is primarily a cache layer), MemoryDB is designed to be your primary database for workloads that need both the speed of in-memory access and the durability guarantees of a persistent database. It uses a multi-AZ transaction log to ensure data is never lost.

Amazon Redshift

Redshift is a petabyte-scale data warehouse purpose-built for OLAP (Online Analytical Processing). It uses columnar storage, massively parallel processing (MPP), and result caching to execute complex analytical queries across billions of rows in seconds. Redshift Serverless eliminates cluster management entirely. Redshift Spectrum extends your queries to data sitting in S3 without loading it into the warehouse, enabling a data lakehouse architecture.

Amazon Athena

Athena is a serverless interactive query service that lets you analyze data directly in S3 using standard SQL. There is no infrastructure to manage and no data to load. You point Athena at your S3 data, define a schema, and run queries. You pay per terabyte scanned. Athena integrates with AWS Glue Data Catalog for schema management and supports formats including Parquet, ORC, JSON, and CSV. For data lake architectures, Athena provides the query layer without the cost or complexity of a dedicated warehouse.

Amazon DocumentDB

DocumentDB is a fully managed document database compatible with MongoDB 3.6, 4.0, and 5.0 APIs. It separates compute and storage, replicates data six ways across three AZs, and scales read throughput with up to 15 read replicas. Use DocumentDB when your application already uses the MongoDB API and you need the operational simplicity of a fully managed service with built-in high availability. It supports JSON documents with nested fields, arrays, and flexible schemas.

Amazon Neptune

Neptune is a fully managed graph database that supports both the Property Graph model (with Apache TinkerPop Gremlin) and the RDF model (with SPARQL). Graph databases excel when the relationships between entities are as important as the entities themselves: social networks, fraud detection, knowledge graphs, recommendation engines, and network topology analysis. Neptune stores billions of relationships and executes graph traversal queries with millisecond latency.

Amazon Keyspaces

Keyspaces is a fully managed wide-column database compatible with Apache Cassandra. It runs Cassandra Query Language (CQL) workloads without managing servers, software, or patching. Keyspaces automatically scales tables up and down in response to traffic, replicates data three times across multiple AZs, and offers both on-demand and provisioned capacity modes. Use Keyspaces when your application is built on Cassandra and you want to eliminate operational overhead.

Amazon QLDB

Quantum Ledger Database is a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log. Every change to your data is tracked in an append-only journal that cannot be altered or deleted. QLDB uses a hash-chained journal (similar to blockchain) to prove that no modifications have occurred. Use QLDB for systems of record where you need an authoritative, verifiable history: financial transactions, supply chain tracking, regulatory reporting.

Amazon Timestream

Timestream is a fully managed time-series database built for collecting, storing, and querying time-stamped data at scale. It automatically moves recent data from a memory store (fast queries) to a magnetic store (cost-optimized) based on retention policies you define. Timestream is purpose-built for IoT telemetry, DevOps monitoring, industrial equipment sensors, and application metrics where every data point has a timestamp and you query by time range, aggregation, and interpolation.

The Database Decision Framework

This framework is the single most important deliverable from Phase 5. Memorize it. Internalize it. It will appear on every architecture review you participate in.

Use a relational database (RDS/Aurora) when:

Your data has defined relationships that require joins across tables
You need ACID transactions with strong consistency
Your schema is well-defined and relatively stable
You need complex queries with aggregations, subqueries, and multi-table operations
Regulatory requirements mandate SQL-accessible audit capabilities
Your read/write ratio and scale fit within vertical scaling limits (or read replicas for read scaling)

Use DynamoDB (NoSQL key-value) when:

You need single-digit millisecond latency at any scale
Your access patterns are known and limited (you can design keys around them)
You need horizontal scaling without application changes
Your data model fits key-value or document patterns without complex joins
You need consistent performance regardless of table size (10 GB or 10 PB)

Use ElastiCache/MemoryDB (in-memory) when:

You need sub-millisecond read latency for hot data
You are caching database query results, session data, or computed values
You need atomic operations on data structures (counters, sorted sets, queues)
Your application has a high read-to-write ratio with a hot working set

Use Neptune (graph) when:

Relationships between entities are first-class citizens in your queries
You need to traverse many levels of connections efficiently (friends-of-friends, shortest path)
Your queries are shaped like "find all entities connected to X within N hops"
Use cases include fraud detection, recommendation engines, social networks, knowledge graphs

Use Timestream (time-series) when:

Every record has a timestamp and you query by time range
You need built-in time-series functions (interpolation, smoothing, aggregation by time window)
Data volume is high but older data can be moved to cheaper storage automatically
Use cases include IoT telemetry, application monitoring, and industrial sensor data

Use QLDB (ledger) when:

You need a cryptographically verifiable, immutable history of all changes
Auditability is a hard requirement (not just "nice to have")
You need to prove that records have not been tampered with
Use cases include financial transactions, supply chain provenance, regulatory compliance

Architecture Context

Phase 5 builds directly on the foundations from prior phases and establishes the data layer that every subsequent phase depends on.

From earlier phases, your databases will deploy into the private subnets you configured in Phase 2 (Networking). Security groups from Phase 3 (Network Security) control which compute resources can reach your database endpoints. IAM policies from Phase 2 (IAM) govern who can manage database instances and, in some cases (DynamoDB, RDS IAM Auth), who can connect to them. KMS keys from your security modules encrypt data at rest. CloudWatch from your monitoring foundations provides the metrics and alarms that keep your databases healthy.

Looking ahead, the database skills from this phase become the persistence layer for everything you build afterward. Your containerized applications (Phase 6) will connect to RDS and DynamoDB. Your CI/CD pipelines (Phase 7) will need to manage database migrations and schema changes. Your monitoring and observability practices (Phase 8) will track database performance metrics. Your cost optimization strategies (Phase 9) will target database right-sizing as one of the highest-impact levers.

Every production system you architect will have a database. The decisions you learn to make in this phase determine whether that database is an enabler that scales with your business or a bottleneck that constrains every system built on top of it.

Phase Exam

After completing all eight modules, you will take the Phase 5 Databases exam:

35 multiple-choice questions covering all database services, architectural decisions, and access pattern analysis from this phase
55 minutes time limit
70% pass threshold (25/35 correct)
Questions emphasize database selection decisions, high availability configurations, access pattern analysis, and performance optimization
Expect scenario-based questions that present a workload and ask you to select the correct database engine, configuration, or scaling strategy
DynamoDB key design, Aurora vs RDS decisions, caching strategy selection, and purpose-built database matching are heavily represented

Databases

Why This Phase Exists

What You Will Master

Modules in This Phase

The Progressive Path

Services You Will Command

Amazon RDS

Amazon Aurora

Amazon DynamoDB

Amazon ElastiCache

Amazon MemoryDB

Amazon Redshift

Amazon Athena

Amazon DocumentDB

Amazon Neptune

Amazon Keyspaces

Amazon QLDB

Amazon Timestream

The Database Decision Framework

Architecture Context

Phase Exam

Modules in This Phase

Module 27: RDS Fundamentals

Module 28: RDS High Availability & Read Scaling

Module 29: Amazon Aurora

Module 30: DynamoDB Fundamentals

Module 31: DynamoDB Advanced Features

Module 32: In-Memory Databases: ElastiCache & MemoryDB

Module 33: Analytics Databases: Redshift & Athena

Module 34: Purpose-Built Database Services

Phase 5 Exam