Why This Phase Exists
The data layer is the most critical architectural decision you will make. Get it wrong and you are stuck.
Compute is fungible. You can swap a Lambda function for a container or an EC2 instance with a weekend of engineering effort. Networking is configurable. Route tables, security groups, and load balancers change in minutes. But your database? Once you have a terabyte of relational data in Aurora, or a billion items in DynamoDB with carefully designed partition keys, or a year of time-series metrics in Timestream, migration becomes an exercise measured in months, not hours. Applications shape themselves around their data model. Access patterns harden into code. Indexes, queries, and consistency expectations permeate every layer of your application.
AWS offers 15+ purpose-built database services because one size does not fit all. A relational database optimized for ACID transactions will never deliver single-digit millisecond reads at millions of requests per second. A key-value store that scales horizontally to virtually unlimited throughput will never give you the ad-hoc join capabilities of SQL. A graph database that traverses billions of relationships in milliseconds would be absurdly wasteful for storing user session data.
A Solutions Architect chooses the right database for the access pattern, not the other way around. You do not start with "we use PostgreSQL for everything" and then fight the technology when your requirements demand something different. You start with the access pattern, the consistency requirement, the scale target, and the latency budget, then select the engine that was purpose-built for exactly that profile.
This phase takes you from understanding managed relational databases through NoSQL at scale, in-memory caching, analytics engines, and the full spectrum of purpose-built databases. By the end, you will make database technology decisions with the same confidence you now have selecting compute or storage services.
What You Will Master
By the end of Phase 5, you will be able to:
- Deploy and manage production RDS instances across multiple Availability Zones with automated failover, backups, and point-in-time recovery
- Architect Aurora clusters that deliver five times the throughput of standard MySQL with storage that scales automatically to 128 TiB
- Design DynamoDB tables with partition key strategies that distribute load evenly across partitions at any scale
- Implement advanced DynamoDB patterns including GSIs, LSIs, single-table design, DynamoDB Streams, and global tables
- Deploy ElastiCache and MemoryDB clusters that reduce database load by 90% and deliver sub-millisecond response times
- Build analytics architectures using Redshift for petabyte-scale warehousing and Athena for serverless queries against S3 data lakes
- Select the correct purpose-built database (DocumentDB, Neptune, Keyspaces, QLDB, Timestream) based on data model and access pattern requirements
- Make the relational vs NoSQL vs in-memory vs graph vs time-series vs ledger decision correctly on the first attempt for any workload
Modules in This Phase
| Module | Title | Key Focus Areas |
|---|---|---|
| 27 | RDS Fundamentals | Managed relational databases, engine selection, instance classes, storage types, backup/recovery, security, monitoring |
| 28 | RDS High Availability | Multi-AZ deployments, read replicas, cross-region replicas, automated failover, promotion strategies |
| 29 | Amazon Aurora | Aurora architecture, cluster topology, storage auto-scaling, Aurora Serverless v2, Global Database, parallel query |
| 30 | DynamoDB Fundamentals | Tables, items, attributes, partition keys, sort keys, read/write capacity modes, consistency models, basic operations |
| 31 | DynamoDB Advanced | GSIs, LSIs, single-table design, DynamoDB Streams, Global Tables, DAX, transactions, TTL |
| 32 | ElastiCache & MemoryDB | Redis vs Memcached, cluster modes, caching strategies, session stores, MemoryDB for durable in-memory workloads |
| 33 | Analytics Databases | Redshift architecture and distribution styles, Athena serverless queries, Redshift Spectrum, data lake analytics patterns |
| 34 | Purpose-Built Databases | DocumentDB (MongoDB-compatible), Neptune (graph), Keyspaces (Cassandra-compatible), QLDB (ledger), Timestream (time-series) |
The Progressive Path
This phase follows a deliberate progression that builds competence in layers.
Modules 27 and 28 form the relational database arc. You start with RDS fundamentals because relational databases remain the backbone of most enterprise applications. You learn what RDS manages for you, how to select engines and instance classes, and how backup and security work. Module 28 then layers on the high availability and scalability patterns (Multi-AZ, read replicas) that make RDS production-ready. You cannot design read replica architectures without first understanding the fundamentals from Module 27.
Module 29 introduces Aurora as the evolution of managed relational databases on AWS. Aurora's shared storage architecture, automatic failover, and Aurora Serverless v2 represent a fundamentally different approach from standard RDS. You must understand standard RDS to appreciate what Aurora improves and where it differs.
Modules 30 and 31 pivot to NoSQL with DynamoDB. This is intentional. After mastering relational patterns, you need to understand when relational is the wrong answer. DynamoDB fundamentals teaches you a completely different data modeling paradigm: access-pattern-first design, single-digit millisecond performance at any scale, and eventual consistency as a feature. Module 31 advances into patterns that make DynamoDB viable for complex applications: secondary indexes, single-table design, streams, and global tables.
Module 32 introduces the in-memory tier. Caching is not optional at scale. ElastiCache and MemoryDB sit in front of your databases, absorbing the read load that would otherwise crush your relational or NoSQL backend. You cannot design an effective caching layer without understanding the databases it protects.
Module 33 shifts to analytics. OLTP databases (RDS, DynamoDB) serve transactional workloads. When you need to analyze billions of rows across years of historical data, you need purpose-built analytics engines. Redshift and Athena solve that problem at different price points and with different operational models.
Module 34 concludes the phase with the specialized engines. Not every data problem is relational, key-value, or analytical. Document databases, graph databases, wide-column stores, ledgers, and time-series databases each exist because certain access patterns are fundamentally incompatible with general-purpose engines.
Services You Will Command
Amazon RDS
Relational Database Service eliminates the undifferentiated heavy lifting of database administration: hardware provisioning, OS patching, database engine installation, backup configuration, and minor version upgrades. RDS supports six engines (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Db2) and gives you Multi-AZ failover, automated backups with point-in-time recovery, and read replicas for horizontal read scaling. You retain full SQL access and application compatibility while AWS handles everything below the database layer.
Amazon Aurora
Aurora is AWS's cloud-native relational database, compatible with MySQL and PostgreSQL but architecturally distinct. Its shared distributed storage layer automatically replicates six copies of your data across three Availability Zones, scales to 128 TiB without downtime, and delivers up to five times the throughput of standard MySQL. Aurora Serverless v2 scales compute capacity in fine-grained increments based on application demand, eliminating capacity planning for variable workloads. Aurora Global Database extends replication across Regions with sub-second replication lag.
Amazon DynamoDB
DynamoDB is a fully managed NoSQL key-value and document database that delivers single-digit millisecond performance at any scale. It has no servers to manage, no storage to provision, and no throughput ceilings that cannot be raised. Tables can grow from zero to petabytes and from zero to millions of requests per second without architectural changes. DynamoDB Streams captures a time-ordered sequence of item changes for event-driven architectures. Global Tables provides multi-region, multi-active replication with sub-second convergence.
Amazon ElastiCache
ElastiCache provides fully managed in-memory data stores compatible with Redis and Memcached. A Redis cluster on ElastiCache delivers sub-millisecond read latency and supports data structures (strings, hashes, lists, sets, sorted sets, streams) that enable use cases beyond simple key-value caching: leaderboards, session stores, rate limiters, real-time analytics, and message queues. Cluster mode distributes data across multiple shards for horizontal scaling.
Amazon MemoryDB
MemoryDB for Redis is a durable, in-memory database service that delivers microsecond reads and single-digit millisecond writes with Multi-AZ durability. Unlike ElastiCache (which is primarily a cache layer), MemoryDB is designed to be your primary database for workloads that need both the speed of in-memory access and the durability guarantees of a persistent database. It uses a multi-AZ transaction log to ensure data is never lost.
Amazon Redshift
Redshift is a petabyte-scale data warehouse purpose-built for OLAP (Online Analytical Processing). It uses columnar storage, massively parallel processing (MPP), and result caching to execute complex analytical queries across billions of rows in seconds. Redshift Serverless eliminates cluster management entirely. Redshift Spectrum extends your queries to data sitting in S3 without loading it into the warehouse, enabling a data lakehouse architecture.
Amazon Athena
Athena is a serverless interactive query service that lets you analyze data directly in S3 using standard SQL. There is no infrastructure to manage and no data to load. You point Athena at your S3 data, define a schema, and run queries. You pay per terabyte scanned. Athena integrates with AWS Glue Data Catalog for schema management and supports formats including Parquet, ORC, JSON, and CSV. For data lake architectures, Athena provides the query layer without the cost or complexity of a dedicated warehouse.
Amazon DocumentDB
DocumentDB is a fully managed document database compatible with MongoDB 3.6, 4.0, and 5.0 APIs. It separates compute and storage, replicates data six ways across three AZs, and scales read throughput with up to 15 read replicas. Use DocumentDB when your application already uses the MongoDB API and you need the operational simplicity of a fully managed service with built-in high availability. It supports JSON documents with nested fields, arrays, and flexible schemas.
Amazon Neptune
Neptune is a fully managed graph database that supports both the Property Graph model (with Apache TinkerPop Gremlin) and the RDF model (with SPARQL). Graph databases excel when the relationships between entities are as important as the entities themselves: social networks, fraud detection, knowledge graphs, recommendation engines, and network topology analysis. Neptune stores billions of relationships and executes graph traversal queries with millisecond latency.
Amazon Keyspaces
Keyspaces is a fully managed wide-column database compatible with Apache Cassandra. It runs Cassandra Query Language (CQL) workloads without managing servers, software, or patching. Keyspaces automatically scales tables up and down in response to traffic, replicates data three times across multiple AZs, and offers both on-demand and provisioned capacity modes. Use Keyspaces when your application is built on Cassandra and you want to eliminate operational overhead.
Amazon QLDB
Quantum Ledger Database is a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log. Every change to your data is tracked in an append-only journal that cannot be altered or deleted. QLDB uses a hash-chained journal (similar to blockchain) to prove that no modifications have occurred. Use QLDB for systems of record where you need an authoritative, verifiable history: financial transactions, supply chain tracking, regulatory reporting.
Amazon Timestream
Timestream is a fully managed time-series database built for collecting, storing, and querying time-stamped data at scale. It automatically moves recent data from a memory store (fast queries) to a magnetic store (cost-optimized) based on retention policies you define. Timestream is purpose-built for IoT telemetry, DevOps monitoring, industrial equipment sensors, and application metrics where every data point has a timestamp and you query by time range, aggregation, and interpolation.
The Database Decision Framework
This framework is the single most important deliverable from Phase 5. Memorize it. Internalize it. It will appear on every architecture review you participate in.
Use a relational database (RDS/Aurora) when:
- Your data has defined relationships that require joins across tables
- You need ACID transactions with strong consistency
- Your schema is well-defined and relatively stable
- You need complex queries with aggregations, subqueries, and multi-table operations
- Regulatory requirements mandate SQL-accessible audit capabilities
- Your read/write ratio and scale fit within vertical scaling limits (or read replicas for read scaling)
Use DynamoDB (NoSQL key-value) when:
- You need single-digit millisecond latency at any scale
- Your access patterns are known and limited (you can design keys around them)
- You need horizontal scaling without application changes
- Your data model fits key-value or document patterns without complex joins
- You need consistent performance regardless of table size (10 GB or 10 PB)
Use ElastiCache/MemoryDB (in-memory) when:
- You need sub-millisecond read latency for hot data
- You are caching database query results, session data, or computed values
- You need atomic operations on data structures (counters, sorted sets, queues)
- Your application has a high read-to-write ratio with a hot working set
Use Neptune (graph) when:
- Relationships between entities are first-class citizens in your queries
- You need to traverse many levels of connections efficiently (friends-of-friends, shortest path)
- Your queries are shaped like "find all entities connected to X within N hops"
- Use cases include fraud detection, recommendation engines, social networks, knowledge graphs
Use Timestream (time-series) when:
- Every record has a timestamp and you query by time range
- You need built-in time-series functions (interpolation, smoothing, aggregation by time window)
- Data volume is high but older data can be moved to cheaper storage automatically
- Use cases include IoT telemetry, application monitoring, and industrial sensor data
Use QLDB (ledger) when:
- You need a cryptographically verifiable, immutable history of all changes
- Auditability is a hard requirement (not just "nice to have")
- You need to prove that records have not been tampered with
- Use cases include financial transactions, supply chain provenance, regulatory compliance
Architecture Context
Phase 5 builds directly on the foundations from prior phases and establishes the data layer that every subsequent phase depends on.
From earlier phases, your databases will deploy into the private subnets you configured in Phase 2 (Networking). Security groups from Phase 3 (Network Security) control which compute resources can reach your database endpoints. IAM policies from Phase 2 (IAM) govern who can manage database instances and, in some cases (DynamoDB, RDS IAM Auth), who can connect to them. KMS keys from your security modules encrypt data at rest. CloudWatch from your monitoring foundations provides the metrics and alarms that keep your databases healthy.
Looking ahead, the database skills from this phase become the persistence layer for everything you build afterward. Your containerized applications (Phase 6) will connect to RDS and DynamoDB. Your CI/CD pipelines (Phase 7) will need to manage database migrations and schema changes. Your monitoring and observability practices (Phase 8) will track database performance metrics. Your cost optimization strategies (Phase 9) will target database right-sizing as one of the highest-impact levers.
Every production system you architect will have a database. The decisions you learn to make in this phase determine whether that database is an enabler that scales with your business or a bottleneck that constrains every system built on top of it.
Phase Exam
After completing all eight modules, you will take the Phase 5 Databases exam:
- 35 multiple-choice questions covering all database services, architectural decisions, and access pattern analysis from this phase
- 55 minutes time limit
- 70% pass threshold (25/35 correct)
- Questions emphasize database selection decisions, high availability configurations, access pattern analysis, and performance optimization
- Expect scenario-based questions that present a workload and ask you to select the correct database engine, configuration, or scaling strategy
- DynamoDB key design, Aurora vs RDS decisions, caching strategy selection, and purpose-built database matching are heavily represented