AWS Service Decision Guides
When to use this guide: This is reference material designed for Phases 3-5 of the bootcamp (Weeks 4-8). If you're in Weeks 1-3, bookmark this page and return when you begin designing architectures. You won't fully appreciate these comparisons until you have hands-on experience with the underlying services.
How to use this guide: When you need to choose between similar AWS services (e.g., Lambda vs ECS, RDS vs DynamoDB), find the relevant decision table below. Start with the workload requirements, not the service. Read the "Choose X when..." recommendations at the bottom of each table. If you're still unsure, work through the scenario examples.
Compute: Lambda vs. ECS vs. EC2
| Factor | Lambda | ECS (Fargate) | EC2 |
|---|---|---|---|
| Best for | Event-driven, short tasks (<15 min) | Long-running services, microservices | Full OS control, GPU, legacy apps |
| Scaling | Automatic per-invocation | Task-level auto scaling | Auto Scaling groups |
| Pricing | Per request + duration | Per vCPU/memory per second | Per instance-hour |
| Cold starts | Yes (mitigated with provisioned concurrency) | Task startup ~30-60s | Instance launch ~1-3 min |
| Max runtime | 15 minutes | No limit | No limit |
| State | Stateless | Stateless or stateful with EFS | Stateful |
| Ops overhead | Minimal | Low | High |
Choose Lambda when you have event-driven workloads, APIs with variable traffic, or data processing triggered by S3/SQS/DynamoDB events.
Choose ECS (Fargate) when you need long-running services, consistent baseline traffic, or container-based microservices without managing servers.
Choose EC2 when you need full OS access, GPU instances, specific kernel configurations, or are running legacy applications that cannot be containerized.
Scenario examples:
| Scenario | Recommended | Why |
|---|---|---|
| REST API with 10 requests/minute during business hours, near-zero at night | Lambda + API Gateway | Variable traffic, pay-per-request, no idle cost |
| Web app serving 1,000 concurrent users 24/7 | ECS (Fargate) | Consistent baseline traffic, long-running, no server management |
| Machine learning model training with GPU | EC2 (P-series instances) | GPU required, long-running, full OS control |
| Thumbnail generation triggered by S3 uploads | Lambda | Event-driven, short execution, automatic scaling |
| Legacy Java app that requires specific JVM tuning | EC2 | Full OS control, custom JVM configuration |
Database: RDS vs. DynamoDB vs. ElastiCache
| Factor | RDS | DynamoDB | ElastiCache |
|---|---|---|---|
| Data model | Relational (SQL) | Key-value / document (NoSQL) | Key-value (in-memory) |
| Best for | Complex queries, joins, transactions | High-throughput, predictable access patterns | Caching, session storage, leaderboards |
| Scaling | Vertical (instance size) + read replicas | Horizontal (automatic partitioning) | Cluster mode with sharding |
| Latency | Single-digit milliseconds | Single-digit milliseconds | Sub-millisecond |
| Pricing | Per instance-hour + storage | Per request + storage (on-demand) or provisioned capacity | Per node-hour |
| Multi-AZ | Multi-AZ standby (automatic failover) | Built-in (3 AZs by default) | Multi-AZ with automatic failover |
| Schema | Fixed schema, migrations required | Schema-less (flexible attributes) | No schema (key-value) |
Choose RDS when you need complex queries with joins, ACID transactions, or your team has strong SQL expertise.
Choose DynamoDB when you have well-defined access patterns, need single-digit millisecond latency at any scale, or want zero operational overhead.
Choose ElastiCache when you need sub-millisecond reads for frequently accessed data, session management, or real-time leaderboards. Use it alongside RDS or DynamoDB, not as a replacement.
Scenario examples:
| Scenario | Recommended | Why |
|---|---|---|
| E-commerce order history with joins across customers, products, and payments | RDS (PostgreSQL or MySQL) | Complex relational queries, ACID transactions |
| Product catalog browsed by millions of users with simple key lookups | DynamoDB | High throughput, predictable access pattern, single-digit ms latency |
| Session storage for a web app with 50,000 concurrent users | ElastiCache (Redis) | Sub-millisecond reads, automatic expiration (TTL) |
| IoT sensor data with 100,000 writes/second | DynamoDB | Horizontal scaling, high write throughput |
| Financial reporting with ad-hoc SQL queries across multiple tables | RDS (PostgreSQL) | Complex joins, aggregations, ad-hoc queries |
Storage: S3 vs. EBS vs. EFS
| Factor | S3 | EBS | EFS |
|---|---|---|---|
| Type | Object storage | Block storage | File storage (NFS) |
| Access | HTTP/HTTPS API | Attached to one EC2 instance (or multi-attach io2) | Shared across multiple EC2/ECS/Lambda |
| Durability | 99.999999999% (11 nines) | 99.999% | 99.999999999% (11 nines) |
| Scaling | Unlimited objects | Up to 64 TiB per volume | Automatic (petabyte scale) |
| Latency | ~100ms (first byte) | Sub-millisecond | Low single-digit milliseconds |
| Cost | Lowest (per GB stored) | Medium (per GB provisioned) | Highest (per GB used) |
| Use case | Static assets, backups, data lakes | Boot volumes, databases, high-IOPS apps | Shared config, CMS, container storage |
Choose S3 for static website hosting, backups, data lakes, and any data accessed via HTTP. Use lifecycle policies to move infrequently accessed data to cheaper storage classes.
Choose EBS for EC2 boot volumes, databases (RDS uses EBS under the hood), and any workload requiring low-latency block storage attached to a single instance.
Choose EFS when multiple compute resources (EC2 instances, ECS tasks, Lambda functions) need shared file access.
Scenario examples:
| Scenario | Recommended | Why |
|---|---|---|
| Static website assets (images, CSS, JS) served via CloudFront | S3 | Object storage, HTTP access, lowest cost, CDN-friendly |
| PostgreSQL database storage | EBS (gp3 or io2) | Block storage, sub-millisecond latency, attached to RDS instance |
| Shared configuration files across 10 ECS containers | EFS | NFS mount shared across multiple tasks |
| Data lake with 50 TB of Parquet files queried by Athena | S3 | Unlimited scale, lowest cost, native Athena integration |
| Machine learning training data accessed by multiple EC2 instances | EFS | Shared access, automatic scaling |
Load Balancer: ALB vs. NLB vs. CLB
| Factor | ALB | NLB | CLB (Legacy) |
|---|---|---|---|
| Layer | Layer 7 (HTTP/HTTPS) | Layer 4 (TCP/UDP/TLS) | Layer 4 + basic Layer 7 |
| Best for | Web apps, microservices, APIs | High throughput, low latency, non-HTTP | Legacy applications only |
| Routing | Path-based, host-based, header-based | Port-based | Basic round-robin |
| WebSocket | Yes | Yes | No |
| Static IP | No (use Global Accelerator) | Yes (Elastic IP per AZ) | No |
| Latency | ~ms | ~100μs | ~ms |
| Cost | Per hour + LCU | Per hour + NLCU | Per hour + data |
Choose ALB for HTTP/HTTPS workloads, REST APIs, microservices with path-based routing, or any application that needs Layer 7 features.
Choose NLB for TCP/UDP workloads, extreme performance requirements, static IPs, or when you need to preserve the client source IP.
Avoid CLB for new architectures. It exists for backward compatibility only.
Scenario examples:
| Scenario | Recommended | Why |
|---|---|---|
| REST API with /api/* routed to backend, /web/* to frontend | ALB | Path-based routing, Layer 7, HTTP-aware |
| Real-time gaming server handling millions of UDP packets/second | NLB | Layer 4, ultra-low latency, UDP support |
| gRPC microservices with HTTP/2 | ALB | HTTP/2 and gRPC support, target group routing |
| VPN endpoint requiring a static IP address | NLB | Static Elastic IP per AZ |
Messaging: SQS vs. SNS vs. EventBridge
| Factor | SQS | SNS | EventBridge |
|---|---|---|---|
| Pattern | Queue (pull) | Pub/sub (push) | Event bus (push, rule-based) |
| Delivery | One consumer per message (standard) | Fan-out to many subscribers | Rule-based routing to targets |
| Ordering | FIFO queues guarantee order | FIFO topics guarantee order | Best-effort (or ordered per rule) |
| Retry | Built-in (visibility timeout + DLQ) | Retry policies per subscription | Built-in retry with DLQ |
| Use case | Decouple producer/consumer, buffer spikes | Notifications, fan-out to multiple queues | Cross-service event routing, SaaS integration |
Choose SQS when you need to decouple a producer from a consumer, buffer traffic spikes, or guarantee at-least-once processing with retries.
Choose SNS when you need to fan out a single event to multiple subscribers (email, SQS queues, Lambda functions, HTTP endpoints).
Choose EventBridge when you need content-based routing (filter events by fields), cross-account event delivery, or integration with SaaS providers.
Common pattern: SNS → SQS fan-out. Publish to an SNS topic, subscribe multiple SQS queues. Each queue processes the event independently.
Scenario examples:
| Scenario | Recommended | Why |
|---|---|---|
| Order processing: decouple the web tier from the payment processor | SQS | Queue buffers requests, retries on failure, DLQ for poison messages |
| New user signup triggers email, analytics, and provisioning | SNS → SQS fan-out | One event fans out to three independent consumers |
| Route S3 upload events to different Lambda functions based on file type | EventBridge | Content-based filtering on event fields (e.g., file extension) |
| Process exactly-once financial transactions in strict order | SQS FIFO | Exactly-once processing, guaranteed ordering by message group |
IaC: CloudFormation vs. SAM vs. CDK vs. Terraform
| Factor | CloudFormation | SAM | CDK | Terraform |
|---|---|---|---|---|
| Language | YAML / JSON | YAML (shorthand) | TypeScript, Python, Java, etc. | HCL |
| Best for | Any AWS resource | Serverless apps (Lambda, API GW, DynamoDB) | Complex infra with logic (loops, conditions) | Multi-cloud or team already using Terraform |
| Learning curve | Medium | Low (if you know CloudFormation) | Medium (requires programming) | Medium |
| State management | Managed by AWS (stacks) | Managed by AWS (stacks) | Managed by AWS (stacks) | State file (S3 + DynamoDB for locking) |
| Drift detection | Yes (via CloudFormation) | Yes (via CloudFormation) | Yes (via CloudFormation) | Yes (terraform plan) |
| AWS integration | Native | Native | Native (compiles to CloudFormation) | Provider-based |
Choose CloudFormation when you need direct, declarative YAML/JSON templates and your team prefers configuration over code.
Choose SAM when building serverless applications. SAM shorthand reduces boilerplate for Lambda, API Gateway, and DynamoDB resources.
Choose CDK when you need programming constructs (loops, conditionals, abstractions) or your team prefers writing infrastructure in a familiar language.
Choose Terraform when you manage resources across multiple cloud providers or your organization has standardized on Terraform.
Security: Security Groups vs. NACLs
| Factor | Security Groups | NACLs |
|---|---|---|
| Level | Instance / ENI | Subnet |
| State | Stateful (return traffic auto-allowed) | Stateless (must allow both inbound and outbound) |
| Rules | Allow only | Allow and deny |
| Evaluation | All rules evaluated together | Rules evaluated in order (lowest number first) |
| Default | Deny all inbound, allow all outbound | Allow all inbound and outbound |
| Use case | Primary firewall for instances | Subnet-level guardrails, block specific IPs |
Use security groups as your primary firewall. They are stateful, easier to manage, and sufficient for most use cases.
Add NACLs as a secondary defense layer when you need to explicitly deny traffic from specific IP ranges or add subnet-level controls.
DR Strategy: Backup & Restore vs. Pilot Light vs. Warm Standby vs. Active-Active
| Strategy | RTO | RPO | Cost | Complexity | Key AWS Services |
|---|---|---|---|---|---|
| Backup & Restore | Hours | Hours | Lowest | Low | AWS Backup, S3 Cross-Region Replication |
| Pilot Light | 10s of minutes | Minutes | Low | Medium | RDS cross-Region read replica, Route 53, AMIs |
| Warm Standby | Minutes | Seconds–minutes | Medium | Medium-high | Auto Scaling (scaled down), RDS replica, Route 53 failover |
| Active-Active | Near-zero | Near-zero | Highest | High | DynamoDB Global Tables, Route 53 multivalue/latency, Aurora Global |
Choose Backup & Restore for non-critical workloads where hours of downtime are acceptable.
Choose Pilot Light when you need faster recovery than backup/restore but want to minimize cost. Core infrastructure runs at minimum capacity and scales up during failover.
Choose Warm Standby for business-critical workloads that need recovery in minutes. A scaled-down copy of the production environment runs continuously.
Choose Active-Active for mission-critical workloads with near-zero tolerance for downtime. Traffic is served from multiple regions simultaneously.
Scenario examples:
| Scenario | Recommended | Why |
|---|---|---|
| Internal wiki used by 50 employees | Backup & Restore | Low criticality, hours of downtime acceptable, lowest cost |
| E-commerce site with $10K/hour revenue | Warm Standby | Minutes of downtime acceptable, cost-justified by revenue impact |
| Payment processing system for a bank | Active-Active | Near-zero downtime required, regulatory requirements |
| Development/staging environment | Backup & Restore | Non-production, rebuild from IaC templates if needed |
Quick Reference: Which Service for Which Job?
| I need to... | Use this |
|---|---|
| Run code in response to an event | Lambda |
| Run a containerized web service 24/7 | ECS (Fargate) |
| Run a VM with full OS control | EC2 |
| Store and query relational data | RDS |
| Store key-value data at massive scale | DynamoDB |
| Cache frequently accessed data | ElastiCache |
| Store files, images, backups | S3 |
| Attach a disk to an EC2 instance | EBS |
| Share files across multiple instances | EFS |
| Route HTTP traffic to microservices | ALB |
| Route TCP/UDP traffic with static IPs | NLB |
| Decouple two services with a queue | SQS |
| Send one event to many consumers | SNS |
| Route events based on content | EventBridge |
| Define infrastructure as YAML | CloudFormation |
| Define serverless infrastructure | SAM |
| Define infrastructure as code (TypeScript/Python) | CDK |
| Encrypt data at rest | KMS |
| Store secrets securely | Secrets Manager |
| Monitor metrics and set alarms | CloudWatch |
| Trace requests across services | X-Ray |
| Automate deployments | CodePipeline |
| Serve static content globally | CloudFront |
| Register a domain and route DNS | Route 53 |
AWS Bootcamp: From Novice to Architect Author: Samuel Ogunti License: CC BY-NC 4.0