Microservices vs. Monolith on AWS: When to Use Each (Decision Framework)
"Should we use microservices?" is one of the most asked questions in cloud architecture, and one of the most frequently answered wrong. The internet is full of advice telling you to break everything into microservices from day one. That advice has cost companies millions of dollars and years of lost productivity.
The truth is simpler and less dramatic: monoliths are fine. Microservices are fine. Serverless is fine. The right answer depends on your team, your application, and where you are in the growth journey. This guide will give you a clear framework for making that decision on AWS.
Prerequisites: You should understand Docker containers and ECS and AWS messaging services before starting this article.
What You Will Learn
By the end of this article, you will be able to:
- Evaluate the trade-offs of monolith, microservices, and serverless architectures for a given set of project requirements
- Design a migration path from monolith to microservices using the Strangler Fig pattern
- Implement inter-service communication using SQS, SNS, EventBridge, and the Saga pattern on AWS
- Compare ECS and EKS as container orchestration platforms and select the right one for your team
- Troubleshoot common distributed systems issues including cascading failures and data consistency problems
Three Architecture Patterns on AWS
The Monolith
A monolith is a single application that contains all your business logic, deployed as one unit. On AWS, this typically looks like:
- An EC2 instance (or Auto Scaling group) running your entire application
- A single RDS database backing everything
- An Application Load Balancer in front
User --> ALB --> EC2 (entire application) --> RDS
That is it. One deployment, one codebase, one database. Simple.
Monolith advantages:
| Advantage | Why It Matters |
|---|---|
| Simple to develop | One codebase, one IDE, one repo |
| Simple to deploy | Build one artifact, deploy to one place |
| Simple to debug | All the code is in one process, stack traces are complete |
| Simple to test | Integration tests run against one application |
| Low operational overhead | One service to monitor, one log stream to watch |
| Fast to start | No distributed systems complexity on day one |
| Easy transactions | ACID transactions across all data in one database |
Monolith disadvantages:
| Disadvantage | Why It Matters |
|---|---|
| Scaling is all-or-nothing | You cannot scale just the part that is under load |
| Deployments affect everything | A bug in one feature can take down the whole application |
| Technology lock-in | The entire application uses one language/framework |
| Team coordination overhead | As the team grows, merge conflicts and coordination increase |
| Longer build and deploy times | As code grows, builds slow down |
| Blast radius | A memory leak in one feature crashes the entire process |
Microservices
A microservices architecture breaks the application into small, independently deployable services. Each service owns one piece of business functionality and has its own data store.
On AWS, this typically looks like:
- Multiple ECS or EKS services, each running a different component
- Each service has its own database (or DynamoDB table)
- Services communicate via API calls, SQS queues, or EventBridge events
- API Gateway or an ALB routes external traffic to the right service
User --> API Gateway --> Service A (ECS) --> DynamoDB
--> Service B (ECS) --> RDS
--> Service C (ECS) --> DynamoDB
Service A --SQS--> Service D (ECS) --> S3
Microservices advantages:
| Advantage | Why It Matters |
|---|---|
| Independent scaling | Scale each service based on its own demand |
| Independent deployment | Deploy Service A without touching Service B |
| Technology flexibility | Each service can use the best language/framework for its job |
| Team autonomy | Small teams own entire services end-to-end |
| Fault isolation | If Service C crashes, Services A and B keep running |
| Smaller codebases | Each service is easier to understand and modify |
| Organizational alignment | Service boundaries mirror team boundaries |
Microservices disadvantages:
| Disadvantage | Why It Matters |
|---|---|
| Distributed system complexity | Network calls fail, services go down, data gets out of sync |
| Operational overhead | N services means N deployments, N log streams, N monitoring dashboards |
| Data consistency challenges | No more simple database transactions across services |
| Testing complexity | Integration testing across services is significantly harder |
| Debugging difficulty | A request touches 5 services; finding the bug requires distributed tracing |
| Higher infrastructure cost | More load balancers, more containers, more networking |
| Network latency | Every service-to-service call adds milliseconds |
Serverless
Serverless is not really a third architecture pattern. It is a deployment model that can implement either monolithic or microservices designs. But it is worth discussing separately because it changes the trade-offs significantly.
On AWS, serverless typically looks like:
- Lambda functions handling business logic
- API Gateway for HTTP routing
- DynamoDB for data storage
- SQS and EventBridge for async communication
- S3 for file storage
- Step Functions for orchestrating workflows
User --> API Gateway --> Lambda (handler) --> DynamoDB
--> SQS --> Lambda (processor)
--> S3
Serverless advantages:
| Advantage | Why It Matters |
|---|---|
| Zero idle cost | You pay nothing when nobody is using your application |
| Automatic scaling | From zero to thousands of concurrent executions without configuration |
| No server management | No patching, no capacity planning, no OS maintenance |
| Fast development | Focus entirely on business logic |
| Built-in high availability | Lambda runs across multiple AZs automatically |
| Pay-per-invocation | Costs scale exactly with usage |
Serverless disadvantages:
| Disadvantage | Why It Matters |
|---|---|
| Cold starts | First invocation after idle period adds latency (100ms-1s) |
| 15-minute execution limit | Long-running processes cannot use Lambda |
| Vendor lock-in | Your code is tightly coupled to AWS services |
| Limited compute resources | Max 10 GB memory per Lambda function |
| Complex debugging | Distributed, event-driven systems are harder to trace |
| State management | Lambda is stateless; you need external storage for everything |
| Concurrency limits | Default 1,000 concurrent executions per region (can increase) |
When the Monolith Is the Right Choice
This is the section most architecture articles skip, and it is the most important one.
Start with a monolith when:
-
You are a small team (fewer than 10 developers). The overhead of managing multiple services, deployment pipelines, and inter-service communication is not justified. Your team will move faster with one codebase.
-
You are building an MVP or prototype. You do not know what the final architecture will look like. A monolith lets you iterate quickly, figure out the domain boundaries, and refactor later with actual knowledge instead of guesses.
-
Your application has tightly coupled features. If Feature A always needs data from Feature B and Feature C in the same request, splitting them into separate services adds network latency and complexity with no benefit.
-
You need simple transactions. If your business logic requires atomic database transactions across multiple entities, a monolith with a single database handles this trivially. In microservices, you need distributed transactions or eventual consistency patterns like sagas, which are dramatically more complex.
-
You value deployment simplicity. One build, one deploy, one rollback. In a monolith, deploying is boring. In microservices, coordinating deployments across dependent services is a project in itself.
Real-world example on AWS:
Internet --> ALB --> Auto Scaling Group (t3.large instances)
--> Application (Python/Django)
--> RDS PostgreSQL (Multi-AZ)
This architecture serves millions of requests, is highly available, and costs a fraction of what a microservices equivalent would cost. For most startups and small teams, this is the right answer.
The Monolith Cost Advantage
Here is a cost comparison that illustrates why starting with a monolith often makes sense:
| Component | Monolith | Microservices (5 services) |
|---|---|---|
| Compute | 2x t3.large ($120/mo) | 5x ECS tasks ($250/mo) |
| Load Balancer | 1x ALB ($22/mo) | 1x ALB + internal ($44/mo) |
| Database | 1x RDS ($100/mo) | 3x DynamoDB + 2x RDS ($300/mo) |
| NAT Gateway | 1x ($35/mo) | 1x ($35/mo) |
| Monitoring | CloudWatch basic ($10/mo) | CloudWatch + X-Ray ($50/mo) |
| Total | ~$287/month | ~$679/month |
The microservices version costs 2.4x more for the same functionality. That difference only makes sense when you actually need the benefits microservices provide.
When to Move to Microservices
Microservices solve specific problems. If you do not have those problems, you do not need microservices.
Consider microservices when:
-
Your team has grown past 15-20 developers. Multiple teams stepping on each other in the same codebase is the number one sign you need service boundaries. Each team should own a service they can develop and deploy independently.
-
You have components with vastly different scaling needs. If your image processing pipeline needs 10x the compute of your user API, scaling them independently saves money and improves performance.
-
You need independent deployment cycles. If your payments team needs to deploy 3 times a day but your analytics team deploys weekly, coupling them in a monolith creates friction.
-
You need technology diversity. Maybe your real-time processing needs Go for performance, your ML pipeline needs Python, and your API needs Node.js. Microservices let each team choose the best tool.
-
You want fault isolation. In a monolith, a memory leak in one feature crashes the entire application. In microservices, one service can fail without taking down everything else (if you design the system correctly).
-
You need to comply with organizational standards. Large organizations often require teams to own and operate their own services. Microservices align with this organizational model.
Real-world example on AWS:
Internet --> API Gateway
/users --> ECS Service (Node.js) --> DynamoDB (Users table)
/orders --> ECS Service (Python) --> RDS PostgreSQL (Orders)
/payments --> ECS Service (Go) --> DynamoDB (Payments table)
/search --> ECS Service (Java) --> OpenSearch
Order Service --SQS--> Payment Service
Payment Service --EventBridge--> Notification Service
This architecture makes sense when you have separate teams owning Users, Orders, Payments, and Search, and when those services have different scaling profiles and technology needs.
Microservices on AWS: ECS vs EKS
If you choose microservices, you need a container orchestration platform. The two main options on AWS:
| Feature | ECS (Elastic Container Service) | EKS (Elastic Kubernetes Service) |
|---|---|---|
| Complexity | Lower (AWS-native) | Higher (Kubernetes) |
| Learning curve | Moderate | Steep |
| Portability | AWS only | Multi-cloud, on-premises |
| Cost | Just Fargate/EC2 pricing | $0.10/hour per cluster + compute |
| Integration | Deep AWS integration | Good AWS integration + K8s ecosystem |
| Best for | AWS-focused teams | Teams with K8s experience, multi-cloud |
# Create an ECS service for one microservice
aws ecs create-service \
--cluster my-microservices-cluster \
--service-name user-service \
--task-definition user-service:3 \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-abc123", "subnet-def456"],
"securityGroups": ["sg-abc123"],
"assignPublicIp": "DISABLED"
}
}' \
--load-balancers '[{
"targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/user-svc/abc123",
"containerName": "user-service",
"containerPort": 8080
}]'
# Configure service auto scaling
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/my-microservices-cluster/user-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 20
When Serverless Wins
Serverless is ideal when:
-
Your traffic is unpredictable. An API that gets 10 requests one hour and 10,000 the next. Serverless scales to zero and up to thousands without any configuration.
-
You are building event-driven workflows. File uploaded to S3? Process it with Lambda. Message arrives in SQS? Lambda handles it. Database record changed? Lambda reacts. These event-driven patterns are what serverless was built for.
-
Your team is small and does not want to manage infrastructure. Serverless eliminates patching, capacity planning, and OS management entirely.
-
Your workloads are short-lived. API requests, file processing, data transformations, and webhook handlers that complete in seconds or minutes are perfect for Lambda.
-
You want minimal cost at low traffic. A serverless application serving 1,000 requests per day costs pennies. The same application on EC2 costs at least $7-8/month even if idle.
Real-world example on AWS:
Internet --> CloudFront --> S3 (static frontend)
--> API Gateway --> Lambda functions --> DynamoDB
--> SES (email)
--> S3 (file storage)
--> EventBridge --> Lambda (scheduled tasks)
This is a complete production application with zero servers to manage. It scales from zero to massive and costs almost nothing at low traffic.
Serverless Cost Comparison at Different Traffic Levels
| Monthly Requests | Lambda Cost | Equivalent EC2 (t3.small) |
|---|---|---|
| 10,000 | $0.02 | $15.18 |
| 100,000 | $0.20 | $15.18 |
| 1,000,000 | $2.00 | $15.18 |
| 10,000,000 | $20.00 | $15.18 |
| 50,000,000 | $100.00 | $30.36 (need larger) |
| 100,000,000 | $200.00 | $60.72 (need scaling) |
The crossover point where EC2 becomes cheaper than Lambda depends on your workload, but it is typically around 10-50 million requests/month. Below that, serverless wins on cost. Above that, you need to evaluate.
The Decision Framework
Use this flowchart to choose your architecture:
Question 1: How big is your team?
- Fewer than 10 developers? Start with a monolith (on EC2/ECS) or serverless.
- 10-20 developers? Consider a modular monolith or begin splitting into a few services.
- More than 20 developers? Microservices likely make sense for team autonomy.
Question 2: What is your traffic pattern?
- Consistent, high-volume traffic? EC2 with Auto Scaling (monolith or microservices).
- Spiky or unpredictable? Serverless (Lambda) or containers with Fargate.
- Low traffic or MVPs? Serverless. You pay almost nothing.
Question 3: How long do your processes run?
- Under 15 minutes? Lambda is an option.
- Over 15 minutes? EC2 or ECS/Fargate with longer-running tasks.
- Continuous processing? EC2 or ECS.
Question 4: How important is deployment independence?
- One team, deploying together? Monolith is simpler.
- Multiple teams needing independent release cycles? Microservices.
Question 5: What is your budget for operational overhead?
- Minimal ops budget? Serverless (AWS manages everything).
- Moderate ops team? Monolith on ECS/Fargate (managed containers).
- Dedicated platform team? Microservices on ECS/EKS.
Summary table:
| Factor | Monolith | Microservices | Serverless |
|---|---|---|---|
| Team size | Small (1-10) | Large (10+) | Any |
| Complexity | Low | High | Medium |
| Scaling | All-or-nothing | Per-service | Automatic |
| Deployment speed | Fast (one unit) | Fast (per service) | Fastest |
| Infrastructure cost | Medium | Higher | Lowest at low traffic |
| Operational overhead | Low | High | Lowest |
| Transaction support | Strong (ACID) | Eventual consistency | Eventual consistency |
| Debugging | Easy (one process) | Hard (distributed) | Medium (event-driven) |
| Best for | MVPs, small teams, tightly coupled features | Large orgs, diverse scaling needs, team autonomy | Event-driven, variable traffic, no-ops teams |
The Pragmatic Middle Ground: The Modular Monolith
There is a pattern that does not get enough attention: the modular monolith. You build a single deployable application, but you organize the code into well-defined modules with clear boundaries and interfaces.
This gives you the simplicity of a monolith for deployment and operations, while setting up clean boundaries that make a future migration to microservices straightforward if you ever need it.
On AWS, this looks identical to a monolith from an infrastructure perspective. The difference is entirely in code organization. Many successful companies run on modular monoliths, including Shopify, which handles massive scale with this approach.
Key rules for a modular monolith:
- Each module has a public API (interface) and private implementation
- Modules communicate through interfaces, never by reaching into another module's database tables
- Each module could theoretically become its own service without rewriting the interface
- Shared database, but each module owns its own tables
Application
/modules
/users (owns: users table, profiles table)
/orders (owns: orders table, line_items table)
/payments (owns: payments table, refunds table)
/search (owns: search_index table)
/notifications (owns: notification_log table)
/shared
/auth (shared authentication middleware)
/logging (shared logging utilities)
Communication Patterns in Microservices
If you do go the microservices route, one of the most important decisions is how your services talk to each other. This is where most microservices implementations go wrong.
Synchronous Communication (REST/gRPC)
Service A calls Service B directly and waits for a response.
Service A --HTTP GET /users/123--> Service B --> response
Use when: The caller needs the response immediately to continue processing (e.g., an API request that needs user data to build the response).
Risk: If Service B is down, Service A fails too. This creates tight coupling through availability. Chain enough synchronous calls together and you have a distributed monolith, all the complexity of microservices with none of the benefits.
Mitigating the risk:
# Use circuit breakers to prevent cascading failures
# When Service B is unavailable, the circuit breaker "opens"
# and returns a fallback response instead of waiting and timing out
# AWS App Mesh provides built-in circuit breaker support
# Or implement in your application code with libraries like:
# - resilience4j (Java)
# - polly (C#)
# - tenacity (Python)
Asynchronous Communication (Queues and Events)
Service A puts a message on a queue. Service B processes it later.
Service A --message--> SQS Queue --> Service B processes when ready
Service A --event--> EventBridge --> Service B, C, D all react independently
Use when: The caller does not need an immediate response. Order processing, notification sending, data synchronization, and log processing are all naturally asynchronous.
AWS services for async communication:
| Service | Best For | Delivery | Ordering |
|---|---|---|---|
| SQS Standard | Point-to-point, at-least-once delivery | At least once | Best effort |
| SQS FIFO | Point-to-point, exactly-once, ordered | Exactly once | Guaranteed |
| SNS | Fan-out to multiple subscribers | At least once | Best effort |
| EventBridge | Event routing with filtering rules | At least once | Best effort |
| Kinesis | Real-time streaming data processing | At least once | Per-shard |
| Step Functions | Orchestrating multi-step workflows | Exactly once | Sequential |
Real-world example: When a customer places an order, the Order Service writes the order to its database and publishes an "OrderPlaced" event to EventBridge. The Payment Service, Inventory Service, and Notification Service all subscribe to that event and process it independently. If the Notification Service is down, the order still gets processed. The notification will be sent when the service recovers.
# Publish an event when an order is placed
aws events put-events \
--entries '[{
"Source": "com.myapp.orders",
"DetailType": "OrderPlaced",
"Detail": "{\"orderId\": \"ord-123\", \"customerId\": \"cust-456\", \"total\": 99.99}",
"EventBusName": "default"
}]'
# Create rules that route the event to different services
aws events put-rule \
--name "OrderPlaced-to-Payment" \
--event-pattern '{
"source": ["com.myapp.orders"],
"detail-type": ["OrderPlaced"]
}'
aws events put-targets \
--rule "OrderPlaced-to-Payment" \
--targets '[
{"Id": "payment-queue", "Arn": "arn:aws:sqs:us-east-1:123456789012:payment-processing"},
{"Id": "inventory-queue", "Arn": "arn:aws:sqs:us-east-1:123456789012:inventory-updates"},
{"Id": "notification-fn", "Arn": "arn:aws:lambda:us-east-1:123456789012:function:send-order-confirmation"}
]'
The Saga Pattern for Distributed Transactions
In a monolith, you can wrap multiple database operations in a single transaction. In microservices, each service has its own database, so you cannot do that. The Saga pattern solves this with a sequence of local transactions coordinated by either choreography (events) or orchestration (Step Functions).
# Example: Order Saga with Step Functions (orchestration)
# Step 1: Create Order (Order Service)
# Step 2: Reserve Inventory (Inventory Service)
# Step 3: Process Payment (Payment Service)
# Step 4: Confirm Order (Order Service)
#
# If Step 3 fails:
# Compensate Step 2: Release Inventory
# Compensate Step 1: Cancel Order
# Step Functions handles the orchestration and compensation automatically
aws stepfunctions create-state-machine \
--name "OrderSaga" \
--definition '{
"StartAt": "CreateOrder",
"States": {
"CreateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:create-order",
"Next": "ReserveInventory",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "CancelOrder"}]
},
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:reserve-inventory",
"Next": "ProcessPayment",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "ReleaseInventory"}]
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-payment",
"Next": "ConfirmOrder",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "ReleaseInventory"}]
},
"ConfirmOrder": {"Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:confirm-order", "End": true},
"ReleaseInventory": {"Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:release-inventory", "Next": "CancelOrder"},
"CancelOrder": {"Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:cancel-order", "End": true}
}
}' \
--role-arn "arn:aws:iam::123456789012:role/StepFunctionsRole"
The Strangler Fig Pattern: Migrating Gradually
If you have an existing monolith and decide you need microservices, do not rewrite everything at once. Use the Strangler Fig pattern:
- Identify one feature to extract (start with something self-contained)
- Build the new microservice alongside the monolith
- Route traffic for that feature to the new service (API Gateway or ALB path-based routing)
- Once the new service is stable, remove the old code from the monolith
- Repeat for the next feature
Phase 1: Monolith handles everything
Phase 2: New Search Service handles /search, monolith handles everything else
Phase 3: New User Service handles /users, Search Service handles /search
Phase 4: Continue until the monolith is empty (or small enough to maintain)
On AWS, this is straightforward with an Application Load Balancer or API Gateway. You create path-based routing rules that send specific URL patterns to the new service while everything else continues to hit the monolith.
# ALB path-based routing example
# /api/search/* goes to the new Search Service target group
# Everything else goes to the monolith target group
aws elbv2 create-rule \
--listener-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/my-alb/abc123/def456 \
--conditions '[{"Field":"path-pattern","Values":["/api/search/*"]}]' \
--actions '[{"Type":"forward","TargetGroupArn":"arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/search-service/abc123"}]' \
--priority 10
This pattern reduces risk because you are migrating incrementally. If the new service has problems, you route traffic back to the monolith. No big bang cutover required.
How to Choose What to Extract First
When using the Strangler Fig pattern, choose your first extraction carefully:
| Good First Candidates | Why | Bad First Candidates | Why |
|---|---|---|---|
| Search functionality | Usually self-contained, read-heavy | User authentication | Touches everything, high risk |
| Notification sending | Naturally async, low coupling | Core business logic | Too tightly coupled initially |
| File/image processing | Event-driven, clear boundary | Payment processing | High risk, needs atomic transactions |
| Reporting/analytics | Read-only, different scaling needs | Shared data models | Creates distributed data challenges |
Observability: The Non-Negotiable for Microservices
If you run microservices without proper observability, you will spend most of your time debugging. When a request touches 5 services and something fails, you need to trace the request across all of them.
Three pillars of observability:
| Pillar | What It Shows | AWS Service |
|---|---|---|
| Logs | What happened in each service | CloudWatch Logs |
| Metrics | How the system is performing | CloudWatch Metrics |
| Traces | How a request flowed across services | X-Ray |
AWS X-Ray is especially important for microservices. It traces requests across Lambda functions, ECS services, API Gateway, and other AWS services, showing you exactly where time is spent and where failures occur.
# Enable X-Ray tracing on an API Gateway stage
aws apigateway update-stage \
--rest-api-id abc123 \
--stage-name prod \
--patch-operations op=replace,path=/tracingEnabled,value=true
# Enable X-Ray on an ECS task definition
# Add the X-Ray daemon as a sidecar container in your task definition
# The daemon collects traces from your application and sends them to X-Ray
# Query X-Ray for traces with errors in the last hour
aws xray get-trace-summaries \
--start-time $(date -u -v-1H +%s) \
--end-time $(date -u +%s) \
--filter-expression 'service("order-service") AND fault = true'
# Create a CloudWatch dashboard for microservices health
aws cloudwatch put-dashboard \
--dashboard-name "Microservices-Health" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ECS", "CPUUtilization", "ServiceName", "user-service"],
["AWS/ECS", "CPUUtilization", "ServiceName", "order-service"],
["AWS/ECS", "CPUUtilization", "ServiceName", "payment-service"]
],
"title": "CPU Utilization by Service",
"period": 300
}
}
]
}'
Without distributed tracing, debugging microservices is like debugging a monolith with no stack traces. You know something is broken, but you have no idea where.
Service Mesh with AWS App Mesh
For complex microservices deployments, a service mesh adds a layer of infrastructure that handles service-to-service communication, observability, and traffic management:
# Create an App Mesh virtual service
aws appmesh create-virtual-service \
--mesh-name my-app-mesh \
--virtual-service-name user-service.local \
--spec '{
"provider": {
"virtualRouter": {
"virtualRouterName": "user-service-router"
}
}
}'
App Mesh gives you circuit breakers, retry policies, and traffic shifting without modifying your application code. It is useful for large microservices deployments but adds complexity that smaller deployments do not need.
Troubleshooting Common Errors
Circuit breaker tripping too aggressively Your circuit breaker opens and returns fallback responses even though the downstream service is healthy. This usually means your thresholds are too sensitive. Start with a failure rate threshold of 50% over a 60-second rolling window, then tune from there. In App Mesh, check the outlierDetection settings on your virtual node. Also confirm your health check endpoints return 200 quickly and are not timing out on cold starts.
Distributed tracing gaps (missing spans in X-Ray) You see incomplete traces where requests disappear between services. This happens when one or more services do not propagate the X-Ray trace header (X-Amzn-Trace-Id). Every service in the call chain must forward that header on outbound requests. If you use an HTTP client library, configure it to pass through the trace header automatically. For ECS, verify the X-Ray daemon sidecar container is running and healthy in each task definition.
Service discovery failures (ECS services cannot find each other) Containers start successfully but fail to connect to other services by name. If you use AWS Cloud Map for service discovery, confirm that your services are registering instances to the correct namespace and that the security groups allow traffic on the expected ports between services. Run aws servicediscovery list-instances --service-id <id> to verify registrations. DNS-based discovery can also fail if the VPC DNS resolution settings are not enabled.
How This Shows Up in Architecture Decisions
Architecture reviews and design discussions frequently present these kinds of scenarios:
- "A startup with 5 developers wants to build an MVP quickly." (Monolith or serverless)
- "A company has teams that need to deploy independently." (Microservices)
- "An application processes images uploaded to S3." (Serverless/Lambda)
- "A workload has steady, predictable traffic." (EC2 with Reserved Instances or Savings Plans)
- "A company wants to minimize operational overhead." (Serverless)
- "An application needs to coordinate a multi-step workflow with compensation." (Step Functions)
- "Services need to communicate without tight coupling." (SQS, SNS, or EventBridge)
No single pattern is universally "right." The skill is matching the pattern to the requirements. Understanding trade-offs is what matters.
Quick Reference for Architecture Decisions
| If the requirement says... | Think... |
|---|---|
| "MVP", "small team", "simple" | Monolith or serverless |
| "Independent deployment", "team autonomy" | Microservices |
| "Event-driven", "S3 trigger", "variable traffic" | Serverless (Lambda) |
| "Long-running process" (>15 min) | ECS/Fargate or EC2 |
| "Decouple services", "loose coupling" | SQS or EventBridge |
| "Orchestrate workflow", "compensation" | Step Functions |
| "Minimize operational overhead" | Serverless or Fargate |
| "Container orchestration", "multi-cloud" | EKS (Kubernetes) |
| "Container orchestration", "AWS-native" | ECS |
Pricing note: Monthly cost estimates (such as the monolith vs. microservices comparison) cited in this article are for us-east-1 and were verified in May 2026. Check the AWS Pricing Calculator for current rates in your Region.
Hands-On Challenge
Deploy two services that communicate asynchronously on AWS. When you are finished, verify you have met all of these success criteria:
- Two ECS Fargate services are running in the same cluster, each with its own task definition and container image
- An SQS queue connects the two services (Service A sends messages, Service B consumes them)
- A dead letter queue is configured on the SQS queue with
maxReceiveCountset to 3 - Both services write logs to CloudWatch Logs with distinct log groups
- X-Ray tracing is enabled and you can view a trace map showing the request flow between both services
- Service B processes a test message published by Service A, and you can confirm delivery by checking the CloudWatch logs
- You can stop Service B, send a message from Service A, restart Service B, and confirm the message was still processed (proving the decoupling works)
Next Steps
If you are building something new, start with the simplest architecture that meets your requirements. You can always add complexity later. You cannot easily remove it.
If you are evaluating an existing system, ask yourself: "What specific problem would microservices solve that I cannot solve by better organizing my current architecture?" If the answer is not clear, you probably do not need microservices yet.
Remember Martin Fowler's advice: "You should not start a new project with microservices, even if you are sure your application will be big enough to make it worthwhile." Start with a monolith, keep it modular, and extract services when you have a clear need.
Build it yourself: This topic is covered hands-on in Module 18: Architecture Patterns on AWS of our AWS Bootcamp.