β Cloud Concepts & AWS Services
Complete notes for a Junior DevOps role. Learn core cloud principles, then dive deep into AWS services with real-world examples and diagrams. GCP & Azure equivalents included throughout.
Cloud Fundamentals
The Core Idea
Cloud computing means renting computing resources over the internet instead of buying and managing your own hardware. Think of it like electricity β you don't build your own power plant; you plug into the grid and pay for what you use.
Before cloud, a startup wanting to launch an app needed to: buy servers, rent datacenter space, hire a sysadmin, buy networking hardware, wait weeks for delivery β all before writing a single line of code. Cloud made that a 5-minute signup.
NIST 5 Essential Characteristics
The official NIST definition says cloud computing must have all 5 of these:
1. On-Demand Self-Service
You provision resources yourself, without talking to a human. Spin up an EC2 instance at 2 AM, no approval required.
2. Broad Network Access
Resources are accessible over the internet from any device β laptop, phone, another server anywhere on the planet.
3. Resource Pooling (Multi-tenancy)
Provider serves many customers from the same physical hardware, dynamically assigning resources. You don't know (or care) which physical server you're on.
4. Rapid Elasticity
Scale up or down fast β sometimes automatically. Resources feel unlimited from the user's perspective. Traffic spike at 9 AM? Auto Scaling adds servers in minutes.
5. Measured Service
You pay for exactly what you use. Like a utility bill. AWS charges per hour/second for compute, per GB for storage, per million for API calls.
CAPEX vs OPEX
| Model | What it means | Example | Cloud relevance |
|---|---|---|---|
| CAPEX (Capital Expense) | Upfront large purchase. You own the asset. | Buying 50 physical servers | Traditional / On-prem model |
| OPEX (Operational Expense) | Ongoing cost. Pay as you go. | Paying AWS monthly bill | Cloud model β predictable, flexible |
The "Pizza as a Service" Analogy
These models define how much of the stack the cloud provider manages vs how much you manage.
| Layer | On-Prem (you manage) | IaaS | PaaS | SaaS |
|---|---|---|---|---|
| Application | You | You | You | Provider |
| Data | You | You | You | Provider |
| Runtime / Middleware | You | You | Provider | Provider |
| OS | You | You | Provider | Provider |
| Virtualization | You | Provider | Provider | Provider |
| Hardware / Network / DC | You | Provider | Provider | Provider |
IaaS β Infrastructure as a Service
You get raw compute, storage, and networking. You manage the OS up. Most control, most responsibility.
Real example: You rent an EC2 instance, install Ubuntu, install Nginx, deploy your app. If the OS crashes, that's on you to fix.
PaaS β Platform as a Service
You just deploy your application code/container. The provider handles OS patching, scaling infrastructure, runtime. Less control, less ops work.
Real example: You push a Python Flask app to Elastic Beanstalk. AWS auto-provisions EC2, load balancer, and auto-scaling. You never SSH into a server.
SaaS β Software as a Service
You're just a user of a complete application. No infrastructure, no app management. Just login and use it.
Real example: Gmail, Slack, Salesforce. AWS WorkMail is also SaaS.
IaaS: EC2 | PaaS: Elastic Beanstalk, Lambda | SaaS: WorkMail, Chime
IaaS: Compute Engine (GCE) | PaaS: App Engine, Cloud Run | SaaS: Google Workspace
IaaS: Azure VMs | PaaS: Azure App Service, Azure Functions | SaaS: Microsoft 365, Dynamics 365
Public Cloud
Resources run on provider's shared infrastructure, accessible over the public internet. AWS, GCP, Azure are all public clouds. Best for: startups, variable workloads, apps without strict data residency needs.
Private Cloud
Cloud infrastructure dedicated to one organization. Can be on-prem or in a provider's dedicated facility. Tech: OpenStack, VMware vSphere. Best for: banks, government, healthcare β strict compliance requirements.
Hybrid Cloud
Mix of on-prem (private) + public cloud, connected by VPN or Direct Connect. Best for: organizations with legacy systems migrating gradually to cloud, or data residency requirements with burst needs.
Multi-Cloud
Using multiple public cloud providers simultaneously (e.g., AWS for compute + GCP for ML). Best for: avoiding vendor lock-in, using best-of-breed services, or regulatory reasons.
The Most Important Concept in Cloud Security
AWS (and all cloud providers) operate under a shared responsibility model. In simple terms: AWS is responsible for security OF the cloud. YOU are responsible for security IN the cloud.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β CUSTOMER RESPONSIBILITY β β (Security IN the cloud) β β β β βββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββββββ β β β Customer β β Platform, β β Identity & Access Mgmt β β β β Data β β App, OS β β (IAM users, policies) β β β βββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββββββ β β βββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββββββ β β β Firewall / β β Network β β Client-side & Server- β β β β Sec Groups β β Config β β side Encryption β β β βββββββββββββββ βββββββββββββββ ββββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β AWS RESPONSIBILITY β β (Security OF the cloud) β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Compute | Storage | Networking | Database (managed infra) β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Physical Security of Datacenters, Hardware, Network Infra β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Responsibility Shifts Based on Service Model
| Concern | IaaS (EC2) | PaaS (Beanstalk) | SaaS (WorkMail) |
|---|---|---|---|
| Physical datacenter | AWS | AWS | AWS |
| Hypervisor / Hardware | AWS | AWS | AWS |
| OS patching | You | AWS | AWS |
| Runtime/middleware | You | AWS | AWS |
| Application code | You | You | AWS |
| Data & encryption | You | You | You |
| IAM / access control | You | You | You |
Same model: Google secures the infrastructure, you secure your workloads and data. Called "Shared Fate" in GCP (more collaborative tone β Google provides security tools to help you).
Same model. Azure's documentation explicitly shows a layered diagram. For managed services (like Azure SQL), Azure takes on more responsibility than for VMs.
Global Infrastructure
Why Geography Matters in Cloud
Your users are physically distributed. A server in the US takes ~150ms to respond to a user in India. Cloud providers build datacenters globally to solve this. But geography also matters for compliance (EU GDPR requires EU data stay in EU), disaster recovery (separate physical locations), and cost (prices vary by region).
Regions
A Region is a geographically separate area of the world with a cluster of datacenters. Each region has a unique name like us-east-1 (N. Virginia), ap-south-1 (Mumbai), eu-west-1 (Ireland).
- AWS has 33+ regions worldwide (2024)
- Regions are completely independent β a region-wide failure doesn't affect other regions
- Not all services are available in all regions (e.g., some AI services only in US regions initially)
- Data does NOT automatically replicate across regions β you must explicitly configure cross-region replication
Availability Zones (AZs)
Each region has 2-6 AZs (usually 3). An AZ is one or more discrete datacenters with:
- Independent power supply (UPS + diesel generators)
- Independent networking (separate internet uplinks)
- Physical separation (miles apart, so one fire/flood doesn't take both)
- But connected with high-speed, low-latency private fiber within the region (<1ms)
Edge Locations & CloudFront PoPs
For CDN (CloudFront), AWS has 600+ edge locations worldwide β far more than regions. These are smaller cache servers placed close to end users. Content cached here gets served with ultra-low latency. Edge locations are also used by Route 53 (DNS) and AWS Shield (DDoS protection).
Other AWS Infrastructure Types
| Type | What it is | Use case |
|---|---|---|
| Local Zones | AWS compute placed in specific cities (e.g., Delhi, Chicago), extending a region | Sub-10ms latency for city users. Gaming, live video, AR/VR. |
| Wavelength Zones | AWS compute embedded in telecom 5G networks | Ultra-low latency apps delivered via 5G. Mobile gaming, real-time video. |
| AWS Outposts | AWS-managed rack in YOUR datacenter running AWS services | On-prem workloads needing AWS APIs. Compliance requiring on-prem data. |
Regions & Zones (similar concept). A Zone is like an AZ. GCP calls them Zones directly (e.g., asia-south1-a). Also has Cloud CDN PoPs for edge caching. ~40 regions.
Regions & Availability Zones. Azure also has Availability Sets (older: ensures VMs spread across fault/update domains within a single datacenter β NOT the same as AZs). Azure AZs are like AWS AZs. Also has Azure Edge Zones similar to AWS Local Zones.
Azure Paired Regions: Every Azure region is paired with another region in the same geography (e.g., East US β West US). Microsoft staggers updates across pairs and replicates some services automatically. AWS doesn't have an exact equivalent β you manage cross-region replication manually.
1. Compliance & Data Residency
GDPR (EU) requires EU citizen data stays in EU. HIPAA (US healthcare), PCI-DSS (payments). If law requires data in a specific country β that region wins, period. No other factor overrides this.
2. Latency (Proximity to Users)
Deploy closest to your users. If 80% of users are in India, ap-south-1 (Mumbai). Use CloudFront for global CDN on top. Test with cloudpingtest.com.
3. Service Availability
Not all services exist in all regions. New services launch in us-east-1 first. Check the AWS Regional Services table before designing architecture. Bedrock (AI) has limited region availability.
4. Pricing
Same EC2 instance type costs differently per region. us-east-1 tends to be cheapest. ap-southeast-1 (Singapore) is ~10-20% more. Factor this into cost modeling.
High Availability, Scalability & Disaster Recovery
Three Related But Different Concepts
These terms are often confused. Think of a hospital as an analogy:
High Availability (HA)
System is designed to be "always on" with minimal downtime. If a component fails, the system automatically recovers quickly. A hospital with a backup generator β brief flicker but stays running.
Fault Tolerance (FT)
System continues operating WITH ZERO downtime or data loss even when a component fails. An airplane with 4 engines that can fly on 3 β no passengers even notice. Much harder and more expensive than HA.
Disaster Recovery (DR)
Your plan for recovering from catastrophic failure (entire datacenter destroyed, full region outage). Like a hospital's evacuation plan β you hope you never need it but must have it. Usually involves a separate region.
Nines of Availability
| Availability % | Downtime per year | Downtime per month | Typical system |
|---|---|---|---|
| 99% | 3.65 days | 7.2 hours | Basic single-server app |
| 99.9% ("three nines") | 8.76 hours | 43.8 minutes | Simple multi-AZ setup |
| 99.99% ("four nines") | 52.6 minutes | 4.4 minutes | Production multi-AZ + failover |
| 99.999% ("five nines") | 5.25 minutes | 26 seconds | Active-active multi-region |
RPO and RTO β The Two DR Metrics
RPO β Recovery Point Objective
How much data can you afford to lose? Measured as maximum time between last backup and the disaster. RPO = 1 hour means you're OK losing up to 1 hour of data. Lower RPO = more frequent backups = more cost.
RTO β Recovery Time Objective
How long can your system be down? Time from disaster to full recovery. RTO = 4 hours means you need to be back up within 4 hours. Lower RTO = more standby infrastructure = more cost.
Normal βββββββββββββββββββ DISASTER βββββββββββββββββ Recovered
Operation β event β state
β β
[Last backup] ββββββββββββ€ ββββββββββββββββΊ [Back online]
RPO β β RTO
(data gap) β β (recovery time)
Example: RPO=1hr, RTO=4hr
β You can lose max 1 hour of data
β You must be back online within 4 hours of the disaster
4 AWS DR Strategies β Cost vs Speed Tradeoff
Faster recovery (lower RTO/RPO)
ββββββββββββββββββββββββββββββΊ
CHEAPEST Backup & Pilot Warm Active-Active MOST
(cold) Restore Light Standby (Multi-site) EXPENSIVE
β β β β
βΌ βΌ βΌ βΌ
RTO: Hours Minutes Minutes Seconds
RPO: Hours Minutes Seconds Near-zero
Cost: $ $$ $$$ $$$$
β β β β
β β β ββ Full copy in 2nd region
β β ββββ Scaled-down running copy
β βββββ Minimal services always running
βββββ Just backups, nothing running
Strategy 1: Backup & Restore
Regularly back up data and snapshots to S3. When disaster hits, spin up new infrastructure from those backups. Simplest, cheapest, but slowest.
Example: EC2 AMI snapshots every 6 hours to S3. RDS automated backups to another region. If primary region fails, launch new EC2 from AMI, restore RDS from backup. Takes hours.
Strategy 2: Pilot Light
A minimal version of your app is always running in DR region β just the core data-syncing layer (e.g., a database replicating from primary). Application servers are OFF but AMIs/configs are ready. Scale up when needed.
Example: RDS read replica in DR region (always syncing). EC2 AMIs ready. When disaster: promote read replica to master, launch app servers from AMIs. Takes 15-30 minutes.
Strategy 3: Warm Standby
A scaled-down but fully running copy of your system in DR region. It receives traffic in normal times or just sits ready. During disaster, scale it up to full production capacity.
Example: 2 t3.small EC2s in DR region vs 10 m5.xlarge in production. During disaster, scale DR to full size and redirect DNS.
Strategy 4: Active-Active (Multi-Site)
Full production deployment in 2+ regions, ALL serving live traffic. Route 53 routes users to nearest healthy region. If one region fails, all traffic goes to the other with no perceivable downtime.
Example: Netflix runs in multiple AWS regions. If us-east-1 has issues, traffic goes to us-west-2. Users might see a brief slowdown, but no outage.
DR strategies built around multi-region architecture. Key services: Route 53 (DNS failover), S3 CRR (cross-region replication), RDS Read Replicas, Aurora Global Database, DynamoDB Global Tables.
Same concepts. Multi-region Cloud Storage, Cloud Spanner (global DB), Cloud DNS with failover routing. GCP also has Managed Instance Groups with regional autoscaling.
Azure Site Recovery (ASR) is Azure's dedicated DR service β not available in AWS directly. ASR can replicate VMs to a secondary region and automate failover. Azure Traffic Manager handles DNS-level failover (like Route 53).
Azure Site Recovery (ASR): Dedicated managed DR service that replicates VMs, manages failover plans, and handles RPO/RTO tracking. AWS equivalent would be custom-built using CloudFormation + scripting + Route 53.
Scalability = Can It Grow? Elasticity = Does It Grow Automatically?
Scalability means your architecture can handle increased load. Elasticity means it automatically scales up AND back down as load changes β so you're not paying for idle capacity at 3 AM.
Vertical Scaling (Scale Up)
Give the existing server more power. Upgrade from t3.medium (2 vCPU, 4GB) to m5.4xlarge (16 vCPU, 64GB). Simple but has limits (biggest instance size), requires downtime, and creates a single point of failure.
VERTICAL (Scale Up) HORIZONTAL (Scale Out)
Before: [Server 2GB RAM] Before: [Server] [Server]
β β
βΌ βΌ
After: [Server 16GB RAM] After: [Server] [Server] [Server] [Server]
β
One server, bigger Load Balancer distributes traffic
Single point of failure No SPOF β much more resilient
Horizontal Scaling (Scale Out)
Add more instances of the same server. 1 server β 5 servers behind a load balancer. No single point of failure. Nearly unlimited scale. Requires your app to be stateless (session data stored in Redis/DB, not locally).
Auto Scaling
AWS Auto Scaling automatically adjusts the number of instances based on rules you define. You define a minimum (always have at least 2), maximum (never exceed 20), and desired (target 4 normally).
Scaling can be triggered by: CPU usage > 70%, request count, memory, schedule, or custom CloudWatch metrics.
Cloud Networking Fundamentals
What is a VPC?
A VPC is a logically isolated private network in the cloud. Think of it as your own private section of AWS that no one else can access. By default, nothing inside your VPC can reach the internet, and the internet can't reach your VPC β you must explicitly configure that.
Analogy: AWS is a massive apartment building. Your VPC is your apartment β you can furnish it however you like inside, but outsiders can't get in unless you let them.
ββββββββββββββββββββββββ AWS Region (ap-south-1) ββββββββββββββββββββββββββ
β β
β βββββββββββββββββββββββββ VPC (10.0.0.0/16) βββββββββββββββββββββββββ β
β β β β
β β βββ AZ-1a ββββββββββββββββββββββ βββ AZ-1b βββββββββββββββββββ β β
β β β β β β β β
β β β [Public Subnet 10.0.1.0/24] β β [Public Subnet 10.0.2.0/24]β β β
β β β ββββββββββββ β β ββββββββββββ β β β
β β β β EC2 (web)β β β β EC2 (web)β β β β
β β β β Public β NAT GW β β β Public β β β β
β β β β IP: β ββββ β β β IP: β β β β β
β β β ββββββββββββ β β β ββββββββββββ β β β
β β β β β β β β β
β β β [Private Sub 10.0.3.0/24] β β [Private Sub 10.0.4.0/24] β β β
β β β ββββββββββββ β β β ββββββββββββ β β β
β β β β EC2 (app)ββββ (outbound)β β β RDS (DB) β β β β
β β β β No pub IPβ β β β No pub IPβ β β β
β β β ββββββββββββ β β ββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β β β
β β β β β
β β Internet Gateway (IGW) β β
β βββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ β
β β β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ
β
INTERNET
Key Networking Components
CIDR Block
When you create a VPC, you assign it a CIDR block like 10.0.0.0/16. This defines the IP range for your entire VPC (65,536 IPs). You then carve subnets from this range.
Subnets
Subnets divide the VPC into smaller networks, and they're tied to a specific AZ. A public subnet has a route to the internet via an Internet Gateway. A private subnet has no direct internet route β instances here can't be reached from the internet.
Internet Gateway (IGW)
A horizontally-scaled, redundant, HA component attached to your VPC that enables communication between your VPC and the internet. Free. Without an IGW, your VPC has no internet connectivity at all. Only one IGW per VPC.
NAT Gateway
Allows instances in private subnets to initiate outbound connections to the internet (for software updates, API calls, etc.) but prevents the internet from initiating connections TO those instances. Deployed in a public subnet, charges per hour + data processed.
Route Tables
Every subnet has a route table that defines where traffic goes. A public subnet's route table has an entry: 0.0.0.0/0 β igw-xxxxx (default route to internet via IGW). A private subnet's route table: 0.0.0.0/0 β nat-xxxxx (outbound only via NAT).
Security Groups
Virtual firewalls at the instance (EC2) level. Stateful: if you allow inbound on port 80, the response automatically comes back out without needing an outbound rule. Default: deny all inbound, allow all outbound.
Network ACLs (NACLs)
Firewall at the subnet level. Stateless: you must define both inbound AND outbound rules explicitly. Rules evaluated in number order (lowest first). An explicit DENY stops evaluation. Less commonly tweaked than Security Groups.
| Feature | Security Group | NACL |
|---|---|---|
| Applied at | Instance (ENI) level | Subnet level |
| State | Stateful (response auto-allowed) | Stateless (must allow both directions) |
| Rules | Allow only | Allow and Deny |
| Rule evaluation | All rules evaluated, most permissive wins | Rules in number order, first match wins |
| Default behavior | Deny all inbound, allow all outbound | Allow all inbound and outbound |
Also called VPC. Key difference: GCP VPCs are global by default (span all regions). AWS VPCs are regional. In GCP, one VPC can have subnets in multiple regions. Subnets are regional. Security Groups β Firewall Rules (global, not per-instance). No direct NACL equivalent.
Called Virtual Network (VNet). Same concept β private IP space, subnets, gateways. Azure has Network Security Groups (NSGs) which work like AWS Security Groups but can be applied to subnets OR individual NICs. Azure also has Application Security Groups (ASGs) to group VMs logically.
What is a Load Balancer?
A load balancer distributes incoming traffic across multiple backend servers. It's the entry point users hit β they don't talk to individual servers directly. This enables high availability (if one server dies, traffic goes elsewhere), horizontal scaling, and no single point of failure.
Layer 4 vs Layer 7 Load Balancing
L4 β Transport Layer (TCP/UDP)
Routes traffic based on IP address and port number. Doesn't look inside the packet. Fast, low-overhead. Good for: non-HTTP traffic, TCP-based apps, ultra-low latency, gaming, VoIP, financial trading.
AWS: NLB (Network Load Balancer)
L7 β Application Layer (HTTP/HTTPS)
Looks inside the HTTP request β path, hostname, headers, cookies. Can route /api/* to one group, /images/* to another. Smarter but slightly more overhead. Good for: web apps, microservices, content-based routing.
AWS: ALB (Application Load Balancer)
Load Balancing Algorithms
| Algorithm | How it works | Best for |
|---|---|---|
| Round Robin | Send each request to next server in sequence: A, B, C, A, B, C... | Similar servers, similar request sizes |
| Least Connections | Send to server with fewest active connections | Variable request processing time |
| IP Hash / Sticky Sessions | Same client IP always goes to same server | Apps that need session affinity (stateful) |
| Weighted | Some servers get more traffic by weight (70/30 split) | Gradual deployments (blue/green, canary) |
Health Checks
Load balancers continuously ping backend servers (e.g., HTTP GET /health every 30s). If a server fails health check 3 times, the LB removes it from rotation. When it recovers and passes, it's added back. This is how HA works in practice.
The Problem CDNs Solve
Your origin server is in us-east-1. A user in Mumbai requests your 5MB homepage image. The packet travels ~14,000 km. High latency. With a CDN, that image is cached in an edge server in Mumbai β user gets it from there. Fast.
WITHOUT CDN: WITH CDN (cache HIT):
User (Mumbai) ββββββββββββββββββΊ User (Mumbai) βββΊ Edge (Mumbai) βββΊ User
14,000km to us-east-1 [cached!] ββββ
Response: ~300ms latency Response: ~5ms latency
FIRST REQUEST (cache MISS):
User (Mumbai) βββΊ Edge (Mumbai) βββΊ Origin (us-east-1) βββΊ Edge caches it
Response: ~300ms (one time)
SUBSEQUENT REQUESTS (cache HIT, within TTL):
User (Mumbai) βββΊ Edge (Mumbai) βββΊ Serve from cache β ~5ms β
Key CDN Concepts
- Origin: The source of truth β your actual server (S3 bucket, EC2, ALB).
- Edge Location: CDN's cache servers distributed globally.
- TTL (Time To Live): How long content is cached before being re-fetched from origin. Too long = stale content. Too short = defeats the purpose.
- Cache Invalidation: Manually expire cached content when you deploy new files. In CloudFront, you create an invalidation request.
- Origin Shield: Extra caching layer between edge locations and origin, reducing origin load. One central cache instead of 100s of edges hitting origin.
What to cache: Static assets (images, CSS, JS, videos). What NOT to cache: User-specific pages, API responses with sensitive data, frequently changing data (unless you manage TTL carefully).
CloudFront β AWS's CDN. 600+ PoPs. Integrates with S3, EC2, ALB. Supports Lambda@Edge for dynamic logic at the edge.
Cloud CDN β Works with Cloud Load Balancing. Also Cloud Media CDN for high-throughput streaming.
Azure Front Door β combines CDN, WAF, and global load balancing in one product. More feature-rich than a pure CDN. Also legacy Azure CDN (being retired in favour of Front Door).
Azure Front Door's global load balancing (routing users to the closest healthy region based on latency, not just caching) is more tightly integrated than AWS CloudFront + Route 53 combination.
Security Concepts, IaC & Modern Patterns
Identity & Access Management (IAM) β Core Concepts
IAM answers: Who are you? What can you do? To what resources?
- Authentication: Proving who you are (password, MFA, API key)
- Authorization: What you're allowed to do once authenticated (policies)
- Principal: An entity that can make requests (user, role, service)
- Principle of Least Privilege: Grant only the permissions needed for the specific task. Not "give admin and let them figure it out."
Encryption
Encryption at Rest
Data encrypted while stored. If someone steals a hard drive, they get garbage. AWS does this for EBS, S3, RDS with keys managed by KMS. In S3, you can enable SSE-S3 (AWS manages key) or SSE-KMS (you manage key via KMS).
Encryption in Transit
Data encrypted while moving over a network. Uses TLS (formerly SSL). HTTPS is HTTP + TLS. Your AWS API calls are all HTTPS. Between services: use TLS wherever possible. Between on-prem and AWS: VPN or Direct Connect with MACsec.
MFA β Multi-Factor Authentication
Something you know (password) + something you have (phone/hardware key). Even if your AWS root password is stolen, attacker can't login without your MFA device. Always enable MFA on root account and all IAM users with console access.
Zero Trust Model
Traditional model: "Trust everything inside the network perimeter." Zero Trust: "Trust nothing, verify everything." Even requests from inside the VPC are not automatically trusted β authenticate and authorize every request. Implemented via mutual TLS (mTLS), service meshes (Istio), and strict IAM policies.
What is IaC and Why Does It Matter?
IaC means defining your cloud infrastructure in code files (YAML, JSON, HCL) instead of clicking through the console. You check these files into Git, review them in PRs, run them through CI/CD. Infrastructure becomes reproducible, auditable, and versionable.
Declarative
"I want 3 EC2 instances with these properties." The tool figures out HOW to make that happen. CloudFormation, Terraform, Pulumi.
Imperative
"First create VPC, then subnet, then EC2..." You specify exact steps. AWS CDK, scripts with AWS CLI/SDK.
Key IaC Tools
| Tool | Type | Language | Multi-cloud? | Best for |
|---|---|---|---|---|
| AWS CloudFormation | Declarative | YAML/JSON | AWS only | AWS-native teams, no extra setup needed |
| Terraform | Declarative | HCL | Yes (all clouds) | Multi-cloud, most popular in industry |
| AWS CDK | Imperative/Declarative | Python/TS/Java | AWS only | Devs who prefer real languages over YAML |
| Pulumi | Imperative | Python/TS/Go/C# | Yes | Teams wanting full programming language power |
resource "aws_s3_bucket" "my_bucket" { bucket = "my-app-bucket" }. CloudFormation uses YAML with AWSTemplateFormatVersion headers and more verbose syntax. Terraform is more readable and multi-cloud but requires the Terraform binary. CloudFormation is AWS-native and has deeper service integration (like StackSets for multi-account deployments).
Serverless
You write functions, the cloud runs them. No servers to provision, patch, or manage. You pay only when code runs (per invocation + per ms of execution). Serverless β no servers β there ARE servers, you just don't manage them.
Key characteristics: Event-driven (triggered by HTTP, S3 upload, queue message, schedule). Scales to zero (no traffic = no cost). Scales to millions (auto-scale). Stateless (function runs fresh each time).
Containers
Containers package your app + all dependencies (libraries, config, runtime) into a portable unit. Unlike VMs, containers share the host OS kernel β much more lightweight. Docker is the de-facto container standard.
Virtual Machines Containers ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ β App A β App B β App C β β App A β App B β App C β β Libs β Libs β Libs β β Libs β Libs β Libs β β OS β OS β OS β βββββββββββββββββββββββββββ β€ ββββββββββΌβββββββββΌβββββββββ€ β Container Runtime β β Hypervisor β β (Docker/containerd) β ββββββββββββββββββββββββββββ€ ββββββββββββββββββββββββββββ€ β Host OS β β Host OS (ONE) β ββββββββββββββββββββββββββββ€ ββββββββββββββββββββββββββββ€ β Hardware β β Hardware β ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ Each VM: 1-2GB RAM overhead Each Container: ~50MB overhead Slow to start (minutes) Starts in seconds
Container Orchestration
When you run 100s of containers, you need something to manage them: scheduling, health checks, service discovery, rolling updates, secret management. Kubernetes is the industry standard.
| Concept | What it means |
|---|---|
| Pod | Smallest deployable unit in Kubernetes β 1+ containers sharing network/storage |
| Deployment | Desired state: "run 3 replicas of this pod" |
| Service | Stable network endpoint for pods (pods restart with new IPs β Service is static) |
| Ingress | HTTP routing rules (like an L7 LB/reverse proxy for K8s) |
| Namespace | Logical isolation within a cluster (like separate teams/environments) |
Serverless: Lambda | Containers: ECS, EKS (managed K8s), Fargate (serverless containers) | Container Registry: ECR
Serverless: Cloud Functions, Cloud Run (containers as serverless!) | Containers: GKE (Google Kubernetes Engine) | Registry: Artifact Registry
Serverless: Azure Functions, Container Apps | Containers: AKS (Azure Kubernetes Service), Container Instances | Registry: ACR (Azure Container Registry)
Cloud Run: Deploy any Docker container and it runs serverless (scale to zero, pay per request). More flexible than Lambda (any language, any binary). AWS equivalent would be Lambda containers, but Cloud Run has no cold-start shim overhead.
What is CI/CD?
Continuous Integration (CI): Every code commit is automatically built, tested, and validated. You catch bugs immediately β not 3 months later during a manual deployment.
Continuous Delivery (CD): After CI passes, the artifact (container, zip, AMI) is automatically deployed to staging. Deployment to production requires manual approval.
Continuous Deployment: Like CD but no manual approval β changes go straight to production automatically after tests pass. Used by companies doing 100s of deploys per day.
graph LR
A[Developer Push] --> B[Source Repo]
B --> C[CI: Build & Test]
C --> D{Tests Pass?}
D -- No --> E[Notify Dev, Stop]
D -- Yes --> F[Create Artifact]
F --> G[Deploy to Staging]
G --> H[Integration Tests]
H --> I{Approved?}
I -- Manual Approve --> J[Deploy to Prod]
I -- Auto Deploy --> J
J --> K[Monitor & Alert]
Deployment Strategies
| Strategy | How it works | Downtime? | Rollback? | Best for |
|---|---|---|---|---|
| In-Place (Rolling) | Update existing servers one by one | Brief per server | Slow | Simple apps, non-critical |
| Blue/Green | Two identical envs. Swap DNS/LB to new version. Old stays as backup. | None | Instant (flip LB back) | Critical apps needing instant rollback |
| Canary | Send 5% of traffic to new version. Gradually increase if healthy. | None | Shift traffic back | Risk-sensitive features, A/B testing |
| Feature Flags | Deploy code disabled. Enable via config for % of users. | None | Toggle flag off | Gradual feature rollouts, experimentation |
CodeCommit (Git repo, being deprecated 2024) | CodeBuild (CI: build & test) | CodeDeploy (CD: deploy to EC2/Lambda/ECS) | CodePipeline (orchestrates all stages). Also integrates with GitHub, GitLab, Jenkins.
Cloud Source Repositories (Git, being merged to Gemini Code Assist era) | Cloud Build (CI/CD) | Artifact Registry (store artifacts) | Cloud Deploy (managed delivery pipelines to GKE, Cloud Run).
Azure DevOps (all-in-one: repos, pipelines, boards, test plans, artifacts) | Azure Pipelines (CI/CD, free for open source). Azure DevOps is more mature/unified than AWS CodePipeline family.
Azure DevOps Boards: Kanban/Scrum project tracking built into the same product as CI/CD. AWS doesn't have a native project management tool β would need Jira, Linear, etc.
Compute
What is EC2?
EC2 is AWS's virtual machine service (IaaS). You rent a virtual server that runs on AWS hardware, choose the OS, configure storage and networking. You have full root/admin access. It's the foundation of most AWS architectures.
Key EC2 Components
Instance Types
EC2 instances come in families optimized for different workloads. The naming convention: [Family][Generation][Size] β e.g., m5.xlarge = General purpose, 5th gen, xlarge.
| Family | Optimized for | Examples | Use case |
|---|---|---|---|
| t3, t4g | Burstable (credits) | t3.micro, t4g.small | Dev/test, low-traffic sites |
| m5, m6i, m7i | General purpose | m5.xlarge, m6i.2xlarge | Web servers, app servers, small DBs |
| c5, c6i, c7g | Compute-optimized | c5.2xlarge, c7g.xlarge | Batch processing, gaming, video encoding |
| r5, r6i, r7g | Memory-optimized | r5.4xlarge, r6i.large | In-memory DBs (Redis), big data, SAP HANA |
| i3, i4i | Storage-optimized | i3.xlarge, i4i.2xlarge | High IOPS workloads, Cassandra, Elasticsearch |
| p4, p5, g4, g5 | GPU-accelerated | p4d.24xlarge, g5.xlarge | ML training, inference, 3D rendering, gaming |
AMI β Amazon Machine Image
An AMI is a pre-configured template (OS + optional installed software) used to launch EC2 instances. Like a VM snapshot you can clone. AWS provides Amazon Linux, Ubuntu, RHEL, Windows images. You can create custom AMIs (e.g., "Amazon Linux + Nginx + your app pre-installed") for faster launches β called a "Golden AMI".
User Data
A bash script that runs once on first boot. Used to install packages, download code, configure services without baking them into an AMI. Passed at launch time:
#!/bin/bash
yum update -y
yum install -y nginx
systemctl start nginx
systemctl enable nginx
Key Pairs
RSA key pair for SSH access. AWS stores the public key; you keep the private key (.pem file). ssh -i my-key.pem ec2-user@<public-ip>. If you lose the private key, you can't SSH in anymore β AWS has no backup. Best practice: use SSM Session Manager instead of SSH (no key pair, no open port 22 needed, auditable).
Instance Metadata Service (IMDS)
From inside an EC2 instance, you can query http://169.254.169.254/latest/meta-data/ to get instance info: instance ID, public IP, IAM role credentials, AZ, etc. Critical for automation scripts running on EC2. IMDSv2 (more secure, requires token) is now required.
# Get instance ID from inside the instance
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id
# Get IAM role temporary credentials
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRole
EC2 Pricing Models
| Model | How it works | Discount vs On-Demand | Best for |
|---|---|---|---|
| On-Demand | Pay by the hour/second. No commitment. | None (baseline) | Unpredictable workloads, short-term dev/test |
| Reserved Instances (RI) | 1-year or 3-year commitment to a specific instance type/region. | Up to 72% off | Steady-state production workloads |
| Savings Plans | Commit to $X/hr usage (flexible: any instance type, any region). | Up to 66% off | More flexible than RIs β same savings, less commitment |
| Spot Instances | Bid for unused EC2 capacity. AWS can terminate with 2-min notice. | Up to 90% off | Fault-tolerant, batch jobs, big data, CI runners |
| Dedicated Hosts | Physical server dedicated to you. Useful for per-socket/per-core licenses. | More expensive | Compliance, BYOL software licenses |
Placement Groups
Controls how AWS places EC2 instances on physical hardware:
- Cluster: Pack instances close together in same AZ. Ultra-low latency network (~25Gbps). Use for: HPC, big data jobs needing fast node-to-node comms. Risk: AZ failure takes all down.
- Spread: Instances on different hardware. Reduces correlated hardware failure. Max 7 instances per AZ per group. Use for: small critical clusters of distinct VMs.
- Partition: Groups of instances in different partitions (separate racks). Good for large distributed systems (Kafka, Hadoop, Cassandra) where partial failures are tolerable.
Elastic IP (EIP)
A static public IPv4 address you can allocate and associate with an EC2 instance. When an EC2 stops/starts, its public IP changes β an EIP stays fixed. But: AWS charges for EIPs that are not attached to a running instance (to discourage hoarding). Best practice: use a load balancer with a stable DNS name instead of EIPs for production.
Compute Engine (GCE). Similar instance types. GCP uses Preemptible VMs (like Spot) and Spot VMs. GCP's equivalent of AMIs are Custom Images. GCP has Committed Use Discounts (CUDs) instead of RIs/Savings Plans.
Azure Virtual Machines. Pricing: Pay-as-you-go (On-Demand), Reserved VM Instances (1 or 3 yr), Spot VMs (like AWS Spot). Azure's equivalent of AMIs are Azure VM Images (stored in Compute Gallery).
What is Lambda?
AWS Lambda lets you run code without provisioning any servers. You upload a function (zip or container), define what triggers it, and Lambda runs it on-demand. You're billed per invocation + per GB-second of memory used. No code running = zero cost.
Key Lambda Concepts
Triggers (Event Sources)
Lambda is event-driven. Something must trigger it:
HTTP/API
API Gateway β Lambda. REST or WebSocket APIs.
S3 Events
File uploaded to S3 β Lambda. Common for image processing, ETL.
Scheduled
EventBridge cron rule β Lambda. Like cron jobs, serverless.
Queue/Stream
SQS message β Lambda. Kinesis stream β Lambda. Event processing.
DynamoDB Stream
Record change in DynamoDB β Lambda. Triggers on insert/update/delete.
SNS / EventBridge
Pub/sub messages or event bus events β Lambda. Decoupled architectures.
Execution Environment
Lambda runs your code inside a micro-container (Firecracker VM). Your function gets:
- Memory: 128MB to 10GB. CPU scales proportionally with memory.
- Timeout: Max 15 minutes per invocation.
- /tmp storage: 512MB to 10GB ephemeral disk (lost after function ends).
- Ephemeral by design: Don't rely on state persisting between invocations.
Cold Start
When Lambda hasn't run recently, AWS needs to initialize the execution environment (download code, start runtime, run initialization code). This adds 200ms-2s latency. Subsequent "warm" invocations reuse the same container (~1ms overhead).
# Lambda handler (Python example)
import boto3
# Code HERE runs on every COLD start (container init)
s3_client = boto3.client('s3') # Initialize once, reused on warm invocations
def handler(event, context):
# Code HERE runs on EVERY invocation (warm or cold)
bucket = event['bucket']
key = event['key']
response = s3_client.get_object(Bucket=bucket, Key=key)
return {'statusCode': 200, 'body': response['Body'].read().decode()}
Layers
Lambda Layers are zip archives containing dependencies (libraries) that can be shared across multiple functions. Instead of bundling numpy in every ML Lambda, put it in a layer and reference it. Max 5 layers per function. Reduces deployment package size and enables sharing.
Concurrency
Lambda scales horizontally automatically. If 1000 events arrive simultaneously, Lambda spins up 1000 instances of your function. Default account limit: 1000 concurrent executions (soft limit, can increase). You can set Reserved Concurrency (cap a function to protect others) or Provisioned Concurrency (keep containers warm, eliminate cold starts, extra cost).
IAM Execution Role
Each Lambda function has an execution role β an IAM role Lambda assumes to make API calls. If your function needs to read from S3, the execution role must have s3:GetObject permission. Never put AWS credentials inside Lambda code β use the execution role.
Cloud Functions (event-driven, like Lambda) and Cloud Run (containerized serverless β more flexible, any language, scale to zero). Cloud Run is often preferred over Cloud Functions for complex apps.
Azure Functions. Same concept. Supports Consumption Plan (pay-per-use, cold starts), Premium Plan (pre-warmed, VNet integration, no cold starts), and Dedicated Plan (runs on App Service Plan). Durable Functions is Azure-specific for stateful workflows β more powerful than Lambda Step Functions integration.
AWS Container Ecosystem Overview
Need containers on AWS?
β
βΌ
Kubernetes or AWS-native orchestration?
ββββββββββββββββββββ¬βββββββββββββββββββ
β AWS-native β Kubernetes β
β (ECS) β (EKS) β
ββββββββββ¬ββββββββββ΄βββββββββ¬ββββββββββ
β β
Where to run containers? β
βββββββββββββββββββββββββββββββββββββββ
β β
EC2 (you manage nodes) Fargate (serverless nodes)
More control, cheaper No node management, slightly pricier
for stable workloads great for variable/small workloads
ECS β Elastic Container Service
AWS's own container orchestration service. Not Kubernetes β AWS's proprietary system. Simpler to operate than EKS for pure AWS workloads.
- Task Definition: JSON/YAML file defining your container(s): image URI, CPU/memory, port mappings, env vars, logging, IAM role. Think of it like a Pod spec in Kubernetes.
- Task: A running instance of a Task Definition. Like a Pod.
- Service: Ensures a desired number of tasks are running. Handles health checks, restarts, load balancer integration, rolling deploys. Like a Deployment + Service in K8s.
- Cluster: Logical group of resources (EC2 instances or Fargate capacity) where tasks run.
# Example Task Definition (simplified JSON)
{
"family": "my-web-app",
"networkMode": "awsvpc",
"containerDefinitions": [{
"name": "web",
"image": "123456789.dkr.ecr.ap-south-1.amazonaws.com/my-app:v1.2",
"cpu": 256,
"memory": 512,
"portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
"environment": [{"name": "ENV", "value": "production"}],
"logConfiguration": {
"logDriver": "awslogs",
"options": {"awslogs-group": "/ecs/my-web-app", "awslogs-region": "ap-south-1"}
}
}],
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512"
}
EKS β Elastic Kubernetes Service
Managed Kubernetes. AWS runs and manages the Kubernetes control plane (API server, etcd). You manage the worker nodes (EC2 node groups) or use Fargate. Best when you need Kubernetes compatibility (standard K8s manifests, Helm charts, existing K8s tooling).
- Managed Node Groups: AWS creates/updates EC2 instances as worker nodes. You pick instance type, scaling policy.
- Fargate Profiles: Pods matching certain selectors run on Fargate (serverless).
- Add-ons: Managed plugins like CoreDNS, kube-proxy, VPC CNI, AWS Load Balancer Controller.
Fargate β Serverless Containers
Fargate is a compute engine for ECS and EKS where AWS manages the underlying EC2 instances. You just specify CPU/memory for your container β no node groups to manage, no EC2 to patch.
Use Fargate when
Variable workloads, you don't want to manage nodes, small team, serverless containers, batch jobs, don't need GPU.
Use EC2 nodes when
Need GPU instances, need specific instance types, want Spot instance savings, need local NVMe storage, running Windows containers, very high compute needs.
ECR β Elastic Container Registry
AWS's private Docker image registry. Like Docker Hub but private and integrated with IAM. Push images here, pull from ECS/EKS. ECR also scans images for security vulnerabilities. Free private repos (storage charged separately). You authenticate with: aws ecr get-login-password | docker login ...
GKE (Google Kubernetes Engine β most mature managed K8s service, invented Kubernetes) | Cloud Run (serverless containers, like Fargate but easier) | Artifact Registry (like ECR). GCP does NOT have an ECS equivalent β they pushed everyone to GKE or Cloud Run.
AKS (Azure Kubernetes Service) | Azure Container Apps (serverless containers, like Cloud Run, built on K8s internally) | Azure Container Instances (ACI) (simple single-container runs, like Fargate but simpler) | ACR (Azure Container Registry).
Storage
What is S3?
S3 is AWS's object storage service β the most fundamental AWS service. Store any file (object) up to 5TB in size. Highly durable (99.999999999% β eleven 9s), highly available, globally accessible. Used for: static file hosting, backup, data lake, ML training data, CloudFront origin, application logs, artifacts.
Key S3 Concepts
Buckets & Objects
A bucket is a container (globally unique name). An object is the file + metadata stored in a bucket. Objects are addressed by a key (the "path"): s3://my-bucket/images/profile/user123.jpg. Despite looking like folders, S3 is flat β the "/" is just part of the key name. The "folders" you see in console are just a UI fiction (prefix grouping).
S3 Storage Classes
FREQUENTLY ACCESSED ββββββββββββββββββββββββββββββββΊ RARELY ACCESSED
HIGHEST COST LOWEST COST
S3 Standard β S3 Intelligent β S3 Standard-IA β S3 Glacier β S3 Glacier
β Tiering β β Instant β Deep Archive
β β β Retrieval β
----------------β----------------β----------------β---------------β-----------
Any data β Unknown or β Backups, β Long-term β Long-term
accessed β changing β disaster β backups, RA β archive, 7-10yr
frequently β access pattern β recovery β 1/quarter β retention
β Auto-moves β β β
Retrieval: ms β between tiers β Retrieval: ms β Retrieval: ms β Retrieval: 12hr
β β Min 30 days β Min 90 days β Min 180 days
Versioning
Enable versioning on a bucket to keep multiple versions of every object. Protects against accidental deletes and overwrites. When you delete an object, S3 adds a "delete marker" β the old version still exists. You can restore it. Once enabled, versioning can be suspended but NOT fully disabled. Versions accumulate cost β use Lifecycle rules to clean old versions.
Lifecycle Policies
Automate object transitions between storage classes or expiration:
# Example: Move to IA after 30 days, Glacier after 90 days, delete after 365 days
{
"Rules": [{
"Status": "Enabled",
"Filter": {"Prefix": "logs/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"}
],
"Expiration": {"Days": 365}
}]
}
Bucket Policies vs ACLs vs IAM
| Method | What it controls | Use when |
|---|---|---|
| IAM Policy | What an IAM user/role can do with S3 | Controlling access for your AWS users/services |
| Bucket Policy | JSON policy on the bucket itself. Can grant cross-account access. | Granting access to other AWS accounts, making bucket public, enforcing HTTPS |
| ACLs | Legacy per-object permissions | Avoid if possible. Disabled by default now with Block Public Access. |
# Bucket policy: enforce HTTPS only
{
"Statement": [{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
"Condition": {"Bool": {"aws:SecureTransport": "false"}}
}]
}
Pre-signed URLs
Temporarily grant access to a private object without making it public. A pre-signed URL is signed with your credentials and has an expiry. Your backend generates it and sends to a user β they can download the private file for the next 15 minutes. Used for: file downloads in apps, direct-to-S3 uploads from browser (bypasses your server).
# Generate pre-signed URL (Python boto3)
url = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': 'my-bucket', 'Key': 'report.pdf'},
ExpiresIn=900 # 15 minutes
)
# Now share this URL β expires in 15 minutes automatically
S3 Replication
- Cross-Region Replication (CRR): Replicate objects to a bucket in another region. For DR, compliance (EU data must also be in EU-West), lower latency for users in different regions.
- Same-Region Replication (SRR): Replicate within same region to another bucket. For log aggregation, test-prod separation, compliance copies.
Both require versioning enabled. Replication is asynchronous (not instant). Does NOT replicate existing objects β only new uploads after replication is configured.
Cloud Storage. Storage classes: Standard, Nearline (monthly access), Coldline (quarterly), Archive (yearly). Has Object Lifecycle Management (like S3 Lifecycle). Signed URLs = equivalent to Pre-signed URLs. HMAC keys for S3-compatible API access.
Azure Blob Storage. Tiers: Hot (frequent), Cool (infrequent), Cold, Archive. Objects are called "blobs". Containers β S3 buckets. Shared Access Signatures (SAS tokens) = equivalent to S3 Pre-signed URLs. Azure Data Lake Storage Gen2 (Blob + hierarchical namespace for analytics).
What is EBS?
EBS provides block storage volumes for EC2 instances β like a virtual hard drive. Unlike S3 (object storage accessible over HTTP), EBS appears as a raw block device to the OS (like /dev/xvda). You format it with a filesystem (ext4, xfs) and mount it. EBS volumes persist independently of EC2 instance lifecycle β you can stop/terminate an instance and the volume remains.
EBS Volume Types
| Type | Name | Max IOPS | Max Throughput | Best for |
|---|---|---|---|---|
| gp3 | General Purpose SSD | 16,000 | 1,000 MB/s | Most workloads. Default. Boot volumes, dev DBs. |
| gp2 | General Purpose SSD (legacy) | 16,000 | 250 MB/s | Legacy β migrate to gp3 (cheaper, more flexible) |
| io2 Block Express | Provisioned IOPS SSD | 256,000 | 4,000 MB/s | Mission-critical: SAP HANA, Oracle, high-perf DBs |
| io1 | Provisioned IOPS SSD | 64,000 | 1,000 MB/s | Production I/O-intensive databases |
| st1 | Throughput Optimized HDD | 500 | 500 MB/s | Big data, data warehouses, log processing |
| sc1 | Cold HDD | 250 | 250 MB/s | Infrequently accessed, lowest cost |
EBS Snapshots
Point-in-time backup of an EBS volume to S3 (you don't see this S3 bucket β it's AWS-managed). Snapshots are incremental: first snapshot copies everything, subsequent snapshots only store changed blocks. You can create volumes from snapshots in any AZ (cross-AZ copy). You can copy snapshots across regions (for DR). Cost: per GB-month of data stored in snapshot.
# Create snapshot via AWS CLI
aws ec2 create-snapshot --volume-id vol-0abc123 --description "Pre-deploy backup"
# Create volume from snapshot in different AZ (useful for migrating data)
aws ec2 create-volume --snapshot-id snap-0xyz789 --availability-zone ap-south-1b --volume-type gp3
EBS vs Instance Store
| Feature | EBS | Instance Store |
|---|---|---|
| Persistence | Persists independently of instance | Data LOST when instance stops/terminates |
| Performance | Good (up to 256K IOPS) | Excellent (physically attached NVMe) |
| Cost | Separate charge per GB-month | Included in instance price |
| Use case | Boot volumes, databases, general storage | Temp data, buffers, cache, Kafka, Spark shuffle |
Persistent Disks (standard HDD, balanced SSD, extreme SSD) and Hyperdisk (ultra-high performance). Google's equivalent of EBS. Also Local SSDs = instance store equivalent (ephemeral).
Azure Managed Disks. Types: Standard HDD, Standard SSD, Premium SSD, Ultra Disk (for SAP HANA, etc.). Azure also has Azure Shared Disks (multi-VM attach, for Windows WSFC clusters).
What is EFS?
EFS is a managed NFS (Network File System) that can be mounted by multiple EC2 instances simultaneously across multiple AZs. Unlike EBS (one instance at a time), EFS is shared storage. Grows and shrinks automatically β pay only for what you use. No capacity planning needed.
Key EFS Features
- Multi-AZ by default: Data stored redundantly across multiple AZs. Highly durable and available.
- Shared mount: 100s or 1000s of EC2 instances can mount the same EFS simultaneously. Read AND write from multiple instances.
- Performance modes: General Purpose (low latency) | Max I/O (high throughput, slightly higher latency for massively parallel workloads).
- Throughput modes: Elastic (auto-scales throughput with load), Bursting (throughput proportional to size), Provisioned (fix throughput independently).
- Storage tiers: Standard (active) β Infrequent Access (EFS IA, cheaper) via lifecycle policies.
# Mount EFS on EC2 (Amazon Linux)
sudo yum install -y amazon-efs-utils
sudo mkdir /mnt/efs
sudo mount -t efs fs-0abc12345:/ /mnt/efs
# Or add to /etc/fstab for persistent mount:
echo "fs-0abc12345:/ /mnt/efs efs defaults,_netdev 0 0" | sudo tee -a /etc/fstab
Use EFS when
Shared content (CMS media files), home directories for multiple users, container shared storage, web farm with shared assets, machine learning training data accessed by multiple GPU nodes.
Don't use EFS when
App needs a database (use RDS), high-performance single-instance block storage (use EBS), object storage for files (use S3), very cost-sensitive (~3x more expensive than EBS per GB).
| Feature | S3 | EBS | EFS |
|---|---|---|---|
| Type | Object | Block | File (NFS) |
| Access | HTTP API / SDK | Single EC2 (usually) | Multiple EC2, multiple AZs |
| Durability | 11 nines | 99.999% | 99.999999999% |
| Use case | Blobs, backups, data lake | Boot disk, databases | Shared file system |
| Cost (approx) | $0.023/GB | $0.08/GB (gp3) | $0.30/GB (Standard) |
Filestore β managed NFS. Similar to EFS. Also Cloud Storage FUSE (mount GCS bucket as a filesystem, not true NFS).
Azure Files β managed SMB/NFS file shares. Works with Windows AND Linux. Azure NetApp Files for enterprise NAS workloads (SAP, Oracle). Azure also has Azure File Sync to sync on-prem Windows file servers with Azure Files.
Azure File Sync: Extend your on-prem Windows File Server to Azure Files automatically. No AWS equivalent β would require custom scripting. Common hybrid use case for enterprises migrating file shares to cloud.
Networking Deep Dive
VPC Advanced Concepts
VPC Peering
Connect two VPCs so resources can communicate using private IPs, as if they were in the same network. Can peer across accounts and regions. Non-transitive: if VPC-A peers with VPC-B and VPC-B peers with VPC-C, VPC-A cannot talk to VPC-C through VPC-B. You'd need a direct peering or Transit Gateway.
VPC-A βββpeeringβββΊ VPC-B βββpeeringβββΊ VPC-C
EC2 in VPC-A β VPC-B: β (direct peering)
EC2 in VPC-A β VPC-C: β (non-transitive β no direct peering)
Transit Gateway (TGW)
A central hub that connects multiple VPCs and on-prem networks. Solves the peering mesh problem: instead of NΓ(N-1)/2 peering connections for N VPCs, you connect each VPC to one TGW. TGW is transitive. Think of it as a cloud router. Supports: inter-VPC, VPC-to-on-prem (via VPN/Direct Connect), inter-region peering via TGW.
WITHOUT TGW (5 VPCs, 10 peerings needed): WITH TGW (5 VPCs, 5 attachments): VPC-A ββββ VPC-B VPC-A βββ VPC-A ββββ VPC-C VPC-B βββ€ VPC-A ββββ VPC-D VPC-C βββΌββ Transit Gateway ββ On-Prem VPC-A ββββ VPC-E VPC-D βββ€ VPC-B ββββ VPC-C ... etc. VPC-E βββ Non-transitive, complex route tables. Central hub, transitive, one TGW.
VPC Endpoints
Access AWS services (S3, DynamoDB, SSM, etc.) from within your VPC without traffic leaving through the internet. Traffic stays on AWS's private network. More secure and often faster.
| Type | How it works | Supported services |
|---|---|---|
| Gateway Endpoint | Free. Route table entry routes traffic to AWS service. No ENI. | S3 and DynamoDB only |
| Interface Endpoint (PrivateLink) | Creates an ENI with private IP in your subnet. DNS resolves service to private IP. Charged per hour + data. | 100+ services: SSM, Secrets Manager, KMS, API Gateway, ECR, and more |
VPC Flow Logs
Capture information about IP traffic going to/from network interfaces in your VPC. Sent to CloudWatch Logs or S3. Not real-time packet capture (use Traffic Mirroring for that) β just metadata: source/dest IP, ports, protocol, bytes, action (ACCEPT/REJECT).
# Example flow log entry:
# version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2 123456789 eni-0abc 10.0.1.10 10.0.2.20 45678 443 6 20 4000 1620000000 1620000060 ACCEPT OK
2 123456789 eni-0abc 1.2.3.4 10.0.1.10 12345 22 6 5 300 1620000010 1620000070 REJECT OK
# β Blocked SSH attempt from 1.2.3.4 to our server (Security Group or NACL blocked it)
NAT Gateway Details
- Deployed in a public subnet with an Elastic IP
- Private instances route 0.0.0.0/0 to the NAT GW β it translates their private IP to its public EIP β sends to internet
- Cost: ~$0.045/hour + $0.045/GB data processed. In high-traffic envs, this adds up.
- For high-availability: deploy a NAT Gateway in EACH AZ. Don't share one NAT GW across AZs (AZ failure kills outbound internet for other AZs).
- NAT Instance vs NAT Gateway: NAT Instance is a self-managed EC2 instance doing NAT. Cheaper, more configurable, but you manage patching, HA. NAT Gateway is managed, scales automatically, no maintenance. Use NAT Gateway unless you have a specific reason for NAT Instance.
What is Route 53?
AWS's managed DNS service. Also handles domain registration, health checks, and sophisticated traffic routing policies. Named after port 53 (DNS port). Has a 100% availability SLA β the only AWS service with this guarantee.
Hosted Zones
A hosted zone is a container for DNS records for a domain. Public hosted zone: records accessible over the internet (your website). Private hosted zone: records for resources within your VPC (internal service discovery β db.internal β 10.0.3.5).
Record Types
| Record | Purpose | Example |
|---|---|---|
| A | Maps hostname to IPv4 address | api.example.com β 54.123.45.67 |
| AAAA | Maps hostname to IPv6 address | api.example.com β 2001:db8::1 |
| CNAME | Maps hostname to another hostname. Cannot be used on zone apex (root domain). | www.example.com β example.com |
| Alias | AWS-specific. Like CNAME but can be used on root domain. Points to AWS resources (ALB, CloudFront, S3). Free queries for Alias records. | example.com β my-alb.us-east-1.elb.amazonaws.com |
| MX | Mail exchange servers for email routing | example.com β mail1.example.com (priority 10) |
| TXT | Text records. Used for domain verification, SPF, DKIM. | example.com β "v=spf1 include:_spf.google.com ~all" |
| NS | Name server records β which DNS servers handle this zone | Automatically created by Route 53 when you create a zone |
| PTR | Reverse DNS β IP to hostname | 67.45.123.54 β api.example.com |
Routing Policies
| Policy | How it routes | Use case |
|---|---|---|
| Simple | Returns one or more IPs (round-robin if multiple). No health checks. | Basic single-resource routing |
| Weighted | Distribute traffic by weight (70/30). Sum doesn't need to be 100. | Blue/green deploys, A/B testing, gradual migrations |
| Latency | Route to region with lowest latency for the user. AWS measures latency to each region. | Multi-region apps wanting best performance for each user |
| Failover | Primary record β health-checked. If unhealthy, Route 53 serves the secondary. | Active-passive DR. Route to DR region on failure. |
| Geolocation | Route based on user's geographic location (country/continent). Strict β no match = no response unless default record exists. | Legal compliance (EU users β EU servers), localized content |
| Geoproximity | Route based on physical distance. Can shift traffic by adjusting bias values. Requires Traffic Flow (extra cost). | Multi-region with granular traffic shifting |
| Multivalue Answer | Returns up to 8 healthy records. Like Simple but with health checks per record. | Simple client-side load balancing with health checks. Not a replacement for ALB. |
| IP-Based | Route based on client IP CIDR ranges. | Route corporate network traffic to internal endpoints |
Health Checks
Route 53 health checkers (globally distributed) ping your endpoint every 10/30 seconds. If 18%+ of checkers fail β endpoint marked unhealthy β Failover routing activates. You can health-check: HTTP/HTTPS/TCP endpoints, CloudWatch alarms, or calculated health checks (composite of multiple checks).
What is CloudFront?
AWS's global CDN with 600+ edge locations. Accelerates delivery of static and dynamic content by caching at the edge. Also provides: DDoS protection (Shield Standard free), HTTPS termination, compression, WAF integration, Lambda@Edge for programmable edge logic.
Key CloudFront Concepts
Distribution
A CloudFront configuration object. You create one distribution per app/site. A distribution has a CloudFront domain (d1abc23efg.cloudfront.net) which you CNAME your domain to. Has one or more origins and one or more cache behaviors.
Origins
Where CloudFront fetches content when it's not cached (cache miss). Can be: S3 bucket, ALB, EC2, API Gateway, or any HTTP server. A distribution can have multiple origins.
Cache Behaviors
Rules that define how CloudFront handles requests matching a URL path pattern. Different paths can route to different origins with different cache settings:
https://example.com
β
βββ /api/* ββββββββββββββββββββββββββΊ ALB β EC2 (no caching, dynamic)
β Cache: TTL=0, forward all headers β
β
βββ /static/* βββββββββββββββββββββββββΊ S3 Bucket (cached, long TTL)
β Cache: TTL=86400 (1 day) β
β
βββ /* (Default) ββββββββββββββββββββββΊ S3 (index.html, SPA)
Cache: TTL=300 (5 min) β
OAC β Origin Access Control
Allows CloudFront to access a private S3 bucket on your behalf. Users access content via CloudFront URL only β the S3 bucket can block all direct access. Prevents bucket hotlinking, enforces CloudFront caching. OAC is the modern replacement for OAI (Origin Access Identity).
Lambda@Edge & CloudFront Functions
- CloudFront Functions: Ultra-lightweight JS functions running at the edge for request/response manipulation. Sub-ms latency. Free tier available. Good for: URL rewrites/redirects, add security headers, A/B testing at edge.
- Lambda@Edge: Full Lambda functions deployed globally to CloudFront PoPs. More powerful (Node.js, Python), slightly higher latency. Good for: authentication at edge, dynamic content generation, API calls at edge.
// CloudFront Function example: add security headers to all responses
function handler(event) {
var response = event.response;
var headers = response.headers;
headers['strict-transport-security'] = {value: 'max-age=63072000; includeSubdomains; preload'};
headers['x-content-type-options'] = {value: 'nosniff'};
headers['x-frame-options'] = {value: 'DENY'};
return response;
}
Application Load Balancer (ALB) β Layer 7
ALB operates at HTTP/HTTPS layer. It understands your request content and can make intelligent routing decisions. Every request is terminated at the ALB (it opens a new connection to the backend). Essential for microservices architectures.
Key ALB Components
- Listener: Waits on a port (80 or 443). Defines rules to route requests.
- Rules: IF (conditions match) THEN (action). Conditions: path, hostname, headers, query strings, source IP, HTTP method. Actions: forward, redirect, return fixed response.
- Target Groups: Collection of targets (EC2 instances, IP addresses, Lambda functions, containers). Each TG has health check configuration.
Client HTTP Request
β
βΌ
βββββββββββββββββββ
β ALB Listener β :443 (HTTPS)
β βββββββββββββ β
β Rule 1: β /users/* βββββββββΊ Target Group A (User Service)
β Rule 2: β /orders/* ββββββββΊ Target Group B (Order Service)
β Rule 3: β /api/* (host:api) βΊ Target Group C (API backend)
β Default: β /* βββββββββββββββΊ Target Group D (Frontend SPA)
βββββββββββββββββββ
Each Target Group:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EC2: i-001 (healthy β) i-002 (healthy β) i-003 β β
β Health check: GET /health β 200 OK every 30s β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ALB Features
- HTTPS Termination: ALB decrypts HTTPS and talks to backend via HTTP. Offloads SSL processing from backends.
- Sticky Sessions: Route same user to same backend target using a cookie. Use with caution (undermines horizontal scaling).
- Weighted Target Groups: Send 90% to v2 TG, 10% to v3 TG. Canary deploys without DNS changes.
- Authentication: Native OpenID Connect/Cognito authentication. Reject unauthenticated requests before they hit your app.
- Access Logs: Log every request to S3. Useful for traffic analysis, debugging, compliance.
Network Load Balancer (NLB) β Layer 4
NLB operates at TCP/UDP layer. Doesn't inspect packet contents. Handles millions of requests per second with ultra-low latency (<1ms). Has static IP addresses (useful for whitelisting). Supports TLS termination at L4.
| Feature | ALB | NLB |
|---|---|---|
| OSI Layer | Layer 7 (HTTP/HTTPS) | Layer 4 (TCP/UDP/TLS) |
| Routing intelligence | Path, host, headers, cookies | IP + Port only |
| Performance | Good | Extreme (millions RPS) |
| Static IP | No (use CloudFront) | Yes (one per AZ) |
| Protocol support | HTTP/HTTPS/WebSocket/gRPC | TCP/UDP/TLS |
| Price | Moderate | Moderate |
| Use case | Web apps, microservices, APIs | Gaming, IoT, VoIP, financial trading, TCP apps |
AWS Site-to-Site VPN
Encrypted connection between your on-premises network and your AWS VPC over the public internet. Uses IPsec. Two tunnels per VPN connection (for redundancy). Managed on AWS side by Virtual Private Gateway (VGW) or Transit Gateway. Bandwidth: ~1.25 Gbps max per tunnel, varies with internet conditions.
# VPN Connection components:
On-Prem Router/Firewall (Customer Gateway) ββIPsec TunnelβββΊ Virtual Private Gateway (VGW)
β
Route table entry in VPC
10.0.0.0/8 β vgw-xxxxx
AWS Direct Connect (DX)
A dedicated physical private network connection from your datacenter to AWS. NOT over the internet β a private fiber link through an AWS Direct Connect partner (colocation facility). More expensive to set up but: consistent bandwidth, lower latency, more predictable, can carry more traffic more cheaply (data transfer pricing is lower on DX vs internet).
| Feature | Site-to-Site VPN | Direct Connect |
|---|---|---|
| Connection type | Over internet (encrypted) | Private dedicated fiber |
| Setup time | Hours (AWS console + router config) | Weeks to months (physical provisioning) |
| Bandwidth | ~1 Gbps (variable, internet-dependent) | 1 Gbps or 10 Gbps, consistent |
| Cost | Low (hourly + data transfer) | High (port fee + partner fee + data) |
| Reliability | Internet outages affect it | Dedicated β very reliable |
| Latency | Variable | Consistent and low |
| Use case | Small/medium orgs, dev, backup link | Enterprise hybrid cloud, large data transfers, compliance |
Cloud VPN (like Site-to-Site VPN) | Cloud Interconnect (like Direct Connect). Cloud Interconnect types: Dedicated Interconnect (100 Gbps!) and Partner Interconnect.
Azure VPN Gateway (like Site-to-Site VPN) | Azure ExpressRoute (like Direct Connect). ExpressRoute also has ExpressRoute Global Reach β connect your on-prem through Azure to reach other Azure regions or other on-prem offices (AWS doesn't offer this natively).
IAM & Security
Core IAM Entities
IAM is a free, global service β it's not region-specific. IAM controls who can do what on which AWS resources. Everything in AWS is an API call, and every call goes through IAM for authorization.
IAM User
A person or application with permanent long-term credentials (password + access keys). Represents one specific identity. Avoid creating users for services β use roles instead.
IAM Group
Collection of users. Attach policies to groups, not individual users. E.g., "Developers" group has S3 + EC2 read. Add a new dev β add to group. Remove dev β remove from group. Clean, scalable.
IAM Role
An identity with permissions, but NO permanent credentials. Assumed temporarily by users, AWS services (EC2, Lambda), or other accounts. Credentials are auto-rotated. Preferred over users for services.
IAM Policy
JSON document defining what actions are allowed/denied on which resources. Attached to users, groups, or roles. AWS-managed policies (maintained by AWS) or customer-managed (you control them).
IAM Policy Structure
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowS3ReadOnSpecificBucket", // Optional statement ID
"Effect": "Allow", // "Allow" or "Deny"
"Action": [ // What actions are allowed
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [ // On which resources
"arn:aws:s3:::my-company-bucket", // The bucket itself (for ListBucket)
"arn:aws:s3:::my-company-bucket/*" // Objects within the bucket
],
"Condition": { // Optional: extra conditions
"StringEquals": {
"s3:prefix": "reports/" // Only objects under "reports/" prefix
}
}
},
{
"Effect": "Deny",
"Action": "s3:DeleteObject",
"Resource": "arn:aws:s3:::my-company-bucket/*"
}
]
}
IAM Policy Types
| Policy Type | Attached to | Purpose |
|---|---|---|
| Identity-based | User, Group, Role | What that identity can do |
| Resource-based | Resource (S3 bucket, Lambda, SQS) | Who can access this resource (enables cross-account) |
| Permission Boundary | User or Role | Maximum permissions ceiling. Even if identity has broader policy, boundary limits it. |
| SCP (Service Control Policy) | AWS Organization Account/OU | Max permissions for all accounts in an org. Even account root can't exceed SCP. |
| Session Policy | AssumeRole call | Further restrict permissions for a specific role session |
Policy Evaluation Logic
Request arrives β Check for explicit DENY in any policy
β
Yes: DENY β (Deny wins, always)
β
No: Check if SCP allows (Organizations)
β
No: DENY β
β
Yes: Check for explicit ALLOW
β
No: Implicit DENY β (default deny)
β
Yes: Check Permission Boundary
β
No: DENY β
β
Yes: ALLOW β
Rule: EXPLICIT DENY always wins. Default is DENY.
You must explicitly ALLOW everything you want permitted.
IAM Roles β The Key Pattern
Instead of creating a user for your EC2 instance and storing access keys on the server (dangerous β keys can leak), you attach an IAM Role to EC2. EC2 automatically gets temporary credentials via IMDS. The credentials rotate every hour automatically. Lambda, ECS tasks, and other services all work the same way.
# BAD: Access keys hardcoded or in environment (never do this)
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# GOOD: Use IAM Role attached to the EC2/Lambda/ECS task
# boto3 automatically fetches temp credentials from IMDS
import boto3
s3 = boto3.client('s3') # No credentials needed β role creds used automatically
s3.get_object(Bucket='my-bucket', Key='file.txt')
Cross-Account Access with Roles
A role in Account B can be assumed by Account A's resources. This is how centralized tooling (one DevOps account managing multiple app accounts) works. The trust policy on the role in Account B says "allow Account A's role X to assume me."
# Trust Policy on Role in Account B (the target role)
{
"Statement": [{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111111111111:role/DeployRole" // Account A's role
},
"Action": "sts:AssumeRole"
}]
}
# In Account A, assume the role:
aws sts assume-role \
--role-arn "arn:aws:iam::222222222222:role/DeployTargetRole" \
--role-session-name "deploy-session-$(date +%s)"
MFA (Multi-Factor Authentication)
- Virtual MFA: Authenticator app (Google Authenticator, Authy)
- Hardware MFA: Physical TOTP device or FIDO2 security key (YubiKey)
- Always enable MFA on root account β root has unlimited power and can't be restricted by SCPs
- You can enforce MFA for specific actions via policy condition:
"Condition": {"Bool": {"aws:MultiFactorAuthPresent": "true"}}
Cloud IAM. Key difference: GCP uses Roles (not policies) as the primary permission unit. Predefined roles (like AWS managed policies), custom roles. Service Accounts = IAM Roles for services. Workload Identity Federation = allows external identities (GitHub Actions, on-prem) to access GCP without service account keys β similar to AWS OIDC federation.
Azure RBAC (Role-Based Access Control). Built-in roles: Owner, Contributor, Reader, plus 100+ service-specific roles. Service Principals = IAM Roles for services. Managed Identities (System-assigned or User-assigned) = equivalent to EC2 IAM roles β no credentials stored. Azure AD / Entra ID is the identity provider (IAM is separate from directory in AWS, Azure integrates them).
Azure Active Directory (Entra ID): Azure integrates identity directory (user management, SSO, conditional access) directly with RBAC. In AWS, you'd use IAM + AWS SSO (IAM Identity Center) + potentially an external IdP (Okta, Azure AD itself). Many companies use Azure AD as their IdP even for AWS.
AWS KMS β Key Management Service
KMS is a managed service for creating and controlling encryption keys. It's the central key vault for all AWS encryption. When you "enable encryption" in S3, EBS, RDS β they're using KMS keys under the hood.
Key Types
| Key Type | Who manages | Rotation | Cost | Use when |
|---|---|---|---|---|
| AWS Managed Key | AWS (auto-created per service) | Auto (1 yr) | Free | Basic encryption, fine for most cases |
| Customer Managed Key (CMK) | You | Auto or manual | $1/month + API calls | Need control, cross-account, custom key policy, audit |
| AWS CloudHSM | You (hardware module) | You manage | $$$ | Strict compliance (FIPS 140-2 Level 3), custom HSM |
Envelope Encryption
KMS doesn't encrypt your 5GB file directly (KMS keys stay in KMS β data never leaves). Instead: KMS generates a Data Encryption Key (DEK). Your code uses the DEK to encrypt the actual data. The DEK itself is encrypted with a KMS key (the "master key"). You store the encrypted DEK alongside the encrypted data. To decrypt: call KMS to decrypt the DEK, then use the DEK to decrypt the data. The master key never leaves KMS.
AWS Secrets Manager
Centralized, encrypted storage for secrets: database passwords, API keys, OAuth tokens. Auto-rotates secrets (can trigger a Lambda to rotate passwords in RDS). Applications retrieve secrets at runtime via API β no hardcoded passwords in code.
# Retrieve secret at runtime (Python)
import boto3, json
client = boto3.client('secretsmanager', region_name='ap-south-1')
secret = client.get_secret_value(SecretId='prod/myapp/db-password')
db_creds = json.loads(secret['SecretString'])
db_host = db_creds['host']
db_pass = db_creds['password']
# Application auto-rotates: RDS password changed every 30 days
# Lambda triggered by Secrets Manager updates RDS user password automatically
Secrets Manager vs SSM Parameter Store
| Feature | Secrets Manager | SSM Parameter Store |
|---|---|---|
| Cost | $0.40/secret/month + API calls | Free (Standard) / $0.05/adv param |
| Auto-rotation | Yes (built-in for RDS, Redshift, DocumentDB) | No (manual or custom Lambda) |
| Encryption | Always encrypted (KMS) | Optional (use SecureString type for encrypted) |
| Cross-account | Yes, with resource policy | No native support |
| Best for | Database passwords, API keys, credentials requiring rotation | App configs, feature flags, non-secret parameters |
Secret Manager (like Secrets Manager) | Cloud KMS (like AWS KMS) | Cloud HSM (like CloudHSM). GCP Secret Manager also supports version control of secrets.
Azure Key Vault β combines secrets, keys, AND certificates in one service (AWS splits these: Secrets Manager + KMS + ACM). Key Vault has Managed HSM tier for FIPS 140-2 Level 3. Azure App Configuration is like SSM Parameter Store for feature flags and app settings.
Azure Key Vault Certificates: Key Vault can manage the full TLS certificate lifecycle β request, renew, store, deploy. AWS splits this: ACM (Certificate Manager) for provisioning/renewal, Secrets Manager for custom cert storage.
AWS WAF β Web Application Firewall
WAF protects your web apps from common exploits at the application layer (L7). Works with CloudFront, ALB, API Gateway, AppSync. You define rules that filter HTTP requests.
Built-in rule groups (AWS Managed Rules): SQL injection protection, XSS protection, known bad IPs, AWS IP reputation lists. You can also write custom rules: "Block all requests where URI contains ../" or "Rate limit to 1000 req/5min per IP."
AWS Shield
| Tier | Cost | Protection |
|---|---|---|
| Shield Standard | Free (automatic) | L3/L4 DDoS protection for all AWS resources. Protects against SYN floods, UDP reflection, etc. |
| Shield Advanced | $3,000/month | Enhanced DDoS protection, 24/7 DDoS Response Team (DRT), cost protection (AWS refunds scale-out costs from DDoS), advanced metrics. |
Amazon GuardDuty
AI-powered threat detection service that continuously monitors your AWS account for malicious activity and unusual behavior. Analyzes: VPC Flow Logs, DNS logs, CloudTrail events, S3 access logs, EKS audit logs. Detects: compromised EC2 instances communicating with known bad IPs, unusual API calls, credential theft, S3 data exfiltration patterns.
AWS Security Hub
Central security dashboard aggregating findings from GuardDuty, Inspector, Macie, Firewall Manager, and third-party tools. Runs automated compliance checks against CIS AWS Foundations, PCI-DSS, and other standards. Gives you a security score and prioritized findings list.
Other Key Security Services
| Service | What it does |
|---|---|
| Amazon Inspector | Vulnerability scanning for EC2 instances and container images in ECR. Continuously scans for CVEs, network exposure. Integrates with ECR to block vulnerable images. |
| Amazon Macie | ML-based data security for S3. Discovers and protects sensitive data: PII (names, SSNs, credit cards, passports). Alerts you if sensitive data is in a public bucket. |
| AWS Config | Continuous resource configuration recording. "Who changed what, when?" Compliance rules: "All S3 buckets must have encryption enabled." Alerts on drift. |
| AWS CloudTrail | Audit log of all AWS API calls: who made the call, from which IP, when, what changed. The "flight recorder" of your AWS account. Enabled by default but save to S3 for long-term retention. |
Cloud Armor (WAF + DDoS) | Security Command Center (like Security Hub + GuardDuty) | Cloud Audit Logs (like CloudTrail) | Container Analysis (like Inspector for containers).
Azure WAF (part of App Gateway or Front Door) | Azure DDoS Protection (Standard = like Shield Advanced) | Microsoft Defender for Cloud (like GuardDuty + Security Hub combined) | Azure Monitor Activity Log (like CloudTrail).
Databases
What is RDS?
RDS is a managed relational database service. AWS handles: OS patching, DB engine upgrades (with your approval), automated backups, replication, failover. You just connect and query. Supported engines: MySQL, PostgreSQL, MariaDB, Oracle, Microsoft SQL Server and Amazon Aurora (custom AWS engine).
Key RDS Concepts
Multi-AZ Deployment
The most important RDS HA feature. When enabled, AWS automatically maintains a synchronous standby replica in a different AZ. If primary fails, RDS automatically fails over to standby. Failover takes 60-120 seconds (DNS update). The standby is NOT accessible for reads β it's purely for failover. Separate from Read Replicas.
MULTI-AZ (High Availability): READ REPLICAS (Scalability):
ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β AZ-1a: Primary RDS ββsyncβββΊβ β Primary ββasyncβββΊ Replica 1 β
β Read+Write βfailoverβ β (R+W) ββasyncβββΊ Replica 2 β
β β β ββasyncβββΊ Replica 3 β
β AZ-1b: Standby RDS β β β
β (NOT accessible) β β Replicas: READ ONLY β
ββββββββββββββββββββββββββββββββ β Can be in different region! β
For: automatic failover / HA ββββββββββββββββββββββββββββββββ
For: scale out reads, reports,
analytics, DR (promote to master)
Read Replicas
Asynchronous copies of your primary DB, used to offload read traffic. Up to 15 read replicas for Aurora, 5 for other engines. Can be in a different region (cross-region read replicas for DR). In disaster, promote a read replica to standalone DB β becomes the new primary.
Automated Backups
- Daily automated backup during your maintenance window (entire DB + transaction logs)
- Retained for 1-35 days (default 7). After that, deleted automatically.
- Point-in-time recovery: restore to any second within the backup retention period
- Manual snapshots: you control them, persist indefinitely until you delete them
RDS Proxy
A fully managed, highly available database proxy that sits between your app and RDS. Why use it? Lambda functions opening thousands of connections overwhelm RDS (too many connections). RDS Proxy pools and reuses connections β Lambda connects to Proxy, Proxy maintains a small pool to RDS. Also speeds up failover: clients connect to Proxy endpoint which auto-routes to healthy instance.
Storage Auto Scaling
Enable and set a maximum storage limit. If your DB is about to run out of disk space, RDS automatically scales up storage without downtime. You can never shrink it back (only grow). Set a high maximum and don't worry about disk again.
Amazon Aurora
AWS's custom-built cloud-native relational DB. MySQL and PostgreSQL compatible β your app code doesn't change. But it's re-engineered from scratch for cloud performance and resilience.
| Feature | Standard RDS (MySQL) | Aurora MySQL |
|---|---|---|
| Storage | Single AZ volume (Multi-AZ adds standby) | 6 copies across 3 AZs by default |
| Read Replicas | 5 max | 15 max (Aurora Replicas) |
| Failover | 60-120 seconds | ~30 seconds (in-cluster replicas) |
| Performance | Baseline MySQL | 5x MySQL throughput |
| Cost | Lower | ~20% more than RDS |
| Storage | Up to 64TB | Up to 128TB, auto-scales |
Aurora Serverless v2
Aurora that scales capacity in fine-grained increments (in 0.5 ACU steps from 0.5 to 256 ACUs) based on actual demand, in seconds. No pre-provisioning. Pay per second of actual ACU usage. Perfect for: unpredictable workloads, dev/test, multi-tenant SaaS with variable tenant load.
Cloud SQL (managed MySQL, PostgreSQL, SQL Server β like standard RDS) | AlloyDB (like Aurora β PostgreSQL-compatible, high performance, 4x faster than Cloud SQL). Also Cloud Spanner β globally distributed SQL (unique, no AWS equivalent).
Azure SQL Database (managed SQL Server) | Azure Database for MySQL/PostgreSQL (like standard RDS) | Azure SQL Managed Instance (SQL Server with near-100% compatibility, for lift-and-shift). Azure's Hyperscale tier is similar to Aurora in concept.
Cloud Spanner: Globally distributed, horizontally scalable relational DB with ACID transactions across regions. No true equivalent in AWS or Azure (AWS DocumentDB is NoSQL, and global Aurora has limits). Used by Google for their own core infrastructure.
What is DynamoDB?
DynamoDB is AWS's fully managed NoSQL key-value and document database. No servers, no OS, no capacity planning. Single-digit millisecond performance at any scale. Used by Amazon itself for their shopping cart, sessions, order management. Built for internet-scale applications.
Core Concepts
Tables, Items, Attributes
DynamoDB is schemaless (except for keys). A Table holds Items (like rows), each with Attributes (like columns). No fixed schema β different items can have different attributes. Only the primary key is required.
Primary Key Types
Simple Primary Key (Partition Key only)
Single attribute used as the primary key. Must be unique. Used when you query by a single ID.
Example: userId as partition key. Query: "Give me all data for userId=U123"
Composite Primary Key (Partition + Sort Key)
Two attributes together are unique. Multiple items can share partition key but must have different sort keys. Enables range queries.
Example: userId (partition) + orderDate (sort). Query: "Give me all orders for userId=U123 in 2024"
Read Capacity Units (RCU) and Write Capacity Units (WCU)
DynamoDB bills on throughput. 1 RCU = 1 strongly consistent read (or 2 eventually consistent reads) of up to 4KB/second. 1 WCU = 1 write of up to 1KB/second. You either provision RCU/WCU (predictable, cheaper) or use On-Demand mode (pay per request, no planning, costlier per request but no idle waste).
Global Secondary Indexes (GSI)
Query your DynamoDB table on a different attribute. If your table's partition key is userId, but you need to query "all users who signed up on date X" β create a GSI with signupDate as partition key. GSIs have their own RCU/WCU separate from the main table.
DynamoDB Streams
A time-ordered stream of item-level changes (inserts, updates, deletes) in a DynamoDB table. Retained for 24 hours. Trigger Lambda functions on changes β powerful for: replication, cache invalidation, event sourcing, audit logs.
DynamoDB Accelerator (DAX)
In-memory cache for DynamoDB. API-compatible β swap your DynamoDB client for a DAX client, same code. Reduces read latency from single-digit ms to microseconds. Handles millions of reads per second. Use for: high-read, cost-sensitive workloads (DAX reads are cheaper than DynamoDB reads at high volume).
Global Tables
Multi-region, multi-active DynamoDB. Write to any region, DynamoDB replicates to others within seconds. Last-writer-wins conflict resolution. Perfect for: global apps needing local read/write latency everywhere, multi-region active-active architecture.
Cloud Firestore (document NoSQL, like DynamoDB but more flexible querying) | Cloud Bigtable (wide-column NoSQL, Apache HBase compatible, for massive analytics). No exact DynamoDB equivalent β Firestore is closest for serverless apps.
Azure Cosmos DB β multi-model NoSQL (document, key-value, graph, column-family) with multi-region active-active. More flexible than DynamoDB. Supports multiple APIs: Core (SQL), MongoDB, Cassandra, Gremlin, Table. 99.999% availability SLA.
Cosmos DB's multi-model support: One Cosmos DB instance supports MongoDB API, Cassandra API, and SQL API simultaneously (with different collections). You can use existing MongoDB drivers unchanged. AWS has separate services for each (DynamoDB, DocumentDB for MongoDB, Keyspaces for Cassandra).
What is ElastiCache?
Managed in-memory caching service. Two engines: Redis and Memcached. Dramatically reduces database load and latency by serving frequent reads from memory (microseconds) instead of disk (milliseconds).
Redis vs Memcached on ElastiCache
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, bitmaps, geospatial, streams | Simple key-value strings only |
| Persistence | Yes (RDB snapshots, AOF logs) | None (restart = all data lost) |
| Replication | Yes (primary + replicas) | No |
| Multi-AZ Failover | Yes | No |
| Pub/Sub | Yes | No |
| Cluster mode | Yes (sharding) | Yes |
| Use cases | Sessions, leaderboards, rate limiting, pub/sub, queues, ML | Simple cache (horizontal scaling, multi-threaded) |
Common Caching Patterns
# Lazy Loading (Cache-Aside) β most common pattern
def get_user(user_id):
# Try cache first
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached) # Cache HIT
# Cache MISS β query database
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
# Store in cache with TTL (expiry)
redis.setex(f"user:{user_id}", 3600, json.dumps(user)) # Cache 1 hour
return user
# Write-Through β write to cache AND DB simultaneously
def update_user(user_id, data):
db.update("UPDATE users SET ... WHERE id = ?", user_id, data)
redis.setex(f"user:{user_id}", 3600, json.dumps(data)) # Always fresh
Memorystore for Redis and Memorystore for Memcached β same concept. Also Memorystore for Redis Cluster for large-scale sharding.
Azure Cache for Redis β same concept. Tiers: Basic (single node, no SLA), Standard (primary+replica), Premium (Redis Cluster, persistence, VNet injection), Enterprise (Redis Enterprise software, higher performance).
Monitoring & Observability
What is CloudWatch?
CloudWatch is AWS's unified observability platform. It collects metrics, logs, traces, and events from AWS services and your applications. Like a central nervous system for your AWS environment. Three pillars: Metrics (what's happening), Logs (what happened), Alarms (alert when something's wrong).
CloudWatch Metrics
Numeric data points over time. AWS services automatically push metrics: EC2 CPU%, RDS connections, ALB request count, Lambda errors. You can publish custom metrics from your application code.
| Metric | Service | What to monitor |
|---|---|---|
| CPUUtilization | EC2 | Alert if >80% sustained for 5min β need to scale |
| DatabaseConnections | RDS | Alert if near max_connections limit |
| RequestCount, TargetResponseTime | ALB | Alert on traffic spikes or high latency |
| Errors, Duration, Throttles | Lambda | Alert on elevated error rate or timeouts |
| QueueDepth | SQS | Alert if messages accumulating (consumers slow) |
| BucketSizeBytes, NumberOfObjects | S3 | Storage growth tracking (daily granularity) |
CloudWatch Logs
Centralized log storage and analysis. Logs are organized in Log Groups (one per app/service), which contain Log Streams (one per instance/invocation). Lambda, ECS, and other services push logs automatically. EC2 needs the CloudWatch Agent installed to push logs.
# CloudWatch Agent config (simplified) β push /var/log/nginx/access.log
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "/ec2/nginx/access",
"log_stream_name": "{instance_id}",
"timestamp_format": "%d/%b/%Y:%H:%M:%S %z"
}]
}
}
}
}
CloudWatch Logs Insights
Query language for analyzing logs. Like SQL for your logs. Very useful for debugging:
# Find all Lambda errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50
# Calculate average response time from ALB access logs
fields @timestamp, targetProcessingTime
| stats avg(targetProcessingTime) as avgTime, count() as requests
| sort avgTime desc
CloudWatch Alarms
Trigger actions when a metric crosses a threshold. States: OK (metric within threshold), ALARM (metric breached threshold), INSUFFICIENT_DATA (not enough data yet).
Actions on ALARM: SNS notification (email/SMS), Auto Scaling (add/remove instances), EC2 action (stop/reboot/recover instance), Systems Manager action.
# Create alarm via CLI: alert if EC2 CPU > 80% for 2 consecutive 5-min periods
aws cloudwatch put-metric-alarm \
--alarm-name "High-CPU-ec2-web-01" \
--alarm-description "CPU usage too high" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--dimensions Name=InstanceId,Value=i-0abc12345 \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions "arn:aws:sns:ap-south-1:123456789:ops-alerts"
CloudWatch Dashboards
Custom dashboards combining metrics from multiple services. Create a single pane view: EC2 CPU + RDS connections + ALB latency + Lambda errors + SQS queue depth. Share with team. Use as your operations wall display.
CloudWatch Events / EventBridge
Rule-based event routing. React to AWS service events or scheduled triggers. EventBridge is the evolution of CloudWatch Events β more powerful, supports custom event buses, third-party SaaS events, schema registry.
# EventBridge rule: trigger Lambda every day at 8 AM UTC (cron)
{
"source": "aws.events",
"schedule": "cron(0 8 * * ? *)",
"targets": [{"Id": "DailyReport", "Arn": "arn:aws:lambda:...daily-report"}]
}
# EventBridge rule: trigger when EC2 instance state changes to "stopped"
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {"state": ["stopped"]}
}
Cloud Monitoring (metrics + dashboards + alerting, like CloudWatch) | Cloud Logging (like CloudWatch Logs) | Cloud Trace (distributed tracing, like AWS X-Ray) | Cloud Profiler (continuous CPU/memory profiling of running apps). All under Google Cloud Observability umbrella.
Azure Monitor (umbrella service β metrics, logs, alerts) | Log Analytics Workspace (like CloudWatch Log Insights, uses KQL query language) | Application Insights (APM for apps, auto-traces HTTP, DB queries, exceptions β no direct AWS equivalent as a single managed service) | Azure Event Grid (like EventBridge).
Application Insights: Full APM (Application Performance Monitoring) β auto-instrumentation of .NET, Java, Node, Python apps. Tracks requests, dependencies, exceptions, performance counters, user flows, availability tests. AWS would need a combination of X-Ray + CloudWatch + third-party APM (Datadog, Dynatrace).
AWS CloudTrail
Records every AWS API call made in your account β via Console, CLI, SDK, or other AWS services. Who did what, when, from where. The audit trail for your entire AWS account. Enabled automatically but events only kept 90 days in CloudTrail console; create a Trail to send to S3 for long-term retention (required for compliance).
Trail Types
- Management Events: Control plane operations β CreateBucket, LaunchEC2, DeleteUser. Enabled by default. Free for first copy.
- Data Events: Data plane operations β S3 object reads/writes (PutObject, GetObject), Lambda invocations. High volume, extra cost. Enable for critical resources.
- Insight Events: Detect unusual API activity (e.g., sudden spike in IAM calls). Extra cost but powerful anomaly detection.
# Example CloudTrail event β someone deleted an S3 bucket
{
"eventTime": "2024-01-15T14:23:01Z",
"eventName": "DeleteBucket",
"userIdentity": {"type": "IAMUser", "userName": "john.doe"},
"sourceIPAddress": "203.0.113.45",
"requestParameters": {"bucketName": "prod-customer-data-backup"},
"eventSource": "s3.amazonaws.com"
}
# β John deleted the production backup bucket from IP 203.0.113.45 at 2:23 PM UTC
AWS X-Ray β Distributed Tracing
X-Ray helps debug and analyze distributed applications (microservices). When a user request flows through API Gateway β Lambda β DynamoDB β SQS β another Lambda β X-Ray traces the entire journey, showing where latency comes from and where errors occur.
User Request (Total: 450ms) β βββ API Gateway: 5ms β βββ Lambda: process-order (380ms total) β βββ Init (cold start): 150ms β performance problem! β βββ DynamoDB PutItem: 12ms β βββ SQS SendMessage: 8ms β βββ Execution: 210ms β βββ Response: 65ms X-Ray shows: Cold start is causing 33% of total latency. Fix: Enable Provisioned Concurrency on this Lambda.
To use X-Ray: add the X-Ray SDK to your app code, or enable active tracing on Lambda/API Gateway (no code changes). X-Ray automatically generates a service map showing all components and their interconnections.
Cloud Trace (distributed tracing, like X-Ray) | Cloud Audit Logs (like CloudTrail β Admin Activity, Data Access, System Event logs). Cloud Trace auto-instruments GCP services.
Application Insights Distributed Tracing (like X-Ray, part of App Insights) | Azure Monitor Activity Log (like CloudTrail β tracks all subscription-level operations).
DevOps Tools β IaC, CI/CD & Automation
What is CloudFormation?
AWS's native IaC service. Define your entire infrastructure in YAML or JSON templates. CloudFormation handles creation, update, and deletion of resources in the right order. Free β you only pay for the resources it creates.
CloudFormation Template Structure
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Web Application Stack'
Parameters: # User inputs at deploy time
EnvironmentName:
Type: String
Default: production
AllowedValues: [development, staging, production]
InstanceType:
Type: String
Default: t3.micro
Mappings: # Lookup tables (e.g., AMI IDs per region)
RegionAMIMap:
ap-south-1:
AMI: ami-0abc12345
us-east-1:
AMI: ami-0xyz67890
Conditions: # Conditional resource creation
IsProd: !Equals [!Ref EnvironmentName, production]
Resources: # Actual AWS resources (required)
MyEC2Instance:
Type: AWS::EC2::Instance
Properties:
InstanceType: !Ref InstanceType
ImageId: !FindInMap [RegionAMIMap, !Ref AWS::Region, AMI]
SecurityGroupIds: [!Ref WebSecurityGroup]
Tags:
- Key: Environment
Value: !Ref EnvironmentName
WebSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allow HTTP/HTTPS
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0
# Only create this in production
ElasticIP:
Type: AWS::EC2::EIP
Condition: IsProd
Properties:
InstanceId: !Ref MyEC2Instance
Outputs: # Values returned after stack creation
InstancePublicIP:
Value: !GetAtt MyEC2Instance.PublicIp
Export:
Name: !Sub "${AWS::StackName}-PublicIP"
Key CloudFormation Concepts
Stacks & Stack Sets
A Stack is a deployed instance of a template (all the resources it creates). You update a stack by updating the template and running a changeset. StackSets deploy one template across multiple accounts and regions simultaneously β essential for large organizations.
Changesets
Preview what changes CloudFormation will make before actually making them. Shows: which resources will be added, modified, or deleted. Always review changesets before applying β especially check for resource replacements (which cause downtime).
Drift Detection
Checks if actual resource state differs from what CloudFormation expects. If someone manually changed a Security Group that CloudFormation manages, drift detection finds it. Important for compliance and ensuring IaC is the source of truth.
!Ref and !GetAtt
Built-in functions for referencing other resources within the template. !Ref MyBucket returns the bucket name. !GetAtt MyBucket.Arn returns the bucket ARN. !Sub "arn:aws:s3:::${MyBucket}/*" substitutes variable into string.
CloudFormation vs Terraform (Key Differences)
| Aspect | CloudFormation | Terraform |
|---|---|---|
| Language | YAML/JSON | HCL (HashiCorp Configuration Language) |
| Cloud support | AWS only | Multi-cloud (AWS, GCP, Azure, 1000+ providers) |
| State management | AWS manages state (no state file) | State file (must manage securely in S3/Terraform Cloud) |
| Native AWS support | Supports new AWS services on day 1 | Depends on provider update (usually within days) |
| Free | Yes | Open source (Terraform Enterprise is paid) |
| Module system | Nested stacks (complex) | Modules (cleaner, community registry) |
| Drift detection | Built in | Manual (terraform refresh) |
| Industry adoption | AWS shops | Most popular IaC tool overall |
Deployment Manager (like CloudFormation, GCP-native, YAML/Jinja/Python) | Config Connector (manage GCP resources via Kubernetes CRDs) | Terraform is actually more commonly used in GCP environments than Deployment Manager.
Azure Resource Manager (ARM) Templates (like CloudFormation, JSON-based, verbose) | Bicep (ARM's modern replacement β cleaner syntax, transpiles to ARM JSON) | Azure Blueprints (for governance at scale β deploy policies + RBAC + resource groups together).
AWS CI/CD Toolchain Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β AWS CodePipeline β ββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββββ€ β SOURCE β BUILD β TEST β DEPLOY β β β β β β β CodeCommit β CodeBuild β CodeBuild β CodeDeploy β EC2 β β GitHub β (compile, β (unit tests,β CodeDeploy β Lambda β β Bitbucket β lint, β integrationβ ECS (Blue/Green) β β S3 β docker buildβ tests) β CloudFormation β β β push to ECR)β β Beanstalk β ββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββββ Each stage has actions. Failure at any stage stops the pipeline.
CodeBuild β Build Service
Managed build server. Runs your build commands in a Docker container, compiles code, runs tests, creates artifacts. Defined in a buildspec.yml file at the root of your repo.
# buildspec.yml β defines build steps
version: 0.2
phases:
install:
runtime-versions:
python: 3.11
commands:
- pip install -r requirements.txt
pre_build:
commands:
- echo Logging into ECR...
- aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
- COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
- IMAGE_TAG=$COMMIT_HASH
build:
commands:
- echo Running tests...
- pytest tests/ --junitxml=test-results.xml
- echo Building Docker image...
- docker build -t $ECR_URI:$IMAGE_TAG .
post_build:
commands:
- docker push $ECR_URI:$IMAGE_TAG
- echo Build complete. Image $ECR_URI:$IMAGE_TAG
artifacts:
files:
- imagedefinitions.json # Used by CodeDeploy for ECS deploy
reports:
TestResults:
files: test-results.xml
file-format: JUNITXML
CodeDeploy β Deployment Service
Automates application deployments to EC2, on-premises servers, Lambda, and ECS. Handles rolling updates, blue/green deployments, automatic rollback on failure. Defined in appspec.yml.
# appspec.yml for EC2 deployment
version: 0.0
os: linux
files:
- source: /app
destination: /var/www/html
hooks:
BeforeInstall:
- location: scripts/stop_server.sh
timeout: 30
AfterInstall:
- location: scripts/install_dependencies.sh
timeout: 120
ApplicationStart:
- location: scripts/start_server.sh
timeout: 30
ValidateService:
- location: scripts/health_check.sh
timeout: 60
CodeDeploy Deployment Configurations
| Config | How it deploys | Downtime? |
|---|---|---|
| CodeDeployDefault.AllAtOnce | All instances simultaneously | Yes (if deploy fails) |
| CodeDeployDefault.HalfAtATime | 50% first, then 50% | Partial |
| CodeDeployDefault.OneAtATime | One instance at a time (slowest, safest) | No |
| Custom (e.g., 25% at a time) | Define your own batch size | Depends |
| Blue/Green (ECS/Lambda) | New version deployed alongside old, traffic shifted gradually | No, instant rollback |
Elastic Beanstalk β PaaS Deploy
If you don't want to manage CI/CD pipelines at all, Elastic Beanstalk is AWS's PaaS. Upload your app code (zip), EB handles EC2 provisioning, Auto Scaling, Load Balancer, health monitoring, and rolling deploys. Runs on top of standard AWS services (you can still see and modify the EC2 instances). Great for smaller teams or migrating existing apps quickly. Less flexible than managing EC2/ECS directly.
Cloud Build (like CodeBuild) | Cloud Deploy (managed delivery to GKE/Cloud Run, with promotion through environments) | Artifact Registry (store build artifacts, Docker images)
Azure Pipelines (CI + CD in one service, like CodePipeline + CodeBuild + CodeDeploy combined β more integrated) | Azure Artifacts (package/artifact storage) | GitHub Actions (Microsoft owns GitHub β deep Azure integration)
What is AWS Systems Manager?
SSM is a collection of operational tools for managing your EC2 instances and on-premises servers at scale. Often overlooked but incredibly powerful for DevOps. It's a suite of services, not just one thing.
SSM Session Manager
Connect to EC2 instances via browser or CLI without opening port 22, without a bastion host, without managing SSH keys. The SSM Agent on the instance communicates outbound to SSM service β no inbound port needed. Fully audited β all sessions recorded to S3 or CloudWatch.
# Connect to EC2 via SSM (no SSH key, no port 22)
aws ssm start-session --target i-0abc12345
# Port forwarding via SSM (access RDS in private subnet)
aws ssm start-session --target i-0abc12345 \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["3306"],"localPortNumber":["13306"]}'
# Now: mysql -h 127.0.0.1 -P 13306 -u admin -p
SSM Parameter Store
Store configuration values and secrets. Types: String, StringList, SecureString (KMS-encrypted). Use for: app config, database hostnames, feature flags, non-sensitive or mildly-sensitive parameters.
# Store a parameter
aws ssm put-parameter \
--name "/myapp/production/db-host" \
--value "mydb.cluster.ap-south-1.rds.amazonaws.com" \
--type String
# Store an encrypted secret
aws ssm put-parameter \
--name "/myapp/production/api-key" \
--value "sk-abc123secret" \
--type SecureString \
--key-id alias/myapp-key
# Retrieve in code (Python)
ssm = boto3.client('ssm')
param = ssm.get_parameter(Name='/myapp/production/db-host', WithDecryption=True)
db_host = param['Parameter']['Value']
SSM Run Command
Run commands across multiple EC2 instances without SSH. Execute shell scripts, PowerShell, Python across your entire fleet in seconds. With resource tags, target groups: "Run this on all instances tagged Environment=production."
SSM Patch Manager
Automate OS patching across your fleet. Define patch baselines (which patches to apply, e.g., only Critical + High severity), maintenance windows (when to apply β 2 AM Sunday), and patch groups (which instances). Never manually SSH to patch 50 servers again.
SSM State Manager
Keep instances in a desired state. Define an association: "All prod instances must have the CWAgent installed and running." State Manager periodically checks and enforces this. If someone removes the agent, SSM reinstalls it.
Messaging & Decoupling
What is SQS?
SQS is a fully managed message queue service. It decouples producers (who send messages) from consumers (who process them). If your consumer is slow or down, messages accumulate safely in the queue. No message is lost. Classic async communication pattern.
WITHOUT SQS (Tight Coupling): Web App ββHTTPβββΊ Worker Service If Worker is slow/down β Web App blocks or errors β WITH SQS (Loose Coupling): Web App ββPutMessageβββΊ [SQS Queue] βββPollMessagesββ Worker Service Web App returns immediately β Worker processes at its own pace β Queue buffers messages during spikes β Worker can scale independently β Messages survive worker crashes β
SQS Key Concepts
Queue Types
Standard Queue
Nearly unlimited throughput. Best-effort ordering (usually FIFO, but not guaranteed). At-least-once delivery (message may be delivered more than once β make your consumer idempotent). Good for most use cases where order doesn't strictly matter.
FIFO Queue
Guaranteed order (First-In-First-Out). Exactly-once processing (no duplicates). Limited to 3,000 msg/sec with batching (300 without). For: financial transactions, order processing, inventory changes where sequence matters.
Visibility Timeout
When a consumer reads a message, it's hidden from other consumers for the visibility timeout period (default 30s, max 12h). The consumer must delete the message before timeout expires. If it doesn't (consumer crashed), the message becomes visible again for another consumer to process. Set visibility timeout to slightly longer than your max processing time.
Dead Letter Queue (DLQ)
If a message fails processing too many times (exceeds maxReceiveCount), SQS moves it to a Dead Letter Queue. DLQ lets you isolate and debug problematic messages without losing them. Always configure a DLQ for production queues β otherwise failed messages keep cycling forever consuming resources.
Long Polling
When a consumer calls ReceiveMessage and the queue is empty, short polling returns immediately (wasteful API calls). Long polling waits up to 20 seconds for a message to arrive before returning empty. Reduces cost (fewer API calls) and reduces false-empty responses. Always use long polling (WaitTimeSeconds=20).
# Sending a message (Python boto3)
sqs = boto3.client('sqs')
response = sqs.send_message(
QueueUrl='https://sqs.ap-south-1.amazonaws.com/123456/my-queue',
MessageBody=json.dumps({
'order_id': 'ORD-12345',
'customer_id': 'CUST-789',
'items': [{'product': 'laptop', 'qty': 1}]
}),
MessageAttributes={
'EventType': {'StringValue': 'OrderPlaced', 'DataType': 'String'}
}
)
# Consuming messages (long polling)
while True:
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # Long polling
VisibilityTimeout=60 # 60s to process
)
for message in response.get('Messages', []):
process_order(json.loads(message['Body']))
# Delete after successful processing
sqs.delete_message(
QueueUrl=QUEUE_URL,
ReceiptHandle=message['ReceiptHandle']
)
Cloud Pub/Sub β acts as both a queue AND pub/sub. Pull subscriptions work like SQS (consumer polls). Push subscriptions push to HTTP endpoint. At-least-once delivery. No native FIFO, but ordering key feature ensures ordered delivery within a key.
Azure Service Bus (full-featured queue + pub/sub, like SQS + some SNS features β supports sessions for FIFO, dead-lettering, transactions) | Azure Queue Storage (simpler, cheaper, like basic SQS standard queue, max 7-day retention vs SB's 14 days).
Amazon SNS β Simple Notification Service
SNS is a publish/subscribe (pub/sub) messaging service. A publisher sends a message to a Topic, and SNS fans it out to all subscribers simultaneously. One message β many consumers. Perfect for: fanout pattern, notifications, decoupled event broadcasting.
SNS Subscribers
A topic can have multiple subscribers of different types: SQS queue, Lambda function, HTTP/HTTPS endpoint, Email, SMS, Mobile Push (APNs, GCM), Kinesis Data Firehose.
Order Service publishes "OrderPlaced" event to SNS Topic
β
βββββββββββββββββΌββββββββββββββββ
βΌ βΌ βΌ
SQS Queue Lambda Fn SQS Queue
(Inventory (Send email (Analytics
Service) confirmation) Service)
All three consumers receive the same message independently.
If one consumer is down, others still get the message.
SNS vs SQS β Key Difference
| Feature | SNS (Pub/Sub) | SQS (Queue) |
|---|---|---|
| Pattern | 1 publisher β many subscribers (fanout) | Producers β queue β one consumer per message |
| Message persistence | No persistence (if no subscriber, message lost) | Persists up to 14 days |
| Consumers | Multiple, all receive the message | One consumer per message (competing consumers) |
| Pull vs Push | Push to subscribers | Consumer pulls |
| Best for | Broadcast notifications, fanout, alerting | Task queue, work distribution, decoupling |
SNS + SQS Fanout Pattern
The most common real-world pattern: SNS pushes to multiple SQS queues. This gives you fanout (SNS) with durability and retry (SQS):
# Architecture:
# New Product Added β SNS Topic "product-events"
# β SQS "inventory-queue" β Inventory Lambda
# β SQS "search-index-queue" β Search Index Lambda
# β SQS "notification-queue" β Push Notification Lambda
# If Search Index Lambda is down: messages buffer in search-index-queue
# Inventory and Push Notification still work independently
# When Search Lambda recovers: processes all buffered messages
# This is the gold standard for reliable event-driven microservices.
Amazon EventBridge
An event bus service for building event-driven applications. More powerful than SNS for complex routing β you can filter events by content, transform them, route to 20+ AWS services, connect to third-party SaaS apps (Salesforce, Zendesk, Datadog), and create custom event buses per service.
- Default Event Bus: Receives AWS service events (EC2 state changes, CodePipeline updates, etc.)
- Custom Event Bus: For your own application events. Publish events from your microservices here.
- Partner Event Bus: Receive events from SaaS partners (Shopify orders, GitHub events)
# Publish a custom event to EventBridge
events = boto3.client('events')
events.put_events(
Entries=[{
'Source': 'com.mycompany.orders',
'DetailType': 'OrderPlaced',
'Detail': json.dumps({'orderId': 'ORD-123', 'total': 599.99}),
'EventBusName': 'my-app-events'
}]
)
# EventBridge rule routes this to: Lambda for fulfillment,
# SQS for analytics, another EventBridge bus in a different account
# Based on content: {"source": ["com.mycompany.orders"], "detail-type": ["OrderPlaced"]}
Amazon Kinesis β Real-Time Streaming
For high-throughput, real-time data streaming. Unlike SQS (queue β messages consumed and deleted), Kinesis retains data as a stream that multiple consumers can read from. Think of it as a real-time data pipeline.
| Service | What it does | Use case |
|---|---|---|
| Kinesis Data Streams | Real-time data stream. Shards provide throughput (1MB/s write per shard). Multiple consumers. Retain 1-365 days. | Real-time clickstream, app logs, IoT telemetry |
| Kinesis Data Firehose | Fully managed ETL β stream data directly to S3, Redshift, OpenSearch, Splunk. Auto-scales, buffers, compresses, transforms. | Load streaming data to S3 data lake or Redshift without code |
| Kinesis Data Analytics | Run SQL or Apache Flink on streaming data in real-time | Real-time dashboards, anomaly detection, aggregations |
| MSK (Managed Kafka) | Fully managed Apache Kafka. For teams that need Kafka compatibility. | Kafka migration, complex event streaming, ecosystem tools |
Pub/Sub (handles both SNS and SQS use cases β push and pull modes). Dataflow (like Kinesis Data Analytics, uses Apache Beam). Pub/Sub Lite (lower cost, regional, like Kinesis for ordered streams).
Azure Event Grid (like EventBridge β event routing, serverless, pay-per-event) | Azure Event Hubs (like Kinesis Data Streams β high-throughput event streaming, Kafka-compatible API!) | Azure Service Bus Topics (like SNS β pub/sub with filtering)
Azure Event Hubs Kafka-compatible API: You can use your existing Apache Kafka clients to produce/consume from Event Hubs without code changes. Just change the broker endpoint. AWS MSK also offers Kafka compatibility, but Event Hubs being serverless AND Kafka-compatible is unique in the PaaS space.
IAM & Security Services
What is IAM?
IAM is AWS's centralized service for controlling who can do what on which AWS resources. It's global (not region-specific) and free. Every API call to AWS is authenticated and authorized through IAM. No IAM permission β API call denied, period.
IAM Entities
IAM Users
A person or application that needs permanent, long-term credentials to interact with AWS. Has a username + password (console) and/or access key + secret key (programmatic). Best practice: don't use root account β create individual IAM users. Even better: use IAM Identity Center (SSO) for humans.
IAM Groups
A collection of IAM users. Attach policies to the group β all members inherit those permissions. You can't attach a policy directly to a group and then add roles to it. Groups only contain users. Simplifies permission management: add user to "Developers" group β gets all developer permissions.
IAM Roles
An IAM identity without permanent credentials. Instead, when something assumes a role, it gets temporary security credentials (valid minutes to hours). Used by: EC2 instances (instead of hardcoded keys), Lambda functions, cross-account access, federated users (SSO), ECS tasks. This is the correct way for AWS services to access other services β never hardcode access keys in code.
EC2 instance needs to write to S3
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BAD: Hardcode access_key + secret in app β leaked in Git β disaster
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOOD: EC2 IAM Role with s3:PutObject permission:
IAM Role "EC2-S3-Writer" ββattached toβββΊ EC2 Instance
β
βββ Policy: Allow s3:PutObject on arn:aws:s3:::my-bucket/*
Inside EC2: AWS SDK auto-fetches temporary credentials from IMDS
http://169.254.169.254/latest/meta-data/iam/security-credentials/EC2-S3-Writer
β Access Key (temp), Secret Key (temp), Session Token, Expiration
β SDK auto-refreshes these before expiry
IAM Policies
JSON documents defining permissions. A policy has one or more statements, each with:
- Effect:
AlloworDeny. Explicit Deny always wins over Allow. - Action: API operations (e.g.,
s3:GetObject,ec2:*) - Resource: ARN of the specific resource (or
*for all) - Condition: Optional restrictions (e.g., only from this IP, only over MFA)
- Principal: (In resource-based policies) Who the policy applies to
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowS3Read",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-app-bucket",
"arn:aws:s3:::my-app-bucket/*"
]
},
{
"Sid": "DenyDeleteUnlessMFA",
"Effect": "Deny",
"Action": "s3:DeleteObject",
"Resource": "arn:aws:s3:::my-app-bucket/*",
"Condition": {
"BoolIfExists": {"aws:MultiFactorAuthPresent": "false"}
}
}
]
}
Policy Types
| Type | Attached to | Managed by | Use case |
|---|---|---|---|
| AWS Managed Policy | User, Group, Role | AWS creates & updates | Common permission sets: AmazonS3ReadOnlyAccess, AdministratorAccess |
| Customer Managed Policy | User, Group, Role | You create & manage | Custom permissions for your org. Reusable. Versionable. |
| Inline Policy | Single User, Group, or Role | You create, embedded in identity | Strict 1:1 relationship. Deleted when identity deleted. Avoid when possible. |
| Resource-based Policy | The resource itself (S3 bucket, SQS queue, Lambda) | You create on the resource | Grant cross-account access without assuming a role. Used for S3 bucket policies, Lambda resource policies. |
| Permission Boundary | User or Role | Admin sets max permissions ceiling | Delegate IAM permission management to devs but cap what they can grant. |
| Service Control Policy (SCP) | AWS Org OUs or accounts | Org admin | Maximum permissions guardrails across entire AWS accounts. "Nobody in this account can touch us-west-1." |
IAM Policy Evaluation Logic
When a request is made, AWS evaluates all applicable policies:
API request arrives
β
βΌ
1. Explicit DENY anywhere? βββββ YES βββΊ DENY (stops here)
β NO
βΌ
2. SCP allows? (if AWS Org) ββββ NO ββββΊ DENY
β YES
βΌ
3. Resource-based policy allows? β YES ββΊ (may ALLOW without identity policy)
β NO
βΌ
4. Permission Boundary allows? ββ NO ββββΊ DENY
β YES
βΌ
5. Identity policy allows? ββββββ YES βββΊ ALLOW
β NO
βΌ
DENY (implicit β default deny everything)
Cross-Account Access
Account A's EC2 wants to access Account B's S3 bucket. Process:
- Account B creates an IAM Role with a trust policy allowing Account A to assume it
- Account B's role has the S3 permissions needed
- Account A's EC2 calls
sts:AssumeRolefor Account B's role - Gets temporary credentials for Account B β can now access Account B's S3
# Trust policy on Account B's role (who can assume it):
{
"Statement": [{
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111111111111:role/EC2-Role"}, # Account A
"Action": "sts:AssumeRole"
}]
}
# Account A EC2 assuming Account B's role (boto3):
import boto3
sts = boto3.client('sts')
response = sts.assume_role(
RoleArn='arn:aws:iam::222222222222:role/S3-Access-Role', # Account B
RoleSessionName='my-session'
)
creds = response['Credentials']
# Use creds to create an S3 client for Account B
Cloud IAM. Key differences: GCP uses Service Accounts (like IAM roles but with an email identity β can be granted access to specific resources). GCP IAM is more resource-centric (bind roles to resources). No inline policies β roles are always separate entities. Workload Identity = IAM roles for GKE pods.
Azure Active Directory (Azure AD / Entra ID) for identity + Azure RBAC for access control. Azure uses Entra ID for both human users and service principals (like IAM roles). Managed Identities = IAM roles for Azure VMs/Functions. Azure RBAC assigns built-in or custom roles to identities at various scopes (management group, subscription, resource group, resource).
Azure Entra ID (Active Directory): Much more feature-rich identity provider than AWS IAM β supports OAuth 2.0, SAML, OIDC federation with thousands of apps natively, Conditional Access policies (block login from outside the country), Privileged Identity Management (JIT access). AWS equivalent would be IAM Identity Center + Cognito combined, with less enterprise AD integration.
AWS KMS β Key Management Service
KMS manages cryptographic keys used to encrypt your data. You never handle raw key material β KMS keeps keys secure inside Hardware Security Modules (HSMs). Services like S3, EBS, RDS, Secrets Manager all use KMS keys to encrypt data.
KMS Key Types
- AWS Managed Keys: Free. AWS creates and manages rotation. Automatically used by services (e.g.,
aws/s3key for S3 SSE). Less control β you can't change rotation or grant cross-account access. - Customer Managed Keys (CMK): You create, own, and manage. $1/month/key. Control rotation (optional, annual). Can grant cross-account usage. Needed for compliance where you must control the key.
- AWS CloudHSM: Dedicated hardware HSM. You control the keys completely, AWS has no access. Most expensive, highest compliance. Used for PCI-DSS, FIPS 140-2 Level 3 requirements.
# Encrypt data with KMS (AWS CLI)
aws kms encrypt \
--key-id arn:aws:kms:ap-south-1:123456789:key/abc-123 \
--plaintext fileb://secret.txt \
--output text --query CiphertextBlob | base64 --decode > encrypted.bin
# Decrypt
aws kms decrypt \
--ciphertext-blob fileb://encrypted.bin \
--output text --query Plaintext | base64 --decode
Envelope Encryption
KMS uses envelope encryption: a Data Encryption Key (DEK) is generated to encrypt your actual data. The DEK itself is encrypted by the KMS CMK. Only the encrypted DEK is stored with the data. To decrypt: call KMS to decrypt the DEK, use plaintext DEK to decrypt data. This way, large amounts of data never pass through KMS API.
AWS Secrets Manager
Store, manage, and rotate secrets (DB passwords, API keys, OAuth tokens). Secrets are encrypted at rest via KMS. Applications retrieve secrets via API β no plaintext secrets in code or environment variables.
- Automatic rotation: Secrets Manager can automatically rotate DB credentials (works natively with RDS). It creates a new password, updates the DB, stores the new secret β all without downtime.
- Versioning: Keeps previous versions during rotation (AWSPREVIOUS stage) so apps using old password still work briefly while they update.
- Cost: $0.40/secret/month + $0.05 per 10,000 API calls.
# Retrieve secret in Python (boto3)
import boto3, json
client = boto3.client('secretsmanager', region_name='ap-south-1')
response = client.get_secret_value(SecretId='prod/myapp/db-credentials')
secret = json.loads(response['SecretString'])
db_password = secret['password'] # Fresh from Secrets Manager, never hardcoded
AWS Systems Manager Parameter Store
Lightweight configuration and secrets storage. Two tiers:
- Standard Parameters: Free. Up to 4KB. Good for non-sensitive config (app settings, feature flags, environment config).
- SecureString Parameters: Encrypted with KMS. Good for secrets that don't need rotation. No extra cost beyond KMS calls.
- Advanced Parameters: $0.05/param/month. Up to 8KB, parameter policies (TTL, auto-notification when approaching expiry).
Secrets Manager vs Parameter Store: Use Secrets Manager when you need automatic rotation. Use Parameter Store for config, non-sensitive data, or cost-sensitive secrets (it's free for standard).
# Store a parameter (CLI)
aws ssm put-parameter \
--name "/myapp/prod/db-host" \
--value "mydb.cluster.ap-south-1.rds.amazonaws.com" \
--type "String"
aws ssm put-parameter \
--name "/myapp/prod/db-password" \
--value "SuperSecret123!" \
--type "SecureString" # Encrypted with KMS
# Retrieve in app
aws ssm get-parameter --name "/myapp/prod/db-password" --with-decryption
Cloud KMS (key management, like AWS KMS) | Secret Manager (like Secrets Manager β stores secrets, automatic versioning, access via API). GCP Cloud HSM is part of Cloud KMS. No direct equivalent to SSM Parameter Store β Secret Manager serves both use cases.
Azure Key Vault: unified service for secrets, keys, AND certificates. Equivalent to AWS KMS + Secrets Manager combined. Key Vault also manages TLS/SSL certificates with automatic renewal. Azure Dedicated HSM = AWS CloudHSM equivalent.
Azure Key Vault Certificates: natively manages TLS certificates (creation, renewal, storage) in one service. AWS equivalent requires ACM (certificates) + KMS (keys) + Secrets Manager (secrets) as separate services.
AWS WAF β Web Application Firewall
Protects web applications from common web exploits (OWASP Top 10): SQL injection, XSS, bad bots, malformed requests. Deployed in front of CloudFront, ALB, API Gateway, or AppSync. You define Web ACLs with rules.
- Managed Rule Groups: Pre-built rule sets from AWS or AWS Marketplace (e.g., "AWS Managed Rules - Core rule set" blocks common OWASP attacks).
- Custom Rules: Block requests matching your logic (rate limiting by IP, block specific user agents, geo-blocking).
- Rate-based rules: Automatically block IPs exceeding X requests per 5 minutes.
# Rate-based rule example (terraform-style representation):
# Block any IP that sends more than 2000 requests per 5 minutes
Rule: RateBasedRule
Action: BLOCK
Statement:
RateBasedStatement:
Limit: 2000
AggregateKeyType: IP
AWS Shield β DDoS Protection
- Shield Standard: Free, automatically enabled for all AWS customers. Protects against common L3/L4 DDoS attacks (SYN floods, UDP reflection attacks). Integrated with CloudFront and Route 53.
- Shield Advanced: $3,000/month per organization. Protects against large sophisticated DDoS. Includes 24/7 DDoS Response Team (DRT) access, real-time attack visibility, cost protection (AWS credits if your bill spikes due to DDoS attack scaling).
Amazon GuardDuty β Threat Detection
Continuous security monitoring service that analyzes: VPC Flow Logs, CloudTrail API logs, DNS logs, and optionally EKS audit logs and S3 data events. Uses ML to detect threats like: EC2 cryptomining, root credential usage, unusual API calls from unknown IPs, port scanning, compromised credentials accessing S3.
GuardDuty doesn't block anything β it generates findings (alerts) with severity levels (low/medium/high). You automate responses via EventBridge β Lambda (e.g., auto-isolate compromised instance by removing from security groups).
AWS Inspector
Automated vulnerability scanning for EC2 instances and container images. Continuously scans for OS package vulnerabilities (CVEs), network exposure issues, software vulnerabilities. Integrates with ECR to scan images on push. Different from GuardDuty (runtime threat detection) β Inspector is about vulnerability assessment.
Cloud Armor (= WAF + DDoS, like WAF + Shield combined) | Security Command Center (SCC) (threat detection, vulnerability findings, like GuardDuty + Inspector combined) | Container Analysis (vulnerability scanning in Artifact Registry, like ECR + Inspector).
Azure WAF (via Front Door or Application Gateway) | Azure DDoS Protection Standard (like Shield Advanced) | Microsoft Defender for Cloud (threat detection + vulnerability assessment, like GuardDuty + Inspector + more) | Microsoft Sentinel (SIEM/SOAR β no direct AWS equivalent).
Microsoft Sentinel: A full SIEM/SOAR platform that ingests logs from Azure + on-prem + multi-cloud + third-party tools, uses ML for threat hunting, and automates playbooks. AWS equivalent would be custom-built using CloudTrail + GuardDuty + Macie + Security Hub + custom Lambda playbooks. Sentinel is more turnkey.
Databases
What is RDS?
RDS is a managed relational database service. AWS handles: provisioning hardware, installing the DB engine, patching, backups, monitoring, Multi-AZ failover. You focus on your schema and queries. Supports: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora.
RDS Key Features
Multi-AZ
Your primary DB runs in one AZ. A standby replica runs in a different AZ, synchronously receiving every write. If the primary fails, AWS automatically promotes the standby and updates the DNS endpoint within 1-2 minutes. Your app reconnects to the same endpoint β no code changes. Multi-AZ standby is NOT readable β it's a pure failover. For read scale, use Read Replicas.
MULTI-AZ (for HA/failover): READ REPLICAS (for read scale):
App βββΊ RDS Endpoint App βββΊ Primary (write endpoint)
β β
βΌ βββasync replβββΊ Read Replica 1
Primary DB (AZ-a) ββsync replβββΊ βββasync replβββΊ Read Replica 2
Standby DB (AZ-b) [not readable] βββasync replβββΊ Read Replica (another region)
Failover: ~60-120 seconds, auto Read: use separate read endpoint
Standby ONLY for failover Slight replication lag (async)
Read Replicas
- Up to 15 read replicas per source (Aurora) or 5 (MySQL/PostgreSQL)
- Async replication β slight lag possible. Apps must tolerate slightly stale reads.
- Can be in same AZ, different AZ, or different region (Cross-Region Read Replica)
- Can be promoted to standalone (good for DR) β promotion breaks replication
- Useful for: analytics queries, reporting, geographically close reads
Automated Backups & Snapshots
- Automated backups: Daily snapshot + transaction logs. Retention: 0-35 days. Enables point-in-time recovery (PITR). Free storage equal to DB size.
- Manual snapshots: You trigger them. Stored until you delete. Survive DB deletion. Good for: pre-migration checkpoints, long-term retention.
- Restore: Creates a NEW DB instance from the backup (doesn't restore in-place). Update your app's endpoint.
RDS Proxy
A managed connection pool between your app and RDS. Databases have limited connections (e.g., db.t3.medium MySQL = ~66 connections). Lambda functions scale to thousands of concurrent invocations β without RDS Proxy, they'd exhaust DB connections. RDS Proxy maintains a warm pool and multiplexes application connections. Also improves failover time (connections held during failover, reducing app errors).
What is Aurora?
Aurora is AWS's proprietary cloud-native relational database compatible with MySQL and PostgreSQL. It's NOT just a managed MySQL β AWS redesigned the storage layer from scratch. Result: up to 5x faster than MySQL on RDS, up to 3x faster than PostgreSQL on RDS. Higher cost than standard RDS (~20%) but typically worth it for production workloads.
Aurora Architecture
Aurora's storage is completely separate from the compute (DB instances). Storage is a distributed, fault-tolerant, self-healing cluster across 3 AZs Γ 2 copies = 6 copies of your data. Can lose 2 copies without write availability loss, 3 copies without read availability loss.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β AURORA CLUSTER β β β β Writer Instance (Primary) βββββββββββββββββββββββββββββββββββ β β Reader Instance 1 βββ Shared Storage Cluster βββββΊ β β β Reader Instance 2 βββ (6 copies, 3 AZs) β β β β β β AZ-1: [Data Copy 1] [Data Copy 2] β β β AZ-2: [Data Copy 3] [Data Copy 4] β β β AZ-3: [Data Copy 5] [Data Copy 6] β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Failover: ~30 seconds (promote a reader β same shared storage! No data copy needed since readers already share storage)
Aurora Features
Aurora Serverless v2
Aurora capacity auto-scales in fine-grained increments (0.5 ACU steps) based on actual load, within seconds. You define min/max ACUs. Pay per second of capacity used. Ideal for: variable workloads, dev/test, new apps with unpredictable traffic. Can scale from nearly zero to 128 ACUs (β256GB RAM) in seconds.
Aurora Global Database
One primary region with up to 5 secondary read-only regions. Replication lag < 1 second globally (uses AWS's dedicated infrastructure, not the internet). Used for: global read scale, DR (RPO <1s, RTO < 1 minute β just promote a secondary to primary). Unlike standard cross-region read replicas, Global DB can handle replication even under high write load.
Aurora Backtrack
MySQL-compatible only. Rewind the DB to a point in the past without restoring from backup. Goes back in time by replaying the storage log. Can backtrack up to 72 hours. Instant β takes seconds vs hours for a restore. Useful for: "oops we just ran DELETE without WHERE."
Cloud SQL (managed MySQL, PostgreSQL, SQL Server β like standard RDS) | Cloud Spanner (global, horizontally scalable relational DB β no direct AWS equivalent, but closest to Aurora Global + Vitess. True horizontal write scale across regions with ACID transactions). Spanner is unique β AWS has nothing comparable.
Azure SQL Database (managed SQL Server β like RDS SQL Server) | Azure Database for MySQL/PostgreSQL (like RDS MySQL/PostgreSQL) | Azure Cosmos DB for PostgreSQL (distributed PostgreSQL, like Citus β no direct AWS equivalent for this exact feature).
Cloud Spanner: Globally distributed, ACID-compliant relational DB that scales horizontally for writes across regions. AWS Aurora Global DB scales reads globally but writes are single-region. Spanner scales both globally. AWS has no equivalent β closest would be DynamoDB Global Tables (NoSQL) or CockroachDB on EC2.
What is DynamoDB?
DynamoDB is AWS's managed NoSQL key-value and document database. Fully serverless: no instances to size, automatic scaling, single-digit millisecond performance at any scale. Powers Amazon.com's shopping cart, Lyft's ride data, Duolingo's learning streak β workloads at massive scale.
DynamoDB Data Model
- Table: Collection of items (like a table in SQL, but schemaless)
- Item: A single record (like a row). Max 400KB per item.
- Attribute: A data field (like a column, but each item can have different attributes)
- Partition Key (PK): Required. Used to distribute data across partitions. Every access pattern must include the PK.
- Sort Key (SK): Optional. Combined with PK = composite primary key. Enables range queries within a partition.
# Example DynamoDB table for an e-commerce app:
# PK = UserID, SK = OrderID
Items:
{ "UserID": "user123", "OrderID": "order001", "Status": "Delivered", "Total": 299.99 }
{ "UserID": "user123", "OrderID": "order002", "Status": "Shipped", "Total": 49.99 }
{ "UserID": "user456", "OrderID": "order003", "Status": "Pending", "Total": 799.00 }
# Query: All orders for user123 (efficient - same partition)
aws dynamodb query \
--table-name Orders \
--key-condition-expression "UserID = :uid" \
--expression-attribute-values '{":uid": {"S": "user123"}}'
Capacity Modes
| Mode | How it works | Best for |
|---|---|---|
| On-Demand | Pay per request (RCU/WCU). Auto-scales instantly. No capacity planning. | New apps, variable traffic, don't know your load. Slightly more expensive per request than provisioned at steady state. |
| Provisioned | You set RCUs (Read Capacity Units) and WCUs (Write Capacity Units). Can use Auto Scaling to adjust. Cheaper at steady state. May throttle if you exceed provisioned capacity. | Predictable steady workloads. Pair with Auto Scaling for some elasticity. |
Capacity Units
- 1 RCU = 1 strongly consistent read/sec (or 2 eventually consistent reads/sec) for items up to 4KB
- 1 WCU = 1 write/sec for items up to 1KB
- A 10KB item read with strong consistency = 3 RCUs. Same item, eventual consistency = 1.5 RCUs (round up = 2).
DynamoDB Advanced Features
Global Secondary Indexes (GSI)
Query on non-primary key attributes. A GSI has its own partition key + sort key (different from table's PK/SK) and its own capacity. Enables different access patterns without data duplication in your code.
# Table: PK=UserID, SK=OrderID
# Query by Status β can't do this without an index (full table scan is expensive)
# Add GSI: PK=Status, SK=CreatedAt β can now query "all PENDING orders, newest first"
aws dynamodb query \
--table-name Orders \
--index-name StatusIndex \
--key-condition-expression "#s = :status" \
--expression-attribute-names '{"#s": "Status"}' \
--expression-attribute-values '{":status": {"S": "PENDING"}}'
DynamoDB Streams
A time-ordered stream of item-level changes (insert/update/delete) in your table. Retained for 24 hours. Used with Lambda to react to data changes (send email when order status changes, sync to another table, audit log, real-time analytics).
DynamoDB Global Tables
Multi-region, multi-active (all regions accept reads AND writes). DynamoDB handles conflict resolution (last-writer-wins). Near-zero RPO/RTO for region failure. Used for: globally distributed apps where users in each region write and read data locally.
DynamoDB Accelerator (DAX)
In-memory cache specifically for DynamoDB. Read latency drops from ms to microseconds. Fully compatible β just change your endpoint from DynamoDB to DAX. Best for: read-heavy apps, repeated reads of same items, caching leaderboards/hot items. Not useful for write-heavy workloads or data that changes frequently.
Firestore (document database, like DynamoDB but more flexible querying, real-time sync) | Bigtable (wide-column NoSQL for massive analytics/IoT β like DynamoDB at petabyte scale for time-series/analytics, used by Google internally).
Azure Cosmos DB: Multi-model NoSQL (can use SQL, MongoDB, Cassandra, Table, Gremlin APIs). Has global distribution with 99.999% SLA. Cosmos DB for NoSQL is closest to DynamoDB but with richer querying. Cosmos DB is Azure's flagship database β more flexible than DynamoDB in query capabilities.
Azure Cosmos DB multi-model API: One service with MongoDB API compatibility, Cassandra API, Gremlin (graph) API, etc. If you have an existing MongoDB or Cassandra app, you can point it at Cosmos DB with minimal changes. AWS would require separate DocumentDB (MongoDB-compatible) or Keyspaces (Cassandra-compatible) services.
What is ElastiCache?
Managed in-memory data store. Two engines: Redis and Memcached. Used to cache frequently accessed data, reducing database load, improving response times from seconds to milliseconds. Common pattern: check cache first β cache hit? return instantly. Cache miss? read from DB, write to cache, return.
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Strings, Hashes, Lists, Sets, Sorted Sets, Pub/Sub, Streams, Geospatial | Strings only |
| Persistence | Optional (RDB snapshots, AOF log) | No persistence (pure cache) |
| Replication | Master-replica, Multi-AZ | No replication |
| Clustering | Redis Cluster (sharding) | Multi-node (simpler sharding) |
| Lua scripting | Yes | No |
| Use case | Sessions, leaderboards, pub/sub, real-time analytics, queues, rate limiting | Simple object caching, stateless horizontal scaling |
# Session caching example (Flask + Redis via ElastiCache):
import redis, json
r = redis.Redis(host='my-cache.abc123.ng.0001.apse1.cache.amazonaws.com', port=6379)
def get_user_profile(user_id):
# Try cache first
cached = r.get(f'user:{user_id}')
if cached:
return json.loads(cached) # Cache HIT β sub-millisecond response
# Cache MISS β query database
profile = db.query("SELECT * FROM users WHERE id = %s", user_id)
r.setex(f'user:{user_id}', 300, json.dumps(profile)) # Cache 5 min
return profile
Cloud Memorystore: Managed Redis and Memcached. Same concepts. Fully compatible with open-source Redis/Memcached clients. Redis Cluster mode available.
Azure Cache for Redis: Managed Redis. Tiers: Basic (single node), Standard (replication), Premium (clustering, persistence, VNet injection), Enterprise (Redis Enterprise modules like RedisJSON, RediSearch).
Monitoring & Observability
What is CloudWatch?
CloudWatch is AWS's primary observability service β a unified platform for metrics, logs, dashboards, alarms, and events. Almost every AWS service automatically sends metrics to CloudWatch. It's your first stop for understanding what's happening in your AWS environment.
CloudWatch Metrics
Time-series data points published by AWS services and your own apps. Organized by Namespace (e.g., AWS/EC2) β Metric Name (e.g., CPUUtilization) β Dimension (e.g., InstanceId=i-0abc123).
Default EC2 Metrics (every 5 min, free):
CPUUtilization, NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, StatusCheckFailed
Detailed Monitoring (every 1 min, extra cost):
Same metrics but at 1-minute resolution. Needed for faster Auto Scaling reactions.
Custom Metrics:
Publish your own metrics from app code or scripts. Standard resolution = 1 min. High resolution = 1 second (extra cost). Example: publish queue depth, active sessions, order processing rate.
# Publish custom metric (CLI):
aws cloudwatch put-metric-data \
--namespace "MyApp/OrderService" \
--metric-name "OrdersPerMinute" \
--value 142 \
--dimensions Environment=Production,Service=OrderService
# Publish from Python:
import boto3
cw = boto3.client('cloudwatch')
cw.put_metric_data(
Namespace='MyApp/OrderService',
MetricData=[{
'MetricName': 'ActiveConnections',
'Value': 89,
'Unit': 'Count'
}]
)
CloudWatch Logs
- Log Groups: Container for log streams from the same source (e.g.,
/aws/lambda/my-function,/ecs/my-service) - Log Streams: Sequence of log events from a single source (one EC2 instance, one Lambda invocation container)
- Retention: 1 day to 10 years (or never expire). Set per log group. Storage charged.
- Insights: Query language for searching/analyzing logs. Run across multiple log groups.
# CloudWatch Logs Insights query β find all errors in last hour:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# Find slow Lambda invocations (>5 seconds):
filter @type = "REPORT"
| parse @message "Duration: * ms" as duration
| filter duration > 5000
| stats avg(duration), max(duration), count() by bin(5m)
CloudWatch Alarms
Watches a metric and transitions between states based on thresholds. States: OK, ALARM, INSUFFICIENT_DATA. When ALARM state: send SNS notification, trigger Auto Scaling action, stop/reboot/terminate EC2, invoke Lambda.
# CLI: Create alarm β alert when CPU > 80% for 5 consecutive minutes:
aws cloudwatch put-metric-alarm \
--alarm-name "HighCPU-EC2-prod" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 60 \ # 1-minute periods
--evaluation-periods 5 \ # 5 consecutive periods
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=InstanceId,Value=i-0abc123 \
--alarm-actions arn:aws:sns:ap-south-1:123456:AlertsTopic \
--ok-actions arn:aws:sns:ap-south-1:123456:AlertsTopic
CloudWatch Agent
Install the CloudWatch agent on EC2 (or on-prem servers) to collect metrics not available by default: memory usage (RAM), disk usage, swap, process-level metrics. Also collects logs from any file (system logs, app logs, custom log files) and ships them to CloudWatch Logs.
CloudWatch Dashboards
Custom visualizations β widgets showing metrics graphs, numbers, text, alarms. Share dashboards across accounts. Create one per team/service. JSON-configurable. Free to view, charged per dashboard per month.
Cloud Monitoring (metrics, dashboards, alerting) | Cloud Logging (logs, like CloudWatch Logs) | Cloud Trace (distributed tracing) | Cloud Profiler (CPU/memory profiling). These are unified under Google Cloud Observability (formerly Stackdriver).
Azure Monitor (umbrella service for all observability β metrics, logs, alerts, like CloudWatch) | Log Analytics Workspace (centralized log store with Kusto query language β richer querying than CloudWatch Logs Insights) | Application Insights (APM for web apps β no direct AWS equivalent natively).
Application Insights: Full APM (Application Performance Monitoring) natively integrated into Azure Monitor. Tracks request rates, failure rates, response times, dependency calls, exceptions, user sessions. AWS equivalent would be X-Ray + custom CloudWatch metrics β more complex to set up.
AWS CloudTrail β API Audit Logging
Records every API call made to AWS (via Console, CLI, SDK, or other services). Who did what, when, from where. Stored in S3. The forensic record of your AWS account. Enabled by default for 90 days (Event History) β create a Trail for longer retention.
# CloudTrail log entry example β someone deleted an S3 bucket:
{
"eventTime": "2024-01-15T14:32:01Z",
"eventSource": "s3.amazonaws.com",
"eventName": "DeleteBucket",
"userIdentity": {
"type": "IAMUser",
"userName": "john.dev",
"arn": "arn:aws:iam::123456789:user/john.dev"
},
"sourceIPAddress": "103.210.45.67", # The IP that made the call
"requestParameters": {"bucketName": "prod-data-bucket"}
}
AWS X-Ray β Distributed Tracing
Traces requests as they flow through distributed systems (multiple services, Lambda, DynamoDB, RDS, external APIs). Generates a service map showing which services call which. Identifies bottlenecks and errors. Essential for microservices β when a user's request goes through 5 services and fails, X-Ray shows exactly which service caused the error and how long each took.
Amazon EventBridge β Event Bus
A serverless event bus for routing events between AWS services, your own apps, and SaaS partners. Think of it as AWS's "if this then that" at scale. Events go to EventBridge β rules match events β targets receive events.
# EventBridge rule: "When EC2 instance state changes to STOPPED, run a Lambda"
Event Pattern:
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {"state": ["stopped"]}
}
Target: Lambda function β notify team on Slack
# Another example: Run DB backup Lambda every day at 2AM IST
Schedule: cron(30 20 * * ? *) # 20:30 UTC = 02:00 IST
Target: Lambda function β trigger RDS snapshot
EventBridge is what replaced CloudWatch Events. Has: default event bus (AWS events), custom event buses (your app events), partner event buses (SaaS integrations like Datadog, PagerDuty).
DevOps & Automation Tools
What is CloudFormation?
AWS's native IaC service. Define your AWS infrastructure in YAML or JSON templates. CloudFormation creates, updates, and deletes resources as a Stack. Resources in a stack are managed together β create the stack β all resources created. Delete the stack β all resources deleted.
Template Structure
AWSTemplateFormatVersion: '2010-09-09'
Description: 'My web app infrastructure'
Parameters:
InstanceType:
Type: String
Default: t3.micro
AllowedValues: [t3.micro, t3.small, t3.medium]
Mappings:
RegionAMI:
ap-south-1:
AMI: ami-0c55b159cbfafe1f0 # Amazon Linux 2
Conditions:
IsProd: !Equals [!Ref Environment, production]
Resources:
# The ONLY required section
MyBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub 'my-app-${AWS::AccountId}'
VersioningConfiguration:
Status: Enabled
MyEC2:
Type: AWS::EC2::Instance
Properties:
InstanceType: !Ref InstanceType
ImageId: !FindInMap [RegionAMI, !Ref AWS::Region, AMI]
IamInstanceProfile: !Ref EC2InstanceProfile
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}-web'
Outputs:
BucketName:
Value: !Ref MyBucket
Export:
Name: !Sub '${AWS::StackName}-BucketName'
CloudFormation Key Concepts
Change Sets
Before updating a stack, create a Change Set to preview what CloudFormation will actually do: which resources will be added, modified, or deleted. Always use Change Sets in production β a resource replacement (e.g., changing RDS parameter requiring replacement) means data loss if you're not prepared.
Stack Sets
Deploy CloudFormation stacks across multiple AWS accounts and regions in one operation. Managed from a central admin account. Used for: applying security baseline to all accounts in an org, deploying global app infrastructure to 5 regions at once.
Drift Detection
Detects when actual resource configuration differs from CloudFormation's expected state (someone made a manual console change). Drift detection identifies what changed so you can fix it. Best practice: all changes through CloudFormation only β treat console as read-only for production.
Helper Scripts (cfn-signal, cfn-init)
cfn-signal: Allows an EC2 instance to signal CloudFormation that it has finished initializing (bootstrapping complete). CloudFormation waits for the signal (CreationPolicy WaitCondition) before marking the resource as created. Without this, CloudFormation marks EC2 as created the moment it starts, even if your app isn't ready yet.
# In EC2 UserData:
#!/bin/bash
/opt/aws/bin/cfn-init -v --stack my-stack --resource MyEC2 --region ap-south-1
# ... install and configure app ...
/opt/aws/bin/cfn-signal -e $? --stack my-stack --resource MyEC2 --region ap-south-1
The AWS CI/CD Toolchain
GitHub / CodeCommit
β Code push
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS CodePipeline β
β β
β Stage 1: SOURCE Stage 2: BUILD Stage 3: DEPLOY β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β GitHub / β β CodeBuild: β β CodeDeploy: β β
β β CodeCommit ββββββββΊ β - Install ββββββββΊ β - EC2/ECS/ β β
β β Webhook β β - Test β β Lambda β β
β βββββββββββββββ β - Build β β - Blue/Greenβ β
β β - Push ECR β β - Canary β β
β βββββββββββββββ βββββββββββββββ β
β β β β
β buildspec.yml appspec.yml β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CodeBuild
Fully managed build service. Runs your build commands in a container. Defined by buildspec.yml in your repo root. Scales automatically β no build servers to manage. Charged per build minute.
# buildspec.yml example (Node.js app β Docker β ECR)
version: 0.2
phases:
pre_build:
commands:
- echo Logging in to ECR...
- aws ecr get-login-password | docker login --username AWS \
--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
build:
commands:
- echo Running tests...
- npm test
- echo Building Docker image...
- docker build -t $IMAGE_NAME:$CODEBUILD_RESOLVED_SOURCE_VERSION .
- docker tag $IMAGE_NAME:$CODEBUILD_RESOLVED_SOURCE_VERSION \
$ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
post_build:
commands:
- docker push $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
- echo Build completed
artifacts:
files:
- imagedefinitions.json # Tells CodeDeploy which image to use for ECS
CodeDeploy
Automates application deployments to EC2, Lambda, or ECS. Supports deployment strategies: in-place, blue/green, canary, linear. Defined by appspec.yml.
# appspec.yml for EC2 deployment
version: 0.0
os: linux
files:
- source: /dist
destination: /var/www/myapp
hooks:
BeforeInstall:
- location: scripts/stop_server.sh
timeout: 30
AfterInstall:
- location: scripts/install_deps.sh
timeout: 60
ApplicationStart:
- location: scripts/start_server.sh
timeout: 30
ValidateService:
- location: scripts/health_check.sh
timeout: 30
Auto Scaling Group (ASG)
An ASG maintains a fleet of EC2 instances. You define min/max/desired count. ASG continuously monitors health, replaces unhealthy instances automatically, and scales based on policies.
Launch Template vs Launch Configuration
Launch Template (modern, prefer this): Defines EC2 parameters (AMI, instance type, key pair, security groups, user data). Supports versioning, can specify multiple instance types, supports Spot + On-Demand mix. Launch Configuration (legacy, deprecated): Older, no versioning, only one instance type. Always use Launch Templates for new ASGs.
ASG Scaling Policies
| Policy Type | How it works | Best for |
|---|---|---|
| Simple Scaling | Alarm triggers: add/remove N instances. Cooldown period before next action. | Rarely used now β slow response, blunt |
| Step Scaling | Different scaling magnitudes based on alarm severity. CPU 70-80%: add 1. CPU 80-90%: add 3. CPU >90%: add 5. | Variable load spikes with different intensities |
| Target Tracking | Keep a metric at a target value. "Keep average CPU at 60%" β ASG figures out how many instances to add/remove. | Most common β easy to configure, handles scale-in/out automatically |
| Scheduled Scaling | Pre-set scaling at specific times. Scale out at 8AM, scale in at 10PM. | Predictable traffic patterns (business hours, weekly spikes) |
| Predictive Scaling | ML-based forecasting using historical data. Pre-scales before expected traffic increase. | Cyclical/recurring load patterns |
Mixed Instance Types & Spot
Launch Templates support specifying multiple instance types and a mix of On-Demand + Spot instances in an ASG. E.g., "run 2 On-Demand as baseline, fill capacity with cheapest Spot instances from this list: m5.xlarge, m5a.xlarge, m6i.xlarge." If Spot is interrupted, ASG replaces with another Spot or falls back to On-Demand. Major cost savings for stateless workloads.
# CDK example (simplified) β Mixed instance ASG:
asg = autoscaling.AutoScalingGroup(self, "MyASG",
min_capacity=2, max_capacity=20,
mixed_instances_policy=autoscaling.MixedInstancesPolicy(
instances_distribution=autoscaling.InstancesDistribution(
on_demand_base_capacity=2, # Always keep 2 On-Demand
on_demand_percentage_above_base=20, # 20% On-Demand, 80% Spot above base
spot_allocation_strategy="capacity-optimized" # Pick cheapest available Spot
),
launch_template=lt,
launch_template_overrides=[
autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m5.xlarge")),
autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m5a.xlarge")),
autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m6i.xlarge")),
]
)
)
Managed Instance Groups (MIG) with Autoscaler. Uses Instance Templates (like Launch Templates). Supports scale out on CPU, LB capacity, custom metrics. Also has Spot VMs integration in MIGs.
Azure Virtual Machine Scale Sets (VMSS). Like ASG but Azure-flavored. Supports Flex (flexible orchestration) and Uniform orchestration modes. Auto-scale based on metrics or schedule. Spot instance support in VMSS.
AWS Systems Manager (SSM)
A suite of tools for managing EC2 instances (and on-prem servers) at scale. The SSM Agent runs on your instances and connects to the SSM service. Key features:
Session Manager
Browser-based or CLI shell access to EC2 instances with no SSH, no bastion host, no open inbound ports. Authentication via IAM. All sessions are logged to CloudWatch/S3. The modern way to access EC2. Significant security improvement over SSH.
# Start session (CLI) - no SSH keys needed
aws ssm start-session --target i-0abc123def456
# Port forwarding via SSM (e.g., connect to RDS in private subnet)
aws ssm start-session \
--target i-0abc123def456 \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["5432"],"localPortNumber":["5432"]}'
# Now: psql -h localhost -p 5432 -U admin mydb (via SSM tunnel)
Parameter Store
Already covered in security section β stores config and secrets. Accessible from EC2 instances, Lambda, ECS tasks via SSM API.
Run Command
Execute shell commands on one or multiple EC2 instances without SSH. Run across hundreds of instances using tags. Output captured in CloudWatch. Good for: emergency patches, config changes, one-off maintenance tasks.
# Run command on all tagged "Environment=Production" instances
aws ssm send-command \
--targets "Key=tag:Environment,Values=Production" \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["yum update -y kernel", "reboot"]'
Patch Manager
Automates OS patching across your fleet. Define patch baselines (which patches to approve), maintenance windows (when to patch), and patch groups. Generates compliance reports. Integrates with Run Command to actually apply patches.
State Manager
Ensures your instances are in a defined state (software installed, config files present, services running). Like configuration management (Ansible/Chef) but AWS-native. Uses SSM Documents to define state.
Messaging & Async Services
Why Async Messaging?
In a synchronous architecture, Service A calls Service B directly. If B is slow or down β A is slow or failing too. With async messaging, A puts a message in a queue and returns immediately. B processes when it can. They're decoupled β A doesn't care about B's state.
SYNC: Order Service ββHTTPβββΊ Inventory Service ββHTTPβββΊ Notification Service
(if either downstream fails β order fails, user gets error)
ASYNC: Order Service βββΊ SQS Queue βββ Inventory Service (processes when ready)
β
ββββΊ SNS Topic ββfan-outβββΊ Email Notification Lambda
ββββΊ Push Notification Lambda
ββββΊ Analytics Lambda
Amazon SQS β Simple Queue Service
Fully managed message queue. Producer sends messages, consumer polls and processes them, deletes after processing. Guarantees at-least-once delivery (same message might be delivered more than once β make consumers idempotent).
Queue Types
Standard Queue
Unlimited throughput. Messages delivered at least once, in approximately-order (not guaranteed). Best for: high-throughput workloads where some duplicate processing is OK. Default choice.
FIFO Queue
Exactly-once processing. Messages delivered exactly once, strictly in order. Throughput: 3,000 msg/s with batching (300/s without). Best for: financial transactions, order processing, any use case where order and deduplication matter.
Key SQS Concepts
- Visibility Timeout: When a consumer picks up a message, it becomes invisible for this duration. If not deleted within timeout (consumer crashed), it reappears for another consumer. Default 30s, max 12 hours. Set to > max processing time.
- Dead Letter Queue (DLQ): After N failed processing attempts (maxReceiveCount), message moves to DLQ. Use DLQ to capture and analyze unprocessable messages without losing them.
- Long Polling: Consumer waits up to 20 seconds for messages instead of returning empty immediately. Reduces empty API responses and costs.
- Message Retention: 1 minute to 14 days. Default 4 days. Plan accordingly.
- Max Message Size: 256KB. For larger payloads, store in S3 and put S3 reference in the message.
# SQS Producer (send message):
import boto3, json
sqs = boto3.client('sqs')
sqs.send_message(
QueueUrl='https://sqs.ap-south-1.amazonaws.com/123456789/OrderQueue',
MessageBody=json.dumps({
'orderId': 'ORD-2024-001',
'userId': 'user123',
'total': 299.99
}),
MessageGroupId='user123' # For FIFO: same group = in-order
)
# SQS Consumer (receive and delete):
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # Long polling
VisibilityTimeout=60
)
for msg in response.get('Messages', []):
process(json.loads(msg['Body']))
sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=msg['ReceiptHandle'])
Amazon SNS β Simple Notification Service
Pub/Sub messaging. Publishers send to a Topic. Subscribers receive all messages published to that topic. Fan-out: one message β many subscribers. Supports: SQS, Lambda, HTTP/HTTPS, email, SMS, mobile push (APNS, FCM).
# SNS Fan-out: Order created β notify multiple systems
SNS Topic: "OrderCreated"
βββ SQS: InventoryQueue β Inventory Lambda (update stock)
βββ SQS: ShippingQueue β Shipping Lambda (create shipment)
βββ Lambda: EmailSender β Send confirmation email
βββ Lambda: Analytics β Record to analytics DB
# Each subscriber independently processes the same event
Amazon Kinesis β Real-Time Streaming
For high-volume, real-time data streaming (millions of events/sec). Unlike SQS (queue β each message consumed once, deleted), Kinesis stores records for up to 7 days and multiple consumers can read the same stream.
- Kinesis Data Streams: Real-time streaming. Producers write records, consumers read. Partitioned by shards (1MB/s write, 2MB/s read per shard). Ordered within a shard. Good for: real-time analytics, click stream, log ingestion, IoT telemetry.
- Kinesis Data Firehose: Fully managed delivery of streaming data to S3, Redshift, Elasticsearch, Splunk. No consumers to write β automatic batching and delivery. Buffer size and interval configurable. Good for: log delivery to S3 for analysis, streaming ETL.
- Kinesis Data Analytics (Managed Service for Apache Flink): Run SQL or Flink queries on streaming data in real time. Good for: real-time dashboards, anomaly detection, streaming aggregations.
| Feature | SQS | SNS | Kinesis Data Streams |
|---|---|---|---|
| Pattern | Queue (consume once) | Pub/Sub (fan-out) | Stream (replay, multiple consumers) |
| Retention | 14 days max | No retention | 1-365 days |
| Ordering | FIFO (with FIFO queue) | No guarantee | Ordered per shard |
| Replay | No | No | Yes (replay from any position) |
| Throughput | Unlimited | Unlimited | 1MB/s per shard |
| Use case | Task queues, job processing | Notifications, fan-out | Real-time analytics, event sourcing |
Cloud Pub/Sub: Combines SQS + SNS in one service (pub/sub model with at-least-once delivery, pull or push subscriptions). Also Cloud Tasks (task queues, more like SQS β delayed execution, rate limits, HTTP targets).
Azure Service Bus (enterprise messaging β like SQS/SNS with richer features: sessions, dead-lettering, transactions, topic subscriptions = fan-out) | Azure Event Grid (event routing, like EventBridge) | Azure Event Hubs (high-throughput streaming, like Kinesis Data Streams β compatible with Apache Kafka protocol).
Azure Event Hubs Kafka compatibility: Azure Event Hubs has a Kafka-compatible API. Migrate existing Kafka producers/consumers to Event Hubs with minimal code changes. AWS offers Amazon MSK (Managed Kafka) but it's a full Kafka cluster β heavier. Event Hubs is lighter and Kafka-compatible at the same time.
Cloud Fundamentals β Quick Review
HA, Scaling & DR β Quick Review
Networking, Security & Modern Patterns β Quick Review
Compute β Quick Review
aws ecr get-login-password | docker login. Scans images for CVEs. Image URI format: <account>.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>