☁ Cloud Concepts & AWS Services

Complete notes for a Junior DevOps role. Learn core cloud principles, then dive deep into AWS services with real-world examples and diagrams. GCP & Azure equivalents included throughout.

AWS GCP Azure DevOps IaC Serverless
☁ Cloud Concepts
CC-M1

Cloud Fundamentals

1 What is Cloud Computing?

The Core Idea

Cloud computing means renting computing resources over the internet instead of buying and managing your own hardware. Think of it like electricity β€” you don't build your own power plant; you plug into the grid and pay for what you use.

Before cloud, a startup wanting to launch an app needed to: buy servers, rent datacenter space, hire a sysadmin, buy networking hardware, wait weeks for delivery β€” all before writing a single line of code. Cloud made that a 5-minute signup.

NIST 5 Essential Characteristics

The official NIST definition says cloud computing must have all 5 of these:

1. On-Demand Self-Service

You provision resources yourself, without talking to a human. Spin up an EC2 instance at 2 AM, no approval required.

2. Broad Network Access

Resources are accessible over the internet from any device β€” laptop, phone, another server anywhere on the planet.

3. Resource Pooling (Multi-tenancy)

Provider serves many customers from the same physical hardware, dynamically assigning resources. You don't know (or care) which physical server you're on.

4. Rapid Elasticity

Scale up or down fast β€” sometimes automatically. Resources feel unlimited from the user's perspective. Traffic spike at 9 AM? Auto Scaling adds servers in minutes.

5. Measured Service

You pay for exactly what you use. Like a utility bill. AWS charges per hour/second for compute, per GB for storage, per million for API calls.

CAPEX vs OPEX

ModelWhat it meansExampleCloud relevance
CAPEX (Capital Expense)Upfront large purchase. You own the asset.Buying 50 physical serversTraditional / On-prem model
OPEX (Operational Expense)Ongoing cost. Pay as you go.Paying AWS monthly billCloud model β€” predictable, flexible
Why it matters for DevOps Cloud moves IT from CAPEX to OPEX. This means faster experimentation (no hardware order), easier budgeting, and no wasted capital on underused hardware. As a DevOps engineer, you'll constantly make decisions that affect cloud spend.
2 Service Models β€” IaaS, PaaS, SaaS

The "Pizza as a Service" Analogy

These models define how much of the stack the cloud provider manages vs how much you manage.

LayerOn-Prem (you manage)IaaSPaaSSaaS
ApplicationYouYouYouProvider
DataYouYouYouProvider
Runtime / MiddlewareYouYouProviderProvider
OSYouYouProviderProvider
VirtualizationYouProviderProviderProvider
Hardware / Network / DCYouProviderProviderProvider

IaaS β€” Infrastructure as a Service

You get raw compute, storage, and networking. You manage the OS up. Most control, most responsibility.

Real example: You rent an EC2 instance, install Ubuntu, install Nginx, deploy your app. If the OS crashes, that's on you to fix.

PaaS β€” Platform as a Service

You just deploy your application code/container. The provider handles OS patching, scaling infrastructure, runtime. Less control, less ops work.

Real example: You push a Python Flask app to Elastic Beanstalk. AWS auto-provisions EC2, load balancer, and auto-scaling. You never SSH into a server.

SaaS β€” Software as a Service

You're just a user of a complete application. No infrastructure, no app management. Just login and use it.

Real example: Gmail, Slack, Salesforce. AWS WorkMail is also SaaS.

Cloud Provider Service Model Equivalents
AWS

IaaS: EC2  |  PaaS: Elastic Beanstalk, Lambda  |  SaaS: WorkMail, Chime

GCP

IaaS: Compute Engine (GCE)  |  PaaS: App Engine, Cloud Run  |  SaaS: Google Workspace

Azure

IaaS: Azure VMs  |  PaaS: Azure App Service, Azure Functions  |  SaaS: Microsoft 365, Dynamics 365

3 Deployment Models β€” Public, Private, Hybrid, Multi-Cloud
Public Cloud

Resources run on provider's shared infrastructure, accessible over the public internet. AWS, GCP, Azure are all public clouds. Best for: startups, variable workloads, apps without strict data residency needs.

Private Cloud

Cloud infrastructure dedicated to one organization. Can be on-prem or in a provider's dedicated facility. Tech: OpenStack, VMware vSphere. Best for: banks, government, healthcare β€” strict compliance requirements.

Hybrid Cloud

Mix of on-prem (private) + public cloud, connected by VPN or Direct Connect. Best for: organizations with legacy systems migrating gradually to cloud, or data residency requirements with burst needs.

Multi-Cloud

Using multiple public cloud providers simultaneously (e.g., AWS for compute + GCP for ML). Best for: avoiding vendor lock-in, using best-of-breed services, or regulatory reasons.

Real-World Example β€” Hybrid A bank keeps customer data on private on-prem servers (regulatory compliance) but uses AWS for its web frontend and analytics dashboards. The private datacenter connects to AWS via AWS Direct Connect, creating a hybrid setup.
Multi-Cloud vs Hybrid Cloud Multi-cloud = multiple public cloud providers. Hybrid cloud = public cloud + on-prem/private cloud. These are different! Many companies end up with both (multi-cloud-hybrid) in practice.
4 Shared Responsibility Model

The Most Important Concept in Cloud Security

AWS (and all cloud providers) operate under a shared responsibility model. In simple terms: AWS is responsible for security OF the cloud. YOU are responsible for security IN the cloud.

Shared Responsibility Model β€” EC2 (IaaS) Example
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                      CUSTOMER RESPONSIBILITY                     β”‚
  β”‚  (Security IN the cloud)                                         β”‚
  β”‚                                                                  β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
  β”‚  β”‚  Customer   β”‚  β”‚  Platform,  β”‚  β”‚  Identity & Access Mgmt  β”‚ β”‚
  β”‚  β”‚    Data     β”‚  β”‚  App, OS    β”‚  β”‚  (IAM users, policies)   β”‚ β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
  β”‚  β”‚  Firewall / β”‚  β”‚  Network    β”‚  β”‚  Client-side & Server-   β”‚ β”‚
  β”‚  β”‚  Sec Groups β”‚  β”‚  Config     β”‚  β”‚  side Encryption         β”‚ β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                       AWS RESPONSIBILITY                         β”‚
  β”‚  (Security OF the cloud)                                         β”‚
  β”‚                                                                  β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚  Compute | Storage | Networking | Database (managed infra) β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚  Physical Security of Datacenters, Hardware, Network Infra β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Responsibility Shifts Based on Service Model

ConcernIaaS (EC2)PaaS (Beanstalk)SaaS (WorkMail)
Physical datacenterAWSAWSAWS
Hypervisor / HardwareAWSAWSAWS
OS patchingYouAWSAWS
Runtime/middlewareYouAWSAWS
Application codeYouYouAWS
Data & encryptionYouYouYou
IAM / access controlYouYouYou
Common Mistake Many cloud breaches happen because people think "the cloud provider secures everything." They don't. Leaving an S3 bucket publicly readable, using weak IAM policies, or not patching your EC2 OS β€” all your responsibility. The provider doesn't protect you from YOUR mistakes inside the cloud.
Shared Responsibility β€” Other Providers
GCP

Same model: Google secures the infrastructure, you secure your workloads and data. Called "Shared Fate" in GCP (more collaborative tone β€” Google provides security tools to help you).

Azure

Same model. Azure's documentation explicitly shows a layered diagram. For managed services (like Azure SQL), Azure takes on more responsibility than for VMs.

CC-M2

Global Infrastructure

1 Regions, Availability Zones & Edge Locations

Why Geography Matters in Cloud

Your users are physically distributed. A server in the US takes ~150ms to respond to a user in India. Cloud providers build datacenters globally to solve this. But geography also matters for compliance (EU GDPR requires EU data stay in EU), disaster recovery (separate physical locations), and cost (prices vary by region).

AWS Global Infrastructure Hierarchy
AWS Region (e.g. ap-south-1 Mumbai) Geographically isolated. 33+ regions worldwide. Services replicated within region. AZ-1 (ap-south-1a) Data Center A EC2 instances, RDS, etc. Data Center B Separate power & network ~km apart, low-latency AZ-2 (ap-south-1b) Data Center C Independent failure domain Data Center D Own UPS, generators connected by AWS fiber AZ-3 (ap-south-1c) Data Center E Flood/fire isolated Data Center F Redundant transit links 3 AZs minimum per region

Regions

A Region is a geographically separate area of the world with a cluster of datacenters. Each region has a unique name like us-east-1 (N. Virginia), ap-south-1 (Mumbai), eu-west-1 (Ireland).

  • AWS has 33+ regions worldwide (2024)
  • Regions are completely independent β€” a region-wide failure doesn't affect other regions
  • Not all services are available in all regions (e.g., some AI services only in US regions initially)
  • Data does NOT automatically replicate across regions β€” you must explicitly configure cross-region replication

Availability Zones (AZs)

Each region has 2-6 AZs (usually 3). An AZ is one or more discrete datacenters with:

  • Independent power supply (UPS + diesel generators)
  • Independent networking (separate internet uplinks)
  • Physical separation (miles apart, so one fire/flood doesn't take both)
  • But connected with high-speed, low-latency private fiber within the region (<1ms)
Practical Rule Deploy critical resources across at least 2 AZs. If one AZ fails (power outage, hardware failure), your app keeps running in the other AZ. This is the foundation of High Availability in AWS.

Edge Locations & CloudFront PoPs

For CDN (CloudFront), AWS has 600+ edge locations worldwide β€” far more than regions. These are smaller cache servers placed close to end users. Content cached here gets served with ultra-low latency. Edge locations are also used by Route 53 (DNS) and AWS Shield (DDoS protection).

Other AWS Infrastructure Types

TypeWhat it isUse case
Local ZonesAWS compute placed in specific cities (e.g., Delhi, Chicago), extending a regionSub-10ms latency for city users. Gaming, live video, AR/VR.
Wavelength ZonesAWS compute embedded in telecom 5G networksUltra-low latency apps delivered via 5G. Mobile gaming, real-time video.
AWS OutpostsAWS-managed rack in YOUR datacenter running AWS servicesOn-prem workloads needing AWS APIs. Compliance requiring on-prem data.
Global Infrastructure β€” Other Providers
GCP

Regions & Zones (similar concept). A Zone is like an AZ. GCP calls them Zones directly (e.g., asia-south1-a). Also has Cloud CDN PoPs for edge caching. ~40 regions.

Azure

Regions & Availability Zones. Azure also has Availability Sets (older: ensures VMs spread across fault/update domains within a single datacenter β€” NOT the same as AZs). Azure AZs are like AWS AZs. Also has Azure Edge Zones similar to AWS Local Zones.

Azure-Only

Azure Paired Regions: Every Azure region is paired with another region in the same geography (e.g., East US ↔ West US). Microsoft staggers updates across pairs and replicates some services automatically. AWS doesn't have an exact equivalent β€” you manage cross-region replication manually.

2 Choosing a Region β€” 4 Key Factors
1. Compliance & Data Residency

GDPR (EU) requires EU citizen data stays in EU. HIPAA (US healthcare), PCI-DSS (payments). If law requires data in a specific country β€” that region wins, period. No other factor overrides this.

2. Latency (Proximity to Users)

Deploy closest to your users. If 80% of users are in India, ap-south-1 (Mumbai). Use CloudFront for global CDN on top. Test with cloudpingtest.com.

3. Service Availability

Not all services exist in all regions. New services launch in us-east-1 first. Check the AWS Regional Services table before designing architecture. Bedrock (AI) has limited region availability.

4. Pricing

Same EC2 instance type costs differently per region. us-east-1 tends to be cheapest. ap-southeast-1 (Singapore) is ~10-20% more. Factor this into cost modeling.

Pro Tip β€” us-east-1 is special AWS us-east-1 (N. Virginia) is AWS's oldest and largest region. New services launch here first. It's also where AWS Console global resources (like IAM, Route 53, CloudFront) show up. When something seems to not exist in your region β€” check if it's in us-east-1 only.
CC-M3

High Availability, Scalability & Disaster Recovery

1 High Availability vs Fault Tolerance vs Disaster Recovery

Three Related But Different Concepts

These terms are often confused. Think of a hospital as an analogy:

High Availability (HA)

System is designed to be "always on" with minimal downtime. If a component fails, the system automatically recovers quickly. A hospital with a backup generator β€” brief flicker but stays running.

Fault Tolerance (FT)

System continues operating WITH ZERO downtime or data loss even when a component fails. An airplane with 4 engines that can fly on 3 β€” no passengers even notice. Much harder and more expensive than HA.

Disaster Recovery (DR)

Your plan for recovering from catastrophic failure (entire datacenter destroyed, full region outage). Like a hospital's evacuation plan β€” you hope you never need it but must have it. Usually involves a separate region.

Nines of Availability

Availability %Downtime per yearDowntime per monthTypical system
99%3.65 days7.2 hoursBasic single-server app
99.9% ("three nines")8.76 hours43.8 minutesSimple multi-AZ setup
99.99% ("four nines")52.6 minutes4.4 minutesProduction multi-AZ + failover
99.999% ("five nines")5.25 minutes26 secondsActive-active multi-region
AWS SLAs EC2 SLA = 99.99% for a region. S3 = 99.99% availability. Route 53 = 100% uptime SLA (first cloud service with 100% SLA). These are guarantees β€” if AWS misses them, you get service credits.

RPO and RTO β€” The Two DR Metrics

RPO β€” Recovery Point Objective

How much data can you afford to lose? Measured as maximum time between last backup and the disaster. RPO = 1 hour means you're OK losing up to 1 hour of data. Lower RPO = more frequent backups = more cost.

RTO β€” Recovery Time Objective

How long can your system be down? Time from disaster to full recovery. RTO = 4 hours means you need to be back up within 4 hours. Lower RTO = more standby infrastructure = more cost.

RPO and RTO on a Timeline
  Normal ──────────────────┐ DISASTER β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Recovered
  Operation                β”‚ event    β”‚                 state
                           β”‚          β”‚
  [Last backup] ◄───────────          β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ί [Back online]
                   RPO     β”‚          β”‚      RTO
                (data gap) β”‚          β”‚  (recovery time)

  Example: RPO=1hr, RTO=4hr
  β†’ You can lose max 1 hour of data
  β†’ You must be back online within 4 hours of the disaster
2 Disaster Recovery Strategies

4 AWS DR Strategies β€” Cost vs Speed Tradeoff

DR Strategy Comparison β€” Cost vs Recovery Speed
                        Faster recovery (lower RTO/RPO)
                        ─────────────────────────────►

  CHEAPEST    Backup &    Pilot     Warm      Active-Active    MOST
  (cold)      Restore     Light     Standby   (Multi-site)     EXPENSIVE
              β”‚           β”‚         β”‚         β”‚
              β–Ό           β–Ό         β–Ό         β–Ό
  RTO:        Hours      Minutes   Minutes   Seconds
  RPO:        Hours      Minutes   Seconds   Near-zero
  Cost:       $           $$        $$$       $$$$
              β”‚           β”‚         β”‚         β”‚
              β”‚           β”‚         β”‚         └─ Full copy in 2nd region
              β”‚           β”‚         └─── Scaled-down running copy
              β”‚           └──── Minimal services always running
              └──── Just backups, nothing running

Strategy 1: Backup & Restore

Regularly back up data and snapshots to S3. When disaster hits, spin up new infrastructure from those backups. Simplest, cheapest, but slowest.

Example: EC2 AMI snapshots every 6 hours to S3. RDS automated backups to another region. If primary region fails, launch new EC2 from AMI, restore RDS from backup. Takes hours.

Strategy 2: Pilot Light

A minimal version of your app is always running in DR region β€” just the core data-syncing layer (e.g., a database replicating from primary). Application servers are OFF but AMIs/configs are ready. Scale up when needed.

Example: RDS read replica in DR region (always syncing). EC2 AMIs ready. When disaster: promote read replica to master, launch app servers from AMIs. Takes 15-30 minutes.

Strategy 3: Warm Standby

A scaled-down but fully running copy of your system in DR region. It receives traffic in normal times or just sits ready. During disaster, scale it up to full production capacity.

Example: 2 t3.small EC2s in DR region vs 10 m5.xlarge in production. During disaster, scale DR to full size and redirect DNS.

Strategy 4: Active-Active (Multi-Site)

Full production deployment in 2+ regions, ALL serving live traffic. Route 53 routes users to nearest healthy region. If one region fails, all traffic goes to the other with no perceivable downtime.

Example: Netflix runs in multiple AWS regions. If us-east-1 has issues, traffic goes to us-west-2. Users might see a brief slowdown, but no outage.

DR Concepts β€” All Clouds
AWS

DR strategies built around multi-region architecture. Key services: Route 53 (DNS failover), S3 CRR (cross-region replication), RDS Read Replicas, Aurora Global Database, DynamoDB Global Tables.

GCP

Same concepts. Multi-region Cloud Storage, Cloud Spanner (global DB), Cloud DNS with failover routing. GCP also has Managed Instance Groups with regional autoscaling.

Azure

Azure Site Recovery (ASR) is Azure's dedicated DR service β€” not available in AWS directly. ASR can replicate VMs to a secondary region and automate failover. Azure Traffic Manager handles DNS-level failover (like Route 53).

Azure-Only

Azure Site Recovery (ASR): Dedicated managed DR service that replicates VMs, manages failover plans, and handles RPO/RTO tracking. AWS equivalent would be custom-built using CloudFormation + scripting + Route 53.

3 Scalability & Elasticity

Scalability = Can It Grow? Elasticity = Does It Grow Automatically?

Scalability means your architecture can handle increased load. Elasticity means it automatically scales up AND back down as load changes β€” so you're not paying for idle capacity at 3 AM.

Vertical Scaling (Scale Up)

Give the existing server more power. Upgrade from t3.medium (2 vCPU, 4GB) to m5.4xlarge (16 vCPU, 64GB). Simple but has limits (biggest instance size), requires downtime, and creates a single point of failure.

Vertical vs Horizontal Scaling
  VERTICAL (Scale Up)                 HORIZONTAL (Scale Out)

  Before: [Server 2GB RAM]            Before: [Server] [Server]
              β”‚                                   β”‚
              β–Ό                                   β–Ό
  After:  [Server 16GB RAM]           After:  [Server] [Server] [Server] [Server]
                                                         β”‚
          One server, bigger                      Load Balancer distributes traffic
          Single point of failure                 No SPOF β€” much more resilient

Horizontal Scaling (Scale Out)

Add more instances of the same server. 1 server β†’ 5 servers behind a load balancer. No single point of failure. Nearly unlimited scale. Requires your app to be stateless (session data stored in Redis/DB, not locally).

Auto Scaling

AWS Auto Scaling automatically adjusts the number of instances based on rules you define. You define a minimum (always have at least 2), maximum (never exceed 20), and desired (target 4 normally).

Scaling can be triggered by: CPU usage > 70%, request count, memory, schedule, or custom CloudWatch metrics.

Real-World Scenario You run an e-commerce site. On a normal day, 4 EC2 instances handle traffic. Black Friday comes β€” traffic spikes 10x. Auto Scaling detects CPU spike β†’ scales out to 20 instances in 5 minutes β†’ Black Friday handled. At midnight when traffic drops β€” scales back to 4. You only paid for the extra instances for those hours.
Elasticity Requires Stateless Apps If your app stores user sessions in local server memory, horizontal scaling breaks: User logs in β†’ Server A has session β†’ Next request hits Server B β†’ Session not found β†’ User gets logged out. Fix: store sessions in ElastiCache (Redis) or DynamoDB, not in server memory.
CC-M4

Cloud Networking Fundamentals

1 Virtual Private Cloud (VPC) β€” Your Private Network

What is a VPC?

A VPC is a logically isolated private network in the cloud. Think of it as your own private section of AWS that no one else can access. By default, nothing inside your VPC can reach the internet, and the internet can't reach your VPC β€” you must explicitly configure that.

Analogy: AWS is a massive apartment building. Your VPC is your apartment β€” you can furnish it however you like inside, but outsiders can't get in unless you let them.

VPC Architecture β€” Key Components
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ AWS Region (ap-south-1) ─────────────────────────┐
  β”‚                                                                          β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ VPC (10.0.0.0/16) ────────────────────────┐  β”‚
  β”‚  β”‚                                                                    β”‚  β”‚
  β”‚  β”‚  β”Œβ”€β”€ AZ-1a ─────────────────────┐  β”Œβ”€β”€ AZ-1b ──────────────────┐ β”‚  β”‚
  β”‚  β”‚  β”‚                              β”‚  β”‚                            β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚ [Public Subnet 10.0.1.0/24]  β”‚  β”‚ [Public Subnet 10.0.2.0/24]β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β”‚ EC2 (web)β”‚               β”‚  β”‚  β”‚ EC2 (web)β”‚             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β”‚  Public  β”‚ NAT GW        β”‚  β”‚  β”‚  Public  β”‚             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β”‚  IP: βœ“   │──┐            β”‚  β”‚  β”‚  IP: βœ“   β”‚             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚            β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚                β”‚            β”‚  β”‚                            β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚ [Private Sub 10.0.3.0/24]  β”‚  β”‚ [Private Sub 10.0.4.0/24] β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚            β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β”‚ EC2 (app)β”‚β—„β”€β”˜ (outbound)β”‚  β”‚  β”‚ RDS (DB) β”‚             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β”‚ No pub IPβ”‚               β”‚  β”‚  β”‚ No pub IPβ”‚             β”‚ β”‚  β”‚
  β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚ β”‚  β”‚
  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  └────────────────────────── β”˜ β”‚  β”‚
  β”‚  β”‚                       β”‚                                           β”‚  β”‚
  β”‚  β”‚              Internet Gateway (IGW)                               β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β”‚                          β”‚                                              β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                         INTERNET

Key Networking Components

CIDR Block

When you create a VPC, you assign it a CIDR block like 10.0.0.0/16. This defines the IP range for your entire VPC (65,536 IPs). You then carve subnets from this range.

Subnets

Subnets divide the VPC into smaller networks, and they're tied to a specific AZ. A public subnet has a route to the internet via an Internet Gateway. A private subnet has no direct internet route β€” instances here can't be reached from the internet.

Internet Gateway (IGW)

A horizontally-scaled, redundant, HA component attached to your VPC that enables communication between your VPC and the internet. Free. Without an IGW, your VPC has no internet connectivity at all. Only one IGW per VPC.

NAT Gateway

Allows instances in private subnets to initiate outbound connections to the internet (for software updates, API calls, etc.) but prevents the internet from initiating connections TO those instances. Deployed in a public subnet, charges per hour + data processed.

Route Tables

Every subnet has a route table that defines where traffic goes. A public subnet's route table has an entry: 0.0.0.0/0 β†’ igw-xxxxx (default route to internet via IGW). A private subnet's route table: 0.0.0.0/0 β†’ nat-xxxxx (outbound only via NAT).

Security Groups

Virtual firewalls at the instance (EC2) level. Stateful: if you allow inbound on port 80, the response automatically comes back out without needing an outbound rule. Default: deny all inbound, allow all outbound.

Network ACLs (NACLs)

Firewall at the subnet level. Stateless: you must define both inbound AND outbound rules explicitly. Rules evaluated in number order (lowest first). An explicit DENY stops evaluation. Less commonly tweaked than Security Groups.

FeatureSecurity GroupNACL
Applied atInstance (ENI) levelSubnet level
StateStateful (response auto-allowed)Stateless (must allow both directions)
RulesAllow onlyAllow and Deny
Rule evaluationAll rules evaluated, most permissive winsRules in number order, first match wins
Default behaviorDeny all inbound, allow all outboundAllow all inbound and outbound
VPC Equivalents in Other Clouds
GCP

Also called VPC. Key difference: GCP VPCs are global by default (span all regions). AWS VPCs are regional. In GCP, one VPC can have subnets in multiple regions. Subnets are regional. Security Groups β†’ Firewall Rules (global, not per-instance). No direct NACL equivalent.

Azure

Called Virtual Network (VNet). Same concept β€” private IP space, subnets, gateways. Azure has Network Security Groups (NSGs) which work like AWS Security Groups but can be applied to subnets OR individual NICs. Azure also has Application Security Groups (ASGs) to group VMs logically.

2 Load Balancing Concepts

What is a Load Balancer?

A load balancer distributes incoming traffic across multiple backend servers. It's the entry point users hit β€” they don't talk to individual servers directly. This enables high availability (if one server dies, traffic goes elsewhere), horizontal scaling, and no single point of failure.

Layer 4 vs Layer 7 Load Balancing

L4 β€” Transport Layer (TCP/UDP)

Routes traffic based on IP address and port number. Doesn't look inside the packet. Fast, low-overhead. Good for: non-HTTP traffic, TCP-based apps, ultra-low latency, gaming, VoIP, financial trading.

AWS: NLB (Network Load Balancer)

L7 β€” Application Layer (HTTP/HTTPS)

Looks inside the HTTP request β€” path, hostname, headers, cookies. Can route /api/* to one group, /images/* to another. Smarter but slightly more overhead. Good for: web apps, microservices, content-based routing.

AWS: ALB (Application Load Balancer)

Load Balancing Algorithms

AlgorithmHow it worksBest for
Round RobinSend each request to next server in sequence: A, B, C, A, B, C...Similar servers, similar request sizes
Least ConnectionsSend to server with fewest active connectionsVariable request processing time
IP Hash / Sticky SessionsSame client IP always goes to same serverApps that need session affinity (stateful)
WeightedSome servers get more traffic by weight (70/30 split)Gradual deployments (blue/green, canary)

Health Checks

Load balancers continuously ping backend servers (e.g., HTTP GET /health every 30s). If a server fails health check 3 times, the LB removes it from rotation. When it recovers and passes, it's added back. This is how HA works in practice.

3 CDN Concepts β€” Content Delivery Networks

The Problem CDNs Solve

Your origin server is in us-east-1. A user in Mumbai requests your 5MB homepage image. The packet travels ~14,000 km. High latency. With a CDN, that image is cached in an edge server in Mumbai β€” user gets it from there. Fast.

CDN β€” Cache Hit vs Cache Miss Flow
  WITHOUT CDN:                          WITH CDN (cache HIT):
  User (Mumbai) ─────────────────►     User (Mumbai) ──► Edge (Mumbai) ──► User
   14,000km to us-east-1                                  [cached!] β—„β”€β”€β”˜
   Response: ~300ms latency                               Response: ~5ms latency

  FIRST REQUEST (cache MISS):
  User (Mumbai) ──► Edge (Mumbai) ──► Origin (us-east-1) ──► Edge caches it
                                       Response: ~300ms (one time)

  SUBSEQUENT REQUESTS (cache HIT, within TTL):
  User (Mumbai) ──► Edge (Mumbai) ──► Serve from cache β†’ ~5ms βœ“

Key CDN Concepts

  • Origin: The source of truth β€” your actual server (S3 bucket, EC2, ALB).
  • Edge Location: CDN's cache servers distributed globally.
  • TTL (Time To Live): How long content is cached before being re-fetched from origin. Too long = stale content. Too short = defeats the purpose.
  • Cache Invalidation: Manually expire cached content when you deploy new files. In CloudFront, you create an invalidation request.
  • Origin Shield: Extra caching layer between edge locations and origin, reducing origin load. One central cache instead of 100s of edges hitting origin.

What to cache: Static assets (images, CSS, JS, videos). What NOT to cache: User-specific pages, API responses with sensitive data, frequently changing data (unless you manage TTL carefully).

CDN Services
AWS

CloudFront β€” AWS's CDN. 600+ PoPs. Integrates with S3, EC2, ALB. Supports Lambda@Edge for dynamic logic at the edge.

GCP

Cloud CDN β€” Works with Cloud Load Balancing. Also Cloud Media CDN for high-throughput streaming.

Azure

Azure Front Door β€” combines CDN, WAF, and global load balancing in one product. More feature-rich than a pure CDN. Also legacy Azure CDN (being retired in favour of Front Door).

Azure-Only

Azure Front Door's global load balancing (routing users to the closest healthy region based on latency, not just caching) is more tightly integrated than AWS CloudFront + Route 53 combination.

CC-M5

Security Concepts, IaC & Modern Patterns

1 Cloud Security β€” IAM, Encryption & Zero Trust

Identity & Access Management (IAM) β€” Core Concepts

IAM answers: Who are you? What can you do? To what resources?

  • Authentication: Proving who you are (password, MFA, API key)
  • Authorization: What you're allowed to do once authenticated (policies)
  • Principal: An entity that can make requests (user, role, service)
  • Principle of Least Privilege: Grant only the permissions needed for the specific task. Not "give admin and let them figure it out."

Encryption

Encryption at Rest

Data encrypted while stored. If someone steals a hard drive, they get garbage. AWS does this for EBS, S3, RDS with keys managed by KMS. In S3, you can enable SSE-S3 (AWS manages key) or SSE-KMS (you manage key via KMS).

Encryption in Transit

Data encrypted while moving over a network. Uses TLS (formerly SSL). HTTPS is HTTP + TLS. Your AWS API calls are all HTTPS. Between services: use TLS wherever possible. Between on-prem and AWS: VPN or Direct Connect with MACsec.

MFA β€” Multi-Factor Authentication

Something you know (password) + something you have (phone/hardware key). Even if your AWS root password is stolen, attacker can't login without your MFA device. Always enable MFA on root account and all IAM users with console access.

Zero Trust Model

Traditional model: "Trust everything inside the network perimeter." Zero Trust: "Trust nothing, verify everything." Even requests from inside the VPC are not automatically trusted β€” authenticate and authorize every request. Implemented via mutual TLS (mTLS), service meshes (Istio), and strict IAM policies.

Security Anti-Patterns to Avoid 1. Using root account for daily operations β€” create IAM users/roles instead. 2. Hardcoding AWS credentials in code β€” use IAM roles for EC2/Lambda. 3. Opening 0.0.0.0/0 on SSH port 22 to the world β€” use SSM Session Manager instead. 4. Storing secrets in environment variables unencrypted β€” use Secrets Manager.
2 Infrastructure as Code (IaC)

What is IaC and Why Does It Matter?

IaC means defining your cloud infrastructure in code files (YAML, JSON, HCL) instead of clicking through the console. You check these files into Git, review them in PRs, run them through CI/CD. Infrastructure becomes reproducible, auditable, and versionable.

Declarative

"I want 3 EC2 instances with these properties." The tool figures out HOW to make that happen. CloudFormation, Terraform, Pulumi.

Imperative

"First create VPC, then subnet, then EC2..." You specify exact steps. AWS CDK, scripts with AWS CLI/SDK.

Key IaC Tools

ToolTypeLanguageMulti-cloud?Best for
AWS CloudFormationDeclarativeYAML/JSONAWS onlyAWS-native teams, no extra setup needed
TerraformDeclarativeHCLYes (all clouds)Multi-cloud, most popular in industry
AWS CDKImperative/DeclarativePython/TS/JavaAWS onlyDevs who prefer real languages over YAML
PulumiImperativePython/TS/Go/C#YesTeams wanting full programming language power
Example β€” Terraform vs CloudFormation Both can create an S3 bucket. Terraform uses HCL: resource "aws_s3_bucket" "my_bucket" { bucket = "my-app-bucket" }. CloudFormation uses YAML with AWSTemplateFormatVersion headers and more verbose syntax. Terraform is more readable and multi-cloud but requires the Terraform binary. CloudFormation is AWS-native and has deeper service integration (like StackSets for multi-account deployments).
3 Serverless & Containers β€” Modern App Patterns

Serverless

You write functions, the cloud runs them. No servers to provision, patch, or manage. You pay only when code runs (per invocation + per ms of execution). Serverless β‰  no servers β€” there ARE servers, you just don't manage them.

Key characteristics: Event-driven (triggered by HTTP, S3 upload, queue message, schedule). Scales to zero (no traffic = no cost). Scales to millions (auto-scale). Stateless (function runs fresh each time).

Cold Start Problem When a function hasn't run recently, the cloud provider needs to spin up a container and load your code. This takes 200ms-2 seconds (cold start). Warm subsequent calls are ~1ms. Solutions: AWS Provisioned Concurrency (keep containers warm, extra cost), keep functions small (faster load), use lightweight runtimes (Python/Node faster than Java).

Containers

Containers package your app + all dependencies (libraries, config, runtime) into a portable unit. Unlike VMs, containers share the host OS kernel β€” much more lightweight. Docker is the de-facto container standard.

VMs vs Containers
  Virtual Machines                    Containers
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ App A  β”‚ App B  β”‚ App C  β”‚        β”‚ App A  β”‚ App B  β”‚ App C  β”‚
  β”‚ Libs   β”‚ Libs   β”‚ Libs   β”‚        β”‚ Libs   β”‚ Libs   β”‚ Libs   β”‚
  β”‚ OS     β”‚ OS     β”‚ OS     β”‚        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ ─
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€        β”‚     Container Runtime     β”‚
  β”‚      Hypervisor          β”‚        β”‚     (Docker/containerd)   β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚     Host OS              β”‚        β”‚     Host OS (ONE)        β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚     Hardware             β”‚        β”‚     Hardware             β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Each VM: 1-2GB RAM overhead         Each Container: ~50MB overhead
  Slow to start (minutes)             Starts in seconds

Container Orchestration

When you run 100s of containers, you need something to manage them: scheduling, health checks, service discovery, rolling updates, secret management. Kubernetes is the industry standard.

ConceptWhat it means
PodSmallest deployable unit in Kubernetes β€” 1+ containers sharing network/storage
DeploymentDesired state: "run 3 replicas of this pod"
ServiceStable network endpoint for pods (pods restart with new IPs β€” Service is static)
IngressHTTP routing rules (like an L7 LB/reverse proxy for K8s)
NamespaceLogical isolation within a cluster (like separate teams/environments)
Serverless & Containers β€” All Providers
AWS

Serverless: Lambda | Containers: ECS, EKS (managed K8s), Fargate (serverless containers) | Container Registry: ECR

GCP

Serverless: Cloud Functions, Cloud Run (containers as serverless!) | Containers: GKE (Google Kubernetes Engine) | Registry: Artifact Registry

Azure

Serverless: Azure Functions, Container Apps | Containers: AKS (Azure Kubernetes Service), Container Instances | Registry: ACR (Azure Container Registry)

GCP-Only

Cloud Run: Deploy any Docker container and it runs serverless (scale to zero, pay per request). More flexible than Lambda (any language, any binary). AWS equivalent would be Lambda containers, but Cloud Run has no cold-start shim overhead.

4 CI/CD Concepts in the Cloud

What is CI/CD?

Continuous Integration (CI): Every code commit is automatically built, tested, and validated. You catch bugs immediately β€” not 3 months later during a manual deployment.

Continuous Delivery (CD): After CI passes, the artifact (container, zip, AMI) is automatically deployed to staging. Deployment to production requires manual approval.

Continuous Deployment: Like CD but no manual approval β€” changes go straight to production automatically after tests pass. Used by companies doing 100s of deploys per day.

Mermaid Diagram β€” CI/CD Pipeline Flow
graph LR
    A[Developer Push] --> B[Source Repo]
    B --> C[CI: Build & Test]
    C --> D{Tests Pass?}
    D -- No --> E[Notify Dev, Stop]
    D -- Yes --> F[Create Artifact]
    F --> G[Deploy to Staging]
    G --> H[Integration Tests]
    H --> I{Approved?}
    I -- Manual Approve --> J[Deploy to Prod]
    I -- Auto Deploy --> J
    J --> K[Monitor & Alert]

Deployment Strategies

StrategyHow it worksDowntime?Rollback?Best for
In-Place (Rolling)Update existing servers one by oneBrief per serverSlowSimple apps, non-critical
Blue/GreenTwo identical envs. Swap DNS/LB to new version. Old stays as backup.NoneInstant (flip LB back)Critical apps needing instant rollback
CanarySend 5% of traffic to new version. Gradually increase if healthy.NoneShift traffic backRisk-sensitive features, A/B testing
Feature FlagsDeploy code disabled. Enable via config for % of users.NoneToggle flag offGradual feature rollouts, experimentation
CI/CD Tools β€” Cloud Providers
AWS

CodeCommit (Git repo, being deprecated 2024) | CodeBuild (CI: build & test) | CodeDeploy (CD: deploy to EC2/Lambda/ECS) | CodePipeline (orchestrates all stages). Also integrates with GitHub, GitLab, Jenkins.

GCP

Cloud Source Repositories (Git, being merged to Gemini Code Assist era) | Cloud Build (CI/CD) | Artifact Registry (store artifacts) | Cloud Deploy (managed delivery pipelines to GKE, Cloud Run).

Azure

Azure DevOps (all-in-one: repos, pipelines, boards, test plans, artifacts) | Azure Pipelines (CI/CD, free for open source). Azure DevOps is more mature/unified than AWS CodePipeline family.

Azure-Only

Azure DevOps Boards: Kanban/Scrum project tracking built into the same product as CI/CD. AWS doesn't have a native project management tool β€” would need Jira, Linear, etc.

🟠 AWS Services & Concepts
AWS-M1

Compute

EC2 Elastic Compute Cloud

What is EC2?

EC2 is AWS's virtual machine service (IaaS). You rent a virtual server that runs on AWS hardware, choose the OS, configure storage and networking. You have full root/admin access. It's the foundation of most AWS architectures.

Key EC2 Components

Instance Types

EC2 instances come in families optimized for different workloads. The naming convention: [Family][Generation][Size] β†’ e.g., m5.xlarge = General purpose, 5th gen, xlarge.

FamilyOptimized forExamplesUse case
t3, t4gBurstable (credits)t3.micro, t4g.smallDev/test, low-traffic sites
m5, m6i, m7iGeneral purposem5.xlarge, m6i.2xlargeWeb servers, app servers, small DBs
c5, c6i, c7gCompute-optimizedc5.2xlarge, c7g.xlargeBatch processing, gaming, video encoding
r5, r6i, r7gMemory-optimizedr5.4xlarge, r6i.largeIn-memory DBs (Redis), big data, SAP HANA
i3, i4iStorage-optimizedi3.xlarge, i4i.2xlargeHigh IOPS workloads, Cassandra, Elasticsearch
p4, p5, g4, g5GPU-acceleratedp4d.24xlarge, g5.xlargeML training, inference, 3D rendering, gaming
Graviton (arm64) Instances t4g, m7g, c7g, r7g are ARM-based instances using AWS's own Graviton chips. They're 20-40% cheaper than equivalent x86 instances AND often faster. If your app can run on ARM (most Linux apps can), prefer Graviton. This is a key cost-optimization lever.

AMI β€” Amazon Machine Image

An AMI is a pre-configured template (OS + optional installed software) used to launch EC2 instances. Like a VM snapshot you can clone. AWS provides Amazon Linux, Ubuntu, RHEL, Windows images. You can create custom AMIs (e.g., "Amazon Linux + Nginx + your app pre-installed") for faster launches β€” called a "Golden AMI".

User Data

A bash script that runs once on first boot. Used to install packages, download code, configure services without baking them into an AMI. Passed at launch time:

#!/bin/bash
yum update -y
yum install -y nginx
systemctl start nginx
systemctl enable nginx

Key Pairs

RSA key pair for SSH access. AWS stores the public key; you keep the private key (.pem file). ssh -i my-key.pem ec2-user@<public-ip>. If you lose the private key, you can't SSH in anymore β€” AWS has no backup. Best practice: use SSM Session Manager instead of SSH (no key pair, no open port 22 needed, auditable).

Instance Metadata Service (IMDS)

From inside an EC2 instance, you can query http://169.254.169.254/latest/meta-data/ to get instance info: instance ID, public IP, IAM role credentials, AZ, etc. Critical for automation scripts running on EC2. IMDSv2 (more secure, requires token) is now required.

# Get instance ID from inside the instance
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id

# Get IAM role temporary credentials
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRole

EC2 Pricing Models

ModelHow it worksDiscount vs On-DemandBest for
On-DemandPay by the hour/second. No commitment.None (baseline)Unpredictable workloads, short-term dev/test
Reserved Instances (RI)1-year or 3-year commitment to a specific instance type/region.Up to 72% offSteady-state production workloads
Savings PlansCommit to $X/hr usage (flexible: any instance type, any region).Up to 66% offMore flexible than RIs β€” same savings, less commitment
Spot InstancesBid for unused EC2 capacity. AWS can terminate with 2-min notice.Up to 90% offFault-tolerant, batch jobs, big data, CI runners
Dedicated HostsPhysical server dedicated to you. Useful for per-socket/per-core licenses.More expensiveCompliance, BYOL software licenses

Placement Groups

Controls how AWS places EC2 instances on physical hardware:

  • Cluster: Pack instances close together in same AZ. Ultra-low latency network (~25Gbps). Use for: HPC, big data jobs needing fast node-to-node comms. Risk: AZ failure takes all down.
  • Spread: Instances on different hardware. Reduces correlated hardware failure. Max 7 instances per AZ per group. Use for: small critical clusters of distinct VMs.
  • Partition: Groups of instances in different partitions (separate racks). Good for large distributed systems (Kafka, Hadoop, Cassandra) where partial failures are tolerable.

Elastic IP (EIP)

A static public IPv4 address you can allocate and associate with an EC2 instance. When an EC2 stops/starts, its public IP changes β€” an EIP stays fixed. But: AWS charges for EIPs that are not attached to a running instance (to discourage hoarding). Best practice: use a load balancer with a stable DNS name instead of EIPs for production.

EC2 Equivalents
GCP

Compute Engine (GCE). Similar instance types. GCP uses Preemptible VMs (like Spot) and Spot VMs. GCP's equivalent of AMIs are Custom Images. GCP has Committed Use Discounts (CUDs) instead of RIs/Savings Plans.

Azure

Azure Virtual Machines. Pricing: Pay-as-you-go (On-Demand), Reserved VM Instances (1 or 3 yr), Spot VMs (like AWS Spot). Azure's equivalent of AMIs are Azure VM Images (stored in Compute Gallery).

Lambda Serverless Functions

What is Lambda?

AWS Lambda lets you run code without provisioning any servers. You upload a function (zip or container), define what triggers it, and Lambda runs it on-demand. You're billed per invocation + per GB-second of memory used. No code running = zero cost.

Key Lambda Concepts

Triggers (Event Sources)

Lambda is event-driven. Something must trigger it:

HTTP/API

API Gateway β†’ Lambda. REST or WebSocket APIs.

S3 Events

File uploaded to S3 β†’ Lambda. Common for image processing, ETL.

Scheduled

EventBridge cron rule β†’ Lambda. Like cron jobs, serverless.

Queue/Stream

SQS message β†’ Lambda. Kinesis stream β†’ Lambda. Event processing.

DynamoDB Stream

Record change in DynamoDB β†’ Lambda. Triggers on insert/update/delete.

SNS / EventBridge

Pub/sub messages or event bus events β†’ Lambda. Decoupled architectures.

Execution Environment

Lambda runs your code inside a micro-container (Firecracker VM). Your function gets:

  • Memory: 128MB to 10GB. CPU scales proportionally with memory.
  • Timeout: Max 15 minutes per invocation.
  • /tmp storage: 512MB to 10GB ephemeral disk (lost after function ends).
  • Ephemeral by design: Don't rely on state persisting between invocations.

Cold Start

When Lambda hasn't run recently, AWS needs to initialize the execution environment (download code, start runtime, run initialization code). This adds 200ms-2s latency. Subsequent "warm" invocations reuse the same container (~1ms overhead).

# Lambda handler (Python example)
import boto3

# Code HERE runs on every COLD start (container init)
s3_client = boto3.client('s3')  # Initialize once, reused on warm invocations

def handler(event, context):
    # Code HERE runs on EVERY invocation (warm or cold)
    bucket = event['bucket']
    key = event['key']
    response = s3_client.get_object(Bucket=bucket, Key=key)
    return {'statusCode': 200, 'body': response['Body'].read().decode()}

Layers

Lambda Layers are zip archives containing dependencies (libraries) that can be shared across multiple functions. Instead of bundling numpy in every ML Lambda, put it in a layer and reference it. Max 5 layers per function. Reduces deployment package size and enables sharing.

Concurrency

Lambda scales horizontally automatically. If 1000 events arrive simultaneously, Lambda spins up 1000 instances of your function. Default account limit: 1000 concurrent executions (soft limit, can increase). You can set Reserved Concurrency (cap a function to protect others) or Provisioned Concurrency (keep containers warm, eliminate cold starts, extra cost).

IAM Execution Role

Each Lambda function has an execution role β€” an IAM role Lambda assumes to make API calls. If your function needs to read from S3, the execution role must have s3:GetObject permission. Never put AWS credentials inside Lambda code β€” use the execution role.

Real-World Lambda Architecture User uploads profile photo to S3 β†’ S3 triggers Lambda β†’ Lambda resizes image to thumbnail, saves to a different S3 path, records metadata in DynamoDB, sends SNS notification "Profile photo processed." Zero servers, auto-scales, costs pennies per million photos.
Serverless Functions β€” Equivalents
GCP

Cloud Functions (event-driven, like Lambda) and Cloud Run (containerized serverless β€” more flexible, any language, scale to zero). Cloud Run is often preferred over Cloud Functions for complex apps.

Azure

Azure Functions. Same concept. Supports Consumption Plan (pay-per-use, cold starts), Premium Plan (pre-warmed, VNet integration, no cold starts), and Dedicated Plan (runs on App Service Plan). Durable Functions is Azure-specific for stateful workflows β€” more powerful than Lambda Step Functions integration.

ECS / EKS / Fargate Container Services

AWS Container Ecosystem Overview

Container Service Decision Tree
  Need containers on AWS?
          β”‚
          β–Ό
  Kubernetes or AWS-native orchestration?
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  AWS-native      β”‚  Kubernetes      β”‚
  β”‚  (ECS)          β”‚  (EKS)           β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                  β”‚
  Where to run containers?    β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                                     β”‚
  EC2 (you manage nodes)    Fargate (serverless nodes)
  More control, cheaper     No node management, slightly pricier
  for stable workloads      great for variable/small workloads

ECS β€” Elastic Container Service

AWS's own container orchestration service. Not Kubernetes β€” AWS's proprietary system. Simpler to operate than EKS for pure AWS workloads.

  • Task Definition: JSON/YAML file defining your container(s): image URI, CPU/memory, port mappings, env vars, logging, IAM role. Think of it like a Pod spec in Kubernetes.
  • Task: A running instance of a Task Definition. Like a Pod.
  • Service: Ensures a desired number of tasks are running. Handles health checks, restarts, load balancer integration, rolling deploys. Like a Deployment + Service in K8s.
  • Cluster: Logical group of resources (EC2 instances or Fargate capacity) where tasks run.
# Example Task Definition (simplified JSON)
{
  "family": "my-web-app",
  "networkMode": "awsvpc",
  "containerDefinitions": [{
    "name": "web",
    "image": "123456789.dkr.ecr.ap-south-1.amazonaws.com/my-app:v1.2",
    "cpu": 256,
    "memory": 512,
    "portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
    "environment": [{"name": "ENV", "value": "production"}],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {"awslogs-group": "/ecs/my-web-app", "awslogs-region": "ap-south-1"}
    }
  }],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512"
}

EKS β€” Elastic Kubernetes Service

Managed Kubernetes. AWS runs and manages the Kubernetes control plane (API server, etcd). You manage the worker nodes (EC2 node groups) or use Fargate. Best when you need Kubernetes compatibility (standard K8s manifests, Helm charts, existing K8s tooling).

  • Managed Node Groups: AWS creates/updates EC2 instances as worker nodes. You pick instance type, scaling policy.
  • Fargate Profiles: Pods matching certain selectors run on Fargate (serverless).
  • Add-ons: Managed plugins like CoreDNS, kube-proxy, VPC CNI, AWS Load Balancer Controller.

Fargate β€” Serverless Containers

Fargate is a compute engine for ECS and EKS where AWS manages the underlying EC2 instances. You just specify CPU/memory for your container β€” no node groups to manage, no EC2 to patch.

Use Fargate when

Variable workloads, you don't want to manage nodes, small team, serverless containers, batch jobs, don't need GPU.

Use EC2 nodes when

Need GPU instances, need specific instance types, want Spot instance savings, need local NVMe storage, running Windows containers, very high compute needs.

ECR β€” Elastic Container Registry

AWS's private Docker image registry. Like Docker Hub but private and integrated with IAM. Push images here, pull from ECS/EKS. ECR also scans images for security vulnerabilities. Free private repos (storage charged separately). You authenticate with: aws ecr get-login-password | docker login ...

Container Services β€” Equivalents
GCP

GKE (Google Kubernetes Engine β€” most mature managed K8s service, invented Kubernetes) | Cloud Run (serverless containers, like Fargate but easier) | Artifact Registry (like ECR). GCP does NOT have an ECS equivalent β€” they pushed everyone to GKE or Cloud Run.

Azure

AKS (Azure Kubernetes Service) | Azure Container Apps (serverless containers, like Cloud Run, built on K8s internally) | Azure Container Instances (ACI) (simple single-container runs, like Fargate but simpler) | ACR (Azure Container Registry).

AWS-M2

Storage

S3 Simple Storage Service

What is S3?

S3 is AWS's object storage service β€” the most fundamental AWS service. Store any file (object) up to 5TB in size. Highly durable (99.999999999% β€” eleven 9s), highly available, globally accessible. Used for: static file hosting, backup, data lake, ML training data, CloudFront origin, application logs, artifacts.

Key S3 Concepts

Buckets & Objects

A bucket is a container (globally unique name). An object is the file + metadata stored in a bucket. Objects are addressed by a key (the "path"): s3://my-bucket/images/profile/user123.jpg. Despite looking like folders, S3 is flat β€” the "/" is just part of the key name. The "folders" you see in console are just a UI fiction (prefix grouping).

S3 Storage Classes

S3 Storage Classes β€” Access Frequency vs Cost
  FREQUENTLY ACCESSED ◄──────────────────────────────► RARELY ACCESSED
  HIGHEST COST                                           LOWEST COST

  S3 Standard     β”‚ S3 Intelligent β”‚ S3 Standard-IA β”‚ S3 Glacier    β”‚ S3 Glacier
                  β”‚ Tiering        β”‚                β”‚ Instant       β”‚ Deep Archive
                  β”‚                β”‚                β”‚ Retrieval     β”‚
  ----------------β”‚----------------β”‚----------------β”‚---------------β”‚-----------
  Any data        β”‚ Unknown or     β”‚ Backups,       β”‚ Long-term     β”‚ Long-term
  accessed        β”‚ changing       β”‚ disaster       β”‚ backups, RA   β”‚ archive, 7-10yr
  frequently      β”‚ access pattern β”‚ recovery       β”‚ 1/quarter     β”‚ retention
                  β”‚ Auto-moves     β”‚                β”‚               β”‚
  Retrieval: ms   β”‚ between tiers  β”‚ Retrieval: ms  β”‚ Retrieval: ms β”‚ Retrieval: 12hr
                  β”‚                β”‚ Min 30 days    β”‚ Min 90 days   β”‚ Min 180 days
Intelligent-Tiering If you're unsure how frequently an object will be accessed, use Intelligent-Tiering. S3 monitors access patterns and automatically moves objects between tiers. Small monthly fee per 1000 objects for this monitoring, but saves on storage cost. Best for new data lakes where access patterns are unknown.

Versioning

Enable versioning on a bucket to keep multiple versions of every object. Protects against accidental deletes and overwrites. When you delete an object, S3 adds a "delete marker" β€” the old version still exists. You can restore it. Once enabled, versioning can be suspended but NOT fully disabled. Versions accumulate cost β€” use Lifecycle rules to clean old versions.

Lifecycle Policies

Automate object transitions between storage classes or expiration:

# Example: Move to IA after 30 days, Glacier after 90 days, delete after 365 days
{
  "Rules": [{
    "Status": "Enabled",
    "Filter": {"Prefix": "logs/"},
    "Transitions": [
      {"Days": 30, "StorageClass": "STANDARD_IA"},
      {"Days": 90, "StorageClass": "GLACIER"}
    ],
    "Expiration": {"Days": 365}
  }]
}

Bucket Policies vs ACLs vs IAM

MethodWhat it controlsUse when
IAM PolicyWhat an IAM user/role can do with S3Controlling access for your AWS users/services
Bucket PolicyJSON policy on the bucket itself. Can grant cross-account access.Granting access to other AWS accounts, making bucket public, enforcing HTTPS
ACLsLegacy per-object permissionsAvoid if possible. Disabled by default now with Block Public Access.
# Bucket policy: enforce HTTPS only
{
  "Statement": [{
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
    "Condition": {"Bool": {"aws:SecureTransport": "false"}}
  }]
}

Pre-signed URLs

Temporarily grant access to a private object without making it public. A pre-signed URL is signed with your credentials and has an expiry. Your backend generates it and sends to a user β€” they can download the private file for the next 15 minutes. Used for: file downloads in apps, direct-to-S3 uploads from browser (bypasses your server).

# Generate pre-signed URL (Python boto3)
url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'my-bucket', 'Key': 'report.pdf'},
    ExpiresIn=900  # 15 minutes
)
# Now share this URL β€” expires in 15 minutes automatically

S3 Replication

  • Cross-Region Replication (CRR): Replicate objects to a bucket in another region. For DR, compliance (EU data must also be in EU-West), lower latency for users in different regions.
  • Same-Region Replication (SRR): Replicate within same region to another bucket. For log aggregation, test-prod separation, compliance copies.

Both require versioning enabled. Replication is asynchronous (not instant). Does NOT replicate existing objects β€” only new uploads after replication is configured.

Object Storage β€” Equivalents
GCP

Cloud Storage. Storage classes: Standard, Nearline (monthly access), Coldline (quarterly), Archive (yearly). Has Object Lifecycle Management (like S3 Lifecycle). Signed URLs = equivalent to Pre-signed URLs. HMAC keys for S3-compatible API access.

Azure

Azure Blob Storage. Tiers: Hot (frequent), Cool (infrequent), Cold, Archive. Objects are called "blobs". Containers β‰ˆ S3 buckets. Shared Access Signatures (SAS tokens) = equivalent to S3 Pre-signed URLs. Azure Data Lake Storage Gen2 (Blob + hierarchical namespace for analytics).

EBS Elastic Block Store

What is EBS?

EBS provides block storage volumes for EC2 instances β€” like a virtual hard drive. Unlike S3 (object storage accessible over HTTP), EBS appears as a raw block device to the OS (like /dev/xvda). You format it with a filesystem (ext4, xfs) and mount it. EBS volumes persist independently of EC2 instance lifecycle β€” you can stop/terminate an instance and the volume remains.

EBS Volume Types

TypeNameMax IOPSMax ThroughputBest for
gp3General Purpose SSD16,0001,000 MB/sMost workloads. Default. Boot volumes, dev DBs.
gp2General Purpose SSD (legacy)16,000250 MB/sLegacy β€” migrate to gp3 (cheaper, more flexible)
io2 Block ExpressProvisioned IOPS SSD256,0004,000 MB/sMission-critical: SAP HANA, Oracle, high-perf DBs
io1Provisioned IOPS SSD64,0001,000 MB/sProduction I/O-intensive databases
st1Throughput Optimized HDD500500 MB/sBig data, data warehouses, log processing
sc1Cold HDD250250 MB/sInfrequently accessed, lowest cost
gp3 vs gp2 gp2 IOPS are tied to volume size (3 IOPS/GB, so 100GB = 300 IOPS). gp3 lets you configure IOPS independently of size. A gp3 volume starts at 3,000 IOPS regardless of size. gp3 is also 20% cheaper than gp2. Always prefer gp3 for new volumes.

EBS Snapshots

Point-in-time backup of an EBS volume to S3 (you don't see this S3 bucket β€” it's AWS-managed). Snapshots are incremental: first snapshot copies everything, subsequent snapshots only store changed blocks. You can create volumes from snapshots in any AZ (cross-AZ copy). You can copy snapshots across regions (for DR). Cost: per GB-month of data stored in snapshot.

# Create snapshot via AWS CLI
aws ec2 create-snapshot --volume-id vol-0abc123 --description "Pre-deploy backup"

# Create volume from snapshot in different AZ (useful for migrating data)
aws ec2 create-volume --snapshot-id snap-0xyz789 --availability-zone ap-south-1b --volume-type gp3

EBS vs Instance Store

FeatureEBSInstance Store
PersistencePersists independently of instanceData LOST when instance stops/terminates
PerformanceGood (up to 256K IOPS)Excellent (physically attached NVMe)
CostSeparate charge per GB-monthIncluded in instance price
Use caseBoot volumes, databases, general storageTemp data, buffers, cache, Kafka, Spark shuffle
Block Storage β€” Equivalents
GCP

Persistent Disks (standard HDD, balanced SSD, extreme SSD) and Hyperdisk (ultra-high performance). Google's equivalent of EBS. Also Local SSDs = instance store equivalent (ephemeral).

Azure

Azure Managed Disks. Types: Standard HDD, Standard SSD, Premium SSD, Ultra Disk (for SAP HANA, etc.). Azure also has Azure Shared Disks (multi-VM attach, for Windows WSFC clusters).

EFS Elastic File System

What is EFS?

EFS is a managed NFS (Network File System) that can be mounted by multiple EC2 instances simultaneously across multiple AZs. Unlike EBS (one instance at a time), EFS is shared storage. Grows and shrinks automatically β€” pay only for what you use. No capacity planning needed.

Key EFS Features

  • Multi-AZ by default: Data stored redundantly across multiple AZs. Highly durable and available.
  • Shared mount: 100s or 1000s of EC2 instances can mount the same EFS simultaneously. Read AND write from multiple instances.
  • Performance modes: General Purpose (low latency) | Max I/O (high throughput, slightly higher latency for massively parallel workloads).
  • Throughput modes: Elastic (auto-scales throughput with load), Bursting (throughput proportional to size), Provisioned (fix throughput independently).
  • Storage tiers: Standard (active) β†’ Infrequent Access (EFS IA, cheaper) via lifecycle policies.
# Mount EFS on EC2 (Amazon Linux)
sudo yum install -y amazon-efs-utils
sudo mkdir /mnt/efs
sudo mount -t efs fs-0abc12345:/ /mnt/efs
# Or add to /etc/fstab for persistent mount:
echo "fs-0abc12345:/ /mnt/efs efs defaults,_netdev 0 0" | sudo tee -a /etc/fstab
Use EFS when

Shared content (CMS media files), home directories for multiple users, container shared storage, web farm with shared assets, machine learning training data accessed by multiple GPU nodes.

Don't use EFS when

App needs a database (use RDS), high-performance single-instance block storage (use EBS), object storage for files (use S3), very cost-sensitive (~3x more expensive than EBS per GB).

FeatureS3EBSEFS
TypeObjectBlockFile (NFS)
AccessHTTP API / SDKSingle EC2 (usually)Multiple EC2, multiple AZs
Durability11 nines99.999%99.999999999%
Use caseBlobs, backups, data lakeBoot disk, databasesShared file system
Cost (approx)$0.023/GB$0.08/GB (gp3)$0.30/GB (Standard)
Shared File Storage β€” Equivalents
GCP

Filestore β€” managed NFS. Similar to EFS. Also Cloud Storage FUSE (mount GCS bucket as a filesystem, not true NFS).

Azure

Azure Files β€” managed SMB/NFS file shares. Works with Windows AND Linux. Azure NetApp Files for enterprise NAS workloads (SAP, Oracle). Azure also has Azure File Sync to sync on-prem Windows file servers with Azure Files.

Azure-Only

Azure File Sync: Extend your on-prem Windows File Server to Azure Files automatically. No AWS equivalent β€” would require custom scripting. Common hybrid use case for enterprises migrating file shares to cloud.

AWS-M3

Networking Deep Dive

VPC Virtual Private Cloud β€” Deep Dive

VPC Advanced Concepts

VPC Peering

Connect two VPCs so resources can communicate using private IPs, as if they were in the same network. Can peer across accounts and regions. Non-transitive: if VPC-A peers with VPC-B and VPC-B peers with VPC-C, VPC-A cannot talk to VPC-C through VPC-B. You'd need a direct peering or Transit Gateway.

VPC-A ←──peering──► VPC-B ←──peering──► VPC-C
EC2 in VPC-A β†’ VPC-B: βœ“ (direct peering)
EC2 in VPC-A β†’ VPC-C: βœ— (non-transitive β€” no direct peering)

Transit Gateway (TGW)

A central hub that connects multiple VPCs and on-prem networks. Solves the peering mesh problem: instead of NΓ—(N-1)/2 peering connections for N VPCs, you connect each VPC to one TGW. TGW is transitive. Think of it as a cloud router. Supports: inter-VPC, VPC-to-on-prem (via VPN/Direct Connect), inter-region peering via TGW.

Transit Gateway vs VPC Peering at Scale
  WITHOUT TGW (5 VPCs, 10 peerings needed):     WITH TGW (5 VPCs, 5 attachments):
  VPC-A ──── VPC-B                               VPC-A ──┐
  VPC-A ──── VPC-C                               VPC-B ───
  VPC-A ──── VPC-D                               VPC-C ──┼── Transit Gateway ── On-Prem
  VPC-A ──── VPC-E                               VPC-D ───
  VPC-B ──── VPC-C ... etc.                      VPC-E β”€β”€β”˜
  Non-transitive, complex route tables.          Central hub, transitive, one TGW.

VPC Endpoints

Access AWS services (S3, DynamoDB, SSM, etc.) from within your VPC without traffic leaving through the internet. Traffic stays on AWS's private network. More secure and often faster.

TypeHow it worksSupported services
Gateway EndpointFree. Route table entry routes traffic to AWS service. No ENI.S3 and DynamoDB only
Interface Endpoint (PrivateLink)Creates an ENI with private IP in your subnet. DNS resolves service to private IP. Charged per hour + data.100+ services: SSM, Secrets Manager, KMS, API Gateway, ECR, and more
Real-World Example Your Lambda in a private VPC needs to call the SSM Parameter Store API. Without a VPC endpoint, it would need to route through a NAT Gateway (costly) to reach SSM's public endpoint. With an SSM VPC Interface Endpoint: traffic goes SSM β†’ ENI in your VPC β†’ private AWS backbone. No NAT cost, more secure.

VPC Flow Logs

Capture information about IP traffic going to/from network interfaces in your VPC. Sent to CloudWatch Logs or S3. Not real-time packet capture (use Traffic Mirroring for that) β€” just metadata: source/dest IP, ports, protocol, bytes, action (ACCEPT/REJECT).

# Example flow log entry:
# version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2 123456789 eni-0abc 10.0.1.10 10.0.2.20 45678 443 6 20 4000 1620000000 1620000060 ACCEPT OK
2 123456789 eni-0abc 1.2.3.4   10.0.1.10 12345 22  6  5  300  1620000010 1620000070 REJECT OK
# β†’ Blocked SSH attempt from 1.2.3.4 to our server (Security Group or NACL blocked it)

NAT Gateway Details

  • Deployed in a public subnet with an Elastic IP
  • Private instances route 0.0.0.0/0 to the NAT GW β†’ it translates their private IP to its public EIP β†’ sends to internet
  • Cost: ~$0.045/hour + $0.045/GB data processed. In high-traffic envs, this adds up.
  • For high-availability: deploy a NAT Gateway in EACH AZ. Don't share one NAT GW across AZs (AZ failure kills outbound internet for other AZs).
  • NAT Instance vs NAT Gateway: NAT Instance is a self-managed EC2 instance doing NAT. Cheaper, more configurable, but you manage patching, HA. NAT Gateway is managed, scales automatically, no maintenance. Use NAT Gateway unless you have a specific reason for NAT Instance.
Route 53 DNS & Traffic Routing

What is Route 53?

AWS's managed DNS service. Also handles domain registration, health checks, and sophisticated traffic routing policies. Named after port 53 (DNS port). Has a 100% availability SLA β€” the only AWS service with this guarantee.

Hosted Zones

A hosted zone is a container for DNS records for a domain. Public hosted zone: records accessible over the internet (your website). Private hosted zone: records for resources within your VPC (internal service discovery β€” db.internal β†’ 10.0.3.5).

Record Types

RecordPurposeExample
AMaps hostname to IPv4 addressapi.example.com β†’ 54.123.45.67
AAAAMaps hostname to IPv6 addressapi.example.com β†’ 2001:db8::1
CNAMEMaps hostname to another hostname. Cannot be used on zone apex (root domain).www.example.com β†’ example.com
AliasAWS-specific. Like CNAME but can be used on root domain. Points to AWS resources (ALB, CloudFront, S3). Free queries for Alias records.example.com β†’ my-alb.us-east-1.elb.amazonaws.com
MXMail exchange servers for email routingexample.com β†’ mail1.example.com (priority 10)
TXTText records. Used for domain verification, SPF, DKIM.example.com β†’ "v=spf1 include:_spf.google.com ~all"
NSName server records β€” which DNS servers handle this zoneAutomatically created by Route 53 when you create a zone
PTRReverse DNS β€” IP to hostname67.45.123.54 β†’ api.example.com

Routing Policies

PolicyHow it routesUse case
SimpleReturns one or more IPs (round-robin if multiple). No health checks.Basic single-resource routing
WeightedDistribute traffic by weight (70/30). Sum doesn't need to be 100.Blue/green deploys, A/B testing, gradual migrations
LatencyRoute to region with lowest latency for the user. AWS measures latency to each region.Multi-region apps wanting best performance for each user
FailoverPrimary record β†’ health-checked. If unhealthy, Route 53 serves the secondary.Active-passive DR. Route to DR region on failure.
GeolocationRoute based on user's geographic location (country/continent). Strict β€” no match = no response unless default record exists.Legal compliance (EU users β†’ EU servers), localized content
GeoproximityRoute based on physical distance. Can shift traffic by adjusting bias values. Requires Traffic Flow (extra cost).Multi-region with granular traffic shifting
Multivalue AnswerReturns up to 8 healthy records. Like Simple but with health checks per record.Simple client-side load balancing with health checks. Not a replacement for ALB.
IP-BasedRoute based on client IP CIDR ranges.Route corporate network traffic to internal endpoints

Health Checks

Route 53 health checkers (globally distributed) ping your endpoint every 10/30 seconds. If 18%+ of checkers fail β†’ endpoint marked unhealthy β†’ Failover routing activates. You can health-check: HTTP/HTTPS/TCP endpoints, CloudWatch alarms, or calculated health checks (composite of multiple checks).

CloudFront CDN

What is CloudFront?

AWS's global CDN with 600+ edge locations. Accelerates delivery of static and dynamic content by caching at the edge. Also provides: DDoS protection (Shield Standard free), HTTPS termination, compression, WAF integration, Lambda@Edge for programmable edge logic.

Key CloudFront Concepts

Distribution

A CloudFront configuration object. You create one distribution per app/site. A distribution has a CloudFront domain (d1abc23efg.cloudfront.net) which you CNAME your domain to. Has one or more origins and one or more cache behaviors.

Origins

Where CloudFront fetches content when it's not cached (cache miss). Can be: S3 bucket, ALB, EC2, API Gateway, or any HTTP server. A distribution can have multiple origins.

Cache Behaviors

Rules that define how CloudFront handles requests matching a URL path pattern. Different paths can route to different origins with different cache settings:

CloudFront β€” Multiple Origins via Cache Behaviors
  https://example.com
  β”‚
  β”œβ”€β”€ /api/*  ─────────────────────────► ALB β†’ EC2 (no caching, dynamic)
  β”‚   Cache: TTL=0, forward all headers β”‚
  β”‚
  β”œβ”€β”€ /static/* ────────────────────────► S3 Bucket (cached, long TTL)
  β”‚   Cache: TTL=86400 (1 day)          β”‚
  β”‚
  └── /* (Default) ─────────────────────► S3 (index.html, SPA)
      Cache: TTL=300 (5 min)            β”‚

OAC β€” Origin Access Control

Allows CloudFront to access a private S3 bucket on your behalf. Users access content via CloudFront URL only β€” the S3 bucket can block all direct access. Prevents bucket hotlinking, enforces CloudFront caching. OAC is the modern replacement for OAI (Origin Access Identity).

Lambda@Edge & CloudFront Functions

  • CloudFront Functions: Ultra-lightweight JS functions running at the edge for request/response manipulation. Sub-ms latency. Free tier available. Good for: URL rewrites/redirects, add security headers, A/B testing at edge.
  • Lambda@Edge: Full Lambda functions deployed globally to CloudFront PoPs. More powerful (Node.js, Python), slightly higher latency. Good for: authentication at edge, dynamic content generation, API calls at edge.
// CloudFront Function example: add security headers to all responses
function handler(event) {
    var response = event.response;
    var headers = response.headers;
    headers['strict-transport-security'] = {value: 'max-age=63072000; includeSubdomains; preload'};
    headers['x-content-type-options'] = {value: 'nosniff'};
    headers['x-frame-options'] = {value: 'DENY'};
    return response;
}
ELB Elastic Load Balancing β€” ALB & NLB

Application Load Balancer (ALB) β€” Layer 7

ALB operates at HTTP/HTTPS layer. It understands your request content and can make intelligent routing decisions. Every request is terminated at the ALB (it opens a new connection to the backend). Essential for microservices architectures.

Key ALB Components

  • Listener: Waits on a port (80 or 443). Defines rules to route requests.
  • Rules: IF (conditions match) THEN (action). Conditions: path, hostname, headers, query strings, source IP, HTTP method. Actions: forward, redirect, return fixed response.
  • Target Groups: Collection of targets (EC2 instances, IP addresses, Lambda functions, containers). Each TG has health check configuration.
ALB β€” Path-Based Routing to Microservices
  Client HTTP Request
         β”‚
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  ALB Listener   β”‚ :443 (HTTPS)
  β”‚  ─────────────  β”‚
  β”‚  Rule 1:         β”‚ /users/* ────────► Target Group A (User Service)
  β”‚  Rule 2:         β”‚ /orders/* ───────► Target Group B (Order Service)
  β”‚  Rule 3:         β”‚ /api/* (host:api) β–Ί Target Group C (API backend)
  β”‚  Default:        β”‚ /* ──────────────► Target Group D (Frontend SPA)
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Each Target Group:
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  EC2: i-001 (healthy βœ“)  i-002 (healthy βœ“)  i-003 βœ—  β”‚
  β”‚  Health check: GET /health β†’ 200 OK every 30s         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

ALB Features

  • HTTPS Termination: ALB decrypts HTTPS and talks to backend via HTTP. Offloads SSL processing from backends.
  • Sticky Sessions: Route same user to same backend target using a cookie. Use with caution (undermines horizontal scaling).
  • Weighted Target Groups: Send 90% to v2 TG, 10% to v3 TG. Canary deploys without DNS changes.
  • Authentication: Native OpenID Connect/Cognito authentication. Reject unauthenticated requests before they hit your app.
  • Access Logs: Log every request to S3. Useful for traffic analysis, debugging, compliance.

Network Load Balancer (NLB) β€” Layer 4

NLB operates at TCP/UDP layer. Doesn't inspect packet contents. Handles millions of requests per second with ultra-low latency (<1ms). Has static IP addresses (useful for whitelisting). Supports TLS termination at L4.

FeatureALBNLB
OSI LayerLayer 7 (HTTP/HTTPS)Layer 4 (TCP/UDP/TLS)
Routing intelligencePath, host, headers, cookiesIP + Port only
PerformanceGoodExtreme (millions RPS)
Static IPNo (use CloudFront)Yes (one per AZ)
Protocol supportHTTP/HTTPS/WebSocket/gRPCTCP/UDP/TLS
PriceModerateModerate
Use caseWeb apps, microservices, APIsGaming, IoT, VoIP, financial trading, TCP apps
VPN / Direct Connect Hybrid Connectivity

AWS Site-to-Site VPN

Encrypted connection between your on-premises network and your AWS VPC over the public internet. Uses IPsec. Two tunnels per VPN connection (for redundancy). Managed on AWS side by Virtual Private Gateway (VGW) or Transit Gateway. Bandwidth: ~1.25 Gbps max per tunnel, varies with internet conditions.

# VPN Connection components:
On-Prem Router/Firewall (Customer Gateway) ──IPsec Tunnel──► Virtual Private Gateway (VGW)
                                                                        β”‚
                                                               Route table entry in VPC
                                                               10.0.0.0/8 β†’ vgw-xxxxx

AWS Direct Connect (DX)

A dedicated physical private network connection from your datacenter to AWS. NOT over the internet β€” a private fiber link through an AWS Direct Connect partner (colocation facility). More expensive to set up but: consistent bandwidth, lower latency, more predictable, can carry more traffic more cheaply (data transfer pricing is lower on DX vs internet).

FeatureSite-to-Site VPNDirect Connect
Connection typeOver internet (encrypted)Private dedicated fiber
Setup timeHours (AWS console + router config)Weeks to months (physical provisioning)
Bandwidth~1 Gbps (variable, internet-dependent)1 Gbps or 10 Gbps, consistent
CostLow (hourly + data transfer)High (port fee + partner fee + data)
ReliabilityInternet outages affect itDedicated β€” very reliable
LatencyVariableConsistent and low
Use caseSmall/medium orgs, dev, backup linkEnterprise hybrid cloud, large data transfers, compliance
Best Practice: DX + VPN Backup Use Direct Connect as the primary link and a VPN connection as a failover. If DX goes down, traffic automatically fails over to the VPN (slower but encrypted). This gives you the performance of DX with the resilience of VPN as backup.
Hybrid Connectivity β€” Equivalents
GCP

Cloud VPN (like Site-to-Site VPN) | Cloud Interconnect (like Direct Connect). Cloud Interconnect types: Dedicated Interconnect (100 Gbps!) and Partner Interconnect.

Azure

Azure VPN Gateway (like Site-to-Site VPN) | Azure ExpressRoute (like Direct Connect). ExpressRoute also has ExpressRoute Global Reach β€” connect your on-prem through Azure to reach other Azure regions or other on-prem offices (AWS doesn't offer this natively).

AWS-M4

IAM & Security

IAM Identity & Access Management

Core IAM Entities

IAM is a free, global service β€” it's not region-specific. IAM controls who can do what on which AWS resources. Everything in AWS is an API call, and every call goes through IAM for authorization.

IAM User

A person or application with permanent long-term credentials (password + access keys). Represents one specific identity. Avoid creating users for services β€” use roles instead.

IAM Group

Collection of users. Attach policies to groups, not individual users. E.g., "Developers" group has S3 + EC2 read. Add a new dev β†’ add to group. Remove dev β†’ remove from group. Clean, scalable.

IAM Role

An identity with permissions, but NO permanent credentials. Assumed temporarily by users, AWS services (EC2, Lambda), or other accounts. Credentials are auto-rotated. Preferred over users for services.

IAM Policy

JSON document defining what actions are allowed/denied on which resources. Attached to users, groups, or roles. AWS-managed policies (maintained by AWS) or customer-managed (you control them).

IAM Policy Structure

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3ReadOnSpecificBucket",   // Optional statement ID
      "Effect": "Allow",                       // "Allow" or "Deny"
      "Action": [                              // What actions are allowed
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [                            // On which resources
        "arn:aws:s3:::my-company-bucket",      // The bucket itself (for ListBucket)
        "arn:aws:s3:::my-company-bucket/*"     // Objects within the bucket
      ],
      "Condition": {                           // Optional: extra conditions
        "StringEquals": {
          "s3:prefix": "reports/"             // Only objects under "reports/" prefix
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::my-company-bucket/*"
    }
  ]
}

IAM Policy Types

Policy TypeAttached toPurpose
Identity-basedUser, Group, RoleWhat that identity can do
Resource-basedResource (S3 bucket, Lambda, SQS)Who can access this resource (enables cross-account)
Permission BoundaryUser or RoleMaximum permissions ceiling. Even if identity has broader policy, boundary limits it.
SCP (Service Control Policy)AWS Organization Account/OUMax permissions for all accounts in an org. Even account root can't exceed SCP.
Session PolicyAssumeRole callFurther restrict permissions for a specific role session

Policy Evaluation Logic

IAM Authorization β€” Evaluation Order
  Request arrives β†’ Check for explicit DENY in any policy
                           β”‚
                    Yes: DENY βœ— (Deny wins, always)
                           β”‚
                    No: Check if SCP allows (Organizations)
                           β”‚
                    No: DENY βœ—
                           β”‚
                    Yes: Check for explicit ALLOW
                           β”‚
                    No: Implicit DENY βœ— (default deny)
                           β”‚
                    Yes: Check Permission Boundary
                           β”‚
                    No: DENY βœ—
                           β”‚
                    Yes: ALLOW βœ“

  Rule: EXPLICIT DENY always wins. Default is DENY.
  You must explicitly ALLOW everything you want permitted.

IAM Roles β€” The Key Pattern

Instead of creating a user for your EC2 instance and storing access keys on the server (dangerous β€” keys can leak), you attach an IAM Role to EC2. EC2 automatically gets temporary credentials via IMDS. The credentials rotate every hour automatically. Lambda, ECS tasks, and other services all work the same way.

# BAD: Access keys hardcoded or in environment (never do this)
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# GOOD: Use IAM Role attached to the EC2/Lambda/ECS task
# boto3 automatically fetches temp credentials from IMDS
import boto3
s3 = boto3.client('s3')  # No credentials needed β€” role creds used automatically
s3.get_object(Bucket='my-bucket', Key='file.txt')

Cross-Account Access with Roles

A role in Account B can be assumed by Account A's resources. This is how centralized tooling (one DevOps account managing multiple app accounts) works. The trust policy on the role in Account B says "allow Account A's role X to assume me."

# Trust Policy on Role in Account B (the target role)
{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::111111111111:role/DeployRole"  // Account A's role
    },
    "Action": "sts:AssumeRole"
  }]
}

# In Account A, assume the role:
aws sts assume-role \
  --role-arn "arn:aws:iam::222222222222:role/DeployTargetRole" \
  --role-session-name "deploy-session-$(date +%s)"

MFA (Multi-Factor Authentication)

  • Virtual MFA: Authenticator app (Google Authenticator, Authy)
  • Hardware MFA: Physical TOTP device or FIDO2 security key (YubiKey)
  • Always enable MFA on root account β€” root has unlimited power and can't be restricted by SCPs
  • You can enforce MFA for specific actions via policy condition: "Condition": {"Bool": {"aws:MultiFactorAuthPresent": "true"}}
Root Account Security The AWS root account (the email you signed up with) has unrestricted access β€” it can even bypass SCPs. Best practices: Enable MFA on root immediately. Create a strong password. Store root credentials in a password manager in a vault. Never use root for day-to-day operations. Create IAM admin users for regular work.
IAM Equivalents
GCP

Cloud IAM. Key difference: GCP uses Roles (not policies) as the primary permission unit. Predefined roles (like AWS managed policies), custom roles. Service Accounts = IAM Roles for services. Workload Identity Federation = allows external identities (GitHub Actions, on-prem) to access GCP without service account keys β€” similar to AWS OIDC federation.

Azure

Azure RBAC (Role-Based Access Control). Built-in roles: Owner, Contributor, Reader, plus 100+ service-specific roles. Service Principals = IAM Roles for services. Managed Identities (System-assigned or User-assigned) = equivalent to EC2 IAM roles β€” no credentials stored. Azure AD / Entra ID is the identity provider (IAM is separate from directory in AWS, Azure integrates them).

Azure-Only

Azure Active Directory (Entra ID): Azure integrates identity directory (user management, SSO, conditional access) directly with RBAC. In AWS, you'd use IAM + AWS SSO (IAM Identity Center) + potentially an external IdP (Okta, Azure AD itself). Many companies use Azure AD as their IdP even for AWS.

KMS / Secrets Manager Key & Secrets Management

AWS KMS β€” Key Management Service

KMS is a managed service for creating and controlling encryption keys. It's the central key vault for all AWS encryption. When you "enable encryption" in S3, EBS, RDS β€” they're using KMS keys under the hood.

Key Types

Key TypeWho managesRotationCostUse when
AWS Managed KeyAWS (auto-created per service)Auto (1 yr)FreeBasic encryption, fine for most cases
Customer Managed Key (CMK)YouAuto or manual$1/month + API callsNeed control, cross-account, custom key policy, audit
AWS CloudHSMYou (hardware module)You manage$$$Strict compliance (FIPS 140-2 Level 3), custom HSM

Envelope Encryption

KMS doesn't encrypt your 5GB file directly (KMS keys stay in KMS β€” data never leaves). Instead: KMS generates a Data Encryption Key (DEK). Your code uses the DEK to encrypt the actual data. The DEK itself is encrypted with a KMS key (the "master key"). You store the encrypted DEK alongside the encrypted data. To decrypt: call KMS to decrypt the DEK, then use the DEK to decrypt the data. The master key never leaves KMS.

AWS Secrets Manager

Centralized, encrypted storage for secrets: database passwords, API keys, OAuth tokens. Auto-rotates secrets (can trigger a Lambda to rotate passwords in RDS). Applications retrieve secrets at runtime via API β€” no hardcoded passwords in code.

# Retrieve secret at runtime (Python)
import boto3, json
client = boto3.client('secretsmanager', region_name='ap-south-1')
secret = client.get_secret_value(SecretId='prod/myapp/db-password')
db_creds = json.loads(secret['SecretString'])
db_host = db_creds['host']
db_pass = db_creds['password']

# Application auto-rotates: RDS password changed every 30 days
# Lambda triggered by Secrets Manager updates RDS user password automatically

Secrets Manager vs SSM Parameter Store

FeatureSecrets ManagerSSM Parameter Store
Cost$0.40/secret/month + API callsFree (Standard) / $0.05/adv param
Auto-rotationYes (built-in for RDS, Redshift, DocumentDB)No (manual or custom Lambda)
EncryptionAlways encrypted (KMS)Optional (use SecureString type for encrypted)
Cross-accountYes, with resource policyNo native support
Best forDatabase passwords, API keys, credentials requiring rotationApp configs, feature flags, non-secret parameters
Secrets & Key Management β€” Equivalents
GCP

Secret Manager (like Secrets Manager) | Cloud KMS (like AWS KMS) | Cloud HSM (like CloudHSM). GCP Secret Manager also supports version control of secrets.

Azure

Azure Key Vault β€” combines secrets, keys, AND certificates in one service (AWS splits these: Secrets Manager + KMS + ACM). Key Vault has Managed HSM tier for FIPS 140-2 Level 3. Azure App Configuration is like SSM Parameter Store for feature flags and app settings.

Azure-Only

Azure Key Vault Certificates: Key Vault can manage the full TLS certificate lifecycle β€” request, renew, store, deploy. AWS splits this: ACM (Certificate Manager) for provisioning/renewal, Secrets Manager for custom cert storage.

WAF / Shield / GuardDuty Threat Protection

AWS WAF β€” Web Application Firewall

WAF protects your web apps from common exploits at the application layer (L7). Works with CloudFront, ALB, API Gateway, AppSync. You define rules that filter HTTP requests.

Built-in rule groups (AWS Managed Rules): SQL injection protection, XSS protection, known bad IPs, AWS IP reputation lists. You can also write custom rules: "Block all requests where URI contains ../" or "Rate limit to 1000 req/5min per IP."

AWS Shield

TierCostProtection
Shield StandardFree (automatic)L3/L4 DDoS protection for all AWS resources. Protects against SYN floods, UDP reflection, etc.
Shield Advanced$3,000/monthEnhanced DDoS protection, 24/7 DDoS Response Team (DRT), cost protection (AWS refunds scale-out costs from DDoS), advanced metrics.

Amazon GuardDuty

AI-powered threat detection service that continuously monitors your AWS account for malicious activity and unusual behavior. Analyzes: VPC Flow Logs, DNS logs, CloudTrail events, S3 access logs, EKS audit logs. Detects: compromised EC2 instances communicating with known bad IPs, unusual API calls, credential theft, S3 data exfiltration patterns.

Enable GuardDuty on Every Account GuardDuty is pay-per-use (per GB of log data analyzed), has a 30-day free trial, and requires literally zero configuration to start getting value. Enable it and connect to Security Hub for centralized findings. It's one of the highest-value-per-effort security services in AWS.

AWS Security Hub

Central security dashboard aggregating findings from GuardDuty, Inspector, Macie, Firewall Manager, and third-party tools. Runs automated compliance checks against CIS AWS Foundations, PCI-DSS, and other standards. Gives you a security score and prioritized findings list.

Other Key Security Services

ServiceWhat it does
Amazon InspectorVulnerability scanning for EC2 instances and container images in ECR. Continuously scans for CVEs, network exposure. Integrates with ECR to block vulnerable images.
Amazon MacieML-based data security for S3. Discovers and protects sensitive data: PII (names, SSNs, credit cards, passports). Alerts you if sensitive data is in a public bucket.
AWS ConfigContinuous resource configuration recording. "Who changed what, when?" Compliance rules: "All S3 buckets must have encryption enabled." Alerts on drift.
AWS CloudTrailAudit log of all AWS API calls: who made the call, from which IP, when, what changed. The "flight recorder" of your AWS account. Enabled by default but save to S3 for long-term retention.
Security Services β€” Equivalents
GCP

Cloud Armor (WAF + DDoS) | Security Command Center (like Security Hub + GuardDuty) | Cloud Audit Logs (like CloudTrail) | Container Analysis (like Inspector for containers).

Azure

Azure WAF (part of App Gateway or Front Door) | Azure DDoS Protection (Standard = like Shield Advanced) | Microsoft Defender for Cloud (like GuardDuty + Security Hub combined) | Azure Monitor Activity Log (like CloudTrail).

AWS-M5

Databases

RDS Relational Database Service

What is RDS?

RDS is a managed relational database service. AWS handles: OS patching, DB engine upgrades (with your approval), automated backups, replication, failover. You just connect and query. Supported engines: MySQL, PostgreSQL, MariaDB, Oracle, Microsoft SQL Server and Amazon Aurora (custom AWS engine).

Key RDS Concepts

Multi-AZ Deployment

The most important RDS HA feature. When enabled, AWS automatically maintains a synchronous standby replica in a different AZ. If primary fails, RDS automatically fails over to standby. Failover takes 60-120 seconds (DNS update). The standby is NOT accessible for reads β€” it's purely for failover. Separate from Read Replicas.

RDS Multi-AZ vs Read Replicas
  MULTI-AZ (High Availability):           READ REPLICAS (Scalability):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ AZ-1a: Primary RDS  ──sync──►│        β”‚ Primary ──async──► Replica 1 β”‚
  β”‚         Read+Write   β—„failoverβ”‚        β”‚  (R+W)  ──async──► Replica 2 β”‚
  β”‚                              β”‚        β”‚         ──async──► Replica 3 β”‚
  β”‚ AZ-1b: Standby RDS           β”‚        β”‚                              β”‚
  β”‚         (NOT accessible)     β”‚        β”‚ Replicas: READ ONLY          β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚ Can be in different region!  β”‚
  For: automatic failover / HA            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                          For: scale out reads, reports,
                                          analytics, DR (promote to master)

Read Replicas

Asynchronous copies of your primary DB, used to offload read traffic. Up to 15 read replicas for Aurora, 5 for other engines. Can be in a different region (cross-region read replicas for DR). In disaster, promote a read replica to standalone DB β€” becomes the new primary.

Automated Backups

  • Daily automated backup during your maintenance window (entire DB + transaction logs)
  • Retained for 1-35 days (default 7). After that, deleted automatically.
  • Point-in-time recovery: restore to any second within the backup retention period
  • Manual snapshots: you control them, persist indefinitely until you delete them

RDS Proxy

A fully managed, highly available database proxy that sits between your app and RDS. Why use it? Lambda functions opening thousands of connections overwhelm RDS (too many connections). RDS Proxy pools and reuses connections β€” Lambda connects to Proxy, Proxy maintains a small pool to RDS. Also speeds up failover: clients connect to Proxy endpoint which auto-routes to healthy instance.

Storage Auto Scaling

Enable and set a maximum storage limit. If your DB is about to run out of disk space, RDS automatically scales up storage without downtime. You can never shrink it back (only grow). Set a high maximum and don't worry about disk again.

Amazon Aurora

AWS's custom-built cloud-native relational DB. MySQL and PostgreSQL compatible β€” your app code doesn't change. But it's re-engineered from scratch for cloud performance and resilience.

FeatureStandard RDS (MySQL)Aurora MySQL
StorageSingle AZ volume (Multi-AZ adds standby)6 copies across 3 AZs by default
Read Replicas5 max15 max (Aurora Replicas)
Failover60-120 seconds~30 seconds (in-cluster replicas)
PerformanceBaseline MySQL5x MySQL throughput
CostLower~20% more than RDS
StorageUp to 64TBUp to 128TB, auto-scales

Aurora Serverless v2

Aurora that scales capacity in fine-grained increments (in 0.5 ACU steps from 0.5 to 256 ACUs) based on actual demand, in seconds. No pre-provisioning. Pay per second of actual ACU usage. Perfect for: unpredictable workloads, dev/test, multi-tenant SaaS with variable tenant load.

Managed Relational DB β€” Equivalents
GCP

Cloud SQL (managed MySQL, PostgreSQL, SQL Server β€” like standard RDS) | AlloyDB (like Aurora β€” PostgreSQL-compatible, high performance, 4x faster than Cloud SQL). Also Cloud Spanner β€” globally distributed SQL (unique, no AWS equivalent).

Azure

Azure SQL Database (managed SQL Server) | Azure Database for MySQL/PostgreSQL (like standard RDS) | Azure SQL Managed Instance (SQL Server with near-100% compatibility, for lift-and-shift). Azure's Hyperscale tier is similar to Aurora in concept.

GCP-Only

Cloud Spanner: Globally distributed, horizontally scalable relational DB with ACID transactions across regions. No true equivalent in AWS or Azure (AWS DocumentDB is NoSQL, and global Aurora has limits). Used by Google for their own core infrastructure.

DynamoDB NoSQL Database

What is DynamoDB?

DynamoDB is AWS's fully managed NoSQL key-value and document database. No servers, no OS, no capacity planning. Single-digit millisecond performance at any scale. Used by Amazon itself for their shopping cart, sessions, order management. Built for internet-scale applications.

Core Concepts

Tables, Items, Attributes

DynamoDB is schemaless (except for keys). A Table holds Items (like rows), each with Attributes (like columns). No fixed schema β€” different items can have different attributes. Only the primary key is required.

Primary Key Types

Simple Primary Key (Partition Key only)

Single attribute used as the primary key. Must be unique. Used when you query by a single ID.
Example: userId as partition key. Query: "Give me all data for userId=U123"

Composite Primary Key (Partition + Sort Key)

Two attributes together are unique. Multiple items can share partition key but must have different sort keys. Enables range queries.
Example: userId (partition) + orderDate (sort). Query: "Give me all orders for userId=U123 in 2024"

Read Capacity Units (RCU) and Write Capacity Units (WCU)

DynamoDB bills on throughput. 1 RCU = 1 strongly consistent read (or 2 eventually consistent reads) of up to 4KB/second. 1 WCU = 1 write of up to 1KB/second. You either provision RCU/WCU (predictable, cheaper) or use On-Demand mode (pay per request, no planning, costlier per request but no idle waste).

Global Secondary Indexes (GSI)

Query your DynamoDB table on a different attribute. If your table's partition key is userId, but you need to query "all users who signed up on date X" β€” create a GSI with signupDate as partition key. GSIs have their own RCU/WCU separate from the main table.

DynamoDB Streams

A time-ordered stream of item-level changes (inserts, updates, deletes) in a DynamoDB table. Retained for 24 hours. Trigger Lambda functions on changes β€” powerful for: replication, cache invalidation, event sourcing, audit logs.

DynamoDB Accelerator (DAX)

In-memory cache for DynamoDB. API-compatible β€” swap your DynamoDB client for a DAX client, same code. Reduces read latency from single-digit ms to microseconds. Handles millions of reads per second. Use for: high-read, cost-sensitive workloads (DAX reads are cheaper than DynamoDB reads at high volume).

Global Tables

Multi-region, multi-active DynamoDB. Write to any region, DynamoDB replicates to others within seconds. Last-writer-wins conflict resolution. Perfect for: global apps needing local read/write latency everywhere, multi-region active-active architecture.

When to use DynamoDB vs RDS Use DynamoDB when: access patterns are known and simple (get by key, query by key + sort), need massive scale (millions of TPS), no complex SQL queries needed, need single-digit ms at any scale, fully serverless architecture. Use RDS when: complex queries, JOINs, ACID transactions across multiple tables, unknown/evolving access patterns, need SQL, reporting/analytics.
NoSQL DB β€” Equivalents
GCP

Cloud Firestore (document NoSQL, like DynamoDB but more flexible querying) | Cloud Bigtable (wide-column NoSQL, Apache HBase compatible, for massive analytics). No exact DynamoDB equivalent β€” Firestore is closest for serverless apps.

Azure

Azure Cosmos DB β€” multi-model NoSQL (document, key-value, graph, column-family) with multi-region active-active. More flexible than DynamoDB. Supports multiple APIs: Core (SQL), MongoDB, Cassandra, Gremlin, Table. 99.999% availability SLA.

Azure-Only

Cosmos DB's multi-model support: One Cosmos DB instance supports MongoDB API, Cassandra API, and SQL API simultaneously (with different collections). You can use existing MongoDB drivers unchanged. AWS has separate services for each (DynamoDB, DocumentDB for MongoDB, Keyspaces for Cassandra).

ElastiCache In-Memory Caching

What is ElastiCache?

Managed in-memory caching service. Two engines: Redis and Memcached. Dramatically reduces database load and latency by serving frequent reads from memory (microseconds) instead of disk (milliseconds).

Redis vs Memcached on ElastiCache

FeatureRedisMemcached
Data structuresStrings, hashes, lists, sets, sorted sets, bitmaps, geospatial, streamsSimple key-value strings only
PersistenceYes (RDB snapshots, AOF logs)None (restart = all data lost)
ReplicationYes (primary + replicas)No
Multi-AZ FailoverYesNo
Pub/SubYesNo
Cluster modeYes (sharding)Yes
Use casesSessions, leaderboards, rate limiting, pub/sub, queues, MLSimple cache (horizontal scaling, multi-threaded)
Choose Redis Almost Always Unless you have a specific need for Memcached's multi-threaded horizontal scaling or already use Memcached, Redis is the better choice β€” more features, persistence, HA. In practice, most teams use Redis.

Common Caching Patterns

# Lazy Loading (Cache-Aside) β€” most common pattern
def get_user(user_id):
    # Try cache first
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)  # Cache HIT

    # Cache MISS β€” query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # Store in cache with TTL (expiry)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))  # Cache 1 hour

    return user

# Write-Through β€” write to cache AND DB simultaneously
def update_user(user_id, data):
    db.update("UPDATE users SET ... WHERE id = ?", user_id, data)
    redis.setex(f"user:{user_id}", 3600, json.dumps(data))  # Always fresh
Managed Cache β€” Equivalents
GCP

Memorystore for Redis and Memorystore for Memcached β€” same concept. Also Memorystore for Redis Cluster for large-scale sharding.

Azure

Azure Cache for Redis β€” same concept. Tiers: Basic (single node, no SLA), Standard (primary+replica), Premium (Redis Cluster, persistence, VNet injection), Enterprise (Redis Enterprise software, higher performance).

AWS-M6

Monitoring & Observability

CloudWatch Metrics, Logs & Alarms

What is CloudWatch?

CloudWatch is AWS's unified observability platform. It collects metrics, logs, traces, and events from AWS services and your applications. Like a central nervous system for your AWS environment. Three pillars: Metrics (what's happening), Logs (what happened), Alarms (alert when something's wrong).

CloudWatch Metrics

Numeric data points over time. AWS services automatically push metrics: EC2 CPU%, RDS connections, ALB request count, Lambda errors. You can publish custom metrics from your application code.

MetricServiceWhat to monitor
CPUUtilizationEC2Alert if >80% sustained for 5min β†’ need to scale
DatabaseConnectionsRDSAlert if near max_connections limit
RequestCount, TargetResponseTimeALBAlert on traffic spikes or high latency
Errors, Duration, ThrottlesLambdaAlert on elevated error rate or timeouts
QueueDepthSQSAlert if messages accumulating (consumers slow)
BucketSizeBytes, NumberOfObjectsS3Storage growth tracking (daily granularity)
EC2 Default vs Detailed Monitoring By default, EC2 sends metrics every 5 minutes (basic monitoring, free). Enable detailed monitoring for 1-minute granularity ($0.30/metric/month). For auto-scaling decisions, 5-minute lag can be too slow β€” enable detailed monitoring on production.

CloudWatch Logs

Centralized log storage and analysis. Logs are organized in Log Groups (one per app/service), which contain Log Streams (one per instance/invocation). Lambda, ECS, and other services push logs automatically. EC2 needs the CloudWatch Agent installed to push logs.

# CloudWatch Agent config (simplified) β€” push /var/log/nginx/access.log
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [{
          "file_path": "/var/log/nginx/access.log",
          "log_group_name": "/ec2/nginx/access",
          "log_stream_name": "{instance_id}",
          "timestamp_format": "%d/%b/%Y:%H:%M:%S %z"
        }]
      }
    }
  }
}

CloudWatch Logs Insights

Query language for analyzing logs. Like SQL for your logs. Very useful for debugging:

# Find all Lambda errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

# Calculate average response time from ALB access logs
fields @timestamp, targetProcessingTime
| stats avg(targetProcessingTime) as avgTime, count() as requests
| sort avgTime desc

CloudWatch Alarms

Trigger actions when a metric crosses a threshold. States: OK (metric within threshold), ALARM (metric breached threshold), INSUFFICIENT_DATA (not enough data yet).

Actions on ALARM: SNS notification (email/SMS), Auto Scaling (add/remove instances), EC2 action (stop/reboot/recover instance), Systems Manager action.

# Create alarm via CLI: alert if EC2 CPU > 80% for 2 consecutive 5-min periods
aws cloudwatch put-metric-alarm \
  --alarm-name "High-CPU-ec2-web-01" \
  --alarm-description "CPU usage too high" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --dimensions Name=InstanceId,Value=i-0abc12345 \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions "arn:aws:sns:ap-south-1:123456789:ops-alerts"

CloudWatch Dashboards

Custom dashboards combining metrics from multiple services. Create a single pane view: EC2 CPU + RDS connections + ALB latency + Lambda errors + SQS queue depth. Share with team. Use as your operations wall display.

CloudWatch Events / EventBridge

Rule-based event routing. React to AWS service events or scheduled triggers. EventBridge is the evolution of CloudWatch Events β€” more powerful, supports custom event buses, third-party SaaS events, schema registry.

# EventBridge rule: trigger Lambda every day at 8 AM UTC (cron)
{
  "source": "aws.events",
  "schedule": "cron(0 8 * * ? *)",
  "targets": [{"Id": "DailyReport", "Arn": "arn:aws:lambda:...daily-report"}]
}

# EventBridge rule: trigger when EC2 instance state changes to "stopped"
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {"state": ["stopped"]}
}
Monitoring & Observability β€” Equivalents
GCP

Cloud Monitoring (metrics + dashboards + alerting, like CloudWatch) | Cloud Logging (like CloudWatch Logs) | Cloud Trace (distributed tracing, like AWS X-Ray) | Cloud Profiler (continuous CPU/memory profiling of running apps). All under Google Cloud Observability umbrella.

Azure

Azure Monitor (umbrella service β€” metrics, logs, alerts) | Log Analytics Workspace (like CloudWatch Log Insights, uses KQL query language) | Application Insights (APM for apps, auto-traces HTTP, DB queries, exceptions β€” no direct AWS equivalent as a single managed service) | Azure Event Grid (like EventBridge).

Azure-Only

Application Insights: Full APM (Application Performance Monitoring) β€” auto-instrumentation of .NET, Java, Node, Python apps. Tracks requests, dependencies, exceptions, performance counters, user flows, availability tests. AWS would need a combination of X-Ray + CloudWatch + third-party APM (Datadog, Dynatrace).

CloudTrail / X-Ray Audit & Tracing

AWS CloudTrail

Records every AWS API call made in your account β€” via Console, CLI, SDK, or other AWS services. Who did what, when, from where. The audit trail for your entire AWS account. Enabled automatically but events only kept 90 days in CloudTrail console; create a Trail to send to S3 for long-term retention (required for compliance).

Trail Types

  • Management Events: Control plane operations β€” CreateBucket, LaunchEC2, DeleteUser. Enabled by default. Free for first copy.
  • Data Events: Data plane operations β€” S3 object reads/writes (PutObject, GetObject), Lambda invocations. High volume, extra cost. Enable for critical resources.
  • Insight Events: Detect unusual API activity (e.g., sudden spike in IAM calls). Extra cost but powerful anomaly detection.
# Example CloudTrail event β€” someone deleted an S3 bucket
{
  "eventTime": "2024-01-15T14:23:01Z",
  "eventName": "DeleteBucket",
  "userIdentity": {"type": "IAMUser", "userName": "john.doe"},
  "sourceIPAddress": "203.0.113.45",
  "requestParameters": {"bucketName": "prod-customer-data-backup"},
  "eventSource": "s3.amazonaws.com"
}
# β†’ John deleted the production backup bucket from IP 203.0.113.45 at 2:23 PM UTC

AWS X-Ray β€” Distributed Tracing

X-Ray helps debug and analyze distributed applications (microservices). When a user request flows through API Gateway β†’ Lambda β†’ DynamoDB β†’ SQS β†’ another Lambda β€” X-Ray traces the entire journey, showing where latency comes from and where errors occur.

X-Ray Trace β€” Following a Request
  User Request (Total: 450ms)
  β”‚
  β”œβ”€β”€ API Gateway: 5ms
  β”‚
  β”œβ”€β”€ Lambda: process-order (380ms total)
  β”‚   β”œβ”€β”€ Init (cold start): 150ms   ← performance problem!
  β”‚   β”œβ”€β”€ DynamoDB PutItem: 12ms
  β”‚   β”œβ”€β”€ SQS SendMessage: 8ms
  β”‚   └── Execution: 210ms
  β”‚
  └── Response: 65ms

  X-Ray shows: Cold start is causing 33% of total latency.
  Fix: Enable Provisioned Concurrency on this Lambda.

To use X-Ray: add the X-Ray SDK to your app code, or enable active tracing on Lambda/API Gateway (no code changes). X-Ray automatically generates a service map showing all components and their interconnections.

Tracing & Audit β€” Equivalents
GCP

Cloud Trace (distributed tracing, like X-Ray) | Cloud Audit Logs (like CloudTrail β€” Admin Activity, Data Access, System Event logs). Cloud Trace auto-instruments GCP services.

Azure

Application Insights Distributed Tracing (like X-Ray, part of App Insights) | Azure Monitor Activity Log (like CloudTrail β€” tracks all subscription-level operations).

AWS-M7

DevOps Tools β€” IaC, CI/CD & Automation

CloudFormation Infrastructure as Code

What is CloudFormation?

AWS's native IaC service. Define your entire infrastructure in YAML or JSON templates. CloudFormation handles creation, update, and deletion of resources in the right order. Free β€” you only pay for the resources it creates.

CloudFormation Template Structure

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Web Application Stack'

Parameters:  # User inputs at deploy time
  EnvironmentName:
    Type: String
    Default: production
    AllowedValues: [development, staging, production]
  InstanceType:
    Type: String
    Default: t3.micro

Mappings:  # Lookup tables (e.g., AMI IDs per region)
  RegionAMIMap:
    ap-south-1:
      AMI: ami-0abc12345
    us-east-1:
      AMI: ami-0xyz67890

Conditions:  # Conditional resource creation
  IsProd: !Equals [!Ref EnvironmentName, production]

Resources:  # Actual AWS resources (required)
  MyEC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: !Ref InstanceType
      ImageId: !FindInMap [RegionAMIMap, !Ref AWS::Region, AMI]
      SecurityGroupIds: [!Ref WebSecurityGroup]
      Tags:
        - Key: Environment
          Value: !Ref EnvironmentName

  WebSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP/HTTPS
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

  # Only create this in production
  ElasticIP:
    Type: AWS::EC2::EIP
    Condition: IsProd
    Properties:
      InstanceId: !Ref MyEC2Instance

Outputs:  # Values returned after stack creation
  InstancePublicIP:
    Value: !GetAtt MyEC2Instance.PublicIp
    Export:
      Name: !Sub "${AWS::StackName}-PublicIP"

Key CloudFormation Concepts

Stacks & Stack Sets

A Stack is a deployed instance of a template (all the resources it creates). You update a stack by updating the template and running a changeset. StackSets deploy one template across multiple accounts and regions simultaneously β€” essential for large organizations.

Changesets

Preview what changes CloudFormation will make before actually making them. Shows: which resources will be added, modified, or deleted. Always review changesets before applying β€” especially check for resource replacements (which cause downtime).

Drift Detection

Checks if actual resource state differs from what CloudFormation expects. If someone manually changed a Security Group that CloudFormation manages, drift detection finds it. Important for compliance and ensuring IaC is the source of truth.

!Ref and !GetAtt

Built-in functions for referencing other resources within the template. !Ref MyBucket returns the bucket name. !GetAtt MyBucket.Arn returns the bucket ARN. !Sub "arn:aws:s3:::${MyBucket}/*" substitutes variable into string.

CloudFormation vs Terraform (Key Differences)

AspectCloudFormationTerraform
LanguageYAML/JSONHCL (HashiCorp Configuration Language)
Cloud supportAWS onlyMulti-cloud (AWS, GCP, Azure, 1000+ providers)
State managementAWS manages state (no state file)State file (must manage securely in S3/Terraform Cloud)
Native AWS supportSupports new AWS services on day 1Depends on provider update (usually within days)
FreeYesOpen source (Terraform Enterprise is paid)
Module systemNested stacks (complex)Modules (cleaner, community registry)
Drift detectionBuilt inManual (terraform refresh)
Industry adoptionAWS shopsMost popular IaC tool overall
IaC β€” Equivalents
GCP

Deployment Manager (like CloudFormation, GCP-native, YAML/Jinja/Python) | Config Connector (manage GCP resources via Kubernetes CRDs) | Terraform is actually more commonly used in GCP environments than Deployment Manager.

Azure

Azure Resource Manager (ARM) Templates (like CloudFormation, JSON-based, verbose) | Bicep (ARM's modern replacement β€” cleaner syntax, transpiles to ARM JSON) | Azure Blueprints (for governance at scale β€” deploy policies + RBAC + resource groups together).

CodePipeline / CodeBuild / CodeDeploy AWS CI/CD

AWS CI/CD Toolchain Overview

AWS CodePipeline β€” Full CI/CD Flow
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                        AWS CodePipeline                           β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚  SOURCE    β”‚    BUILD      β”‚    TEST       β”‚       DEPLOY           β”‚
  β”‚            β”‚               β”‚              β”‚                        β”‚
  β”‚ CodeCommit β”‚  CodeBuild    β”‚  CodeBuild   β”‚  CodeDeploy β†’ EC2      β”‚
  β”‚  GitHub    β”‚  (compile,    β”‚  (unit tests,β”‚  CodeDeploy β†’ Lambda   β”‚
  β”‚  Bitbucket β”‚   lint,       β”‚   integrationβ”‚  ECS (Blue/Green)      β”‚
  β”‚  S3        β”‚   docker buildβ”‚   tests)     β”‚  CloudFormation        β”‚
  β”‚            β”‚   push to ECR)β”‚              β”‚  Beanstalk             β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Each stage has actions. Failure at any stage stops the pipeline.

CodeBuild β€” Build Service

Managed build server. Runs your build commands in a Docker container, compiles code, runs tests, creates artifacts. Defined in a buildspec.yml file at the root of your repo.

# buildspec.yml β€” defines build steps
version: 0.2
phases:
  install:
    runtime-versions:
      python: 3.11
    commands:
      - pip install -r requirements.txt

  pre_build:
    commands:
      - echo Logging into ECR...
      - aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=$COMMIT_HASH

  build:
    commands:
      - echo Running tests...
      - pytest tests/ --junitxml=test-results.xml
      - echo Building Docker image...
      - docker build -t $ECR_URI:$IMAGE_TAG .

  post_build:
    commands:
      - docker push $ECR_URI:$IMAGE_TAG
      - echo Build complete. Image $ECR_URI:$IMAGE_TAG

artifacts:
  files:
    - imagedefinitions.json  # Used by CodeDeploy for ECS deploy
reports:
  TestResults:
    files: test-results.xml
    file-format: JUNITXML

CodeDeploy β€” Deployment Service

Automates application deployments to EC2, on-premises servers, Lambda, and ECS. Handles rolling updates, blue/green deployments, automatic rollback on failure. Defined in appspec.yml.

# appspec.yml for EC2 deployment
version: 0.0
os: linux
files:
  - source: /app
    destination: /var/www/html
hooks:
  BeforeInstall:
    - location: scripts/stop_server.sh
      timeout: 30
  AfterInstall:
    - location: scripts/install_dependencies.sh
      timeout: 120
  ApplicationStart:
    - location: scripts/start_server.sh
      timeout: 30
  ValidateService:
    - location: scripts/health_check.sh
      timeout: 60

CodeDeploy Deployment Configurations

ConfigHow it deploysDowntime?
CodeDeployDefault.AllAtOnceAll instances simultaneouslyYes (if deploy fails)
CodeDeployDefault.HalfAtATime50% first, then 50%Partial
CodeDeployDefault.OneAtATimeOne instance at a time (slowest, safest)No
Custom (e.g., 25% at a time)Define your own batch sizeDepends
Blue/Green (ECS/Lambda)New version deployed alongside old, traffic shifted graduallyNo, instant rollback

Elastic Beanstalk β€” PaaS Deploy

If you don't want to manage CI/CD pipelines at all, Elastic Beanstalk is AWS's PaaS. Upload your app code (zip), EB handles EC2 provisioning, Auto Scaling, Load Balancer, health monitoring, and rolling deploys. Runs on top of standard AWS services (you can still see and modify the EC2 instances). Great for smaller teams or migrating existing apps quickly. Less flexible than managing EC2/ECS directly.

CI/CD Tools β€” Equivalents
GCP

Cloud Build (like CodeBuild) | Cloud Deploy (managed delivery to GKE/Cloud Run, with promotion through environments) | Artifact Registry (store build artifacts, Docker images)

Azure

Azure Pipelines (CI + CD in one service, like CodePipeline + CodeBuild + CodeDeploy combined β€” more integrated) | Azure Artifacts (package/artifact storage) | GitHub Actions (Microsoft owns GitHub β€” deep Azure integration)

Systems Manager (SSM) Ops Automation

What is AWS Systems Manager?

SSM is a collection of operational tools for managing your EC2 instances and on-premises servers at scale. Often overlooked but incredibly powerful for DevOps. It's a suite of services, not just one thing.

SSM Session Manager

Connect to EC2 instances via browser or CLI without opening port 22, without a bastion host, without managing SSH keys. The SSM Agent on the instance communicates outbound to SSM service β€” no inbound port needed. Fully audited β€” all sessions recorded to S3 or CloudWatch.

# Connect to EC2 via SSM (no SSH key, no port 22)
aws ssm start-session --target i-0abc12345

# Port forwarding via SSM (access RDS in private subnet)
aws ssm start-session --target i-0abc12345 \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["3306"],"localPortNumber":["13306"]}'
# Now: mysql -h 127.0.0.1 -P 13306 -u admin -p

SSM Parameter Store

Store configuration values and secrets. Types: String, StringList, SecureString (KMS-encrypted). Use for: app config, database hostnames, feature flags, non-sensitive or mildly-sensitive parameters.

# Store a parameter
aws ssm put-parameter \
  --name "/myapp/production/db-host" \
  --value "mydb.cluster.ap-south-1.rds.amazonaws.com" \
  --type String

# Store an encrypted secret
aws ssm put-parameter \
  --name "/myapp/production/api-key" \
  --value "sk-abc123secret" \
  --type SecureString \
  --key-id alias/myapp-key

# Retrieve in code (Python)
ssm = boto3.client('ssm')
param = ssm.get_parameter(Name='/myapp/production/db-host', WithDecryption=True)
db_host = param['Parameter']['Value']

SSM Run Command

Run commands across multiple EC2 instances without SSH. Execute shell scripts, PowerShell, Python across your entire fleet in seconds. With resource tags, target groups: "Run this on all instances tagged Environment=production."

SSM Patch Manager

Automate OS patching across your fleet. Define patch baselines (which patches to apply, e.g., only Critical + High severity), maintenance windows (when to apply β€” 2 AM Sunday), and patch groups (which instances). Never manually SSH to patch 50 servers again.

SSM State Manager

Keep instances in a desired state. Define an association: "All prod instances must have the CWAgent installed and running." State Manager periodically checks and enforces this. If someone removes the agent, SSM reinstalls it.

AWS-M8

Messaging & Decoupling

SQS Simple Queue Service

What is SQS?

SQS is a fully managed message queue service. It decouples producers (who send messages) from consumers (who process them). If your consumer is slow or down, messages accumulate safely in the queue. No message is lost. Classic async communication pattern.

SQS β€” Decoupling Producer and Consumer
  WITHOUT SQS (Tight Coupling):
  Web App ──HTTP──► Worker Service
  If Worker is slow/down β†’ Web App blocks or errors βœ—

  WITH SQS (Loose Coupling):
  Web App ──PutMessage──► [SQS Queue] ◄──PollMessages── Worker Service
  Web App returns immediately βœ“           Worker processes at its own pace βœ“
  Queue buffers messages during spikes βœ“  Worker can scale independently βœ“
  Messages survive worker crashes βœ“

SQS Key Concepts

Queue Types

Standard Queue

Nearly unlimited throughput. Best-effort ordering (usually FIFO, but not guaranteed). At-least-once delivery (message may be delivered more than once β€” make your consumer idempotent). Good for most use cases where order doesn't strictly matter.

FIFO Queue

Guaranteed order (First-In-First-Out). Exactly-once processing (no duplicates). Limited to 3,000 msg/sec with batching (300 without). For: financial transactions, order processing, inventory changes where sequence matters.

Visibility Timeout

When a consumer reads a message, it's hidden from other consumers for the visibility timeout period (default 30s, max 12h). The consumer must delete the message before timeout expires. If it doesn't (consumer crashed), the message becomes visible again for another consumer to process. Set visibility timeout to slightly longer than your max processing time.

Dead Letter Queue (DLQ)

If a message fails processing too many times (exceeds maxReceiveCount), SQS moves it to a Dead Letter Queue. DLQ lets you isolate and debug problematic messages without losing them. Always configure a DLQ for production queues β€” otherwise failed messages keep cycling forever consuming resources.

Long Polling

When a consumer calls ReceiveMessage and the queue is empty, short polling returns immediately (wasteful API calls). Long polling waits up to 20 seconds for a message to arrive before returning empty. Reduces cost (fewer API calls) and reduces false-empty responses. Always use long polling (WaitTimeSeconds=20).

# Sending a message (Python boto3)
sqs = boto3.client('sqs')
response = sqs.send_message(
    QueueUrl='https://sqs.ap-south-1.amazonaws.com/123456/my-queue',
    MessageBody=json.dumps({
        'order_id': 'ORD-12345',
        'customer_id': 'CUST-789',
        'items': [{'product': 'laptop', 'qty': 1}]
    }),
    MessageAttributes={
        'EventType': {'StringValue': 'OrderPlaced', 'DataType': 'String'}
    }
)

# Consuming messages (long polling)
while True:
    response = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=20,        # Long polling
        VisibilityTimeout=60       # 60s to process
    )
    for message in response.get('Messages', []):
        process_order(json.loads(message['Body']))
        # Delete after successful processing
        sqs.delete_message(
            QueueUrl=QUEUE_URL,
            ReceiptHandle=message['ReceiptHandle']
        )
Message Queues β€” Equivalents
GCP

Cloud Pub/Sub β€” acts as both a queue AND pub/sub. Pull subscriptions work like SQS (consumer polls). Push subscriptions push to HTTP endpoint. At-least-once delivery. No native FIFO, but ordering key feature ensures ordered delivery within a key.

Azure

Azure Service Bus (full-featured queue + pub/sub, like SQS + some SNS features β€” supports sessions for FIFO, dead-lettering, transactions) | Azure Queue Storage (simpler, cheaper, like basic SQS standard queue, max 7-day retention vs SB's 14 days).

SNS / EventBridge Pub/Sub & Events

Amazon SNS β€” Simple Notification Service

SNS is a publish/subscribe (pub/sub) messaging service. A publisher sends a message to a Topic, and SNS fans it out to all subscribers simultaneously. One message β†’ many consumers. Perfect for: fanout pattern, notifications, decoupled event broadcasting.

SNS Subscribers

A topic can have multiple subscribers of different types: SQS queue, Lambda function, HTTP/HTTPS endpoint, Email, SMS, Mobile Push (APNs, GCM), Kinesis Data Firehose.

SNS Fanout Pattern β€” One Message, Multiple Consumers
  Order Service publishes "OrderPlaced" event to SNS Topic
                              β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό               β–Ό               β–Ό
        SQS Queue       Lambda Fn        SQS Queue
        (Inventory      (Send email      (Analytics
         Service)        confirmation)    Service)

  All three consumers receive the same message independently.
  If one consumer is down, others still get the message.

SNS vs SQS β€” Key Difference

FeatureSNS (Pub/Sub)SQS (Queue)
Pattern1 publisher β†’ many subscribers (fanout)Producers β†’ queue β†’ one consumer per message
Message persistenceNo persistence (if no subscriber, message lost)Persists up to 14 days
ConsumersMultiple, all receive the messageOne consumer per message (competing consumers)
Pull vs PushPush to subscribersConsumer pulls
Best forBroadcast notifications, fanout, alertingTask queue, work distribution, decoupling

SNS + SQS Fanout Pattern

The most common real-world pattern: SNS pushes to multiple SQS queues. This gives you fanout (SNS) with durability and retry (SQS):

# Architecture:
# New Product Added β†’ SNS Topic "product-events"
#   β†’ SQS "inventory-queue" β†’ Inventory Lambda
#   β†’ SQS "search-index-queue" β†’ Search Index Lambda
#   β†’ SQS "notification-queue" β†’ Push Notification Lambda

# If Search Index Lambda is down: messages buffer in search-index-queue
# Inventory and Push Notification still work independently
# When Search Lambda recovers: processes all buffered messages
# This is the gold standard for reliable event-driven microservices.

Amazon EventBridge

An event bus service for building event-driven applications. More powerful than SNS for complex routing β€” you can filter events by content, transform them, route to 20+ AWS services, connect to third-party SaaS apps (Salesforce, Zendesk, Datadog), and create custom event buses per service.

  • Default Event Bus: Receives AWS service events (EC2 state changes, CodePipeline updates, etc.)
  • Custom Event Bus: For your own application events. Publish events from your microservices here.
  • Partner Event Bus: Receive events from SaaS partners (Shopify orders, GitHub events)
# Publish a custom event to EventBridge
events = boto3.client('events')
events.put_events(
    Entries=[{
        'Source': 'com.mycompany.orders',
        'DetailType': 'OrderPlaced',
        'Detail': json.dumps({'orderId': 'ORD-123', 'total': 599.99}),
        'EventBusName': 'my-app-events'
    }]
)
# EventBridge rule routes this to: Lambda for fulfillment,
# SQS for analytics, another EventBridge bus in a different account
# Based on content: {"source": ["com.mycompany.orders"], "detail-type": ["OrderPlaced"]}

Amazon Kinesis β€” Real-Time Streaming

For high-throughput, real-time data streaming. Unlike SQS (queue β€” messages consumed and deleted), Kinesis retains data as a stream that multiple consumers can read from. Think of it as a real-time data pipeline.

ServiceWhat it doesUse case
Kinesis Data StreamsReal-time data stream. Shards provide throughput (1MB/s write per shard). Multiple consumers. Retain 1-365 days.Real-time clickstream, app logs, IoT telemetry
Kinesis Data FirehoseFully managed ETL β€” stream data directly to S3, Redshift, OpenSearch, Splunk. Auto-scales, buffers, compresses, transforms.Load streaming data to S3 data lake or Redshift without code
Kinesis Data AnalyticsRun SQL or Apache Flink on streaming data in real-timeReal-time dashboards, anomaly detection, aggregations
MSK (Managed Kafka)Fully managed Apache Kafka. For teams that need Kafka compatibility.Kafka migration, complex event streaming, ecosystem tools
Pub/Sub & Streaming β€” Equivalents
GCP

Pub/Sub (handles both SNS and SQS use cases β€” push and pull modes). Dataflow (like Kinesis Data Analytics, uses Apache Beam). Pub/Sub Lite (lower cost, regional, like Kinesis for ordered streams).

Azure

Azure Event Grid (like EventBridge β€” event routing, serverless, pay-per-event) | Azure Event Hubs (like Kinesis Data Streams β€” high-throughput event streaming, Kafka-compatible API!) | Azure Service Bus Topics (like SNS β€” pub/sub with filtering)

Azure-Only

Azure Event Hubs Kafka-compatible API: You can use your existing Apache Kafka clients to produce/consume from Event Hubs without code changes. Just change the broker endpoint. AWS MSK also offers Kafka compatibility, but Event Hubs being serverless AND Kafka-compatible is unique in the PaaS space.

AWS-M4

IAM & Security Services

IAM Identity & Access Management

What is IAM?

IAM is AWS's centralized service for controlling who can do what on which AWS resources. It's global (not region-specific) and free. Every API call to AWS is authenticated and authorized through IAM. No IAM permission β†’ API call denied, period.

IAM Entities

IAM Users

A person or application that needs permanent, long-term credentials to interact with AWS. Has a username + password (console) and/or access key + secret key (programmatic). Best practice: don't use root account β€” create individual IAM users. Even better: use IAM Identity Center (SSO) for humans.

IAM Groups

A collection of IAM users. Attach policies to the group β€” all members inherit those permissions. You can't attach a policy directly to a group and then add roles to it. Groups only contain users. Simplifies permission management: add user to "Developers" group β†’ gets all developer permissions.

IAM Roles

An IAM identity without permanent credentials. Instead, when something assumes a role, it gets temporary security credentials (valid minutes to hours). Used by: EC2 instances (instead of hardcoded keys), Lambda functions, cross-account access, federated users (SSO), ECS tasks. This is the correct way for AWS services to access other services β€” never hardcode access keys in code.

IAM Role β€” EC2 Instance Assuming a Role
  EC2 instance needs to write to S3
  ─────────────────────────────────────────────────────────────────────
  BAD:  Hardcode access_key + secret in app β†’ leaked in Git β†’ disaster
  ─────────────────────────────────────────────────────────────────────
  GOOD: EC2 IAM Role with s3:PutObject permission:

  IAM Role "EC2-S3-Writer" ──attached to──► EC2 Instance
       β”‚
       └── Policy: Allow s3:PutObject on arn:aws:s3:::my-bucket/*

  Inside EC2: AWS SDK auto-fetches temporary credentials from IMDS
  http://169.254.169.254/latest/meta-data/iam/security-credentials/EC2-S3-Writer
  β†’ Access Key (temp), Secret Key (temp), Session Token, Expiration
  β†’ SDK auto-refreshes these before expiry

IAM Policies

JSON documents defining permissions. A policy has one or more statements, each with:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3Read",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-app-bucket",
        "arn:aws:s3:::my-app-bucket/*"
      ]
    },
    {
      "Sid": "DenyDeleteUnlessMFA",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::my-app-bucket/*",
      "Condition": {
        "BoolIfExists": {"aws:MultiFactorAuthPresent": "false"}
      }
    }
  ]
}

Policy Types

TypeAttached toManaged byUse case
AWS Managed PolicyUser, Group, RoleAWS creates & updatesCommon permission sets: AmazonS3ReadOnlyAccess, AdministratorAccess
Customer Managed PolicyUser, Group, RoleYou create & manageCustom permissions for your org. Reusable. Versionable.
Inline PolicySingle User, Group, or RoleYou create, embedded in identityStrict 1:1 relationship. Deleted when identity deleted. Avoid when possible.
Resource-based PolicyThe resource itself (S3 bucket, SQS queue, Lambda)You create on the resourceGrant cross-account access without assuming a role. Used for S3 bucket policies, Lambda resource policies.
Permission BoundaryUser or RoleAdmin sets max permissions ceilingDelegate IAM permission management to devs but cap what they can grant.
Service Control Policy (SCP)AWS Org OUs or accountsOrg adminMaximum permissions guardrails across entire AWS accounts. "Nobody in this account can touch us-west-1."

IAM Policy Evaluation Logic

When a request is made, AWS evaluates all applicable policies:

IAM Policy Evaluation Order
  API request arrives
         β”‚
         β–Ό
  1. Explicit DENY anywhere? ───── YES ──► DENY (stops here)
         β”‚ NO
         β–Ό
  2. SCP allows? (if AWS Org) ──── NO ───► DENY
         β”‚ YES
         β–Ό
  3. Resource-based policy allows? ─ YES ─► (may ALLOW without identity policy)
         β”‚ NO
         β–Ό
  4. Permission Boundary allows? ── NO ───► DENY
         β”‚ YES
         β–Ό
  5. Identity policy allows? ────── YES ──► ALLOW
         β”‚ NO
         β–Ό
         DENY (implicit β€” default deny everything)

Cross-Account Access

Account A's EC2 wants to access Account B's S3 bucket. Process:

  1. Account B creates an IAM Role with a trust policy allowing Account A to assume it
  2. Account B's role has the S3 permissions needed
  3. Account A's EC2 calls sts:AssumeRole for Account B's role
  4. Gets temporary credentials for Account B β†’ can now access Account B's S3
# Trust policy on Account B's role (who can assume it):
{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::111111111111:role/EC2-Role"},  # Account A
    "Action": "sts:AssumeRole"
  }]
}

# Account A EC2 assuming Account B's role (boto3):
import boto3
sts = boto3.client('sts')
response = sts.assume_role(
    RoleArn='arn:aws:iam::222222222222:role/S3-Access-Role',  # Account B
    RoleSessionName='my-session'
)
creds = response['Credentials']
# Use creds to create an S3 client for Account B
IAM Equivalents
GCP

Cloud IAM. Key differences: GCP uses Service Accounts (like IAM roles but with an email identity β€” can be granted access to specific resources). GCP IAM is more resource-centric (bind roles to resources). No inline policies β€” roles are always separate entities. Workload Identity = IAM roles for GKE pods.

Azure

Azure Active Directory (Azure AD / Entra ID) for identity + Azure RBAC for access control. Azure uses Entra ID for both human users and service principals (like IAM roles). Managed Identities = IAM roles for Azure VMs/Functions. Azure RBAC assigns built-in or custom roles to identities at various scopes (management group, subscription, resource group, resource).

Azure-Only

Azure Entra ID (Active Directory): Much more feature-rich identity provider than AWS IAM β€” supports OAuth 2.0, SAML, OIDC federation with thousands of apps natively, Conditional Access policies (block login from outside the country), Privileged Identity Management (JIT access). AWS equivalent would be IAM Identity Center + Cognito combined, with less enterprise AD integration.

KMS / Secrets Manager / SSM Secrets & Key Management

AWS KMS β€” Key Management Service

KMS manages cryptographic keys used to encrypt your data. You never handle raw key material β€” KMS keeps keys secure inside Hardware Security Modules (HSMs). Services like S3, EBS, RDS, Secrets Manager all use KMS keys to encrypt data.

KMS Key Types

# Encrypt data with KMS (AWS CLI)
aws kms encrypt \
  --key-id arn:aws:kms:ap-south-1:123456789:key/abc-123 \
  --plaintext fileb://secret.txt \
  --output text --query CiphertextBlob | base64 --decode > encrypted.bin

# Decrypt
aws kms decrypt \
  --ciphertext-blob fileb://encrypted.bin \
  --output text --query Plaintext | base64 --decode

Envelope Encryption

KMS uses envelope encryption: a Data Encryption Key (DEK) is generated to encrypt your actual data. The DEK itself is encrypted by the KMS CMK. Only the encrypted DEK is stored with the data. To decrypt: call KMS to decrypt the DEK, use plaintext DEK to decrypt data. This way, large amounts of data never pass through KMS API.

AWS Secrets Manager

Store, manage, and rotate secrets (DB passwords, API keys, OAuth tokens). Secrets are encrypted at rest via KMS. Applications retrieve secrets via API β€” no plaintext secrets in code or environment variables.

# Retrieve secret in Python (boto3)
import boto3, json
client = boto3.client('secretsmanager', region_name='ap-south-1')
response = client.get_secret_value(SecretId='prod/myapp/db-credentials')
secret = json.loads(response['SecretString'])
db_password = secret['password']  # Fresh from Secrets Manager, never hardcoded

AWS Systems Manager Parameter Store

Lightweight configuration and secrets storage. Two tiers:

Secrets Manager vs Parameter Store: Use Secrets Manager when you need automatic rotation. Use Parameter Store for config, non-sensitive data, or cost-sensitive secrets (it's free for standard).

# Store a parameter (CLI)
aws ssm put-parameter \
  --name "/myapp/prod/db-host" \
  --value "mydb.cluster.ap-south-1.rds.amazonaws.com" \
  --type "String"

aws ssm put-parameter \
  --name "/myapp/prod/db-password" \
  --value "SuperSecret123!" \
  --type "SecureString"  # Encrypted with KMS

# Retrieve in app
aws ssm get-parameter --name "/myapp/prod/db-password" --with-decryption
Secrets & Key Management β€” Equivalents
GCP

Cloud KMS (key management, like AWS KMS) | Secret Manager (like Secrets Manager β€” stores secrets, automatic versioning, access via API). GCP Cloud HSM is part of Cloud KMS. No direct equivalent to SSM Parameter Store β€” Secret Manager serves both use cases.

Azure

Azure Key Vault: unified service for secrets, keys, AND certificates. Equivalent to AWS KMS + Secrets Manager combined. Key Vault also manages TLS/SSL certificates with automatic renewal. Azure Dedicated HSM = AWS CloudHSM equivalent.

Azure-Only

Azure Key Vault Certificates: natively manages TLS certificates (creation, renewal, storage) in one service. AWS equivalent requires ACM (certificates) + KMS (keys) + Secrets Manager (secrets) as separate services.

WAF / Shield / GuardDuty Threat Protection

AWS WAF β€” Web Application Firewall

Protects web applications from common web exploits (OWASP Top 10): SQL injection, XSS, bad bots, malformed requests. Deployed in front of CloudFront, ALB, API Gateway, or AppSync. You define Web ACLs with rules.

# Rate-based rule example (terraform-style representation):
# Block any IP that sends more than 2000 requests per 5 minutes
Rule: RateBasedRule
  Action: BLOCK
  Statement:
    RateBasedStatement:
      Limit: 2000
      AggregateKeyType: IP

AWS Shield β€” DDoS Protection

Amazon GuardDuty β€” Threat Detection

Continuous security monitoring service that analyzes: VPC Flow Logs, CloudTrail API logs, DNS logs, and optionally EKS audit logs and S3 data events. Uses ML to detect threats like: EC2 cryptomining, root credential usage, unusual API calls from unknown IPs, port scanning, compromised credentials accessing S3.

GuardDuty doesn't block anything β€” it generates findings (alerts) with severity levels (low/medium/high). You automate responses via EventBridge β†’ Lambda (e.g., auto-isolate compromised instance by removing from security groups).

AWS Inspector

Automated vulnerability scanning for EC2 instances and container images. Continuously scans for OS package vulnerabilities (CVEs), network exposure issues, software vulnerabilities. Integrates with ECR to scan images on push. Different from GuardDuty (runtime threat detection) β€” Inspector is about vulnerability assessment.

Security Services β€” Equivalents
GCP

Cloud Armor (= WAF + DDoS, like WAF + Shield combined) | Security Command Center (SCC) (threat detection, vulnerability findings, like GuardDuty + Inspector combined) | Container Analysis (vulnerability scanning in Artifact Registry, like ECR + Inspector).

Azure

Azure WAF (via Front Door or Application Gateway) | Azure DDoS Protection Standard (like Shield Advanced) | Microsoft Defender for Cloud (threat detection + vulnerability assessment, like GuardDuty + Inspector + more) | Microsoft Sentinel (SIEM/SOAR β€” no direct AWS equivalent).

Azure-Only

Microsoft Sentinel: A full SIEM/SOAR platform that ingests logs from Azure + on-prem + multi-cloud + third-party tools, uses ML for threat hunting, and automates playbooks. AWS equivalent would be custom-built using CloudTrail + GuardDuty + Macie + Security Hub + custom Lambda playbooks. Sentinel is more turnkey.

AWS-M5

Databases

RDS Relational Database Service

What is RDS?

RDS is a managed relational database service. AWS handles: provisioning hardware, installing the DB engine, patching, backups, monitoring, Multi-AZ failover. You focus on your schema and queries. Supports: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora.

RDS Key Features

Multi-AZ

Your primary DB runs in one AZ. A standby replica runs in a different AZ, synchronously receiving every write. If the primary fails, AWS automatically promotes the standby and updates the DNS endpoint within 1-2 minutes. Your app reconnects to the same endpoint β€” no code changes. Multi-AZ standby is NOT readable β€” it's a pure failover. For read scale, use Read Replicas.

RDS Multi-AZ vs Read Replicas
  MULTI-AZ (for HA/failover):              READ REPLICAS (for read scale):

  App ──► RDS Endpoint                     App ──► Primary (write endpoint)
          β”‚                                         β”‚
          β–Ό                                         β”œβ”€β”€async repl──► Read Replica 1
  Primary DB (AZ-a) ──sync repl──►                 β”œβ”€β”€async repl──► Read Replica 2
  Standby DB (AZ-b) [not readable]                 └──async repl──► Read Replica (another region)

  Failover: ~60-120 seconds, auto         Read: use separate read endpoint
  Standby ONLY for failover               Slight replication lag (async)

Read Replicas

Automated Backups & Snapshots

RDS Proxy

A managed connection pool between your app and RDS. Databases have limited connections (e.g., db.t3.medium MySQL = ~66 connections). Lambda functions scale to thousands of concurrent invocations β€” without RDS Proxy, they'd exhaust DB connections. RDS Proxy maintains a warm pool and multiplexes application connections. Also improves failover time (connections held during failover, reducing app errors).

Lambda + RDS = Use RDS Proxy Never connect Lambda directly to RDS without RDS Proxy. Each Lambda invocation opens a new DB connection. At 1000 concurrent Lambdas, you'd hit DB connection limits immediately. With RDS Proxy, Lambda connects to the proxy, which maintains a small pool to RDS. Classic serverless + relational DB pattern.
Aurora AWS's Cloud-Native DB

What is Aurora?

Aurora is AWS's proprietary cloud-native relational database compatible with MySQL and PostgreSQL. It's NOT just a managed MySQL β€” AWS redesigned the storage layer from scratch. Result: up to 5x faster than MySQL on RDS, up to 3x faster than PostgreSQL on RDS. Higher cost than standard RDS (~20%) but typically worth it for production workloads.

Aurora Architecture

Aurora's storage is completely separate from the compute (DB instances). Storage is a distributed, fault-tolerant, self-healing cluster across 3 AZs Γ— 2 copies = 6 copies of your data. Can lose 2 copies without write availability loss, 3 copies without read availability loss.

Aurora Storage β€” Distributed Across 3 AZs
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                    AURORA CLUSTER                               β”‚
  β”‚                                                                 β”‚
  β”‚  Writer Instance (Primary) ──────────────────────────────────┐  β”‚
  β”‚  Reader Instance 1          ─── Shared Storage Cluster ────► β”‚  β”‚
  β”‚  Reader Instance 2          ─── (6 copies, 3 AZs)            β”‚  β”‚
  β”‚                                                               β”‚  β”‚
  β”‚  AZ-1: [Data Copy 1] [Data Copy 2]                           β”‚  β”‚
  β”‚  AZ-2: [Data Copy 3] [Data Copy 4]                           β”‚  β”‚
  β”‚  AZ-3: [Data Copy 5] [Data Copy 6]                           β”‚  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Failover: ~30 seconds (promote a reader β€” same shared storage!
  No data copy needed since readers already share storage)

Aurora Features

Aurora Serverless v2

Aurora capacity auto-scales in fine-grained increments (0.5 ACU steps) based on actual load, within seconds. You define min/max ACUs. Pay per second of capacity used. Ideal for: variable workloads, dev/test, new apps with unpredictable traffic. Can scale from nearly zero to 128 ACUs (β‰ˆ256GB RAM) in seconds.

Aurora Global Database

One primary region with up to 5 secondary read-only regions. Replication lag < 1 second globally (uses AWS's dedicated infrastructure, not the internet). Used for: global read scale, DR (RPO <1s, RTO < 1 minute β€” just promote a secondary to primary). Unlike standard cross-region read replicas, Global DB can handle replication even under high write load.

Aurora Backtrack

MySQL-compatible only. Rewind the DB to a point in the past without restoring from backup. Goes back in time by replaying the storage log. Can backtrack up to 72 hours. Instant β€” takes seconds vs hours for a restore. Useful for: "oops we just ran DELETE without WHERE."

Managed Relational DB β€” Equivalents
GCP

Cloud SQL (managed MySQL, PostgreSQL, SQL Server β€” like standard RDS) | Cloud Spanner (global, horizontally scalable relational DB β€” no direct AWS equivalent, but closest to Aurora Global + Vitess. True horizontal write scale across regions with ACID transactions). Spanner is unique β€” AWS has nothing comparable.

Azure

Azure SQL Database (managed SQL Server β€” like RDS SQL Server) | Azure Database for MySQL/PostgreSQL (like RDS MySQL/PostgreSQL) | Azure Cosmos DB for PostgreSQL (distributed PostgreSQL, like Citus β€” no direct AWS equivalent for this exact feature).

GCP-Only

Cloud Spanner: Globally distributed, ACID-compliant relational DB that scales horizontally for writes across regions. AWS Aurora Global DB scales reads globally but writes are single-region. Spanner scales both globally. AWS has no equivalent β€” closest would be DynamoDB Global Tables (NoSQL) or CockroachDB on EC2.

DynamoDB Serverless NoSQL

What is DynamoDB?

DynamoDB is AWS's managed NoSQL key-value and document database. Fully serverless: no instances to size, automatic scaling, single-digit millisecond performance at any scale. Powers Amazon.com's shopping cart, Lyft's ride data, Duolingo's learning streak β€” workloads at massive scale.

DynamoDB Data Model

# Example DynamoDB table for an e-commerce app:
# PK = UserID, SK = OrderID

Items:
{ "UserID": "user123",  "OrderID": "order001", "Status": "Delivered", "Total": 299.99 }
{ "UserID": "user123",  "OrderID": "order002", "Status": "Shipped",   "Total": 49.99  }
{ "UserID": "user456",  "OrderID": "order003", "Status": "Pending",   "Total": 799.00 }

# Query: All orders for user123 (efficient - same partition)
aws dynamodb query \
  --table-name Orders \
  --key-condition-expression "UserID = :uid" \
  --expression-attribute-values '{":uid": {"S": "user123"}}'

Capacity Modes

ModeHow it worksBest for
On-DemandPay per request (RCU/WCU). Auto-scales instantly. No capacity planning.New apps, variable traffic, don't know your load. Slightly more expensive per request than provisioned at steady state.
ProvisionedYou set RCUs (Read Capacity Units) and WCUs (Write Capacity Units). Can use Auto Scaling to adjust. Cheaper at steady state. May throttle if you exceed provisioned capacity.Predictable steady workloads. Pair with Auto Scaling for some elasticity.

Capacity Units

DynamoDB Advanced Features

Global Secondary Indexes (GSI)

Query on non-primary key attributes. A GSI has its own partition key + sort key (different from table's PK/SK) and its own capacity. Enables different access patterns without data duplication in your code.

# Table: PK=UserID, SK=OrderID
# Query by Status β€” can't do this without an index (full table scan is expensive)
# Add GSI: PK=Status, SK=CreatedAt β†’ can now query "all PENDING orders, newest first"

aws dynamodb query \
  --table-name Orders \
  --index-name StatusIndex \
  --key-condition-expression "#s = :status" \
  --expression-attribute-names '{"#s": "Status"}' \
  --expression-attribute-values '{":status": {"S": "PENDING"}}'

DynamoDB Streams

A time-ordered stream of item-level changes (insert/update/delete) in your table. Retained for 24 hours. Used with Lambda to react to data changes (send email when order status changes, sync to another table, audit log, real-time analytics).

DynamoDB Global Tables

Multi-region, multi-active (all regions accept reads AND writes). DynamoDB handles conflict resolution (last-writer-wins). Near-zero RPO/RTO for region failure. Used for: globally distributed apps where users in each region write and read data locally.

DynamoDB Accelerator (DAX)

In-memory cache specifically for DynamoDB. Read latency drops from ms to microseconds. Fully compatible β€” just change your endpoint from DynamoDB to DAX. Best for: read-heavy apps, repeated reads of same items, caching leaderboards/hot items. Not useful for write-heavy workloads or data that changes frequently.

DynamoDB Design Tip β€” Single Table Design In DynamoDB, the access pattern drives the data model β€” not the other way around (unlike SQL). Many experienced DynamoDB users put ALL entities in a single table with composite keys. E.g., PK="USER#user123", SK="PROFILE" for profile; PK="USER#user123", SK="ORDER#2024-01-15" for orders. This avoids expensive joins (DynamoDB doesn't have joins) and keeps related data in the same partition.
NoSQL β€” Equivalents
GCP

Firestore (document database, like DynamoDB but more flexible querying, real-time sync) | Bigtable (wide-column NoSQL for massive analytics/IoT β€” like DynamoDB at petabyte scale for time-series/analytics, used by Google internally).

Azure

Azure Cosmos DB: Multi-model NoSQL (can use SQL, MongoDB, Cassandra, Table, Gremlin APIs). Has global distribution with 99.999% SLA. Cosmos DB for NoSQL is closest to DynamoDB but with richer querying. Cosmos DB is Azure's flagship database β€” more flexible than DynamoDB in query capabilities.

Azure-Only

Azure Cosmos DB multi-model API: One service with MongoDB API compatibility, Cassandra API, Gremlin (graph) API, etc. If you have an existing MongoDB or Cassandra app, you can point it at Cosmos DB with minimal changes. AWS would require separate DocumentDB (MongoDB-compatible) or Keyspaces (Cassandra-compatible) services.

ElastiCache In-Memory Caching

What is ElastiCache?

Managed in-memory data store. Two engines: Redis and Memcached. Used to cache frequently accessed data, reducing database load, improving response times from seconds to milliseconds. Common pattern: check cache first β†’ cache hit? return instantly. Cache miss? read from DB, write to cache, return.

FeatureRedisMemcached
Data structuresStrings, Hashes, Lists, Sets, Sorted Sets, Pub/Sub, Streams, GeospatialStrings only
PersistenceOptional (RDB snapshots, AOF log)No persistence (pure cache)
ReplicationMaster-replica, Multi-AZNo replication
ClusteringRedis Cluster (sharding)Multi-node (simpler sharding)
Lua scriptingYesNo
Use caseSessions, leaderboards, pub/sub, real-time analytics, queues, rate limitingSimple object caching, stateless horizontal scaling
Choose Redis almost always. Redis supports everything Memcached does, plus persistence, replication, and rich data structures. Memcached's only advantage: multi-threaded (better on very large nodes). In practice, 95% of use cases β†’ Redis.
# Session caching example (Flask + Redis via ElastiCache):
import redis, json
r = redis.Redis(host='my-cache.abc123.ng.0001.apse1.cache.amazonaws.com', port=6379)

def get_user_profile(user_id):
    # Try cache first
    cached = r.get(f'user:{user_id}')
    if cached:
        return json.loads(cached)  # Cache HIT β€” sub-millisecond response
    
    # Cache MISS β€” query database
    profile = db.query("SELECT * FROM users WHERE id = %s", user_id)
    r.setex(f'user:{user_id}', 300, json.dumps(profile))  # Cache 5 min
    return profile
Managed Cache β€” Equivalents
GCP

Cloud Memorystore: Managed Redis and Memcached. Same concepts. Fully compatible with open-source Redis/Memcached clients. Redis Cluster mode available.

Azure

Azure Cache for Redis: Managed Redis. Tiers: Basic (single node), Standard (replication), Premium (clustering, persistence, VNet injection), Enterprise (Redis Enterprise modules like RedisJSON, RediSearch).

AWS-M6

Monitoring & Observability

CloudWatch Metrics, Logs & Alarms

What is CloudWatch?

CloudWatch is AWS's primary observability service β€” a unified platform for metrics, logs, dashboards, alarms, and events. Almost every AWS service automatically sends metrics to CloudWatch. It's your first stop for understanding what's happening in your AWS environment.

CloudWatch Metrics

Time-series data points published by AWS services and your own apps. Organized by Namespace (e.g., AWS/EC2) β†’ Metric Name (e.g., CPUUtilization) β†’ Dimension (e.g., InstanceId=i-0abc123).

Default EC2 Metrics (every 5 min, free):

CPUUtilization, NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, StatusCheckFailed

Detailed Monitoring (every 1 min, extra cost):

Same metrics but at 1-minute resolution. Needed for faster Auto Scaling reactions.

Custom Metrics:

Publish your own metrics from app code or scripts. Standard resolution = 1 min. High resolution = 1 second (extra cost). Example: publish queue depth, active sessions, order processing rate.

# Publish custom metric (CLI):
aws cloudwatch put-metric-data \
  --namespace "MyApp/OrderService" \
  --metric-name "OrdersPerMinute" \
  --value 142 \
  --dimensions Environment=Production,Service=OrderService

# Publish from Python:
import boto3
cw = boto3.client('cloudwatch')
cw.put_metric_data(
    Namespace='MyApp/OrderService',
    MetricData=[{
        'MetricName': 'ActiveConnections',
        'Value': 89,
        'Unit': 'Count'
    }]
)

CloudWatch Logs

# CloudWatch Logs Insights query β€” find all errors in last hour:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Find slow Lambda invocations (>5 seconds):
filter @type = "REPORT"
| parse @message "Duration: * ms" as duration
| filter duration > 5000
| stats avg(duration), max(duration), count() by bin(5m)

CloudWatch Alarms

Watches a metric and transitions between states based on thresholds. States: OK, ALARM, INSUFFICIENT_DATA. When ALARM state: send SNS notification, trigger Auto Scaling action, stop/reboot/terminate EC2, invoke Lambda.

# CLI: Create alarm β€” alert when CPU > 80% for 5 consecutive minutes:
aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-EC2-prod" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 60 \             # 1-minute periods
  --evaluation-periods 5 \  # 5 consecutive periods
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --alarm-actions arn:aws:sns:ap-south-1:123456:AlertsTopic \
  --ok-actions arn:aws:sns:ap-south-1:123456:AlertsTopic

CloudWatch Agent

Install the CloudWatch agent on EC2 (or on-prem servers) to collect metrics not available by default: memory usage (RAM), disk usage, swap, process-level metrics. Also collects logs from any file (system logs, app logs, custom log files) and ships them to CloudWatch Logs.

CloudWatch Dashboards

Custom visualizations β€” widgets showing metrics graphs, numbers, text, alarms. Share dashboards across accounts. Create one per team/service. JSON-configurable. Free to view, charged per dashboard per month.

Monitoring β€” Equivalents
GCP

Cloud Monitoring (metrics, dashboards, alerting) | Cloud Logging (logs, like CloudWatch Logs) | Cloud Trace (distributed tracing) | Cloud Profiler (CPU/memory profiling). These are unified under Google Cloud Observability (formerly Stackdriver).

Azure

Azure Monitor (umbrella service for all observability β€” metrics, logs, alerts, like CloudWatch) | Log Analytics Workspace (centralized log store with Kusto query language β€” richer querying than CloudWatch Logs Insights) | Application Insights (APM for web apps β€” no direct AWS equivalent natively).

Azure-Only

Application Insights: Full APM (Application Performance Monitoring) natively integrated into Azure Monitor. Tracks request rates, failure rates, response times, dependency calls, exceptions, user sessions. AWS equivalent would be X-Ray + custom CloudWatch metrics β€” more complex to set up.

CloudTrail / X-Ray / EventBridge

AWS CloudTrail β€” API Audit Logging

Records every API call made to AWS (via Console, CLI, SDK, or other services). Who did what, when, from where. Stored in S3. The forensic record of your AWS account. Enabled by default for 90 days (Event History) β€” create a Trail for longer retention.

# CloudTrail log entry example β€” someone deleted an S3 bucket:
{
  "eventTime": "2024-01-15T14:32:01Z",
  "eventSource": "s3.amazonaws.com",
  "eventName": "DeleteBucket",
  "userIdentity": {
    "type": "IAMUser",
    "userName": "john.dev",
    "arn": "arn:aws:iam::123456789:user/john.dev"
  },
  "sourceIPAddress": "103.210.45.67",  # The IP that made the call
  "requestParameters": {"bucketName": "prod-data-bucket"}
}
Enable CloudTrail β€” First Thing, Always If you get hacked, CloudTrail logs tell you WHAT was done. Without it, you're blind. Enable a multi-region trail on day 1, send to S3 with MFA Delete enabled so attackers can't delete the logs. Also enable CloudTrail log file integrity validation.

AWS X-Ray β€” Distributed Tracing

Traces requests as they flow through distributed systems (multiple services, Lambda, DynamoDB, RDS, external APIs). Generates a service map showing which services call which. Identifies bottlenecks and errors. Essential for microservices β€” when a user's request goes through 5 services and fails, X-Ray shows exactly which service caused the error and how long each took.

Amazon EventBridge β€” Event Bus

A serverless event bus for routing events between AWS services, your own apps, and SaaS partners. Think of it as AWS's "if this then that" at scale. Events go to EventBridge β†’ rules match events β†’ targets receive events.

# EventBridge rule: "When EC2 instance state changes to STOPPED, run a Lambda"
Event Pattern:
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {"state": ["stopped"]}
}
Target: Lambda function β†’ notify team on Slack

# Another example: Run DB backup Lambda every day at 2AM IST
Schedule: cron(30 20 * * ? *)   # 20:30 UTC = 02:00 IST
Target: Lambda function β†’ trigger RDS snapshot

EventBridge is what replaced CloudWatch Events. Has: default event bus (AWS events), custom event buses (your app events), partner event buses (SaaS integrations like Datadog, PagerDuty).

AWS-M7

DevOps & Automation Tools

CloudFormation IaC β€” Native AWS

What is CloudFormation?

AWS's native IaC service. Define your AWS infrastructure in YAML or JSON templates. CloudFormation creates, updates, and deletes resources as a Stack. Resources in a stack are managed together β€” create the stack β†’ all resources created. Delete the stack β†’ all resources deleted.

Template Structure

AWSTemplateFormatVersion: '2010-09-09'
Description: 'My web app infrastructure'

Parameters:
  InstanceType:
    Type: String
    Default: t3.micro
    AllowedValues: [t3.micro, t3.small, t3.medium]

Mappings:
  RegionAMI:
    ap-south-1:
      AMI: ami-0c55b159cbfafe1f0  # Amazon Linux 2

Conditions:
  IsProd: !Equals [!Ref Environment, production]

Resources:
  # The ONLY required section
  MyBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub 'my-app-${AWS::AccountId}'
      VersioningConfiguration:
        Status: Enabled

  MyEC2:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: !Ref InstanceType
      ImageId: !FindInMap [RegionAMI, !Ref AWS::Region, AMI]
      IamInstanceProfile: !Ref EC2InstanceProfile
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}-web'

Outputs:
  BucketName:
    Value: !Ref MyBucket
    Export:
      Name: !Sub '${AWS::StackName}-BucketName'

CloudFormation Key Concepts

Change Sets

Before updating a stack, create a Change Set to preview what CloudFormation will actually do: which resources will be added, modified, or deleted. Always use Change Sets in production β€” a resource replacement (e.g., changing RDS parameter requiring replacement) means data loss if you're not prepared.

Stack Sets

Deploy CloudFormation stacks across multiple AWS accounts and regions in one operation. Managed from a central admin account. Used for: applying security baseline to all accounts in an org, deploying global app infrastructure to 5 regions at once.

Drift Detection

Detects when actual resource configuration differs from CloudFormation's expected state (someone made a manual console change). Drift detection identifies what changed so you can fix it. Best practice: all changes through CloudFormation only β€” treat console as read-only for production.

Helper Scripts (cfn-signal, cfn-init)

cfn-signal: Allows an EC2 instance to signal CloudFormation that it has finished initializing (bootstrapping complete). CloudFormation waits for the signal (CreationPolicy WaitCondition) before marking the resource as created. Without this, CloudFormation marks EC2 as created the moment it starts, even if your app isn't ready yet.

# In EC2 UserData:
#!/bin/bash
/opt/aws/bin/cfn-init -v --stack my-stack --resource MyEC2 --region ap-south-1
# ... install and configure app ...
/opt/aws/bin/cfn-signal -e $? --stack my-stack --resource MyEC2 --region ap-south-1
CodePipeline / CodeBuild / CodeDeploy CI/CD on AWS

The AWS CI/CD Toolchain

AWS Native CI/CD Pipeline
  GitHub / CodeCommit
          β”‚ Code push
          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                        AWS CodePipeline                             β”‚
  β”‚                                                                     β”‚
  β”‚  Stage 1: SOURCE         Stage 2: BUILD         Stage 3: DEPLOY    β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
  β”‚  β”‚  GitHub /   β”‚         β”‚ CodeBuild:  β”‚        β”‚ CodeDeploy: β”‚    β”‚
  β”‚  β”‚ CodeCommit  │──────►  β”‚ - Install   │──────► β”‚ - EC2/ECS/  β”‚    β”‚
  β”‚  β”‚  Webhook    β”‚         β”‚ - Test      β”‚        β”‚   Lambda    β”‚    β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚ - Build     β”‚        β”‚ - Blue/Greenβ”‚    β”‚
  β”‚                          β”‚ - Push ECR  β”‚        β”‚ - Canary    β”‚    β”‚
  β”‚                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
  β”‚                               ↑                       ↑            β”‚
  β”‚                        buildspec.yml            appspec.yml        β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

CodeBuild

Fully managed build service. Runs your build commands in a container. Defined by buildspec.yml in your repo root. Scales automatically β€” no build servers to manage. Charged per build minute.

# buildspec.yml example (Node.js app β†’ Docker β†’ ECR)
version: 0.2
phases:
  pre_build:
    commands:
      - echo Logging in to ECR...
      - aws ecr get-login-password | docker login --username AWS \
          --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
  build:
    commands:
      - echo Running tests...
      - npm test
      - echo Building Docker image...
      - docker build -t $IMAGE_NAME:$CODEBUILD_RESOLVED_SOURCE_VERSION .
      - docker tag $IMAGE_NAME:$CODEBUILD_RESOLVED_SOURCE_VERSION \
          $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
  post_build:
    commands:
      - docker push $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
      - echo Build completed
artifacts:
  files:
    - imagedefinitions.json  # Tells CodeDeploy which image to use for ECS

CodeDeploy

Automates application deployments to EC2, Lambda, or ECS. Supports deployment strategies: in-place, blue/green, canary, linear. Defined by appspec.yml.

# appspec.yml for EC2 deployment
version: 0.0
os: linux
files:
  - source: /dist
    destination: /var/www/myapp
hooks:
  BeforeInstall:
    - location: scripts/stop_server.sh
      timeout: 30
  AfterInstall:
    - location: scripts/install_deps.sh
      timeout: 60
  ApplicationStart:
    - location: scripts/start_server.sh
      timeout: 30
  ValidateService:
    - location: scripts/health_check.sh
      timeout: 30
Auto Scaling Groups / Launch Templates

Auto Scaling Group (ASG)

An ASG maintains a fleet of EC2 instances. You define min/max/desired count. ASG continuously monitors health, replaces unhealthy instances automatically, and scales based on policies.

Launch Template vs Launch Configuration

Launch Template (modern, prefer this): Defines EC2 parameters (AMI, instance type, key pair, security groups, user data). Supports versioning, can specify multiple instance types, supports Spot + On-Demand mix. Launch Configuration (legacy, deprecated): Older, no versioning, only one instance type. Always use Launch Templates for new ASGs.

ASG Scaling Policies

Policy TypeHow it worksBest for
Simple ScalingAlarm triggers: add/remove N instances. Cooldown period before next action.Rarely used now β€” slow response, blunt
Step ScalingDifferent scaling magnitudes based on alarm severity. CPU 70-80%: add 1. CPU 80-90%: add 3. CPU >90%: add 5.Variable load spikes with different intensities
Target TrackingKeep a metric at a target value. "Keep average CPU at 60%" β€” ASG figures out how many instances to add/remove.Most common β€” easy to configure, handles scale-in/out automatically
Scheduled ScalingPre-set scaling at specific times. Scale out at 8AM, scale in at 10PM.Predictable traffic patterns (business hours, weekly spikes)
Predictive ScalingML-based forecasting using historical data. Pre-scales before expected traffic increase.Cyclical/recurring load patterns

Mixed Instance Types & Spot

Launch Templates support specifying multiple instance types and a mix of On-Demand + Spot instances in an ASG. E.g., "run 2 On-Demand as baseline, fill capacity with cheapest Spot instances from this list: m5.xlarge, m5a.xlarge, m6i.xlarge." If Spot is interrupted, ASG replaces with another Spot or falls back to On-Demand. Major cost savings for stateless workloads.

# CDK example (simplified) β€” Mixed instance ASG:
asg = autoscaling.AutoScalingGroup(self, "MyASG",
    min_capacity=2, max_capacity=20,
    mixed_instances_policy=autoscaling.MixedInstancesPolicy(
        instances_distribution=autoscaling.InstancesDistribution(
            on_demand_base_capacity=2,       # Always keep 2 On-Demand
            on_demand_percentage_above_base=20,  # 20% On-Demand, 80% Spot above base
            spot_allocation_strategy="capacity-optimized"  # Pick cheapest available Spot
        ),
        launch_template=lt,
        launch_template_overrides=[
            autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m5.xlarge")),
            autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m5a.xlarge")),
            autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m6i.xlarge")),
        ]
    )
)
Auto Scaling β€” Equivalents
GCP

Managed Instance Groups (MIG) with Autoscaler. Uses Instance Templates (like Launch Templates). Supports scale out on CPU, LB capacity, custom metrics. Also has Spot VMs integration in MIGs.

Azure

Azure Virtual Machine Scale Sets (VMSS). Like ASG but Azure-flavored. Supports Flex (flexible orchestration) and Uniform orchestration modes. Auto-scale based on metrics or schedule. Spot instance support in VMSS.

SSM Session Manager / Systems Manager

AWS Systems Manager (SSM)

A suite of tools for managing EC2 instances (and on-prem servers) at scale. The SSM Agent runs on your instances and connects to the SSM service. Key features:

Session Manager

Browser-based or CLI shell access to EC2 instances with no SSH, no bastion host, no open inbound ports. Authentication via IAM. All sessions are logged to CloudWatch/S3. The modern way to access EC2. Significant security improvement over SSH.

# Start session (CLI) - no SSH keys needed
aws ssm start-session --target i-0abc123def456

# Port forwarding via SSM (e.g., connect to RDS in private subnet)
aws ssm start-session \
  --target i-0abc123def456 \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["5432"],"localPortNumber":["5432"]}'
# Now: psql -h localhost -p 5432 -U admin mydb  (via SSM tunnel)

Parameter Store

Already covered in security section β€” stores config and secrets. Accessible from EC2 instances, Lambda, ECS tasks via SSM API.

Run Command

Execute shell commands on one or multiple EC2 instances without SSH. Run across hundreds of instances using tags. Output captured in CloudWatch. Good for: emergency patches, config changes, one-off maintenance tasks.

# Run command on all tagged "Environment=Production" instances
aws ssm send-command \
  --targets "Key=tag:Environment,Values=Production" \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["yum update -y kernel", "reboot"]'

Patch Manager

Automates OS patching across your fleet. Define patch baselines (which patches to approve), maintenance windows (when to patch), and patch groups. Generates compliance reports. Integrates with Run Command to actually apply patches.

State Manager

Ensures your instances are in a defined state (software installed, config files present, services running). Like configuration management (Ansible/Chef) but AWS-native. Uses SSM Documents to define state.

AWS-M8

Messaging & Async Services

SQS / SNS / EventBridge Decoupling Patterns

Why Async Messaging?

In a synchronous architecture, Service A calls Service B directly. If B is slow or down β†’ A is slow or failing too. With async messaging, A puts a message in a queue and returns immediately. B processes when it can. They're decoupled β€” A doesn't care about B's state.

Sync vs Async Architecture
  SYNC:  Order Service ──HTTP──► Inventory Service ──HTTP──► Notification Service
         (if either downstream fails β†’ order fails, user gets error)

  ASYNC: Order Service ──► SQS Queue ◄── Inventory Service (processes when ready)
              β”‚
              └──► SNS Topic ──fan-out──► Email Notification Lambda
                                     └──► Push Notification Lambda
                                     └──► Analytics Lambda

Amazon SQS β€” Simple Queue Service

Fully managed message queue. Producer sends messages, consumer polls and processes them, deletes after processing. Guarantees at-least-once delivery (same message might be delivered more than once β€” make consumers idempotent).

Queue Types

Standard Queue

Unlimited throughput. Messages delivered at least once, in approximately-order (not guaranteed). Best for: high-throughput workloads where some duplicate processing is OK. Default choice.

FIFO Queue

Exactly-once processing. Messages delivered exactly once, strictly in order. Throughput: 3,000 msg/s with batching (300/s without). Best for: financial transactions, order processing, any use case where order and deduplication matter.

Key SQS Concepts

# SQS Producer (send message):
import boto3, json
sqs = boto3.client('sqs')
sqs.send_message(
    QueueUrl='https://sqs.ap-south-1.amazonaws.com/123456789/OrderQueue',
    MessageBody=json.dumps({
        'orderId': 'ORD-2024-001',
        'userId': 'user123',
        'total': 299.99
    }),
    MessageGroupId='user123'  # For FIFO: same group = in-order
)

# SQS Consumer (receive and delete):
response = sqs.receive_message(
    QueueUrl=QUEUE_URL,
    MaxNumberOfMessages=10,
    WaitTimeSeconds=20,  # Long polling
    VisibilityTimeout=60
)
for msg in response.get('Messages', []):
    process(json.loads(msg['Body']))
    sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=msg['ReceiptHandle'])

Amazon SNS β€” Simple Notification Service

Pub/Sub messaging. Publishers send to a Topic. Subscribers receive all messages published to that topic. Fan-out: one message β†’ many subscribers. Supports: SQS, Lambda, HTTP/HTTPS, email, SMS, mobile push (APNS, FCM).

# SNS Fan-out: Order created β†’ notify multiple systems
SNS Topic: "OrderCreated"
β”œβ”€β”€ SQS: InventoryQueue  β†’ Inventory Lambda (update stock)
β”œβ”€β”€ SQS: ShippingQueue   β†’ Shipping Lambda (create shipment)
β”œβ”€β”€ Lambda: EmailSender  β†’ Send confirmation email
└── Lambda: Analytics    β†’ Record to analytics DB

# Each subscriber independently processes the same event
SNS β†’ SQS Fan-out Pattern Best practice is to use SNS + SQS together. SNS fans out to multiple SQS queues. Each queue has its own consumer. This gives you: fan-out, durable storage (SQS retains messages if consumer is down), and independent scaling of each consumer. Don't fan out SNS directly to Lambda in high-throughput scenarios β€” use SQS as a buffer.

Amazon Kinesis β€” Real-Time Streaming

For high-volume, real-time data streaming (millions of events/sec). Unlike SQS (queue β€” each message consumed once, deleted), Kinesis stores records for up to 7 days and multiple consumers can read the same stream.

FeatureSQSSNSKinesis Data Streams
PatternQueue (consume once)Pub/Sub (fan-out)Stream (replay, multiple consumers)
Retention14 days maxNo retention1-365 days
OrderingFIFO (with FIFO queue)No guaranteeOrdered per shard
ReplayNoNoYes (replay from any position)
ThroughputUnlimitedUnlimited1MB/s per shard
Use caseTask queues, job processingNotifications, fan-outReal-time analytics, event sourcing
Messaging Services β€” Equivalents
GCP

Cloud Pub/Sub: Combines SQS + SNS in one service (pub/sub model with at-least-once delivery, pull or push subscriptions). Also Cloud Tasks (task queues, more like SQS β€” delayed execution, rate limits, HTTP targets).

Azure

Azure Service Bus (enterprise messaging β€” like SQS/SNS with richer features: sessions, dead-lettering, transactions, topic subscriptions = fan-out) | Azure Event Grid (event routing, like EventBridge) | Azure Event Hubs (high-throughput streaming, like Kinesis Data Streams β€” compatible with Apache Kafka protocol).

Azure-Only

Azure Event Hubs Kafka compatibility: Azure Event Hubs has a Kafka-compatible API. Migrate existing Kafka producers/consumers to Event Hubs with minimal code changes. AWS offers Amazon MSK (Managed Kafka) but it's a full Kafka cluster β€” heavier. Event Hubs is lighter and Kafka-compatible at the same time.

⚑ Revision β€” Cloud Concepts
CC-R1

Cloud Fundamentals β€” Quick Review

1 What is Cloud & Service Models
Cloud = renting computing over the internet. NIST 5 traits: On-demand self-service, Broad network access, Resource pooling (multi-tenancy), Rapid elasticity, Measured service (pay-per-use).
CAPEX vs OPEX: Traditional IT = CAPEX (buy hardware upfront). Cloud = OPEX (pay monthly). Cloud β†’ no wasted capital, faster iteration.
IaaS: You manage OS up. Provider: hardware + virtualization. AWS EC2, GCP GCE, Azure VMs.
PaaS: You manage app + data. Provider: everything else. AWS Elastic Beanstalk/Lambda, GCP App Engine/Cloud Run, Azure App Service.
SaaS: Just use the app. Gmail, Slack, Salesforce, AWS WorkMail.
Deployment models: Public (AWS/GCP/Azure), Private (on-prem, OpenStack), Hybrid (public + on-prem), Multi-Cloud (multiple public providers). Multi-cloud β‰  Hybrid cloud.
2 Shared Responsibility & Global Infrastructure
Shared Responsibility: AWS = Security OF the cloud (hardware, DCs, hypervisor). You = Security IN the cloud (IAM, OS patches, data, encryption, firewall config).
More managed service = AWS takes more responsibility. EC2 (you patch OS) β†’ Lambda (AWS manages OS) β†’ SaaS (AWS manages everything).
Region: Geographically isolated cluster of DCs. Independent of each other. 33+ regions. Data does NOT auto-replicate across regions.
AZ: One or more DCs with independent power/network within a Region. Connected by low-latency (<1ms) private fiber. Deploy across 2+ AZs for HA.
Edge Locations: 600+ CDN cache servers globally (CloudFront, Route 53, Shield). More than regions β€” closer to end users.
Local Zones: AWS compute in specific cities (sub-10ms). Wavelength: AWS in 5G networks. Outposts: AWS rack in YOUR datacenter.
Region selection factors: 1) Compliance/data residency (non-negotiable), 2) Latency to users, 3) Service availability, 4) Pricing. us-east-1 = new services first, usually cheapest.
Azure Paired Regions: Azure-specific β€” each region paired with another for automatic DR. AWS has no equivalent (you manually configure cross-region replication).
CC-R2

HA, Scaling & DR β€” Quick Review

3 HA vs FT vs DR & Scaling
HA (High Availability): System recovers quickly from failure. Multi-AZ = HA. 99.9% β†’ 8.76 hrs downtime/yr. 99.99% β†’ 52 min/yr.
Fault Tolerance (FT): Zero downtime even on failure. Harder, more expensive. Active-active multi-region setups approach FT.
RPO = max acceptable data loss (how much data can you afford to lose?). RTO = max acceptable downtime (how fast must you recover?).
DR Strategies (cheapest to costliest, fastest RTO last): Backup & Restore β†’ Pilot Light β†’ Warm Standby β†’ Active-Active (Multi-Site).
Vertical scaling = bigger server. Single point of failure, has hardware limit, often requires downtime. Horizontal scaling = more servers. Resilient, near-unlimited, needs stateless app.
Elasticity = auto scale up AND back down. Requires stateless apps. Store sessions in Redis/DynamoDB, not local server memory.
Azure Site Recovery (ASR): Azure's managed DR service β€” VM replication + automated failover plans. AWS has no direct equivalent (you'd build with CloudFormation + Route 53 + scripting).
CC-R3

Networking, Security & Modern Patterns β€” Quick Review

4 VPC, Load Balancing & CDN
VPC = logically isolated private network. Public subnet (has IGW route) vs Private subnet (no direct internet route). Route table controls where traffic goes.
Security Group: Instance-level, stateful, allow-only rules. NACL: Subnet-level, stateless (both directions), allow AND deny, numbered rule order. Explicit Deny in NACL overrides SG allow.
NAT Gateway: Allows private instances outbound internet. NOT inbound. Placed in PUBLIC subnet with EIP. Deploy one per AZ for HA.
IGW: VPC ↔ internet. Required for any public access. Free. One per VPC. VPN Gateway: VPC ↔ on-prem (IPsec over internet). Direct Connect: Dedicated private fiber, consistent bandwidth.
L4 LB (NLB): Routes by IP+port. Ultra-fast, static IP, non-HTTP. L7 LB (ALB): Routes by HTTP path/host/headers. Smart routing, microservices.
CDN: Cache static content at edge servers close to users. Cache hit = fast. Cache miss = fetch from origin, cache it. TTL controls freshness. Invalidation expires cache early.
GCP VPC is global (subnets span all regions in one VPC). AWS VPC is regional. Azure VNet = regional (like AWS). GCP Firewall Rules are global (not per-instance like SGs).
5 Security Concepts, IaC & Modern Patterns
Authentication = who are you. Authorization = what can you do. Least Privilege = only grant what's needed. MFA = password + something you have.
Encryption at rest = data encrypted on disk (EBS, S3, RDS all support it). Encryption in transit = TLS/HTTPS for data moving over network. Both required for proper security posture.
Zero Trust: Trust nothing, verify everything. Even internal traffic authenticated. mTLS, service meshes (Istio), strict IAM = Zero Trust implementation.
IaC: Infrastructure defined as code. Declarative (CloudFormation, Terraform) vs Imperative (CDK, scripts). Versionable, reproducible, reviewable in PRs. Terraform = most popular, multi-cloud.
Serverless: No servers to manage. Event-driven. Pay per invocation. Scale to zero. Cold start problem (200ms-2s first invocation). Lambda = AWS, Cloud Functions = GCP, Azure Functions = Azure.
Containers vs VMs: Containers share host OS kernel (~50MB overhead, seconds to start). VMs have full OS (~1-2GB overhead, minutes to start). Docker = container standard. Kubernetes = orchestration standard.
CI/CD: CI = auto build+test on commit. CD = auto deploy to staging (manual prod). Continuous Deployment = auto deploy to prod. Blue/Green = instant rollback. Canary = gradual traffic shift. Feature flags = code-level rollout.
GCP Cloud Run: Serverless containers (any Docker image, scale to zero). More flexible than Lambda, no cold-start shim. Azure DevOps: All-in-one CI/CD + project management. Azure Container Apps: Serverless K8s-based containers like Cloud Run.
⚑ Revision β€” AWS Services
AWS-R1

Compute β€” Quick Review

1 EC2 & Lambda
EC2 = virtual machine. AMI = OS template. Instance families: t (burstable), m (general), c (compute), r (memory), i (storage), p/g (GPU). Graviton (ARM) = 20-40% cheaper than x86 β€” use when possible.
EC2 Pricing: On-Demand (no commit) β†’ Savings Plans/RI (1-3yr commit, up to 72% off) β†’ Spot (up to 90% off, interruptible). Spot = use for fault-tolerant batch jobs, CI runners.
User Data = script runs on first boot (installs software). IMDS = query instance metadata from inside (169.254.169.254). Always use IMDSv2 (token-based).
Placement Groups: Cluster (same AZ, low latency, HPC) | Spread (max 7/AZ, different hardware, critical VMs) | Partition (large distributed apps like Kafka).
EIP = static public IP. Charged when NOT attached. Use LB DNS name instead for production. EBS = persistent block storage (attaches like hard drive). Instance Store = ephemeral, lost on stop/terminate, faster.
Lambda: Serverless. Event-driven triggers: S3, API GW, SQS, DynamoDB Streams, EventBridge, SNS. Max 15 min timeout. 128MB-10GB memory. Cold start: first invocation ~200ms-2s. Provisioned Concurrency = keep warm.
Lambda execution role = IAM role Lambda assumes. Never put access keys in Lambda code β€” use the execution role. Layers = shared dependencies (max 5 per function). Reserve concurrency = cap function, protect other functions from being starved.
ECS: Task Definition (blueprint) β†’ Task (running container) β†’ Service (desired count + LB + rolling deploy) β†’ Cluster. EKS: Managed K8s. Fargate: Serverless nodes for ECS/EKS β€” no EC2 management.
ECR = private Docker registry. Authenticate with aws ecr get-login-password | docker login. Scans images for CVEs. Image URI format: <account>.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>
AWS-R2

Storage β€” Quick Review

2 S3, EBS, EFS
S3 = object storage. Buckets are globally unique. Objects have keys (the "path"). NOT a filesystem β€” flat namespace with "/" in key names giving folder illusion.
S3 Storage Classes: Standard β†’ Standard-IA (30-day min) β†’ Glacier Instant β†’ Glacier Flexible (minutes-hours retrieval) β†’ Glacier Deep Archive (12hr retrieval). Intelligent-Tiering = auto-moves between tiers.
S3 Versioning: Once enabled, can only suspend (not disable). Delete = adds delete marker (old versions remain). Lifecycle rules = auto-transition or expire objects.
S3 Replication: CRR (cross-region, for DR/compliance) and SRR (same-region). Requires versioning. Async. Does NOT replicate existing objects β€” only new uploads after enabling.
Pre-signed URLs: Temporarily grant access to private S3 objects. Backend generates signed URL with expiry. User downloads directly from S3. Also supports direct-upload from browser to S3 (pre-signed PUT).
EBS: Block storage, attach to ONE EC2 (usually). gp3 = preferred over gp2 (separate IOPS from size, 20% cheaper). io2 Block Express = up to 256K IOPS for DBs. EBS volumes are AZ-specific.
EBS Snapshots: Incremental, stored in S3 (AWS-managed). Can copy cross-region. Create new volume from snapshot in any AZ. Always snapshot before risky changes.
EFS = shared NFS. Multiple EC2 instances across multiple AZs mount simultaneously. Auto-scales. 6 copies across 3 AZs. ~3x more expensive than EBS per GB. Use for: CMS files, shared assets, ML training data across GPU nodes.
Storage comparison: S3 = objects over HTTP ($0.023/GB) | EBS = block/disk for single EC2 ($0.08/GB) | EFS = shared NFS for multiple EC2 ($0.30/GB) | Instance Store = ephemeral NVMe, included in instance price.
AWS-R3

Networking β€” Quick Review

3 VPC, Route 53, CloudFront, ELB
VPC Peering: Connect two VPCs, private IP routing. Non-transitive (A↔B, B↔C, A cannot reach C). Transit Gateway: Hub-and-spoke, transitive, connects N VPCs + on-prem. Use TGW over VPC peering at scale.
VPC Endpoints: Gateway (free, S3/DynamoDB only, route table entry) | Interface/PrivateLink (ENI in subnet, charged, supports 100+ services). Traffic stays on AWS network β€” no NAT GW needed for these services.
VPC Flow Logs: IP traffic metadata (not packet content) per ENI. Goes to CloudWatch/S3. Use to debug connectivity issues and security analysis. Look for REJECT entries to find blocked connections.
Route 53 Routing Policies: Simple | Weighted (A/B, gradual migration) | Latency (lowest latency region) | Failover (primary/secondary, health-checked) | Geolocation (country/continent) | Geoproximity (distance + bias) | Multivalue (up to 8 healthy IPs).
Alias record: AWS-specific CNAME substitute. Can be used on root domain (zone apex). Points to AWS resources. FREE queries (unlike CNAME). Use Alias for ALB, CloudFront, S3 website endpoints.
CloudFront: Distribution = config. Origin = S3/ALB/EC2. Cache Behavior = path pattern rules. OAC = CloudFront-only S3 access (modern, replaces OAI). Lambda@Edge = full Lambda at PoPs. CloudFront Functions = lightweight JS at edge.
ALB vs NLB: ALB = L7, path/host/header routing, HTTPS termination, microservices. NLB = L4, TCP/UDP, static IP, millions RPS, ultra-low latency, gaming/IoT. Both do health checks and AZ distribution.
Site-to-Site VPN: IPsec over internet. Quick to set up (hours). Up to ~1.25Gbps. Direct Connect: Private fiber. Weeks to set up. 1-100Gbps, consistent. DX + VPN as backup = best practice.
AWS-R4

IAM & Security β€” Quick Review

4 IAM, KMS, Secrets, WAF
IAM: Users (permanent creds) | Groups (users only, no roles) | Roles (temp creds, assumed by services/users/cross-account) | Policies (JSON permission docs). Global service, free.
Policy evaluation: Explicit DENY wins above all β†’ SCP β†’ Resource policy β†’ Permission boundary β†’ Identity policy. Default = DENY everything. Must have explicit ALLOW.
Policy types: AWS Managed (pre-built) | Customer Managed (custom, reusable) | Inline (embedded in one identity, avoid) | Resource-based (on S3/SQS/Lambda, enables cross-account) | SCP (org-wide ceiling) | Permission Boundary (max ceiling for a role).
IAM Roles for services: EC2 assume a role β†’ credentials via IMDS (169.254.169.254). Lambda execution role β†’ auto-injected. NEVER hardcode access keys in code. Use roles. If code is in Git with hardcoded keys = critical security incident.
Cross-account: Account B creates role with trust policy allowing Account A. Account A STS AssumeRole β†’ gets temp creds for Account B. Least privilege on both sides.
KMS: Manages encryption keys. AWS Managed Keys (free, less control) vs Customer Managed Keys ($1/month, full control, cross-account). Envelope encryption: DEK encrypts data, CMK encrypts DEK.
Secrets Manager: Store secrets encrypted by KMS. Auto-rotation for RDS. $0.40/secret/month. SSM Parameter Store: Free for standard/SecureString. No auto-rotation. Good for config + non-rotating secrets. Use SM for rotating DB passwords, SSM for config.
WAF: Blocks OWASP Top 10, SQL injection, XSS, bad bots. Applied to CloudFront/ALB/API Gateway. Use AWS Managed Rule Groups. Rate-based rules for DDoS defense at L7.
Shield Standard: Free, automatic L3/L4 DDoS. Shield Advanced: $3K/month, L7 DDoS, DRT access, cost protection. GuardDuty: Threat detection via ML on VPC Flow Logs + CloudTrail + DNS. Finds cryptomining, compromised creds, port scans.
Azure Key Vault = KMS + Secrets Manager + certificate management in one. Azure Entra ID = feature-rich identity (OAuth, SAML, Conditional Access, MFA) β€” more than AWS IAM alone. Microsoft Sentinel = full SIEM/SOAR, no AWS equivalent.
AWS-R5

Databases β€” Quick Review

5 RDS, Aurora, DynamoDB, ElastiCache
RDS: Managed relational DB. Engines: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Aurora. AWS manages hardware, OS, patching, backups. You manage: schema, queries, scaling decisions.
RDS Multi-AZ: Synchronous standby replica in another AZ. Auto-failover ~60-120s. Standby is NOT readable. For read scale β†’ Read Replicas (async, max 5 for MySQL/PostgreSQL, max 15 for Aurora).
RDS Proxy: Connection pooler. REQUIRED for Lambda + RDS (Lambda β†’ 1000 concurrent connections would exhaust DB). Reduces failover time. Improves connection reuse.
Aurora: AWS-proprietary. MySQL/PostgreSQL compatible. 5x faster than MySQL, 3x PostgreSQL. 6 copies across 3 AZs. Failover ~30s (vs 60-120s for RDS Multi-AZ β€” readers share storage, no data copy needed on promote).
Aurora Serverless v2: Auto-scales in 0.5 ACU increments, seconds. Pay-per-second. Min to max ACU range. Aurora Global: Multi-region, <1s replication lag, promote secondary for DR (RPO <1s, RTO <1min).
DynamoDB: NoSQL key-value + document. Serverless. Single-digit ms at any scale. Table β†’ Items β†’ Attributes. Primary Key: PK alone (simple) or PK+SK (composite). Max item size: 400KB.
DynamoDB Capacity: On-Demand (pay per request, auto-scale) vs Provisioned (set RCU/WCU, cheaper at steady load). 1 RCU = 4KB strongly consistent read/s. 1 WCU = 1KB write/s.
DynamoDB GSI: Query by non-PK attributes. Own PK+SK, own capacity. Design data model around your access patterns FIRST (unlike SQL). Single-table design = all entities in one table with composite keys.
DynamoDB Streams: Change stream (24hr retention). Triggers Lambda on insert/update/delete. Global Tables: Multi-region, multi-active writes. DAX: In-memory cache for DynamoDB β€” microsecond reads. Drop-in replacement (same API).
ElastiCache Redis: In-memory. Rich data types (hashes, sorted sets, pub/sub). Persistence optional. Replication + Multi-AZ. Sessions, leaderboards, rate limiting. Memcached: Simple caching, multi-threaded, no persistence, no replication. Choose Redis 95% of the time.
GCP Cloud Spanner: Global relational DB with horizontal write scaling β€” no AWS equivalent. Azure Cosmos DB: Multi-model NoSQL (MongoDB/Cassandra/Gremlin APIs in one service). Azure Cosmos DB for PostgreSQL = Citus distributed PostgreSQL.
AWS-R6

Monitoring, DevOps & Messaging β€” Quick Review

6 CloudWatch, CloudTrail, CI/CD & Messaging
CloudWatch Metrics: Time-series from AWS services. EC2 default: every 5 min (CPU, Network, Disk, Status). Detailed monitoring: 1 min (extra cost). Custom metrics: push your own (queue depth, sessions, etc.).
CloudWatch Logs: Log Groups β†’ Log Streams β†’ Log Events. Set retention per group. CloudWatch Logs Insights = query language (filter, parse, stats). CloudWatch Agent = install on EC2 to collect RAM/disk/custom log files.
CloudWatch Alarms: Watch metric β†’ threshold β†’ States: OK/ALARM/INSUFFICIENT_DATA. Actions: SNS notification, Auto Scaling, EC2 actions (stop/reboot). Composite alarms = AND/OR of multiple alarms.
CloudTrail: Every AWS API call logged (who, what, when, from where). Default 90-day Event History. Create a Trail β†’ S3 for longer retention. Enable on Day 1. Multi-region trail covers all regions. Enable log file integrity validation.
EventBridge: Serverless event bus. Rules match events β†’ targets. Default bus (AWS events) + custom buses (your app events). Replaced CloudWatch Events. Cron schedules, service event reactions (EC2 state change β†’ Lambda).
CloudFormation: YAML/JSON templates β†’ Stacks. Resources created/updated/deleted together. Always use Change Sets before updating production stacks. Drift detection = find manual changes. StackSets = multi-account/region deployments.
Auto Scaling Group: Min/Max/Desired. Scaling policies: Target Tracking (most common, "keep CPU at 60%") | Step Scaling | Scheduled | Predictive. Mixed instances (Spot + On-Demand) = major cost savings. Health checks replace unhealthy instances automatically.
SSM Session Manager: Shell access to EC2 with no SSH keys, no port 22 open, no bastion host. IAM-authenticated. All sessions logged. Use instead of SSH for production access. Also: SSM Run Command (run commands on fleets), Patch Manager, Parameter Store.
SQS: Queue. Standard (unlimited throughput, at-least-once, ~ordered) vs FIFO (3K/s, exactly-once, strictly ordered). Visibility timeout = how long message hidden during processing. DLQ = captures failed messages after N retries. Long polling = 20s wait reduces empty responses.
SNS: Pub/Sub. One topic → many subscribers (SQS, Lambda, HTTP, Email, SMS). SNS→SQS fan-out pattern: SNS fans to multiple SQS queues, each with independent consumer. Durable (SQS buffers if consumer is down).
Kinesis Data Streams: Real-time streaming. Records retained 1-365 days. Multiple consumers can re-read (unlike SQS β€” consumed once). Ordered per shard. 1MB/s write per shard. Kinesis Firehose: Managed delivery to S3/Redshift/ES β€” no consumer code needed.
SQS vs SNS vs Kinesis: SQS = job queue (consume once, delete). SNS = fan-out notifications (no retention). Kinesis = real-time stream (replay, multiple consumers, time-ordered). Pick based on pattern: task processing β†’ SQS, event notifications β†’ SNS, real-time analytics β†’ Kinesis.
Azure Application Insights: Full APM (request rates, errors, response times, user sessions) β€” no native AWS equivalent (use X-Ray + custom CloudWatch). Azure Event Hubs: Kafka-compatible streaming, like Kinesis but with Kafka API support. GCP Cloud Pub/Sub: SQS + SNS combined in one service.