☁ Cloud Concepts & AWS Services

Complete notes for a Junior DevOps role. Learn core cloud principles, then dive deep into AWS services with real-world examples and diagrams. GCP & Azure equivalents included throughout.

AWS GCP Azure DevOps IaC Serverless

☁ Cloud Concepts

CC-M1

Cloud Fundamentals

1 What is Cloud Computing?

The Core Idea

Cloud computing means renting computing resources over the internet instead of buying and managing your own hardware. Think of it like electricity — you don't build your own power plant; you plug into the grid and pay for what you use.

Before cloud, a startup wanting to launch an app needed to: buy servers, rent datacenter space, hire a sysadmin, buy networking hardware, wait weeks for delivery — all before writing a single line of code. Cloud made that a 5-minute signup.

NIST 5 Essential Characteristics

The official NIST definition says cloud computing must have all 5 of these:

1. On-Demand Self-Service

You provision resources yourself, without talking to a human. Spin up an EC2 instance at 2 AM, no approval required.

2. Broad Network Access

Resources are accessible over the internet from any device — laptop, phone, another server anywhere on the planet.

3. Resource Pooling (Multi-tenancy)

Provider serves many customers from the same physical hardware, dynamically assigning resources. You don't know (or care) which physical server you're on.

4. Rapid Elasticity

Scale up or down fast — sometimes automatically. Resources feel unlimited from the user's perspective. Traffic spike at 9 AM? Auto Scaling adds servers in minutes.

5. Measured Service

You pay for exactly what you use. Like a utility bill. AWS charges per hour/second for compute, per GB for storage, per million for API calls.

CAPEX vs OPEX

Model	What it means	Example	Cloud relevance
CAPEX (Capital Expense)	Upfront large purchase. You own the asset.	Buying 50 physical servers	Traditional / On-prem model
OPEX (Operational Expense)	Ongoing cost. Pay as you go.	Paying AWS monthly bill	Cloud model — predictable, flexible

Why it matters for DevOps Cloud moves IT from CAPEX to OPEX. This means faster experimentation (no hardware order), easier budgeting, and no wasted capital on underused hardware. As a DevOps engineer, you'll constantly make decisions that affect cloud spend.

2 Service Models — IaaS, PaaS, SaaS

The "Pizza as a Service" Analogy

These models define how much of the stack the cloud provider manages vs how much you manage.

Layer	On-Prem (you manage)	IaaS	PaaS	SaaS
Application	You	You	You	Provider
Data	You	You	You	Provider
Runtime / Middleware	You	You	Provider	Provider
OS	You	You	Provider	Provider
Virtualization	You	Provider	Provider	Provider
Hardware / Network / DC	You	Provider	Provider	Provider

IaaS — Infrastructure as a Service

You get raw compute, storage, and networking. You manage the OS up. Most control, most responsibility.

Real example: You rent an EC2 instance, install Ubuntu, install Nginx, deploy your app. If the OS crashes, that's on you to fix.

PaaS — Platform as a Service

You just deploy your application code/container. The provider handles OS patching, scaling infrastructure, runtime. Less control, less ops work.

Real example: You push a Python Flask app to Elastic Beanstalk. AWS auto-provisions EC2, load balancer, and auto-scaling. You never SSH into a server.

SaaS — Software as a Service

You're just a user of a complete application. No infrastructure, no app management. Just login and use it.

Real example: Gmail, Slack, Salesforce. AWS WorkMail is also SaaS.

Cloud Provider Service Model Equivalents

AWS

IaaS: EC2 | PaaS: Elastic Beanstalk, Lambda | SaaS: WorkMail, Chime

GCP

IaaS: Compute Engine (GCE) | PaaS: App Engine, Cloud Run | SaaS: Google Workspace

Azure

IaaS: Azure VMs | PaaS: Azure App Service, Azure Functions | SaaS: Microsoft 365, Dynamics 365

3 Deployment Models — Public, Private, Hybrid, Multi-Cloud

Public Cloud

Resources run on provider's shared infrastructure, accessible over the public internet. AWS, GCP, Azure are all public clouds. Best for: startups, variable workloads, apps without strict data residency needs.

Private Cloud

Cloud infrastructure dedicated to one organization. Can be on-prem or in a provider's dedicated facility. Tech: OpenStack, VMware vSphere. Best for: banks, government, healthcare — strict compliance requirements.

Hybrid Cloud

Mix of on-prem (private) + public cloud, connected by VPN or Direct Connect. Best for: organizations with legacy systems migrating gradually to cloud, or data residency requirements with burst needs.

Multi-Cloud

Using multiple public cloud providers simultaneously (e.g., AWS for compute + GCP for ML). Best for: avoiding vendor lock-in, using best-of-breed services, or regulatory reasons.

Real-World Example — Hybrid A bank keeps customer data on private on-prem servers (regulatory compliance) but uses AWS for its web frontend and analytics dashboards. The private datacenter connects to AWS via AWS Direct Connect, creating a hybrid setup.

Multi-Cloud vs Hybrid Cloud Multi-cloud = multiple public cloud providers. Hybrid cloud = public cloud + on-prem/private cloud. These are different! Many companies end up with both (multi-cloud-hybrid) in practice.

4 Shared Responsibility Model

The Most Important Concept in Cloud Security

AWS (and all cloud providers) operate under a shared responsibility model. In simple terms: AWS is responsible for security OF the cloud. YOU are responsible for security IN the cloud.

Shared Responsibility Model — EC2 (IaaS) Example

  ┌──────────────────────────────────────────────────────────────────┐
  │                      CUSTOMER RESPONSIBILITY                     │
  │  (Security IN the cloud)                                         │
  │                                                                  │
  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────────────────┐ │
  │  │  Customer   │  │  Platform,  │  │  Identity & Access Mgmt  │ │
  │  │    Data     │  │  App, OS    │  │  (IAM users, policies)   │ │
  │  └─────────────┘  └─────────────┘  └──────────────────────────┘ │
  │  ┌─────────────┐  ┌─────────────┐  ┌──────────────────────────┐ │
  │  │  Firewall / │  │  Network    │  │  Client-side & Server-   │ │
  │  │  Sec Groups │  │  Config     │  │  side Encryption         │ │
  │  └─────────────┘  └─────────────┘  └──────────────────────────┘ │
  └──────────────────────────────────────────────────────────────────┘
  ┌──────────────────────────────────────────────────────────────────┐
  │                       AWS RESPONSIBILITY                         │
  │  (Security OF the cloud)                                         │
  │                                                                  │
  │  ┌────────────────────────────────────────────────────────────┐  │
  │  │  Compute | Storage | Networking | Database (managed infra) │  │
  │  └────────────────────────────────────────────────────────────┘  │
  │  ┌────────────────────────────────────────────────────────────┐  │
  │  │  Physical Security of Datacenters, Hardware, Network Infra │  │
  │  └────────────────────────────────────────────────────────────┘  │
  └──────────────────────────────────────────────────────────────────┘

Responsibility Shifts Based on Service Model

Concern	IaaS (EC2)	PaaS (Beanstalk)	SaaS (WorkMail)
Physical datacenter	AWS	AWS	AWS
Hypervisor / Hardware	AWS	AWS	AWS
OS patching	You	AWS	AWS
Runtime/middleware	You	AWS	AWS
Application code	You	You	AWS
Data & encryption	You	You	You
IAM / access control	You	You	You

Common Mistake Many cloud breaches happen because people think "the cloud provider secures everything." They don't. Leaving an S3 bucket publicly readable, using weak IAM policies, or not patching your EC2 OS — all your responsibility. The provider doesn't protect you from YOUR mistakes inside the cloud.

Shared Responsibility — Other Providers

GCP

Same model: Google secures the infrastructure, you secure your workloads and data. Called "Shared Fate" in GCP (more collaborative tone — Google provides security tools to help you).

Azure

Same model. Azure's documentation explicitly shows a layered diagram. For managed services (like Azure SQL), Azure takes on more responsibility than for VMs.

CC-M2

Global Infrastructure

1 Regions, Availability Zones & Edge Locations

Why Geography Matters in Cloud

Your users are physically distributed. A server in the US takes ~150ms to respond to a user in India. Cloud providers build datacenters globally to solve this. But geography also matters for compliance (EU GDPR requires EU data stay in EU), disaster recovery (separate physical locations), and cost (prices vary by region).

AWS Global Infrastructure Hierarchy

Regions

A Region is a geographically separate area of the world with a cluster of datacenters. Each region has a unique name like us-east-1 (N. Virginia), ap-south-1 (Mumbai), eu-west-1 (Ireland).

AWS has 33+ regions worldwide (2024)
Regions are completely independent — a region-wide failure doesn't affect other regions
Not all services are available in all regions (e.g., some AI services only in US regions initially)
Data does NOT automatically replicate across regions — you must explicitly configure cross-region replication

Availability Zones (AZs)

Each region has 2-6 AZs (usually 3). An AZ is one or more discrete datacenters with:

Independent power supply (UPS + diesel generators)
Independent networking (separate internet uplinks)
Physical separation (miles apart, so one fire/flood doesn't take both)
But connected with high-speed, low-latency private fiber within the region (<1ms)

Practical Rule Deploy critical resources across at least 2 AZs. If one AZ fails (power outage, hardware failure), your app keeps running in the other AZ. This is the foundation of High Availability in AWS.

Edge Locations & CloudFront PoPs

For CDN (CloudFront), AWS has 600+ edge locations worldwide — far more than regions. These are smaller cache servers placed close to end users. Content cached here gets served with ultra-low latency. Edge locations are also used by Route 53 (DNS) and AWS Shield (DDoS protection).

Other AWS Infrastructure Types

Type	What it is	Use case
Local Zones	AWS compute placed in specific cities (e.g., Delhi, Chicago), extending a region	Sub-10ms latency for city users. Gaming, live video, AR/VR.
Wavelength Zones	AWS compute embedded in telecom 5G networks	Ultra-low latency apps delivered via 5G. Mobile gaming, real-time video.
AWS Outposts	AWS-managed rack in YOUR datacenter running AWS services	On-prem workloads needing AWS APIs. Compliance requiring on-prem data.

Global Infrastructure — Other Providers

GCP

Regions & Zones (similar concept). A Zone is like an AZ. GCP calls them Zones directly (e.g., asia-south1-a). Also has Cloud CDN PoPs for edge caching. ~40 regions.

Azure

Regions & Availability Zones. Azure also has Availability Sets (older: ensures VMs spread across fault/update domains within a single datacenter — NOT the same as AZs). Azure AZs are like AWS AZs. Also has Azure Edge Zones similar to AWS Local Zones.

Azure-Only

Azure Paired Regions: Every Azure region is paired with another region in the same geography (e.g., East US ↔ West US). Microsoft staggers updates across pairs and replicates some services automatically. AWS doesn't have an exact equivalent — you manage cross-region replication manually.

2 Choosing a Region — 4 Key Factors

1. Compliance & Data Residency

GDPR (EU) requires EU citizen data stays in EU. HIPAA (US healthcare), PCI-DSS (payments). If law requires data in a specific country — that region wins, period. No other factor overrides this.

2. Latency (Proximity to Users)

Deploy closest to your users. If 80% of users are in India, ap-south-1 (Mumbai). Use CloudFront for global CDN on top. Test with cloudpingtest.com.

3. Service Availability

Not all services exist in all regions. New services launch in us-east-1 first. Check the AWS Regional Services table before designing architecture. Bedrock (AI) has limited region availability.

4. Pricing

Same EC2 instance type costs differently per region. us-east-1 tends to be cheapest. ap-southeast-1 (Singapore) is ~10-20% more. Factor this into cost modeling.

Pro Tip — us-east-1 is special AWS us-east-1 (N. Virginia) is AWS's oldest and largest region. New services launch here first. It's also where AWS Console global resources (like IAM, Route 53, CloudFront) show up. When something seems to not exist in your region — check if it's in us-east-1 only.

CC-M3

High Availability, Scalability & Disaster Recovery

1 High Availability vs Fault Tolerance vs Disaster Recovery

Three Related But Different Concepts

These terms are often confused. Think of a hospital as an analogy:

High Availability (HA)

System is designed to be "always on" with minimal downtime. If a component fails, the system automatically recovers quickly. A hospital with a backup generator — brief flicker but stays running.

Fault Tolerance (FT)

System continues operating WITH ZERO downtime or data loss even when a component fails. An airplane with 4 engines that can fly on 3 — no passengers even notice. Much harder and more expensive than HA.

Disaster Recovery (DR)

Your plan for recovering from catastrophic failure (entire datacenter destroyed, full region outage). Like a hospital's evacuation plan — you hope you never need it but must have it. Usually involves a separate region.

Nines of Availability

Availability %	Downtime per year	Downtime per month	Typical system
99%	3.65 days	7.2 hours	Basic single-server app
99.9% ("three nines")	8.76 hours	43.8 minutes	Simple multi-AZ setup
99.99% ("four nines")	52.6 minutes	4.4 minutes	Production multi-AZ + failover
99.999% ("five nines")	5.25 minutes	26 seconds	Active-active multi-region

AWS SLAs EC2 SLA = 99.99% for a region. S3 = 99.99% availability. Route 53 = 100% uptime SLA (first cloud service with 100% SLA). These are guarantees — if AWS misses them, you get service credits.

RPO and RTO — The Two DR Metrics

RPO — Recovery Point Objective

How much data can you afford to lose? Measured as maximum time between last backup and the disaster. RPO = 1 hour means you're OK losing up to 1 hour of data. Lower RPO = more frequent backups = more cost.

RTO — Recovery Time Objective

How long can your system be down? Time from disaster to full recovery. RTO = 4 hours means you need to be back up within 4 hours. Lower RTO = more standby infrastructure = more cost.

RPO and RTO on a Timeline

  Normal ──────────────────┐ DISASTER ┌──────────────── Recovered
  Operation                │ event    │                 state
                           │          │
  [Last backup] ◄──────────┤          ├──────────────► [Back online]
                   RPO     │          │      RTO
                (data gap) │          │  (recovery time)

  Example: RPO=1hr, RTO=4hr
  → You can lose max 1 hour of data
  → You must be back online within 4 hours of the disaster

2 Disaster Recovery Strategies

4 AWS DR Strategies — Cost vs Speed Tradeoff

DR Strategy Comparison — Cost vs Recovery Speed

                        Faster recovery (lower RTO/RPO)
                        ─────────────────────────────►

  CHEAPEST    Backup &    Pilot     Warm      Active-Active    MOST
  (cold)      Restore     Light     Standby   (Multi-site)     EXPENSIVE
              │           │         │         │
              ▼           ▼         ▼         ▼
  RTO:        Hours      Minutes   Minutes   Seconds
  RPO:        Hours      Minutes   Seconds   Near-zero
  Cost:       $           $$        $$$       $$$$
              │           │         │         │
              │           │         │         └─ Full copy in 2nd region
              │           │         └─── Scaled-down running copy
              │           └──── Minimal services always running
              └──── Just backups, nothing running

Strategy 1: Backup & Restore

Regularly back up data and snapshots to S3. When disaster hits, spin up new infrastructure from those backups. Simplest, cheapest, but slowest.

Example: EC2 AMI snapshots every 6 hours to S3. RDS automated backups to another region. If primary region fails, launch new EC2 from AMI, restore RDS from backup. Takes hours.

Strategy 2: Pilot Light

A minimal version of your app is always running in DR region — just the core data-syncing layer (e.g., a database replicating from primary). Application servers are OFF but AMIs/configs are ready. Scale up when needed.

Example: RDS read replica in DR region (always syncing). EC2 AMIs ready. When disaster: promote read replica to master, launch app servers from AMIs. Takes 15-30 minutes.

Strategy 3: Warm Standby

A scaled-down but fully running copy of your system in DR region. It receives traffic in normal times or just sits ready. During disaster, scale it up to full production capacity.

Example: 2 t3.small EC2s in DR region vs 10 m5.xlarge in production. During disaster, scale DR to full size and redirect DNS.

Strategy 4: Active-Active (Multi-Site)

Full production deployment in 2+ regions, ALL serving live traffic. Route 53 routes users to nearest healthy region. If one region fails, all traffic goes to the other with no perceivable downtime.

Example: Netflix runs in multiple AWS regions. If us-east-1 has issues, traffic goes to us-west-2. Users might see a brief slowdown, but no outage.

DR Concepts — All Clouds

AWS

DR strategies built around multi-region architecture. Key services: Route 53 (DNS failover), S3 CRR (cross-region replication), RDS Read Replicas, Aurora Global Database, DynamoDB Global Tables.

GCP

Same concepts. Multi-region Cloud Storage, Cloud Spanner (global DB), Cloud DNS with failover routing. GCP also has Managed Instance Groups with regional autoscaling.

Azure

Azure Site Recovery (ASR) is Azure's dedicated DR service — not available in AWS directly. ASR can replicate VMs to a secondary region and automate failover. Azure Traffic Manager handles DNS-level failover (like Route 53).

Azure-Only

Azure Site Recovery (ASR): Dedicated managed DR service that replicates VMs, manages failover plans, and handles RPO/RTO tracking. AWS equivalent would be custom-built using CloudFormation + scripting + Route 53.

3 Scalability & Elasticity

Scalability = Can It Grow? Elasticity = Does It Grow Automatically?

Scalability means your architecture can handle increased load. Elasticity means it automatically scales up AND back down as load changes — so you're not paying for idle capacity at 3 AM.

Vertical Scaling (Scale Up)

Give the existing server more power. Upgrade from t3.medium (2 vCPU, 4GB) to m5.4xlarge (16 vCPU, 64GB). Simple but has limits (biggest instance size), requires downtime, and creates a single point of failure.

Vertical vs Horizontal Scaling

  VERTICAL (Scale Up)                 HORIZONTAL (Scale Out)

  Before: [Server 2GB RAM]            Before: [Server] [Server]
              │                                   │
              ▼                                   ▼
  After:  [Server 16GB RAM]           After:  [Server] [Server] [Server] [Server]
                                                         │
          One server, bigger                      Load Balancer distributes traffic
          Single point of failure                 No SPOF — much more resilient

Horizontal Scaling (Scale Out)

Add more instances of the same server. 1 server → 5 servers behind a load balancer. No single point of failure. Nearly unlimited scale. Requires your app to be stateless (session data stored in Redis/DB, not locally).

Auto Scaling

AWS Auto Scaling automatically adjusts the number of instances based on rules you define. You define a minimum (always have at least 2), maximum (never exceed 20), and desired (target 4 normally).

Scaling can be triggered by: CPU usage > 70%, request count, memory, schedule, or custom CloudWatch metrics.

Real-World Scenario You run an e-commerce site. On a normal day, 4 EC2 instances handle traffic. Black Friday comes — traffic spikes 10x. Auto Scaling detects CPU spike → scales out to 20 instances in 5 minutes → Black Friday handled. At midnight when traffic drops — scales back to 4. You only paid for the extra instances for those hours.

Elasticity Requires Stateless Apps If your app stores user sessions in local server memory, horizontal scaling breaks: User logs in → Server A has session → Next request hits Server B → Session not found → User gets logged out. Fix: store sessions in ElastiCache (Redis) or DynamoDB, not in server memory.

CC-M4

Cloud Networking Fundamentals

1 Virtual Private Cloud (VPC) — Your Private Network

What is a VPC?

A VPC is a logically isolated private network in the cloud. Think of it as your own private section of AWS that no one else can access. By default, nothing inside your VPC can reach the internet, and the internet can't reach your VPC — you must explicitly configure that.

Analogy: AWS is a massive apartment building. Your VPC is your apartment — you can furnish it however you like inside, but outsiders can't get in unless you let them.

VPC Architecture — Key Components

  ┌─────────────────────── AWS Region (ap-south-1) ─────────────────────────┐
  │                                                                          │
  │  ┌──────────────────────── VPC (10.0.0.0/16) ────────────────────────┐  │
  │  │                                                                    │  │
  │  │  ┌── AZ-1a ─────────────────────┐  ┌── AZ-1b ──────────────────┐ │  │
  │  │  │                              │  │                            │ │  │
  │  │  │ [Public Subnet 10.0.1.0/24]  │  │ [Public Subnet 10.0.2.0/24]│ │  │
  │  │  │  ┌──────────┐               │  │  ┌──────────┐             │ │  │
  │  │  │  │ EC2 (web)│               │  │  │ EC2 (web)│             │ │  │
  │  │  │  │  Public  │ NAT GW        │  │  │  Public  │             │ │  │
  │  │  │  │  IP: ✓   │──┐            │  │  │  IP: ✓   │             │ │  │
  │  │  │  └──────────┘  │            │  │  └──────────┘             │ │  │
  │  │  │                │            │  │                            │ │  │
  │  │  │ [Private Sub 10.0.3.0/24]  │  │ [Private Sub 10.0.4.0/24] │ │  │
  │  │  │  ┌──────────┐  │            │  │  ┌──────────┐             │ │  │
  │  │  │  │ EC2 (app)│◄─┘ (outbound)│  │  │ RDS (DB) │             │ │  │
  │  │  │  │ No pub IP│               │  │  │ No pub IP│             │ │  │
  │  │  │  └──────────┘               │  │  └──────────┘             │ │  │
  │  │  └──────────────────────────────┘  └────────────────────────── ┘ │  │
  │  │                       │                                           │  │
  │  │              Internet Gateway (IGW)                               │  │
  │  └───────────────────────┼───────────────────────────────────────────┘  │
  │                          │                                              │
  └──────────────────────────┼──────────────────────────────────────────────┘
                             │
                         INTERNET

Key Networking Components

CIDR Block

When you create a VPC, you assign it a CIDR block like 10.0.0.0/16. This defines the IP range for your entire VPC (65,536 IPs). You then carve subnets from this range.

Subnets

Subnets divide the VPC into smaller networks, and they're tied to a specific AZ. A public subnet has a route to the internet via an Internet Gateway. A private subnet has no direct internet route — instances here can't be reached from the internet.

Internet Gateway (IGW)

A horizontally-scaled, redundant, HA component attached to your VPC that enables communication between your VPC and the internet. Free. Without an IGW, your VPC has no internet connectivity at all. Only one IGW per VPC.

NAT Gateway

Allows instances in private subnets to initiate outbound connections to the internet (for software updates, API calls, etc.) but prevents the internet from initiating connections TO those instances. Deployed in a public subnet, charges per hour + data processed.

Route Tables

Every subnet has a route table that defines where traffic goes. A public subnet's route table has an entry: 0.0.0.0/0 → igw-xxxxx (default route to internet via IGW). A private subnet's route table: 0.0.0.0/0 → nat-xxxxx (outbound only via NAT).

Security Groups

Virtual firewalls at the instance (EC2) level. Stateful: if you allow inbound on port 80, the response automatically comes back out without needing an outbound rule. Default: deny all inbound, allow all outbound.

Network ACLs (NACLs)

Firewall at the subnet level. Stateless: you must define both inbound AND outbound rules explicitly. Rules evaluated in number order (lowest first). An explicit DENY stops evaluation. Less commonly tweaked than Security Groups.

Feature	Security Group	NACL
Applied at	Instance (ENI) level	Subnet level
State	Stateful (response auto-allowed)	Stateless (must allow both directions)
Rules	Allow only	Allow and Deny
Rule evaluation	All rules evaluated, most permissive wins	Rules in number order, first match wins
Default behavior	Deny all inbound, allow all outbound	Allow all inbound and outbound

VPC Equivalents in Other Clouds

GCP

Also called VPC. Key difference: GCP VPCs are global by default (span all regions). AWS VPCs are regional. In GCP, one VPC can have subnets in multiple regions. Subnets are regional. Security Groups → Firewall Rules (global, not per-instance). No direct NACL equivalent.

Azure

Called Virtual Network (VNet). Same concept — private IP space, subnets, gateways. Azure has Network Security Groups (NSGs) which work like AWS Security Groups but can be applied to subnets OR individual NICs. Azure also has Application Security Groups (ASGs) to group VMs logically.

2 Load Balancing Concepts

What is a Load Balancer?

A load balancer distributes incoming traffic across multiple backend servers. It's the entry point users hit — they don't talk to individual servers directly. This enables high availability (if one server dies, traffic goes elsewhere), horizontal scaling, and no single point of failure.

Layer 4 vs Layer 7 Load Balancing

L4 — Transport Layer (TCP/UDP)

Routes traffic based on IP address and port number. Doesn't look inside the packet. Fast, low-overhead. Good for: non-HTTP traffic, TCP-based apps, ultra-low latency, gaming, VoIP, financial trading.

AWS: NLB (Network Load Balancer)

L7 — Application Layer (HTTP/HTTPS)

Looks inside the HTTP request — path, hostname, headers, cookies. Can route /api/* to one group, /images/* to another. Smarter but slightly more overhead. Good for: web apps, microservices, content-based routing.

AWS: ALB (Application Load Balancer)

Load Balancing Algorithms

Algorithm	How it works	Best for
Round Robin	Send each request to next server in sequence: A, B, C, A, B, C...	Similar servers, similar request sizes
Least Connections	Send to server with fewest active connections	Variable request processing time
IP Hash / Sticky Sessions	Same client IP always goes to same server	Apps that need session affinity (stateful)
Weighted	Some servers get more traffic by weight (70/30 split)	Gradual deployments (blue/green, canary)

Health Checks

Load balancers continuously ping backend servers (e.g., HTTP GET /health every 30s). If a server fails health check 3 times, the LB removes it from rotation. When it recovers and passes, it's added back. This is how HA works in practice.

3 CDN Concepts — Content Delivery Networks

The Problem CDNs Solve

Your origin server is in us-east-1. A user in Mumbai requests your 5MB homepage image. The packet travels ~14,000 km. High latency. With a CDN, that image is cached in an edge server in Mumbai — user gets it from there. Fast.

CDN — Cache Hit vs Cache Miss Flow

  WITHOUT CDN:                          WITH CDN (cache HIT):
  User (Mumbai) ─────────────────►     User (Mumbai) ──► Edge (Mumbai) ──► User
   14,000km to us-east-1                                  [cached!] ◄──┘
   Response: ~300ms latency                               Response: ~5ms latency

  FIRST REQUEST (cache MISS):
  User (Mumbai) ──► Edge (Mumbai) ──► Origin (us-east-1) ──► Edge caches it
                                       Response: ~300ms (one time)

  SUBSEQUENT REQUESTS (cache HIT, within TTL):
  User (Mumbai) ──► Edge (Mumbai) ──► Serve from cache → ~5ms ✓

Key CDN Concepts

Origin: The source of truth — your actual server (S3 bucket, EC2, ALB).
Edge Location: CDN's cache servers distributed globally.
TTL (Time To Live): How long content is cached before being re-fetched from origin. Too long = stale content. Too short = defeats the purpose.
Cache Invalidation: Manually expire cached content when you deploy new files. In CloudFront, you create an invalidation request.
Origin Shield: Extra caching layer between edge locations and origin, reducing origin load. One central cache instead of 100s of edges hitting origin.

What to cache: Static assets (images, CSS, JS, videos). What NOT to cache: User-specific pages, API responses with sensitive data, frequently changing data (unless you manage TTL carefully).

CDN Services

AWS

CloudFront — AWS's CDN. 600+ PoPs. Integrates with S3, EC2, ALB. Supports Lambda@Edge for dynamic logic at the edge.

GCP

Cloud CDN — Works with Cloud Load Balancing. Also Cloud Media CDN for high-throughput streaming.

Azure

Azure Front Door — combines CDN, WAF, and global load balancing in one product. More feature-rich than a pure CDN. Also legacy Azure CDN (being retired in favour of Front Door).

Azure-Only

Azure Front Door's global load balancing (routing users to the closest healthy region based on latency, not just caching) is more tightly integrated than AWS CloudFront + Route 53 combination.

CC-M5

Security Concepts, IaC & Modern Patterns

1 Cloud Security — IAM, Encryption & Zero Trust

Identity & Access Management (IAM) — Core Concepts

IAM answers: Who are you? What can you do? To what resources?

Authentication: Proving who you are (password, MFA, API key)
Authorization: What you're allowed to do once authenticated (policies)
Principal: An entity that can make requests (user, role, service)
Principle of Least Privilege: Grant only the permissions needed for the specific task. Not "give admin and let them figure it out."

Encryption

Encryption at Rest

Data encrypted while stored. If someone steals a hard drive, they get garbage. AWS does this for EBS, S3, RDS with keys managed by KMS. In S3, you can enable SSE-S3 (AWS manages key) or SSE-KMS (you manage key via KMS).

Encryption in Transit

Data encrypted while moving over a network. Uses TLS (formerly SSL). HTTPS is HTTP + TLS. Your AWS API calls are all HTTPS. Between services: use TLS wherever possible. Between on-prem and AWS: VPN or Direct Connect with MACsec.

MFA — Multi-Factor Authentication

Something you know (password) + something you have (phone/hardware key). Even if your AWS root password is stolen, attacker can't login without your MFA device. Always enable MFA on root account and all IAM users with console access.

Zero Trust Model

Traditional model: "Trust everything inside the network perimeter." Zero Trust: "Trust nothing, verify everything." Even requests from inside the VPC are not automatically trusted — authenticate and authorize every request. Implemented via mutual TLS (mTLS), service meshes (Istio), and strict IAM policies.

Security Anti-Patterns to Avoid 1. Using root account for daily operations — create IAM users/roles instead. 2. Hardcoding AWS credentials in code — use IAM roles for EC2/Lambda. 3. Opening 0.0.0.0/0 on SSH port 22 to the world — use SSM Session Manager instead. 4. Storing secrets in environment variables unencrypted — use Secrets Manager.

2 Infrastructure as Code (IaC)

What is IaC and Why Does It Matter?

IaC means defining your cloud infrastructure in code files (YAML, JSON, HCL) instead of clicking through the console. You check these files into Git, review them in PRs, run them through CI/CD. Infrastructure becomes reproducible, auditable, and versionable.

Declarative

"I want 3 EC2 instances with these properties." The tool figures out HOW to make that happen. CloudFormation, Terraform, Pulumi.

Imperative

"First create VPC, then subnet, then EC2..." You specify exact steps. AWS CDK, scripts with AWS CLI/SDK.

Key IaC Tools

Tool	Type	Language	Multi-cloud?	Best for
AWS CloudFormation	Declarative	YAML/JSON	AWS only	AWS-native teams, no extra setup needed
Terraform	Declarative	HCL	Yes (all clouds)	Multi-cloud, most popular in industry
AWS CDK	Imperative/Declarative	Python/TS/Java	AWS only	Devs who prefer real languages over YAML
Pulumi	Imperative	Python/TS/Go/C#	Yes	Teams wanting full programming language power

Example — Terraform vs CloudFormation Both can create an S3 bucket. Terraform uses HCL: resource "aws_s3_bucket" "my_bucket" { bucket = "my-app-bucket" }. CloudFormation uses YAML with AWSTemplateFormatVersion headers and more verbose syntax. Terraform is more readable and multi-cloud but requires the Terraform binary. CloudFormation is AWS-native and has deeper service integration (like StackSets for multi-account deployments).

3 Serverless & Containers — Modern App Patterns

Serverless

You write functions, the cloud runs them. No servers to provision, patch, or manage. You pay only when code runs (per invocation + per ms of execution). Serverless ≠ no servers — there ARE servers, you just don't manage them.

Key characteristics: Event-driven (triggered by HTTP, S3 upload, queue message, schedule). Scales to zero (no traffic = no cost). Scales to millions (auto-scale). Stateless (function runs fresh each time).

Cold Start Problem When a function hasn't run recently, the cloud provider needs to spin up a container and load your code. This takes 200ms-2 seconds (cold start). Warm subsequent calls are ~1ms. Solutions: AWS Provisioned Concurrency (keep containers warm, extra cost), keep functions small (faster load), use lightweight runtimes (Python/Node faster than Java).

Containers

Containers package your app + all dependencies (libraries, config, runtime) into a portable unit. Unlike VMs, containers share the host OS kernel — much more lightweight. Docker is the de-facto container standard.

VMs vs Containers

  Virtual Machines                    Containers
  ┌──────────────────────────┐        ┌──────────────────────────┐
  │ App A  │ App B  │ App C  │        │ App A  │ App B  │ App C  │
  │ Libs   │ Libs   │ Libs   │        │ Libs   │ Libs   │ Libs   │
  │ OS     │ OS     │ OS     │        ├────────────────────────── ┤
  ├────────┼────────┼────────┤        │     Container Runtime     │
  │      Hypervisor          │        │     (Docker/containerd)   │
  ├──────────────────────────┤        ├──────────────────────────┤
  │     Host OS              │        │     Host OS (ONE)        │
  ├──────────────────────────┤        ├──────────────────────────┤
  │     Hardware             │        │     Hardware             │
  └──────────────────────────┘        └──────────────────────────┘
  Each VM: 1-2GB RAM overhead         Each Container: ~50MB overhead
  Slow to start (minutes)             Starts in seconds

Container Orchestration

When you run 100s of containers, you need something to manage them: scheduling, health checks, service discovery, rolling updates, secret management. Kubernetes is the industry standard.

Concept	What it means
Pod	Smallest deployable unit in Kubernetes — 1+ containers sharing network/storage
Deployment	Desired state: "run 3 replicas of this pod"
Service	Stable network endpoint for pods (pods restart with new IPs — Service is static)
Ingress	HTTP routing rules (like an L7 LB/reverse proxy for K8s)
Namespace	Logical isolation within a cluster (like separate teams/environments)

Serverless & Containers — All Providers

AWS

Serverless: Lambda | Containers: ECS, EKS (managed K8s), Fargate (serverless containers) | Container Registry: ECR

GCP

Serverless: Cloud Functions, Cloud Run (containers as serverless!) | Containers: GKE (Google Kubernetes Engine) | Registry: Artifact Registry

Azure

Serverless: Azure Functions, Container Apps | Containers: AKS (Azure Kubernetes Service), Container Instances | Registry: ACR (Azure Container Registry)

GCP-Only

Cloud Run: Deploy any Docker container and it runs serverless (scale to zero, pay per request). More flexible than Lambda (any language, any binary). AWS equivalent would be Lambda containers, but Cloud Run has no cold-start shim overhead.

4 CI/CD Concepts in the Cloud

What is CI/CD?

Continuous Integration (CI): Every code commit is automatically built, tested, and validated. You catch bugs immediately — not 3 months later during a manual deployment.

Continuous Delivery (CD): After CI passes, the artifact (container, zip, AMI) is automatically deployed to staging. Deployment to production requires manual approval.

Continuous Deployment: Like CD but no manual approval — changes go straight to production automatically after tests pass. Used by companies doing 100s of deploys per day.

Mermaid Diagram — CI/CD Pipeline Flow

graph LR
    A[Developer Push] --> B[Source Repo]
    B --> C[CI: Build & Test]
    C --> D{Tests Pass?}
    D -- No --> E[Notify Dev, Stop]
    D -- Yes --> F[Create Artifact]
    F --> G[Deploy to Staging]
    G --> H[Integration Tests]
    H --> I{Approved?}
    I -- Manual Approve --> J[Deploy to Prod]
    I -- Auto Deploy --> J
    J --> K[Monitor & Alert]

Deployment Strategies

Strategy	How it works	Downtime?	Rollback?	Best for
In-Place (Rolling)	Update existing servers one by one	Brief per server	Slow	Simple apps, non-critical
Blue/Green	Two identical envs. Swap DNS/LB to new version. Old stays as backup.	None	Instant (flip LB back)	Critical apps needing instant rollback
Canary	Send 5% of traffic to new version. Gradually increase if healthy.	None	Shift traffic back	Risk-sensitive features, A/B testing
Feature Flags	Deploy code disabled. Enable via config for % of users.	None	Toggle flag off	Gradual feature rollouts, experimentation

CI/CD Tools — Cloud Providers

AWS

CodeCommit (Git repo, being deprecated 2024) | CodeBuild (CI: build & test) | CodeDeploy (CD: deploy to EC2/Lambda/ECS) | CodePipeline (orchestrates all stages). Also integrates with GitHub, GitLab, Jenkins.

GCP

Cloud Source Repositories (Git, being merged to Gemini Code Assist era) | Cloud Build (CI/CD) | Artifact Registry (store artifacts) | Cloud Deploy (managed delivery pipelines to GKE, Cloud Run).

Azure

Azure DevOps (all-in-one: repos, pipelines, boards, test plans, artifacts) | Azure Pipelines (CI/CD, free for open source). Azure DevOps is more mature/unified than AWS CodePipeline family.

Azure-Only

Azure DevOps Boards: Kanban/Scrum project tracking built into the same product as CI/CD. AWS doesn't have a native project management tool — would need Jira, Linear, etc.

🟠 AWS Services & Concepts

AWS-M1

Compute

EC2 Elastic Compute Cloud

What is EC2?

EC2 is AWS's virtual machine service (IaaS). You rent a virtual server that runs on AWS hardware, choose the OS, configure storage and networking. You have full root/admin access. It's the foundation of most AWS architectures.

Key EC2 Components

Instance Types

EC2 instances come in families optimized for different workloads. The naming convention: [Family][Generation][Size] → e.g., m5.xlarge = General purpose, 5th gen, xlarge.

Family	Optimized for	Examples	Use case
t3, t4g	Burstable (credits)	t3.micro, t4g.small	Dev/test, low-traffic sites
m5, m6i, m7i	General purpose	m5.xlarge, m6i.2xlarge	Web servers, app servers, small DBs
c5, c6i, c7g	Compute-optimized	c5.2xlarge, c7g.xlarge	Batch processing, gaming, video encoding
r5, r6i, r7g	Memory-optimized	r5.4xlarge, r6i.large	In-memory DBs (Redis), big data, SAP HANA
i3, i4i	Storage-optimized	i3.xlarge, i4i.2xlarge	High IOPS workloads, Cassandra, Elasticsearch
p4, p5, g4, g5	GPU-accelerated	p4d.24xlarge, g5.xlarge	ML training, inference, 3D rendering, gaming

Graviton (arm64) Instances t4g, m7g, c7g, r7g are ARM-based instances using AWS's own Graviton chips. They're 20-40% cheaper than equivalent x86 instances AND often faster. If your app can run on ARM (most Linux apps can), prefer Graviton. This is a key cost-optimization lever.

AMI — Amazon Machine Image

An AMI is a pre-configured template (OS + optional installed software) used to launch EC2 instances. Like a VM snapshot you can clone. AWS provides Amazon Linux, Ubuntu, RHEL, Windows images. You can create custom AMIs (e.g., "Amazon Linux + Nginx + your app pre-installed") for faster launches — called a "Golden AMI".

User Data

A bash script that runs once on first boot. Used to install packages, download code, configure services without baking them into an AMI. Passed at launch time:

#!/bin/bash
yum update -y
yum install -y nginx
systemctl start nginx
systemctl enable nginx

Key Pairs

RSA key pair for SSH access. AWS stores the public key; you keep the private key (.pem file). ssh -i my-key.pem ec2-user@<public-ip>. If you lose the private key, you can't SSH in anymore — AWS has no backup. Best practice: use SSM Session Manager instead of SSH (no key pair, no open port 22 needed, auditable).

Instance Metadata Service (IMDS)

From inside an EC2 instance, you can query http://169.254.169.254/latest/meta-data/ to get instance info: instance ID, public IP, IAM role credentials, AZ, etc. Critical for automation scripts running on EC2. IMDSv2 (more secure, requires token) is now required.

# Get instance ID from inside the instance
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id

# Get IAM role temporary credentials
curl -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRole

EC2 Pricing Models

Model	How it works	Discount vs On-Demand	Best for
On-Demand	Pay by the hour/second. No commitment.	None (baseline)	Unpredictable workloads, short-term dev/test
Reserved Instances (RI)	1-year or 3-year commitment to a specific instance type/region.	Up to 72% off	Steady-state production workloads
Savings Plans	Commit to $X/hr usage (flexible: any instance type, any region).	Up to 66% off	More flexible than RIs — same savings, less commitment
Spot Instances	Bid for unused EC2 capacity. AWS can terminate with 2-min notice.	Up to 90% off	Fault-tolerant, batch jobs, big data, CI runners
Dedicated Hosts	Physical server dedicated to you. Useful for per-socket/per-core licenses.	More expensive	Compliance, BYOL software licenses

Placement Groups

Controls how AWS places EC2 instances on physical hardware:

Cluster: Pack instances close together in same AZ. Ultra-low latency network (~25Gbps). Use for: HPC, big data jobs needing fast node-to-node comms. Risk: AZ failure takes all down.
Spread: Instances on different hardware. Reduces correlated hardware failure. Max 7 instances per AZ per group. Use for: small critical clusters of distinct VMs.
Partition: Groups of instances in different partitions (separate racks). Good for large distributed systems (Kafka, Hadoop, Cassandra) where partial failures are tolerable.

Elastic IP (EIP)

A static public IPv4 address you can allocate and associate with an EC2 instance. When an EC2 stops/starts, its public IP changes — an EIP stays fixed. But: AWS charges for EIPs that are not attached to a running instance (to discourage hoarding). Best practice: use a load balancer with a stable DNS name instead of EIPs for production.

EC2 Equivalents

GCP

Compute Engine (GCE). Similar instance types. GCP uses Preemptible VMs (like Spot) and Spot VMs. GCP's equivalent of AMIs are Custom Images. GCP has Committed Use Discounts (CUDs) instead of RIs/Savings Plans.

Azure

Azure Virtual Machines. Pricing: Pay-as-you-go (On-Demand), Reserved VM Instances (1 or 3 yr), Spot VMs (like AWS Spot). Azure's equivalent of AMIs are Azure VM Images (stored in Compute Gallery).

Lambda Serverless Functions

What is Lambda?

AWS Lambda lets you run code without provisioning any servers. You upload a function (zip or container), define what triggers it, and Lambda runs it on-demand. You're billed per invocation + per GB-second of memory used. No code running = zero cost.

Key Lambda Concepts

Triggers (Event Sources)

Lambda is event-driven. Something must trigger it:

HTTP/API

API Gateway → Lambda. REST or WebSocket APIs.

S3 Events

File uploaded to S3 → Lambda. Common for image processing, ETL.

Scheduled

EventBridge cron rule → Lambda. Like cron jobs, serverless.

Queue/Stream

SQS message → Lambda. Kinesis stream → Lambda. Event processing.

DynamoDB Stream

Record change in DynamoDB → Lambda. Triggers on insert/update/delete.

SNS / EventBridge

Pub/sub messages or event bus events → Lambda. Decoupled architectures.

Execution Environment

Lambda runs your code inside a micro-container (Firecracker VM). Your function gets:

Memory: 128MB to 10GB. CPU scales proportionally with memory.
Timeout: Max 15 minutes per invocation.
/tmp storage: 512MB to 10GB ephemeral disk (lost after function ends).
Ephemeral by design: Don't rely on state persisting between invocations.

Cold Start

When Lambda hasn't run recently, AWS needs to initialize the execution environment (download code, start runtime, run initialization code). This adds 200ms-2s latency. Subsequent "warm" invocations reuse the same container (~1ms overhead).

# Lambda handler (Python example)
import boto3

# Code HERE runs on every COLD start (container init)
s3_client = boto3.client('s3')  # Initialize once, reused on warm invocations

def handler(event, context):
    # Code HERE runs on EVERY invocation (warm or cold)
    bucket = event['bucket']
    key = event['key']
    response = s3_client.get_object(Bucket=bucket, Key=key)
    return {'statusCode': 200, 'body': response['Body'].read().decode()}

Layers

Lambda Layers are zip archives containing dependencies (libraries) that can be shared across multiple functions. Instead of bundling numpy in every ML Lambda, put it in a layer and reference it. Max 5 layers per function. Reduces deployment package size and enables sharing.

Concurrency

Lambda scales horizontally automatically. If 1000 events arrive simultaneously, Lambda spins up 1000 instances of your function. Default account limit: 1000 concurrent executions (soft limit, can increase). You can set Reserved Concurrency (cap a function to protect others) or Provisioned Concurrency (keep containers warm, eliminate cold starts, extra cost).

IAM Execution Role

Each Lambda function has an execution role — an IAM role Lambda assumes to make API calls. If your function needs to read from S3, the execution role must have s3:GetObject permission. Never put AWS credentials inside Lambda code — use the execution role.

Real-World Lambda Architecture User uploads profile photo to S3 → S3 triggers Lambda → Lambda resizes image to thumbnail, saves to a different S3 path, records metadata in DynamoDB, sends SNS notification "Profile photo processed." Zero servers, auto-scales, costs pennies per million photos.

Serverless Functions — Equivalents

GCP

Cloud Functions (event-driven, like Lambda) and Cloud Run (containerized serverless — more flexible, any language, scale to zero). Cloud Run is often preferred over Cloud Functions for complex apps.

Azure

Azure Functions. Same concept. Supports Consumption Plan (pay-per-use, cold starts), Premium Plan (pre-warmed, VNet integration, no cold starts), and Dedicated Plan (runs on App Service Plan). Durable Functions is Azure-specific for stateful workflows — more powerful than Lambda Step Functions integration.

ECS / EKS / Fargate Container Services

AWS Container Ecosystem Overview

Container Service Decision Tree

  Need containers on AWS?
          │
          ▼
  Kubernetes or AWS-native orchestration?
  ┌──────────────────┬──────────────────┐
  │  AWS-native      │  Kubernetes      │
  │  (ECS)          │  (EKS)           │
  └────────┬─────────┴────────┬─────────┘
           │                  │
  Where to run containers?    │
  ┌─────────────────────────────────────┐
  │                                     │
  EC2 (you manage nodes)    Fargate (serverless nodes)
  More control, cheaper     No node management, slightly pricier
  for stable workloads      great for variable/small workloads

ECS — Elastic Container Service

AWS's own container orchestration service. Not Kubernetes — AWS's proprietary system. Simpler to operate than EKS for pure AWS workloads.

Task Definition: JSON/YAML file defining your container(s): image URI, CPU/memory, port mappings, env vars, logging, IAM role. Think of it like a Pod spec in Kubernetes.
Task: A running instance of a Task Definition. Like a Pod.
Service: Ensures a desired number of tasks are running. Handles health checks, restarts, load balancer integration, rolling deploys. Like a Deployment + Service in K8s.
Cluster: Logical group of resources (EC2 instances or Fargate capacity) where tasks run.

# Example Task Definition (simplified JSON)
{
  "family": "my-web-app",
  "networkMode": "awsvpc",
  "containerDefinitions": [{
    "name": "web",
    "image": "123456789.dkr.ecr.ap-south-1.amazonaws.com/my-app:v1.2",
    "cpu": 256,
    "memory": 512,
    "portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
    "environment": [{"name": "ENV", "value": "production"}],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {"awslogs-group": "/ecs/my-web-app", "awslogs-region": "ap-south-1"}
    }
  }],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512"
}

EKS — Elastic Kubernetes Service

Managed Kubernetes. AWS runs and manages the Kubernetes control plane (API server, etcd). You manage the worker nodes (EC2 node groups) or use Fargate. Best when you need Kubernetes compatibility (standard K8s manifests, Helm charts, existing K8s tooling).

Managed Node Groups: AWS creates/updates EC2 instances as worker nodes. You pick instance type, scaling policy.
Fargate Profiles: Pods matching certain selectors run on Fargate (serverless).
Add-ons: Managed plugins like CoreDNS, kube-proxy, VPC CNI, AWS Load Balancer Controller.

Fargate — Serverless Containers

Fargate is a compute engine for ECS and EKS where AWS manages the underlying EC2 instances. You just specify CPU/memory for your container — no node groups to manage, no EC2 to patch.

Use Fargate when

Variable workloads, you don't want to manage nodes, small team, serverless containers, batch jobs, don't need GPU.

Use EC2 nodes when

Need GPU instances, need specific instance types, want Spot instance savings, need local NVMe storage, running Windows containers, very high compute needs.

ECR — Elastic Container Registry

AWS's private Docker image registry. Like Docker Hub but private and integrated with IAM. Push images here, pull from ECS/EKS. ECR also scans images for security vulnerabilities. Free private repos (storage charged separately). You authenticate with: aws ecr get-login-password | docker login ...

Container Services — Equivalents

GCP

GKE (Google Kubernetes Engine — most mature managed K8s service, invented Kubernetes) | Cloud Run (serverless containers, like Fargate but easier) | Artifact Registry (like ECR). GCP does NOT have an ECS equivalent — they pushed everyone to GKE or Cloud Run.

Azure

AKS (Azure Kubernetes Service) | Azure Container Apps (serverless containers, like Cloud Run, built on K8s internally) | Azure Container Instances (ACI) (simple single-container runs, like Fargate but simpler) | ACR (Azure Container Registry).

AWS-M2

Storage

S3 Simple Storage Service

What is S3?

S3 is AWS's object storage service — the most fundamental AWS service. Store any file (object) up to 5TB in size. Highly durable (99.999999999% — eleven 9s), highly available, globally accessible. Used for: static file hosting, backup, data lake, ML training data, CloudFront origin, application logs, artifacts.

Key S3 Concepts

Buckets & Objects

A bucket is a container (globally unique name). An object is the file + metadata stored in a bucket. Objects are addressed by a key (the "path"): s3://my-bucket/images/profile/user123.jpg. Despite looking like folders, S3 is flat — the "/" is just part of the key name. The "folders" you see in console are just a UI fiction (prefix grouping).

S3 Storage Classes

S3 Storage Classes — Access Frequency vs Cost

  FREQUENTLY ACCESSED ◄──────────────────────────────► RARELY ACCESSED
  HIGHEST COST                                           LOWEST COST

  S3 Standard     │ S3 Intelligent │ S3 Standard-IA │ S3 Glacier    │ S3 Glacier
                  │ Tiering        │                │ Instant       │ Deep Archive
                  │                │                │ Retrieval     │
  ----------------│----------------│----------------│---------------│-----------
  Any data        │ Unknown or     │ Backups,       │ Long-term     │ Long-term
  accessed        │ changing       │ disaster       │ backups, RA   │ archive, 7-10yr
  frequently      │ access pattern │ recovery       │ 1/quarter     │ retention
                  │ Auto-moves     │                │               │
  Retrieval: ms   │ between tiers  │ Retrieval: ms  │ Retrieval: ms │ Retrieval: 12hr
                  │                │ Min 30 days    │ Min 90 days   │ Min 180 days

Intelligent-Tiering If you're unsure how frequently an object will be accessed, use Intelligent-Tiering. S3 monitors access patterns and automatically moves objects between tiers. Small monthly fee per 1000 objects for this monitoring, but saves on storage cost. Best for new data lakes where access patterns are unknown.

Versioning

Enable versioning on a bucket to keep multiple versions of every object. Protects against accidental deletes and overwrites. When you delete an object, S3 adds a "delete marker" — the old version still exists. You can restore it. Once enabled, versioning can be suspended but NOT fully disabled. Versions accumulate cost — use Lifecycle rules to clean old versions.

Lifecycle Policies

Automate object transitions between storage classes or expiration:

# Example: Move to IA after 30 days, Glacier after 90 days, delete after 365 days
{
  "Rules": [{
    "Status": "Enabled",
    "Filter": {"Prefix": "logs/"},
    "Transitions": [
      {"Days": 30, "StorageClass": "STANDARD_IA"},
      {"Days": 90, "StorageClass": "GLACIER"}
    ],
    "Expiration": {"Days": 365}
  }]
}

Bucket Policies vs ACLs vs IAM

Method	What it controls	Use when
IAM Policy	What an IAM user/role can do with S3	Controlling access for your AWS users/services
Bucket Policy	JSON policy on the bucket itself. Can grant cross-account access.	Granting access to other AWS accounts, making bucket public, enforcing HTTPS
ACLs	Legacy per-object permissions	Avoid if possible. Disabled by default now with Block Public Access.

# Bucket policy: enforce HTTPS only
{
  "Statement": [{
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
    "Condition": {"Bool": {"aws:SecureTransport": "false"}}
  }]
}

Pre-signed URLs

Temporarily grant access to a private object without making it public. A pre-signed URL is signed with your credentials and has an expiry. Your backend generates it and sends to a user — they can download the private file for the next 15 minutes. Used for: file downloads in apps, direct-to-S3 uploads from browser (bypasses your server).

# Generate pre-signed URL (Python boto3)
url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'my-bucket', 'Key': 'report.pdf'},
    ExpiresIn=900  # 15 minutes
)
# Now share this URL — expires in 15 minutes automatically

S3 Replication

Cross-Region Replication (CRR): Replicate objects to a bucket in another region. For DR, compliance (EU data must also be in EU-West), lower latency for users in different regions.
Same-Region Replication (SRR): Replicate within same region to another bucket. For log aggregation, test-prod separation, compliance copies.

Both require versioning enabled. Replication is asynchronous (not instant). Does NOT replicate existing objects — only new uploads after replication is configured.

Object Storage — Equivalents

GCP

Cloud Storage. Storage classes: Standard, Nearline (monthly access), Coldline (quarterly), Archive (yearly). Has Object Lifecycle Management (like S3 Lifecycle). Signed URLs = equivalent to Pre-signed URLs. HMAC keys for S3-compatible API access.

Azure

Azure Blob Storage. Tiers: Hot (frequent), Cool (infrequent), Cold, Archive. Objects are called "blobs". Containers ≈ S3 buckets. Shared Access Signatures (SAS tokens) = equivalent to S3 Pre-signed URLs. Azure Data Lake Storage Gen2 (Blob + hierarchical namespace for analytics).

EBS Elastic Block Store

What is EBS?

EBS provides block storage volumes for EC2 instances — like a virtual hard drive. Unlike S3 (object storage accessible over HTTP), EBS appears as a raw block device to the OS (like /dev/xvda). You format it with a filesystem (ext4, xfs) and mount it. EBS volumes persist independently of EC2 instance lifecycle — you can stop/terminate an instance and the volume remains.

EBS Volume Types

Type	Name	Max IOPS	Max Throughput	Best for
gp3	General Purpose SSD	16,000	1,000 MB/s	Most workloads. Default. Boot volumes, dev DBs.
gp2	General Purpose SSD (legacy)	16,000	250 MB/s	Legacy — migrate to gp3 (cheaper, more flexible)
io2 Block Express	Provisioned IOPS SSD	256,000	4,000 MB/s	Mission-critical: SAP HANA, Oracle, high-perf DBs
io1	Provisioned IOPS SSD	64,000	1,000 MB/s	Production I/O-intensive databases
st1	Throughput Optimized HDD	500	500 MB/s	Big data, data warehouses, log processing
sc1	Cold HDD	250	250 MB/s	Infrequently accessed, lowest cost

gp3 vs gp2 gp2 IOPS are tied to volume size (3 IOPS/GB, so 100GB = 300 IOPS). gp3 lets you configure IOPS independently of size. A gp3 volume starts at 3,000 IOPS regardless of size. gp3 is also 20% cheaper than gp2. Always prefer gp3 for new volumes.

EBS Snapshots

Point-in-time backup of an EBS volume to S3 (you don't see this S3 bucket — it's AWS-managed). Snapshots are incremental: first snapshot copies everything, subsequent snapshots only store changed blocks. You can create volumes from snapshots in any AZ (cross-AZ copy). You can copy snapshots across regions (for DR). Cost: per GB-month of data stored in snapshot.

# Create snapshot via AWS CLI
aws ec2 create-snapshot --volume-id vol-0abc123 --description "Pre-deploy backup"

# Create volume from snapshot in different AZ (useful for migrating data)
aws ec2 create-volume --snapshot-id snap-0xyz789 --availability-zone ap-south-1b --volume-type gp3

EBS vs Instance Store

Feature	EBS	Instance Store
Persistence	Persists independently of instance	Data LOST when instance stops/terminates
Performance	Good (up to 256K IOPS)	Excellent (physically attached NVMe)
Cost	Separate charge per GB-month	Included in instance price
Use case	Boot volumes, databases, general storage	Temp data, buffers, cache, Kafka, Spark shuffle

Block Storage — Equivalents

GCP

Persistent Disks (standard HDD, balanced SSD, extreme SSD) and Hyperdisk (ultra-high performance). Google's equivalent of EBS. Also Local SSDs = instance store equivalent (ephemeral).

Azure

Azure Managed Disks. Types: Standard HDD, Standard SSD, Premium SSD, Ultra Disk (for SAP HANA, etc.). Azure also has Azure Shared Disks (multi-VM attach, for Windows WSFC clusters).

EFS Elastic File System

What is EFS?

EFS is a managed NFS (Network File System) that can be mounted by multiple EC2 instances simultaneously across multiple AZs. Unlike EBS (one instance at a time), EFS is shared storage. Grows and shrinks automatically — pay only for what you use. No capacity planning needed.

Key EFS Features

Multi-AZ by default: Data stored redundantly across multiple AZs. Highly durable and available.
Shared mount: 100s or 1000s of EC2 instances can mount the same EFS simultaneously. Read AND write from multiple instances.
Performance modes: General Purpose (low latency) | Max I/O (high throughput, slightly higher latency for massively parallel workloads).
Throughput modes: Elastic (auto-scales throughput with load), Bursting (throughput proportional to size), Provisioned (fix throughput independently).
Storage tiers: Standard (active) → Infrequent Access (EFS IA, cheaper) via lifecycle policies.

# Mount EFS on EC2 (Amazon Linux)
sudo yum install -y amazon-efs-utils
sudo mkdir /mnt/efs
sudo mount -t efs fs-0abc12345:/ /mnt/efs
# Or add to /etc/fstab for persistent mount:
echo "fs-0abc12345:/ /mnt/efs efs defaults,_netdev 0 0" | sudo tee -a /etc/fstab

Use EFS when

Shared content (CMS media files), home directories for multiple users, container shared storage, web farm with shared assets, machine learning training data accessed by multiple GPU nodes.

Don't use EFS when

App needs a database (use RDS), high-performance single-instance block storage (use EBS), object storage for files (use S3), very cost-sensitive (~3x more expensive than EBS per GB).

Feature	S3	EBS	EFS
Type	Object	Block	File (NFS)
Access	HTTP API / SDK	Single EC2 (usually)	Multiple EC2, multiple AZs
Durability	11 nines	99.999%	99.999999999%
Use case	Blobs, backups, data lake	Boot disk, databases	Shared file system
Cost (approx)	$0.023/GB	$0.08/GB (gp3)	$0.30/GB (Standard)

Shared File Storage — Equivalents

GCP

Filestore — managed NFS. Similar to EFS. Also Cloud Storage FUSE (mount GCS bucket as a filesystem, not true NFS).

Azure

Azure Files — managed SMB/NFS file shares. Works with Windows AND Linux. Azure NetApp Files for enterprise NAS workloads (SAP, Oracle). Azure also has Azure File Sync to sync on-prem Windows file servers with Azure Files.

Azure-Only

Azure File Sync: Extend your on-prem Windows File Server to Azure Files automatically. No AWS equivalent — would require custom scripting. Common hybrid use case for enterprises migrating file shares to cloud.

AWS-M3

Networking Deep Dive

VPC Virtual Private Cloud — Deep Dive

VPC Advanced Concepts

VPC Peering

Connect two VPCs so resources can communicate using private IPs, as if they were in the same network. Can peer across accounts and regions. Non-transitive: if VPC-A peers with VPC-B and VPC-B peers with VPC-C, VPC-A cannot talk to VPC-C through VPC-B. You'd need a direct peering or Transit Gateway.

VPC-A ←──peering──► VPC-B ←──peering──► VPC-C
EC2 in VPC-A → VPC-B: ✓ (direct peering)
EC2 in VPC-A → VPC-C: ✗ (non-transitive — no direct peering)

Transit Gateway (TGW)

A central hub that connects multiple VPCs and on-prem networks. Solves the peering mesh problem: instead of N×(N-1)/2 peering connections for N VPCs, you connect each VPC to one TGW. TGW is transitive. Think of it as a cloud router. Supports: inter-VPC, VPC-to-on-prem (via VPN/Direct Connect), inter-region peering via TGW.

Transit Gateway vs VPC Peering at Scale

  WITHOUT TGW (5 VPCs, 10 peerings needed):     WITH TGW (5 VPCs, 5 attachments):
  VPC-A ──── VPC-B                               VPC-A ──┐
  VPC-A ──── VPC-C                               VPC-B ──┤
  VPC-A ──── VPC-D                               VPC-C ──┼── Transit Gateway ── On-Prem
  VPC-A ──── VPC-E                               VPC-D ──┤
  VPC-B ──── VPC-C ... etc.                      VPC-E ──┘
  Non-transitive, complex route tables.          Central hub, transitive, one TGW.

VPC Endpoints

Access AWS services (S3, DynamoDB, SSM, etc.) from within your VPC without traffic leaving through the internet. Traffic stays on AWS's private network. More secure and often faster.

Type	How it works	Supported services
Gateway Endpoint	Free. Route table entry routes traffic to AWS service. No ENI.	S3 and DynamoDB only
Interface Endpoint (PrivateLink)	Creates an ENI with private IP in your subnet. DNS resolves service to private IP. Charged per hour + data.	100+ services: SSM, Secrets Manager, KMS, API Gateway, ECR, and more

Real-World Example Your Lambda in a private VPC needs to call the SSM Parameter Store API. Without a VPC endpoint, it would need to route through a NAT Gateway (costly) to reach SSM's public endpoint. With an SSM VPC Interface Endpoint: traffic goes SSM → ENI in your VPC → private AWS backbone. No NAT cost, more secure.

VPC Flow Logs

Capture information about IP traffic going to/from network interfaces in your VPC. Sent to CloudWatch Logs or S3. Not real-time packet capture (use Traffic Mirroring for that) — just metadata: source/dest IP, ports, protocol, bytes, action (ACCEPT/REJECT).

# Example flow log entry:
# version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2 123456789 eni-0abc 10.0.1.10 10.0.2.20 45678 443 6 20 4000 1620000000 1620000060 ACCEPT OK
2 123456789 eni-0abc 1.2.3.4   10.0.1.10 12345 22  6  5  300  1620000010 1620000070 REJECT OK
# → Blocked SSH attempt from 1.2.3.4 to our server (Security Group or NACL blocked it)

NAT Gateway Details

Deployed in a public subnet with an Elastic IP
Private instances route 0.0.0.0/0 to the NAT GW → it translates their private IP to its public EIP → sends to internet
Cost: ~$0.045/hour + $0.045/GB data processed. In high-traffic envs, this adds up.
For high-availability: deploy a NAT Gateway in EACH AZ. Don't share one NAT GW across AZs (AZ failure kills outbound internet for other AZs).
NAT Instance vs NAT Gateway: NAT Instance is a self-managed EC2 instance doing NAT. Cheaper, more configurable, but you manage patching, HA. NAT Gateway is managed, scales automatically, no maintenance. Use NAT Gateway unless you have a specific reason for NAT Instance.

Route 53 DNS & Traffic Routing

What is Route 53?

AWS's managed DNS service. Also handles domain registration, health checks, and sophisticated traffic routing policies. Named after port 53 (DNS port). Has a 100% availability SLA — the only AWS service with this guarantee.

Hosted Zones

A hosted zone is a container for DNS records for a domain. Public hosted zone: records accessible over the internet (your website). Private hosted zone: records for resources within your VPC (internal service discovery — db.internal → 10.0.3.5).

Record Types

Record	Purpose	Example
A	Maps hostname to IPv4 address	api.example.com → 54.123.45.67
AAAA	Maps hostname to IPv6 address	api.example.com → 2001:db8::1
CNAME	Maps hostname to another hostname. Cannot be used on zone apex (root domain).	www.example.com → example.com
Alias	AWS-specific. Like CNAME but can be used on root domain. Points to AWS resources (ALB, CloudFront, S3). Free queries for Alias records.	example.com → my-alb.us-east-1.elb.amazonaws.com
MX	Mail exchange servers for email routing	example.com → mail1.example.com (priority 10)
TXT	Text records. Used for domain verification, SPF, DKIM.	example.com → "v=spf1 include:_spf.google.com ~all"
NS	Name server records — which DNS servers handle this zone	Automatically created by Route 53 when you create a zone
PTR	Reverse DNS — IP to hostname	67.45.123.54 → api.example.com

Routing Policies

Policy	How it routes	Use case
Simple	Returns one or more IPs (round-robin if multiple). No health checks.	Basic single-resource routing
Weighted	Distribute traffic by weight (70/30). Sum doesn't need to be 100.	Blue/green deploys, A/B testing, gradual migrations
Latency	Route to region with lowest latency for the user. AWS measures latency to each region.	Multi-region apps wanting best performance for each user
Failover	Primary record → health-checked. If unhealthy, Route 53 serves the secondary.	Active-passive DR. Route to DR region on failure.
Geolocation	Route based on user's geographic location (country/continent). Strict — no match = no response unless default record exists.	Legal compliance (EU users → EU servers), localized content
Geoproximity	Route based on physical distance. Can shift traffic by adjusting bias values. Requires Traffic Flow (extra cost).	Multi-region with granular traffic shifting
Multivalue Answer	Returns up to 8 healthy records. Like Simple but with health checks per record.	Simple client-side load balancing with health checks. Not a replacement for ALB.
IP-Based	Route based on client IP CIDR ranges.	Route corporate network traffic to internal endpoints

Health Checks

Route 53 health checkers (globally distributed) ping your endpoint every 10/30 seconds. If 18%+ of checkers fail → endpoint marked unhealthy → Failover routing activates. You can health-check: HTTP/HTTPS/TCP endpoints, CloudWatch alarms, or calculated health checks (composite of multiple checks).

CloudFront CDN

What is CloudFront?

AWS's global CDN with 600+ edge locations. Accelerates delivery of static and dynamic content by caching at the edge. Also provides: DDoS protection (Shield Standard free), HTTPS termination, compression, WAF integration, Lambda@Edge for programmable edge logic.

Key CloudFront Concepts

Distribution

A CloudFront configuration object. You create one distribution per app/site. A distribution has a CloudFront domain (d1abc23efg.cloudfront.net) which you CNAME your domain to. Has one or more origins and one or more cache behaviors.

Origins

Where CloudFront fetches content when it's not cached (cache miss). Can be: S3 bucket, ALB, EC2, API Gateway, or any HTTP server. A distribution can have multiple origins.

Cache Behaviors

Rules that define how CloudFront handles requests matching a URL path pattern. Different paths can route to different origins with different cache settings:

CloudFront — Multiple Origins via Cache Behaviors

  https://example.com
  │
  ├── /api/*  ─────────────────────────► ALB → EC2 (no caching, dynamic)
  │   Cache: TTL=0, forward all headers │
  │
  ├── /static/* ────────────────────────► S3 Bucket (cached, long TTL)
  │   Cache: TTL=86400 (1 day)          │
  │
  └── /* (Default) ─────────────────────► S3 (index.html, SPA)
      Cache: TTL=300 (5 min)            │

OAC — Origin Access Control

Allows CloudFront to access a private S3 bucket on your behalf. Users access content via CloudFront URL only — the S3 bucket can block all direct access. Prevents bucket hotlinking, enforces CloudFront caching. OAC is the modern replacement for OAI (Origin Access Identity).

Lambda@Edge & CloudFront Functions

CloudFront Functions: Ultra-lightweight JS functions running at the edge for request/response manipulation. Sub-ms latency. Free tier available. Good for: URL rewrites/redirects, add security headers, A/B testing at edge.
Lambda@Edge: Full Lambda functions deployed globally to CloudFront PoPs. More powerful (Node.js, Python), slightly higher latency. Good for: authentication at edge, dynamic content generation, API calls at edge.

// CloudFront Function example: add security headers to all responses
function handler(event) {
    var response = event.response;
    var headers = response.headers;
    headers['strict-transport-security'] = {value: 'max-age=63072000; includeSubdomains; preload'};
    headers['x-content-type-options'] = {value: 'nosniff'};
    headers['x-frame-options'] = {value: 'DENY'};
    return response;
}

ELB Elastic Load Balancing — ALB & NLB

Application Load Balancer (ALB) — Layer 7

ALB operates at HTTP/HTTPS layer. It understands your request content and can make intelligent routing decisions. Every request is terminated at the ALB (it opens a new connection to the backend). Essential for microservices architectures.

Key ALB Components

Listener: Waits on a port (80 or 443). Defines rules to route requests.
Rules: IF (conditions match) THEN (action). Conditions: path, hostname, headers, query strings, source IP, HTTP method. Actions: forward, redirect, return fixed response.
Target Groups: Collection of targets (EC2 instances, IP addresses, Lambda functions, containers). Each TG has health check configuration.

ALB — Path-Based Routing to Microservices

  Client HTTP Request
         │
         ▼
  ┌─────────────────┐
  │  ALB Listener   │ :443 (HTTPS)
  │  ─────────────  │
  │  Rule 1:         │ /users/* ────────► Target Group A (User Service)
  │  Rule 2:         │ /orders/* ───────► Target Group B (Order Service)
  │  Rule 3:         │ /api/* (host:api) ► Target Group C (API backend)
  │  Default:        │ /* ──────────────► Target Group D (Frontend SPA)
  └─────────────────┘

  Each Target Group:
  ┌───────────────────────────────────────────────────────┐
  │  EC2: i-001 (healthy ✓)  i-002 (healthy ✓)  i-003 ✗  │
  │  Health check: GET /health → 200 OK every 30s         │
  └───────────────────────────────────────────────────────┘

ALB Features

HTTPS Termination: ALB decrypts HTTPS and talks to backend via HTTP. Offloads SSL processing from backends.
Sticky Sessions: Route same user to same backend target using a cookie. Use with caution (undermines horizontal scaling).
Weighted Target Groups: Send 90% to v2 TG, 10% to v3 TG. Canary deploys without DNS changes.
Authentication: Native OpenID Connect/Cognito authentication. Reject unauthenticated requests before they hit your app.
Access Logs: Log every request to S3. Useful for traffic analysis, debugging, compliance.

Network Load Balancer (NLB) — Layer 4

NLB operates at TCP/UDP layer. Doesn't inspect packet contents. Handles millions of requests per second with ultra-low latency (<1ms). Has static IP addresses (useful for whitelisting). Supports TLS termination at L4.

Feature	ALB	NLB
OSI Layer	Layer 7 (HTTP/HTTPS)	Layer 4 (TCP/UDP/TLS)
Routing intelligence	Path, host, headers, cookies	IP + Port only
Performance	Good	Extreme (millions RPS)
Static IP	No (use CloudFront)	Yes (one per AZ)
Protocol support	HTTP/HTTPS/WebSocket/gRPC	TCP/UDP/TLS
Price	Moderate	Moderate
Use case	Web apps, microservices, APIs	Gaming, IoT, VoIP, financial trading, TCP apps

VPN / Direct Connect Hybrid Connectivity

AWS Site-to-Site VPN

Encrypted connection between your on-premises network and your AWS VPC over the public internet. Uses IPsec. Two tunnels per VPN connection (for redundancy). Managed on AWS side by Virtual Private Gateway (VGW) or Transit Gateway. Bandwidth: ~1.25 Gbps max per tunnel, varies with internet conditions.

# VPN Connection components:
On-Prem Router/Firewall (Customer Gateway) ──IPsec Tunnel──► Virtual Private Gateway (VGW)
                                                                        │
                                                               Route table entry in VPC
                                                               10.0.0.0/8 → vgw-xxxxx

AWS Direct Connect (DX)

A dedicated physical private network connection from your datacenter to AWS. NOT over the internet — a private fiber link through an AWS Direct Connect partner (colocation facility). More expensive to set up but: consistent bandwidth, lower latency, more predictable, can carry more traffic more cheaply (data transfer pricing is lower on DX vs internet).

Feature	Site-to-Site VPN	Direct Connect
Connection type	Over internet (encrypted)	Private dedicated fiber
Setup time	Hours (AWS console + router config)	Weeks to months (physical provisioning)
Bandwidth	~1 Gbps (variable, internet-dependent)	1 Gbps or 10 Gbps, consistent
Cost	Low (hourly + data transfer)	High (port fee + partner fee + data)
Reliability	Internet outages affect it	Dedicated — very reliable
Latency	Variable	Consistent and low
Use case	Small/medium orgs, dev, backup link	Enterprise hybrid cloud, large data transfers, compliance

Best Practice: DX + VPN Backup Use Direct Connect as the primary link and a VPN connection as a failover. If DX goes down, traffic automatically fails over to the VPN (slower but encrypted). This gives you the performance of DX with the resilience of VPN as backup.

Hybrid Connectivity — Equivalents

GCP

Cloud VPN (like Site-to-Site VPN) | Cloud Interconnect (like Direct Connect). Cloud Interconnect types: Dedicated Interconnect (100 Gbps!) and Partner Interconnect.

Azure

Azure VPN Gateway (like Site-to-Site VPN) | Azure ExpressRoute (like Direct Connect). ExpressRoute also has ExpressRoute Global Reach — connect your on-prem through Azure to reach other Azure regions or other on-prem offices (AWS doesn't offer this natively).

AWS-M4

IAM & Security

IAM Identity & Access Management

Core IAM Entities

IAM is a free, global service — it's not region-specific. IAM controls who can do what on which AWS resources. Everything in AWS is an API call, and every call goes through IAM for authorization.

IAM User

A person or application with permanent long-term credentials (password + access keys). Represents one specific identity. Avoid creating users for services — use roles instead.

IAM Group

Collection of users. Attach policies to groups, not individual users. E.g., "Developers" group has S3 + EC2 read. Add a new dev → add to group. Remove dev → remove from group. Clean, scalable.

IAM Role

An identity with permissions, but NO permanent credentials. Assumed temporarily by users, AWS services (EC2, Lambda), or other accounts. Credentials are auto-rotated. Preferred over users for services.

IAM Policy

JSON document defining what actions are allowed/denied on which resources. Attached to users, groups, or roles. AWS-managed policies (maintained by AWS) or customer-managed (you control them).

IAM Policy Structure

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3ReadOnSpecificBucket",   // Optional statement ID
      "Effect": "Allow",                       // "Allow" or "Deny"
      "Action": [                              // What actions are allowed
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [                            // On which resources
        "arn:aws:s3:::my-company-bucket",      // The bucket itself (for ListBucket)
        "arn:aws:s3:::my-company-bucket/*"     // Objects within the bucket
      ],
      "Condition": {                           // Optional: extra conditions
        "StringEquals": {
          "s3:prefix": "reports/"             // Only objects under "reports/" prefix
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::my-company-bucket/*"
    }
  ]
}

IAM Policy Types

Policy Type	Attached to	Purpose
Identity-based	User, Group, Role	What that identity can do
Resource-based	Resource (S3 bucket, Lambda, SQS)	Who can access this resource (enables cross-account)
Permission Boundary	User or Role	Maximum permissions ceiling. Even if identity has broader policy, boundary limits it.
SCP (Service Control Policy)	AWS Organization Account/OU	Max permissions for all accounts in an org. Even account root can't exceed SCP.
Session Policy	AssumeRole call	Further restrict permissions for a specific role session

Policy Evaluation Logic

IAM Authorization — Evaluation Order

  Request arrives → Check for explicit DENY in any policy
                           │
                    Yes: DENY ✗ (Deny wins, always)
                           │
                    No: Check if SCP allows (Organizations)
                           │
                    No: DENY ✗
                           │
                    Yes: Check for explicit ALLOW
                           │
                    No: Implicit DENY ✗ (default deny)
                           │
                    Yes: Check Permission Boundary
                           │
                    No: DENY ✗
                           │
                    Yes: ALLOW ✓

  Rule: EXPLICIT DENY always wins. Default is DENY.
  You must explicitly ALLOW everything you want permitted.

IAM Roles — The Key Pattern

Instead of creating a user for your EC2 instance and storing access keys on the server (dangerous — keys can leak), you attach an IAM Role to EC2. EC2 automatically gets temporary credentials via IMDS. The credentials rotate every hour automatically. Lambda, ECS tasks, and other services all work the same way.

# BAD: Access keys hardcoded or in environment (never do this)
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# GOOD: Use IAM Role attached to the EC2/Lambda/ECS task
# boto3 automatically fetches temp credentials from IMDS
import boto3
s3 = boto3.client('s3')  # No credentials needed — role creds used automatically
s3.get_object(Bucket='my-bucket', Key='file.txt')

Cross-Account Access with Roles

A role in Account B can be assumed by Account A's resources. This is how centralized tooling (one DevOps account managing multiple app accounts) works. The trust policy on the role in Account B says "allow Account A's role X to assume me."

# Trust Policy on Role in Account B (the target role)
{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::111111111111:role/DeployRole"  // Account A's role
    },
    "Action": "sts:AssumeRole"
  }]
}

# In Account A, assume the role:
aws sts assume-role \
  --role-arn "arn:aws:iam::222222222222:role/DeployTargetRole" \
  --role-session-name "deploy-session-$(date +%s)"

MFA (Multi-Factor Authentication)

Virtual MFA: Authenticator app (Google Authenticator, Authy)
Hardware MFA: Physical TOTP device or FIDO2 security key (YubiKey)
Always enable MFA on root account — root has unlimited power and can't be restricted by SCPs
You can enforce MFA for specific actions via policy condition: "Condition": {"Bool": {"aws:MultiFactorAuthPresent": "true"}}

Root Account Security The AWS root account (the email you signed up with) has unrestricted access — it can even bypass SCPs. Best practices: Enable MFA on root immediately. Create a strong password. Store root credentials in a password manager in a vault. Never use root for day-to-day operations. Create IAM admin users for regular work.

IAM Equivalents

GCP

Cloud IAM. Key difference: GCP uses Roles (not policies) as the primary permission unit. Predefined roles (like AWS managed policies), custom roles. Service Accounts = IAM Roles for services. Workload Identity Federation = allows external identities (GitHub Actions, on-prem) to access GCP without service account keys — similar to AWS OIDC federation.

Azure

Azure RBAC (Role-Based Access Control). Built-in roles: Owner, Contributor, Reader, plus 100+ service-specific roles. Service Principals = IAM Roles for services. Managed Identities (System-assigned or User-assigned) = equivalent to EC2 IAM roles — no credentials stored. Azure AD / Entra ID is the identity provider (IAM is separate from directory in AWS, Azure integrates them).

Azure-Only

Azure Active Directory (Entra ID): Azure integrates identity directory (user management, SSO, conditional access) directly with RBAC. In AWS, you'd use IAM + AWS SSO (IAM Identity Center) + potentially an external IdP (Okta, Azure AD itself). Many companies use Azure AD as their IdP even for AWS.

KMS / Secrets Manager Key & Secrets Management

AWS KMS — Key Management Service

KMS is a managed service for creating and controlling encryption keys. It's the central key vault for all AWS encryption. When you "enable encryption" in S3, EBS, RDS — they're using KMS keys under the hood.

Key Types

Key Type	Who manages	Rotation	Cost	Use when
AWS Managed Key	AWS (auto-created per service)	Auto (1 yr)	Free	Basic encryption, fine for most cases
Customer Managed Key (CMK)	You	Auto or manual	$1/month + API calls	Need control, cross-account, custom key policy, audit
AWS CloudHSM	You (hardware module)	You manage	$$$	Strict compliance (FIPS 140-2 Level 3), custom HSM

Envelope Encryption

KMS doesn't encrypt your 5GB file directly (KMS keys stay in KMS — data never leaves). Instead: KMS generates a Data Encryption Key (DEK). Your code uses the DEK to encrypt the actual data. The DEK itself is encrypted with a KMS key (the "master key"). You store the encrypted DEK alongside the encrypted data. To decrypt: call KMS to decrypt the DEK, then use the DEK to decrypt the data. The master key never leaves KMS.

AWS Secrets Manager

Centralized, encrypted storage for secrets: database passwords, API keys, OAuth tokens. Auto-rotates secrets (can trigger a Lambda to rotate passwords in RDS). Applications retrieve secrets at runtime via API — no hardcoded passwords in code.

# Retrieve secret at runtime (Python)
import boto3, json
client = boto3.client('secretsmanager', region_name='ap-south-1')
secret = client.get_secret_value(SecretId='prod/myapp/db-password')
db_creds = json.loads(secret['SecretString'])
db_host = db_creds['host']
db_pass = db_creds['password']

# Application auto-rotates: RDS password changed every 30 days
# Lambda triggered by Secrets Manager updates RDS user password automatically

Secrets Manager vs SSM Parameter Store

Feature	Secrets Manager	SSM Parameter Store
Cost	$0.40/secret/month + API calls	Free (Standard) / $0.05/adv param
Auto-rotation	Yes (built-in for RDS, Redshift, DocumentDB)	No (manual or custom Lambda)
Encryption	Always encrypted (KMS)	Optional (use SecureString type for encrypted)
Cross-account	Yes, with resource policy	No native support
Best for	Database passwords, API keys, credentials requiring rotation	App configs, feature flags, non-secret parameters

Secrets & Key Management — Equivalents

GCP

Secret Manager (like Secrets Manager) | Cloud KMS (like AWS KMS) | Cloud HSM (like CloudHSM). GCP Secret Manager also supports version control of secrets.

Azure

Azure Key Vault — combines secrets, keys, AND certificates in one service (AWS splits these: Secrets Manager + KMS + ACM). Key Vault has Managed HSM tier for FIPS 140-2 Level 3. Azure App Configuration is like SSM Parameter Store for feature flags and app settings.

Azure-Only

Azure Key Vault Certificates: Key Vault can manage the full TLS certificate lifecycle — request, renew, store, deploy. AWS splits this: ACM (Certificate Manager) for provisioning/renewal, Secrets Manager for custom cert storage.

WAF / Shield / GuardDuty Threat Protection

AWS WAF — Web Application Firewall

WAF protects your web apps from common exploits at the application layer (L7). Works with CloudFront, ALB, API Gateway, AppSync. You define rules that filter HTTP requests.

Built-in rule groups (AWS Managed Rules): SQL injection protection, XSS protection, known bad IPs, AWS IP reputation lists. You can also write custom rules: "Block all requests where URI contains ../" or "Rate limit to 1000 req/5min per IP."

AWS Shield

Tier	Cost	Protection
Shield Standard	Free (automatic)	L3/L4 DDoS protection for all AWS resources. Protects against SYN floods, UDP reflection, etc.
Shield Advanced	$3,000/month	Enhanced DDoS protection, 24/7 DDoS Response Team (DRT), cost protection (AWS refunds scale-out costs from DDoS), advanced metrics.

Amazon GuardDuty

AI-powered threat detection service that continuously monitors your AWS account for malicious activity and unusual behavior. Analyzes: VPC Flow Logs, DNS logs, CloudTrail events, S3 access logs, EKS audit logs. Detects: compromised EC2 instances communicating with known bad IPs, unusual API calls, credential theft, S3 data exfiltration patterns.

Enable GuardDuty on Every Account GuardDuty is pay-per-use (per GB of log data analyzed), has a 30-day free trial, and requires literally zero configuration to start getting value. Enable it and connect to Security Hub for centralized findings. It's one of the highest-value-per-effort security services in AWS.

AWS Security Hub

Central security dashboard aggregating findings from GuardDuty, Inspector, Macie, Firewall Manager, and third-party tools. Runs automated compliance checks against CIS AWS Foundations, PCI-DSS, and other standards. Gives you a security score and prioritized findings list.

Other Key Security Services

Service	What it does
Amazon Inspector	Vulnerability scanning for EC2 instances and container images in ECR. Continuously scans for CVEs, network exposure. Integrates with ECR to block vulnerable images.
Amazon Macie	ML-based data security for S3. Discovers and protects sensitive data: PII (names, SSNs, credit cards, passports). Alerts you if sensitive data is in a public bucket.
AWS Config	Continuous resource configuration recording. "Who changed what, when?" Compliance rules: "All S3 buckets must have encryption enabled." Alerts on drift.
AWS CloudTrail	Audit log of all AWS API calls: who made the call, from which IP, when, what changed. The "flight recorder" of your AWS account. Enabled by default but save to S3 for long-term retention.

Security Services — Equivalents

GCP

Cloud Armor (WAF + DDoS) | Security Command Center (like Security Hub + GuardDuty) | Cloud Audit Logs (like CloudTrail) | Container Analysis (like Inspector for containers).

Azure

Azure WAF (part of App Gateway or Front Door) | Azure DDoS Protection (Standard = like Shield Advanced) | Microsoft Defender for Cloud (like GuardDuty + Security Hub combined) | Azure Monitor Activity Log (like CloudTrail).

AWS-M5

Databases

RDS Relational Database Service

What is RDS?

RDS is a managed relational database service. AWS handles: OS patching, DB engine upgrades (with your approval), automated backups, replication, failover. You just connect and query. Supported engines: MySQL, PostgreSQL, MariaDB, Oracle, Microsoft SQL Server and Amazon Aurora (custom AWS engine).

Key RDS Concepts

Multi-AZ Deployment

The most important RDS HA feature. When enabled, AWS automatically maintains a synchronous standby replica in a different AZ. If primary fails, RDS automatically fails over to standby. Failover takes 60-120 seconds (DNS update). The standby is NOT accessible for reads — it's purely for failover. Separate from Read Replicas.

RDS Multi-AZ vs Read Replicas

  MULTI-AZ (High Availability):           READ REPLICAS (Scalability):
  ┌──────────────────────────────┐        ┌──────────────────────────────┐
  │ AZ-1a: Primary RDS  ──sync──►│        │ Primary ──async──► Replica 1 │
  │         Read+Write   ◄failover│        │  (R+W)  ──async──► Replica 2 │
  │                              │        │         ──async──► Replica 3 │
  │ AZ-1b: Standby RDS           │        │                              │
  │         (NOT accessible)     │        │ Replicas: READ ONLY          │
  └──────────────────────────────┘        │ Can be in different region!  │
  For: automatic failover / HA            └──────────────────────────────┘
                                          For: scale out reads, reports,
                                          analytics, DR (promote to master)

Read Replicas

Asynchronous copies of your primary DB, used to offload read traffic. Up to 15 read replicas for Aurora, 5 for other engines. Can be in a different region (cross-region read replicas for DR). In disaster, promote a read replica to standalone DB — becomes the new primary.

Automated Backups

Daily automated backup during your maintenance window (entire DB + transaction logs)
Retained for 1-35 days (default 7). After that, deleted automatically.
Point-in-time recovery: restore to any second within the backup retention period
Manual snapshots: you control them, persist indefinitely until you delete them

RDS Proxy

A fully managed, highly available database proxy that sits between your app and RDS. Why use it? Lambda functions opening thousands of connections overwhelm RDS (too many connections). RDS Proxy pools and reuses connections — Lambda connects to Proxy, Proxy maintains a small pool to RDS. Also speeds up failover: clients connect to Proxy endpoint which auto-routes to healthy instance.

Storage Auto Scaling

Enable and set a maximum storage limit. If your DB is about to run out of disk space, RDS automatically scales up storage without downtime. You can never shrink it back (only grow). Set a high maximum and don't worry about disk again.

Amazon Aurora

AWS's custom-built cloud-native relational DB. MySQL and PostgreSQL compatible — your app code doesn't change. But it's re-engineered from scratch for cloud performance and resilience.

Feature	Standard RDS (MySQL)	Aurora MySQL
Storage	Single AZ volume (Multi-AZ adds standby)	6 copies across 3 AZs by default
Read Replicas	5 max	15 max (Aurora Replicas)
Failover	60-120 seconds	~30 seconds (in-cluster replicas)
Performance	Baseline MySQL	5x MySQL throughput
Cost	Lower	~20% more than RDS
Storage	Up to 64TB	Up to 128TB, auto-scales

Aurora Serverless v2

Aurora that scales capacity in fine-grained increments (in 0.5 ACU steps from 0.5 to 256 ACUs) based on actual demand, in seconds. No pre-provisioning. Pay per second of actual ACU usage. Perfect for: unpredictable workloads, dev/test, multi-tenant SaaS with variable tenant load.

Managed Relational DB — Equivalents

GCP

Cloud SQL (managed MySQL, PostgreSQL, SQL Server — like standard RDS) | AlloyDB (like Aurora — PostgreSQL-compatible, high performance, 4x faster than Cloud SQL). Also Cloud Spanner — globally distributed SQL (unique, no AWS equivalent).

Azure

Azure SQL Database (managed SQL Server) | Azure Database for MySQL/PostgreSQL (like standard RDS) | Azure SQL Managed Instance (SQL Server with near-100% compatibility, for lift-and-shift). Azure's Hyperscale tier is similar to Aurora in concept.

GCP-Only

Cloud Spanner: Globally distributed, horizontally scalable relational DB with ACID transactions across regions. No true equivalent in AWS or Azure (AWS DocumentDB is NoSQL, and global Aurora has limits). Used by Google for their own core infrastructure.

DynamoDB NoSQL Database

What is DynamoDB?

DynamoDB is AWS's fully managed NoSQL key-value and document database. No servers, no OS, no capacity planning. Single-digit millisecond performance at any scale. Used by Amazon itself for their shopping cart, sessions, order management. Built for internet-scale applications.

Core Concepts

Tables, Items, Attributes

DynamoDB is schemaless (except for keys). A Table holds Items (like rows), each with Attributes (like columns). No fixed schema — different items can have different attributes. Only the primary key is required.

Primary Key Types

Simple Primary Key (Partition Key only)

Single attribute used as the primary key. Must be unique. Used when you query by a single ID.
Example: userId as partition key. Query: "Give me all data for userId=U123"

Composite Primary Key (Partition + Sort Key)

Two attributes together are unique. Multiple items can share partition key but must have different sort keys. Enables range queries.
Example: userId (partition) + orderDate (sort). Query: "Give me all orders for userId=U123 in 2024"

Read Capacity Units (RCU) and Write Capacity Units (WCU)

DynamoDB bills on throughput. 1 RCU = 1 strongly consistent read (or 2 eventually consistent reads) of up to 4KB/second. 1 WCU = 1 write of up to 1KB/second. You either provision RCU/WCU (predictable, cheaper) or use On-Demand mode (pay per request, no planning, costlier per request but no idle waste).

Global Secondary Indexes (GSI)

Query your DynamoDB table on a different attribute. If your table's partition key is userId, but you need to query "all users who signed up on date X" — create a GSI with signupDate as partition key. GSIs have their own RCU/WCU separate from the main table.

DynamoDB Streams

A time-ordered stream of item-level changes (inserts, updates, deletes) in a DynamoDB table. Retained for 24 hours. Trigger Lambda functions on changes — powerful for: replication, cache invalidation, event sourcing, audit logs.

DynamoDB Accelerator (DAX)

In-memory cache for DynamoDB. API-compatible — swap your DynamoDB client for a DAX client, same code. Reduces read latency from single-digit ms to microseconds. Handles millions of reads per second. Use for: high-read, cost-sensitive workloads (DAX reads are cheaper than DynamoDB reads at high volume).

Global Tables

Multi-region, multi-active DynamoDB. Write to any region, DynamoDB replicates to others within seconds. Last-writer-wins conflict resolution. Perfect for: global apps needing local read/write latency everywhere, multi-region active-active architecture.

When to use DynamoDB vs RDS Use DynamoDB when: access patterns are known and simple (get by key, query by key + sort), need massive scale (millions of TPS), no complex SQL queries needed, need single-digit ms at any scale, fully serverless architecture. Use RDS when: complex queries, JOINs, ACID transactions across multiple tables, unknown/evolving access patterns, need SQL, reporting/analytics.

NoSQL DB — Equivalents

GCP

Cloud Firestore (document NoSQL, like DynamoDB but more flexible querying) | Cloud Bigtable (wide-column NoSQL, Apache HBase compatible, for massive analytics). No exact DynamoDB equivalent — Firestore is closest for serverless apps.

Azure

Azure Cosmos DB — multi-model NoSQL (document, key-value, graph, column-family) with multi-region active-active. More flexible than DynamoDB. Supports multiple APIs: Core (SQL), MongoDB, Cassandra, Gremlin, Table. 99.999% availability SLA.

Azure-Only

Cosmos DB's multi-model support: One Cosmos DB instance supports MongoDB API, Cassandra API, and SQL API simultaneously (with different collections). You can use existing MongoDB drivers unchanged. AWS has separate services for each (DynamoDB, DocumentDB for MongoDB, Keyspaces for Cassandra).

ElastiCache In-Memory Caching

What is ElastiCache?

Managed in-memory caching service. Two engines: Redis and Memcached. Dramatically reduces database load and latency by serving frequent reads from memory (microseconds) instead of disk (milliseconds).

Redis vs Memcached on ElastiCache

Feature	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, bitmaps, geospatial, streams	Simple key-value strings only
Persistence	Yes (RDB snapshots, AOF logs)	None (restart = all data lost)
Replication	Yes (primary + replicas)	No
Multi-AZ Failover	Yes	No
Pub/Sub	Yes	No
Cluster mode	Yes (sharding)	Yes
Use cases	Sessions, leaderboards, rate limiting, pub/sub, queues, ML	Simple cache (horizontal scaling, multi-threaded)

Choose Redis Almost Always Unless you have a specific need for Memcached's multi-threaded horizontal scaling or already use Memcached, Redis is the better choice — more features, persistence, HA. In practice, most teams use Redis.

Common Caching Patterns

# Lazy Loading (Cache-Aside) — most common pattern
def get_user(user_id):
    # Try cache first
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)  # Cache HIT

    # Cache MISS — query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # Store in cache with TTL (expiry)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))  # Cache 1 hour

    return user

# Write-Through — write to cache AND DB simultaneously
def update_user(user_id, data):
    db.update("UPDATE users SET ... WHERE id = ?", user_id, data)
    redis.setex(f"user:{user_id}", 3600, json.dumps(data))  # Always fresh

Managed Cache — Equivalents

GCP

Memorystore for Redis and Memorystore for Memcached — same concept. Also Memorystore for Redis Cluster for large-scale sharding.

Azure

Azure Cache for Redis — same concept. Tiers: Basic (single node, no SLA), Standard (primary+replica), Premium (Redis Cluster, persistence, VNet injection), Enterprise (Redis Enterprise software, higher performance).

AWS-M6

Monitoring & Observability

CloudWatch Metrics, Logs & Alarms

What is CloudWatch?

CloudWatch is AWS's unified observability platform. It collects metrics, logs, traces, and events from AWS services and your applications. Like a central nervous system for your AWS environment. Three pillars: Metrics (what's happening), Logs (what happened), Alarms (alert when something's wrong).

CloudWatch Metrics

Numeric data points over time. AWS services automatically push metrics: EC2 CPU%, RDS connections, ALB request count, Lambda errors. You can publish custom metrics from your application code.

Metric	Service	What to monitor
CPUUtilization	EC2	Alert if >80% sustained for 5min → need to scale
DatabaseConnections	RDS	Alert if near max_connections limit
RequestCount, TargetResponseTime	ALB	Alert on traffic spikes or high latency
Errors, Duration, Throttles	Lambda	Alert on elevated error rate or timeouts
QueueDepth	SQS	Alert if messages accumulating (consumers slow)
BucketSizeBytes, NumberOfObjects	S3	Storage growth tracking (daily granularity)

EC2 Default vs Detailed Monitoring By default, EC2 sends metrics every 5 minutes (basic monitoring, free). Enable detailed monitoring for 1-minute granularity ($0.30/metric/month). For auto-scaling decisions, 5-minute lag can be too slow — enable detailed monitoring on production.

CloudWatch Logs

Centralized log storage and analysis. Logs are organized in Log Groups (one per app/service), which contain Log Streams (one per instance/invocation). Lambda, ECS, and other services push logs automatically. EC2 needs the CloudWatch Agent installed to push logs.

# CloudWatch Agent config (simplified) — push /var/log/nginx/access.log
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [{
          "file_path": "/var/log/nginx/access.log",
          "log_group_name": "/ec2/nginx/access",
          "log_stream_name": "{instance_id}",
          "timestamp_format": "%d/%b/%Y:%H:%M:%S %z"
        }]
      }
    }
  }
}

CloudWatch Logs Insights

Query language for analyzing logs. Like SQL for your logs. Very useful for debugging:

# Find all Lambda errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

# Calculate average response time from ALB access logs
fields @timestamp, targetProcessingTime
| stats avg(targetProcessingTime) as avgTime, count() as requests
| sort avgTime desc

CloudWatch Alarms

Trigger actions when a metric crosses a threshold. States: OK (metric within threshold), ALARM (metric breached threshold), INSUFFICIENT_DATA (not enough data yet).

Actions on ALARM: SNS notification (email/SMS), Auto Scaling (add/remove instances), EC2 action (stop/reboot/recover instance), Systems Manager action.

# Create alarm via CLI: alert if EC2 CPU > 80% for 2 consecutive 5-min periods
aws cloudwatch put-metric-alarm \
  --alarm-name "High-CPU-ec2-web-01" \
  --alarm-description "CPU usage too high" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --dimensions Name=InstanceId,Value=i-0abc12345 \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions "arn:aws:sns:ap-south-1:123456789:ops-alerts"

CloudWatch Dashboards

Custom dashboards combining metrics from multiple services. Create a single pane view: EC2 CPU + RDS connections + ALB latency + Lambda errors + SQS queue depth. Share with team. Use as your operations wall display.

CloudWatch Events / EventBridge

Rule-based event routing. React to AWS service events or scheduled triggers. EventBridge is the evolution of CloudWatch Events — more powerful, supports custom event buses, third-party SaaS events, schema registry.

# EventBridge rule: trigger Lambda every day at 8 AM UTC (cron)
{
  "source": "aws.events",
  "schedule": "cron(0 8 * * ? *)",
  "targets": [{"Id": "DailyReport", "Arn": "arn:aws:lambda:...daily-report"}]
}

# EventBridge rule: trigger when EC2 instance state changes to "stopped"
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {"state": ["stopped"]}
}

Monitoring & Observability — Equivalents

GCP

Cloud Monitoring (metrics + dashboards + alerting, like CloudWatch) | Cloud Logging (like CloudWatch Logs) | Cloud Trace (distributed tracing, like AWS X-Ray) | Cloud Profiler (continuous CPU/memory profiling of running apps). All under Google Cloud Observability umbrella.

Azure

Azure Monitor (umbrella service — metrics, logs, alerts) | Log Analytics Workspace (like CloudWatch Log Insights, uses KQL query language) | Application Insights (APM for apps, auto-traces HTTP, DB queries, exceptions — no direct AWS equivalent as a single managed service) | Azure Event Grid (like EventBridge).

Azure-Only

Application Insights: Full APM (Application Performance Monitoring) — auto-instrumentation of .NET, Java, Node, Python apps. Tracks requests, dependencies, exceptions, performance counters, user flows, availability tests. AWS would need a combination of X-Ray + CloudWatch + third-party APM (Datadog, Dynatrace).

CloudTrail / X-Ray Audit & Tracing

AWS CloudTrail

Records every AWS API call made in your account — via Console, CLI, SDK, or other AWS services. Who did what, when, from where. The audit trail for your entire AWS account. Enabled automatically but events only kept 90 days in CloudTrail console; create a Trail to send to S3 for long-term retention (required for compliance).

Trail Types

Management Events: Control plane operations — CreateBucket, LaunchEC2, DeleteUser. Enabled by default. Free for first copy.
Data Events: Data plane operations — S3 object reads/writes (PutObject, GetObject), Lambda invocations. High volume, extra cost. Enable for critical resources.
Insight Events: Detect unusual API activity (e.g., sudden spike in IAM calls). Extra cost but powerful anomaly detection.

# Example CloudTrail event — someone deleted an S3 bucket
{
  "eventTime": "2024-01-15T14:23:01Z",
  "eventName": "DeleteBucket",
  "userIdentity": {"type": "IAMUser", "userName": "john.doe"},
  "sourceIPAddress": "203.0.113.45",
  "requestParameters": {"bucketName": "prod-customer-data-backup"},
  "eventSource": "s3.amazonaws.com"
}
# → John deleted the production backup bucket from IP 203.0.113.45 at 2:23 PM UTC

AWS X-Ray — Distributed Tracing

X-Ray helps debug and analyze distributed applications (microservices). When a user request flows through API Gateway → Lambda → DynamoDB → SQS → another Lambda — X-Ray traces the entire journey, showing where latency comes from and where errors occur.

X-Ray Trace — Following a Request

  User Request (Total: 450ms)
  │
  ├── API Gateway: 5ms
  │
  ├── Lambda: process-order (380ms total)
  │   ├── Init (cold start): 150ms   ← performance problem!
  │   ├── DynamoDB PutItem: 12ms
  │   ├── SQS SendMessage: 8ms
  │   └── Execution: 210ms
  │
  └── Response: 65ms

  X-Ray shows: Cold start is causing 33% of total latency.
  Fix: Enable Provisioned Concurrency on this Lambda.

To use X-Ray: add the X-Ray SDK to your app code, or enable active tracing on Lambda/API Gateway (no code changes). X-Ray automatically generates a service map showing all components and their interconnections.

Tracing & Audit — Equivalents

GCP

Cloud Trace (distributed tracing, like X-Ray) | Cloud Audit Logs (like CloudTrail — Admin Activity, Data Access, System Event logs). Cloud Trace auto-instruments GCP services.

Azure

Application Insights Distributed Tracing (like X-Ray, part of App Insights) | Azure Monitor Activity Log (like CloudTrail — tracks all subscription-level operations).

AWS-M7

DevOps Tools — IaC, CI/CD & Automation

CloudFormation Infrastructure as Code

What is CloudFormation?

AWS's native IaC service. Define your entire infrastructure in YAML or JSON templates. CloudFormation handles creation, update, and deletion of resources in the right order. Free — you only pay for the resources it creates.

CloudFormation Template Structure

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Web Application Stack'

Parameters:  # User inputs at deploy time
  EnvironmentName:
    Type: String
    Default: production
    AllowedValues: [development, staging, production]
  InstanceType:
    Type: String
    Default: t3.micro

Mappings:  # Lookup tables (e.g., AMI IDs per region)
  RegionAMIMap:
    ap-south-1:
      AMI: ami-0abc12345
    us-east-1:
      AMI: ami-0xyz67890

Conditions:  # Conditional resource creation
  IsProd: !Equals [!Ref EnvironmentName, production]

Resources:  # Actual AWS resources (required)
  MyEC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: !Ref InstanceType
      ImageId: !FindInMap [RegionAMIMap, !Ref AWS::Region, AMI]
      SecurityGroupIds: [!Ref WebSecurityGroup]
      Tags:
        - Key: Environment
          Value: !Ref EnvironmentName

  WebSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP/HTTPS
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

  # Only create this in production
  ElasticIP:
    Type: AWS::EC2::EIP
    Condition: IsProd
    Properties:
      InstanceId: !Ref MyEC2Instance

Outputs:  # Values returned after stack creation
  InstancePublicIP:
    Value: !GetAtt MyEC2Instance.PublicIp
    Export:
      Name: !Sub "${AWS::StackName}-PublicIP"

Key CloudFormation Concepts

Stacks & Stack Sets

A Stack is a deployed instance of a template (all the resources it creates). You update a stack by updating the template and running a changeset. StackSets deploy one template across multiple accounts and regions simultaneously — essential for large organizations.

Changesets

Preview what changes CloudFormation will make before actually making them. Shows: which resources will be added, modified, or deleted. Always review changesets before applying — especially check for resource replacements (which cause downtime).

Drift Detection

Checks if actual resource state differs from what CloudFormation expects. If someone manually changed a Security Group that CloudFormation manages, drift detection finds it. Important for compliance and ensuring IaC is the source of truth.

!Ref and !GetAtt

Built-in functions for referencing other resources within the template. !Ref MyBucket returns the bucket name. !GetAtt MyBucket.Arn returns the bucket ARN. !Sub "arn:aws:s3:::${MyBucket}/*" substitutes variable into string.

CloudFormation vs Terraform (Key Differences)

Aspect	CloudFormation	Terraform
Language	YAML/JSON	HCL (HashiCorp Configuration Language)
Cloud support	AWS only	Multi-cloud (AWS, GCP, Azure, 1000+ providers)
State management	AWS manages state (no state file)	State file (must manage securely in S3/Terraform Cloud)
Native AWS support	Supports new AWS services on day 1	Depends on provider update (usually within days)
Free	Yes	Open source (Terraform Enterprise is paid)
Module system	Nested stacks (complex)	Modules (cleaner, community registry)
Drift detection	Built in	Manual (`terraform refresh`)
Industry adoption	AWS shops	Most popular IaC tool overall

IaC — Equivalents

GCP

Deployment Manager (like CloudFormation, GCP-native, YAML/Jinja/Python) | Config Connector (manage GCP resources via Kubernetes CRDs) | Terraform is actually more commonly used in GCP environments than Deployment Manager.

Azure

Azure Resource Manager (ARM) Templates (like CloudFormation, JSON-based, verbose) | Bicep (ARM's modern replacement — cleaner syntax, transpiles to ARM JSON) | Azure Blueprints (for governance at scale — deploy policies + RBAC + resource groups together).

CodePipeline / CodeBuild / CodeDeploy AWS CI/CD

AWS CI/CD Toolchain Overview

AWS CodePipeline — Full CI/CD Flow

  ┌────────────────────────────────────────────────────────────────────┐
  │                        AWS CodePipeline                           │
  ├────────────┬───────────────┬──────────────┬────────────────────────┤
  │  SOURCE    │    BUILD      │    TEST       │       DEPLOY           │
  │            │               │              │                        │
  │ CodeCommit │  CodeBuild    │  CodeBuild   │  CodeDeploy → EC2      │
  │  GitHub    │  (compile,    │  (unit tests,│  CodeDeploy → Lambda   │
  │  Bitbucket │   lint,       │   integration│  ECS (Blue/Green)      │
  │  S3        │   docker build│   tests)     │  CloudFormation        │
  │            │   push to ECR)│              │  Beanstalk             │
  └────────────┴───────────────┴──────────────┴────────────────────────┘
  Each stage has actions. Failure at any stage stops the pipeline.

CodeBuild — Build Service

Managed build server. Runs your build commands in a Docker container, compiles code, runs tests, creates artifacts. Defined in a buildspec.yml file at the root of your repo.

# buildspec.yml — defines build steps
version: 0.2
phases:
  install:
    runtime-versions:
      python: 3.11
    commands:
      - pip install -r requirements.txt

  pre_build:
    commands:
      - echo Logging into ECR...
      - aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=$COMMIT_HASH

  build:
    commands:
      - echo Running tests...
      - pytest tests/ --junitxml=test-results.xml
      - echo Building Docker image...
      - docker build -t $ECR_URI:$IMAGE_TAG .

  post_build:
    commands:
      - docker push $ECR_URI:$IMAGE_TAG
      - echo Build complete. Image $ECR_URI:$IMAGE_TAG

artifacts:
  files:
    - imagedefinitions.json  # Used by CodeDeploy for ECS deploy
reports:
  TestResults:
    files: test-results.xml
    file-format: JUNITXML

CodeDeploy — Deployment Service

Automates application deployments to EC2, on-premises servers, Lambda, and ECS. Handles rolling updates, blue/green deployments, automatic rollback on failure. Defined in appspec.yml.

# appspec.yml for EC2 deployment
version: 0.0
os: linux
files:
  - source: /app
    destination: /var/www/html
hooks:
  BeforeInstall:
    - location: scripts/stop_server.sh
      timeout: 30
  AfterInstall:
    - location: scripts/install_dependencies.sh
      timeout: 120
  ApplicationStart:
    - location: scripts/start_server.sh
      timeout: 30
  ValidateService:
    - location: scripts/health_check.sh
      timeout: 60

CodeDeploy Deployment Configurations

Config	How it deploys	Downtime?
CodeDeployDefault.AllAtOnce	All instances simultaneously	Yes (if deploy fails)
CodeDeployDefault.HalfAtATime	50% first, then 50%	Partial
CodeDeployDefault.OneAtATime	One instance at a time (slowest, safest)	No
Custom (e.g., 25% at a time)	Define your own batch size	Depends
Blue/Green (ECS/Lambda)	New version deployed alongside old, traffic shifted gradually	No, instant rollback

Elastic Beanstalk — PaaS Deploy

If you don't want to manage CI/CD pipelines at all, Elastic Beanstalk is AWS's PaaS. Upload your app code (zip), EB handles EC2 provisioning, Auto Scaling, Load Balancer, health monitoring, and rolling deploys. Runs on top of standard AWS services (you can still see and modify the EC2 instances). Great for smaller teams or migrating existing apps quickly. Less flexible than managing EC2/ECS directly.

CI/CD Tools — Equivalents

GCP

Cloud Build (like CodeBuild) | Cloud Deploy (managed delivery to GKE/Cloud Run, with promotion through environments) | Artifact Registry (store build artifacts, Docker images)

Azure

Azure Pipelines (CI + CD in one service, like CodePipeline + CodeBuild + CodeDeploy combined — more integrated) | Azure Artifacts (package/artifact storage) | GitHub Actions (Microsoft owns GitHub — deep Azure integration)

Systems Manager (SSM) Ops Automation

What is AWS Systems Manager?

SSM is a collection of operational tools for managing your EC2 instances and on-premises servers at scale. Often overlooked but incredibly powerful for DevOps. It's a suite of services, not just one thing.

SSM Session Manager

Connect to EC2 instances via browser or CLI without opening port 22, without a bastion host, without managing SSH keys. The SSM Agent on the instance communicates outbound to SSM service — no inbound port needed. Fully audited — all sessions recorded to S3 or CloudWatch.

# Connect to EC2 via SSM (no SSH key, no port 22)
aws ssm start-session --target i-0abc12345

# Port forwarding via SSM (access RDS in private subnet)
aws ssm start-session --target i-0abc12345 \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["3306"],"localPortNumber":["13306"]}'
# Now: mysql -h 127.0.0.1 -P 13306 -u admin -p

SSM Parameter Store

Store configuration values and secrets. Types: String, StringList, SecureString (KMS-encrypted). Use for: app config, database hostnames, feature flags, non-sensitive or mildly-sensitive parameters.

# Store a parameter
aws ssm put-parameter \
  --name "/myapp/production/db-host" \
  --value "mydb.cluster.ap-south-1.rds.amazonaws.com" \
  --type String

# Store an encrypted secret
aws ssm put-parameter \
  --name "/myapp/production/api-key" \
  --value "sk-abc123secret" \
  --type SecureString \
  --key-id alias/myapp-key

# Retrieve in code (Python)
ssm = boto3.client('ssm')
param = ssm.get_parameter(Name='/myapp/production/db-host', WithDecryption=True)
db_host = param['Parameter']['Value']

SSM Run Command

Run commands across multiple EC2 instances without SSH. Execute shell scripts, PowerShell, Python across your entire fleet in seconds. With resource tags, target groups: "Run this on all instances tagged Environment=production."

SSM Patch Manager

Automate OS patching across your fleet. Define patch baselines (which patches to apply, e.g., only Critical + High severity), maintenance windows (when to apply — 2 AM Sunday), and patch groups (which instances). Never manually SSH to patch 50 servers again.

SSM State Manager

Keep instances in a desired state. Define an association: "All prod instances must have the CWAgent installed and running." State Manager periodically checks and enforces this. If someone removes the agent, SSM reinstalls it.

AWS-M8

Messaging & Decoupling

SQS Simple Queue Service

What is SQS?

SQS is a fully managed message queue service. It decouples producers (who send messages) from consumers (who process them). If your consumer is slow or down, messages accumulate safely in the queue. No message is lost. Classic async communication pattern.

SQS — Decoupling Producer and Consumer

  WITHOUT SQS (Tight Coupling):
  Web App ──HTTP──► Worker Service
  If Worker is slow/down → Web App blocks or errors ✗

  WITH SQS (Loose Coupling):
  Web App ──PutMessage──► [SQS Queue] ◄──PollMessages── Worker Service
  Web App returns immediately ✓           Worker processes at its own pace ✓
  Queue buffers messages during spikes ✓  Worker can scale independently ✓
  Messages survive worker crashes ✓

SQS Key Concepts

Queue Types

Standard Queue

Nearly unlimited throughput. Best-effort ordering (usually FIFO, but not guaranteed). At-least-once delivery (message may be delivered more than once — make your consumer idempotent). Good for most use cases where order doesn't strictly matter.

FIFO Queue

Guaranteed order (First-In-First-Out). Exactly-once processing (no duplicates). Limited to 3,000 msg/sec with batching (300 without). For: financial transactions, order processing, inventory changes where sequence matters.

Visibility Timeout

When a consumer reads a message, it's hidden from other consumers for the visibility timeout period (default 30s, max 12h). The consumer must delete the message before timeout expires. If it doesn't (consumer crashed), the message becomes visible again for another consumer to process. Set visibility timeout to slightly longer than your max processing time.

Dead Letter Queue (DLQ)

If a message fails processing too many times (exceeds maxReceiveCount), SQS moves it to a Dead Letter Queue. DLQ lets you isolate and debug problematic messages without losing them. Always configure a DLQ for production queues — otherwise failed messages keep cycling forever consuming resources.

Long Polling

When a consumer calls ReceiveMessage and the queue is empty, short polling returns immediately (wasteful API calls). Long polling waits up to 20 seconds for a message to arrive before returning empty. Reduces cost (fewer API calls) and reduces false-empty responses. Always use long polling (WaitTimeSeconds=20).

# Sending a message (Python boto3)
sqs = boto3.client('sqs')
response = sqs.send_message(
    QueueUrl='https://sqs.ap-south-1.amazonaws.com/123456/my-queue',
    MessageBody=json.dumps({
        'order_id': 'ORD-12345',
        'customer_id': 'CUST-789',
        'items': [{'product': 'laptop', 'qty': 1}]
    }),
    MessageAttributes={
        'EventType': {'StringValue': 'OrderPlaced', 'DataType': 'String'}
    }
)

# Consuming messages (long polling)
while True:
    response = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=20,        # Long polling
        VisibilityTimeout=60       # 60s to process
    )
    for message in response.get('Messages', []):
        process_order(json.loads(message['Body']))
        # Delete after successful processing
        sqs.delete_message(
            QueueUrl=QUEUE_URL,
            ReceiptHandle=message['ReceiptHandle']
        )

Message Queues — Equivalents

GCP

Cloud Pub/Sub — acts as both a queue AND pub/sub. Pull subscriptions work like SQS (consumer polls). Push subscriptions push to HTTP endpoint. At-least-once delivery. No native FIFO, but ordering key feature ensures ordered delivery within a key.

Azure

Azure Service Bus (full-featured queue + pub/sub, like SQS + some SNS features — supports sessions for FIFO, dead-lettering, transactions) | Azure Queue Storage (simpler, cheaper, like basic SQS standard queue, max 7-day retention vs SB's 14 days).

SNS / EventBridge Pub/Sub & Events

Amazon SNS — Simple Notification Service

SNS is a publish/subscribe (pub/sub) messaging service. A publisher sends a message to a Topic, and SNS fans it out to all subscribers simultaneously. One message → many consumers. Perfect for: fanout pattern, notifications, decoupled event broadcasting.

SNS Subscribers

A topic can have multiple subscribers of different types: SQS queue, Lambda function, HTTP/HTTPS endpoint, Email, SMS, Mobile Push (APNs, GCM), Kinesis Data Firehose.

SNS Fanout Pattern — One Message, Multiple Consumers

  Order Service publishes "OrderPlaced" event to SNS Topic
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
        SQS Queue       Lambda Fn        SQS Queue
        (Inventory      (Send email      (Analytics
         Service)        confirmation)    Service)

  All three consumers receive the same message independently.
  If one consumer is down, others still get the message.

SNS vs SQS — Key Difference

Feature	SNS (Pub/Sub)	SQS (Queue)
Pattern	1 publisher → many subscribers (fanout)	Producers → queue → one consumer per message
Message persistence	No persistence (if no subscriber, message lost)	Persists up to 14 days
Consumers	Multiple, all receive the message	One consumer per message (competing consumers)
Pull vs Push	Push to subscribers	Consumer pulls
Best for	Broadcast notifications, fanout, alerting	Task queue, work distribution, decoupling

SNS + SQS Fanout Pattern

The most common real-world pattern: SNS pushes to multiple SQS queues. This gives you fanout (SNS) with durability and retry (SQS):

# Architecture:
# New Product Added → SNS Topic "product-events"
#   → SQS "inventory-queue" → Inventory Lambda
#   → SQS "search-index-queue" → Search Index Lambda
#   → SQS "notification-queue" → Push Notification Lambda

# If Search Index Lambda is down: messages buffer in search-index-queue
# Inventory and Push Notification still work independently
# When Search Lambda recovers: processes all buffered messages
# This is the gold standard for reliable event-driven microservices.

Amazon EventBridge

An event bus service for building event-driven applications. More powerful than SNS for complex routing — you can filter events by content, transform them, route to 20+ AWS services, connect to third-party SaaS apps (Salesforce, Zendesk, Datadog), and create custom event buses per service.

Default Event Bus: Receives AWS service events (EC2 state changes, CodePipeline updates, etc.)
Custom Event Bus: For your own application events. Publish events from your microservices here.
Partner Event Bus: Receive events from SaaS partners (Shopify orders, GitHub events)

# Publish a custom event to EventBridge
events = boto3.client('events')
events.put_events(
    Entries=[{
        'Source': 'com.mycompany.orders',
        'DetailType': 'OrderPlaced',
        'Detail': json.dumps({'orderId': 'ORD-123', 'total': 599.99}),
        'EventBusName': 'my-app-events'
    }]
)
# EventBridge rule routes this to: Lambda for fulfillment,
# SQS for analytics, another EventBridge bus in a different account
# Based on content: {"source": ["com.mycompany.orders"], "detail-type": ["OrderPlaced"]}

Amazon Kinesis — Real-Time Streaming

For high-throughput, real-time data streaming. Unlike SQS (queue — messages consumed and deleted), Kinesis retains data as a stream that multiple consumers can read from. Think of it as a real-time data pipeline.

Service	What it does	Use case
Kinesis Data Streams	Real-time data stream. Shards provide throughput (1MB/s write per shard). Multiple consumers. Retain 1-365 days.	Real-time clickstream, app logs, IoT telemetry
Kinesis Data Firehose	Fully managed ETL — stream data directly to S3, Redshift, OpenSearch, Splunk. Auto-scales, buffers, compresses, transforms.	Load streaming data to S3 data lake or Redshift without code
Kinesis Data Analytics	Run SQL or Apache Flink on streaming data in real-time	Real-time dashboards, anomaly detection, aggregations
MSK (Managed Kafka)	Fully managed Apache Kafka. For teams that need Kafka compatibility.	Kafka migration, complex event streaming, ecosystem tools

Pub/Sub & Streaming — Equivalents

GCP

Pub/Sub (handles both SNS and SQS use cases — push and pull modes). Dataflow (like Kinesis Data Analytics, uses Apache Beam). Pub/Sub Lite (lower cost, regional, like Kinesis for ordered streams).

Azure

Azure Event Grid (like EventBridge — event routing, serverless, pay-per-event) | Azure Event Hubs (like Kinesis Data Streams — high-throughput event streaming, Kafka-compatible API!) | Azure Service Bus Topics (like SNS — pub/sub with filtering)

Azure-Only

Azure Event Hubs Kafka-compatible API: You can use your existing Apache Kafka clients to produce/consume from Event Hubs without code changes. Just change the broker endpoint. AWS MSK also offers Kafka compatibility, but Event Hubs being serverless AND Kafka-compatible is unique in the PaaS space.

AWS-M4

IAM & Security Services

IAM Identity & Access Management

What is IAM?

IAM is AWS's centralized service for controlling who can do what on which AWS resources. It's global (not region-specific) and free. Every API call to AWS is authenticated and authorized through IAM. No IAM permission → API call denied, period.

IAM Entities

IAM Users

A person or application that needs permanent, long-term credentials to interact with AWS. Has a username + password (console) and/or access key + secret key (programmatic). Best practice: don't use root account — create individual IAM users. Even better: use IAM Identity Center (SSO) for humans.

IAM Groups

A collection of IAM users. Attach policies to the group — all members inherit those permissions. You can't attach a policy directly to a group and then add roles to it. Groups only contain users. Simplifies permission management: add user to "Developers" group → gets all developer permissions.

IAM Roles

An IAM identity without permanent credentials. Instead, when something assumes a role, it gets temporary security credentials (valid minutes to hours). Used by: EC2 instances (instead of hardcoded keys), Lambda functions, cross-account access, federated users (SSO), ECS tasks. This is the correct way for AWS services to access other services — never hardcode access keys in code.

IAM Role — EC2 Instance Assuming a Role

  EC2 instance needs to write to S3
  ─────────────────────────────────────────────────────────────────────
  BAD:  Hardcode access_key + secret in app → leaked in Git → disaster
  ─────────────────────────────────────────────────────────────────────
  GOOD: EC2 IAM Role with s3:PutObject permission:

  IAM Role "EC2-S3-Writer" ──attached to──► EC2 Instance
       │
       └── Policy: Allow s3:PutObject on arn:aws:s3:::my-bucket/*

  Inside EC2: AWS SDK auto-fetches temporary credentials from IMDS
  http://169.254.169.254/latest/meta-data/iam/security-credentials/EC2-S3-Writer
  → Access Key (temp), Secret Key (temp), Session Token, Expiration
  → SDK auto-refreshes these before expiry

IAM Policies

JSON documents defining permissions. A policy has one or more statements, each with:

Effect: Allow or Deny. Explicit Deny always wins over Allow.
Action: API operations (e.g., s3:GetObject, ec2:*)
Resource: ARN of the specific resource (or * for all)
Condition: Optional restrictions (e.g., only from this IP, only over MFA)
Principal: (In resource-based policies) Who the policy applies to

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3Read",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-app-bucket",
        "arn:aws:s3:::my-app-bucket/*"
      ]
    },
    {
      "Sid": "DenyDeleteUnlessMFA",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::my-app-bucket/*",
      "Condition": {
        "BoolIfExists": {"aws:MultiFactorAuthPresent": "false"}
      }
    }
  ]
}

Policy Types

Type	Attached to	Managed by	Use case
AWS Managed Policy	User, Group, Role	AWS creates & updates	Common permission sets: `AmazonS3ReadOnlyAccess`, `AdministratorAccess`
Customer Managed Policy	User, Group, Role	You create & manage	Custom permissions for your org. Reusable. Versionable.
Inline Policy	Single User, Group, or Role	You create, embedded in identity	Strict 1:1 relationship. Deleted when identity deleted. Avoid when possible.
Resource-based Policy	The resource itself (S3 bucket, SQS queue, Lambda)	You create on the resource	Grant cross-account access without assuming a role. Used for S3 bucket policies, Lambda resource policies.
Permission Boundary	User or Role	Admin sets max permissions ceiling	Delegate IAM permission management to devs but cap what they can grant.
Service Control Policy (SCP)	AWS Org OUs or accounts	Org admin	Maximum permissions guardrails across entire AWS accounts. "Nobody in this account can touch us-west-1."

IAM Policy Evaluation Logic

When a request is made, AWS evaluates all applicable policies:

IAM Policy Evaluation Order

  API request arrives
         │
         ▼
  1. Explicit DENY anywhere? ───── YES ──► DENY (stops here)
         │ NO
         ▼
  2. SCP allows? (if AWS Org) ──── NO ───► DENY
         │ YES
         ▼
  3. Resource-based policy allows? ─ YES ─► (may ALLOW without identity policy)
         │ NO
         ▼
  4. Permission Boundary allows? ── NO ───► DENY
         │ YES
         ▼
  5. Identity policy allows? ────── YES ──► ALLOW
         │ NO
         ▼
         DENY (implicit — default deny everything)

Cross-Account Access

Account A's EC2 wants to access Account B's S3 bucket. Process:

Account B creates an IAM Role with a trust policy allowing Account A to assume it
Account B's role has the S3 permissions needed
Account A's EC2 calls sts:AssumeRole for Account B's role
Gets temporary credentials for Account B → can now access Account B's S3

# Trust policy on Account B's role (who can assume it):
{
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::111111111111:role/EC2-Role"},  # Account A
    "Action": "sts:AssumeRole"
  }]
}

# Account A EC2 assuming Account B's role (boto3):
import boto3
sts = boto3.client('sts')
response = sts.assume_role(
    RoleArn='arn:aws:iam::222222222222:role/S3-Access-Role',  # Account B
    RoleSessionName='my-session'
)
creds = response['Credentials']
# Use creds to create an S3 client for Account B

IAM Equivalents

GCP

Cloud IAM. Key differences: GCP uses Service Accounts (like IAM roles but with an email identity — can be granted access to specific resources). GCP IAM is more resource-centric (bind roles to resources). No inline policies — roles are always separate entities. Workload Identity = IAM roles for GKE pods.

Azure

Azure Active Directory (Azure AD / Entra ID) for identity + Azure RBAC for access control. Azure uses Entra ID for both human users and service principals (like IAM roles). Managed Identities = IAM roles for Azure VMs/Functions. Azure RBAC assigns built-in or custom roles to identities at various scopes (management group, subscription, resource group, resource).

Azure-Only

Azure Entra ID (Active Directory): Much more feature-rich identity provider than AWS IAM — supports OAuth 2.0, SAML, OIDC federation with thousands of apps natively, Conditional Access policies (block login from outside the country), Privileged Identity Management (JIT access). AWS equivalent would be IAM Identity Center + Cognito combined, with less enterprise AD integration.

KMS / Secrets Manager / SSM Secrets & Key Management

AWS KMS — Key Management Service

KMS manages cryptographic keys used to encrypt your data. You never handle raw key material — KMS keeps keys secure inside Hardware Security Modules (HSMs). Services like S3, EBS, RDS, Secrets Manager all use KMS keys to encrypt data.

KMS Key Types

AWS Managed Keys: Free. AWS creates and manages rotation. Automatically used by services (e.g., aws/s3 key for S3 SSE). Less control — you can't change rotation or grant cross-account access.
Customer Managed Keys (CMK): You create, own, and manage. $1/month/key. Control rotation (optional, annual). Can grant cross-account usage. Needed for compliance where you must control the key.
AWS CloudHSM: Dedicated hardware HSM. You control the keys completely, AWS has no access. Most expensive, highest compliance. Used for PCI-DSS, FIPS 140-2 Level 3 requirements.

# Encrypt data with KMS (AWS CLI)
aws kms encrypt \
  --key-id arn:aws:kms:ap-south-1:123456789:key/abc-123 \
  --plaintext fileb://secret.txt \
  --output text --query CiphertextBlob | base64 --decode > encrypted.bin

# Decrypt
aws kms decrypt \
  --ciphertext-blob fileb://encrypted.bin \
  --output text --query Plaintext | base64 --decode

Envelope Encryption

KMS uses envelope encryption: a Data Encryption Key (DEK) is generated to encrypt your actual data. The DEK itself is encrypted by the KMS CMK. Only the encrypted DEK is stored with the data. To decrypt: call KMS to decrypt the DEK, use plaintext DEK to decrypt data. This way, large amounts of data never pass through KMS API.

AWS Secrets Manager

Store, manage, and rotate secrets (DB passwords, API keys, OAuth tokens). Secrets are encrypted at rest via KMS. Applications retrieve secrets via API — no plaintext secrets in code or environment variables.

Automatic rotation: Secrets Manager can automatically rotate DB credentials (works natively with RDS). It creates a new password, updates the DB, stores the new secret — all without downtime.
Versioning: Keeps previous versions during rotation (AWSPREVIOUS stage) so apps using old password still work briefly while they update.
Cost: $0.40/secret/month + $0.05 per 10,000 API calls.

# Retrieve secret in Python (boto3)
import boto3, json
client = boto3.client('secretsmanager', region_name='ap-south-1')
response = client.get_secret_value(SecretId='prod/myapp/db-credentials')
secret = json.loads(response['SecretString'])
db_password = secret['password']  # Fresh from Secrets Manager, never hardcoded

AWS Systems Manager Parameter Store

Lightweight configuration and secrets storage. Two tiers:

Standard Parameters: Free. Up to 4KB. Good for non-sensitive config (app settings, feature flags, environment config).
SecureString Parameters: Encrypted with KMS. Good for secrets that don't need rotation. No extra cost beyond KMS calls.
Advanced Parameters: $0.05/param/month. Up to 8KB, parameter policies (TTL, auto-notification when approaching expiry).

Secrets Manager vs Parameter Store: Use Secrets Manager when you need automatic rotation. Use Parameter Store for config, non-sensitive data, or cost-sensitive secrets (it's free for standard).

# Store a parameter (CLI)
aws ssm put-parameter \
  --name "/myapp/prod/db-host" \
  --value "mydb.cluster.ap-south-1.rds.amazonaws.com" \
  --type "String"

aws ssm put-parameter \
  --name "/myapp/prod/db-password" \
  --value "SuperSecret123!" \
  --type "SecureString"  # Encrypted with KMS

# Retrieve in app
aws ssm get-parameter --name "/myapp/prod/db-password" --with-decryption

Secrets & Key Management — Equivalents

GCP

Cloud KMS (key management, like AWS KMS) | Secret Manager (like Secrets Manager — stores secrets, automatic versioning, access via API). GCP Cloud HSM is part of Cloud KMS. No direct equivalent to SSM Parameter Store — Secret Manager serves both use cases.

Azure

Azure Key Vault: unified service for secrets, keys, AND certificates. Equivalent to AWS KMS + Secrets Manager combined. Key Vault also manages TLS/SSL certificates with automatic renewal. Azure Dedicated HSM = AWS CloudHSM equivalent.

Azure-Only

Azure Key Vault Certificates: natively manages TLS certificates (creation, renewal, storage) in one service. AWS equivalent requires ACM (certificates) + KMS (keys) + Secrets Manager (secrets) as separate services.

WAF / Shield / GuardDuty Threat Protection

AWS WAF — Web Application Firewall

Protects web applications from common web exploits (OWASP Top 10): SQL injection, XSS, bad bots, malformed requests. Deployed in front of CloudFront, ALB, API Gateway, or AppSync. You define Web ACLs with rules.

Managed Rule Groups: Pre-built rule sets from AWS or AWS Marketplace (e.g., "AWS Managed Rules - Core rule set" blocks common OWASP attacks).
Custom Rules: Block requests matching your logic (rate limiting by IP, block specific user agents, geo-blocking).
Rate-based rules: Automatically block IPs exceeding X requests per 5 minutes.

# Rate-based rule example (terraform-style representation):
# Block any IP that sends more than 2000 requests per 5 minutes
Rule: RateBasedRule
  Action: BLOCK
  Statement:
    RateBasedStatement:
      Limit: 2000
      AggregateKeyType: IP

AWS Shield — DDoS Protection

Shield Standard: Free, automatically enabled for all AWS customers. Protects against common L3/L4 DDoS attacks (SYN floods, UDP reflection attacks). Integrated with CloudFront and Route 53.
Shield Advanced: $3,000/month per organization. Protects against large sophisticated DDoS. Includes 24/7 DDoS Response Team (DRT) access, real-time attack visibility, cost protection (AWS credits if your bill spikes due to DDoS attack scaling).

Amazon GuardDuty — Threat Detection

Continuous security monitoring service that analyzes: VPC Flow Logs, CloudTrail API logs, DNS logs, and optionally EKS audit logs and S3 data events. Uses ML to detect threats like: EC2 cryptomining, root credential usage, unusual API calls from unknown IPs, port scanning, compromised credentials accessing S3.

GuardDuty doesn't block anything — it generates findings (alerts) with severity levels (low/medium/high). You automate responses via EventBridge → Lambda (e.g., auto-isolate compromised instance by removing from security groups).

AWS Inspector

Automated vulnerability scanning for EC2 instances and container images. Continuously scans for OS package vulnerabilities (CVEs), network exposure issues, software vulnerabilities. Integrates with ECR to scan images on push. Different from GuardDuty (runtime threat detection) — Inspector is about vulnerability assessment.

Security Services — Equivalents

GCP

Cloud Armor (= WAF + DDoS, like WAF + Shield combined) | Security Command Center (SCC) (threat detection, vulnerability findings, like GuardDuty + Inspector combined) | Container Analysis (vulnerability scanning in Artifact Registry, like ECR + Inspector).

Azure

Azure WAF (via Front Door or Application Gateway) | Azure DDoS Protection Standard (like Shield Advanced) | Microsoft Defender for Cloud (threat detection + vulnerability assessment, like GuardDuty + Inspector + more) | Microsoft Sentinel (SIEM/SOAR — no direct AWS equivalent).

Azure-Only

Microsoft Sentinel: A full SIEM/SOAR platform that ingests logs from Azure + on-prem + multi-cloud + third-party tools, uses ML for threat hunting, and automates playbooks. AWS equivalent would be custom-built using CloudTrail + GuardDuty + Macie + Security Hub + custom Lambda playbooks. Sentinel is more turnkey.

AWS-M5

Databases

RDS Relational Database Service

What is RDS?

RDS is a managed relational database service. AWS handles: provisioning hardware, installing the DB engine, patching, backups, monitoring, Multi-AZ failover. You focus on your schema and queries. Supports: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora.

RDS Key Features

Multi-AZ

Your primary DB runs in one AZ. A standby replica runs in a different AZ, synchronously receiving every write. If the primary fails, AWS automatically promotes the standby and updates the DNS endpoint within 1-2 minutes. Your app reconnects to the same endpoint — no code changes. Multi-AZ standby is NOT readable — it's a pure failover. For read scale, use Read Replicas.

RDS Multi-AZ vs Read Replicas

  MULTI-AZ (for HA/failover):              READ REPLICAS (for read scale):

  App ──► RDS Endpoint                     App ──► Primary (write endpoint)
          │                                         │
          ▼                                         ├──async repl──► Read Replica 1
  Primary DB (AZ-a) ──sync repl──►                 ├──async repl──► Read Replica 2
  Standby DB (AZ-b) [not readable]                 └──async repl──► Read Replica (another region)

  Failover: ~60-120 seconds, auto         Read: use separate read endpoint
  Standby ONLY for failover               Slight replication lag (async)

Read Replicas

Up to 15 read replicas per source (Aurora) or 5 (MySQL/PostgreSQL)
Async replication — slight lag possible. Apps must tolerate slightly stale reads.
Can be in same AZ, different AZ, or different region (Cross-Region Read Replica)
Can be promoted to standalone (good for DR) — promotion breaks replication
Useful for: analytics queries, reporting, geographically close reads

Automated Backups & Snapshots

Automated backups: Daily snapshot + transaction logs. Retention: 0-35 days. Enables point-in-time recovery (PITR). Free storage equal to DB size.
Manual snapshots: You trigger them. Stored until you delete. Survive DB deletion. Good for: pre-migration checkpoints, long-term retention.
Restore: Creates a NEW DB instance from the backup (doesn't restore in-place). Update your app's endpoint.

RDS Proxy

A managed connection pool between your app and RDS. Databases have limited connections (e.g., db.t3.medium MySQL = ~66 connections). Lambda functions scale to thousands of concurrent invocations — without RDS Proxy, they'd exhaust DB connections. RDS Proxy maintains a warm pool and multiplexes application connections. Also improves failover time (connections held during failover, reducing app errors).

Lambda + RDS = Use RDS Proxy Never connect Lambda directly to RDS without RDS Proxy. Each Lambda invocation opens a new DB connection. At 1000 concurrent Lambdas, you'd hit DB connection limits immediately. With RDS Proxy, Lambda connects to the proxy, which maintains a small pool to RDS. Classic serverless + relational DB pattern.

Aurora AWS's Cloud-Native DB

What is Aurora?

Aurora is AWS's proprietary cloud-native relational database compatible with MySQL and PostgreSQL. It's NOT just a managed MySQL — AWS redesigned the storage layer from scratch. Result: up to 5x faster than MySQL on RDS, up to 3x faster than PostgreSQL on RDS. Higher cost than standard RDS (~20%) but typically worth it for production workloads.

Aurora Architecture

Aurora's storage is completely separate from the compute (DB instances). Storage is a distributed, fault-tolerant, self-healing cluster across 3 AZs × 2 copies = 6 copies of your data. Can lose 2 copies without write availability loss, 3 copies without read availability loss.

Aurora Storage — Distributed Across 3 AZs

  ┌─────────────────────────────────────────────────────────────────┐
  │                    AURORA CLUSTER                               │
  │                                                                 │
  │  Writer Instance (Primary) ──────────────────────────────────┐  │
  │  Reader Instance 1          ─── Shared Storage Cluster ────► │  │
  │  Reader Instance 2          ─── (6 copies, 3 AZs)            │  │
  │                                                               │  │
  │  AZ-1: [Data Copy 1] [Data Copy 2]                           │  │
  │  AZ-2: [Data Copy 3] [Data Copy 4]                           │  │
  │  AZ-3: [Data Copy 5] [Data Copy 6]                           │  │
  └─────────────────────────────────────────────────────────────────┘

  Failover: ~30 seconds (promote a reader — same shared storage!
  No data copy needed since readers already share storage)

Aurora Features

Aurora Serverless v2

Aurora capacity auto-scales in fine-grained increments (0.5 ACU steps) based on actual load, within seconds. You define min/max ACUs. Pay per second of capacity used. Ideal for: variable workloads, dev/test, new apps with unpredictable traffic. Can scale from nearly zero to 128 ACUs (≈256GB RAM) in seconds.

Aurora Global Database

One primary region with up to 5 secondary read-only regions. Replication lag < 1 second globally (uses AWS's dedicated infrastructure, not the internet). Used for: global read scale, DR (RPO <1s, RTO < 1 minute — just promote a secondary to primary). Unlike standard cross-region read replicas, Global DB can handle replication even under high write load.

Aurora Backtrack

MySQL-compatible only. Rewind the DB to a point in the past without restoring from backup. Goes back in time by replaying the storage log. Can backtrack up to 72 hours. Instant — takes seconds vs hours for a restore. Useful for: "oops we just ran DELETE without WHERE."

Managed Relational DB — Equivalents

GCP

Cloud SQL (managed MySQL, PostgreSQL, SQL Server — like standard RDS) | Cloud Spanner (global, horizontally scalable relational DB — no direct AWS equivalent, but closest to Aurora Global + Vitess. True horizontal write scale across regions with ACID transactions). Spanner is unique — AWS has nothing comparable.

Azure

Azure SQL Database (managed SQL Server — like RDS SQL Server) | Azure Database for MySQL/PostgreSQL (like RDS MySQL/PostgreSQL) | Azure Cosmos DB for PostgreSQL (distributed PostgreSQL, like Citus — no direct AWS equivalent for this exact feature).

GCP-Only

Cloud Spanner: Globally distributed, ACID-compliant relational DB that scales horizontally for writes across regions. AWS Aurora Global DB scales reads globally but writes are single-region. Spanner scales both globally. AWS has no equivalent — closest would be DynamoDB Global Tables (NoSQL) or CockroachDB on EC2.

DynamoDB Serverless NoSQL

What is DynamoDB?

DynamoDB is AWS's managed NoSQL key-value and document database. Fully serverless: no instances to size, automatic scaling, single-digit millisecond performance at any scale. Powers Amazon.com's shopping cart, Lyft's ride data, Duolingo's learning streak — workloads at massive scale.

DynamoDB Data Model

Table: Collection of items (like a table in SQL, but schemaless)
Item: A single record (like a row). Max 400KB per item.
Attribute: A data field (like a column, but each item can have different attributes)
Partition Key (PK): Required. Used to distribute data across partitions. Every access pattern must include the PK.
Sort Key (SK): Optional. Combined with PK = composite primary key. Enables range queries within a partition.

# Example DynamoDB table for an e-commerce app:
# PK = UserID, SK = OrderID

Items:
{ "UserID": "user123",  "OrderID": "order001", "Status": "Delivered", "Total": 299.99 }
{ "UserID": "user123",  "OrderID": "order002", "Status": "Shipped",   "Total": 49.99  }
{ "UserID": "user456",  "OrderID": "order003", "Status": "Pending",   "Total": 799.00 }

# Query: All orders for user123 (efficient - same partition)
aws dynamodb query \
  --table-name Orders \
  --key-condition-expression "UserID = :uid" \
  --expression-attribute-values '{":uid": {"S": "user123"}}'

Capacity Modes

Mode	How it works	Best for
On-Demand	Pay per request (RCU/WCU). Auto-scales instantly. No capacity planning.	New apps, variable traffic, don't know your load. Slightly more expensive per request than provisioned at steady state.
Provisioned	You set RCUs (Read Capacity Units) and WCUs (Write Capacity Units). Can use Auto Scaling to adjust. Cheaper at steady state. May throttle if you exceed provisioned capacity.	Predictable steady workloads. Pair with Auto Scaling for some elasticity.

Capacity Units

1 RCU = 1 strongly consistent read/sec (or 2 eventually consistent reads/sec) for items up to 4KB
1 WCU = 1 write/sec for items up to 1KB
A 10KB item read with strong consistency = 3 RCUs. Same item, eventual consistency = 1.5 RCUs (round up = 2).

DynamoDB Advanced Features

Global Secondary Indexes (GSI)

Query on non-primary key attributes. A GSI has its own partition key + sort key (different from table's PK/SK) and its own capacity. Enables different access patterns without data duplication in your code.

# Table: PK=UserID, SK=OrderID
# Query by Status — can't do this without an index (full table scan is expensive)
# Add GSI: PK=Status, SK=CreatedAt → can now query "all PENDING orders, newest first"

aws dynamodb query \
  --table-name Orders \
  --index-name StatusIndex \
  --key-condition-expression "#s = :status" \
  --expression-attribute-names '{"#s": "Status"}' \
  --expression-attribute-values '{":status": {"S": "PENDING"}}'

DynamoDB Streams

A time-ordered stream of item-level changes (insert/update/delete) in your table. Retained for 24 hours. Used with Lambda to react to data changes (send email when order status changes, sync to another table, audit log, real-time analytics).

DynamoDB Global Tables

Multi-region, multi-active (all regions accept reads AND writes). DynamoDB handles conflict resolution (last-writer-wins). Near-zero RPO/RTO for region failure. Used for: globally distributed apps where users in each region write and read data locally.

DynamoDB Accelerator (DAX)

In-memory cache specifically for DynamoDB. Read latency drops from ms to microseconds. Fully compatible — just change your endpoint from DynamoDB to DAX. Best for: read-heavy apps, repeated reads of same items, caching leaderboards/hot items. Not useful for write-heavy workloads or data that changes frequently.

DynamoDB Design Tip — Single Table Design In DynamoDB, the access pattern drives the data model — not the other way around (unlike SQL). Many experienced DynamoDB users put ALL entities in a single table with composite keys. E.g., PK="USER#user123", SK="PROFILE" for profile; PK="USER#user123", SK="ORDER#2024-01-15" for orders. This avoids expensive joins (DynamoDB doesn't have joins) and keeps related data in the same partition.

NoSQL — Equivalents

GCP

Firestore (document database, like DynamoDB but more flexible querying, real-time sync) | Bigtable (wide-column NoSQL for massive analytics/IoT — like DynamoDB at petabyte scale for time-series/analytics, used by Google internally).

Azure

Azure Cosmos DB: Multi-model NoSQL (can use SQL, MongoDB, Cassandra, Table, Gremlin APIs). Has global distribution with 99.999% SLA. Cosmos DB for NoSQL is closest to DynamoDB but with richer querying. Cosmos DB is Azure's flagship database — more flexible than DynamoDB in query capabilities.

Azure-Only

Azure Cosmos DB multi-model API: One service with MongoDB API compatibility, Cassandra API, Gremlin (graph) API, etc. If you have an existing MongoDB or Cassandra app, you can point it at Cosmos DB with minimal changes. AWS would require separate DocumentDB (MongoDB-compatible) or Keyspaces (Cassandra-compatible) services.

ElastiCache In-Memory Caching

What is ElastiCache?

Managed in-memory data store. Two engines: Redis and Memcached. Used to cache frequently accessed data, reducing database load, improving response times from seconds to milliseconds. Common pattern: check cache first → cache hit? return instantly. Cache miss? read from DB, write to cache, return.

Feature	Redis	Memcached
Data structures	Strings, Hashes, Lists, Sets, Sorted Sets, Pub/Sub, Streams, Geospatial	Strings only
Persistence	Optional (RDB snapshots, AOF log)	No persistence (pure cache)
Replication	Master-replica, Multi-AZ	No replication
Clustering	Redis Cluster (sharding)	Multi-node (simpler sharding)
Lua scripting	Yes	No
Use case	Sessions, leaderboards, pub/sub, real-time analytics, queues, rate limiting	Simple object caching, stateless horizontal scaling

Choose Redis almost always. Redis supports everything Memcached does, plus persistence, replication, and rich data structures. Memcached's only advantage: multi-threaded (better on very large nodes). In practice, 95% of use cases → Redis.

# Session caching example (Flask + Redis via ElastiCache):
import redis, json
r = redis.Redis(host='my-cache.abc123.ng.0001.apse1.cache.amazonaws.com', port=6379)

def get_user_profile(user_id):
    # Try cache first
    cached = r.get(f'user:{user_id}')
    if cached:
        return json.loads(cached)  # Cache HIT — sub-millisecond response
    
    # Cache MISS — query database
    profile = db.query("SELECT * FROM users WHERE id = %s", user_id)
    r.setex(f'user:{user_id}', 300, json.dumps(profile))  # Cache 5 min
    return profile

Managed Cache — Equivalents

GCP

Cloud Memorystore: Managed Redis and Memcached. Same concepts. Fully compatible with open-source Redis/Memcached clients. Redis Cluster mode available.

Azure

Azure Cache for Redis: Managed Redis. Tiers: Basic (single node), Standard (replication), Premium (clustering, persistence, VNet injection), Enterprise (Redis Enterprise modules like RedisJSON, RediSearch).

AWS-M6

Monitoring & Observability

CloudWatch Metrics, Logs & Alarms

What is CloudWatch?

CloudWatch is AWS's primary observability service — a unified platform for metrics, logs, dashboards, alarms, and events. Almost every AWS service automatically sends metrics to CloudWatch. It's your first stop for understanding what's happening in your AWS environment.

CloudWatch Metrics

Time-series data points published by AWS services and your own apps. Organized by Namespace (e.g., AWS/EC2) → Metric Name (e.g., CPUUtilization) → Dimension (e.g., InstanceId=i-0abc123).

Default EC2 Metrics (every 5 min, free):

CPUUtilization, NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, StatusCheckFailed

Detailed Monitoring (every 1 min, extra cost):

Same metrics but at 1-minute resolution. Needed for faster Auto Scaling reactions.

Custom Metrics:

Publish your own metrics from app code or scripts. Standard resolution = 1 min. High resolution = 1 second (extra cost). Example: publish queue depth, active sessions, order processing rate.

# Publish custom metric (CLI):
aws cloudwatch put-metric-data \
  --namespace "MyApp/OrderService" \
  --metric-name "OrdersPerMinute" \
  --value 142 \
  --dimensions Environment=Production,Service=OrderService

# Publish from Python:
import boto3
cw = boto3.client('cloudwatch')
cw.put_metric_data(
    Namespace='MyApp/OrderService',
    MetricData=[{
        'MetricName': 'ActiveConnections',
        'Value': 89,
        'Unit': 'Count'
    }]
)

CloudWatch Logs

Log Groups: Container for log streams from the same source (e.g., /aws/lambda/my-function, /ecs/my-service)
Log Streams: Sequence of log events from a single source (one EC2 instance, one Lambda invocation container)
Retention: 1 day to 10 years (or never expire). Set per log group. Storage charged.
Insights: Query language for searching/analyzing logs. Run across multiple log groups.

# CloudWatch Logs Insights query — find all errors in last hour:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Find slow Lambda invocations (>5 seconds):
filter @type = "REPORT"
| parse @message "Duration: * ms" as duration
| filter duration > 5000
| stats avg(duration), max(duration), count() by bin(5m)

CloudWatch Alarms

Watches a metric and transitions between states based on thresholds. States: OK, ALARM, INSUFFICIENT_DATA. When ALARM state: send SNS notification, trigger Auto Scaling action, stop/reboot/terminate EC2, invoke Lambda.

# CLI: Create alarm — alert when CPU > 80% for 5 consecutive minutes:
aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-EC2-prod" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 60 \             # 1-minute periods
  --evaluation-periods 5 \  # 5 consecutive periods
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --alarm-actions arn:aws:sns:ap-south-1:123456:AlertsTopic \
  --ok-actions arn:aws:sns:ap-south-1:123456:AlertsTopic

CloudWatch Agent

Install the CloudWatch agent on EC2 (or on-prem servers) to collect metrics not available by default: memory usage (RAM), disk usage, swap, process-level metrics. Also collects logs from any file (system logs, app logs, custom log files) and ships them to CloudWatch Logs.

CloudWatch Dashboards

Custom visualizations — widgets showing metrics graphs, numbers, text, alarms. Share dashboards across accounts. Create one per team/service. JSON-configurable. Free to view, charged per dashboard per month.

Monitoring — Equivalents

GCP

Cloud Monitoring (metrics, dashboards, alerting) | Cloud Logging (logs, like CloudWatch Logs) | Cloud Trace (distributed tracing) | Cloud Profiler (CPU/memory profiling). These are unified under Google Cloud Observability (formerly Stackdriver).

Azure

Azure Monitor (umbrella service for all observability — metrics, logs, alerts, like CloudWatch) | Log Analytics Workspace (centralized log store with Kusto query language — richer querying than CloudWatch Logs Insights) | Application Insights (APM for web apps — no direct AWS equivalent natively).

Azure-Only

Application Insights: Full APM (Application Performance Monitoring) natively integrated into Azure Monitor. Tracks request rates, failure rates, response times, dependency calls, exceptions, user sessions. AWS equivalent would be X-Ray + custom CloudWatch metrics — more complex to set up.

CloudTrail / X-Ray / EventBridge

AWS CloudTrail — API Audit Logging

Records every API call made to AWS (via Console, CLI, SDK, or other services). Who did what, when, from where. Stored in S3. The forensic record of your AWS account. Enabled by default for 90 days (Event History) — create a Trail for longer retention.

# CloudTrail log entry example — someone deleted an S3 bucket:
{
  "eventTime": "2024-01-15T14:32:01Z",
  "eventSource": "s3.amazonaws.com",
  "eventName": "DeleteBucket",
  "userIdentity": {
    "type": "IAMUser",
    "userName": "john.dev",
    "arn": "arn:aws:iam::123456789:user/john.dev"
  },
  "sourceIPAddress": "103.210.45.67",  # The IP that made the call
  "requestParameters": {"bucketName": "prod-data-bucket"}
}

Enable CloudTrail — First Thing, Always If you get hacked, CloudTrail logs tell you WHAT was done. Without it, you're blind. Enable a multi-region trail on day 1, send to S3 with MFA Delete enabled so attackers can't delete the logs. Also enable CloudTrail log file integrity validation.

AWS X-Ray — Distributed Tracing

Traces requests as they flow through distributed systems (multiple services, Lambda, DynamoDB, RDS, external APIs). Generates a service map showing which services call which. Identifies bottlenecks and errors. Essential for microservices — when a user's request goes through 5 services and fails, X-Ray shows exactly which service caused the error and how long each took.

Amazon EventBridge — Event Bus

A serverless event bus for routing events between AWS services, your own apps, and SaaS partners. Think of it as AWS's "if this then that" at scale. Events go to EventBridge → rules match events → targets receive events.

# EventBridge rule: "When EC2 instance state changes to STOPPED, run a Lambda"
Event Pattern:
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {"state": ["stopped"]}
}
Target: Lambda function → notify team on Slack

# Another example: Run DB backup Lambda every day at 2AM IST
Schedule: cron(30 20 * * ? *)   # 20:30 UTC = 02:00 IST
Target: Lambda function → trigger RDS snapshot

EventBridge is what replaced CloudWatch Events. Has: default event bus (AWS events), custom event buses (your app events), partner event buses (SaaS integrations like Datadog, PagerDuty).

AWS-M7

DevOps & Automation Tools

CloudFormation IaC — Native AWS

What is CloudFormation?

AWS's native IaC service. Define your AWS infrastructure in YAML or JSON templates. CloudFormation creates, updates, and deletes resources as a Stack. Resources in a stack are managed together — create the stack → all resources created. Delete the stack → all resources deleted.

Template Structure

AWSTemplateFormatVersion: '2010-09-09'
Description: 'My web app infrastructure'

Parameters:
  InstanceType:
    Type: String
    Default: t3.micro
    AllowedValues: [t3.micro, t3.small, t3.medium]

Mappings:
  RegionAMI:
    ap-south-1:
      AMI: ami-0c55b159cbfafe1f0  # Amazon Linux 2

Conditions:
  IsProd: !Equals [!Ref Environment, production]

Resources:
  # The ONLY required section
  MyBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub 'my-app-${AWS::AccountId}'
      VersioningConfiguration:
        Status: Enabled

  MyEC2:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: !Ref InstanceType
      ImageId: !FindInMap [RegionAMI, !Ref AWS::Region, AMI]
      IamInstanceProfile: !Ref EC2InstanceProfile
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}-web'

Outputs:
  BucketName:
    Value: !Ref MyBucket
    Export:
      Name: !Sub '${AWS::StackName}-BucketName'

CloudFormation Key Concepts

Change Sets

Before updating a stack, create a Change Set to preview what CloudFormation will actually do: which resources will be added, modified, or deleted. Always use Change Sets in production — a resource replacement (e.g., changing RDS parameter requiring replacement) means data loss if you're not prepared.

Stack Sets

Deploy CloudFormation stacks across multiple AWS accounts and regions in one operation. Managed from a central admin account. Used for: applying security baseline to all accounts in an org, deploying global app infrastructure to 5 regions at once.

Drift Detection

Detects when actual resource configuration differs from CloudFormation's expected state (someone made a manual console change). Drift detection identifies what changed so you can fix it. Best practice: all changes through CloudFormation only — treat console as read-only for production.

Helper Scripts (cfn-signal, cfn-init)

cfn-signal: Allows an EC2 instance to signal CloudFormation that it has finished initializing (bootstrapping complete). CloudFormation waits for the signal (CreationPolicy WaitCondition) before marking the resource as created. Without this, CloudFormation marks EC2 as created the moment it starts, even if your app isn't ready yet.

# In EC2 UserData:
#!/bin/bash
/opt/aws/bin/cfn-init -v --stack my-stack --resource MyEC2 --region ap-south-1
# ... install and configure app ...
/opt/aws/bin/cfn-signal -e $? --stack my-stack --resource MyEC2 --region ap-south-1

CodePipeline / CodeBuild / CodeDeploy CI/CD on AWS

The AWS CI/CD Toolchain

AWS Native CI/CD Pipeline

  GitHub / CodeCommit
          │ Code push
          ▼
  ┌─────────────────────────────────────────────────────────────────────┐
  │                        AWS CodePipeline                             │
  │                                                                     │
  │  Stage 1: SOURCE         Stage 2: BUILD         Stage 3: DEPLOY    │
  │  ┌─────────────┐         ┌─────────────┐        ┌─────────────┐    │
  │  │  GitHub /   │         │ CodeBuild:  │        │ CodeDeploy: │    │
  │  │ CodeCommit  │──────►  │ - Install   │──────► │ - EC2/ECS/  │    │
  │  │  Webhook    │         │ - Test      │        │   Lambda    │    │
  │  └─────────────┘         │ - Build     │        │ - Blue/Green│    │
  │                          │ - Push ECR  │        │ - Canary    │    │
  │                          └─────────────┘        └─────────────┘    │
  │                               ↑                       ↑            │
  │                        buildspec.yml            appspec.yml        │
  └─────────────────────────────────────────────────────────────────────┘

CodeBuild

Fully managed build service. Runs your build commands in a container. Defined by buildspec.yml in your repo root. Scales automatically — no build servers to manage. Charged per build minute.

# buildspec.yml example (Node.js app → Docker → ECR)
version: 0.2
phases:
  pre_build:
    commands:
      - echo Logging in to ECR...
      - aws ecr get-login-password | docker login --username AWS \
          --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
  build:
    commands:
      - echo Running tests...
      - npm test
      - echo Building Docker image...
      - docker build -t $IMAGE_NAME:$CODEBUILD_RESOLVED_SOURCE_VERSION .
      - docker tag $IMAGE_NAME:$CODEBUILD_RESOLVED_SOURCE_VERSION \
          $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
  post_build:
    commands:
      - docker push $ECR_URI:$CODEBUILD_RESOLVED_SOURCE_VERSION
      - echo Build completed
artifacts:
  files:
    - imagedefinitions.json  # Tells CodeDeploy which image to use for ECS

CodeDeploy

Automates application deployments to EC2, Lambda, or ECS. Supports deployment strategies: in-place, blue/green, canary, linear. Defined by appspec.yml.

# appspec.yml for EC2 deployment
version: 0.0
os: linux
files:
  - source: /dist
    destination: /var/www/myapp
hooks:
  BeforeInstall:
    - location: scripts/stop_server.sh
      timeout: 30
  AfterInstall:
    - location: scripts/install_deps.sh
      timeout: 60
  ApplicationStart:
    - location: scripts/start_server.sh
      timeout: 30
  ValidateService:
    - location: scripts/health_check.sh
      timeout: 30

Auto Scaling Groups / Launch Templates

Auto Scaling Group (ASG)

An ASG maintains a fleet of EC2 instances. You define min/max/desired count. ASG continuously monitors health, replaces unhealthy instances automatically, and scales based on policies.

Launch Template vs Launch Configuration

Launch Template (modern, prefer this): Defines EC2 parameters (AMI, instance type, key pair, security groups, user data). Supports versioning, can specify multiple instance types, supports Spot + On-Demand mix. Launch Configuration (legacy, deprecated): Older, no versioning, only one instance type. Always use Launch Templates for new ASGs.

ASG Scaling Policies

Policy Type	How it works	Best for
Simple Scaling	Alarm triggers: add/remove N instances. Cooldown period before next action.	Rarely used now — slow response, blunt
Step Scaling	Different scaling magnitudes based on alarm severity. CPU 70-80%: add 1. CPU 80-90%: add 3. CPU >90%: add 5.	Variable load spikes with different intensities
Target Tracking	Keep a metric at a target value. "Keep average CPU at 60%" — ASG figures out how many instances to add/remove.	Most common — easy to configure, handles scale-in/out automatically
Scheduled Scaling	Pre-set scaling at specific times. Scale out at 8AM, scale in at 10PM.	Predictable traffic patterns (business hours, weekly spikes)
Predictive Scaling	ML-based forecasting using historical data. Pre-scales before expected traffic increase.	Cyclical/recurring load patterns

Mixed Instance Types & Spot

Launch Templates support specifying multiple instance types and a mix of On-Demand + Spot instances in an ASG. E.g., "run 2 On-Demand as baseline, fill capacity with cheapest Spot instances from this list: m5.xlarge, m5a.xlarge, m6i.xlarge." If Spot is interrupted, ASG replaces with another Spot or falls back to On-Demand. Major cost savings for stateless workloads.

# CDK example (simplified) — Mixed instance ASG:
asg = autoscaling.AutoScalingGroup(self, "MyASG",
    min_capacity=2, max_capacity=20,
    mixed_instances_policy=autoscaling.MixedInstancesPolicy(
        instances_distribution=autoscaling.InstancesDistribution(
            on_demand_base_capacity=2,       # Always keep 2 On-Demand
            on_demand_percentage_above_base=20,  # 20% On-Demand, 80% Spot above base
            spot_allocation_strategy="capacity-optimized"  # Pick cheapest available Spot
        ),
        launch_template=lt,
        launch_template_overrides=[
            autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m5.xlarge")),
            autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m5a.xlarge")),
            autoscaling.LaunchTemplateOverrides(instance_type=ec2.InstanceType("m6i.xlarge")),
        ]
    )
)

Auto Scaling — Equivalents

GCP

Managed Instance Groups (MIG) with Autoscaler. Uses Instance Templates (like Launch Templates). Supports scale out on CPU, LB capacity, custom metrics. Also has Spot VMs integration in MIGs.

Azure

Azure Virtual Machine Scale Sets (VMSS). Like ASG but Azure-flavored. Supports Flex (flexible orchestration) and Uniform orchestration modes. Auto-scale based on metrics or schedule. Spot instance support in VMSS.

SSM Session Manager / Systems Manager

AWS Systems Manager (SSM)

A suite of tools for managing EC2 instances (and on-prem servers) at scale. The SSM Agent runs on your instances and connects to the SSM service. Key features:

Session Manager

Browser-based or CLI shell access to EC2 instances with no SSH, no bastion host, no open inbound ports. Authentication via IAM. All sessions are logged to CloudWatch/S3. The modern way to access EC2. Significant security improvement over SSH.

# Start session (CLI) - no SSH keys needed
aws ssm start-session --target i-0abc123def456

# Port forwarding via SSM (e.g., connect to RDS in private subnet)
aws ssm start-session \
  --target i-0abc123def456 \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["5432"],"localPortNumber":["5432"]}'
# Now: psql -h localhost -p 5432 -U admin mydb  (via SSM tunnel)

Parameter Store

Already covered in security section — stores config and secrets. Accessible from EC2 instances, Lambda, ECS tasks via SSM API.

Run Command

Execute shell commands on one or multiple EC2 instances without SSH. Run across hundreds of instances using tags. Output captured in CloudWatch. Good for: emergency patches, config changes, one-off maintenance tasks.

# Run command on all tagged "Environment=Production" instances
aws ssm send-command \
  --targets "Key=tag:Environment,Values=Production" \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["yum update -y kernel", "reboot"]'

Patch Manager

Automates OS patching across your fleet. Define patch baselines (which patches to approve), maintenance windows (when to patch), and patch groups. Generates compliance reports. Integrates with Run Command to actually apply patches.

State Manager

Ensures your instances are in a defined state (software installed, config files present, services running). Like configuration management (Ansible/Chef) but AWS-native. Uses SSM Documents to define state.

AWS-M8

Messaging & Async Services

SQS / SNS / EventBridge Decoupling Patterns

Why Async Messaging?

In a synchronous architecture, Service A calls Service B directly. If B is slow or down → A is slow or failing too. With async messaging, A puts a message in a queue and returns immediately. B processes when it can. They're decoupled — A doesn't care about B's state.

Sync vs Async Architecture

  SYNC:  Order Service ──HTTP──► Inventory Service ──HTTP──► Notification Service
         (if either downstream fails → order fails, user gets error)

  ASYNC: Order Service ──► SQS Queue ◄── Inventory Service (processes when ready)
              │
              └──► SNS Topic ──fan-out──► Email Notification Lambda
                                     └──► Push Notification Lambda
                                     └──► Analytics Lambda

Amazon SQS — Simple Queue Service

Fully managed message queue. Producer sends messages, consumer polls and processes them, deletes after processing. Guarantees at-least-once delivery (same message might be delivered more than once — make consumers idempotent).

Queue Types

Standard Queue

Unlimited throughput. Messages delivered at least once, in approximately-order (not guaranteed). Best for: high-throughput workloads where some duplicate processing is OK. Default choice.

FIFO Queue

Exactly-once processing. Messages delivered exactly once, strictly in order. Throughput: 3,000 msg/s with batching (300/s without). Best for: financial transactions, order processing, any use case where order and deduplication matter.

Key SQS Concepts

Visibility Timeout: When a consumer picks up a message, it becomes invisible for this duration. If not deleted within timeout (consumer crashed), it reappears for another consumer. Default 30s, max 12 hours. Set to > max processing time.
Dead Letter Queue (DLQ): After N failed processing attempts (maxReceiveCount), message moves to DLQ. Use DLQ to capture and analyze unprocessable messages without losing them.
Long Polling: Consumer waits up to 20 seconds for messages instead of returning empty immediately. Reduces empty API responses and costs.
Message Retention: 1 minute to 14 days. Default 4 days. Plan accordingly.
Max Message Size: 256KB. For larger payloads, store in S3 and put S3 reference in the message.

# SQS Producer (send message):
import boto3, json
sqs = boto3.client('sqs')
sqs.send_message(
    QueueUrl='https://sqs.ap-south-1.amazonaws.com/123456789/OrderQueue',
    MessageBody=json.dumps({
        'orderId': 'ORD-2024-001',
        'userId': 'user123',
        'total': 299.99
    }),
    MessageGroupId='user123'  # For FIFO: same group = in-order
)

# SQS Consumer (receive and delete):
response = sqs.receive_message(
    QueueUrl=QUEUE_URL,
    MaxNumberOfMessages=10,
    WaitTimeSeconds=20,  # Long polling
    VisibilityTimeout=60
)
for msg in response.get('Messages', []):
    process(json.loads(msg['Body']))
    sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=msg['ReceiptHandle'])

Amazon SNS — Simple Notification Service

Pub/Sub messaging. Publishers send to a Topic. Subscribers receive all messages published to that topic. Fan-out: one message → many subscribers. Supports: SQS, Lambda, HTTP/HTTPS, email, SMS, mobile push (APNS, FCM).

# SNS Fan-out: Order created → notify multiple systems
SNS Topic: "OrderCreated"
├── SQS: InventoryQueue  → Inventory Lambda (update stock)
├── SQS: ShippingQueue   → Shipping Lambda (create shipment)
├── Lambda: EmailSender  → Send confirmation email
└── Lambda: Analytics    → Record to analytics DB

# Each subscriber independently processes the same event

SNS → SQS Fan-out Pattern Best practice is to use SNS + SQS together. SNS fans out to multiple SQS queues. Each queue has its own consumer. This gives you: fan-out, durable storage (SQS retains messages if consumer is down), and independent scaling of each consumer. Don't fan out SNS directly to Lambda in high-throughput scenarios — use SQS as a buffer.

Amazon Kinesis — Real-Time Streaming

For high-volume, real-time data streaming (millions of events/sec). Unlike SQS (queue — each message consumed once, deleted), Kinesis stores records for up to 7 days and multiple consumers can read the same stream.

Kinesis Data Streams: Real-time streaming. Producers write records, consumers read. Partitioned by shards (1MB/s write, 2MB/s read per shard). Ordered within a shard. Good for: real-time analytics, click stream, log ingestion, IoT telemetry.
Kinesis Data Firehose: Fully managed delivery of streaming data to S3, Redshift, Elasticsearch, Splunk. No consumers to write — automatic batching and delivery. Buffer size and interval configurable. Good for: log delivery to S3 for analysis, streaming ETL.
Kinesis Data Analytics (Managed Service for Apache Flink): Run SQL or Flink queries on streaming data in real time. Good for: real-time dashboards, anomaly detection, streaming aggregations.

Feature	SQS	SNS	Kinesis Data Streams
Pattern	Queue (consume once)	Pub/Sub (fan-out)	Stream (replay, multiple consumers)
Retention	14 days max	No retention	1-365 days
Ordering	FIFO (with FIFO queue)	No guarantee	Ordered per shard
Replay	No	No	Yes (replay from any position)
Throughput	Unlimited	Unlimited	1MB/s per shard
Use case	Task queues, job processing	Notifications, fan-out	Real-time analytics, event sourcing

Messaging Services — Equivalents

GCP

Cloud Pub/Sub: Combines SQS + SNS in one service (pub/sub model with at-least-once delivery, pull or push subscriptions). Also Cloud Tasks (task queues, more like SQS — delayed execution, rate limits, HTTP targets).

Azure

Azure Service Bus (enterprise messaging — like SQS/SNS with richer features: sessions, dead-lettering, transactions, topic subscriptions = fan-out) | Azure Event Grid (event routing, like EventBridge) | Azure Event Hubs (high-throughput streaming, like Kinesis Data Streams — compatible with Apache Kafka protocol).

Azure-Only

Azure Event Hubs Kafka compatibility: Azure Event Hubs has a Kafka-compatible API. Migrate existing Kafka producers/consumers to Event Hubs with minimal code changes. AWS offers Amazon MSK (Managed Kafka) but it's a full Kafka cluster — heavier. Event Hubs is lighter and Kafka-compatible at the same time.

⚡ Revision — Cloud Concepts

CC-R1

Cloud Fundamentals — Quick Review

1 What is Cloud & Service Models

Cloud = renting computing over the internet. NIST 5 traits: On-demand self-service, Broad network access, Resource pooling (multi-tenancy), Rapid elasticity, Measured service (pay-per-use).

CAPEX vs OPEX: Traditional IT = CAPEX (buy hardware upfront). Cloud = OPEX (pay monthly). Cloud → no wasted capital, faster iteration.

IaaS: You manage OS up. Provider: hardware + virtualization. AWS EC2, GCP GCE, Azure VMs.

PaaS: You manage app + data. Provider: everything else. AWS Elastic Beanstalk/Lambda, GCP App Engine/Cloud Run, Azure App Service.

SaaS: Just use the app. Gmail, Slack, Salesforce, AWS WorkMail.

Deployment models: Public (AWS/GCP/Azure), Private (on-prem, OpenStack), Hybrid (public + on-prem), Multi-Cloud (multiple public providers). Multi-cloud ≠ Hybrid cloud.

2 Shared Responsibility & Global Infrastructure

Shared Responsibility: AWS = Security OF the cloud (hardware, DCs, hypervisor). You = Security IN the cloud (IAM, OS patches, data, encryption, firewall config).

More managed service = AWS takes more responsibility. EC2 (you patch OS) → Lambda (AWS manages OS) → SaaS (AWS manages everything).

Region: Geographically isolated cluster of DCs. Independent of each other. 33+ regions. Data does NOT auto-replicate across regions.

AZ: One or more DCs with independent power/network within a Region. Connected by low-latency (<1ms) private fiber. Deploy across 2+ AZs for HA.

Edge Locations: 600+ CDN cache servers globally (CloudFront, Route 53, Shield). More than regions — closer to end users.

Local Zones: AWS compute in specific cities (sub-10ms). Wavelength: AWS in 5G networks. Outposts: AWS rack in YOUR datacenter.

Region selection factors: 1) Compliance/data residency (non-negotiable), 2) Latency to users, 3) Service availability, 4) Pricing. us-east-1 = new services first, usually cheapest.

Azure Paired Regions: Azure-specific — each region paired with another for automatic DR. AWS has no equivalent (you manually configure cross-region replication).

CC-R2

HA, Scaling & DR — Quick Review

3 HA vs FT vs DR & Scaling

HA (High Availability): System recovers quickly from failure. Multi-AZ = HA. 99.9% → 8.76 hrs downtime/yr. 99.99% → 52 min/yr.

Fault Tolerance (FT): Zero downtime even on failure. Harder, more expensive. Active-active multi-region setups approach FT.

RPO = max acceptable data loss (how much data can you afford to lose?). RTO = max acceptable downtime (how fast must you recover?).

DR Strategies (cheapest to costliest, fastest RTO last): Backup & Restore → Pilot Light → Warm Standby → Active-Active (Multi-Site).

Vertical scaling = bigger server. Single point of failure, has hardware limit, often requires downtime. Horizontal scaling = more servers. Resilient, near-unlimited, needs stateless app.

Elasticity = auto scale up AND back down. Requires stateless apps. Store sessions in Redis/DynamoDB, not local server memory.

Azure Site Recovery (ASR): Azure's managed DR service — VM replication + automated failover plans. AWS has no direct equivalent (you'd build with CloudFormation + Route 53 + scripting).

CC-R3

Networking, Security & Modern Patterns — Quick Review

4 VPC, Load Balancing & CDN

VPC = logically isolated private network. Public subnet (has IGW route) vs Private subnet (no direct internet route). Route table controls where traffic goes.

Security Group: Instance-level, stateful, allow-only rules. NACL: Subnet-level, stateless (both directions), allow AND deny, numbered rule order. Explicit Deny in NACL overrides SG allow.

NAT Gateway: Allows private instances outbound internet. NOT inbound. Placed in PUBLIC subnet with EIP. Deploy one per AZ for HA.

IGW: VPC ↔ internet. Required for any public access. Free. One per VPC. VPN Gateway: VPC ↔ on-prem (IPsec over internet). Direct Connect: Dedicated private fiber, consistent bandwidth.

L4 LB (NLB): Routes by IP+port. Ultra-fast, static IP, non-HTTP. L7 LB (ALB): Routes by HTTP path/host/headers. Smart routing, microservices.

CDN: Cache static content at edge servers close to users. Cache hit = fast. Cache miss = fetch from origin, cache it. TTL controls freshness. Invalidation expires cache early.

GCP VPC is global (subnets span all regions in one VPC). AWS VPC is regional. Azure VNet = regional (like AWS). GCP Firewall Rules are global (not per-instance like SGs).

5 Security Concepts, IaC & Modern Patterns

Authentication = who are you. Authorization = what can you do. Least Privilege = only grant what's needed. MFA = password + something you have.

Encryption at rest = data encrypted on disk (EBS, S3, RDS all support it). Encryption in transit = TLS/HTTPS for data moving over network. Both required for proper security posture.

Zero Trust: Trust nothing, verify everything. Even internal traffic authenticated. mTLS, service meshes (Istio), strict IAM = Zero Trust implementation.

IaC: Infrastructure defined as code. Declarative (CloudFormation, Terraform) vs Imperative (CDK, scripts). Versionable, reproducible, reviewable in PRs. Terraform = most popular, multi-cloud.

Serverless: No servers to manage. Event-driven. Pay per invocation. Scale to zero. Cold start problem (200ms-2s first invocation). Lambda = AWS, Cloud Functions = GCP, Azure Functions = Azure.

Containers vs VMs: Containers share host OS kernel (~50MB overhead, seconds to start). VMs have full OS (~1-2GB overhead, minutes to start). Docker = container standard. Kubernetes = orchestration standard.

CI/CD: CI = auto build+test on commit. CD = auto deploy to staging (manual prod). Continuous Deployment = auto deploy to prod. Blue/Green = instant rollback. Canary = gradual traffic shift. Feature flags = code-level rollout.

GCP Cloud Run: Serverless containers (any Docker image, scale to zero). More flexible than Lambda, no cold-start shim. Azure DevOps: All-in-one CI/CD + project management. Azure Container Apps: Serverless K8s-based containers like Cloud Run.

⚡ Revision — AWS Services

AWS-R1

Compute — Quick Review

1 EC2 & Lambda

EC2 = virtual machine. AMI = OS template. Instance families: t (burstable), m (general), c (compute), r (memory), i (storage), p/g (GPU). Graviton (ARM) = 20-40% cheaper than x86 — use when possible.

EC2 Pricing: On-Demand (no commit) → Savings Plans/RI (1-3yr commit, up to 72% off) → Spot (up to 90% off, interruptible). Spot = use for fault-tolerant batch jobs, CI runners.

User Data = script runs on first boot (installs software). IMDS = query instance metadata from inside (169.254.169.254). Always use IMDSv2 (token-based).

Placement Groups: Cluster (same AZ, low latency, HPC) | Spread (max 7/AZ, different hardware, critical VMs) | Partition (large distributed apps like Kafka).

EIP = static public IP. Charged when NOT attached. Use LB DNS name instead for production. EBS = persistent block storage (attaches like hard drive). Instance Store = ephemeral, lost on stop/terminate, faster.

Lambda: Serverless. Event-driven triggers: S3, API GW, SQS, DynamoDB Streams, EventBridge, SNS. Max 15 min timeout. 128MB-10GB memory. Cold start: first invocation ~200ms-2s. Provisioned Concurrency = keep warm.

Lambda execution role = IAM role Lambda assumes. Never put access keys in Lambda code — use the execution role. Layers = shared dependencies (max 5 per function). Reserve concurrency = cap function, protect other functions from being starved.

ECS: Task Definition (blueprint) → Task (running container) → Service (desired count + LB + rolling deploy) → Cluster. EKS: Managed K8s. Fargate: Serverless nodes for ECS/EKS — no EC2 management.

ECR = private Docker registry. Authenticate with aws ecr get-login-password | docker login. Scans images for CVEs. Image URI format: <account>.dkr.ecr.<region>.amazonaws.com/<repo>:<tag>

AWS-R2

Storage — Quick Review

2 S3, EBS, EFS

S3 = object storage. Buckets are globally unique. Objects have keys (the "path"). NOT a filesystem — flat namespace with "/" in key names giving folder illusion.

S3 Storage Classes: Standard → Standard-IA (30-day min) → Glacier Instant → Glacier Flexible (minutes-hours retrieval) → Glacier Deep Archive (12hr retrieval). Intelligent-Tiering = auto-moves between tiers.

S3 Versioning: Once enabled, can only suspend (not disable). Delete = adds delete marker (old versions remain). Lifecycle rules = auto-transition or expire objects.

S3 Replication: CRR (cross-region, for DR/compliance) and SRR (same-region). Requires versioning. Async. Does NOT replicate existing objects — only new uploads after enabling.

Pre-signed URLs: Temporarily grant access to private S3 objects. Backend generates signed URL with expiry. User downloads directly from S3. Also supports direct-upload from browser to S3 (pre-signed PUT).

EBS: Block storage, attach to ONE EC2 (usually). gp3 = preferred over gp2 (separate IOPS from size, 20% cheaper). io2 Block Express = up to 256K IOPS for DBs. EBS volumes are AZ-specific.

EBS Snapshots: Incremental, stored in S3 (AWS-managed). Can copy cross-region. Create new volume from snapshot in any AZ. Always snapshot before risky changes.

EFS = shared NFS. Multiple EC2 instances across multiple AZs mount simultaneously. Auto-scales. 6 copies across 3 AZs. ~3x more expensive than EBS per GB. Use for: CMS files, shared assets, ML training data across GPU nodes.

Storage comparison: S3 = objects over HTTP ($0.023/GB) | EBS = block/disk for single EC2 ($0.08/GB) | EFS = shared NFS for multiple EC2 ($0.30/GB) | Instance Store = ephemeral NVMe, included in instance price.

AWS-R3

Networking — Quick Review

3 VPC, Route 53, CloudFront, ELB

VPC Peering: Connect two VPCs, private IP routing. Non-transitive (A↔B, B↔C, A cannot reach C). Transit Gateway: Hub-and-spoke, transitive, connects N VPCs + on-prem. Use TGW over VPC peering at scale.

VPC Endpoints: Gateway (free, S3/DynamoDB only, route table entry) | Interface/PrivateLink (ENI in subnet, charged, supports 100+ services). Traffic stays on AWS network — no NAT GW needed for these services.

VPC Flow Logs: IP traffic metadata (not packet content) per ENI. Goes to CloudWatch/S3. Use to debug connectivity issues and security analysis. Look for REJECT entries to find blocked connections.

Alias record: AWS-specific CNAME substitute. Can be used on root domain (zone apex). Points to AWS resources. FREE queries (unlike CNAME). Use Alias for ALB, CloudFront, S3 website endpoints.

CloudFront: Distribution = config. Origin = S3/ALB/EC2. Cache Behavior = path pattern rules. OAC = CloudFront-only S3 access (modern, replaces OAI). Lambda@Edge = full Lambda at PoPs. CloudFront Functions = lightweight JS at edge.

ALB vs NLB: ALB = L7, path/host/header routing, HTTPS termination, microservices. NLB = L4, TCP/UDP, static IP, millions RPS, ultra-low latency, gaming/IoT. Both do health checks and AZ distribution.

Site-to-Site VPN: IPsec over internet. Quick to set up (hours). Up to ~1.25Gbps. Direct Connect: Private fiber. Weeks to set up. 1-100Gbps, consistent. DX + VPN as backup = best practice.

AWS-R4

IAM & Security — Quick Review

4 IAM, KMS, Secrets, WAF

IAM: Users (permanent creds) | Groups (users only, no roles) | Roles (temp creds, assumed by services/users/cross-account) | Policies (JSON permission docs). Global service, free.

Policy evaluation: Explicit DENY wins above all → SCP → Resource policy → Permission boundary → Identity policy. Default = DENY everything. Must have explicit ALLOW.

IAM Roles for services: EC2 assume a role → credentials via IMDS (169.254.169.254). Lambda execution role → auto-injected. NEVER hardcode access keys in code. Use roles. If code is in Git with hardcoded keys = critical security incident.

Cross-account: Account B creates role with trust policy allowing Account A. Account A STS AssumeRole → gets temp creds for Account B. Least privilege on both sides.

KMS: Manages encryption keys. AWS Managed Keys (free, less control) vs Customer Managed Keys ($1/month, full control, cross-account). Envelope encryption: DEK encrypts data, CMK encrypts DEK.

Secrets Manager: Store secrets encrypted by KMS. Auto-rotation for RDS. $0.40/secret/month. SSM Parameter Store: Free for standard/SecureString. No auto-rotation. Good for config + non-rotating secrets. Use SM for rotating DB passwords, SSM for config.

WAF: Blocks OWASP Top 10, SQL injection, XSS, bad bots. Applied to CloudFront/ALB/API Gateway. Use AWS Managed Rule Groups. Rate-based rules for DDoS defense at L7.

Shield Standard: Free, automatic L3/L4 DDoS. Shield Advanced: $3K/month, L7 DDoS, DRT access, cost protection. GuardDuty: Threat detection via ML on VPC Flow Logs + CloudTrail + DNS. Finds cryptomining, compromised creds, port scans.

Azure Key Vault = KMS + Secrets Manager + certificate management in one. Azure Entra ID = feature-rich identity (OAuth, SAML, Conditional Access, MFA) — more than AWS IAM alone. Microsoft Sentinel = full SIEM/SOAR, no AWS equivalent.

AWS-R5

Databases — Quick Review

5 RDS, Aurora, DynamoDB, ElastiCache

RDS: Managed relational DB. Engines: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Aurora. AWS manages hardware, OS, patching, backups. You manage: schema, queries, scaling decisions.

RDS Multi-AZ: Synchronous standby replica in another AZ. Auto-failover ~60-120s. Standby is NOT readable. For read scale → Read Replicas (async, max 5 for MySQL/PostgreSQL, max 15 for Aurora).

RDS Proxy: Connection pooler. REQUIRED for Lambda + RDS (Lambda → 1000 concurrent connections would exhaust DB). Reduces failover time. Improves connection reuse.

Aurora: AWS-proprietary. MySQL/PostgreSQL compatible. 5x faster than MySQL, 3x PostgreSQL. 6 copies across 3 AZs. Failover ~30s (vs 60-120s for RDS Multi-AZ — readers share storage, no data copy needed on promote).

Aurora Serverless v2: Auto-scales in 0.5 ACU increments, seconds. Pay-per-second. Min to max ACU range. Aurora Global: Multi-region, <1s replication lag, promote secondary for DR (RPO <1s, RTO <1min).

DynamoDB: NoSQL key-value + document. Serverless. Single-digit ms at any scale. Table → Items → Attributes. Primary Key: PK alone (simple) or PK+SK (composite). Max item size: 400KB.

DynamoDB Capacity: On-Demand (pay per request, auto-scale) vs Provisioned (set RCU/WCU, cheaper at steady load). 1 RCU = 4KB strongly consistent read/s. 1 WCU = 1KB write/s.

DynamoDB GSI: Query by non-PK attributes. Own PK+SK, own capacity. Design data model around your access patterns FIRST (unlike SQL). Single-table design = all entities in one table with composite keys.

DynamoDB Streams: Change stream (24hr retention). Triggers Lambda on insert/update/delete. Global Tables: Multi-region, multi-active writes. DAX: In-memory cache for DynamoDB — microsecond reads. Drop-in replacement (same API).

ElastiCache Redis: In-memory. Rich data types (hashes, sorted sets, pub/sub). Persistence optional. Replication + Multi-AZ. Sessions, leaderboards, rate limiting. Memcached: Simple caching, multi-threaded, no persistence, no replication. Choose Redis 95% of the time.

GCP Cloud Spanner: Global relational DB with horizontal write scaling — no AWS equivalent. Azure Cosmos DB: Multi-model NoSQL (MongoDB/Cassandra/Gremlin APIs in one service). Azure Cosmos DB for PostgreSQL = Citus distributed PostgreSQL.

AWS-R6

Monitoring, DevOps & Messaging — Quick Review

6 CloudWatch, CloudTrail, CI/CD & Messaging

CloudWatch Metrics: Time-series from AWS services. EC2 default: every 5 min (CPU, Network, Disk, Status). Detailed monitoring: 1 min (extra cost). Custom metrics: push your own (queue depth, sessions, etc.).

CloudWatch Logs: Log Groups → Log Streams → Log Events. Set retention per group. CloudWatch Logs Insights = query language (filter, parse, stats). CloudWatch Agent = install on EC2 to collect RAM/disk/custom log files.

CloudWatch Alarms: Watch metric → threshold → States: OK/ALARM/INSUFFICIENT_DATA. Actions: SNS notification, Auto Scaling, EC2 actions (stop/reboot). Composite alarms = AND/OR of multiple alarms.

CloudTrail: Every AWS API call logged (who, what, when, from where). Default 90-day Event History. Create a Trail → S3 for longer retention. Enable on Day 1. Multi-region trail covers all regions. Enable log file integrity validation.

EventBridge: Serverless event bus. Rules match events → targets. Default bus (AWS events) + custom buses (your app events). Replaced CloudWatch Events. Cron schedules, service event reactions (EC2 state change → Lambda).

CloudFormation: YAML/JSON templates → Stacks. Resources created/updated/deleted together. Always use Change Sets before updating production stacks. Drift detection = find manual changes. StackSets = multi-account/region deployments.

Auto Scaling Group: Min/Max/Desired. Scaling policies: Target Tracking (most common, "keep CPU at 60%") | Step Scaling | Scheduled | Predictive. Mixed instances (Spot + On-Demand) = major cost savings. Health checks replace unhealthy instances automatically.

SSM Session Manager: Shell access to EC2 with no SSH keys, no port 22 open, no bastion host. IAM-authenticated. All sessions logged. Use instead of SSH for production access. Also: SSM Run Command (run commands on fleets), Patch Manager, Parameter Store.

SQS: Queue. Standard (unlimited throughput, at-least-once, ~ordered) vs FIFO (3K/s, exactly-once, strictly ordered). Visibility timeout = how long message hidden during processing. DLQ = captures failed messages after N retries. Long polling = 20s wait reduces empty responses.

SNS: Pub/Sub. One topic → many subscribers (SQS, Lambda, HTTP, Email, SMS). SNS→SQS fan-out pattern: SNS fans to multiple SQS queues, each with independent consumer. Durable (SQS buffers if consumer is down).

Kinesis Data Streams: Real-time streaming. Records retained 1-365 days. Multiple consumers can re-read (unlike SQS — consumed once). Ordered per shard. 1MB/s write per shard. Kinesis Firehose: Managed delivery to S3/Redshift/ES — no consumer code needed.

SQS vs SNS vs Kinesis: SQS = job queue (consume once, delete). SNS = fan-out notifications (no retention). Kinesis = real-time stream (replay, multiple consumers, time-ordered). Pick based on pattern: task processing → SQS, event notifications → SNS, real-time analytics → Kinesis.

Azure Application Insights: Full APM (request rates, errors, response times, user sessions) — no native AWS equivalent (use X-Ray + custom CloudWatch). Azure Event Hubs: Kafka-compatible streaming, like Kinesis but with Kafka API support. GCP Cloud Pub/Sub: SQS + SNS combined in one service.