AWS networking, from the IP up
Start at the bit, walk up through CIDR, subnets, Security Groups, and DNS. Mermaid diagrams for the hierarchy and the TCP handshake — plus a debugging toolbox for when things don't connect.
· 12 min read
Most introductions to AWS networking start with a diagram of a VPC and try to teach you the cloud and the network at the same time. That tends to lose people in both. This post goes the other way — start at the IP, walk up through CIDR, subnets, and routing, then put AWS names on what we already understand. Diagrams along the way are Mermaid; they re-render when you toggle the site theme.
By the end you should know:
- Why AWS resources have different scope (global, regional, zonal) and what that means in practice.
- How an IPv4 address and a CIDR block actually work in binary — and how to subnet a /16 into /20s and /28s without overlapping.
- Why VPCs and subnets are organized the way they are.
- How DNS records (
A,AAAA,CNAME,NS) make domains addressable. - How to debug "why can't my app talk to the database" with
tcpdumpand the TCP handshake — plus a handful of other commands worth keeping in muscle memory.
The hierarchy
AWS organizes its hardware in three physical tiers. As of today: 38 geographic Regions and 128 Availability Zones (AZs).
Why the region you pick matters
- Latency. Servers closer to your users cut round-trip time. Indonesian users hitting
us-east-1add ~250ms before any of your code runs. - Regulations. Many countries require user data to stay in-country. Financial and healthcare regulations frequently mandate this (Indonesian financial data →
ap-southeast-3, Australian health records →ap-southeast-2, and so on). - Disaster recovery. Region-level failure is rare but real. If your business cannot tolerate an entire AWS region going dark, you architect across two regions on purpose.
- Pricing. Per-hour cost of EC2, per-GB of storage, and per-GB of cross-region data transfer all vary. The price-cheapest region is rarely the same as the latency-best region.
Where the VPC slots in
The VPC is a logical layer AWS lays on top of the physical hierarchy:
The VPC itself is region-scoped — it spans all AZs in a region. Subnets are zone-scoped — each one is pinned to a single AZ. EC2 instances inside subnets inherit that zone.
A mental model: a region is a city, an AZ is a building, a VPC is a tenant's lease across multiple buildings, a subnet is a floor in one building, an EC2 instance is a desk on that floor.
Scope, in one paragraph
Where each kind of AWS object lives by tier:
Tier Examples Notes
-------- ------------------------------------------------ -----------------------------------------
Global IAM, Route 53, CloudFront, S3 namespace One worldwide control plane
Regional VPC, S3 data, RDS, most managed services Spans all AZs in the region
Zonal Subnet, EC2 instance, EBS volume Pinned to one AZ; cannot span
Logical Security Group, VPC Endpoint, Route Table Defined inside a VPC, applied per resourceRemember which tier a resource lives in and you will never accidentally try to attach an EBS volume to an EC2 instance in a different AZ.
Networking fundamentals: IPv4 and CIDR
An IPv4 address is 32 bits, written as four 8-bit numbers (octets):
192 . 168 . 10 . 1
+----+ +----+ +----+ +----+
| 8b | | 8b | | 8b | | 8b | = 32 bits total
+----+ +----+ +----+ +----+
binary:
11000000.10101000.00001010.00000001Each octet is between 0 and 255 (the range of an 8-bit unsigned integer):
- Valid:
192.168.10.1— every octet is between 0 and 255. - Invalid:
150.280.23.90—280doesn't fit in 8 bits.
CIDR (Classless Inter-Domain Routing) glues a prefix length onto an IP block. The prefix says "the first N bits are fixed; the remaining 32 − N bits are host addresses."
192.168.10.0 /24
11000000.10101000.00001010.00000000
|<------- 24 bit prefix ------>|<-- 8 host bits -->|
fixed network varies (256 IPs)The formula for available addresses:
addresses = 2 ^ (32 - prefix)A few common sizes:
/16→ 2^16 = 65,536 addresses. Typical for an entire VPC./20→ 2^12 = 4,096 addresses. Mid-size subnet; often used as a "reserved" allocation per AZ./24→ 2^8 = 256 addresses. Typical for one subnet./28→ 2^4 = 16 addresses. The smallest subnet AWS lets you create.
AWS reserves five addresses in every subnet (network address, VPC router, DNS, future use, broadcast). A /24 gives you 251 usable IPs, not 256. A /28 gives you 11 usable IPs, not 16.
Worked example: subnetting a /16
Say your VPC is 172.31.0.0/16 and the AWS-default subnets already occupy these /20 blocks:
172.31.0.0/20172.31.16.0/20172.31.32.0/20
If you want to carve a small /28 somewhere safe, you need to avoid those ranges:
172.31.0.0/20covers172.31.0.0→172.31.15.255172.31.16.0/20covers172.31.16.0→172.31.31.255172.31.32.0/20covers172.31.32.0→172.31.47.255
The next free /20 block starts at 172.31.48.0. Inside that, a /28 like 172.31.48.0/28 gives you 11 usable IPs, and the next non-overlapping /28 is 172.31.48.16/28. This kind of math is what saves you when you join an existing network and need to slot something new in without IP collisions.
Why we care about available addresses
IPv4 is finite, and inside a VPC even more so. Every EC2 instance, every Lambda warm-pool ENI, every RDS endpoint, every container running on Fargate consumes an IP address. In a microservices environment with aggressive autoscaling, IP exhaustion is a real failure mode — your auto-scaling group stops creating instances because the subnet has no addresses left. Provision generously the first time. A /16 VPC with /20 subnets is rarely too big; carving /24 subnets gets tight surprisingly fast.
VPC: your private network in a region
A VPC is a logical network you carve out of a region. It is defined by its CIDR block:
VPC: 10.10.0.0/16 (65,536 addresses, region-wide)The VPC itself does not run code. It is the scope inside which subnets, route tables, gateways, and security primitives have meaning. Two VPCs in the same region can have overlapping CIDR blocks and never interact unless you explicitly peer them.
Subnets and the public/private split
A subnet is a slice of the VPC's CIDR pinned to a single AZ. Two reasons to subdivide:
- High availability. Put each tier in two AZs minimum, so an AZ failure does not take you down.
- Routing. Subnets differ in whether their traffic can reach the internet. This is the public vs private split — the most important decision in your VPC layout.
- Public subnet — has a route in its route table to the Internet Gateway (IGW). Resources here can be reached from the public internet (load balancers, bastion hosts) or initiate outbound calls.
- Private subnet — no route to the IGW. Resources here are reachable only from inside the VPC. To let private instances make outbound calls (e.g.,
apt update), put a NAT Gateway in a public subnet and route the private subnet's egress through it.
Databases live in private subnets. Application servers live in private subnets. Only the things that genuinely need to be addressable from the internet — load balancers, bastions, public APIs — live in public subnets.
EC2: where instances actually live
An EC2 instance is a virtual machine. When you provision one you are choosing:
- CPU and RAM — encoded in the instance family and size (
t3.medium,c7g.large). - Storage — usually an attached EBS volume, which is zonal: an EBS volume and the EC2 instance using it must be in the same AZ.
- Network bandwidth — scales with instance size.
- Subnet — which decides AZ, public-or-private, and which Security Groups can attach.
- Cost — directly tied to instance family, size, and region.
The instance gets an IP from its subnet's CIDR block. That IP is what shows up in VPC flow logs, Security Group rule sources, and route tables.
Security Groups
A Security Group (SG) is a virtual firewall attached to an instance's network interface — not to the subnet. It limits access by port, protocol, and source.
Two directions:
- Ingress — incoming traffic to the instance.
- Egress — outgoing traffic from the instance (usually open to all destinations by default so the server can fetch updates).
Two properties that surprise people the first time:
- Stateful. If you allow an inbound TCP connection, the return traffic is automatically allowed. You do not have to add a matching egress rule.
- Default-deny, allow-only. You can only write "allow" rules. To "block port 8080 from
108.55.33.111" you simply do not allow it — anything not explicitly allowed is dropped.
If you genuinely need explicit deny rules (e.g., to block a malicious IP range at the subnet boundary), reach for a Network ACL, which is stateless and operates at the subnet level. The two layers complement each other.
Example inbound rules:
Protocol Port Source Purpose
-------- ---- ---------------- -------------------------------------
TCP 443 0.0.0.0/0 Public HTTPS on the ALB
TCP 22 192.168.1.0/24 SSH from the office network only
TCP 5432 sg-0abc123 Postgres, only from the app tier SGThe last row is the one most people miss: a Security Group's source can be another Security Group, not an IP block. That lets you write rules in terms of roles ("the app tier can reach the database tier") instead of addresses. When you autoscale and instances get new IPs, the rule keeps working because it is keyed off membership, not address.
A common first-rule mistake: **never blanket-allow 0.0.0.0/0 to arbitrary ports** on anything that is not a load balancer. Default-deny only protects you while you actually use it.
S3: outside the VPC
S3 is object storage, and unlike everything else in this post it does not live in your VPC. The bucket name is part of a global namespace (every bucket name must be unique across all AWS customers worldwide), but the data is stored in a specific region you pick at create time.
By default, S3 is reached over the public internet — even from an EC2 instance inside your VPC. If you want VPC traffic to S3 to stay on the AWS backbone (cheaper, more private, doesn't go over the IGW), create a VPC Endpoint for S3.
DNS: how names become IPs
Behind every domain name there is a chain of records and servers. The four record types you see most often:
- A — maps a domain to an IPv4 address.
apple.com → 17.x.x.x. - AAAA — same idea, IPv6.
- CNAME — aliases one domain to another.
www.apple.com → apple.com. - NS — declares which name servers actually answer for a zone.
apple.com → ns1.digitalocean.com. NS is how delegation works: a zone can be hosted anywhere, and the parent zone's NS record points at the right place.
When a browser resolves a name, it walks the tree:
Every step is cached. The reason a DNS change "takes 24 hours" is the TTL on each cached record at each layer.
In AWS, Route 53 plays both the registrar and authoritative-name-server roles. A typical setup: domain registered at Route 53, NS records point at Route 53's name servers, A or ALIAS records point at your ALB.
Why architecture diagrams
A good architecture diagram does three things at once: it forces you to name every box, it makes gaps visible (an arrow with no destination), and it gives the whole team the same mental model. You discover most architectural mistakes the first time you try to draw the system honestly. If you cannot draw it on a single page, that is the finding — not the diagram.
The debugging toolbox
When something does not connect, the questions you ask are usually one of these. Each has a tool worth keeping in muscle memory.
Who is listening on this box?
sudo netstat -tlupn | grep 443
sudo lsof -i :443netstat -tlupn lists TCP and UDP listeners with process and PID. lsof -i :443 is the same idea, more readable on modern Linuxes. If nothing shows up, the application is not bound to the port you think it is.
What is actually flowing on the wire?
sudo tcpdump -i ens5 dst port 5432 -A-i ens5 picks the network interface; dst port 5432 filters to Postgres traffic; -A shows packet contents as ASCII (use -X for hex). The handshake either happens or you see exactly where it stalls.
What does this name resolve to?
nslookup api.example.com
dig +short api.example.comIf nslookup returns the wrong IP, the bug is not in your application — it is in DNS or your /etc/hosts. Check local overrides:
sudo cat /etc/hostsWhat is the system actually doing?
sar -n TCP 1 # TCP-level stats, 1s resolution
sar -n DEV 1 # NIC-level stats per interface
mpstat -P ALL 1 # CPU per core, 1s resolutionWhen latency spikes and you don't know whether the bottleneck is the network, the CPU, or one specific core, these tell you. sar and mpstat come from the sysstat package.
What does the local service config say?
sudo cat /etc/redis/redis.conf | grep bind
sudo nvim /etc/hostsbind 127.0.0.1 in redis.conf means Redis only accepts connections from localhost — a single line that breaks 20% of "why can't my app reach Redis" tickets.
Cron sanity
crontab -l # list current user's cron jobs
crontab -e # edit them
tail -f /home/ubuntu/cron_test.log # watch a job's output liveWhen a job "should be running but isn't," step one is crontab -l to confirm it is even registered. Step two is tail -f on its log to watch it run, or not.
Simulating heavy load — defensively
sudo t50 198.51.100.42 --protocol TCP --dport 80 --syn --threshold 10000000t50 is a stress / flood testing tool. Useful for testing your own ALB and rate limits under SYN-flood conditions. Only run it against infrastructure you own. Pointing it at someone else's network is illegal in most jurisdictions and the kind of mistake that ends careers.
The TCP three-way handshake
When you watch tcpdump, what you are looking for is the three-way handshake:
The four diagnostic patterns you'll see in tcpdump:
- SYN leaves, no SYN-ACK comes back. The destination is not receiving you. Security Group or NACL is dropping you, or there is a route problem.
- SYN-ACK comes back, no final ACK. Asymmetric routing — packets are taking different paths in and out. Common with split-horizon network setups or misconfigured NAT.
- **Three-way handshake completes, then
RST.** TCP connected but the application closed the connection. The process is listening but rejecting (wrong protocol, wrong TLS, etc.). - No SYN even leaves your machine. Local routing or local firewall. Check
ip routeandiptables.
The Security Group, the route table, and the listener are the three places to check, in that order, every time.
Closing
The pieces are small. There are basically five primitives — region, VPC, subnet, route table, Security Group — and a handful of services that live at different scopes. Most of the complexity in real architectures comes from how those primitives are composed, not from any one being hard.
Good AWS networking instinct is built backwards: when you see a working architecture, take five minutes to trace the path of a single packet from a user's browser through Route 53, the IGW, the ALB, the public subnet, into a private subnet, through your application, to a database row. The names start sticking once you have walked the route a few times.