Angga.
← Back to all posts
awsnetworkingdevops

AWS networking, from the IP up

Start at the bit, walk up through CIDR, subnets, Security Groups, and DNS. Mermaid diagrams for the hierarchy and the TCP handshake — plus a debugging toolbox for when things don't connect.

· 12 min read

Most introductions to AWS networking start with a diagram of a VPC and try to teach you the cloud and the network at the same time. That tends to lose people in both. This post goes the other way — start at the IP, walk up through CIDR, subnets, and routing, then put AWS names on what we already understand. Diagrams along the way are Mermaid; they re-render when you toggle the site theme.

By the end you should know:

  • Why AWS resources have different scope (global, regional, zonal) and what that means in practice.
  • How an IPv4 address and a CIDR block actually work in binary — and how to subnet a /16 into /20s and /28s without overlapping.
  • Why VPCs and subnets are organized the way they are.
  • How DNS records (A, AAAA, CNAME, NS) make domains addressable.
  • How to debug "why can't my app talk to the database" with tcpdump and the TCP handshake — plus a handful of other commands worth keeping in muscle memory.

The hierarchy

AWS organizes its hardware in three physical tiers. As of today: 38 geographic Regions and 128 Availability Zones (AZs).

Why the region you pick matters

  • Latency. Servers closer to your users cut round-trip time. Indonesian users hitting us-east-1 add ~250ms before any of your code runs.
  • Regulations. Many countries require user data to stay in-country. Financial and healthcare regulations frequently mandate this (Indonesian financial data → ap-southeast-3, Australian health records → ap-southeast-2, and so on).
  • Disaster recovery. Region-level failure is rare but real. If your business cannot tolerate an entire AWS region going dark, you architect across two regions on purpose.
  • Pricing. Per-hour cost of EC2, per-GB of storage, and per-GB of cross-region data transfer all vary. The price-cheapest region is rarely the same as the latency-best region.

Where the VPC slots in

The VPC is a logical layer AWS lays on top of the physical hierarchy:

The VPC itself is region-scoped — it spans all AZs in a region. Subnets are zone-scoped — each one is pinned to a single AZ. EC2 instances inside subnets inherit that zone.

A mental model: a region is a city, an AZ is a building, a VPC is a tenant's lease across multiple buildings, a subnet is a floor in one building, an EC2 instance is a desk on that floor.

Scope, in one paragraph

Where each kind of AWS object lives by tier:

Tier      Examples                                          Notes
--------  ------------------------------------------------  -----------------------------------------
Global    IAM, Route 53, CloudFront, S3 namespace           One worldwide control plane
Regional  VPC, S3 data, RDS, most managed services          Spans all AZs in the region
Zonal     Subnet, EC2 instance, EBS volume                  Pinned to one AZ; cannot span
Logical   Security Group, VPC Endpoint, Route Table         Defined inside a VPC, applied per resource

Remember which tier a resource lives in and you will never accidentally try to attach an EBS volume to an EC2 instance in a different AZ.

Networking fundamentals: IPv4 and CIDR

An IPv4 address is 32 bits, written as four 8-bit numbers (octets):

 192  .  168  .   10  .    1
+----+   +----+   +----+   +----+
| 8b |   | 8b |   | 8b |   | 8b |   = 32 bits total
+----+   +----+   +----+   +----+

binary:
11000000.10101000.00001010.00000001

Each octet is between 0 and 255 (the range of an 8-bit unsigned integer):

  • Valid: 192.168.10.1 — every octet is between 0 and 255.
  • Invalid: 150.280.23.90280 doesn't fit in 8 bits.

CIDR (Classless Inter-Domain Routing) glues a prefix length onto an IP block. The prefix says "the first N bits are fixed; the remaining 32 − N bits are host addresses."

192.168.10.0 /24

11000000.10101000.00001010.00000000
|<------- 24 bit prefix ------>|<-- 8 host bits -->|
       fixed network                varies (256 IPs)

The formula for available addresses:

addresses = 2 ^ (32 - prefix)

A few common sizes:

  • /16 → 2^16 = 65,536 addresses. Typical for an entire VPC.
  • /20 → 2^12 = 4,096 addresses. Mid-size subnet; often used as a "reserved" allocation per AZ.
  • /24 → 2^8 = 256 addresses. Typical for one subnet.
  • /28 → 2^4 = 16 addresses. The smallest subnet AWS lets you create.

AWS reserves five addresses in every subnet (network address, VPC router, DNS, future use, broadcast). A /24 gives you 251 usable IPs, not 256. A /28 gives you 11 usable IPs, not 16.

Worked example: subnetting a /16

Say your VPC is 172.31.0.0/16 and the AWS-default subnets already occupy these /20 blocks:

  • 172.31.0.0/20
  • 172.31.16.0/20
  • 172.31.32.0/20

If you want to carve a small /28 somewhere safe, you need to avoid those ranges:

  • 172.31.0.0/20 covers 172.31.0.0172.31.15.255
  • 172.31.16.0/20 covers 172.31.16.0172.31.31.255
  • 172.31.32.0/20 covers 172.31.32.0172.31.47.255

The next free /20 block starts at 172.31.48.0. Inside that, a /28 like 172.31.48.0/28 gives you 11 usable IPs, and the next non-overlapping /28 is 172.31.48.16/28. This kind of math is what saves you when you join an existing network and need to slot something new in without IP collisions.

Why we care about available addresses

IPv4 is finite, and inside a VPC even more so. Every EC2 instance, every Lambda warm-pool ENI, every RDS endpoint, every container running on Fargate consumes an IP address. In a microservices environment with aggressive autoscaling, IP exhaustion is a real failure mode — your auto-scaling group stops creating instances because the subnet has no addresses left. Provision generously the first time. A /16 VPC with /20 subnets is rarely too big; carving /24 subnets gets tight surprisingly fast.

VPC: your private network in a region

A VPC is a logical network you carve out of a region. It is defined by its CIDR block:

VPC: 10.10.0.0/16   (65,536 addresses, region-wide)

The VPC itself does not run code. It is the scope inside which subnets, route tables, gateways, and security primitives have meaning. Two VPCs in the same region can have overlapping CIDR blocks and never interact unless you explicitly peer them.

Subnets and the public/private split

A subnet is a slice of the VPC's CIDR pinned to a single AZ. Two reasons to subdivide:

  • High availability. Put each tier in two AZs minimum, so an AZ failure does not take you down.
  • Routing. Subnets differ in whether their traffic can reach the internet. This is the public vs private split — the most important decision in your VPC layout.
  • Public subnet — has a route in its route table to the Internet Gateway (IGW). Resources here can be reached from the public internet (load balancers, bastion hosts) or initiate outbound calls.
  • Private subnet — no route to the IGW. Resources here are reachable only from inside the VPC. To let private instances make outbound calls (e.g., apt update), put a NAT Gateway in a public subnet and route the private subnet's egress through it.

Databases live in private subnets. Application servers live in private subnets. Only the things that genuinely need to be addressable from the internet — load balancers, bastions, public APIs — live in public subnets.

EC2: where instances actually live

An EC2 instance is a virtual machine. When you provision one you are choosing:

  • CPU and RAM — encoded in the instance family and size (t3.medium, c7g.large).
  • Storage — usually an attached EBS volume, which is zonal: an EBS volume and the EC2 instance using it must be in the same AZ.
  • Network bandwidth — scales with instance size.
  • Subnet — which decides AZ, public-or-private, and which Security Groups can attach.
  • Cost — directly tied to instance family, size, and region.

The instance gets an IP from its subnet's CIDR block. That IP is what shows up in VPC flow logs, Security Group rule sources, and route tables.

Security Groups

A Security Group (SG) is a virtual firewall attached to an instance's network interface — not to the subnet. It limits access by port, protocol, and source.

Two directions:

  • Ingress — incoming traffic to the instance.
  • Egress — outgoing traffic from the instance (usually open to all destinations by default so the server can fetch updates).

Two properties that surprise people the first time:

  • Stateful. If you allow an inbound TCP connection, the return traffic is automatically allowed. You do not have to add a matching egress rule.
  • Default-deny, allow-only. You can only write "allow" rules. To "block port 8080 from 108.55.33.111" you simply do not allow it — anything not explicitly allowed is dropped.

If you genuinely need explicit deny rules (e.g., to block a malicious IP range at the subnet boundary), reach for a Network ACL, which is stateless and operates at the subnet level. The two layers complement each other.

Example inbound rules:

Protocol  Port   Source             Purpose
--------  ----   ----------------   -------------------------------------
TCP       443    0.0.0.0/0          Public HTTPS on the ALB
TCP       22     192.168.1.0/24     SSH from the office network only
TCP       5432   sg-0abc123         Postgres, only from the app tier SG

The last row is the one most people miss: a Security Group's source can be another Security Group, not an IP block. That lets you write rules in terms of roles ("the app tier can reach the database tier") instead of addresses. When you autoscale and instances get new IPs, the rule keeps working because it is keyed off membership, not address.

A common first-rule mistake: **never blanket-allow 0.0.0.0/0 to arbitrary ports** on anything that is not a load balancer. Default-deny only protects you while you actually use it.

S3: outside the VPC

S3 is object storage, and unlike everything else in this post it does not live in your VPC. The bucket name is part of a global namespace (every bucket name must be unique across all AWS customers worldwide), but the data is stored in a specific region you pick at create time.

By default, S3 is reached over the public internet — even from an EC2 instance inside your VPC. If you want VPC traffic to S3 to stay on the AWS backbone (cheaper, more private, doesn't go over the IGW), create a VPC Endpoint for S3.

DNS: how names become IPs

Behind every domain name there is a chain of records and servers. The four record types you see most often:

  • A — maps a domain to an IPv4 address. apple.com → 17.x.x.x.
  • AAAA — same idea, IPv6.
  • CNAME — aliases one domain to another. www.apple.com → apple.com.
  • NS — declares which name servers actually answer for a zone. apple.com → ns1.digitalocean.com. NS is how delegation works: a zone can be hosted anywhere, and the parent zone's NS record points at the right place.

When a browser resolves a name, it walks the tree:

Every step is cached. The reason a DNS change "takes 24 hours" is the TTL on each cached record at each layer.

In AWS, Route 53 plays both the registrar and authoritative-name-server roles. A typical setup: domain registered at Route 53, NS records point at Route 53's name servers, A or ALIAS records point at your ALB.

Why architecture diagrams

A good architecture diagram does three things at once: it forces you to name every box, it makes gaps visible (an arrow with no destination), and it gives the whole team the same mental model. You discover most architectural mistakes the first time you try to draw the system honestly. If you cannot draw it on a single page, that is the finding — not the diagram.

The debugging toolbox

When something does not connect, the questions you ask are usually one of these. Each has a tool worth keeping in muscle memory.

Who is listening on this box?

sudo netstat -tlupn | grep 443
sudo lsof -i :443

netstat -tlupn lists TCP and UDP listeners with process and PID. lsof -i :443 is the same idea, more readable on modern Linuxes. If nothing shows up, the application is not bound to the port you think it is.

What is actually flowing on the wire?

sudo tcpdump -i ens5 dst port 5432 -A

-i ens5 picks the network interface; dst port 5432 filters to Postgres traffic; -A shows packet contents as ASCII (use -X for hex). The handshake either happens or you see exactly where it stalls.

What does this name resolve to?

nslookup api.example.com
dig +short api.example.com

If nslookup returns the wrong IP, the bug is not in your application — it is in DNS or your /etc/hosts. Check local overrides:

sudo cat /etc/hosts

What is the system actually doing?

sar -n TCP 1       # TCP-level stats, 1s resolution
sar -n DEV 1       # NIC-level stats per interface
mpstat -P ALL 1    # CPU per core, 1s resolution

When latency spikes and you don't know whether the bottleneck is the network, the CPU, or one specific core, these tell you. sar and mpstat come from the sysstat package.

What does the local service config say?

sudo cat /etc/redis/redis.conf | grep bind
sudo nvim /etc/hosts

bind 127.0.0.1 in redis.conf means Redis only accepts connections from localhost — a single line that breaks 20% of "why can't my app reach Redis" tickets.

Cron sanity

crontab -l                          # list current user's cron jobs
crontab -e                          # edit them
tail -f /home/ubuntu/cron_test.log  # watch a job's output live

When a job "should be running but isn't," step one is crontab -l to confirm it is even registered. Step two is tail -f on its log to watch it run, or not.

Simulating heavy load — defensively

sudo t50 198.51.100.42 --protocol TCP --dport 80 --syn --threshold 10000000

t50 is a stress / flood testing tool. Useful for testing your own ALB and rate limits under SYN-flood conditions. Only run it against infrastructure you own. Pointing it at someone else's network is illegal in most jurisdictions and the kind of mistake that ends careers.

The TCP three-way handshake

When you watch tcpdump, what you are looking for is the three-way handshake:

The four diagnostic patterns you'll see in tcpdump:

  • SYN leaves, no SYN-ACK comes back. The destination is not receiving you. Security Group or NACL is dropping you, or there is a route problem.
  • SYN-ACK comes back, no final ACK. Asymmetric routing — packets are taking different paths in and out. Common with split-horizon network setups or misconfigured NAT.
  • **Three-way handshake completes, then RST.** TCP connected but the application closed the connection. The process is listening but rejecting (wrong protocol, wrong TLS, etc.).
  • No SYN even leaves your machine. Local routing or local firewall. Check ip route and iptables.

The Security Group, the route table, and the listener are the three places to check, in that order, every time.

Closing

The pieces are small. There are basically five primitives — region, VPC, subnet, route table, Security Group — and a handful of services that live at different scopes. Most of the complexity in real architectures comes from how those primitives are composed, not from any one being hard.

Good AWS networking instinct is built backwards: when you see a working architecture, take five minutes to trace the path of a single packet from a user's browser through Route 53, the IGW, the ALB, the public subnet, into a private subnet, through your application, to a database row. The names start sticking once you have walked the route a few times.

Enjoyed this? More posts coming weekly — see the full archive.