Angga.
← Back to all posts
securityawsdevopsaudit

How I audit cloud infrastructure

A working playbook for auditing a production cloud stack — twelve layers from network perimeter to CI/CD pipeline, with verification commands, severity ratings tied to remediation SLAs, and the trade-offs senior engineers make in practice.

· 25 min read

Most "security checklists" you find online were written by people who have never sat in a war room at 3am restoring a database. They list controls without context, miss the real risks, and conflate "compliance checkbox" with "actually secure."

This post is the playbook I run when I sit down to audit a production cloud infrastructure — twelve layers, from the network perimeter through the CI/CD pipeline and into a controlled attack simulation. It's opinionated. It is what I would do, not what a vendor template says to do.

By the end you'll have:

  • A mental model for moving through an audit: outside-in, by attack surface.
  • Concrete verification commands for every layer.
  • Severity ratings tied to remediation SLAs, so you can prioritize against real engineering capacity.
  • The trade-offs that distinguish a senior engineer's read from a junior's.

The severity legend

I rate every finding with the same five-tier scale. The SLA on the right is what I commit to in writing — anything else, and you have a process nobody respects.

Severity   Description                                            SLA
---------  -----------------------------------------------------  ------------
CRITICAL   Immediate exploitation risk; data-breach potential     < 24 hours
HIGH       Significant exposure; address urgently                 < 7 days
MEDIUM     Moderate risk; resolve in sprint cycle                 < 30 days
LOW        Minor finding; best-practice improvement               < 90 days
REVIEW     Requires manual review or policy decision              As scheduled

If a finding slips its SLA, it gets escalated. Not negotiable.

The mental model

I move through an audit outside-in, following the path an attacker actually takes.

Each layer answers a different question:

  • Perimeter (1–3): can an attacker connect at all?
  • Conversations (4, 8): if they connect, can they intercept or read the traffic?
  • Identity (5, 7): if they're inside, what can they impersonate?
  • Network (6): if they pivot, how far can they reach?
  • Data (9): if all else fails, can we restore safely?
  • Supply chain (10, 11): did the attacker get in via a dependency or pipeline?
  • Validation (12): do the controls actually hold under pressure?

You audit in this order because each layer's findings constrain the next layer's risk model. Don't write an IAM audit before you know which networks reach which services.

1. Security Groups

Security Groups are AWS's stateful virtual firewall — return traffic for an allowed inbound connection is automatically permitted, so you only write rules for what you want to receive. Network ACLs are the stateless companion at the subnet boundary; they need explicit rules in both directions.

The single most common SG failure I see in the wild: one "app-sg" or "web-sg" reused across dev, staging, and prod. Once any one environment is compromised, lateral movement is free — the firewall agrees they're all the same tenant.

What to check

# List every SG and its inbound rules
aws ec2 describe-security-groups --query "SecurityGroups[*].{Name:GroupName,InboundRules:IpPermissions}" --output table

# Find any SG that allows 0.0.0.0/0 inbound (highest-risk finding)
aws ec2 describe-security-groups --filters "Name=ip-permission.cidr,Values=0.0.0.0/0" --query "SecurityGroups[*].{ID:GroupId,Name:GroupName,Port:IpPermissions[].FromPort}"

# Audit NACLs for a VPC
aws ec2 describe-network-acls --filters "Name=vpc-id,Values=<vpc-id>" --output json

Common findings

Finding                            Risk                            Remediation                            Severity
---------------------------------  ------------------------------  -------------------------------------  --------
SSH (22) open to 0.0.0.0/0         Brute force, unauthorized SSH   Restrict to bastion/VPN CIDR only      CRITICAL
DB ports exposed publicly          Direct database compromise      Private subnet + app-tier SG only      CRITICAL
Outbound 0.0.0.0/0 on all ports    Data exfiltration channel       Limit egress to required CIDRs         HIGH
Identical SGs across environments  Lateral movement between envs   Separate SGs per env                   HIGH
No NACL deny rules                 No network-layer block          Add NACL denies for known-bad CIDRs    MEDIUM
Default SG in use                  Overly permissive catch-all     Disable default; per-app SGs           MEDIUM

The dangerous ports to scan for explicitly: 22 (SSH), 3389 (RDP), 5432 (Postgres), 6379 (Redis), 5672 (RabbitMQ), 3000 / 9090 (Grafana/Prometheus).

What's not on the checklist

The thing most reviewers miss: SG-as-source rules. The most expressive rule is allow tcp 5432 from sg-app-tier, not from 10.0.1.0/24. The CIDR form rots the moment you autoscale or migrate a subnet. The SG-as-source form survives because it's keyed on role, not address. If I see CIDR-based intra-VPC rules in a non-trivial deployment, that's a sign no one is reasoning about this layer at all — they're just copying from past Terraform.

2. UFW and iptables

Cloud Security Groups are the first perimeter; host-based firewalls are the second. Defense in depth means assuming a misconfigured SG is one git push away — and that if it happens, the host still drops the packet. This matters most for instances running databases, message brokers, or monitoring agents.

What to check

# UFW
sudo ufw status verbose
sudo ufw status numbered
sudo systemctl is-enabled ufw && sudo systemctl status ufw

# iptables (all tables)
sudo iptables -L -v -n --line-numbers
sudo iptables -t nat -L -v -n
sudo iptables -t mangle -L -v -n

# Persistence across reboots
sudo iptables-save | grep -v "^#"
ls /etc/iptables/rules.v4

The hardening baseline

Check                                    Expected                     Severity if failed
---------------------------------------  ---------------------------  ------------------
UFW/iptables enabled on every instance   Active & enabled at boot     HIGH
Default INPUT policy                     DROP                         CRITICAL
SSH restricted to VPN/bastion CIDR       Specific CIDR only           CRITICAL
Rules persist across reboot              Yes (rules.v4 file)          HIGH
Dropped packets logged                   Yes, with a log prefix       MEDIUM

The four principles:

  • Default-deny inbound; ACCEPT outbound only after egress proxying is in place.
  • ICMP dropped from untrusted sources reduces reconnaissance surface.
  • Rate-limit new connections with the LIMIT module to mitigate SYN floods.
  • Log dropped packets — incident response is impossible if you can't see what was attempted.

A sample minimum config:

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow from <VPN_CIDR> to any port 22
sudo ufw allow 443/tcp
sudo ufw enable

What's not on the checklist

Host firewalls are routinely disabled by deployment scripts. I always grep the cloud-init, the Ansible playbook, and the Packer manifest for ufw disable or iptables --flush — they're more common than you think, often "for testing" and never re-enabled.

3. NGINX configuration

NGINX is the edge reverse proxy in most modern stacks. A misconfigured NGINX leaks backend topology, enables request smuggling, advertises versions to attackers, or amplifies DDoS. This section is where a security audit gets a real payoff per hour.

Rate limiting

The first line of defense against credential stuffing and volumetric attacks. NGINX uses a leaky-bucket algorithm via limit_req_zone.

# /etc/nginx/nginx.conf — global http block
http {
  limit_req_zone $binary_remote_addr zone=api_limit:10m   rate=30r/m;
  limit_req_zone $binary_remote_addr zone=login_limit:10m rate=5r/m;
  limit_req_zone $binary_remote_addr zone=general:10m     rate=100r/s;
}

# per route
location /api/auth/login {
  limit_req zone=login_limit burst=3 nodelay;
  limit_req_status 429;
  proxy_pass http://app_upstream;
}

Reverse proxy hardening

proxy_set_header Host              $host;
proxy_set_header X-Real-IP         $remote_addr;
proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_hide_header X-Powered-By;
proxy_hide_header Server;
proxy_connect_timeout 10s;
proxy_read_timeout    30s;
proxy_send_timeout    30s;

Security headers — the seven you owe your users

add_header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" always;
add_header X-Frame-Options          "DENY" always;
add_header X-Content-Type-Options   "nosniff" always;
add_header X-XSS-Protection         "1; mode=block" always;
add_header Referrer-Policy          "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy  "default-src 'self'; script-src 'self'" always;
add_header Permissions-Policy       "geolocation=(), microphone=(), camera=()" always;

What to validate

# Syntax check
sudo nginx -t

# Inspect live headers
curl -I https://your-domain.com

# Verify rate-limit kicks in
for i in {1..20}; do
  curl -s -o /dev/null -w '%{http_code}\n' https://your-domain.com/api/auth/login
done

Required posture

Directive                Recommended Value           Severity if missing
-----------------------  --------------------------  -------------------
limit_req_zone (login)   max 5 req/min               CRITICAL
server_tokens            off                         MEDIUM
HSTS header              max-age=31536000            HIGH
X-Frame-Options          DENY                        MEDIUM
client_max_body_size     <= 10m                      HIGH
proxy_read_timeout       <= 30s                      MEDIUM

What's not on the checklist

The Server and X-Powered-By headers leak the exact NGINX and backend versions to anyone who runs curl -I — a free CVE lookup for an attacker. Hide them. Also: client_max_body_size defaults to 1m, which seems safe until you remember that file-upload features inflate it. I've seen client_max_body_size 0 (= unlimited) on prod NGINX more than once; it makes a slow-POST DoS trivial.

4. SSL/TLS certificates

TLS certs must be valid, properly configured, and cover internal service-to-service communication — not just the external endpoint. Expired or misconfigured internal certs, or missing mTLS on internal hops, create man-in-the-middle attack surface inside the perimeter — exactly where junior reviewers stop looking.

Inspecting certs across services

# External (NGINX/ALB)
openssl s_client -connect your-domain.com:443 -servername your-domain.com </dev/null 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

# Redis TLS port
openssl s_client -connect localhost:6380 </dev/null 2>/dev/null \
  | openssl x509 -noout -dates

# Postgres (STARTTLS)
openssl s_client -connect localhost:5432 -starttls postgres </dev/null 2>/dev/null \
  | openssl x509 -noout -dates -subject

# RabbitMQ AMQPS
openssl s_client -connect localhost:5671 </dev/null 2>/dev/null | openssl x509 -noout -dates

# Cipher suite enumeration
nmap --script ssl-enum-ciphers -p 443 your-domain.com

Per-service TLS requirements

Service              Port   TLS required?       Min TLS    Severity if plaintext
-------------------  -----  ------------------  ---------  ---------------------
Application (NGINX)  443    Yes — external      TLS 1.2+   CRITICAL
PostgreSQL           5432   Yes — internal      TLS 1.2+   CRITICAL
Redis                6380   Yes — TLS port      TLS 1.2+   CRITICAL
RabbitMQ             5671   Yes — AMQPS         TLS 1.2+   HIGH
Grafana              3000   Yes — HTTPS         TLS 1.2+   HIGH
SonarQube            9000   Yes — HTTPS         TLS 1.2+   HIGH

Expiry monitoring

A cert that nobody is watching expires on a Sunday morning. The minimum: a daily cron that alerts at 30, 14, and 7 days.

EXPIRY=$(openssl s_client -connect your-domain.com:443 </dev/null 2>/dev/null \
  | openssl x509 -noout -enddate | cut -d= -f2)
DAYS_LEFT=$(( ( $(date -d "$EXPIRY" +%s) - $(date +%s) ) / 86400 ))
echo "Certificate expires in $DAYS_LEFT days"
[ "$DAYS_LEFT" -lt 30 ] && echo "WARNING: Renew certificate soon!"

Cipher remediation

ssl_protocols           TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;
ssl_ciphers             ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384;
ssl_session_cache       shared:SSL:10m;
ssl_session_timeout     10m;
ssl_stapling            on;
ssl_stapling_verify     on;

What's not on the checklist

Internal services with TLS 1.0 or RC4 ciphers are everywhere because someone enabled them in 2017 "for compatibility" and never re-disabled them. Run nmap --script ssl-enum-ciphers on every internal port, not just 443. Also: ACME auto-renewal is great until your DNS-01 challenge silently breaks because someone rotated the IAM role. Test the renewal path in staging on a deliberately near-expiry cert.

5. Secrets management

Hardcoded credentials are still the leading cause of cloud breaches. Three patterns leak secrets at scale:

  • Source code.env files committed to git, API keys in test fixtures.
  • Environment variables — visible to every child process via /proc/<pid>/environ.
  • Unencrypted configconfig.yaml baked into container images.

Detection

# Scan history with trufflehog
trufflehog git file://. --since-commit HEAD~50 --only-verified

# CI integration with gitleaks
gitleaks detect --source . --report-format json --report-path gitleaks-report.json

# Find .env files in git history
git log --all --full-history -- "**/.env" "**/.env.*"

# Live process env dump
cat /proc/<PID>/environ | tr "\0" "\n" | grep -iE "password|secret|key|token"

The runtime pattern that works

Secrets are fetched at runtime via IAM-role-based SDK calls, never injected as env vars at deployment time:

import boto3, json
client = boto3.client("secretsmanager", region_name="ap-southeast-1")
secret = json.loads(client.get_secret_value(SecretId="prod/app/db")["SecretString"])
DB_PASSWORD = secret["password"]

Every GetSecretValue call is logged to CloudTrail with IAM principal, timestamp, and source IP. This gives you an audit trail no env-var-injection scheme can match.

Password requirements

Secret type          Min length    Complexity                    Rotation
-------------------  ------------  ----------------------------  -----------------
Database passwords   32 chars      alphanumeric + special        Every 90 days
API keys / tokens    64 chars      cryptographically random      Every 180 days
JWT signing secret   32 bytes      base64-encoded urandom        On compromise only
Redis AUTH           32 chars      alphanumeric + special        Every 90 days
RabbitMQ user        32 chars      random alphanumeric           Every 90 days
Grafana admin        20+ chars     mixed case + numeric + sym    Every 90 days

Pre-commit enforcement

The first line of defense is at the engineer's laptop:

# .gitignore — non-negotiable
.env
.env.*
*.pem *.key *.p12 *.pfx
secrets.yaml secrets.json

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.0
    hooks:
      - id: gitleaks

What's not on the checklist

Rotation is the part everybody skips. A 90-day rotation policy with no automation is theatre — the team will silently extend it the moment it's inconvenient. AWS Secrets Manager's native RDS rotation (Lambda-driven) is the only path I've seen actually work. For non-RDS secrets, write the rotation Lambda and commit it before you commit the secret.

6. VPC architecture

A well-architected VPC ensures sensitive workloads (databases, queues, internal services) are never reachable from the internet. Standard pattern: public subnets host load balancers and NAT Gateways; private subnets host every application and data tier.

Subnet roles

Subnet type            Resources                       Routes to            Internet access
---------------------  ------------------------------  -------------------  ----------------------
Public Subnet          ALB, NAT GW, Bastion            IGW for outbound     Yes (controlled)
Private App Subnet     EC2 app servers, ECS tasks      NAT GW for egress    Outbound via NAT only
Private Data Subnet    RDS, ElastiCache, MQ            No internet route    None
Management Subnet      Bastion, monitoring             NAT GW or VPN        Via VPN only

What to check

# Every subnet and whether it auto-assigns public IPs
aws ec2 describe-subnets --query "Subnets[*].{ID:SubnetId,CIDR:CidrBlock,AZ:AvailabilityZone,Public:MapPublicIpOnLaunch}" --output table

# Route tables
aws ec2 describe-route-tables --query "RouteTables[*].{ID:RouteTableId,Routes:Routes,Assoc:Associations}" --output json

# CRITICAL: any route table that routes directly to an IGW from a non-public subnet
aws ec2 describe-route-tables --filters "Name=route.gateway-id,Values=igw-*" \
  --query "RouteTables[*].{RTB:RouteTableId,Associations:Associations[*].SubnetId}"

# NAT gateway HA — one per AZ
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --output table

VPC Flow Logs

Every VPC should have Flow Logs enabled, delivered to CloudWatch or S3, retained for at least 90 days:

aws ec2 create-flow-logs --resource-type VPC --resource-ids <vpc-id> \
  --traffic-type ALL --log-destination-type cloud-watch-logs \
  --log-group-name /aws/vpc/flowlogs --deliver-logs-permission-arn <iam-role-arn>

A useful CloudWatch Insights query for incident response:

fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| stats count(*) by srcAddr, dstPort
| sort count desc

What's not on the checklist

NAT Gateways are zonal, not regional. Cross-AZ NAT traffic is both slower and chargeable, and an AZ failure on a single-NAT VPC takes your egress with it. Always one NAT per AZ in production. Also: route table associations get edited by humans during incidents and never reviewed — re-confirm them every audit, not just the routes themselves.

7. IAM roles and policies

IAM misconfigurations sit behind a huge share of cloud breaches. The single most dangerous pattern is wildcard actions on wildcard resources"Action": "*" on "Resource": "*" — which gives an attacker full account access the moment any single component (a leaked access key, an SSRF on a metadata endpoint, a compromised container) is exploited.

What to check

# Every role and its policies
aws iam list-roles --query "Roles[*].{Name:RoleName,ARN:Arn}" --output table
aws iam list-attached-role-policies --role-name <role-name>
aws iam list-role-policies --role-name <role-name>

# Wildcard-action detection
aws iam get-policy-version --policy-arn <arn> --version-id v1 \
  | jq '.PolicyVersion.Document.Statement[] | select(.Effect=="Allow" and .Action=="*")'

# IAM Access Analyzer
aws accessanalyzer list-findings --analyzer-arn <analyzer-arn> \
  --filter "status={eq=[ACTIVE]}" --output table

# Generate a least-privilege baseline from actual usage
aws iam generate-service-last-accessed-details --arn <role-arn>

Hardening checklist

Control                  Requirement                                       Severity
-----------------------  ------------------------------------------------  --------
Root account             MFA enabled; never used for daily ops             CRITICAL
EC2/ECS service roles    Scoped to specific resource ARNs                  HIGH
S3 bucket policies       No public s3:GetObject on sensitive buckets       CRITICAL
Cross-account access     Explicit external ID + condition key              HIGH
Wildcard policies        Zero wildcard actions on production roles         CRITICAL
Access key rotation      Rotate every 90 days                              HIGH
MFA enforcement          Required for all console and API users            HIGH
Permission boundaries    Set on all developer-created roles                MEDIUM

A scoped policy as the baseline

This is the shape every service role should have — verbs and resources both bounded:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["secretsmanager:GetSecretValue"],
      "Resource": "arn:aws:secretsmanager:ap-southeast-1:ACCOUNT:secret:prod/*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::my-app-bucket/uploads/*"
    }
  ]
}

What's not on the checklist

Inline policies are the second-place IAM smell after wildcards. They're invisible in console summaries, they don't show up in policy-version history, and they accumulate over years. Move every inline policy to a managed policy with a clear naming convention (AppName-Tier-Action) so they show up in the same review tools as everything else.

The third miss: Access Analyzer is not enabled by default in every region. It costs nothing and surfaces external-access findings — turn it on everywhere on day one.

8. Validating encrypted transit

"TLS is configured" is a configuration claim. "TLS is actually negotiated" is an audit claim. The two are not the same. Validation requires packet inspection.

Packet capture

# Postgres traffic — must be encrypted bytes only, no readable SQL
sudo tcpdump -i eth0 -A -s 0 port 5432 -w /tmp/pg_capture.pcap

# Redis traffic — should NOT see "AUTH <password>" in plaintext
sudo tcpdump -i lo port 6379 -A -s 0 | grep -iE "AUTH|PASS|password"

# TLS handshake — record types only
sudo tcpdump -i eth0 port 443 -w /tmp/tls_handshake.pcap
# Inspect with: wireshark /tmp/tls_handshake.pcap

Socket state inspection

# Every listener with PID/process
sudo ss -tlnp

# Verify DB connections come from private IPs only
sudo ss -tnp state established | grep 5432

# Confirm Redis is not on 0.0.0.0
sudo ss -tlnp | grep 6379

TLS handshake verification per service

# Postgres — verify the SSL column is true
psql "host=localhost port=5432 user=app dbname=prod sslmode=require" \
  -c "SELECT ssl, version FROM pg_stat_ssl JOIN pg_stat_activity USING(pid) LIMIT 5;"

# Redis with TLS
redis-cli -h localhost -p 6380 --tls --cert /etc/redis/client.crt \
  --key /etc/redis/client.key --cacert /etc/redis/ca.crt PING

SAR for outbound anomaly detection

sudo apt install sysstat && sudo systemctl enable sysstat

# Real-time NIC stats
sar -n DEV 2 10

# Historical — look for off-hours exfiltration spikes
sar -n DEV -f /var/log/sysstat/sa$(date +%d) | grep eth0

What to confirm

Test                            Expected result                          Severity if failed
------------------------------  ---------------------------------------  -------------------
tcpdump on DB port              Encrypted bytes only, no readable SQL    CRITICAL
Redis port capture              No plaintext AUTH/GET commands           CRITICAL
TLS negotiation on every svc    TLS 1.2+ handshake observed              HIGH
Redis bind address              127.0.0.1 or private IP only             CRITICAL
Outbound traffic baseline       No anomalous spikes during off-hours     MEDIUM

What's not on the checklist

Junior reviewers stop at "the connection is over port 6380" or "we set sslmode=require." Senior reviewers run tcpdump and read the bytes. I've seen Redis containers configured for TLS but with the unencrypted port still open and a service quietly using it for legacy reasons. The packet capture is the only audit step that catches this.

9. Database backups

A backup that has never been restored is not a backup. An unencrypted backup uploaded to a public S3 bucket is a plaintext copy of your entire production database, sitting on the internet.

Backup process audit

# Is the backup schedule actually running?
sudo systemctl status pg_backup.timer
crontab -l | grep backup

# When did the last upload land?
aws s3 ls s3://your-backup-bucket/postgres/ --recursive | sort | tail -10

The encrypted backup workflow

Dump → compress → encrypt → upload (SSE-KMS) → checksum → securely wipe local.

#!/bin/bash  # backup_and_upload.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="/tmp/db_backup_${DATE}.sql.gz"
ENCRYPTED_FILE="${BACKUP_FILE}.enc"

# 1. Dump and compress
PGPASSWORD=$(aws secretsmanager get-secret-value --secret-id prod/db \
  --query SecretString --output text | jq -r .password) \
  pg_dump -h localhost -U app -d production | gzip > "$BACKUP_FILE"

# 2. Encrypt before upload (AES-256)
openssl enc -aes-256-cbc -pbkdf2 -iter 100000 -salt \
  -pass env:BACKUP_ENCRYPTION_KEY \
  -in "$BACKUP_FILE" -out "$ENCRYPTED_FILE"

# 3. Upload with SSE-KMS for at-rest encryption (defense in depth)
aws s3 cp "$ENCRYPTED_FILE" s3://your-backup-bucket/postgres/ \
  --sse aws:kms --sse-kms-key-id alias/backup-key \
  --storage-class STANDARD_IA

# 4. Publish a checksum
sha256sum "$ENCRYPTED_FILE" | aws s3 cp - s3://your-backup-bucket/postgres/${DATE}.sha256

# 5. Securely wipe local
shred -vfzu "$BACKUP_FILE" "$ENCRYPTED_FILE"

S3 bucket hardening

  • Block all public access — enable every one of the four toggles.
  • Versioning on — protects against accidental delete and ransomware.
  • Object Lock (WORM) — immutable retention for compliance.
  • Lifecycle to Glacier at 30 days, expire at 1 year.
  • Access logging + CloudTrail data events — know who read backup objects.

Monthly restore drill

A backup that has never been restored is not a backup.

openssl enc -d -aes-256-cbc -pbkdf2 -iter 100000 \
  -pass env:BACKUP_ENCRYPTION_KEY -in backup.sql.gz.enc | \
  gunzip | psql -h test-host -U app -d restore_test

Control summary

Control                  Requirement                              Severity
-----------------------  ---------------------------------------  --------
Backup encryption        AES-256 before S3 upload                 CRITICAL
S3 SSE-KMS               Enabled with customer-managed key        HIGH
S3 public access block   All 4 settings enabled                   CRITICAL
Backup frequency         Daily full + hourly WAL for Postgres     HIGH
Restore test             Monthly drill, documented                HIGH
Backup retention         30 days hot, 1 year archive              MEDIUM

What's not on the checklist

The dangerous moment in a backup pipeline is before the file gets encrypted — the unencrypted dump sits on local disk between pg_dump and openssl enc. If that disk is captured (a snapshot, an EBS-volume clone), every other control was useless. Use a tmpfs mount or pipe directly between pg_dump | gzip | openssl enc so the plaintext never touches disk at all.

10. Dependency CVEs

OWASP A06:2021 — "Vulnerable and Outdated Components" — is consistently in the top three exploited categories. The audit covers application dependencies, container images, and the operational tools (Grafana, SonarQube, etc.) that often run unattended.

Application scanning

# Node.js
npm audit --audit-level=high
npm audit fix --force

# Python
pip install safety
safety check -r requirements.txt --full-report

# Java/Maven
mvn org.owasp:dependency-check-maven:check -DfailBuildOnCVSS=7

# Containers — OS layers and app packages
trivy image --severity HIGH,CRITICAL --exit-code 1 your-registry/app:latest

Ops tooling — Grafana, SonarQube, etc.

The forgotten attack surface. Pin them in Docker Compose or Helm — **never use :latest in production**. Subscribe to grafana.com/security and sonarsource.com/security. Maintain an update playbook: staging first, validate dashboards and quality gates, promote to production within the CVE SLA window.

# What's currently running?
docker inspect grafana/grafana --format "{{.RepoTags}}"
curl -s http://localhost:9000/api/system/info | jq .version

# Scan the running tags
trivy image grafana/grafana:10.x.x --severity CRITICAL,HIGH
trivy image sonarqube:10.x-community --severity CRITICAL,HIGH

Pipeline gate

# .github/workflows/security.yml
security-scan:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Run Trivy vulnerability scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ env.IMAGE_NAME }}
        format: sarif
        severity: HIGH,CRITICAL
        exit-code: 1
    - name: Upload to GitHub Security tab
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: trivy-results.sarif

What's not on the checklist

A vulnerability database is only as useful as its update cadence. Trivy ships with an offline DB that ages; pin the action to a recent SHA and add a CI job that fails if the DB is older than 7 days. Also: npm audit fix --force is a phrase that has shipped more regressions than any other single command in JavaScript history. Land the fix on a branch, run the test suite, and only then merge.

11. CI/CD pipeline security

Shifting security left turns it from a release-gate into a continuous feedback loop. Every push should trigger validation across multiple layers: SAST on code, SCA on dependencies, secret scanning on diffs, IaC misconfig analysis, container CVEs, and DAST on the deployed staging environment.

SAST

sonar-scanner \
  -Dsonar.projectKey=my-app \
  -Dsonar.sources=./src \
  -Dsonar.host.url=https://sonar.internal \
  -Dsonar.login=$SONAR_TOKEN \
  -Dsonar.qualitygate.wait=true

semgrep --config=p/owasp-top-ten --config=p/secrets \
  --error --json --output semgrep-results.json ./src

IaC analysis

checkov -d ./terraform --framework terraform \
  --check CKV_AWS_* --output junitxml > checkov-results.xml

tfsec ./terraform --severity HIGH --format junit > tfsec-results.xml

Container hardening

  • Run as non-root: USER 1001 at the bottom of every Dockerfile.
  • Minimal base images: distroless, alpine, or slim variants.
  • Never store secrets in ENV or ARG — they're baked into image layers.
  • Enable Docker Content Trust (DCT) to verify image signatures.
  • Read-only filesystem where possible: --read-only or securityContext in Kubernetes.
docker run --rm -i hadolint/hadolint < Dockerfile
dockle --exit-code 1 --exit-level fatal your-registry/app:latest

DAST against staging

docker run -t ghcr.io/zaproxy/zaproxy:stable zap-full-scan.py \
  -t https://staging.your-domain.com \
  -r zap-report.html -I -j

Run DAST against staging only — never production — and fail the deploy pipeline on HIGH or CRITICAL alerts.

Gate matrix

Stage             Tool                          Type            Blocks pipeline?
----------------  ----------------------------  --------------  ----------------
Pre-commit        gitleaks                      secret scan     Yes
Build             SonarQube, Semgrep            SAST            Yes on Critical
Build             npm audit / Trivy fs scan     SCA             Yes on HIGH+
Post-build        Trivy image scan              Container CVE   Yes on CRITICAL
IaC               Checkov, tfsec                Misconfig       Yes on HIGH+
Deploy staging    OWASP ZAP                     DAST            Yes on HIGH
Deploy prod       Manual security sign-off      Review gate     Yes

What's not on the checklist

Pipelines that don't block are pipelines that don't exist. I've seen "security scans" in Jenkins that produce a SARIF report and a green checkmark regardless of findings. Every scan must have a fail-threshold and exit non-zero when it trips. The job of the pipeline is to say no on your team's behalf when you forget to.

12. Attack simulation

Security controls must be validated under realistic attack conditions. Run planned, authorized attack simulations against staging or a dedicated test environment.

> Warning — always obtain written authorization before running any load or DDoS-style test. Running them against arbitrary production without authorization violates cloud-provider AUPs and may trigger legal consequences.

HTTP load testing with k6

sudo gpg -k && sudo gpg --no-default-keyring \
  --keyring /usr/share/keyrings/k6-archive-keyring.gpg --recv-keys 8C728EB71A1B79B9D8A71
sudo apt install k6
// k6_load_test.js — ramping load test
import http from "k6/http";
import { check, sleep } from "k6";

export const options = {
  stages: [
    { duration: "2m", target: 50 },   // ramp up
    { duration: "5m", target: 200 },  // stress
    { duration: "2m", target: 500 },  // spike
    { duration: "2m", target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ["p(95)<500"],
    http_req_failed:   ["rate<0.01"],
  },
};

export default function () {
  const res = http.get("https://staging.your-domain.com/api/health");
  check(res, { "status 200": (r) => r.status === 200 });
  sleep(1);
}
k6 run --out json=results.json k6_load_test.js

Rate-limit validation

for i in {1..50}; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    https://staging.your-domain.com/api/auth/login \
    -X POST -d '{"email":"test@test.com","password":"wrong"}')
  echo "Request $i: HTTP $STATUS"
done
# Expected: 200 for the first ~5, then 429 Too Many Requests

Slowloris

Hold HTTP connections open with partial requests to exhaust the connection pool:

sudo apt install slowhttptest
slowhttptest -c 500 -H -g -o slowloris_report -i 10 -r 200 -t GET \
  -u https://staging.your-domain.com -x 24 -p 3

Mitigation in NGINX:

client_body_timeout    10s;
client_header_timeout  10s;
keepalive_timeout      5s 5s;
send_timeout           10s;

AWS Shield + WAF validation

aws wafv2 list-web-acls --scope REGIONAL --output table
aws wafv2 get-web-acl-for-resource --resource-arn <alb-arn>

aws cloudwatch get-metric-statistics \
  --namespace AWS/WAFV2 --metric-name BlockedRequests \
  --dimensions Name=WebACL,Value=<acl-name> Name=Region,Value=ap-southeast-1 \
  --start-time 2026-05-24T00:00:00Z --end-time 2026-05-25T00:00:00Z \
  --period 3600 --statistics Sum --output table

Pass criteria

Test scenario              Expected behaviour                            Pass indicator
-------------------------  --------------------------------------------  --------------------------------
k6 load test (200 VUs)     P95 latency < 500ms; error rate < 1%          k6 threshold pass
Rate limit (login)         429 after 5 requests per minute               HTTP 429 observed
Slowloris (500 conns)      NGINX closes idle connections within 10s      No service degradation
DDoS simulation            Shield/WAF auto-blocks volumetric traffic     WAF BlockedRequests increases
Spike test (500 VUs)       Auto-scaling triggers; no 5xx errors          CloudWatch EC2 scaling activity
Post-attack recovery       Returns to baseline within 2 minutes          Latency returns to normal

What's not on the checklist

Run the simulation with the on-call rotation alerted. The point of an attack drill is partly to test the human response — Did PagerDuty fire? Did the right person acknowledge? Did the runbook actually exist? A perfectly-blocked attack with no human in the loop is also a finding: it means your team can't tell when something's wrong.

The audit dashboard

After completing all twelve sections, I summarize the posture in a single table that goes to the engineering leadership:

#   Domain                  Key risk area                       Status
--  ----------------------  ----------------------------------  ----------
01  Security Groups         Port exposure to 0.0.0.0/0          REVIEW
02  UFW / iptables          Host firewall default policies      REVIEW
03  NGINX config            Rate limiting, security headers     REVIEW
04  SSL/TLS certs           Expired certs, weak ciphers         REVIEW
05  Secrets management      Hardcoded secrets, weak passwords   REVIEW
06  VPC architecture        Private subnet IGW exposure         REVIEW
07  IAM policies            Wildcard permissions, no MFA        REVIEW
08  Encrypted transfer      Plaintext DB/Redis connections      REVIEW
09  Database backups        Unencrypted S3 uploads              REVIEW
10  Dependency CVEs         Unpatched third-party packages      REVIEW
11  CI/CD security          Missing SAST/DAST/IaC gates         REVIEW
12  Attack simulation       Rate limit & DDoS response          REVIEW

REVIEW is the starting state. After remediation, every row should read PASS or have an open ticket with an owner and a date.

Category             Tool                Purpose                                  License
-------------------  ------------------  ---------------------------------------  --------------------
Secret scanning      gitleaks            Pre-commit & CI secret detection         Open Source
Container scanning   Trivy               CVE scanning for images & filesystems    Open Source
SAST                 SonarQube           Code quality + security analysis         Community/Commercial
SAST                 Semgrep             Custom rule-based code scanning          Open Source / Pro
IaC scanning         Checkov             Terraform/K8s misconfiguration           Open Source
Load testing         k6                  HTTP performance and stress testing      Open Source
DAST                 OWASP ZAP           Dynamic web app security scanner         Open Source
SSL testing          testssl.sh          TLS/SSL cipher and cert audit            Open Source
Network capture      tcpdump / Wireshark Packet analysis for transit validation   Open Source
Cloud audit          AWS Config + IAM AA Continuous compliance monitoring         AWS Native

Closing

The audit is not the deliverable. The deliverable is the set of tickets the audit produces — each with an owner, a severity, an SLA, and a verification step. A 50-page PDF that nobody reads buys you nothing; ten well-scoped Jira tickets with named owners buys you a measurable improvement in the posture by next quarter.

A few things I've learned over the years of running this loop:

  • Severity ratings are a forcing function, not a description. Calling something CRITICAL with a 24-hour SLA changes how the team allocates time. Calling it HIGH does not.
  • Verifying the control matters more than configuring it. tcpdump on every port, not just trust in the config.
  • The pipeline is the audit you run every day. Everything you can move into a CI gate stops being a quarterly worry.
  • Run the attack simulation. The first time you do it, you'll find one thing that genuinely surprises you — and you'll be glad you found it before someone else did.

If you remember nothing else, remember the order: outside-in. Security Groups before IAM. IAM before backups. Backups before attack simulation. The map matches the attacker's path, because the attacker is the customer of your security work.

Enjoyed this? More posts coming weekly — see the full archive.