How I audit cloud infrastructure
A working playbook for auditing a production cloud stack — twelve layers from network perimeter to CI/CD pipeline, with verification commands, severity ratings tied to remediation SLAs, and the trade-offs senior engineers make in practice.
· 25 min read
Most "security checklists" you find online were written by people who have never sat in a war room at 3am restoring a database. They list controls without context, miss the real risks, and conflate "compliance checkbox" with "actually secure."
This post is the playbook I run when I sit down to audit a production cloud infrastructure — twelve layers, from the network perimeter through the CI/CD pipeline and into a controlled attack simulation. It's opinionated. It is what I would do, not what a vendor template says to do.
By the end you'll have:
- A mental model for moving through an audit: outside-in, by attack surface.
- Concrete verification commands for every layer.
- Severity ratings tied to remediation SLAs, so you can prioritize against real engineering capacity.
- The trade-offs that distinguish a senior engineer's read from a junior's.
The severity legend
I rate every finding with the same five-tier scale. The SLA on the right is what I commit to in writing — anything else, and you have a process nobody respects.
Severity Description SLA
--------- ----------------------------------------------------- ------------
CRITICAL Immediate exploitation risk; data-breach potential < 24 hours
HIGH Significant exposure; address urgently < 7 days
MEDIUM Moderate risk; resolve in sprint cycle < 30 days
LOW Minor finding; best-practice improvement < 90 days
REVIEW Requires manual review or policy decision As scheduledIf a finding slips its SLA, it gets escalated. Not negotiable.
The mental model
I move through an audit outside-in, following the path an attacker actually takes.
Each layer answers a different question:
- Perimeter (1–3): can an attacker connect at all?
- Conversations (4, 8): if they connect, can they intercept or read the traffic?
- Identity (5, 7): if they're inside, what can they impersonate?
- Network (6): if they pivot, how far can they reach?
- Data (9): if all else fails, can we restore safely?
- Supply chain (10, 11): did the attacker get in via a dependency or pipeline?
- Validation (12): do the controls actually hold under pressure?
You audit in this order because each layer's findings constrain the next layer's risk model. Don't write an IAM audit before you know which networks reach which services.
1. Security Groups
Security Groups are AWS's stateful virtual firewall — return traffic for an allowed inbound connection is automatically permitted, so you only write rules for what you want to receive. Network ACLs are the stateless companion at the subnet boundary; they need explicit rules in both directions.
The single most common SG failure I see in the wild: one "app-sg" or "web-sg" reused across dev, staging, and prod. Once any one environment is compromised, lateral movement is free — the firewall agrees they're all the same tenant.
What to check
# List every SG and its inbound rules
aws ec2 describe-security-groups --query "SecurityGroups[*].{Name:GroupName,InboundRules:IpPermissions}" --output table
# Find any SG that allows 0.0.0.0/0 inbound (highest-risk finding)
aws ec2 describe-security-groups --filters "Name=ip-permission.cidr,Values=0.0.0.0/0" --query "SecurityGroups[*].{ID:GroupId,Name:GroupName,Port:IpPermissions[].FromPort}"
# Audit NACLs for a VPC
aws ec2 describe-network-acls --filters "Name=vpc-id,Values=<vpc-id>" --output jsonCommon findings
Finding Risk Remediation Severity
--------------------------------- ------------------------------ ------------------------------------- --------
SSH (22) open to 0.0.0.0/0 Brute force, unauthorized SSH Restrict to bastion/VPN CIDR only CRITICAL
DB ports exposed publicly Direct database compromise Private subnet + app-tier SG only CRITICAL
Outbound 0.0.0.0/0 on all ports Data exfiltration channel Limit egress to required CIDRs HIGH
Identical SGs across environments Lateral movement between envs Separate SGs per env HIGH
No NACL deny rules No network-layer block Add NACL denies for known-bad CIDRs MEDIUM
Default SG in use Overly permissive catch-all Disable default; per-app SGs MEDIUMThe dangerous ports to scan for explicitly: 22 (SSH), 3389 (RDP), 5432 (Postgres), 6379 (Redis), 5672 (RabbitMQ), 3000 / 9090 (Grafana/Prometheus).
What's not on the checklist
The thing most reviewers miss: SG-as-source rules. The most expressive rule is allow tcp 5432 from sg-app-tier, not from 10.0.1.0/24. The CIDR form rots the moment you autoscale or migrate a subnet. The SG-as-source form survives because it's keyed on role, not address. If I see CIDR-based intra-VPC rules in a non-trivial deployment, that's a sign no one is reasoning about this layer at all — they're just copying from past Terraform.
2. UFW and iptables
Cloud Security Groups are the first perimeter; host-based firewalls are the second. Defense in depth means assuming a misconfigured SG is one git push away — and that if it happens, the host still drops the packet. This matters most for instances running databases, message brokers, or monitoring agents.
What to check
# UFW
sudo ufw status verbose
sudo ufw status numbered
sudo systemctl is-enabled ufw && sudo systemctl status ufw
# iptables (all tables)
sudo iptables -L -v -n --line-numbers
sudo iptables -t nat -L -v -n
sudo iptables -t mangle -L -v -n
# Persistence across reboots
sudo iptables-save | grep -v "^#"
ls /etc/iptables/rules.v4The hardening baseline
Check Expected Severity if failed
--------------------------------------- --------------------------- ------------------
UFW/iptables enabled on every instance Active & enabled at boot HIGH
Default INPUT policy DROP CRITICAL
SSH restricted to VPN/bastion CIDR Specific CIDR only CRITICAL
Rules persist across reboot Yes (rules.v4 file) HIGH
Dropped packets logged Yes, with a log prefix MEDIUMThe four principles:
- Default-deny inbound; ACCEPT outbound only after egress proxying is in place.
- ICMP dropped from untrusted sources reduces reconnaissance surface.
- Rate-limit new connections with the LIMIT module to mitigate SYN floods.
- Log dropped packets — incident response is impossible if you can't see what was attempted.
A sample minimum config:
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow from <VPN_CIDR> to any port 22
sudo ufw allow 443/tcp
sudo ufw enableWhat's not on the checklist
Host firewalls are routinely disabled by deployment scripts. I always grep the cloud-init, the Ansible playbook, and the Packer manifest for ufw disable or iptables --flush — they're more common than you think, often "for testing" and never re-enabled.
3. NGINX configuration
NGINX is the edge reverse proxy in most modern stacks. A misconfigured NGINX leaks backend topology, enables request smuggling, advertises versions to attackers, or amplifies DDoS. This section is where a security audit gets a real payoff per hour.
Rate limiting
The first line of defense against credential stuffing and volumetric attacks. NGINX uses a leaky-bucket algorithm via limit_req_zone.
# /etc/nginx/nginx.conf — global http block
http {
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
limit_req_zone $binary_remote_addr zone=login_limit:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=general:10m rate=100r/s;
}
# per route
location /api/auth/login {
limit_req zone=login_limit burst=3 nodelay;
limit_req_status 429;
proxy_pass http://app_upstream;
}Reverse proxy hardening
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_hide_header X-Powered-By;
proxy_hide_header Server;
proxy_connect_timeout 10s;
proxy_read_timeout 30s;
proxy_send_timeout 30s;Security headers — the seven you owe your users
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" always;
add_header X-Frame-Options "DENY" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self'" always;
add_header Permissions-Policy "geolocation=(), microphone=(), camera=()" always;What to validate
# Syntax check
sudo nginx -t
# Inspect live headers
curl -I https://your-domain.com
# Verify rate-limit kicks in
for i in {1..20}; do
curl -s -o /dev/null -w '%{http_code}\n' https://your-domain.com/api/auth/login
doneRequired posture
Directive Recommended Value Severity if missing
----------------------- -------------------------- -------------------
limit_req_zone (login) max 5 req/min CRITICAL
server_tokens off MEDIUM
HSTS header max-age=31536000 HIGH
X-Frame-Options DENY MEDIUM
client_max_body_size <= 10m HIGH
proxy_read_timeout <= 30s MEDIUMWhat's not on the checklist
The Server and X-Powered-By headers leak the exact NGINX and backend versions to anyone who runs curl -I — a free CVE lookup for an attacker. Hide them. Also: client_max_body_size defaults to 1m, which seems safe until you remember that file-upload features inflate it. I've seen client_max_body_size 0 (= unlimited) on prod NGINX more than once; it makes a slow-POST DoS trivial.
4. SSL/TLS certificates
TLS certs must be valid, properly configured, and cover internal service-to-service communication — not just the external endpoint. Expired or misconfigured internal certs, or missing mTLS on internal hops, create man-in-the-middle attack surface inside the perimeter — exactly where junior reviewers stop looking.
Inspecting certs across services
# External (NGINX/ALB)
openssl s_client -connect your-domain.com:443 -servername your-domain.com </dev/null 2>/dev/null \
| openssl x509 -noout -dates -subject -issuer
# Redis TLS port
openssl s_client -connect localhost:6380 </dev/null 2>/dev/null \
| openssl x509 -noout -dates
# Postgres (STARTTLS)
openssl s_client -connect localhost:5432 -starttls postgres </dev/null 2>/dev/null \
| openssl x509 -noout -dates -subject
# RabbitMQ AMQPS
openssl s_client -connect localhost:5671 </dev/null 2>/dev/null | openssl x509 -noout -dates
# Cipher suite enumeration
nmap --script ssl-enum-ciphers -p 443 your-domain.comPer-service TLS requirements
Service Port TLS required? Min TLS Severity if plaintext
------------------- ----- ------------------ --------- ---------------------
Application (NGINX) 443 Yes — external TLS 1.2+ CRITICAL
PostgreSQL 5432 Yes — internal TLS 1.2+ CRITICAL
Redis 6380 Yes — TLS port TLS 1.2+ CRITICAL
RabbitMQ 5671 Yes — AMQPS TLS 1.2+ HIGH
Grafana 3000 Yes — HTTPS TLS 1.2+ HIGH
SonarQube 9000 Yes — HTTPS TLS 1.2+ HIGHExpiry monitoring
A cert that nobody is watching expires on a Sunday morning. The minimum: a daily cron that alerts at 30, 14, and 7 days.
EXPIRY=$(openssl s_client -connect your-domain.com:443 </dev/null 2>/dev/null \
| openssl x509 -noout -enddate | cut -d= -f2)
DAYS_LEFT=$(( ( $(date -d "$EXPIRY" +%s) - $(date +%s) ) / 86400 ))
echo "Certificate expires in $DAYS_LEFT days"
[ "$DAYS_LEFT" -lt 30 ] && echo "WARNING: Renew certificate soon!"Cipher remediation
ssl_protocols TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
ssl_stapling on;
ssl_stapling_verify on;What's not on the checklist
Internal services with TLS 1.0 or RC4 ciphers are everywhere because someone enabled them in 2017 "for compatibility" and never re-disabled them. Run nmap --script ssl-enum-ciphers on every internal port, not just 443. Also: ACME auto-renewal is great until your DNS-01 challenge silently breaks because someone rotated the IAM role. Test the renewal path in staging on a deliberately near-expiry cert.
5. Secrets management
Hardcoded credentials are still the leading cause of cloud breaches. Three patterns leak secrets at scale:
- Source code —
.envfiles committed to git, API keys in test fixtures. - Environment variables — visible to every child process via
/proc/<pid>/environ. - Unencrypted config —
config.yamlbaked into container images.
Detection
# Scan history with trufflehog
trufflehog git file://. --since-commit HEAD~50 --only-verified
# CI integration with gitleaks
gitleaks detect --source . --report-format json --report-path gitleaks-report.json
# Find .env files in git history
git log --all --full-history -- "**/.env" "**/.env.*"
# Live process env dump
cat /proc/<PID>/environ | tr "\0" "\n" | grep -iE "password|secret|key|token"The runtime pattern that works
Secrets are fetched at runtime via IAM-role-based SDK calls, never injected as env vars at deployment time:
import boto3, json
client = boto3.client("secretsmanager", region_name="ap-southeast-1")
secret = json.loads(client.get_secret_value(SecretId="prod/app/db")["SecretString"])
DB_PASSWORD = secret["password"]Every GetSecretValue call is logged to CloudTrail with IAM principal, timestamp, and source IP. This gives you an audit trail no env-var-injection scheme can match.
Password requirements
Secret type Min length Complexity Rotation
------------------- ------------ ---------------------------- -----------------
Database passwords 32 chars alphanumeric + special Every 90 days
API keys / tokens 64 chars cryptographically random Every 180 days
JWT signing secret 32 bytes base64-encoded urandom On compromise only
Redis AUTH 32 chars alphanumeric + special Every 90 days
RabbitMQ user 32 chars random alphanumeric Every 90 days
Grafana admin 20+ chars mixed case + numeric + sym Every 90 daysPre-commit enforcement
The first line of defense is at the engineer's laptop:
# .gitignore — non-negotiable
.env
.env.*
*.pem *.key *.p12 *.pfx
secrets.yaml secrets.json
# .pre-commit-config.yaml
repos:
- repo: https://github.com/gitleaks/gitleaks
rev: v8.18.0
hooks:
- id: gitleaksWhat's not on the checklist
Rotation is the part everybody skips. A 90-day rotation policy with no automation is theatre — the team will silently extend it the moment it's inconvenient. AWS Secrets Manager's native RDS rotation (Lambda-driven) is the only path I've seen actually work. For non-RDS secrets, write the rotation Lambda and commit it before you commit the secret.
6. VPC architecture
A well-architected VPC ensures sensitive workloads (databases, queues, internal services) are never reachable from the internet. Standard pattern: public subnets host load balancers and NAT Gateways; private subnets host every application and data tier.
Subnet roles
Subnet type Resources Routes to Internet access
--------------------- ------------------------------ ------------------- ----------------------
Public Subnet ALB, NAT GW, Bastion IGW for outbound Yes (controlled)
Private App Subnet EC2 app servers, ECS tasks NAT GW for egress Outbound via NAT only
Private Data Subnet RDS, ElastiCache, MQ No internet route None
Management Subnet Bastion, monitoring NAT GW or VPN Via VPN onlyWhat to check
# Every subnet and whether it auto-assigns public IPs
aws ec2 describe-subnets --query "Subnets[*].{ID:SubnetId,CIDR:CidrBlock,AZ:AvailabilityZone,Public:MapPublicIpOnLaunch}" --output table
# Route tables
aws ec2 describe-route-tables --query "RouteTables[*].{ID:RouteTableId,Routes:Routes,Assoc:Associations}" --output json
# CRITICAL: any route table that routes directly to an IGW from a non-public subnet
aws ec2 describe-route-tables --filters "Name=route.gateway-id,Values=igw-*" \
--query "RouteTables[*].{RTB:RouteTableId,Associations:Associations[*].SubnetId}"
# NAT gateway HA — one per AZ
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --output tableVPC Flow Logs
Every VPC should have Flow Logs enabled, delivered to CloudWatch or S3, retained for at least 90 days:
aws ec2 create-flow-logs --resource-type VPC --resource-ids <vpc-id> \
--traffic-type ALL --log-destination-type cloud-watch-logs \
--log-group-name /aws/vpc/flowlogs --deliver-logs-permission-arn <iam-role-arn>A useful CloudWatch Insights query for incident response:
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| stats count(*) by srcAddr, dstPort
| sort count descWhat's not on the checklist
NAT Gateways are zonal, not regional. Cross-AZ NAT traffic is both slower and chargeable, and an AZ failure on a single-NAT VPC takes your egress with it. Always one NAT per AZ in production. Also: route table associations get edited by humans during incidents and never reviewed — re-confirm them every audit, not just the routes themselves.
7. IAM roles and policies
IAM misconfigurations sit behind a huge share of cloud breaches. The single most dangerous pattern is wildcard actions on wildcard resources — "Action": "*" on "Resource": "*" — which gives an attacker full account access the moment any single component (a leaked access key, an SSRF on a metadata endpoint, a compromised container) is exploited.
What to check
# Every role and its policies
aws iam list-roles --query "Roles[*].{Name:RoleName,ARN:Arn}" --output table
aws iam list-attached-role-policies --role-name <role-name>
aws iam list-role-policies --role-name <role-name>
# Wildcard-action detection
aws iam get-policy-version --policy-arn <arn> --version-id v1 \
| jq '.PolicyVersion.Document.Statement[] | select(.Effect=="Allow" and .Action=="*")'
# IAM Access Analyzer
aws accessanalyzer list-findings --analyzer-arn <analyzer-arn> \
--filter "status={eq=[ACTIVE]}" --output table
# Generate a least-privilege baseline from actual usage
aws iam generate-service-last-accessed-details --arn <role-arn>Hardening checklist
Control Requirement Severity
----------------------- ------------------------------------------------ --------
Root account MFA enabled; never used for daily ops CRITICAL
EC2/ECS service roles Scoped to specific resource ARNs HIGH
S3 bucket policies No public s3:GetObject on sensitive buckets CRITICAL
Cross-account access Explicit external ID + condition key HIGH
Wildcard policies Zero wildcard actions on production roles CRITICAL
Access key rotation Rotate every 90 days HIGH
MFA enforcement Required for all console and API users HIGH
Permission boundaries Set on all developer-created roles MEDIUMA scoped policy as the baseline
This is the shape every service role should have — verbs and resources both bounded:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:ap-southeast-1:ACCOUNT:secret:prod/*"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-app-bucket/uploads/*"
}
]
}What's not on the checklist
Inline policies are the second-place IAM smell after wildcards. They're invisible in console summaries, they don't show up in policy-version history, and they accumulate over years. Move every inline policy to a managed policy with a clear naming convention (AppName-Tier-Action) so they show up in the same review tools as everything else.
The third miss: Access Analyzer is not enabled by default in every region. It costs nothing and surfaces external-access findings — turn it on everywhere on day one.
8. Validating encrypted transit
"TLS is configured" is a configuration claim. "TLS is actually negotiated" is an audit claim. The two are not the same. Validation requires packet inspection.
Packet capture
# Postgres traffic — must be encrypted bytes only, no readable SQL
sudo tcpdump -i eth0 -A -s 0 port 5432 -w /tmp/pg_capture.pcap
# Redis traffic — should NOT see "AUTH <password>" in plaintext
sudo tcpdump -i lo port 6379 -A -s 0 | grep -iE "AUTH|PASS|password"
# TLS handshake — record types only
sudo tcpdump -i eth0 port 443 -w /tmp/tls_handshake.pcap
# Inspect with: wireshark /tmp/tls_handshake.pcapSocket state inspection
# Every listener with PID/process
sudo ss -tlnp
# Verify DB connections come from private IPs only
sudo ss -tnp state established | grep 5432
# Confirm Redis is not on 0.0.0.0
sudo ss -tlnp | grep 6379TLS handshake verification per service
# Postgres — verify the SSL column is true
psql "host=localhost port=5432 user=app dbname=prod sslmode=require" \
-c "SELECT ssl, version FROM pg_stat_ssl JOIN pg_stat_activity USING(pid) LIMIT 5;"
# Redis with TLS
redis-cli -h localhost -p 6380 --tls --cert /etc/redis/client.crt \
--key /etc/redis/client.key --cacert /etc/redis/ca.crt PINGSAR for outbound anomaly detection
sudo apt install sysstat && sudo systemctl enable sysstat
# Real-time NIC stats
sar -n DEV 2 10
# Historical — look for off-hours exfiltration spikes
sar -n DEV -f /var/log/sysstat/sa$(date +%d) | grep eth0What to confirm
Test Expected result Severity if failed
------------------------------ --------------------------------------- -------------------
tcpdump on DB port Encrypted bytes only, no readable SQL CRITICAL
Redis port capture No plaintext AUTH/GET commands CRITICAL
TLS negotiation on every svc TLS 1.2+ handshake observed HIGH
Redis bind address 127.0.0.1 or private IP only CRITICAL
Outbound traffic baseline No anomalous spikes during off-hours MEDIUMWhat's not on the checklist
Junior reviewers stop at "the connection is over port 6380" or "we set sslmode=require." Senior reviewers run tcpdump and read the bytes. I've seen Redis containers configured for TLS but with the unencrypted port still open and a service quietly using it for legacy reasons. The packet capture is the only audit step that catches this.
9. Database backups
A backup that has never been restored is not a backup. An unencrypted backup uploaded to a public S3 bucket is a plaintext copy of your entire production database, sitting on the internet.
Backup process audit
# Is the backup schedule actually running?
sudo systemctl status pg_backup.timer
crontab -l | grep backup
# When did the last upload land?
aws s3 ls s3://your-backup-bucket/postgres/ --recursive | sort | tail -10The encrypted backup workflow
Dump → compress → encrypt → upload (SSE-KMS) → checksum → securely wipe local.
#!/bin/bash # backup_and_upload.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="/tmp/db_backup_${DATE}.sql.gz"
ENCRYPTED_FILE="${BACKUP_FILE}.enc"
# 1. Dump and compress
PGPASSWORD=$(aws secretsmanager get-secret-value --secret-id prod/db \
--query SecretString --output text | jq -r .password) \
pg_dump -h localhost -U app -d production | gzip > "$BACKUP_FILE"
# 2. Encrypt before upload (AES-256)
openssl enc -aes-256-cbc -pbkdf2 -iter 100000 -salt \
-pass env:BACKUP_ENCRYPTION_KEY \
-in "$BACKUP_FILE" -out "$ENCRYPTED_FILE"
# 3. Upload with SSE-KMS for at-rest encryption (defense in depth)
aws s3 cp "$ENCRYPTED_FILE" s3://your-backup-bucket/postgres/ \
--sse aws:kms --sse-kms-key-id alias/backup-key \
--storage-class STANDARD_IA
# 4. Publish a checksum
sha256sum "$ENCRYPTED_FILE" | aws s3 cp - s3://your-backup-bucket/postgres/${DATE}.sha256
# 5. Securely wipe local
shred -vfzu "$BACKUP_FILE" "$ENCRYPTED_FILE"S3 bucket hardening
- Block all public access — enable every one of the four toggles.
- Versioning on — protects against accidental delete and ransomware.
- Object Lock (WORM) — immutable retention for compliance.
- Lifecycle to Glacier at 30 days, expire at 1 year.
- Access logging + CloudTrail data events — know who read backup objects.
Monthly restore drill
A backup that has never been restored is not a backup.
openssl enc -d -aes-256-cbc -pbkdf2 -iter 100000 \
-pass env:BACKUP_ENCRYPTION_KEY -in backup.sql.gz.enc | \
gunzip | psql -h test-host -U app -d restore_testControl summary
Control Requirement Severity
----------------------- --------------------------------------- --------
Backup encryption AES-256 before S3 upload CRITICAL
S3 SSE-KMS Enabled with customer-managed key HIGH
S3 public access block All 4 settings enabled CRITICAL
Backup frequency Daily full + hourly WAL for Postgres HIGH
Restore test Monthly drill, documented HIGH
Backup retention 30 days hot, 1 year archive MEDIUMWhat's not on the checklist
The dangerous moment in a backup pipeline is before the file gets encrypted — the unencrypted dump sits on local disk between pg_dump and openssl enc. If that disk is captured (a snapshot, an EBS-volume clone), every other control was useless. Use a tmpfs mount or pipe directly between pg_dump | gzip | openssl enc so the plaintext never touches disk at all.
10. Dependency CVEs
OWASP A06:2021 — "Vulnerable and Outdated Components" — is consistently in the top three exploited categories. The audit covers application dependencies, container images, and the operational tools (Grafana, SonarQube, etc.) that often run unattended.
Application scanning
# Node.js
npm audit --audit-level=high
npm audit fix --force
# Python
pip install safety
safety check -r requirements.txt --full-report
# Java/Maven
mvn org.owasp:dependency-check-maven:check -DfailBuildOnCVSS=7
# Containers — OS layers and app packages
trivy image --severity HIGH,CRITICAL --exit-code 1 your-registry/app:latestOps tooling — Grafana, SonarQube, etc.
The forgotten attack surface. Pin them in Docker Compose or Helm — **never use :latest in production**. Subscribe to grafana.com/security and sonarsource.com/security. Maintain an update playbook: staging first, validate dashboards and quality gates, promote to production within the CVE SLA window.
# What's currently running?
docker inspect grafana/grafana --format "{{.RepoTags}}"
curl -s http://localhost:9000/api/system/info | jq .version
# Scan the running tags
trivy image grafana/grafana:10.x.x --severity CRITICAL,HIGH
trivy image sonarqube:10.x-community --severity CRITICAL,HIGHPipeline gate
# .github/workflows/security.yml
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE_NAME }}
format: sarif
severity: HIGH,CRITICAL
exit-code: 1
- name: Upload to GitHub Security tab
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: trivy-results.sarifWhat's not on the checklist
A vulnerability database is only as useful as its update cadence. Trivy ships with an offline DB that ages; pin the action to a recent SHA and add a CI job that fails if the DB is older than 7 days. Also: npm audit fix --force is a phrase that has shipped more regressions than any other single command in JavaScript history. Land the fix on a branch, run the test suite, and only then merge.
11. CI/CD pipeline security
Shifting security left turns it from a release-gate into a continuous feedback loop. Every push should trigger validation across multiple layers: SAST on code, SCA on dependencies, secret scanning on diffs, IaC misconfig analysis, container CVEs, and DAST on the deployed staging environment.
SAST
sonar-scanner \
-Dsonar.projectKey=my-app \
-Dsonar.sources=./src \
-Dsonar.host.url=https://sonar.internal \
-Dsonar.login=$SONAR_TOKEN \
-Dsonar.qualitygate.wait=true
semgrep --config=p/owasp-top-ten --config=p/secrets \
--error --json --output semgrep-results.json ./srcIaC analysis
checkov -d ./terraform --framework terraform \
--check CKV_AWS_* --output junitxml > checkov-results.xml
tfsec ./terraform --severity HIGH --format junit > tfsec-results.xmlContainer hardening
- Run as non-root:
USER 1001at the bottom of every Dockerfile. - Minimal base images: distroless, alpine, or slim variants.
- Never store secrets in
ENVorARG— they're baked into image layers. - Enable Docker Content Trust (DCT) to verify image signatures.
- Read-only filesystem where possible:
--read-onlyorsecurityContextin Kubernetes.
docker run --rm -i hadolint/hadolint < Dockerfile
dockle --exit-code 1 --exit-level fatal your-registry/app:latestDAST against staging
docker run -t ghcr.io/zaproxy/zaproxy:stable zap-full-scan.py \
-t https://staging.your-domain.com \
-r zap-report.html -I -jRun DAST against staging only — never production — and fail the deploy pipeline on HIGH or CRITICAL alerts.
Gate matrix
Stage Tool Type Blocks pipeline?
---------------- ---------------------------- -------------- ----------------
Pre-commit gitleaks secret scan Yes
Build SonarQube, Semgrep SAST Yes on Critical
Build npm audit / Trivy fs scan SCA Yes on HIGH+
Post-build Trivy image scan Container CVE Yes on CRITICAL
IaC Checkov, tfsec Misconfig Yes on HIGH+
Deploy staging OWASP ZAP DAST Yes on HIGH
Deploy prod Manual security sign-off Review gate YesWhat's not on the checklist
Pipelines that don't block are pipelines that don't exist. I've seen "security scans" in Jenkins that produce a SARIF report and a green checkmark regardless of findings. Every scan must have a fail-threshold and exit non-zero when it trips. The job of the pipeline is to say no on your team's behalf when you forget to.
12. Attack simulation
Security controls must be validated under realistic attack conditions. Run planned, authorized attack simulations against staging or a dedicated test environment.
> Warning — always obtain written authorization before running any load or DDoS-style test. Running them against arbitrary production without authorization violates cloud-provider AUPs and may trigger legal consequences.
HTTP load testing with k6
sudo gpg -k && sudo gpg --no-default-keyring \
--keyring /usr/share/keyrings/k6-archive-keyring.gpg --recv-keys 8C728EB71A1B79B9D8A71
sudo apt install k6// k6_load_test.js — ramping load test
import http from "k6/http";
import { check, sleep } from "k6";
export const options = {
stages: [
{ duration: "2m", target: 50 }, // ramp up
{ duration: "5m", target: 200 }, // stress
{ duration: "2m", target: 500 }, // spike
{ duration: "2m", target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ["p(95)<500"],
http_req_failed: ["rate<0.01"],
},
};
export default function () {
const res = http.get("https://staging.your-domain.com/api/health");
check(res, { "status 200": (r) => r.status === 200 });
sleep(1);
}k6 run --out json=results.json k6_load_test.jsRate-limit validation
for i in {1..50}; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
https://staging.your-domain.com/api/auth/login \
-X POST -d '{"email":"test@test.com","password":"wrong"}')
echo "Request $i: HTTP $STATUS"
done
# Expected: 200 for the first ~5, then 429 Too Many RequestsSlowloris
Hold HTTP connections open with partial requests to exhaust the connection pool:
sudo apt install slowhttptest
slowhttptest -c 500 -H -g -o slowloris_report -i 10 -r 200 -t GET \
-u https://staging.your-domain.com -x 24 -p 3Mitigation in NGINX:
client_body_timeout 10s;
client_header_timeout 10s;
keepalive_timeout 5s 5s;
send_timeout 10s;AWS Shield + WAF validation
aws wafv2 list-web-acls --scope REGIONAL --output table
aws wafv2 get-web-acl-for-resource --resource-arn <alb-arn>
aws cloudwatch get-metric-statistics \
--namespace AWS/WAFV2 --metric-name BlockedRequests \
--dimensions Name=WebACL,Value=<acl-name> Name=Region,Value=ap-southeast-1 \
--start-time 2026-05-24T00:00:00Z --end-time 2026-05-25T00:00:00Z \
--period 3600 --statistics Sum --output tablePass criteria
Test scenario Expected behaviour Pass indicator
------------------------- -------------------------------------------- --------------------------------
k6 load test (200 VUs) P95 latency < 500ms; error rate < 1% k6 threshold pass
Rate limit (login) 429 after 5 requests per minute HTTP 429 observed
Slowloris (500 conns) NGINX closes idle connections within 10s No service degradation
DDoS simulation Shield/WAF auto-blocks volumetric traffic WAF BlockedRequests increases
Spike test (500 VUs) Auto-scaling triggers; no 5xx errors CloudWatch EC2 scaling activity
Post-attack recovery Returns to baseline within 2 minutes Latency returns to normalWhat's not on the checklist
Run the simulation with the on-call rotation alerted. The point of an attack drill is partly to test the human response — Did PagerDuty fire? Did the right person acknowledge? Did the runbook actually exist? A perfectly-blocked attack with no human in the loop is also a finding: it means your team can't tell when something's wrong.
The audit dashboard
After completing all twelve sections, I summarize the posture in a single table that goes to the engineering leadership:
# Domain Key risk area Status
-- ---------------------- ---------------------------------- ----------
01 Security Groups Port exposure to 0.0.0.0/0 REVIEW
02 UFW / iptables Host firewall default policies REVIEW
03 NGINX config Rate limiting, security headers REVIEW
04 SSL/TLS certs Expired certs, weak ciphers REVIEW
05 Secrets management Hardcoded secrets, weak passwords REVIEW
06 VPC architecture Private subnet IGW exposure REVIEW
07 IAM policies Wildcard permissions, no MFA REVIEW
08 Encrypted transfer Plaintext DB/Redis connections REVIEW
09 Database backups Unencrypted S3 uploads REVIEW
10 Dependency CVEs Unpatched third-party packages REVIEW
11 CI/CD security Missing SAST/DAST/IaC gates REVIEW
12 Attack simulation Rate limit & DDoS response REVIEWREVIEW is the starting state. After remediation, every row should read PASS or have an open ticket with an owner and a date.
Recommended toolchain
Category Tool Purpose License
------------------- ------------------ --------------------------------------- --------------------
Secret scanning gitleaks Pre-commit & CI secret detection Open Source
Container scanning Trivy CVE scanning for images & filesystems Open Source
SAST SonarQube Code quality + security analysis Community/Commercial
SAST Semgrep Custom rule-based code scanning Open Source / Pro
IaC scanning Checkov Terraform/K8s misconfiguration Open Source
Load testing k6 HTTP performance and stress testing Open Source
DAST OWASP ZAP Dynamic web app security scanner Open Source
SSL testing testssl.sh TLS/SSL cipher and cert audit Open Source
Network capture tcpdump / Wireshark Packet analysis for transit validation Open Source
Cloud audit AWS Config + IAM AA Continuous compliance monitoring AWS NativeClosing
The audit is not the deliverable. The deliverable is the set of tickets the audit produces — each with an owner, a severity, an SLA, and a verification step. A 50-page PDF that nobody reads buys you nothing; ten well-scoped Jira tickets with named owners buys you a measurable improvement in the posture by next quarter.
A few things I've learned over the years of running this loop:
- Severity ratings are a forcing function, not a description. Calling something CRITICAL with a 24-hour SLA changes how the team allocates time. Calling it HIGH does not.
- Verifying the control matters more than configuring it.
tcpdumpon every port, not just trust in the config. - The pipeline is the audit you run every day. Everything you can move into a CI gate stops being a quarterly worry.
- Run the attack simulation. The first time you do it, you'll find one thing that genuinely surprises you — and you'll be glad you found it before someone else did.
If you remember nothing else, remember the order: outside-in. Security Groups before IAM. IAM before backups. Backups before attack simulation. The map matches the attacker's path, because the attacker is the customer of your security work.