Infrastructure Testing Strategies
Prerequisites
- Infrastructure as Code experience (Terraform or Pulumi)
- Basic programming knowledge (Go or Python preferred)
- Understanding of testing concepts (unit, integration, e2e)
- Familiarity with CI/CD pipelines
Introduction
"It worked in dev" is the infrastructure equivalent of "it works on my machine." A missing security group rule, an unencrypted database, or a load balancer pointing to the wrong subnet can take down production. Manual testing catches some issues, but it's slow, inconsistent, and doesn't scale.
Infrastructure testing automates validation, catches errors before they reach production, and enables teams to deploy with confidence. This guide covers practical testing strategies from basic linting to chaos engineering, with real code examples you can implement today.
Why Test Infrastructure?
The Cost of Failures
Scenario Without Tests:
Developer: Write Terraform for new database
↓
PR Review: Looks reasonable
↓
Apply to Production
↓
Database has public IP + default password
↓
Security breach discovered in audit
↓
Incident response, customer notification, compliance review
Total cost: $50K+ in time, reputation damage, potential fines
Scenario With Tests:
Developer: Write Terraform for new database
↓
CI runs automated tests (2 minutes)
↓
Test FAILS: "Database cannot have public IP"
↓
Fix before merge
Total cost: 2 minutes of developer time
Real Production Incidents Prevented by Testing
- Public Database Exposure - Test caught
publicly_accessible = trueon RDS instance - Unencrypted Storage - Security scan flagged S3 bucket without encryption
- Missing Backups - Integration test verified
backup_retention_period >= 7 - Wrong Subnet CIDR - Unit test caught overlapping CIDR blocks
- Excessive IAM Permissions - Policy test blocked
Action: *onResource: *
The Testing Pyramid

The testing pyramid originated in the early 2000s from agile and Extreme Programming practices and was popularized by Mike Cohn as a way to guide teams toward fast, reliable feedback. It describes a layered approach where most tests live at the lowest levels—static analysis, unit, and policy checks—because they are cheap and quick, while progressively fewer tests exist at higher levels like integration and end-to-end testing due to their cost and execution time. In infrastructure testing strategies, the pyramid reinforces the idea of catching failures as early as possible, minimizing slow and expensive E2E runs while still maintaining confidence through targeted integration and system-level tests.
Level 1: Static Analysis
Catch obvious errors instantly, before any resources are created.
Terraform Validate
# Basic validation
terraform init
terraform validate
# Check all modules
find . -type f -name "*.tf" -exec dirname {} \; | sort -u | while read dir; do
echo "Validating $dir"
(cd "$dir" && terraform init -backend=false && terraform validate)
done
Format Checking
# Check formatting
terraform fmt -check -recursive
# Auto-fix formatting
terraform fmt -recursive
TFLint - Advanced Linting
Install:
brew install tflint
# or
curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash
.tflint.hcl:
plugin "aws" {
enabled = true
version = "0.29.0"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
rule "terraform_deprecated_interpolation" {
enabled = true
}
rule "terraform_documented_variables" {
enabled = true
}
rule "terraform_naming_convention" {
enabled = true
format = "snake_case"
}
rule "aws_instance_invalid_type" {
enabled = true
}
rule "aws_db_instance_backup_retention_period_insufficient" {
enabled = true
}
rule "aws_s3_bucket_versioning_disabled" {
enabled = true
}
Run:
tflint --init
tflint --recursive
TFSec - Security Scanning
# Install
brew install tfsec
# Scan current directory
tfsec .
# Scan with minimum severity
tfsec . --minimum-severity HIGH
# Output as JSON
tfsec . --format json > tfsec-results.json
Example output:
Result #1 CRITICAL S3 bucket does not have encryption enabled
main.tf:23
Resource: aws_s3_bucket.data
More Info: https://tfsec.dev/docs/aws/s3/enable-bucket-encryption/
Result #2 HIGH Security group allows ingress from 0.0.0.0/0 to port 22
main.tf:67
Resource: aws_security_group.web
Checkov - Compliance Scanning
# Install
pip install checkov
# Scan Terraform
checkov -d terraform/
# Scan specific frameworks
checkov -d . --framework terraform cloudformation kubernetes
# Skip specific checks
checkov -d . --skip-check CKV_AWS_20,CKV_AWS_21
# Output as JUnit XML (for CI)
checkov -d . -o junitxml > checkov-report.xml
Custom policy:
# custom_policies/require_tags.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories
class RequireEnvironmentTag(BaseResourceCheck):
def __init__(self):
name = "Ensure all resources have Environment tag"
id = "CKV_CUSTOM_1"
supported_resources = ['*']
categories = [CheckCategories.CONVENTION]
super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)
def scan_resource_conf(self, conf):
tags = conf.get('tags', [{}])[0]
if 'Environment' in tags:
return CheckResult.PASSED
return CheckResult.FAILED
check = RequireEnvironmentTag()
Pre-commit Hooks
Automate all static checks before commit:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.88.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
args:
- --hook-config=--path-to-file=README.md
- id: terraform_tflint
args:
- --args=--config=__GIT_WORKING_DIR__/.tflint.hcl
- id: terraform_tfsec
args:
- --args=--minimum-severity=HIGH
- id: terraform_checkov
Setup:
pip install pre-commit
pre-commit install
pre-commit run --all-files
Level 2: Unit Testing with Terratest
Test module logic without creating resources.
Setup
# Create test directory
mkdir test && cd test
# Initialize Go module
go mod init github.com/company/infrastructure-tests
# Install Terratest
go get github.com/gruntwork-io/terratest/modules/terraform
go get github.com/stretchr/testify/assert
Basic Module Test
// test/vpc_test.go
package test
import (
"testing"
"fmt"
"strings"
"github.com/gruntwork-io/terratest/modules/random"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestVPCModule(t *testing.T) {
t.Parallel()
// Generate unique name
uniqueID := strings.ToLower(random.UniqueId())
vpcName := fmt.Sprintf("test-vpc-%s", uniqueID)
terraformOptions := &terraform.Options{
// Path to Terraform code
TerraformDir: "../modules/vpc",
// Variables to pass
Vars: map[string]interface{}{
"vpc_name": vpcName,
"cidr_block": "10.0.0.0/16",
"availability_zones": []string{"us-east-1a", "us-east-1b"},
},
// Environment variables
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": "us-east-1",
},
}
// Clean up resources after test
defer terraform.Destroy(t, terraformOptions)
// Run terraform init and apply
terraform.InitAndApply(t, terraformOptions)
// Validate outputs
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcID)
subnetIDs := terraform.OutputList(t, terraformOptions, "subnet_ids")
assert.Equal(t, 2, len(subnetIDs), "Should create 2 subnets")
vpcCIDR := terraform.Output(t, terraformOptions, "vpc_cidr")
assert.Equal(t, "10.0.0.0/16", vpcCIDR)
}
Run test:
go test -v -timeout 30m
Testing with AWS SDK Validation
// test/s3_bucket_test.go
package test
import (
"testing"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/service/s3"
"github.com/gruntwork-io/terratest/modules/terraform"
awstest "github.com/gruntwork-io/terratest/modules/aws"
"github.com/stretchr/testify/assert"
)
func TestS3BucketSecurity(t *testing.T) {
t.Parallel()
bucketName := fmt.Sprintf("test-bucket-%s", random.UniqueId())
awsRegion := "us-east-1"
terraformOptions := &terraform.Options{
TerraformDir: "../modules/s3-bucket",
Vars: map[string]interface{}{
"bucket_name": bucketName,
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Create S3 client
s3Client := awstest.NewS3Client(t, awsRegion)
// Test encryption
encryption, err := s3Client.GetBucketEncryption(&s3.GetBucketEncryptionInput{
Bucket: aws.String(bucketName),
})
assert.NoError(t, err)
assert.NotNil(t, encryption.ServerSideEncryptionConfiguration)
// Test versioning
versioning, err := s3Client.GetBucketVersioning(&s3.GetBucketVersioningInput{
Bucket: aws.String(bucketName),
})
assert.NoError(t, err)
assert.Equal(t, "Enabled", *versioning.Status)
// Test public access block
publicAccess, err := s3Client.GetPublicAccessBlock(&s3.GetPublicAccessBlockInput{
Bucket: aws.String(bucketName),
})
assert.NoError(t, err)
assert.True(t, *publicAccess.PublicAccessBlockConfiguration.BlockPublicAcls)
assert.True(t, *publicAccess.PublicAccessBlockConfiguration.BlockPublicPolicy)
}
Testing Multiple Scenarios
func TestDatabaseConfigurations(t *testing.T) {
testCases := []struct {
name string
instanceClass string
storageSize int
multiAZ bool
expectedCost float64
}{
{"Small", "db.t3.micro", 20, false, 15.0},
{"Medium", "db.t3.small", 100, false, 45.0},
{"Production", "db.r5.large", 500, true, 450.0},
}
for _, tc := range testCases {
tc := tc // Capture range variable
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/rds",
Vars: map[string]interface{}{
"identifier": fmt.Sprintf("test-db-%s", random.UniqueId()),
"instance_class": tc.instanceClass,
"allocated_storage": tc.storageSize,
"multi_az": tc.multiAZ,
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Verify configuration
dbIdentifier := terraform.Output(t, terraformOptions, "db_identifier")
assert.NotEmpty(t, dbIdentifier)
// Add cost validation here
})
}
}
Level 3: Policy as Code Testing
Enforce organizational standards automatically.
Open Policy Agent (OPA)
Policy file:
# policies/terraform.rego
package terraform
import input as tfplan
# Deny S3 buckets without encryption
deny[msg] {
resource := tfplan.resource_changes[_]
resource.type == "aws_s3_bucket"
not has_encryption(resource)
msg := sprintf("S3 bucket '%s' must have encryption enabled", [resource.name])
}
has_encryption(resource) {
config := resource.change.after
config.server_side_encryption_configuration
}
# Deny databases without backup retention
deny[msg] {
resource := tfplan.resource_changes[_]
resource.type == "aws_db_instance"
resource.change.after.backup_retention_period < 7
msg := sprintf("RDS '%s' must have >= 7 days backup retention", [resource.name])
}
# Deny public SSH access
deny[msg] {
resource := tfplan.resource_changes[_]
resource.type == "aws_security_group"
rule := resource.change.after.ingress[_]
rule.cidr_blocks[_] == "0.0.0.0/0"
rule.from_port <= 22
rule.to_port >= 22
msg := sprintf("Security group '%s' allows public SSH access", [resource.name])
}
# Warn about expensive instances in non-prod
warn[msg] {
resource := tfplan.resource_changes[_]
resource.type == "aws_instance"
expensive := ["m5.8xlarge", "m5.12xlarge", "m5.16xlarge", "m5.24xlarge"]
resource.change.after.instance_type == expensive[_]
tags := resource.change.after.tags
tags.Environment != "production"
msg := sprintf("Instance '%s' uses expensive type in non-prod", [resource.name])
}
Test policy:
# Generate plan
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
# Test with OPA
opa exec --decision terraform/deny --bundle policies/ tfplan.json
# Test in CI/CD
opa eval --fail-defined --data policies/ --input tfplan.json "data.terraform.deny"
Conftest for Kubernetes
# policy/deployment.rego
package main
deny[msg] {
input.kind == "Deployment"
not input.spec.template.spec.securityContext.runAsNonRoot
msg = "Containers must not run as root"
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits.cpu
msg = sprintf("Container %v must have CPU limits", [container.name])
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits.memory
msg = sprintf("Container %v must have memory limits", [container.name])
}
Test manifests:
conftest test deployment.yaml
Level 4: Integration Testing
Test how resources work together.
Database Connectivity Test
func TestApplicationDatabaseConnection(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../examples/app-with-database",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get database connection details
dbEndpoint := terraform.Output(t, terraformOptions, "db_endpoint")
dbUsername := terraform.Output(t, terraformOptions, "db_username")
dbPassword := terraform.Output(t, terraformOptions, "db_password")
dbName := terraform.Output(t, terraformOptions, "db_name")
// Test direct database connection
connStr := fmt.Sprintf(
"host=%s user=%s password=%s dbname=%s sslmode=require",
dbEndpoint, dbUsername, dbPassword, dbName,
)
db, err := sql.Open("postgres", connStr)
assert.NoError(t, err)
defer db.Close()
err = db.Ping()
assert.NoError(t, err, "Should connect to database")
// Test application can connect
appURL := terraform.Output(t, terraformOptions, "app_url")
maxRetries := 10
for i := 0; i < maxRetries; i++ {
resp, err := http.Get(fmt.Sprintf("%s/health", appURL))
if err == nil && resp.StatusCode == 200 {
body, _ := ioutil.ReadAll(resp.Body)
assert.Contains(t, string(body), "database_connected")
return
}
time.Sleep(10 * time.Second)
}
t.Fatal("Application failed to connect to database")
}
Network Connectivity Test
func TestPrivateSubnetConnectivity(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/networking",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get bastion details
bastionIP := terraform.Output(t, terraformOptions, "bastion_public_ip")
privateIP := terraform.Output(t, terraformOptions, "private_instance_ip")
keyPair := terraform.Output(t, terraformOptions, "key_pair_name")
// SSH to bastion
bastionHost := ssh.Host{
Hostname: bastionIP,
SshUserName: "ec2-user",
SshKeyPair: loadKeyPair(t, keyPair),
}
// Test connection to private instance through bastion
command := fmt.Sprintf("ssh -o StrictHostKeyChecking=no ec2-user@%s echo connected", privateIP)
output, err := ssh.CheckSshCommandE(t, bastionHost, command)
assert.NoError(t, err)
assert.Contains(t, output, "connected")
}
Level 5: End-to-End Testing
Test complete infrastructure deployments.
func TestFullEnvironment(t *testing.T) {
// Skip for short tests
if testing.Short() {
t.Skip("Skipping E2E test")
}
terraformOptions := &terraform.Options{
TerraformDir: "../environments/staging",
Vars: map[string]interface{}{
"environment": "e2e-test",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get application URL
appURL := terraform.Output(t, terraformOptions, "app_url")
// Wait for application
http_helper.HttpGetWithRetry(
t,
fmt.Sprintf("%s/health", appURL),
nil,
200,
"healthy",
30,
10*time.Second,
)
// Test main features
t.Run("Homepage", func(t *testing.T) {
status, body := http_helper.HttpGet(t, appURL, nil)
assert.Equal(t, 200, status)
assert.Contains(t, body, "<title>")
})
t.Run("API", func(t *testing.T) {
status, body := http_helper.HttpGet(t, fmt.Sprintf("%s/api/status", appURL), nil)
assert.Equal(t, 200, status)
assert.Contains(t, body, `"status":"ok"`)
})
t.Run("Database", func(t *testing.T) {
status, body := http_helper.HttpGet(t, fmt.Sprintf("%s/api/db-check", appURL), nil)
assert.Equal(t, 200, status)
assert.Contains(t, body, `"connected":true`)
})
}
CI/CD Integration
GitHub Actions
# .github/workflows/terraform-test.yml
name: Terraform Tests
on:
pull_request:
paths:
- 'terraform/**'
push:
branches: [main]
jobs:
static-analysis:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- name: Terraform Format
run: terraform fmt -check -recursive
- name: Terraform Validate
run: |
terraform init -backend=false
terraform validate
- name: TFLint
uses: terraform-linters/setup-tflint@v3
- run: tflint --recursive
- name: TFSec
uses: aquasecurity/[email protected]
- name: Checkov
uses: bridgecrewio/checkov-action@master
with:
directory: terraform/
unit-tests:
runs-on: ubuntu-latest
needs: static-analysis
steps:
- uses: actions/checkout@v3
- uses: actions/setup-go@v4
with:
go-version: '1.21'
- name: Run Tests
working-directory: test
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: go test -v -timeout 30m -parallel 5
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v3
- uses: actions/setup-go@v4
with:
go-version: '1.21'
- name: Run Integration Tests
working-directory: test
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: go test -v -timeout 60m -tags=integration
Cost Management
Ephemeral Test Environments
type TestEnvironment struct {
Name string
CreatedAt time.Time
MaxDuration time.Duration
}
func (env *TestEnvironment) scheduleCleanup(t *testing.T, opts *terraform.Options) {
go func() {
time.Sleep(env.MaxDuration)
if time.Since(env.CreatedAt) >= env.MaxDuration {
terraform.Destroy(t, opts)
}
}()
}
Resource Tagging
locals {
test_tags = {
Environment = "test"
TestID = var.test_id
CreatedAt = timestamp()
AutoDelete = "true"
}
}
Best Practices
- Use the test pyramid - More fast tests, fewer slow tests
- Run tests in parallel - Use
t.Parallel()in Go tests - Always clean up resources - Use
defer terraform.Destroy() - Randomize names - Avoid test collisions
- Add retry logic - Network operations can be flaky
- Document tests - Explain what's being tested and why
- Keep tests fast - Slow tests don't get run
- Test in isolation - Each test should be independent
Conclusion
Infrastructure testing transforms infrastructure management from hope-driven to evidence-based. By implementing testing at multiple levels, teams catch errors early when they're cheap to fix.
Start Here:
- Add pre-commit hooks for static analysis (today)
- Write one Terratest for your most critical module (this week)
- Add tests to CI/CD pipeline (this sprint)
- Measure test coverage and add integration tests (next month)
Key Metrics:
- Time to detect infrastructure bugs: from days → minutes
- Production incidents from infrastructure: from 10/month → 1/month
- Deployment confidence: from 60% → 95%
- Time to deploy: from hours → minutes
The best infrastructure tests are the ones that run automatically and catch bugs before production. Start simple, measure impact, and expand coverage incrementally.
Additional Resources
- Terratest Documentation
- Open Policy Agent Documentation
- TFSec Documentation
- pre-commit-terraform Hooks
Questions about infrastructure testing? Share your strategies in the comments below.