Infrastructure Testing Strategies

Introduction

"It worked in dev" is the infrastructure equivalent of "it works on my machine." A missing security group rule, an unencrypted database, or a load balancer pointing to the wrong subnet can take down production. Manual testing catches some issues, but it's slow, inconsistent, and doesn't scale.

Infrastructure testing automates validation, catches errors before they reach production, and enables teams to deploy with confidence. This guide covers practical testing strategies from basic linting to chaos engineering, with real code examples you can implement today.

Why Test Infrastructure?

The Cost of Failures

Scenario Without Tests:

Developer: Write Terraform for new database
  ↓
PR Review: Looks reasonable  
  ↓
Apply to Production
  ↓  
Database has public IP + default password
  ↓
Security breach discovered in audit
  ↓
Incident response, customer notification, compliance review
Total cost: $50K+ in time, reputation damage, potential fines

Scenario With Tests:

Developer: Write Terraform for new database
  ↓
CI runs automated tests (2 minutes)
  ↓
Test FAILS: "Database cannot have public IP"
  ↓
Fix before merge
Total cost: 2 minutes of developer time

Real Production Incidents Prevented by Testing

Public Database Exposure - Test caught publicly_accessible = true on RDS instance
Unencrypted Storage - Security scan flagged S3 bucket without encryption
Missing Backups - Integration test verified backup_retention_period >= 7
Wrong Subnet CIDR - Unit test caught overlapping CIDR blocks
Excessive IAM Permissions - Policy test blocked Action: * on Resource: *

The Testing Pyramid

The testing pyramid originated in the early 2000s from agile and Extreme Programming practices and was popularized by Mike Cohn as a way to guide teams toward fast, reliable feedback. It describes a layered approach where most tests live at the lowest levels—static analysis, unit, and policy checks—because they are cheap and quick, while progressively fewer tests exist at higher levels like integration and end-to-end testing due to their cost and execution time. In infrastructure testing strategies, the pyramid reinforces the idea of catching failures as early as possible, minimizing slow and expensive E2E runs while still maintaining confidence through targeted integration and system-level tests.

Level 1: Static Analysis

Catch obvious errors instantly, before any resources are created.

Terraform Validate

# Basic validation
terraform init
terraform validate

# Check all modules
find . -type f -name "*.tf" -exec dirname {} \; | sort -u | while read dir; do
  echo "Validating $dir"
  (cd "$dir" && terraform init -backend=false && terraform validate)
done

Format Checking

# Check formatting
terraform fmt -check -recursive

# Auto-fix formatting
terraform fmt -recursive

TFLint - Advanced Linting

Install:

brew install tflint
# or
curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash

.tflint.hcl:

plugin "aws" {
  enabled = true
  version = "0.29.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "terraform_deprecated_interpolation" {
  enabled = true
}

rule "terraform_documented_variables" {
  enabled = true
}

rule "terraform_naming_convention" {
  enabled = true
  format  = "snake_case"
}

rule "aws_instance_invalid_type" {
  enabled = true
}

rule "aws_db_instance_backup_retention_period_insufficient" {
  enabled = true
}

rule "aws_s3_bucket_versioning_disabled" {
  enabled = true
}

Run:

tflint --init
tflint --recursive

TFSec - Security Scanning

# Install
brew install tfsec

# Scan current directory
tfsec .

# Scan with minimum severity
tfsec . --minimum-severity HIGH

# Output as JSON
tfsec . --format json > tfsec-results.json

Example output:

Result #1 CRITICAL S3 bucket does not have encryption enabled
  main.tf:23
  Resource: aws_s3_bucket.data
  
  More Info: https://tfsec.dev/docs/aws/s3/enable-bucket-encryption/
  
Result #2 HIGH Security group allows ingress from 0.0.0.0/0 to port 22
  main.tf:67
  Resource: aws_security_group.web

Checkov - Compliance Scanning

# Install
pip install checkov

# Scan Terraform
checkov -d terraform/

# Scan specific frameworks
checkov -d . --framework terraform cloudformation kubernetes

# Skip specific checks
checkov -d . --skip-check CKV_AWS_20,CKV_AWS_21

# Output as JUnit XML (for CI)
checkov -d . -o junitxml > checkov-report.xml

Custom policy:

# custom_policies/require_tags.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories

class RequireEnvironmentTag(BaseResourceCheck):
    def __init__(self):
        name = "Ensure all resources have Environment tag"
        id = "CKV_CUSTOM_1"
        supported_resources = ['*']
        categories = [CheckCategories.CONVENTION]
        super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        tags = conf.get('tags', [{}])[0]
        if 'Environment' in tags:
            return CheckResult.PASSED
        return CheckResult.FAILED

check = RequireEnvironmentTag()

Pre-commit Hooks

Automate all static checks before commit:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.88.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
        args:
          - --hook-config=--path-to-file=README.md
      - id: terraform_tflint
        args:
          - --args=--config=__GIT_WORKING_DIR__/.tflint.hcl
      - id: terraform_tfsec
        args:
          - --args=--minimum-severity=HIGH
      - id: terraform_checkov

Setup:

pip install pre-commit
pre-commit install
pre-commit run --all-files

Level 2: Unit Testing with Terratest

Test module logic without creating resources.

Setup

# Create test directory
mkdir test && cd test

# Initialize Go module
go mod init github.com/company/infrastructure-tests

# Install Terratest
go get github.com/gruntwork-io/terratest/modules/terraform
go get github.com/stretchr/testify/assert

Basic Module Test

// test/vpc_test.go
package test

import (
    "testing"
    "fmt"
    "strings"
    
    "github.com/gruntwork-io/terratest/modules/random"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVPCModule(t *testing.T) {
    t.Parallel()
    
    // Generate unique name
    uniqueID := strings.ToLower(random.UniqueId())
    vpcName := fmt.Sprintf("test-vpc-%s", uniqueID)
    
    terraformOptions := &terraform.Options{
        // Path to Terraform code
        TerraformDir: "../modules/vpc",
        
        // Variables to pass
        Vars: map[string]interface{}{
            "vpc_name":          vpcName,
            "cidr_block":        "10.0.0.0/16",
            "availability_zones": []string{"us-east-1a", "us-east-1b"},
        },
        
        // Environment variables
        EnvVars: map[string]string{
            "AWS_DEFAULT_REGION": "us-east-1",
        },
    }
    
    // Clean up resources after test
    defer terraform.Destroy(t, terraformOptions)
    
    // Run terraform init and apply
    terraform.InitAndApply(t, terraformOptions)
    
    // Validate outputs
    vpcID := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcID)
    
    subnetIDs := terraform.OutputList(t, terraformOptions, "subnet_ids")
    assert.Equal(t, 2, len(subnetIDs), "Should create 2 subnets")
    
    vpcCIDR := terraform.Output(t, terraformOptions, "vpc_cidr")
    assert.Equal(t, "10.0.0.0/16", vpcCIDR)
}

Run test:

go test -v -timeout 30m

Testing with AWS SDK Validation

// test/s3_bucket_test.go  
package test

import (
    "testing"
    
    "github.com/aws/aws-sdk-go/aws"
    "github.com/aws/aws-sdk-go/service/s3"
    "github.com/gruntwork-io/terratest/modules/terraform"
    awstest "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestS3BucketSecurity(t *testing.T) {
    t.Parallel()
    
    bucketName := fmt.Sprintf("test-bucket-%s", random.UniqueId())
    awsRegion := "us-east-1"
    
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/s3-bucket",
        Vars: map[string]interface{}{
            "bucket_name": bucketName,
        },
    }
    
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    
    // Create S3 client
    s3Client := awstest.NewS3Client(t, awsRegion)
    
    // Test encryption
    encryption, err := s3Client.GetBucketEncryption(&s3.GetBucketEncryptionInput{
        Bucket: aws.String(bucketName),
    })
    assert.NoError(t, err)
    assert.NotNil(t, encryption.ServerSideEncryptionConfiguration)
    
    // Test versioning
    versioning, err := s3Client.GetBucketVersioning(&s3.GetBucketVersioningInput{
        Bucket: aws.String(bucketName),
    })
    assert.NoError(t, err)
    assert.Equal(t, "Enabled", *versioning.Status)
    
    // Test public access block
    publicAccess, err := s3Client.GetPublicAccessBlock(&s3.GetPublicAccessBlockInput{
        Bucket: aws.String(bucketName),
    })
    assert.NoError(t, err)
    assert.True(t, *publicAccess.PublicAccessBlockConfiguration.BlockPublicAcls)
    assert.True(t, *publicAccess.PublicAccessBlockConfiguration.BlockPublicPolicy)
}

Testing Multiple Scenarios

func TestDatabaseConfigurations(t *testing.T) {
    testCases := []struct {
        name            string
        instanceClass   string
        storageSize     int
        multiAZ        bool
        expectedCost   float64
    }{
        {"Small", "db.t3.micro", 20, false, 15.0},
        {"Medium", "db.t3.small", 100, false, 45.0},
        {"Production", "db.r5.large", 500, true, 450.0},
    }
    
    for _, tc := range testCases {
        tc := tc  // Capture range variable
        t.Run(tc.name, func(t *testing.T) {
            t.Parallel()
            
            terraformOptions := &terraform.Options{
                TerraformDir: "../modules/rds",
                Vars: map[string]interface{}{
                    "identifier":      fmt.Sprintf("test-db-%s", random.UniqueId()),
                    "instance_class":  tc.instanceClass,
                    "allocated_storage": tc.storageSize,
                    "multi_az":        tc.multiAZ,
                },
            }
            
            defer terraform.Destroy(t, terraformOptions)
            terraform.InitAndApply(t, terraformOptions)
            
            // Verify configuration
            dbIdentifier := terraform.Output(t, terraformOptions, "db_identifier")
            assert.NotEmpty(t, dbIdentifier)
            
            // Add cost validation here
        })
    }
}

Level 3: Policy as Code Testing

Enforce organizational standards automatically.

Open Policy Agent (OPA)

Policy file:

# policies/terraform.rego
package terraform

import input as tfplan

# Deny S3 buckets without encryption
deny[msg] {
    resource := tfplan.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not has_encryption(resource)
    
    msg := sprintf("S3 bucket '%s' must have encryption enabled", [resource.name])
}

has_encryption(resource) {
    config := resource.change.after
    config.server_side_encryption_configuration
}

# Deny databases without backup retention
deny[msg] {
    resource := tfplan.resource_changes[_]
    resource.type == "aws_db_instance"
    resource.change.after.backup_retention_period < 7
    
    msg := sprintf("RDS '%s' must have >= 7 days backup retention", [resource.name])
}

# Deny public SSH access
deny[msg] {
    resource := tfplan.resource_changes[_]
    resource.type == "aws_security_group"
    
    rule := resource.change.after.ingress[_]
    rule.cidr_blocks[_] == "0.0.0.0/0"
    rule.from_port <= 22
    rule.to_port >= 22
    
    msg := sprintf("Security group '%s' allows public SSH access", [resource.name])
}

# Warn about expensive instances in non-prod
warn[msg] {
    resource := tfplan.resource_changes[_]
    resource.type == "aws_instance"
    
    expensive := ["m5.8xlarge", "m5.12xlarge", "m5.16xlarge", "m5.24xlarge"]
    resource.change.after.instance_type == expensive[_]
    
    tags := resource.change.after.tags
    tags.Environment != "production"
    
    msg := sprintf("Instance '%s' uses expensive type in non-prod", [resource.name])
}

Test policy:

# Generate plan
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json

# Test with OPA
opa exec --decision terraform/deny --bundle policies/ tfplan.json

# Test in CI/CD
opa eval --fail-defined --data policies/ --input tfplan.json "data.terraform.deny"

Conftest for Kubernetes

# policy/deployment.rego
package main

deny[msg] {
    input.kind == "Deployment"
    not input.spec.template.spec.securityContext.runAsNonRoot
    msg = "Containers must not run as root"
}

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.resources.limits.cpu
    msg = sprintf("Container %v must have CPU limits", [container.name])
}

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    not container.resources.limits.memory
    msg = sprintf("Container %v must have memory limits", [container.name])
}

Test manifests:

conftest test deployment.yaml

Level 4: Integration Testing

Test how resources work together.

Database Connectivity Test

func TestApplicationDatabaseConnection(t *testing.T) {
    t.Parallel()
    
    terraformOptions := &terraform.Options{
        TerraformDir: "../examples/app-with-database",
    }
    
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    
    // Get database connection details
    dbEndpoint := terraform.Output(t, terraformOptions, "db_endpoint")
    dbUsername := terraform.Output(t, terraformOptions, "db_username")
    dbPassword := terraform.Output(t, terraformOptions, "db_password")
    dbName := terraform.Output(t, terraformOptions, "db_name")
    
    // Test direct database connection
    connStr := fmt.Sprintf(
        "host=%s user=%s password=%s dbname=%s sslmode=require",
        dbEndpoint, dbUsername, dbPassword, dbName,
    )
    
    db, err := sql.Open("postgres", connStr)
    assert.NoError(t, err)
    defer db.Close()
    
    err = db.Ping()
    assert.NoError(t, err, "Should connect to database")
    
    // Test application can connect
    appURL := terraform.Output(t, terraformOptions, "app_url")
    
    maxRetries := 10
    for i := 0; i < maxRetries; i++ {
        resp, err := http.Get(fmt.Sprintf("%s/health", appURL))
        if err == nil && resp.StatusCode == 200 {
            body, _ := ioutil.ReadAll(resp.Body)
            assert.Contains(t, string(body), "database_connected")
            return
        }
        time.Sleep(10 * time.Second)
    }
    
    t.Fatal("Application failed to connect to database")
}

Network Connectivity Test

func TestPrivateSubnetConnectivity(t *testing.T) {
    t.Parallel()
    
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/networking",
    }
    
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    
    // Get bastion details
    bastionIP := terraform.Output(t, terraformOptions, "bastion_public_ip")
    privateIP := terraform.Output(t, terraformOptions, "private_instance_ip")
    keyPair := terraform.Output(t, terraformOptions, "key_pair_name")
    
    // SSH to bastion
    bastionHost := ssh.Host{
        Hostname:    bastionIP,
        SshUserName: "ec2-user",
        SshKeyPair:  loadKeyPair(t, keyPair),
    }
    
    // Test connection to private instance through bastion
    command := fmt.Sprintf("ssh -o StrictHostKeyChecking=no ec2-user@%s echo connected", privateIP)
    
    output, err := ssh.CheckSshCommandE(t, bastionHost, command)
    assert.NoError(t, err)
    assert.Contains(t, output, "connected")
}

Level 5: End-to-End Testing

Test complete infrastructure deployments.

func TestFullEnvironment(t *testing.T) {
    // Skip for short tests
    if testing.Short() {
        t.Skip("Skipping E2E test")
    }
    
    terraformOptions := &terraform.Options{
        TerraformDir: "../environments/staging",
        Vars: map[string]interface{}{
            "environment": "e2e-test",
        },
    }
    
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    
    // Get application URL
    appURL := terraform.Output(t, terraformOptions, "app_url")
    
    // Wait for application
    http_helper.HttpGetWithRetry(
        t,
        fmt.Sprintf("%s/health", appURL),
        nil,
        200,
        "healthy",
        30,
        10*time.Second,
    )
    
    // Test main features
    t.Run("Homepage", func(t *testing.T) {
        status, body := http_helper.HttpGet(t, appURL, nil)
        assert.Equal(t, 200, status)
        assert.Contains(t, body, "<title>")
    })
    
    t.Run("API", func(t *testing.T) {
        status, body := http_helper.HttpGet(t, fmt.Sprintf("%s/api/status", appURL), nil)
        assert.Equal(t, 200, status)
        assert.Contains(t, body, `"status":"ok"`)
    })
    
    t.Run("Database", func(t *testing.T) {
        status, body := http_helper.HttpGet(t, fmt.Sprintf("%s/api/db-check", appURL), nil)
        assert.Equal(t, 200, status)
        assert.Contains(t, body, `"connected":true`)
    })
}

CI/CD Integration

GitHub Actions

# .github/workflows/terraform-test.yml
name: Terraform Tests

on:
  pull_request:
    paths:
      - 'terraform/**'
  push:
    branches: [main]

jobs:
  static-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - uses: hashicorp/setup-terraform@v2
      
      - name: Terraform Format
        run: terraform fmt -check -recursive
        
      - name: Terraform Validate
        run: |
          terraform init -backend=false
          terraform validate
        
      - name: TFLint
        uses: terraform-linters/setup-tflint@v3
        
      - run: tflint --recursive
        
      - name: TFSec
        uses: aquasecurity/[email protected]
        
      - name: Checkov
        uses: bridgecrewio/checkov-action@master
        with:
          directory: terraform/

  unit-tests:
    runs-on: ubuntu-latest
    needs: static-analysis
    steps:
      - uses: actions/checkout@v3
      
      - uses: actions/setup-go@v4
        with:
          go-version: '1.21'
          
      - name: Run Tests
        working-directory: test
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: go test -v -timeout 30m -parallel 5

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v3
      
      - uses: actions/setup-go@v4
        with:
          go-version: '1.21'
          
      - name: Run Integration Tests
        working-directory: test
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: go test -v -timeout 60m -tags=integration

Cost Management

Ephemeral Test Environments

type TestEnvironment struct {
    Name        string
    CreatedAt   time.Time
    MaxDuration time.Duration
}

func (env *TestEnvironment) scheduleCleanup(t *testing.T, opts *terraform.Options) {
    go func() {
        time.Sleep(env.MaxDuration)
        if time.Since(env.CreatedAt) >= env.MaxDuration {
            terraform.Destroy(t, opts)
        }
    }()
}

Resource Tagging

locals {
  test_tags = {
    Environment = "test"
    TestID      = var.test_id
    CreatedAt   = timestamp()
    AutoDelete  = "true"
  }
}

Best Practices

Use the test pyramid - More fast tests, fewer slow tests
Run tests in parallel - Use t.Parallel() in Go tests
Always clean up resources - Use defer terraform.Destroy()
Randomize names - Avoid test collisions
Add retry logic - Network operations can be flaky
Document tests - Explain what's being tested and why
Keep tests fast - Slow tests don't get run
Test in isolation - Each test should be independent

Conclusion

Infrastructure testing transforms infrastructure management from hope-driven to evidence-based. By implementing testing at multiple levels, teams catch errors early when they're cheap to fix.

Start Here:

Add pre-commit hooks for static analysis (today)
Write one Terratest for your most critical module (this week)
Add tests to CI/CD pipeline (this sprint)
Measure test coverage and add integration tests (next month)

Key Metrics:

Time to detect infrastructure bugs: from days → minutes
Production incidents from infrastructure: from 10/month → 1/month
Deployment confidence: from 60% → 95%
Time to deploy: from hours → minutes

The best infrastructure tests are the ones that run automatically and catch bugs before production. Start simple, measure impact, and expand coverage incrementally.

Additional Resources

Questions about infrastructure testing? Share your strategies in the comments below.