Infrastructure as Code Best Practices: A Practical Guide

Introduction

Infrastructure as Code (IaC) has become the standard for managing cloud infrastructure, but the gap between "it works" and "it works well" is vast. This guide explores battle-tested patterns and practices that will help you write maintainable, scalable, and reliable infrastructure code.

Whether you're managing a small startup's AWS environment or orchestrating multi-cloud infrastructure for an enterprise, these principles will help you avoid common pitfalls and build infrastructure that scales with your organization.

Why IaC Best Practices Matter

Poor IaC practices don't just create technical debt—they create operational risk. Consider these real scenarios:

A terraform apply that takes 45 minutes because everything is in one massive state file
A production outage caused by applying changes from the wrong branch
Hours spent debugging why infrastructure works in staging but fails in production
A critical security patch delayed because nobody understands the infrastructure code

Good IaC practices prevent these scenarios and create infrastructure that's:

Reproducible: Any team member can recreate environments
Auditable: Changes are tracked and reviewable
Testable: Validate changes before they reach production
Maintainable: New team members can understand and modify code
Secure: Secrets are managed properly, least privilege is enforced

Core Principles

1. Treat Infrastructure Code Like Application Code

Your infrastructure code deserves the same rigor as your application code. This means:

Version Control Everything

# Good: All infrastructure in version control
infrastructure/
├── terraform/
│   ├── environments/
│   │   ├── prod/
│   │   ├── staging/
│   │   └── dev/
│   └── modules/
├── .gitignore          # Exclude secrets, state files
└── README.md           # Setup instructions

Never commit:

State files (.tfstate)
Secrets or credentials
Provider credentials
Lock files from local runs (unless using lock file versioning strategy)

Use Pull Requests for All Changes

Every infrastructure change should go through code review:

Reviewers catch mistakes before they reach production
The team learns from each other's approaches
Documentation happens naturally in PR discussions
Changes are auditable with clear context

Write Meaningful Commit Messages

# Bad
git commit -m "fix"

# Good
git commit -m "Add lifecycle policy to prevent accidental RDS deletion

- Implement prevent_destroy on production RDS instances
- Add automated backup retention of 30 days
- Resolves incident #1234 where staging DB was accidentally destroyed"

2. State Management is Critical

State management is where most IaC disasters originate. Handle it with care.

Always Use Remote State

Local state files are a disaster waiting to happen. Use remote backends:

# terraform/backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    
    # Prevent accidental deletion
    skip_region_validation      = false
    skip_credentials_validation = false
  }
}

Enable State Locking

Concurrent modifications will corrupt your state. Always use locking:

# DynamoDB table for Terraform state locking
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name        = "Terraform State Lock Table"
    Environment = "shared"
  }
}

Split State Files Strategically

Don't put everything in one state file. Split by:

Environment: Separate prod, staging, dev
Lifecycle: Networking separate from applications
Blast Radius: Critical infrastructure separate from experimental

terraform/
├── networking/          # VPCs, subnets, routing
│   ├── prod/
│   └── staging/
├── data/                # Databases, caches
│   ├── prod/
│   └── staging/
└── applications/        # App infrastructure
    ├── prod/
    └── staging/

Benefits:

Faster plan/apply cycles
Reduced blast radius
Parallel development
Independent deployment schedules

3. Module Design Patterns

Modules are how you scale IaC across teams and projects.

Create Focused, Single-Purpose Modules

# Bad: God module that does everything
module "everything" {
  source = "./modules/infrastructure"
  create_vpc = true
  create_rds = true
  create_eks = true
  # ... 50 more parameters
}

# Good: Focused modules
module "networking" {
  source = "./modules/vpc"
  cidr_block = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b"]
}

module "database" {
  source = "./modules/rds-postgres"
  vpc_id = module.networking.vpc_id
  subnet_ids = module.networking.private_subnet_ids
}

Design for Composability

Modules should work together like LEGO blocks:

# Base networking module
module "vpc" {
  source = "git::https://github.com/myorg/terraform-modules//vpc?ref=v2.1.0"
  
  name               = "production"
  cidr_block         = "10.0.0.0/16"
  availability_zones = data.aws_availability_zones.available.names
  
  # Enable features as needed
  enable_nat_gateway = true
  enable_vpn_gateway = false
}

# Security module that uses VPC outputs
module "security_groups" {
  source = "git::https://github.com/myorg/terraform-modules//security-groups?ref=v1.3.0"
  
  vpc_id = module.vpc.vpc_id
  
  allow_ssh_from = ["10.0.0.0/8"]  # Only from private networks
}

# Application module that uses both
module "web_app" {
  source = "./modules/web-application"
  
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  security_group_id = module.security_groups.app_sg_id
}

Version Your Modules

Treat modules as versioned packages:

# Bad: Using latest from main branch
module "database" {
  source = "git::https://github.com/myorg/terraform-modules//rds"
}

# Good: Pin to specific version
module "database" {
  source = "git::https://github.com/myorg/terraform-modules//rds?ref=v2.3.1"
  
  # Upgrade deliberately, test thoroughly
}

Module Documentation Standards

Every module should have:

# modules/rds-postgres/README.md

## RDS PostgreSQL Module

Creates a production-ready PostgreSQL RDS instance with:
- Automated backups
- Encryption at rest
- Multi-AZ deployment option
- Parameter group optimization
- CloudWatch monitoring

### Usage

```hcl
module "database" {
  source = "./modules/rds-postgres"
  
  identifier = "myapp-prod"
  instance_class = "db.t3.large"
  allocated_storage = 100
  
  vpc_id = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids
}

Inputs

Name Description Type Default Required identifier Database identifier string n/a yes instance_class RDS instance type string db.t3.medium no

Outputs

Name Description endpoint Database endpoint connection_string Full connection string


### 4. Environment Management

**Use Workspaces or Directory Structure**

Choose one pattern and stick with it:

**Option A: Terraform Workspaces**
```bash
# Simple projects, shared configuration
terraform workspace new prod
terraform workspace new staging
terraform workspace select prod

Option B: Directory-Based (Recommended for most)

terraform/
└── environments/
    ├── prod/
    │   ├── main.tf
    │   ├── variables.tf
    │   └── terraform.tfvars
    ├── staging/
    │   ├── main.tf
    │   ├── variables.tf
    │   └── terraform.tfvars
    └── shared/
        └── modules/

Directory-based is preferred because:

Explicit state separation
Different backend configurations per environment
Clear blast radius
Easier to apply environment-specific policies

Never Hardcode Environment Values

# Bad: Hardcoded values
resource "aws_instance" "web" {
  instance_type = "t3.large"
  count = 5
}

# Good: Variable-driven
variable "instance_type" {
  description = "EC2 instance type"
  type        = string
}

variable "instance_count" {
  description = "Number of instances"
  type        = number
}

resource "aws_instance" "web" {
  instance_type = var.instance_type
  count         = var.instance_count
}

Then use environment-specific tfvars:

# environments/prod/terraform.tfvars
instance_type  = "t3.large"
instance_count = 5

# environments/staging/terraform.tfvars
instance_type  = "t3.small"
instance_count = 2

5. Security Best Practices

Never Store Secrets in Code

# Bad: Secrets in code
resource "aws_db_instance" "db" {
  username = "admin"
  password = "supersecret123"  # NEVER DO THIS
}

# Good: Use secrets management
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/database/master-password"
}

resource "aws_db_instance" "db" {
  username = "admin"
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Implement Least Privilege

# Bad: Overly permissive
resource "aws_iam_role_policy" "app_policy" {
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = "*"
      Resource = "*"
    }]
  })
}

# Good: Specific permissions
resource "aws_iam_role_policy" "app_policy" {
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "arn:aws:s3:::my-app-bucket/*"
      },
      {
        Effect = "Allow"
        Action = ["dynamodb:GetItem", "dynamodb:PutItem"]
        Resource = aws_dynamodb_table.app_table.arn
      }
    ]
  })
}

Enable Encryption by Default

# S3 bucket with encryption
resource "aws_s3_bucket" "data" {
  bucket = "my-data-bucket"
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.s3.arn
    }
  }
}

# RDS with encryption
resource "aws_db_instance" "db" {
  # ... other configuration
  storage_encrypted = true
  kms_key_id        = aws_kms_key.rds.arn
}

6. Code Organization and Style

Consistent File Naming

terraform/
├── main.tf           # Primary resources
├── variables.tf      # Input variables
├── outputs.tf        # Output values
├── versions.tf       # Provider versions
├── backend.tf        # Backend configuration
├── data.tf           # Data sources
├── locals.tf         # Local values
└── README.md         # Documentation

Use Meaningful Resource Names

# Bad: Generic names
resource "aws_instance" "instance1" {
  # ...
}

resource "aws_security_group" "sg2" {
  # ...
}

# Good: Descriptive names
resource "aws_instance" "web_server" {
  # ...
}

resource "aws_security_group" "web_server_https" {
  name        = "web-server-https"
  description = "Allow inbound HTTPS traffic to web servers"
  # ...
}

Leverage Locals for Complex Logic

locals {
  # Environment-specific configuration
  environment_config = {
    prod = {
      instance_count = 5
      instance_type  = "t3.large"
      enable_backup  = true
    }
    staging = {
      instance_count = 2
      instance_type  = "t3.small"
      enable_backup  = false
    }
  }
  
  config = local.environment_config[var.environment]
  
  # Common tags
  common_tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
    Team        = var.team_name
    CostCenter  = var.cost_center
  }
}

resource "aws_instance" "web" {
  count         = local.config.instance_count
  instance_type = local.config.instance_type
  
  tags = merge(
    local.common_tags,
    {
      Name = "web-server-${count.index + 1}"
      Role = "web"
    }
  )
}

7. Change Management

Always Run Plan Before Apply

# Development workflow
terraform fmt              # Format code
terraform validate         # Validate syntax
terraform plan -out=plan   # Generate plan
# Review plan thoroughly
terraform apply plan       # Apply saved plan

Use Plan Files for Safety

# Generate plan
terraform plan -out=prod-change-$(date +%Y%m%d-%H%M%S).tfplan

# Review the plan
terraform show prod-change-20260206-143000.tfplan

# Apply the exact plan (prevents race conditions)
terraform apply prod-change-20260206-143000.tfplan

Implement Approval Gates

For production changes:

Developer creates infrastructure PR
Automated CI runs terraform plan
Plan output posted to PR as comment
Required reviewers approve
Merge triggers automated apply (or manual apply with approval)

Example GitHub Actions Workflow:

name: Terraform Plan

on:
  pull_request:
    paths:
      - 'terraform/**'

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        
      - name: Terraform Init
        run: terraform init
        working-directory: ./terraform/environments/prod
        
      - name: Terraform Plan
        run: terraform plan -no-color
        working-directory: ./terraform/environments/prod
        continue-on-error: true
        
      - name: Comment Plan
        uses: actions/github-script@v6
        with:
          script: |
            // Post plan output as PR comment

8. Testing Infrastructure Code

Unit Testing with Terratest

// test/vpc_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVPCModule(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "cidr_block": "10.0.0.0/16",
            "name": "test-vpc",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)
}

Static Analysis

Use tools to catch issues before deployment:

# TFLint - Terraform linting
tflint --init
tflint

# Checkov - Security scanning
checkov -d terraform/

# TFSec - Security scanner
tfsec terraform/

# Terraform validate
terraform validate

Pre-commit Hooks

Automate checks locally:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.77.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_tflint
      - id: terraform_tfsec

9. Documentation

Document the "Why" Not Just the "What"

# Bad comment
# Create S3 bucket
resource "aws_s3_bucket" "data" {
  bucket = "my-data-bucket"
}

# Good comment
# S3 bucket for user-uploaded documents
# Versioning enabled to meet compliance requirements (SOC2 3.1.2)
# Lifecycle policy moves objects to Glacier after 90 days to reduce costs
resource "aws_s3_bucket" "user_documents" {
  bucket = "prod-user-documents"
  
  tags = {
    Purpose    = "User uploaded content storage"
    Compliance = "SOC2"
    DataClass  = "customer-data"
  }
}

Maintain an Architecture Decision Record (ADR)

# ADR-003: Use Separate VPCs Per Environment

## Status
Accepted

## Context
We need to decide whether to use a single VPC with subnet isolation
or separate VPCs for each environment (prod, staging, dev).

## Decision
We will use separate VPCs for each environment.

## Consequences

### Positive
- Complete network isolation between environments
- Easier to apply different security policies
- Independent IP address space management
- Reduced blast radius for network changes

### Negative
- Increased complexity in VPC peering if needed
- Higher AWS costs (NAT gateways per VPC)
- More infrastructure to manage

## Implementation Notes
- Production VPC: 10.0.0.0/16
- Staging VPC: 10.1.0.0/16
- Development VPC: 10.2.0.0/16

10. Monitoring and Observability

Tag Everything

locals {
  required_tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
    Team        = var.team_name
    CostCenter  = var.cost_center
    Project     = var.project_name
  }
}

# Apply to all resources
resource "aws_instance" "app" {
  # ... configuration
  
  tags = merge(
    local.required_tags,
    {
      Name = "app-server-${count.index}"
      Role = "application"
    }
  )
}

Enable Drift Detection

# Detect configuration drift
terraform plan -detailed-exitcode

# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected

Set up automated drift detection:

# .github/workflows/drift-detection.yml
name: Terraform Drift Detection

on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        
      - name: Check for Drift
        run: |
          terraform init
          terraform plan -detailed-exitcode
        working-directory: ./terraform/environments/prod
        
      - name: Notify on Drift
        if: failure()
        run: |
          # Send alert to Slack/PagerDuty
          echo "Drift detected in production infrastructure"

Common Antipatterns to Avoid

1. The God Module

Problem: One massive module that creates everything

# DON'T DO THIS
module "infrastructure" {
  source = "./modules/everything"
  
  # 100+ parameters
  create_vpc = true
  create_rds = true
  create_eks = true
  create_cloudfront = true
  # ... endless configuration
}

Solution: Break into focused modules

2. Copy-Paste Infrastructure

Problem: Duplicating code across environments

terraform/
├── prod/
│   ├── main.tf      # 500 lines
│   └── variables.tf
└── staging/
    ├── main.tf      # Same 500 lines, slightly different
    └── variables.tf

Solution: Use modules or workspaces

3. Manual State Manipulation

Problem: Running terraform state rm or editing state files manually

Solution: Use proper lifecycle management:

resource "aws_instance" "app" {
  # ... configuration
  
  lifecycle {
    prevent_destroy = true        # Prevent accidental deletion
    create_before_destroy = true  # Zero-downtime updates
    ignore_changes = [
      tags["LastModified"],       # Ignore certain changes
    ]
  }
}

4. Ignoring Outputs

Problem: Not exposing useful information from modules

# Bad: No outputs
module "networking" {
  source = "./modules/vpc"
}

# Can't reference VPC ID elsewhere!

Solution: Always provide outputs:

# modules/vpc/outputs.tf
output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "private_subnet_ids" {
  description = "List of private subnet IDs"
  value       = aws_subnet.private[*].id
}

output "public_subnet_ids" {
  description = "List of public subnet IDs"
  value       = aws_subnet.public[*].id
}

5. No Rollback Plan

Problem: Applying changes without a way to roll back

Solution:

Version your modules (can revert to previous version)
Keep old infrastructure in place until new is verified
Use feature flags for gradual rollouts

# Blue-Green deployment pattern
resource "aws_lb_target_group" "blue" {
  # ... current production config
}

resource "aws_lb_target_group" "green" {
  # ... new config
}

# Switch traffic gradually
resource "aws_lb_listener_rule" "main" {
  action {
    type = "forward"
    forward {
      target_group {
        arn    = aws_lb_target_group.blue.arn
        weight = var.blue_weight  # Start with 100, gradually shift to 0
      }
      target_group {
        arn    = aws_lb_target_group.green.arn
        weight = var.green_weight  # Start with 0, gradually shift to 100
      }
    }
  }
}

Scaling Your IaC Practice

Team Workflows

For Small Teams (2-5 people):

Single repository
Directory-based environment separation
Direct code review in PRs
Manual applies with approval

For Medium Teams (5-20 people):

Separate repositories for infrastructure and applications
Module registry (internal or Terraform Registry)
Automated planning in CI/CD
Manual applies for production
Dedicated infrastructure team

For Large Organizations (20+ people):

Multiple repositories by service/team
Private module registry
Automated planning and applying for non-prod
Approval gates for production
Infrastructure platform team
Self-service portals for developers

Tool Selection

When to Use What:

Terraform:

Multi-cloud environments
Need mature module ecosystem
Team familiar with HCL
State management flexibility

Pulumi:

Team prefers general-purpose languages (Python, TypeScript, Go)
Complex logic in infrastructure code
Need to leverage existing libraries

CloudFormation:

AWS-only infrastructure
Want native AWS integration
No external tools in workflow

CDK:

AWS-focused
Team comfortable with TypeScript/Python
Want type safety and IDE support

Conclusion

Infrastructure as Code is about building reliable, maintainable systems. These practices help you:

Reduce operational risk through consistency
Enable team collaboration through clear patterns
Scale infrastructure management across teams
Maintain security and compliance requirements
Deploy changes confidently

Start with the fundamentals:

Version control everything
Use remote state with locking
Review all changes
Module your common patterns
Test before deploying

Then gradually adopt advanced practices as your infrastructure grows. The key is consistency—pick patterns that work for your team and apply them everywhere.

Next Steps

Ready to level up your IaC game?

Audit your current infrastructure: Identify which practices you're already following and which need work
Start with state management: If you're using local state, migrate to remote state immediately
Create your first module: Extract repeated patterns into reusable modules
Implement testing: Start with static analysis (tflint, tfsec) before investing in Terratest
Document decisions: Start an ADR for your next infrastructure change

Additional Resources

Have questions or suggestions for this guide? Let us know in the comments below or reach out to our community.