Infrastructure as Code Best Practices: A Practical Guide
Prerequisites
- Basic understanding of cloud infrastructure (AWS, GCP, or Azure)
- Familiarity with at least one IaC tool (Terraform, Pulumi, or CloudFormation)
- Understanding of version control with Git
Introduction
Infrastructure as Code (IaC) has become the standard for managing cloud infrastructure, but the gap between "it works" and "it works well" is vast. This guide explores battle-tested patterns and practices that will help you write maintainable, scalable, and reliable infrastructure code.
Whether you're managing a small startup's AWS environment or orchestrating multi-cloud infrastructure for an enterprise, these principles will help you avoid common pitfalls and build infrastructure that scales with your organization.
Why IaC Best Practices Matter
Poor IaC practices don't just create technical debtβthey create operational risk. Consider these real scenarios:
- A terraform apply that takes 45 minutes because everything is in one massive state file
- A production outage caused by applying changes from the wrong branch
- Hours spent debugging why infrastructure works in staging but fails in production
- A critical security patch delayed because nobody understands the infrastructure code
Good IaC practices prevent these scenarios and create infrastructure that's:
- Reproducible: Any team member can recreate environments
- Auditable: Changes are tracked and reviewable
- Testable: Validate changes before they reach production
- Maintainable: New team members can understand and modify code
- Secure: Secrets are managed properly, least privilege is enforced
Core Principles
1. Treat Infrastructure Code Like Application Code
Your infrastructure code deserves the same rigor as your application code. This means:
Version Control Everything
# Good: All infrastructure in version control
infrastructure/
βββ terraform/
β βββ environments/
β β βββ prod/
β β βββ staging/
β β βββ dev/
β βββ modules/
βββ .gitignore # Exclude secrets, state files
βββ README.md # Setup instructions
Never commit:
- State files (.tfstate)
- Secrets or credentials
- Provider credentials
- Lock files from local runs (unless using lock file versioning strategy)
Use Pull Requests for All Changes
Every infrastructure change should go through code review:
- Reviewers catch mistakes before they reach production
- The team learns from each other's approaches
- Documentation happens naturally in PR discussions
- Changes are auditable with clear context
Write Meaningful Commit Messages
# Bad
git commit -m "fix"
# Good
git commit -m "Add lifecycle policy to prevent accidental RDS deletion
- Implement prevent_destroy on production RDS instances
- Add automated backup retention of 30 days
- Resolves incident #1234 where staging DB was accidentally destroyed"
2. State Management is Critical
State management is where most IaC disasters originate. Handle it with care.
Always Use Remote State
Local state files are a disaster waiting to happen. Use remote backends:
# terraform/backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# Prevent accidental deletion
skip_region_validation = false
skip_credentials_validation = false
}
}
Enable State Locking
Concurrent modifications will corrupt your state. Always use locking:
# DynamoDB table for Terraform state locking
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock Table"
Environment = "shared"
}
}
Split State Files Strategically
Don't put everything in one state file. Split by:
- Environment: Separate prod, staging, dev
- Lifecycle: Networking separate from applications
- Blast Radius: Critical infrastructure separate from experimental
terraform/
βββ networking/ # VPCs, subnets, routing
β βββ prod/
β βββ staging/
βββ data/ # Databases, caches
β βββ prod/
β βββ staging/
βββ applications/ # App infrastructure
βββ prod/
βββ staging/
Benefits:
- Faster plan/apply cycles
- Reduced blast radius
- Parallel development
- Independent deployment schedules
3. Module Design Patterns
Modules are how you scale IaC across teams and projects.
Create Focused, Single-Purpose Modules
# Bad: God module that does everything
module "everything" {
source = "./modules/infrastructure"
create_vpc = true
create_rds = true
create_eks = true
# ... 50 more parameters
}
# Good: Focused modules
module "networking" {
source = "./modules/vpc"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b"]
}
module "database" {
source = "./modules/rds-postgres"
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
}
Design for Composability
Modules should work together like LEGO blocks:
# Base networking module
module "vpc" {
source = "git::https://github.com/myorg/terraform-modules//vpc?ref=v2.1.0"
name = "production"
cidr_block = "10.0.0.0/16"
availability_zones = data.aws_availability_zones.available.names
# Enable features as needed
enable_nat_gateway = true
enable_vpn_gateway = false
}
# Security module that uses VPC outputs
module "security_groups" {
source = "git::https://github.com/myorg/terraform-modules//security-groups?ref=v1.3.0"
vpc_id = module.vpc.vpc_id
allow_ssh_from = ["10.0.0.0/8"] # Only from private networks
}
# Application module that uses both
module "web_app" {
source = "./modules/web-application"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
security_group_id = module.security_groups.app_sg_id
}
Version Your Modules
Treat modules as versioned packages:
# Bad: Using latest from main branch
module "database" {
source = "git::https://github.com/myorg/terraform-modules//rds"
}
# Good: Pin to specific version
module "database" {
source = "git::https://github.com/myorg/terraform-modules//rds?ref=v2.3.1"
# Upgrade deliberately, test thoroughly
}
Module Documentation Standards
Every module should have:
# modules/rds-postgres/README.md
## RDS PostgreSQL Module
Creates a production-ready PostgreSQL RDS instance with:
- Automated backups
- Encryption at rest
- Multi-AZ deployment option
- Parameter group optimization
- CloudWatch monitoring
### Usage
```hcl
module "database" {
source = "./modules/rds-postgres"
identifier = "myapp-prod"
instance_class = "db.t3.large"
allocated_storage = 100
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
}
Inputs
Name Description Type Default Required identifier Database identifier string n/a yes instance_class RDS instance type string db.t3.medium no
Outputs
Name Description endpoint Database endpoint connection_string Full connection string
### 4. Environment Management
**Use Workspaces or Directory Structure**
Choose one pattern and stick with it:
**Option A: Terraform Workspaces**
```bash
# Simple projects, shared configuration
terraform workspace new prod
terraform workspace new staging
terraform workspace select prod
Option B: Directory-Based (Recommended for most)
terraform/
βββ environments/
βββ prod/
β βββ main.tf
β βββ variables.tf
β βββ terraform.tfvars
βββ staging/
β βββ main.tf
β βββ variables.tf
β βββ terraform.tfvars
βββ shared/
βββ modules/
Directory-based is preferred because:
- Explicit state separation
- Different backend configurations per environment
- Clear blast radius
- Easier to apply environment-specific policies
Never Hardcode Environment Values
# Bad: Hardcoded values
resource "aws_instance" "web" {
instance_type = "t3.large"
count = 5
}
# Good: Variable-driven
variable "instance_type" {
description = "EC2 instance type"
type = string
}
variable "instance_count" {
description = "Number of instances"
type = number
}
resource "aws_instance" "web" {
instance_type = var.instance_type
count = var.instance_count
}
Then use environment-specific tfvars:
# environments/prod/terraform.tfvars
instance_type = "t3.large"
instance_count = 5
# environments/staging/terraform.tfvars
instance_type = "t3.small"
instance_count = 2
5. Security Best Practices
Never Store Secrets in Code
# Bad: Secrets in code
resource "aws_db_instance" "db" {
username = "admin"
password = "supersecret123" # NEVER DO THIS
}
# Good: Use secrets management
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/database/master-password"
}
resource "aws_db_instance" "db" {
username = "admin"
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Implement Least Privilege
# Bad: Overly permissive
resource "aws_iam_role_policy" "app_policy" {
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = "*"
Resource = "*"
}]
})
}
# Good: Specific permissions
resource "aws_iam_role_policy" "app_policy" {
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "arn:aws:s3:::my-app-bucket/*"
},
{
Effect = "Allow"
Action = ["dynamodb:GetItem", "dynamodb:PutItem"]
Resource = aws_dynamodb_table.app_table.arn
}
]
})
}
Enable Encryption by Default
# S3 bucket with encryption
resource "aws_s3_bucket" "data" {
bucket = "my-data-bucket"
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.s3.arn
}
}
}
# RDS with encryption
resource "aws_db_instance" "db" {
# ... other configuration
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
}
6. Code Organization and Style
Consistent File Naming
terraform/
βββ main.tf # Primary resources
βββ variables.tf # Input variables
βββ outputs.tf # Output values
βββ versions.tf # Provider versions
βββ backend.tf # Backend configuration
βββ data.tf # Data sources
βββ locals.tf # Local values
βββ README.md # Documentation
Use Meaningful Resource Names
# Bad: Generic names
resource "aws_instance" "instance1" {
# ...
}
resource "aws_security_group" "sg2" {
# ...
}
# Good: Descriptive names
resource "aws_instance" "web_server" {
# ...
}
resource "aws_security_group" "web_server_https" {
name = "web-server-https"
description = "Allow inbound HTTPS traffic to web servers"
# ...
}
Leverage Locals for Complex Logic
locals {
# Environment-specific configuration
environment_config = {
prod = {
instance_count = 5
instance_type = "t3.large"
enable_backup = true
}
staging = {
instance_count = 2
instance_type = "t3.small"
enable_backup = false
}
}
config = local.environment_config[var.environment]
# Common tags
common_tags = {
Environment = var.environment
ManagedBy = "Terraform"
Team = var.team_name
CostCenter = var.cost_center
}
}
resource "aws_instance" "web" {
count = local.config.instance_count
instance_type = local.config.instance_type
tags = merge(
local.common_tags,
{
Name = "web-server-${count.index + 1}"
Role = "web"
}
)
}
7. Change Management
Always Run Plan Before Apply
# Development workflow
terraform fmt # Format code
terraform validate # Validate syntax
terraform plan -out=plan # Generate plan
# Review plan thoroughly
terraform apply plan # Apply saved plan
Use Plan Files for Safety
# Generate plan
terraform plan -out=prod-change-$(date +%Y%m%d-%H%M%S).tfplan
# Review the plan
terraform show prod-change-20260206-143000.tfplan
# Apply the exact plan (prevents race conditions)
terraform apply prod-change-20260206-143000.tfplan
Implement Approval Gates
For production changes:
- Developer creates infrastructure PR
- Automated CI runs
terraform plan - Plan output posted to PR as comment
- Required reviewers approve
- Merge triggers automated apply (or manual apply with approval)
Example GitHub Actions Workflow:
name: Terraform Plan
on:
pull_request:
paths:
- 'terraform/**'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
working-directory: ./terraform/environments/prod
- name: Terraform Plan
run: terraform plan -no-color
working-directory: ./terraform/environments/prod
continue-on-error: true
- name: Comment Plan
uses: actions/github-script@v6
with:
script: |
// Post plan output as PR comment
8. Testing Infrastructure Code
Unit Testing with Terratest
// test/vpc_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestVPCModule(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"cidr_block": "10.0.0.0/16",
"name": "test-vpc",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcId)
}
Static Analysis
Use tools to catch issues before deployment:
# TFLint - Terraform linting
tflint --init
tflint
# Checkov - Security scanning
checkov -d terraform/
# TFSec - Security scanner
tfsec terraform/
# Terraform validate
terraform validate
Pre-commit Hooks
Automate checks locally:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.77.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_tflint
- id: terraform_tfsec
9. Documentation
Document the "Why" Not Just the "What"
# Bad comment
# Create S3 bucket
resource "aws_s3_bucket" "data" {
bucket = "my-data-bucket"
}
# Good comment
# S3 bucket for user-uploaded documents
# Versioning enabled to meet compliance requirements (SOC2 3.1.2)
# Lifecycle policy moves objects to Glacier after 90 days to reduce costs
resource "aws_s3_bucket" "user_documents" {
bucket = "prod-user-documents"
tags = {
Purpose = "User uploaded content storage"
Compliance = "SOC2"
DataClass = "customer-data"
}
}
Maintain an Architecture Decision Record (ADR)
# ADR-003: Use Separate VPCs Per Environment
## Status
Accepted
## Context
We need to decide whether to use a single VPC with subnet isolation
or separate VPCs for each environment (prod, staging, dev).
## Decision
We will use separate VPCs for each environment.
## Consequences
### Positive
- Complete network isolation between environments
- Easier to apply different security policies
- Independent IP address space management
- Reduced blast radius for network changes
### Negative
- Increased complexity in VPC peering if needed
- Higher AWS costs (NAT gateways per VPC)
- More infrastructure to manage
## Implementation Notes
- Production VPC: 10.0.0.0/16
- Staging VPC: 10.1.0.0/16
- Development VPC: 10.2.0.0/16
10. Monitoring and Observability
Tag Everything
locals {
required_tags = {
Environment = var.environment
ManagedBy = "Terraform"
Team = var.team_name
CostCenter = var.cost_center
Project = var.project_name
}
}
# Apply to all resources
resource "aws_instance" "app" {
# ... configuration
tags = merge(
local.required_tags,
{
Name = "app-server-${count.index}"
Role = "application"
}
)
}
Enable Drift Detection
# Detect configuration drift
terraform plan -detailed-exitcode
# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected
Set up automated drift detection:
# .github/workflows/drift-detection.yml
name: Terraform Drift Detection
on:
schedule:
- cron: '0 8 * * *' # Daily at 8 AM
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Check for Drift
run: |
terraform init
terraform plan -detailed-exitcode
working-directory: ./terraform/environments/prod
- name: Notify on Drift
if: failure()
run: |
# Send alert to Slack/PagerDuty
echo "Drift detected in production infrastructure"
Common Antipatterns to Avoid
1. The God Module
Problem: One massive module that creates everything
# DON'T DO THIS
module "infrastructure" {
source = "./modules/everything"
# 100+ parameters
create_vpc = true
create_rds = true
create_eks = true
create_cloudfront = true
# ... endless configuration
}
Solution: Break into focused modules
2. Copy-Paste Infrastructure
Problem: Duplicating code across environments
terraform/
βββ prod/
β βββ main.tf # 500 lines
β βββ variables.tf
βββ staging/
βββ main.tf # Same 500 lines, slightly different
βββ variables.tf
Solution: Use modules or workspaces
3. Manual State Manipulation
Problem: Running terraform state rm or editing state files manually
Solution: Use proper lifecycle management:
resource "aws_instance" "app" {
# ... configuration
lifecycle {
prevent_destroy = true # Prevent accidental deletion
create_before_destroy = true # Zero-downtime updates
ignore_changes = [
tags["LastModified"], # Ignore certain changes
]
}
}
4. Ignoring Outputs
Problem: Not exposing useful information from modules
# Bad: No outputs
module "networking" {
source = "./modules/vpc"
}
# Can't reference VPC ID elsewhere!
Solution: Always provide outputs:
# modules/vpc/outputs.tf
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "private_subnet_ids" {
description = "List of private subnet IDs"
value = aws_subnet.private[*].id
}
output "public_subnet_ids" {
description = "List of public subnet IDs"
value = aws_subnet.public[*].id
}
5. No Rollback Plan
Problem: Applying changes without a way to roll back
Solution:
- Version your modules (can revert to previous version)
- Keep old infrastructure in place until new is verified
- Use feature flags for gradual rollouts
# Blue-Green deployment pattern
resource "aws_lb_target_group" "blue" {
# ... current production config
}
resource "aws_lb_target_group" "green" {
# ... new config
}
# Switch traffic gradually
resource "aws_lb_listener_rule" "main" {
action {
type = "forward"
forward {
target_group {
arn = aws_lb_target_group.blue.arn
weight = var.blue_weight # Start with 100, gradually shift to 0
}
target_group {
arn = aws_lb_target_group.green.arn
weight = var.green_weight # Start with 0, gradually shift to 100
}
}
}
}
Scaling Your IaC Practice
Team Workflows
For Small Teams (2-5 people):
- Single repository
- Directory-based environment separation
- Direct code review in PRs
- Manual applies with approval
For Medium Teams (5-20 people):
- Separate repositories for infrastructure and applications
- Module registry (internal or Terraform Registry)
- Automated planning in CI/CD
- Manual applies for production
- Dedicated infrastructure team
For Large Organizations (20+ people):
- Multiple repositories by service/team
- Private module registry
- Automated planning and applying for non-prod
- Approval gates for production
- Infrastructure platform team
- Self-service portals for developers
Tool Selection
When to Use What:
Terraform:
- Multi-cloud environments
- Need mature module ecosystem
- Team familiar with HCL
- State management flexibility
Pulumi:
- Team prefers general-purpose languages (Python, TypeScript, Go)
- Complex logic in infrastructure code
- Need to leverage existing libraries
CloudFormation:
- AWS-only infrastructure
- Want native AWS integration
- No external tools in workflow
CDK:
- AWS-focused
- Team comfortable with TypeScript/Python
- Want type safety and IDE support
Conclusion
Infrastructure as Code is about building reliable, maintainable systems. These practices help you:
- Reduce operational risk through consistency
- Enable team collaboration through clear patterns
- Scale infrastructure management across teams
- Maintain security and compliance requirements
- Deploy changes confidently
Start with the fundamentals:
- Version control everything
- Use remote state with locking
- Review all changes
- Module your common patterns
- Test before deploying
Then gradually adopt advanced practices as your infrastructure grows. The key is consistencyβpick patterns that work for your team and apply them everywhere.
Next Steps
Ready to level up your IaC game?
- Audit your current infrastructure: Identify which practices you're already following and which need work
- Start with state management: If you're using local state, migrate to remote state immediately
- Create your first module: Extract repeated patterns into reusable modules
- Implement testing: Start with static analysis (tflint, tfsec) before investing in Terratest
- Document decisions: Start an ADR for your next infrastructure change
Additional Resources
- Terraform Best Practices by Gruntwork
- Google Cloud's Best Practices for Terraform
- HashiCorp Learn
- Terratest Testing Framework
Have questions or suggestions for this guide? Let us know in the comments below or reach out to our community.