Terraform State Management Deep Dive
On this page
Prerequisites
- Terraform basics (resource syntax terraform apply)
- Experience running Terraform in production
- Understanding of JSON format
- Basic knowledge of cloud storage (S3 GCS or Azure Storage)
Introduction
Terraform state is where most infrastructure disasters originate. A corrupted state file, a lost state, concurrent modifications, or accidentally exposed secrets can destroy infrastructure or create security incidents in seconds.
This guide goes deep on Terraform state: what it is, how it works internally, how to configure it properly, how to manipulate it safely, and how to recover when things go wrong. By the end, you'll understand state well enough to debug complex issues and design resilient state management strategies for your team.
What is Terraform State?
State is Terraform's source of truth about your infrastructure. It's a mapping between your configuration files and the real resources in your cloud provider.
The Core Problem State Solves
Without state, Terraform would need to query your cloud provider every time to discover what exists. This is slow, error-prone, and limited by API rate limits. State caches this information locally.
Your Code State File Cloud Provider
--------- ---------- --------------
resource "aws_s3_bucket" "data" { ā Maps to ā Actual bucket:
bucket = "my-bucket" ID: my-bucket-xyz123
} Region: us-east-1
Versioning: enabled
What's Actually in a State File?
Let's examine a real state file:
{
"version": 4,
"terraform_version": "1.6.0",
"serial": 5,
"lineage": "f3c8b2a1-4d5e-6f7a-8b9c-0d1e2f3a4b5c",
"outputs": {
"bucket_name": {
"value": "my-data-bucket-xyz123",
"type": "string"
}
},
"resources": [
{
"mode": "managed",
"type": "aws_s3_bucket",
"name": "data",
"provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
"instances": [
{
"schema_version": 0,
"attributes": {
"id": "my-data-bucket-xyz123",
"arn": "arn:aws:s3:::my-data-bucket-xyz123",
"bucket": "my-data-bucket-xyz123",
"region": "us-east-1",
"versioning": {
"enabled": true,
"mfa_delete": false
},
"tags": {
"Environment": "production"
}
},
"private": "eyJzY2hlbWFfdmVyc2lvbiI6IjAifQ==",
"dependencies": []
}
]
}
]
}
Key Fields:
- version: State file format version (currently 4)
- terraform_version: Terraform version that wrote this state
- serial: Increments with each write (detects conflicts)
- lineage: UUID that stays constant for a state's lifetime (detects state splits)
- resources: The actual infrastructure mappings
- outputs: Exported values
- private: Base64-encoded provider-specific metadata
State vs Configuration
This distinction is critical:
# Configuration (what you want)
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = "t3.medium"
}
// State (what exists)
{
"type": "aws_instance",
"name": "web",
"attributes": {
"id": "i-0abc123def456789",
"ami": "ami-12345678",
"instance_type": "t3.medium",
"public_ip": "203.0.113.42",
"private_ip": "10.0.1.45"
}
}
State includes computed values (IDs, IPs) that don't exist in your configuration.
Remote State Backends
Local state files are dangerous. Remote backends are essential for any real infrastructure.
S3 Backend (AWS)
The most common backend for AWS infrastructure:
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# Optional but recommended
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abc-def-123"
}
}
Initial Setup:
# Create S3 bucket for state
resource "aws_s3_bucket" "terraform_state" {
bucket = "my-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.terraform_state.arn
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# DynamoDB table for locking
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
lifecycle {
prevent_destroy = true
}
}
Migrating from Local to S3:
# Step 1: Add backend configuration to backend.tf
# Step 2: Initialize with migration
terraform init -migrate-state
# Terraform will prompt:
# Do you want to copy existing state to the new backend? yes
# Step 3: Verify
terraform state list
GCS Backend (GCP)
# backend.tf
terraform {
backend "gcs" {
bucket = "my-terraform-state"
prefix = "prod/networking"
# Optional: customer-managed encryption
encryption_key = "projects/my-project/locations/us/keyRings/terraform/cryptoKeys/state"
}
}
Setup:
resource "google_storage_bucket" "terraform_state" {
name = "my-terraform-state"
location = "US"
force_destroy = false
versioning {
enabled = true
}
encryption {
default_kms_key_name = google_kms_crypto_key.terraform_state.id
}
lifecycle_rule {
action {
type = "Delete"
}
condition {
num_newer_versions = 10
with_state = "ARCHIVED"
}
}
}
Azure Backend
# backend.tf
terraform {
backend "azurerm" {
resource_group_name = "terraform-state-rg"
storage_account_name = "tfstateaccount"
container_name = "tfstate"
key = "prod.terraform.tfstate"
# Optional: use managed identity
use_msi = true
}
}
Terraform Cloud Backend
# backend.tf
terraform {
backend "remote" {
organization = "my-company"
workspaces {
name = "production-networking"
}
}
}
Benefits:
- Built-in locking
- State versioning and history
- Role-based access control
- Audit logs
- Cost estimation
- Policy as code (Sentinel)
Drawbacks:
- Vendor lock-in
- Requires Terraform Cloud account
- Potential latency for large states
Backend Selection Guide
Factor S3/GCS/Azure Terraform Cloud Local Cost Storage only (~$1/month) Free tier limited Free Setup complexity Medium Low None Team collaboration Manual setup needed Built-in Not suitable State locking Requires DynamoDB/extra setup Built-in None Audit logs Requires CloudTrail/extra setup Built-in None Best for Self-hosted infrastructure Teams wanting managed solution Personal projects only
State Locking
State locking prevents concurrent modifications that corrupt state.
How Locking Works
Developer 1 State Backend Developer 2
----------- ------------- -----------
terraform plan
āā> Request lock āāāāāāāāā> Acquire lock
Lock granted āāāāāāāāā> terraform plan
āā> Request lock
Lock denied!
(waits...)
terraform apply
āā> Modify state āāāāāāāāā> Update state
āā> Release lock āāāāāāāāā> Release lock
Lock granted!
Lock granted āāāāāāāāā> Continue plan
DynamoDB Locking (AWS)
Lock entries look like this:
{
"LockID": "my-terraform-state/prod/networking/terraform.tfstate-md5",
"Info": {
"ID": "abc-123-def-456",
"Operation": "OperationTypeApply",
"Path": "prod/networking/terraform.tfstate",
"Version": "1.6.0",
"Created": "2024-02-06T15:30:00Z",
"Who": "[email protected]"
}
}
Dealing with Stuck Locks
Scenario: Someone's terraform apply crashed, leaving a lock.
# Don't immediately force-unlock!
# First, verify the lock is actually stuck:
# Check DynamoDB (AWS)
aws dynamodb scan \
--table-name terraform-state-lock \
--filter-expression "begins_with(LockID, :prefix)" \
--expression-attribute-values '{":prefix":{"S":"my-terraform-state/prod"}}'
# Verify no one is actually running terraform
# Check with team, review CI/CD pipelines
# Only then force unlock:
terraform force-unlock <LOCK_ID>
Better: Automatic Lock Expiration
# Lambda function to clean up old locks
import boto3
from datetime import datetime, timedelta
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('terraform-state-lock')
def handler(event, context):
"""Remove locks older than 1 hour"""
# Scan for all locks
response = table.scan()
for item in response['Items']:
info = json.loads(item['Info'])
created = datetime.fromisoformat(info['Created'])
# If lock is older than 1 hour, remove it
if datetime.utcnow() - created > timedelta(hours=1):
print(f"Removing stale lock: {item['LockID']}")
table.delete_item(Key={'LockID': item['LockID']})
Locking Best Practices
- Always use locking in team environments
- Never disable locking in production
- Set up monitoring for stuck locks
- Document lock force-unlock procedures
- Use CI/CD with exclusive runs (no parallel applies)
State Manipulation Commands
Sometimes you need to manually manipulate state. Here's when and how.
terraform state list
List all resources in state:
# List all resources
terraform state list
# Filter by resource type
terraform state list | grep aws_instance
# Example output:
# aws_instance.web[0]
# aws_instance.web[1]
# aws_security_group.web
# aws_lb.main
terraform state show
Show details of a specific resource:
terraform state show 'aws_instance.web[0]'
# Output:
# resource "aws_instance" "web" {
# id = "i-0abc123def456789"
# ami = "ami-12345678"
# instance_type = "t3.medium"
# private_ip = "10.0.1.45"
# public_ip = "203.0.113.42"
# ...
# }
terraform state mv
Move/rename resources in state:
Use Case 1: Renaming a Resource
# Old configuration
resource "aws_instance" "server" {
# ...
}
# New configuration
resource "aws_instance" "web_server" {
# ...
}
# Move state to match new name
terraform state mv 'aws_instance.server' 'aws_instance.web_server'
# Now terraform plan shows no changes
Use Case 2: Refactoring into a Module
# Before: Resources at root level
resource "aws_vpc" "main" { }
resource "aws_subnet" "private" { }
# After: Resources in module
module "networking" {
source = "./modules/vpc"
}
# Move resources into module
terraform state mv 'aws_vpc.main' 'module.networking.aws_vpc.main'
terraform state mv 'aws_subnet.private' 'module.networking.aws_subnet.private'
Use Case 3: Moving Between State Files
# Move resource from one state to another
terraform state mv \
-state=source.tfstate \
-state-out=destination.tfstate \
'aws_s3_bucket.logs' \
'aws_s3_bucket.logs'
terraform state rm
Remove resources from state without destroying them:
Use Case: Import Existing Resource Elsewhere
# Remove from Terraform management
terraform state rm 'aws_instance.legacy'
# The instance still exists in AWS, but Terraform won't manage it
Use Case: Prevent Destruction
# You want to delete configuration but keep the resource
terraform state rm 'aws_db_instance.important_database'
# Now you can remove it from config without terraform destroy deleting it
terraform state pull / push
Direct state manipulation:
# Download state to local file
terraform state pull > backup.tfstate
# Edit manually (DANGEROUS - only for experts)
# ... edit backup.tfstate ...
# Upload modified state
terraform state push backup.tfstate
When to use state push:
- Disaster recovery
- Migrating between backends
- Fixing corrupted state (extremely rare)
Warning: state push bypasses all safety checks. Use with extreme caution.
terraform import
Import existing infrastructure into Terraform:
Example: Import Existing EC2 Instance
# 1. Write the configuration
resource "aws_instance" "existing" {
ami = "ami-12345678"
instance_type = "t3.medium"
# Add other required arguments
}
# 2. Import the existing resource
terraform import 'aws_instance.existing' i-0abc123def456789
# 3. Run terraform plan to see what attributes are missing
terraform plan
# 4. Update configuration to match actual resource
# 5. Re-run until terraform plan shows no changes
Bulk Import Script:
# import_resources.py
import subprocess
import json
# List of resources to import
resources = [
{"type": "aws_instance", "name": "web1", "id": "i-abc123"},
{"type": "aws_instance", "name": "web2", "id": "i-def456"},
{"type": "aws_security_group", "name": "web", "id": "sg-xyz789"},
]
for resource in resources:
address = f"{resource['type']}.{resource['name']}"
resource_id = resource['id']
print(f"Importing {address}...")
result = subprocess.run(
["terraform", "import", address, resource_id],
capture_output=True,
text=True
)
if result.returncode == 0:
print(f" ā Success")
else:
print(f" ā Failed: {result.stderr}")
terraform refresh
Update state to match real infrastructure:
# Refresh state without making changes
terraform refresh
# Better: Use plan's refresh
terraform plan -refresh-only
# Apply the refresh
terraform apply -refresh-only
When to refresh:
- Resources modified outside Terraform
- Drift detection
- Verifying manual changes
Warning: Refresh can't detect deleted resources. Use -refresh-only plan for safety.
Common State Problems and Solutions
Problem 1: State Drift
Symptom: terraform plan shows unexpected changes
Cause: Resources modified outside Terraform
Detection:
# Detect drift
terraform plan -detailed-exitcode
# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected (drift)
Solution:
# Option 1: Accept manual changes
terraform apply -refresh-only
# Option 2: Revert to Terraform configuration
terraform apply
Prevention:
# Prevent manual modifications
resource "aws_s3_bucket" "important" {
bucket = "critical-data"
lifecycle {
prevent_destroy = true
# Ignore specific attributes modified externally
ignore_changes = [
tags["LastModified"],
]
}
}
Problem 2: Corrupted State
Symptom: Terraform errors, invalid JSON, or missing resources
Cause:
- Concurrent modifications without locking
- Manual state editing gone wrong
- Storage backend corruption
Recovery:
# Step 1: Pull current state
terraform state pull > corrupted.tfstate
# Step 2: Validate JSON
python -m json.tool corrupted.tfstate > /dev/null
# Step 3: If S3 backend with versioning, restore previous version
aws s3api list-object-versions \
--bucket my-terraform-state \
--prefix prod/terraform.tfstate
# Get version ID from output
aws s3api get-object \
--bucket my-terraform-state \
--key prod/terraform.tfstate \
--version-id <VERSION_ID> \
recovered.tfstate
# Step 4: Push recovered state
terraform state push recovered.tfstate
# Step 5: Verify
terraform plan
Problem 3: Lost State
Symptom: State file missing entirely
Recovery (S3 with versioning):
# List all versions
aws s3api list-object-versions \
--bucket my-terraform-state \
--prefix prod/terraform.tfstate
# Restore latest version
aws s3api get-object \
--bucket my-terraform-state \
--key prod/terraform.tfstate \
--version-id <LATEST_VERSION_ID> \
terraform.tfstate
# Re-initialize
terraform init -reconfigure
# Verify
terraform plan
Recovery (no backups - disaster scenario):
# 1. Generate state from existing infrastructure
# Use terraform import for every resource
# 2. Create import script
cat > import.sh << 'EOF'
#!/bin/bash
terraform import 'aws_vpc.main' vpc-abc123
terraform import 'aws_subnet.public[0]' subnet-def456
terraform import 'aws_instance.web[0]' i-ghi789
# ... repeat for all resources
EOF
chmod +x import.sh
./import.sh
# 3. Verify all resources imported
terraform state list
# 4. Run plan to verify no changes needed
terraform plan
Problem 4: State File Too Large
Symptom: Slow plan/apply operations, timeouts
Diagnosis:
# Check state size
terraform state pull | wc -c
# Count resources
terraform state list | wc -l
# Identify large resources
terraform state pull | jq -r '.resources[] | "\(.type).\(.name): \(.instances | length) instances"' | sort -t: -k2 -rn
Solution: Split State
# Example: Split networking from applications
# 1. Create separate directories
mkdir -p networking applications
# 2. Move networking resources to own state
cd networking
terraform init
# Import resources (or move state file subset)
terraform state mv \
-state=../original.tfstate \
-state-out=terraform.tfstate \
'aws_vpc.main' \
'aws_vpc.main'
# 3. Repeat for all networking resources
# 4. Use remote state data source to share outputs
Networking State:
# networking/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
}
output "subnet_ids" {
value = aws_subnet.private[*].id
}
Application State:
# applications/main.tf
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "my-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "app" {
subnet_id = data.terraform_remote_state.networking.outputs.subnet_ids[0]
# ...
}
Problem 5: Serial Number Conflicts
Symptom: "serial number mismatch" error
Cause: Concurrent modifications or state rollback
Solution:
# Pull current state
terraform state pull > current.tfstate
# Check serial numbers
grep "serial" current.tfstate
# If serial is lower than expected, someone pushed an old state
# Verify with team, then decide:
# Option 1: Accept current state
terraform plan
# Option 2: Restore correct version from backup
aws s3api get-object \
--bucket my-terraform-state \
--key prod/terraform.tfstate \
--version-id <CORRECT_VERSION> \
correct.tfstate
terraform state push correct.tfstate
State Security
State files contain sensitive information. Treat them like production databases.
What's in State That's Sensitive?
// State may contain:
{
"resources": [{
"type": "aws_db_instance",
"attributes": {
"password": "supersecretpassword", // Plaintext passwords!
"endpoint": "db.example.com:5432", // Internal endpoints
"username": "admin" // Usernames
}
}]
}
Security Best Practices
1. Encrypt State at Rest
# S3 backend with KMS encryption
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true # Enable encryption
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abc-def"
}
}
2. Restrict Access with IAM
# IAM policy for state bucket access
resource "aws_iam_policy" "terraform_state_access" {
name = "terraform-state-access"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:ListBucket"
]
Resource = "arn:aws:s3:::terraform-state"
},
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "arn:aws:s3:::terraform-state/*"
Condition = {
StringEquals = {
"s3:x-amz-server-side-encryption" = "aws:kms"
}
}
},
{
Effect = "Allow"
Action = [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:DeleteItem"
]
Resource = "arn:aws:dynamodb:us-east-1:123456789012:table/terraform-locks"
}
]
})
}
# Only allow access from specific roles
resource "aws_iam_role" "terraform_automation" {
name = "terraform-automation"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
AWS = [
"arn:aws:iam::123456789012:role/github-actions",
"arn:aws:iam::123456789012:role/gitlab-ci"
]
}
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "terraform_state" {
role = aws_iam_role.terraform_automation.name
policy_arn = aws_iam_policy.terraform_state_access.arn
}
3. Enable Audit Logging
# CloudTrail for S3 state access
resource "aws_cloudtrail" "state_access" {
name = "terraform-state-access"
s3_bucket_name = aws_s3_bucket.cloudtrail.id
include_global_service_events = true
is_multi_region_trail = true
event_selector {
read_write_type = "All"
include_management_events = true
data_resource {
type = "AWS::S3::Object"
values = [
"${aws_s3_bucket.terraform_state.arn}/*"
]
}
}
}
# Alert on state access
resource "aws_cloudwatch_log_metric_filter" "state_access" {
name = "terraform-state-access"
log_group_name = aws_cloudwatch_log_group.cloudtrail.name
pattern = "{$.eventName = GetObject && $.requestParameters.bucketName = \"terraform-state\"}"
metric_transformation {
name = "StateFileAccess"
namespace = "Terraform"
value = "1"
}
}
resource "aws_cloudwatch_metric_alarm" "state_access" {
alarm_name = "terraform-state-accessed"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "StateFileAccess"
namespace = "Terraform"
period = "300"
statistic = "Sum"
threshold = "5" # Alert if accessed more than 5 times in 5 minutes
alarm_actions = [aws_sns_topic.alerts.arn]
}
4. Avoid Secrets in State
# Bad: Password in state
resource "aws_db_instance" "db" {
password = "hardcoded_password" # This goes in state!
}
# Better: Reference secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/database/password"
}
resource "aws_db_instance" "db" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
# State still contains the password, but at least it's not hardcoded
}
# Best: Use random_password with ignore_changes
resource "random_password" "db_password" {
length = 32
special = true
}
resource "aws_secretsmanager_secret" "db_password" {
name = "prod/database/password"
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = random_password.db_password.result
}
resource "aws_db_instance" "db" {
password = random_password.db_password.result
lifecycle {
ignore_changes = [password] # Don't update on subsequent applies
}
}
5. Rotate State Encryption Keys
# AWS KMS key rotation script
aws kms enable-key-rotation --key-id <KEY_ID>
# Or in Terraform
resource "aws_kms_key" "terraform_state" {
description = "Terraform state encryption"
deletion_window_in_days = 30
enable_key_rotation = true # Automatic rotation
}
Secrets Scanning
# Scan state for secrets (CI/CD integration)
# Using truffleHog
docker run --rm -v $(pwd):/repo trufflesecurity/trufflehog:latest \
filesystem /repo --json
# Using detect-secrets
pip install detect-secrets
terraform state pull | detect-secrets scan
State in Team Environments
Pattern 1: Branch-Based Development
main branch State: production
ā
āā feature/new-vpc State: feature-new-vpc (workspace)
ā
āā feature/add-rds State: feature-add-rds (workspace)
# Developer workflow
git checkout -b feature/new-vpc
# Create workspace for feature
terraform workspace new feature-new-vpc
# Make changes
terraform plan
terraform apply
# When ready to merge
git checkout main
terraform workspace select production
# Review changes
terraform plan
# Apply to production
terraform apply
Pattern 2: Pull Request Workflow
# .github/workflows/terraform-pr.yml
name: Terraform PR
on:
pull_request:
paths:
- 'terraform/**'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
working-directory: ./terraform
- name: Terraform Plan
id: plan
run: terraform plan -no-color -out=tfplan
working-directory: ./terraform
continue-on-error: true
- name: Comment Plan
uses: actions/github-script@v6
with:
script: |
const output = `#### Terraform Plan š
<details><summary>Show Plan</summary>
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\`
</details>`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
})
Pattern 3: Environment-Specific State Files
terraform/
āāā environments/
ā āāā prod/
ā ā āāā backend.tf # backend = s3, key = "prod/terraform.tfstate"
ā ā āāā main.tf
ā ā āāā terraform.tfvars
ā āāā staging/
ā ā āāā backend.tf # backend = s3, key = "staging/terraform.tfstate"
ā ā āāā main.tf
ā ā āāā terraform.tfvars
ā āāā dev/
ā āāā backend.tf # backend = s3, key = "dev/terraform.tfstate"
ā āāā main.tf
ā āāā terraform.tfvars
āāā modules/
Advanced State Patterns
Pattern: State Splitting by Lifecycle
Split state based on how often resources change:
terraform/
āāā foundation/ # Rarely changes
ā āāā vpc.tf
ā āāā iam-roles.tf
ā āāā backend.tf # key = "prod/foundation/terraform.tfstate"
ā
āāā data/ # Infrequent changes
ā āāā rds.tf
ā āāā elasticache.tf
ā āāā backend.tf # key = "prod/data/terraform.tfstate"
ā
āāā applications/ # Frequent changes
āāā ecs-services.tf
āāā lambdas.tf
āāā backend.tf # key = "prod/apps/terraform.tfstate"
Benefits:
- Faster plan/apply for application changes
- Reduced risk of breaking foundation
- Independent deployment cadence
Pattern: State Sharing Between Projects
# Project A exports outputs
# terraform/project-a/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
description = "VPC ID for use by other projects"
}
output "database_endpoint" {
value = aws_db_instance.main.endpoint
description = "Database connection endpoint"
sensitive = true
}
# Project B imports outputs
# terraform/project-b/main.tf
data "terraform_remote_state" "project_a" {
backend = "s3"
config = {
bucket = "terraform-state"
key = "project-a/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "app" {
vpc_security_group_ids = [aws_security_group.app.id]
subnet_id = data.terraform_remote_state.project_a.outputs.vpc_id
# Access database
user_data = templatefile("${path.module}/user_data.sh", {
db_endpoint = data.terraform_remote_state.project_a.outputs.database_endpoint
})
}
Warning: This creates coupling between projects. Changes to outputs in Project A can break Project B.
Better: Use SSM Parameter Store or Secrets Manager
# Project A: Write to parameter store
resource "aws_ssm_parameter" "vpc_id" {
name = "/infrastructure/vpc/id"
type = "String"
value = aws_vpc.main.id
}
# Project B: Read from parameter store
data "aws_ssm_parameter" "vpc_id" {
name = "/infrastructure/vpc/id"
}
resource "aws_instance" "app" {
subnet_id = data.aws_ssm_parameter.vpc_id.value
}
This decouples the projects while still sharing data.
Pattern: Targeted State Operations
For large states, target specific resources:
# Plan only specific resources
terraform plan -target=aws_instance.web
# Apply only specific resources
terraform apply -target=aws_security_group.db
# Refresh only specific resources
terraform apply -refresh-only -target=aws_db_instance.main
Warning: Use sparingly. Can break dependencies and create inconsistent state.
Valid use cases:
- Emergency fixes
- Debugging specific resource issues
- Working around provider bugs
State Backup and Disaster Recovery
Automated Backup Strategy
S3 Versioning:
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
# Lifecycle policy to manage old versions
resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
id = "archive-old-versions"
status = "Enabled"
noncurrent_version_transition {
noncurrent_days = 30
storage_class = "GLACIER"
}
noncurrent_version_expiration {
noncurrent_days = 90
}
}
}
Cross-Region Replication:
# Primary bucket
resource "aws_s3_bucket" "terraform_state_primary" {
bucket = "terraform-state-us-east-1"
region = "us-east-1"
}
# Replica bucket
resource "aws_s3_bucket" "terraform_state_replica" {
bucket = "terraform-state-us-west-2"
region = "us-west-2"
}
resource "aws_s3_bucket_replication_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state_primary.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-state"
status = "Enabled"
destination {
bucket = aws_s3_bucket.terraform_state_replica.arn
storage_class = "STANDARD"
}
}
}
Scheduled Backups:
#!/bin/bash
# backup-terraform-state.sh
# Run daily via cron or CI/CD
DATE=$(date +%Y-%m-%d)
BACKUP_DIR="/backups/terraform-state/$DATE"
mkdir -p "$BACKUP_DIR"
# Backup all state files
aws s3 sync \
s3://terraform-state/ \
"$BACKUP_DIR/" \
--region us-east-1
# Compress
tar -czf "$BACKUP_DIR.tar.gz" "$BACKUP_DIR"
rm -rf "$BACKUP_DIR"
# Upload to long-term storage
aws s3 cp \
"$BACKUP_DIR.tar.gz" \
s3://terraform-backups/state-backups/ \
--storage-class GLACIER_DEEP_ARCHIVE
# Keep only last 7 days locally
find /backups/terraform-state -type f -mtime +7 -delete
Disaster Recovery Procedures
Scenario 1: Accidental Deletion
# Step 1: List versions
aws s3api list-object-versions \
--bucket terraform-state \
--prefix prod/terraform.tfstate
# Step 2: Restore specific version
aws s3api get-object \
--bucket terraform-state \
--key prod/terraform.tfstate \
--version-id <VERSION_ID> \
terraform.tfstate
# Step 3: Verify integrity
python -m json.tool terraform.tfstate > /dev/null
# Step 4: Re-upload
terraform state push terraform.tfstate
# Step 5: Verify
terraform plan
Scenario 2: Complete Bucket Loss
# Step 1: Restore from replica or backup
aws s3 sync \
s3://terraform-state-replica/ \
s3://terraform-state/ \
--region us-west-2
# Step 2: Re-initialize Terraform
terraform init -reconfigure
# Step 3: Verify state
terraform state list
terraform plan
Scenario 3: State Corruption with No Backup
This is the nightmare scenario. Your only option is reconstruction:
# Generate import script from AWS inventory
aws ec2 describe-instances --query 'Reservations[].Instances[].[InstanceId,Tags[?Key==`Name`].Value|[0]]' --output text | \
while read instance_id name; do
echo "terraform import 'aws_instance.$name' $instance_id"
done > import-instances.sh
# Similar for other resources
# This is why backups matter!
Monitoring and Alerting
State Health Metrics
# Lambda function to monitor state health
import boto3
import json
from datetime import datetime, timedelta
s3 = boto3.client('s3')
cloudwatch = boto3.client('cloudwatch')
def handler(event, context):
bucket = 'terraform-state'
# Check state file age
response = s3.list_objects_v2(Bucket=bucket)
for obj in response.get('Contents', []):
key = obj['Key']
last_modified = obj['LastModified']
# Alert if state hasn't been updated in 30 days
age = datetime.now(last_modified.tzinfo) - last_modified
cloudwatch.put_metric_data(
Namespace='Terraform',
MetricData=[{
'MetricName': 'StateFileAge',
'Value': age.days,
'Unit': 'Count',
'Dimensions': [{'Name': 'StateFile', 'Value': key}]
}]
)
if age > timedelta(days=30):
# Send alert
sns = boto3.client('sns')
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:terraform-alerts',
Subject='Terraform State Stale',
Message=f'State file {key} has not been updated in {age.days} days'
)
Drift Detection
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [prod, staging, dev]
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
working-directory: ./terraform/${{ matrix.environment }}
- name: Detect Drift
id: drift
run: |
terraform plan -detailed-exitcode -out=tfplan
echo "exitcode=$?" >> $GITHUB_OUTPUT
working-directory: ./terraform/${{ matrix.environment }}
continue-on-error: true
- name: Alert on Drift
if: steps.drift.outputs.exitcode == '2'
uses: actions/github-script@v6
with:
script: |
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `Drift detected in ${{ matrix.environment }}`,
body: 'Configuration drift has been detected. Review the plan in Actions.',
labels: ['terraform', 'drift', '${{ matrix.environment }}']
})
Conclusion
Terraform state is the foundation of your infrastructure management. Get it right and you have a reliable, auditable system. Get it wrong and you risk data loss, security breaches, and infrastructure outages.
Key Takeaways:
- Always use remote state with locking for any shared infrastructure
- Enable versioning and backups - you will need them
- Encrypt state at rest and restrict access with IAM
- Split large states for faster operations and reduced blast radius
- Understand state manipulation commands for when things go wrong
- Test your disaster recovery procedures before you need them
- Monitor state health and detect drift automatically
State management isn't glamorous, but it's where operational maturity separates successful teams from those constantly fighting fires.
Next Steps
- Audit your current state setup: Are you using remote state? Locking? Encryption?
- Implement backups: Set up S3 versioning or cross-region replication
- Document recovery procedures: Write runbooks for common state problems
- Set up monitoring: Create alerts for state age, access, and drift
- Train your team: Ensure everyone understands state basics and knows who to ask for help
Additional Resources
- HashiCorp Terraform State Documentation
- HashiCorp Backend Configuration Reference
- Terraform State CLI Command Reference
- Atlantis - Terraform Pull Request Automation
Have state management war stories or questions? Share them in the comments below.