Multi-Cloud Infrastructure Patterns: Strategy and Implementation
Prerequisites
- Strong understanding of at least one cloud provider (AWS, GCP, or Azure)
- Infrastructure as Code experience (Terraform or similar)
- Networking fundamentals (VPCs, routing, VPNs)
- Experience managing production infrastructure
Introduction
Multi-cloud is a loaded term. For some teams, it's a strategic advantage that prevents vendor lock-in and optimizes costs. For others, it's an operational nightmare that multiplies complexity without delivering value.
This guide cuts through the hype to explore when multi-cloud makes sense, what patterns actually work in production, and how to implement them without drowning in complexity. Whether you're evaluating a multi-cloud strategy or already running workloads across multiple providers, you'll learn patterns that reduce operational burden while maximizing the benefits.
The Multi-Cloud Reality Check
Before diving into patterns, let's establish some ground truth.
When Multi-Cloud Makes Sense
1. Customer Requirements
Your customers demand it. Financial institutions often require specific clouds for compliance. Government contracts may mandate certain providers. Enterprise customers might need you to run in their preferred cloud.
Real Example: A SaaS company runs primary infrastructure on AWS but
deploys dedicated instances on GCP and Azure for enterprise customers
who have existing cloud commitments and need data residency in their
tenants.
2. Risk Mitigation for Critical Systems
True redundancy against cloud provider outages. Not just multi-region, but multi-cloud.
Real Example: A payment processor runs identical infrastructure on AWS
and GCP with automatic failover. During the 2021 AWS us-east-1 outage,
they shifted 100% of traffic to GCP in under 2 minutes.
3. Best-of-Breed Services
Certain clouds excel at specific services. GCP's data analytics and ML tools. AWS's breadth of services. Azure's enterprise integration and hybrid capabilities.
Real Example: A data science company runs training workloads on GCP
for TPU access and superior BigQuery performance, while hosting their
application and database on AWS for better regional coverage and
mature RDS offerings.
4. Cost Optimization at Scale
Once you're spending millions annually, optimizing across providers can save substantial money. Spot instance availability varies by provider. Certain workload types are cheaper on specific clouds.
Real Example: A video transcoding company processes uploads on whichever
cloud has the cheapest spot instances at that moment, using a job queue
system that dispatches to AWS, GCP, or Azure based on current pricing.
When Multi-Cloud is a Mistake
1. Premature Optimization
Your startup isn't ready for multi-cloud if you have:
- Less than $50k monthly cloud spend
- Fewer than 5 infrastructure engineers
- No standardized deployment pipeline
- Single-region footprint
Complexity will outweigh benefits. Master one cloud first.
2. "Avoiding Vendor Lock-In" Without Math
The lock-in fear is often irrational. Calculate the actual cost:
Scenario: You have 100 engineers. Each spends 20% more time managing
multi-cloud complexity.
Cost: 100 engineers Γ $150k salary Γ 0.20 = $3M/year
Does multi-cloud save you $3M annually? If not, you're losing money
"avoiding lock-in."
3. Technology Resume Building
Engineers want to learn multiple clouds. Great for learning, terrible business justification. Training environments and side projects serve this purpose without production complexity.
4. Surface-Level Multi-Cloud
Running a monitoring SaaS on one cloud and your application on another isn't multi-cloud architecture. It's using SaaS products that happen to be cloud-based. Real multi-cloud means your core workloads span providers.
Multi-Cloud Pattern Categories
Pattern 1: The Isolated Island Pattern
Description: Each cloud runs independent workloads with no inter-cloud communication.
Architecture:
βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ
β AWS β β GCP β β Azure β
β β β β β β
β Customer A β β Customer B β β Customer C β
β Workloads β β Workloads β β Workloads β
β β β β β β
βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ
β β β
βββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
β
Central Control Plane
(CI/CD, monitoring, etc)
When to Use:
- B2B SaaS with customer-specific deployments
- Regional isolation requirements
- Independent scaling per customer/region
- No shared data between clouds
Implementation:
# Terraform structure for isolated islands
project/
βββ infrastructure/
β βββ aws/
β β βββ customer-a/
β β β βββ main.tf
β β β βββ variables.tf
β β β βββ terraform.tfvars
β β βββ modules/
β β βββ application/
β β βββ database/
β β βββ networking/
β βββ gcp/
β β βββ customer-b/
β β β βββ main.tf
β β β βββ variables.tf
β β β βββ terraform.tfvars
β β βββ modules/
β β βββ application/
β β βββ database/
β β βββ networking/
β βββ azure/
β βββ customer-c/
βββ control-plane/
βββ ci-cd/
βββ monitoring/
Key Terraform Pattern:
# modules/application/main.tf (cloud-agnostic interface)
variable "environment" {
description = "Environment name"
type = string
}
variable "app_config" {
description = "Application configuration"
type = object({
instance_type = string
instance_count = number
app_port = number
})
}
variable "network_config" {
description = "Network configuration"
type = object({
vpc_id = string
subnet_ids = list(string)
})
}
# AWS implementation
# infrastructure/aws/modules/application/main.tf
resource "aws_instance" "app" {
count = var.app_config.instance_count
instance_type = var.app_config.instance_type
subnet_id = var.network_config.subnet_ids[count.index % length(var.network_config.subnet_ids)]
# ... AWS-specific configuration
}
# GCP implementation
# infrastructure/gcp/modules/application/main.tf
resource "google_compute_instance" "app" {
count = var.app_config.instance_count
machine_type = var.app_config.instance_type
zone = var.zones[count.index % length(var.zones)]
# ... GCP-specific configuration
}
Pros:
- Simplest multi-cloud pattern
- No cross-cloud networking complexity
- Independent failure domains
- Easy cost attribution
Cons:
- Can't share resources or data
- Duplicate infrastructure code
- No cross-cloud redundancy
Pattern 2: The Active-Passive Disaster Recovery Pattern
Description: Primary workload on one cloud, standby replica on another for disaster recovery.
Architecture:
ββββββββββββββββββββββββββββββββ
β AWS (Primary) β
β β
β ββββββββββββββ β
β βApplication β β
β β Cluster β β
β ββββββββββββββ β
β β β
β ββββββββββββββ β
β β Database ββββββββββββββββΌββββΊ Continuous Replication
β β (RDS) β β
β ββββββββββββββ β
β β β
β ββββββββββββββ β
β β Object ββββββββββββββββΌββββΊ Cross-Region Replication
β β Storage β β
β ββββββββββββββ β
ββββββββββββββββββββββββββββββββ
β
β Failover (Manual or Automatic)
βΌ
ββββββββββββββββββββββββββββββββ
β GCP (Standby) β
β β
β ββββββββββββββ β
β βApplication β (Scaled to β
β β Cluster β zero until β
β ββββββββββββββ failover) β
β β β
β ββββββββββββββ β
β β Database β (Read β
β β (CloudSQL) β replica) β
β ββββββββββββββ β
β β β
β ββββββββββββββ β
β β Object β (Replicated β
β β Storage β data) β
β ββββββββββββββ β
ββββββββββββββββββββββββββββββββ
When to Use:
- Need protection against cloud provider outages
- Can tolerate RPO of minutes to hours
- Can tolerate RTO of minutes to hours
- Cost-sensitive (standby is cheaper than active-active)
Implementation:
# AWS Primary Database
resource "aws_db_instance" "primary" {
identifier = "app-primary"
engine = "postgres"
engine_version = "14.7"
instance_class = "db.r6g.xlarge"
backup_retention_period = 7
# Enable automated backups for DR
backup_window = "03:00-04:00"
tags = {
Role = "primary"
DRTarget = "gcp-standby"
}
}
# Export backups to S3 for cross-cloud restore
resource "aws_db_snapshot" "automated" {
# Taken automatically by RDS
}
# Lambda to copy snapshots to S3 and trigger GCP restore
resource "aws_lambda_function" "snapshot_export" {
function_name = "export-snapshot-to-gcs"
# Exports RDS snapshot to S3
# Copies to GCS using cross-cloud credentials
# Triggers GCP Cloud Function to restore to standby
}
# GCP Standby Database
resource "google_sql_database_instance" "standby" {
name = "app-standby"
database_version = "POSTGRES_14"
settings {
tier = "db-custom-4-16384" # Match AWS sizing
# Start with minimal resources, scale on failover
activation_policy = "NEVER" # Don't run unless activated
}
}
# Restore mechanism
resource "google_cloudfunctions_function" "restore_from_backup" {
name = "restore-db-from-s3"
description = "Restore database from AWS backup"
runtime = "python39"
# Triggered by snapshot copy completion
# Restores backup to standby instance
# Updates DNS to point to GCP
}
Failover Orchestration:
# failover.py - Automated DR failover script
import boto3
import google.cloud.sql_v1
import time
class MultiCloudDR:
def __init__(self):
self.aws_client = boto3.client('rds')
self.gcp_client = google.cloud.sql_v1.SqlInstancesServiceClient()
def check_primary_health(self):
"""Check if AWS primary is healthy"""
try:
response = self.aws_client.describe_db_instances(
DBInstanceIdentifier='app-primary'
)
status = response['DBInstances'][0]['DBInstanceStatus']
return status == 'available'
except Exception as e:
print(f"Primary health check failed: {e}")
return False
def activate_standby(self):
"""Activate GCP standby"""
print("Activating GCP standby instance...")
# 1. Activate the Cloud SQL instance
instance = self.gcp_client.get(name='app-standby')
instance.settings.activation_policy = 'ALWAYS'
self.gcp_client.update(instance=instance)
# 2. Wait for instance to be ready
while True:
instance = self.gcp_client.get(name='app-standby')
if instance.state == 'RUNNABLE':
break
time.sleep(10)
# 3. Promote read replica to primary
self.gcp_client.promote_replica(name='app-standby')
# 4. Scale up compute resources
instance.settings.tier = 'db-custom-8-32768'
self.gcp_client.update(instance=instance)
# 5. Update DNS to point to GCP
self.update_dns_to_gcp()
print("Standby activated and serving traffic")
def update_dns_to_gcp(self):
"""Update DNS to point to GCP instance"""
# Update Route53 / Cloud DNS records
pass
def failover(self):
"""Execute failover to GCP"""
if not self.check_primary_health():
print("Primary unhealthy, initiating failover...")
self.activate_standby()
else:
print("Primary is healthy, no action needed")
# Run health check every minute
if __name__ == '__main__':
dr = MultiCloudDR()
while True:
dr.failover()
time.sleep(60)
Pros:
- Protects against cloud provider outages
- Lower cost than active-active (standby scaled down)
- Clear primary/secondary model
Cons:
- Failover time (RTO) of minutes to hours
- Data loss window (RPO) depends on replication lag
- Manual testing required to ensure DR actually works
- Standby infrastructure costs even when unused
Pattern 3: The Active-Active Multi-Cloud Pattern
Description: Workloads actively running on multiple clouds simultaneously with traffic distributed across them.
Architecture:
Global Traffic Manager
(DNS-based or Anycast)
β
ββββββββββββββββββΌβββββββββββββββββ
β β β
βββββββββΌβββββββ βββββββΌβββββββ βββββββΌβββββββ
β AWS β β GCP β β Azure β
β (33.3%) β β (33.3%) β β (33.3%) β
ββββββββββββββββ ββββββββββββββ ββββββββββββββ
β β β
βββββββββΌβββββββ βββββββΌβββββββ βββββββΌβββββββ
β Application β βApplication β βApplication β
β Cluster β β Cluster β β Cluster β
ββββββββ¬ββββββββ βββββββ¬βββββββ βββββββ¬βββββββ
β β β
ββββββββββββββββββΌβββββββββββββββββ
β
Distributed Database
(CockroachDB, Spanner, etc)
When to Use:
- Cannot tolerate downtime from cloud provider outages
- High availability requirements (99.99%+)
- Global user base needing low latency everywhere
- Budget supports 2-3x infrastructure costs
Critical Requirements:
- Stateless Application Design
Applications must be completely stateless. Any state belongs in distributed databases or caches.
// Bad: Session state in local memory
type AppServer struct {
sessions map[string]*Session // Lost on instance restart
}
// Good: Session state in distributed cache
type AppServer struct {
redis *redis.ClusterClient // Works across clouds
}
func (s *AppServer) GetSession(id string) (*Session, error) {
// Fetch from Redis cluster spanning AWS, GCP, Azure
val, err := s.redis.Get(context.Background(), id).Result()
if err != nil {
return nil, err
}
// Deserialize and return
}
- Globally Distributed Database
You need a database that handles multi-cloud replication:
Option A: CockroachDB
-- Create table with multi-region replication
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email STRING NOT NULL,
created_at TIMESTAMP DEFAULT now()
);
-- Configure multi-region
ALTER DATABASE app PRIMARY REGION "aws-us-east-1";
ALTER DATABASE app ADD REGION "gcp-us-central1";
ALTER DATABASE app ADD REGION "azure-eastus";
-- Set survival goals
ALTER DATABASE app SURVIVE REGION FAILURE;
# Terraform for CockroachDB across clouds
resource "cockroachdb_cluster" "multi_cloud" {
name = "production"
cloud_provider = "MULTI_CLOUD"
regions = [
{
name = "aws-us-east-1"
node_count = 3
},
{
name = "gcp-us-central1"
node_count = 3
},
{
name = "azure-eastus"
node_count = 3
}
]
hardware = {
machine_type = "n1-standard-4"
storage_gib = 500
}
}
Option B: Google Cloud Spanner (Multi-Region)
resource "google_spanner_instance" "multi_region" {
name = "production"
config = "nam-eur-asia1" # Multi-region configuration
display_name = "Production Multi-Region"
num_nodes = 9 # 3 per region
}
Option C: Self-Managed PostgreSQL with Multi-Master
# AWS RDS Postgres
resource "aws_db_instance" "postgres_aws" {
identifier = "postgres-aws"
engine = "postgres"
instance_class = "db.r6g.2xlarge"
# Enable logical replication
parameter_group_name = aws_db_parameter_group.postgres_replication.name
}
# GCP Cloud SQL Postgres
resource "google_sql_database_instance" "postgres_gcp" {
name = "postgres-gcp"
database_version = "POSTGRES_14"
settings {
tier = "db-custom-8-32768"
database_flags {
name = "wal_level"
value = "logical"
}
}
}
# Azure PostgreSQL
resource "azurerm_postgresql_flexible_server" "postgres_azure" {
name = "postgres-azure"
resource_group_name = azurerm_resource_group.main.name
location = "eastus"
sku_name = "GP_Standard_D4s_v3"
}
# Configure bidirectional replication using PostgreSQL logical replication
# Requires pglogical or similar extension
- Global Traffic Distribution
# Using Cloudflare for multi-cloud load balancing
resource "cloudflare_load_balancer" "app" {
zone_id = var.cloudflare_zone_id
name = "app.example.com"
default_pool_ids = [
cloudflare_load_balancer_pool.aws.id,
cloudflare_load_balancer_pool.gcp.id,
cloudflare_load_balancer_pool.azure.id,
]
steering_policy = "geo" # or "random", "dynamic_latency"
session_affinity = "cookie"
}
resource "cloudflare_load_balancer_pool" "aws" {
name = "aws-us-east-1"
origins {
name = "aws-origin-1"
address = aws_lb.app.dns_name
enabled = true
}
check_regions = ["WNAM", "ENAM"]
# Health check
monitor = cloudflare_load_balancer_monitor.http.id
}
resource "cloudflare_load_balancer_pool" "gcp" {
name = "gcp-us-central1"
origins {
name = "gcp-origin-1"
address = google_compute_forwarding_rule.app.ip_address
enabled = true
}
monitor = cloudflare_load_balancer_monitor.http.id
}
resource "cloudflare_load_balancer_pool" "azure" {
name = "azure-eastus"
origins {
name = "azure-origin-1"
address = azurerm_lb.app.frontend_ip_configuration[0].public_ip_address
enabled = true
}
monitor = cloudflare_load_balancer_monitor.http.id
}
resource "cloudflare_load_balancer_monitor" "http" {
type = "http"
path = "/health"
port = 443
interval = 60
timeout = 5
retries = 2
expected_codes = "200"
}
Pros:
- True high availability across cloud providers
- No failover delay
- Can optimize latency per region
- Load distribution for performance
Cons:
- Highest cost (3x infrastructure)
- Operational complexity
- Data consistency challenges
- Requires significant engineering investment
Pattern 4: The Hybrid Best-of-Breed Pattern
Description: Use each cloud for what it does best, with controlled integration points.
Architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS (Primary Platform) β
β β
β ββββββββββββββ ββββββββββββ ββββββββββββ β
β βApplication β βPostgreSQLβ β Redis β β
β β (ECS) β β (RDS) β β(ElastiC.)β β
β βββββββ¬βββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
ββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββββ
β β β
β β β
β β ββββββββββββ
β β β
β ββββββββββ β
β β β
ββββββββββΌββββββββββββββββ ββββββΌβββββββββ ββββΌββββββββββββ
β GCP (ML/Analytics) β βGCP BigQuery β β GCP Cloud β
β β β β β Storage β
β ββββββββββββββββ β β β β β
β β Vertex AI β β β Data Lakee β β Data Lake β
β β Training β β β β β β
β ββββββββββββββββ β βββββββββββββββ ββββββββββββββββ
ββββββββββββββββββββββββββ
When to Use:
- Need best-in-class services from different clouds
- Can isolate workloads by function
- Have budget for specialized services
- Team has expertise in multiple clouds
Implementation Example: ML Pipeline on GCP with Application on AWS
# Application runs on AWS, triggers ML training on GCP
import boto3
import google.cloud.aiplatform as aiplatform
from google.cloud import storage
class MultiCloudMLPipeline:
def __init__(self):
# AWS clients
self.s3 = boto3.client('s3')
self.sqs = boto3.client('sqs')
# GCP clients
self.gcs = storage.Client()
aiplatform.init(project='my-project', location='us-central1')
def trigger_training(self, dataset_id):
"""Trigger ML training job on GCP with data from AWS"""
# 1. Export data from AWS RDS to S3
self.export_data_to_s3(dataset_id)
# 2. Transfer from S3 to GCS
self.transfer_s3_to_gcs(
f's3://my-bucket/datasets/{dataset_id}',
f'gs://my-gcp-bucket/datasets/{dataset_id}'
)
# 3. Create Vertex AI training job
job = aiplatform.CustomTrainingJob(
display_name=f'training-{dataset_id}',
script_path='train.py',
container_uri='gcr.io/my-project/training:latest',
requirements=['tensorflow==2.12.0'],
)
# 4. Run training
model = job.run(
dataset=f'gs://my-gcp-bucket/datasets/{dataset_id}',
replica_count=4,
machine_type='n1-highmem-8',
accelerator_type='NVIDIA_TESLA_T4',
accelerator_count=1,
)
# 5. Export model back to AWS S3 for serving
self.export_model_to_s3(model)
return model
def transfer_s3_to_gcs(self, s3_uri, gcs_uri):
"""Transfer data from S3 to GCS"""
# Use AWS DataSync or custom transfer job
# Or: Use GCP Storage Transfer Service
from google.cloud import storage_transfer
transfer_client = storage_transfer.StorageTransferServiceClient()
transfer_job = {
'description': 'Transfer dataset for ML training',
'status': 'ENABLED',
'project_id': 'my-project',
'transfer_spec': {
'aws_s3_data_source': {
'bucket_name': 'my-bucket',
'path': f'datasets/{dataset_id}/',
'aws_access_key': {
'access_key_id': 'ACCESS_KEY',
'secret_access_key': 'SECRET_KEY'
}
},
'gcs_data_sink': {
'bucket_name': 'my-gcp-bucket',
'path': f'datasets/{dataset_id}/'
}
}
}
result = transfer_client.create_transfer_job(transfer_job=transfer_job)
return result
Terraform for Hybrid Infrastructure:
# AWS Application Infrastructure
module "aws_application" {
source = "./modules/aws-app"
providers = {
aws = aws
}
environment = var.environment
vpc_cidr = "10.0.0.0/16"
}
# GCP ML Infrastructure
module "gcp_ml_platform" {
source = "./modules/gcp-ml"
providers = {
google = google
}
project_id = var.gcp_project_id
region = "us-central1"
# Reference to AWS resources for data transfer
aws_s3_bucket = module.aws_application.data_bucket_name
}
# Cross-cloud IAM for service accounts
resource "aws_iam_role" "gcp_access" {
name = "gcp-ml-service-access"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = "accounts.google.com"
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"accounts.google.com:aud" = var.gcp_service_account_email
}
}
}]
})
}
resource "aws_iam_role_policy" "gcp_s3_access" {
role = aws_iam_role.gcp_access.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:ListBucket"
]
Resource = [
module.aws_application.data_bucket_arn,
"${module.aws_application.data_bucket_arn}/*"
]
}]
})
}
Pros:
- Use best services from each cloud
- Can specialize team expertise
- Optimize costs per workload type
- Flexibility to migrate over time
Cons:
- Complex data movement between clouds
- Higher egress costs
- Multiple operational models
- Team needs multi-cloud expertise
Cross-Cloud Networking Strategies
Networking is where multi-cloud gets difficult. Here are proven approaches.
Strategy 1: VPN Mesh
Simple but Limited:
# AWS side
resource "aws_customer_gateway" "gcp" {
bgp_asn = 65000
ip_address = google_compute_address.vpn_gw.address
type = "ipsec.1"
}
resource "aws_vpn_connection" "to_gcp" {
customer_gateway_id = aws_customer_gateway.gcp.id
vpn_gateway_id = aws_vpn_gateway.main.id
type = "ipsec.1"
static_routes_only = false
}
# GCP side
resource "google_compute_vpn_gateway" "aws" {
name = "aws-vpn-gateway"
network = google_compute_network.main.id
}
resource "google_compute_vpn_tunnel" "to_aws" {
name = "to-aws"
peer_ip = aws_vpn_connection.to_gcp.tunnel1_address
shared_secret = aws_vpn_connection.to_gcp.tunnel1_preshared_key
target_vpn_gateway = google_compute_vpn_gateway.aws.id
ike_version = 2
}
Pros: Simple, built-in cloud features Cons: Performance limited, not highly available, complex mesh scaling
Strategy 2: SD-WAN Overlay
Use a third-party SD-WAN solution:
Options:
- Aviatrix
- Alkira
- Cisco Viptela
- VMware SD-WAN
# Example with Aviatrix
resource "aviatrix_transit_gateway" "aws" {
cloud_type = 1 # AWS
account_name = "aws-account"
gw_name = "aws-transit"
vpc_id = aws_vpc.main.id
vpc_reg = "us-east-1"
gw_size = "c5n.2xlarge"
subnet = aws_subnet.transit.cidr_block
}
resource "aviatrix_transit_gateway" "gcp" {
cloud_type = 4 # GCP
account_name = "gcp-account"
gw_name = "gcp-transit"
vpc_id = google_compute_network.main.name
vpc_reg = "us-central1"
gw_size = "n1-standard-4"
subnet = google_compute_subnetwork.transit.ip_cidr_range
}
resource "aviatrix_transit_gateway_peering" "aws_gcp" {
transit_gateway_name1 = aviatrix_transit_gateway.aws.gw_name
transit_gateway_name2 = aviatrix_transit_gateway.gcp.gw_name
}
Pros: High performance, simplified management, native HA Cons: Additional cost, vendor dependency
Strategy 3: Direct Interconnects
High-bandwidth dedicated connections:
# AWS Direct Connect
resource "aws_dx_connection" "main" {
name = "cross-cloud-interconnect"
bandwidth = "10Gbps"
location = "EqDC2" # Equinix datacenter
}
# GCP Partner Interconnect at same colocation
resource "google_compute_interconnect_attachment" "aws" {
name = "to-aws"
type = "PARTNER"
router = google_compute_router.main.id
region = "us-central1"
admin_enabled = true
}
Pros: Lowest latency, highest bandwidth, predictable performance Cons: Expensive, complex setup, limited availability
Identity and Access Management Across Clouds
Pattern: Centralized Identity Provider
Use a single identity source with federation to all clouds:
# Okta as central IDP
resource "okta_app_saml" "aws" {
label = "AWS"
# AWS SAML configuration
app_settings_json = jsonencode({
aws_environment_type = "aws.amazon"
})
}
resource "okta_app_saml" "gcp" {
label = "Google Cloud Platform"
# GCP SAML configuration
}
resource "okta_app_saml" "azure" {
label = "Microsoft Azure"
# Azure AD integration
}
# Assign groups to applications
resource "okta_app_group_assignment" "aws_admin" {
app_id = okta_app_saml.aws.id
group_id = okta_group.platform_team.id
}
Pattern: Service Account Federation
Allow services in one cloud to access another:
# GCP service account that can assume AWS role
resource "google_service_account" "aws_access" {
account_id = "aws-resource-access"
display_name = "Service account for AWS access"
}
# AWS IAM role that GCP can assume
resource "aws_iam_role" "gcp_access" {
name = "gcp-service-access"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = "accounts.google.com"
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"accounts.google.com:sub" = google_service_account.aws_access.unique_id
}
}
}]
})
}
# GCP Workload Identity to AWS
resource "google_iam_workload_identity_pool" "aws" {
workload_identity_pool_id = "aws-pool"
display_name = "AWS Identity Pool"
}
resource "google_iam_workload_identity_pool_provider" "aws" {
workload_identity_pool_id = google_iam_workload_identity_pool.aws.workload_identity_pool_id
workload_identity_pool_provider_id = "aws-provider"
display_name = "AWS Provider"
aws {
account_id = var.aws_account_id
}
attribute_mapping = {
"google.subject" = "assertion.arn"
"attribute.aws_role" = "assertion.arn.extract('assumed-role/{role}/')"
}
}
Cost Management in Multi-Cloud
Multi-cloud amplifies cost management challenges.
Unified Cost Tracking
# Tag strategy across clouds
locals {
common_tags = {
Environment = var.environment
Team = var.team
CostCenter = var.cost_center
ManagedBy = "Terraform"
Project = var.project_name
}
}
# AWS resources
resource "aws_instance" "app" {
# ... configuration
tags = merge(
local.common_tags,
{
Name = "app-server"
Cloud = "AWS"
}
)
}
# GCP resources
resource "google_compute_instance" "app" {
# ... configuration
labels = merge(
local.common_tags,
{
name = "app-server"
cloud = "gcp"
}
)
}
# Azure resources
resource "azurerm_virtual_machine" "app" {
# ... configuration
tags = merge(
local.common_tags,
{
Name = "app-server"
Cloud = "Azure"
}
)
}
Multi-Cloud Cost Tools
CloudHealth by VMware:
- Unified dashboard across AWS, Azure, GCP
- Reserved instance recommendations
- Budget alerts
CloudZero:
- Kubernetes cost allocation across clouds
- Unit economics tracking
- Anomaly detection
Custom Solution with OpenCost:
# OpenCost configuration for multi-cloud
apiVersion: v1
kind: ConfigMap
metadata:
name: opencost-config
data:
cloud-integration: |
aws:
enabled: true
account_id: "123456789"
gcp:
enabled: true
project_id: "my-project"
azure:
enabled: true
subscription_id: "sub-id"
Operational Complexity Management
Standardization is Key
Standard Module Interface:
# modules/compute-cluster/interface.tf
variable "cloud_provider" {
description = "Cloud provider"
type = string
validation {
condition = contains(["aws", "gcp", "azure"], var.cloud_provider)
error_message = "Must be aws, gcp, or azure"
}
}
variable "cluster_config" {
description = "Cluster configuration"
type = object({
name = string
instance_type = string
instance_count = number
disk_size_gb = number
})
}
variable "network_config" {
description = "Network configuration"
type = object({
vpc_id = string
subnet_ids = list(string)
})
}
# Provider-specific implementations
module "aws_cluster" {
count = var.cloud_provider == "aws" ? 1 : 0
source = "./aws"
cluster_config = var.cluster_config
network_config = var.network_config
}
module "gcp_cluster" {
count = var.cloud_provider == "gcp" ? 1 : 0
source = "./gcp"
cluster_config = var.cluster_config
network_config = var.network_config
}
module "azure_cluster" {
count = var.cloud_provider == "azure" ? 1 : 0
source = "./azure"
cluster_config = var.cluster_config
network_config = var.network_config
}
Multi-Cloud Observability
Centralized Logging and Monitoring:
# Grafana Cloud for multi-cloud observability
---
# AWS CloudWatch integration
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-aws-cloudwatch
data:
datasource.yaml: |
apiVersion: 1
datasources:
- name: AWS CloudWatch
type: cloudwatch
access: proxy
jsonData:
authType: ec2_iam_role
defaultRegion: us-east-1
# GCP Cloud Monitoring integration
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-gcp-monitoring
data:
datasource.yaml: |
apiVersion: 1
datasources:
- name: GCP Monitoring
type: stackdriver
access: proxy
jsonData:
authenticationType: gce
# Azure Monitor integration
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-azure-monitor
data:
datasource.yaml: |
apiVersion: 1
datasources:
- name: Azure Monitor
type: grafana-azure-monitor-datasource
access: proxy
jsonData:
cloudName: azuremonitor
Unified Alerting:
# Multi-cloud alert aggregation
from datadog import initialize, api
initialize(api_key='API_KEY', app_key='APP_KEY')
# Create alert that monitors across clouds
alert = api.Monitor.create(
type="metric alert",
query="avg(last_5m):avg:system.cpu.user{cloud:aws} by {host} > 80 "
"OR avg(last_5m):avg:system.cpu.user{cloud:gcp} by {host} > 80 "
"OR avg(last_5m):avg:system.cpu.user{cloud:azure} by {host} > 80",
name="High CPU across all clouds",
message="CPU usage is high on {{host.name}} in {{cloud}}",
tags=['multi-cloud', 'infrastructure']
)
Migration Strategies
Gradual Migration Pattern
Move workloads incrementally:
Phase 1: New workloads on target cloud
AWS (Existing) GCP (New)
- Legacy App β - New Feature A
- Database β - Analytics Pipeline
Phase 2: Migrate non-critical workloads
AWS GCP
- Legacy App - New Feature A
- Database β - Analytics
β - Staging Env (migrated)
Phase 3: Migrate critical workloads
AWS GCP
- Database β - All Applications
β - Analytics
β - Production (migrated)
Strangler Fig Pattern
Gradually replace old system with new:
# Route traffic between old (AWS) and new (GCP) based on feature flags
resource "cloudflare_load_balancer" "migration" {
name = "app.example.com"
# Gradually shift traffic
default_pool_ids = [
cloudflare_load_balancer_pool.aws_legacy.id,
cloudflare_load_balancer_pool.gcp_new.id,
]
# Weight-based routing
rules {
name = "new-users-to-gcp"
priority = 1
condition = "http.cookie contains \"new_system=true\""
overrides {
default_pools = [cloudflare_load_balancer_pool.gcp_new.id]
}
}
# Default: 90% AWS, 10% GCP (gradually increase GCP)
default_pools = [
{
id = cloudflare_load_balancer_pool.aws_legacy.id
weight = 90
},
{
id = cloudflare_load_balancer_pool.gcp_new.id
weight = 10
}
]
}
Conclusion
Multi-cloud is a tool, not a goal. Use it when it solves real problems:
- Customer requirements demand it
- Specific services justify the complexity
- Risk mitigation is worth the cost
Don't use it for:
- Theoretical vendor lock-in avoidance
- Resume building
- Because competitors do it
Key Takeaways:
- Start with one cloud. Master it before adding complexity.
- If you go multi-cloud, standardize everything. Common modules, common monitoring, common processes.
- Justify the cost. Multi-cloud isn't free. Calculate the engineering overhead.
- Plan for Day 2 operations. Multi-cloud is hardest during incidents, not deployment.
- Invest in abstractions. Whether Kubernetes, Terraform modules, or service meshes, abstractions reduce cloud-specific code.
The best multi-cloud strategy is the one that solves your actual problems without creating new ones. Be honest about whether you need it, rigorous in implementation if you do, and ruthless about cutting complexity wherever possible.
Next Steps
If you're implementing multi-cloud:
- Document your why. Write down the specific problems multi-cloud solves for you.
- Calculate the cost. Include engineering time, not just infrastructure.
- Start with the smallest useful implementation. Don't over-engineer from day one.
- Standardize before scaling. Get your patterns right with one workload before replicating.
- Measure everything. Track costs, incidents, and time spent per cloud.
Additional Resources
- Google Cloud Hybrid and Multicloud Architecture Patterns
- AWS Prescriptive Guidance: Multi-Cloud Strategy
- Aviatrix Multi-Cloud Networking Documentation
- CockroachDB Multi-Region Architecture
Questions about multi-cloud architecture? Share your experiences in the comments below.