Multi-Cloud Infrastructure Patterns: Strategy and Implementation

Introduction

Multi-cloud is a loaded term. For some teams, it's a strategic advantage that prevents vendor lock-in and optimizes costs. For others, it's an operational nightmare that multiplies complexity without delivering value.

This guide cuts through the hype to explore when multi-cloud makes sense, what patterns actually work in production, and how to implement them without drowning in complexity. Whether you're evaluating a multi-cloud strategy or already running workloads across multiple providers, you'll learn patterns that reduce operational burden while maximizing the benefits.

The Multi-Cloud Reality Check

Before diving into patterns, let's establish some ground truth.

When Multi-Cloud Makes Sense

1. Customer Requirements

Your customers demand it. Financial institutions often require specific clouds for compliance. Government contracts may mandate certain providers. Enterprise customers might need you to run in their preferred cloud.

Real Example: A SaaS company runs primary infrastructure on AWS but 
deploys dedicated instances on GCP and Azure for enterprise customers 
who have existing cloud commitments and need data residency in their 
tenants.

2. Risk Mitigation for Critical Systems

True redundancy against cloud provider outages. Not just multi-region, but multi-cloud.

Real Example: A payment processor runs identical infrastructure on AWS 
and GCP with automatic failover. During the 2021 AWS us-east-1 outage, 
they shifted 100% of traffic to GCP in under 2 minutes.

3. Best-of-Breed Services

Certain clouds excel at specific services. GCP's data analytics and ML tools. AWS's breadth of services. Azure's enterprise integration and hybrid capabilities.

Real Example: A data science company runs training workloads on GCP 
for TPU access and superior BigQuery performance, while hosting their 
application and database on AWS for better regional coverage and 
mature RDS offerings.

4. Cost Optimization at Scale

Once you're spending millions annually, optimizing across providers can save substantial money. Spot instance availability varies by provider. Certain workload types are cheaper on specific clouds.

Real Example: A video transcoding company processes uploads on whichever 
cloud has the cheapest spot instances at that moment, using a job queue 
system that dispatches to AWS, GCP, or Azure based on current pricing.

When Multi-Cloud is a Mistake

1. Premature Optimization

Your startup isn't ready for multi-cloud if you have:

Less than $50k monthly cloud spend
Fewer than 5 infrastructure engineers
No standardized deployment pipeline
Single-region footprint

Complexity will outweigh benefits. Master one cloud first.

2. "Avoiding Vendor Lock-In" Without Math

The lock-in fear is often irrational. Calculate the actual cost:

Scenario: You have 100 engineers. Each spends 20% more time managing 
multi-cloud complexity.

Cost: 100 engineers × $150k salary × 0.20 = $3M/year

Does multi-cloud save you $3M annually? If not, you're losing money 
"avoiding lock-in."

3. Technology Resume Building

Engineers want to learn multiple clouds. Great for learning, terrible business justification. Training environments and side projects serve this purpose without production complexity.

4. Surface-Level Multi-Cloud

Running a monitoring SaaS on one cloud and your application on another isn't multi-cloud architecture. It's using SaaS products that happen to be cloud-based. Real multi-cloud means your core workloads span providers.

Multi-Cloud Pattern Categories

Pattern 1: The Isolated Island Pattern

Description: Each cloud runs independent workloads with no inter-cloud communication.

Architecture:

┌─────────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐
│       AWS           │     │       GCP           │     │      Azure          │
│                     │     │                     │     │                     │
│  Customer A         │     │  Customer B         │     │  Customer C         │
│  Workloads          │     │  Workloads          │     │  Workloads          │
│                     │     │                     │     │                     │
└─────────────────────┘     └─────────────────────┘     └─────────────────────┘
         │                           │                           │
         └───────────────────────────┴───────────────────────────┘
                                     │
                             Central Control Plane
                            (CI/CD, monitoring, etc)

When to Use:

B2B SaaS with customer-specific deployments
Regional isolation requirements
Independent scaling per customer/region
No shared data between clouds

Implementation:

# Terraform structure for isolated islands
project/
├── infrastructure/
│   ├── aws/
│   │   ├── customer-a/
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   └── terraform.tfvars
│   │   └── modules/
│   │       ├── application/
│   │       ├── database/
│   │       └── networking/
│   ├── gcp/
│   │   ├── customer-b/
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   └── terraform.tfvars
│   │   └── modules/
│   │       ├── application/
│   │       ├── database/
│   │       └── networking/
│   └── azure/
│       └── customer-c/
└── control-plane/
    ├── ci-cd/
    └── monitoring/

Key Terraform Pattern:

# modules/application/main.tf (cloud-agnostic interface)
variable "environment" {
  description = "Environment name"
  type        = string
}

variable "app_config" {
  description = "Application configuration"
  type = object({
    instance_type = string
    instance_count = number
    app_port = number
  })
}

variable "network_config" {
  description = "Network configuration"
  type = object({
    vpc_id = string
    subnet_ids = list(string)
  })
}

# AWS implementation
# infrastructure/aws/modules/application/main.tf
resource "aws_instance" "app" {
  count         = var.app_config.instance_count
  instance_type = var.app_config.instance_type
  subnet_id     = var.network_config.subnet_ids[count.index % length(var.network_config.subnet_ids)]
  
  # ... AWS-specific configuration
}

# GCP implementation  
# infrastructure/gcp/modules/application/main.tf
resource "google_compute_instance" "app" {
  count        = var.app_config.instance_count
  machine_type = var.app_config.instance_type
  zone         = var.zones[count.index % length(var.zones)]
  
  # ... GCP-specific configuration
}

Pros:

Simplest multi-cloud pattern
No cross-cloud networking complexity
Independent failure domains
Easy cost attribution

Cons:

Can't share resources or data
Duplicate infrastructure code
No cross-cloud redundancy

Pattern 2: The Active-Passive Disaster Recovery Pattern

Description: Primary workload on one cloud, standby replica on another for disaster recovery.

Architecture:

┌──────────────────────────────┐
│          AWS (Primary)       │
│                              │
│  ┌────────────┐              │
│  │Application │              │
│  │  Cluster   │              │
│  └────────────┘              │
│        │                     │
│  ┌────────────┐              │
│  │  Database  │──────────────┼───► Continuous Replication
│  │   (RDS)    │              │
│  └────────────┘              │
│        │                     │
│  ┌────────────┐              │
│  │   Object   │──────────────┼───► Cross-Region Replication
│  │  Storage   │              │
│  └────────────┘              │
└──────────────────────────────┘
                │
                │ Failover (Manual or Automatic)
                ▼
┌──────────────────────────────┐
│        GCP (Standby)         │
│                              │
│  ┌────────────┐              │
│  │Application │ (Scaled to  │
│  │  Cluster   │  zero until │
│  └────────────┘  failover)  │
│        │                     │
│  ┌────────────┐              │
│  │  Database  │ (Read       │
│  │ (CloudSQL) │  replica)   │
│  └────────────┘              │
│        │                     │
│  ┌────────────┐              │
│  │   Object   │ (Replicated │
│  │  Storage   │  data)      │
│  └────────────┘              │
└──────────────────────────────┘

When to Use:

Need protection against cloud provider outages
Can tolerate RPO of minutes to hours
Can tolerate RTO of minutes to hours
Cost-sensitive (standby is cheaper than active-active)

Implementation:

# AWS Primary Database
resource "aws_db_instance" "primary" {
  identifier     = "app-primary"
  engine         = "postgres"
  engine_version = "14.7"
  instance_class = "db.r6g.xlarge"
  
  backup_retention_period = 7
  
  # Enable automated backups for DR
  backup_window = "03:00-04:00"
  
  tags = {
    Role = "primary"
    DRTarget = "gcp-standby"
  }
}

# Export backups to S3 for cross-cloud restore
resource "aws_db_snapshot" "automated" {
  # Taken automatically by RDS
}

# Lambda to copy snapshots to S3 and trigger GCP restore
resource "aws_lambda_function" "snapshot_export" {
  function_name = "export-snapshot-to-gcs"
  
  # Exports RDS snapshot to S3
  # Copies to GCS using cross-cloud credentials
  # Triggers GCP Cloud Function to restore to standby
}

# GCP Standby Database
resource "google_sql_database_instance" "standby" {
  name             = "app-standby"
  database_version = "POSTGRES_14"
  
  settings {
    tier = "db-custom-4-16384"  # Match AWS sizing
    
    # Start with minimal resources, scale on failover
    activation_policy = "NEVER"  # Don't run unless activated
  }
}

# Restore mechanism
resource "google_cloudfunctions_function" "restore_from_backup" {
  name        = "restore-db-from-s3"
  description = "Restore database from AWS backup"
  runtime     = "python39"
  
  # Triggered by snapshot copy completion
  # Restores backup to standby instance
  # Updates DNS to point to GCP
}

Failover Orchestration:

# failover.py - Automated DR failover script
import boto3
import google.cloud.sql_v1
import time

class MultiCloudDR:
    def __init__(self):
        self.aws_client = boto3.client('rds')
        self.gcp_client = google.cloud.sql_v1.SqlInstancesServiceClient()
        
    def check_primary_health(self):
        """Check if AWS primary is healthy"""
        try:
            response = self.aws_client.describe_db_instances(
                DBInstanceIdentifier='app-primary'
            )
            status = response['DBInstances'][0]['DBInstanceStatus']
            return status == 'available'
        except Exception as e:
            print(f"Primary health check failed: {e}")
            return False
    
    def activate_standby(self):
        """Activate GCP standby"""
        print("Activating GCP standby instance...")
        
        # 1. Activate the Cloud SQL instance
        instance = self.gcp_client.get(name='app-standby')
        instance.settings.activation_policy = 'ALWAYS'
        self.gcp_client.update(instance=instance)
        
        # 2. Wait for instance to be ready
        while True:
            instance = self.gcp_client.get(name='app-standby')
            if instance.state == 'RUNNABLE':
                break
            time.sleep(10)
        
        # 3. Promote read replica to primary
        self.gcp_client.promote_replica(name='app-standby')
        
        # 4. Scale up compute resources
        instance.settings.tier = 'db-custom-8-32768'
        self.gcp_client.update(instance=instance)
        
        # 5. Update DNS to point to GCP
        self.update_dns_to_gcp()
        
        print("Standby activated and serving traffic")
    
    def update_dns_to_gcp(self):
        """Update DNS to point to GCP instance"""
        # Update Route53 / Cloud DNS records
        pass
    
    def failover(self):
        """Execute failover to GCP"""
        if not self.check_primary_health():
            print("Primary unhealthy, initiating failover...")
            self.activate_standby()
        else:
            print("Primary is healthy, no action needed")

# Run health check every minute
if __name__ == '__main__':
    dr = MultiCloudDR()
    while True:
        dr.failover()
        time.sleep(60)

Pros:

Protects against cloud provider outages
Lower cost than active-active (standby scaled down)
Clear primary/secondary model

Cons:

Failover time (RTO) of minutes to hours
Data loss window (RPO) depends on replication lag
Manual testing required to ensure DR actually works
Standby infrastructure costs even when unused

Pattern 3: The Active-Active Multi-Cloud Pattern

Description: Workloads actively running on multiple clouds simultaneously with traffic distributed across them.

Architecture:

                        Global Traffic Manager
                        (DNS-based or Anycast)
                                 │
                ┌────────────────┼────────────────┐
                │                │                │
        ┌───────▼──────┐  ┌─────▼──────┐  ┌─────▼──────┐
        │     AWS      │  │    GCP     │  │   Azure    │
        │   (33.3%)    │  │  (33.3%)   │  │  (33.3%)   │
        └──────────────┘  └────────────┘  └────────────┘
                │                │                │
        ┌───────▼──────┐  ┌─────▼──────┐  ┌─────▼──────┐
        │ Application  │  │Application │  │Application │
        │   Cluster    │  │  Cluster   │  │  Cluster   │
        └──────┬───────┘  └─────┬──────┘  └─────┬──────┘
               │                │                │
               └────────────────┼────────────────┘
                                │
                         Distributed Database
                      (CockroachDB, Spanner, etc)

When to Use:

Cannot tolerate downtime from cloud provider outages
High availability requirements (99.99%+)
Global user base needing low latency everywhere
Budget supports 2-3x infrastructure costs

Critical Requirements:

Stateless Application Design

Applications must be completely stateless. Any state belongs in distributed databases or caches.

// Bad: Session state in local memory
type AppServer struct {
    sessions map[string]*Session  // Lost on instance restart
}

// Good: Session state in distributed cache
type AppServer struct {
    redis *redis.ClusterClient  // Works across clouds
}

func (s *AppServer) GetSession(id string) (*Session, error) {
    // Fetch from Redis cluster spanning AWS, GCP, Azure
    val, err := s.redis.Get(context.Background(), id).Result()
    if err != nil {
        return nil, err
    }
    // Deserialize and return
}

Globally Distributed Database

You need a database that handles multi-cloud replication:

Option A: CockroachDB

-- Create table with multi-region replication
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email STRING NOT NULL,
    created_at TIMESTAMP DEFAULT now()
);

-- Configure multi-region
ALTER DATABASE app PRIMARY REGION "aws-us-east-1";
ALTER DATABASE app ADD REGION "gcp-us-central1";
ALTER DATABASE app ADD REGION "azure-eastus";

-- Set survival goals
ALTER DATABASE app SURVIVE REGION FAILURE;

# Terraform for CockroachDB across clouds
resource "cockroachdb_cluster" "multi_cloud" {
  name = "production"
  
  cloud_provider = "MULTI_CLOUD"
  
  regions = [
    {
      name = "aws-us-east-1"
      node_count = 3
    },
    {
      name = "gcp-us-central1"
      node_count = 3
    },
    {
      name = "azure-eastus"
      node_count = 3
    }
  ]
  
  hardware = {
    machine_type = "n1-standard-4"
    storage_gib = 500
  }
}

Option B: Google Cloud Spanner (Multi-Region)

resource "google_spanner_instance" "multi_region" {
  name         = "production"
  config       = "nam-eur-asia1"  # Multi-region configuration
  display_name = "Production Multi-Region"
  num_nodes    = 9  # 3 per region
}

Option C: Self-Managed PostgreSQL with Multi-Master

# AWS RDS Postgres
resource "aws_db_instance" "postgres_aws" {
  identifier     = "postgres-aws"
  engine         = "postgres"
  instance_class = "db.r6g.2xlarge"
  
  # Enable logical replication
  parameter_group_name = aws_db_parameter_group.postgres_replication.name
}

# GCP Cloud SQL Postgres
resource "google_sql_database_instance" "postgres_gcp" {
  name             = "postgres-gcp"
  database_version = "POSTGRES_14"
  
  settings {
    tier = "db-custom-8-32768"
    
    database_flags {
      name  = "wal_level"
      value = "logical"
    }
  }
}

# Azure PostgreSQL
resource "azurerm_postgresql_flexible_server" "postgres_azure" {
  name                = "postgres-azure"
  resource_group_name = azurerm_resource_group.main.name
  location            = "eastus"
  
  sku_name = "GP_Standard_D4s_v3"
}

# Configure bidirectional replication using PostgreSQL logical replication
# Requires pglogical or similar extension

Global Traffic Distribution

# Using Cloudflare for multi-cloud load balancing
resource "cloudflare_load_balancer" "app" {
  zone_id = var.cloudflare_zone_id
  name    = "app.example.com"
  
  default_pool_ids = [
    cloudflare_load_balancer_pool.aws.id,
    cloudflare_load_balancer_pool.gcp.id,
    cloudflare_load_balancer_pool.azure.id,
  ]
  
  steering_policy = "geo"  # or "random", "dynamic_latency"
  
  session_affinity = "cookie"
}

resource "cloudflare_load_balancer_pool" "aws" {
  name = "aws-us-east-1"
  
  origins {
    name    = "aws-origin-1"
    address = aws_lb.app.dns_name
    enabled = true
  }
  
  check_regions = ["WNAM", "ENAM"]
  
  # Health check
  monitor = cloudflare_load_balancer_monitor.http.id
}

resource "cloudflare_load_balancer_pool" "gcp" {
  name = "gcp-us-central1"
  
  origins {
    name    = "gcp-origin-1"
    address = google_compute_forwarding_rule.app.ip_address
    enabled = true
  }
  
  monitor = cloudflare_load_balancer_monitor.http.id
}

resource "cloudflare_load_balancer_pool" "azure" {
  name = "azure-eastus"
  
  origins {
    name    = "azure-origin-1"
    address = azurerm_lb.app.frontend_ip_configuration[0].public_ip_address
    enabled = true
  }
  
  monitor = cloudflare_load_balancer_monitor.http.id
}

resource "cloudflare_load_balancer_monitor" "http" {
  type     = "http"
  path     = "/health"
  port     = 443
  interval = 60
  timeout  = 5
  retries  = 2
  
  expected_codes = "200"
}

Pros:

True high availability across cloud providers
No failover delay
Can optimize latency per region
Load distribution for performance

Cons:

Highest cost (3x infrastructure)
Operational complexity
Data consistency challenges
Requires significant engineering investment

Pattern 4: The Hybrid Best-of-Breed Pattern

Description: Use each cloud for what it does best, with controlled integration points.

Architecture:

┌──────────────────────────────────────────────────┐
│              AWS (Primary Platform)              │
│                                                  │
│  ┌────────────┐  ┌──────────┐  ┌──────────┐    │
│  │Application │  │PostgreSQL│  │   Redis  │    │
│  │  (ECS)     │  │  (RDS)   │  │(ElastiC.)│    │
│  └─────┬──────┘  └────┬─────┘  └────┬─────┘    │
└────────┼──────────────┼─────────────┼───────────┘
         │              │             │
         │              │             │
         │              │             └──────────┐
         │              │                        │
         │              └────────┐               │
         │                       │               │
┌────────▼───────────────┐  ┌────▼────────┐  ┌──▼───────────┐
│  GCP (ML/Analytics)    │  │GCP BigQuery │  │ GCP Cloud    │
│                        │  │             │  │  Storage     │
│  ┌──────────────┐      │  │             │  │              │
│  │  Vertex AI   │      │  │ Data Lakee │  │  Data Lake   │
│  │  Training    │      │  │             │  │              │
│  └──────────────┘      │  └─────────────┘  └──────────────┘
└────────────────────────┘

When to Use:

Need best-in-class services from different clouds
Can isolate workloads by function
Have budget for specialized services
Team has expertise in multiple clouds

Implementation Example: ML Pipeline on GCP with Application on AWS

# Application runs on AWS, triggers ML training on GCP
import boto3
import google.cloud.aiplatform as aiplatform
from google.cloud import storage

class MultiCloudMLPipeline:
    def __init__(self):
        # AWS clients
        self.s3 = boto3.client('s3')
        self.sqs = boto3.client('sqs')
        
        # GCP clients
        self.gcs = storage.Client()
        aiplatform.init(project='my-project', location='us-central1')
    
    def trigger_training(self, dataset_id):
        """Trigger ML training job on GCP with data from AWS"""
        
        # 1. Export data from AWS RDS to S3
        self.export_data_to_s3(dataset_id)
        
        # 2. Transfer from S3 to GCS
        self.transfer_s3_to_gcs(
            f's3://my-bucket/datasets/{dataset_id}',
            f'gs://my-gcp-bucket/datasets/{dataset_id}'
        )
        
        # 3. Create Vertex AI training job
        job = aiplatform.CustomTrainingJob(
            display_name=f'training-{dataset_id}',
            script_path='train.py',
            container_uri='gcr.io/my-project/training:latest',
            requirements=['tensorflow==2.12.0'],
        )
        
        # 4. Run training
        model = job.run(
            dataset=f'gs://my-gcp-bucket/datasets/{dataset_id}',
            replica_count=4,
            machine_type='n1-highmem-8',
            accelerator_type='NVIDIA_TESLA_T4',
            accelerator_count=1,
        )
        
        # 5. Export model back to AWS S3 for serving
        self.export_model_to_s3(model)
        
        return model
    
    def transfer_s3_to_gcs(self, s3_uri, gcs_uri):
        """Transfer data from S3 to GCS"""
        # Use AWS DataSync or custom transfer job
        # Or: Use GCP Storage Transfer Service
        from google.cloud import storage_transfer
        
        transfer_client = storage_transfer.StorageTransferServiceClient()
        
        transfer_job = {
            'description': 'Transfer dataset for ML training',
            'status': 'ENABLED',
            'project_id': 'my-project',
            'transfer_spec': {
                'aws_s3_data_source': {
                    'bucket_name': 'my-bucket',
                    'path': f'datasets/{dataset_id}/',
                    'aws_access_key': {
                        'access_key_id': 'ACCESS_KEY',
                        'secret_access_key': 'SECRET_KEY'
                    }
                },
                'gcs_data_sink': {
                    'bucket_name': 'my-gcp-bucket',
                    'path': f'datasets/{dataset_id}/'
                }
            }
        }
        
        result = transfer_client.create_transfer_job(transfer_job=transfer_job)
        return result

Terraform for Hybrid Infrastructure:

# AWS Application Infrastructure
module "aws_application" {
  source = "./modules/aws-app"
  
  providers = {
    aws = aws
  }
  
  environment = var.environment
  vpc_cidr    = "10.0.0.0/16"
}

# GCP ML Infrastructure
module "gcp_ml_platform" {
  source = "./modules/gcp-ml"
  
  providers = {
    google = google
  }
  
  project_id = var.gcp_project_id
  region     = "us-central1"
  
  # Reference to AWS resources for data transfer
  aws_s3_bucket = module.aws_application.data_bucket_name
}

# Cross-cloud IAM for service accounts
resource "aws_iam_role" "gcp_access" {
  name = "gcp-ml-service-access"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = "accounts.google.com"
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "accounts.google.com:aud" = var.gcp_service_account_email
        }
      }
    }]
  })
}

resource "aws_iam_role_policy" "gcp_s3_access" {
  role = aws_iam_role.gcp_access.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "s3:GetObject",
        "s3:ListBucket"
      ]
      Resource = [
        module.aws_application.data_bucket_arn,
        "${module.aws_application.data_bucket_arn}/*"
      ]
    }]
  })
}

Pros:

Use best services from each cloud
Can specialize team expertise
Optimize costs per workload type
Flexibility to migrate over time

Cons:

Complex data movement between clouds
Higher egress costs
Multiple operational models
Team needs multi-cloud expertise

Cross-Cloud Networking Strategies

Networking is where multi-cloud gets difficult. Here are proven approaches.

Strategy 1: VPN Mesh

Simple but Limited:

# AWS side
resource "aws_customer_gateway" "gcp" {
  bgp_asn    = 65000
  ip_address = google_compute_address.vpn_gw.address
  type       = "ipsec.1"
}

resource "aws_vpn_connection" "to_gcp" {
  customer_gateway_id = aws_customer_gateway.gcp.id
  vpn_gateway_id      = aws_vpn_gateway.main.id
  type                = "ipsec.1"
  static_routes_only  = false
}

# GCP side
resource "google_compute_vpn_gateway" "aws" {
  name    = "aws-vpn-gateway"
  network = google_compute_network.main.id
}

resource "google_compute_vpn_tunnel" "to_aws" {
  name          = "to-aws"
  peer_ip       = aws_vpn_connection.to_gcp.tunnel1_address
  shared_secret = aws_vpn_connection.to_gcp.tunnel1_preshared_key
  
  target_vpn_gateway = google_compute_vpn_gateway.aws.id
  
  ike_version = 2
}

Pros: Simple, built-in cloud features Cons: Performance limited, not highly available, complex mesh scaling

Strategy 2: SD-WAN Overlay

Use a third-party SD-WAN solution:

Options:

Aviatrix
Alkira
Cisco Viptela
VMware SD-WAN

# Example with Aviatrix
resource "aviatrix_transit_gateway" "aws" {
  cloud_type   = 1  # AWS
  account_name = "aws-account"
  gw_name      = "aws-transit"
  vpc_id       = aws_vpc.main.id
  vpc_reg      = "us-east-1"
  gw_size      = "c5n.2xlarge"
  subnet       = aws_subnet.transit.cidr_block
}

resource "aviatrix_transit_gateway" "gcp" {
  cloud_type   = 4  # GCP
  account_name = "gcp-account"
  gw_name      = "gcp-transit"
  vpc_id       = google_compute_network.main.name
  vpc_reg      = "us-central1"
  gw_size      = "n1-standard-4"
  subnet       = google_compute_subnetwork.transit.ip_cidr_range
}

resource "aviatrix_transit_gateway_peering" "aws_gcp" {
  transit_gateway_name1 = aviatrix_transit_gateway.aws.gw_name
  transit_gateway_name2 = aviatrix_transit_gateway.gcp.gw_name
}

Pros: High performance, simplified management, native HA Cons: Additional cost, vendor dependency

Strategy 3: Direct Interconnects

High-bandwidth dedicated connections:

# AWS Direct Connect
resource "aws_dx_connection" "main" {
  name      = "cross-cloud-interconnect"
  bandwidth = "10Gbps"
  location  = "EqDC2"  # Equinix datacenter
}

# GCP Partner Interconnect at same colocation
resource "google_compute_interconnect_attachment" "aws" {
  name         = "to-aws"
  type         = "PARTNER"
  router       = google_compute_router.main.id
  region       = "us-central1"
  admin_enabled = true
}

Pros: Lowest latency, highest bandwidth, predictable performance Cons: Expensive, complex setup, limited availability

Identity and Access Management Across Clouds

Pattern: Centralized Identity Provider

Use a single identity source with federation to all clouds:

# Okta as central IDP
resource "okta_app_saml" "aws" {
  label = "AWS"
  
  # AWS SAML configuration
  app_settings_json = jsonencode({
    aws_environment_type = "aws.amazon"
  })
}

resource "okta_app_saml" "gcp" {
  label = "Google Cloud Platform"
  
  # GCP SAML configuration
}

resource "okta_app_saml" "azure" {
  label = "Microsoft Azure"
  
  # Azure AD integration
}

# Assign groups to applications
resource "okta_app_group_assignment" "aws_admin" {
  app_id   = okta_app_saml.aws.id
  group_id = okta_group.platform_team.id
}

Pattern: Service Account Federation

Allow services in one cloud to access another:

# GCP service account that can assume AWS role
resource "google_service_account" "aws_access" {
  account_id   = "aws-resource-access"
  display_name = "Service account for AWS access"
}

# AWS IAM role that GCP can assume
resource "aws_iam_role" "gcp_access" {
  name = "gcp-service-access"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = "accounts.google.com"
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "accounts.google.com:sub" = google_service_account.aws_access.unique_id
        }
      }
    }]
  })
}

# GCP Workload Identity to AWS
resource "google_iam_workload_identity_pool" "aws" {
  workload_identity_pool_id = "aws-pool"
  display_name              = "AWS Identity Pool"
}

resource "google_iam_workload_identity_pool_provider" "aws" {
  workload_identity_pool_id          = google_iam_workload_identity_pool.aws.workload_identity_pool_id
  workload_identity_pool_provider_id = "aws-provider"
  display_name                       = "AWS Provider"
  
  aws {
    account_id = var.aws_account_id
  }
  
  attribute_mapping = {
    "google.subject" = "assertion.arn"
    "attribute.aws_role" = "assertion.arn.extract('assumed-role/{role}/')"
  }
}

Cost Management in Multi-Cloud

Multi-cloud amplifies cost management challenges.

Unified Cost Tracking

# Tag strategy across clouds
locals {
  common_tags = {
    Environment = var.environment
    Team        = var.team
    CostCenter  = var.cost_center
    ManagedBy   = "Terraform"
    Project     = var.project_name
  }
}

# AWS resources
resource "aws_instance" "app" {
  # ... configuration
  
  tags = merge(
    local.common_tags,
    {
      Name = "app-server"
      Cloud = "AWS"
    }
  )
}

# GCP resources
resource "google_compute_instance" "app" {
  # ... configuration
  
  labels = merge(
    local.common_tags,
    {
      name = "app-server"
      cloud = "gcp"
    }
  )
}

# Azure resources
resource "azurerm_virtual_machine" "app" {
  # ... configuration
  
  tags = merge(
    local.common_tags,
    {
      Name = "app-server"
      Cloud = "Azure"
    }
  )
}

Multi-Cloud Cost Tools

CloudHealth by VMware:

Unified dashboard across AWS, Azure, GCP
Reserved instance recommendations
Budget alerts

CloudZero:

Kubernetes cost allocation across clouds
Unit economics tracking
Anomaly detection

Custom Solution with OpenCost:

# OpenCost configuration for multi-cloud
apiVersion: v1
kind: ConfigMap
metadata:
  name: opencost-config
data:
  cloud-integration: |
    aws:
      enabled: true
      account_id: "123456789"
      
    gcp:
      enabled: true
      project_id: "my-project"
      
    azure:
      enabled: true
      subscription_id: "sub-id"

Operational Complexity Management

Standardization is Key

Standard Module Interface:

# modules/compute-cluster/interface.tf
variable "cloud_provider" {
  description = "Cloud provider"
  type        = string
  validation {
    condition     = contains(["aws", "gcp", "azure"], var.cloud_provider)
    error_message = "Must be aws, gcp, or azure"
  }
}

variable "cluster_config" {
  description = "Cluster configuration"
  type = object({
    name           = string
    instance_type  = string
    instance_count = number
    disk_size_gb   = number
  })
}

variable "network_config" {
  description = "Network configuration"
  type = object({
    vpc_id     = string
    subnet_ids = list(string)
  })
}

# Provider-specific implementations
module "aws_cluster" {
  count  = var.cloud_provider == "aws" ? 1 : 0
  source = "./aws"
  
  cluster_config  = var.cluster_config
  network_config  = var.network_config
}

module "gcp_cluster" {
  count  = var.cloud_provider == "gcp" ? 1 : 0
  source = "./gcp"
  
  cluster_config  = var.cluster_config
  network_config  = var.network_config
}

module "azure_cluster" {
  count  = var.cloud_provider == "azure" ? 1 : 0
  source = "./azure"
  
  cluster_config  = var.cluster_config
  network_config  = var.network_config
}

Multi-Cloud Observability

Centralized Logging and Monitoring:

# Grafana Cloud for multi-cloud observability
---
# AWS CloudWatch integration
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-aws-cloudwatch
data:
  datasource.yaml: |
    apiVersion: 1
    datasources:
      - name: AWS CloudWatch
        type: cloudwatch
        access: proxy
        jsonData:
          authType: ec2_iam_role
          defaultRegion: us-east-1

# GCP Cloud Monitoring integration
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-gcp-monitoring
data:
  datasource.yaml: |
    apiVersion: 1
    datasources:
      - name: GCP Monitoring
        type: stackdriver
        access: proxy
        jsonData:
          authenticationType: gce

# Azure Monitor integration
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-azure-monitor
data:
  datasource.yaml: |
    apiVersion: 1
    datasources:
      - name: Azure Monitor
        type: grafana-azure-monitor-datasource
        access: proxy
        jsonData:
          cloudName: azuremonitor

Unified Alerting:

# Multi-cloud alert aggregation
from datadog import initialize, api

initialize(api_key='API_KEY', app_key='APP_KEY')

# Create alert that monitors across clouds
alert = api.Monitor.create(
    type="metric alert",
    query="avg(last_5m):avg:system.cpu.user{cloud:aws} by {host} > 80 "
          "OR avg(last_5m):avg:system.cpu.user{cloud:gcp} by {host} > 80 "
          "OR avg(last_5m):avg:system.cpu.user{cloud:azure} by {host} > 80",
    name="High CPU across all clouds",
    message="CPU usage is high on {{host.name}} in {{cloud}}",
    tags=['multi-cloud', 'infrastructure']
)

Migration Strategies

Gradual Migration Pattern

Move workloads incrementally:

Phase 1: New workloads on target cloud

AWS (Existing)           GCP (New)
  - Legacy App     →     - New Feature A
  - Database       →     - Analytics Pipeline

Phase 2: Migrate non-critical workloads

AWS                      GCP
  - Legacy App           - New Feature A
  - Database      →      - Analytics
                  →      - Staging Env (migrated)

Phase 3: Migrate critical workloads

AWS                      GCP
  - Database      →      - All Applications
                  →      - Analytics
                  →      - Production (migrated)

Strangler Fig Pattern

Gradually replace old system with new:

# Route traffic between old (AWS) and new (GCP) based on feature flags
resource "cloudflare_load_balancer" "migration" {
  name = "app.example.com"
  
  # Gradually shift traffic
  default_pool_ids = [
    cloudflare_load_balancer_pool.aws_legacy.id,
    cloudflare_load_balancer_pool.gcp_new.id,
  ]
  
  # Weight-based routing
  rules {
    name     = "new-users-to-gcp"
    priority = 1
    
    condition = "http.cookie contains \"new_system=true\""
    
    overrides {
      default_pools = [cloudflare_load_balancer_pool.gcp_new.id]
    }
  }
  
  # Default: 90% AWS, 10% GCP (gradually increase GCP)
  default_pools = [
    {
      id     = cloudflare_load_balancer_pool.aws_legacy.id
      weight = 90
    },
    {
      id     = cloudflare_load_balancer_pool.gcp_new.id
      weight = 10
    }
  ]
}

Conclusion

Multi-cloud is a tool, not a goal. Use it when it solves real problems:

Customer requirements demand it
Specific services justify the complexity
Risk mitigation is worth the cost

Don't use it for:

Theoretical vendor lock-in avoidance
Resume building
Because competitors do it

Key Takeaways:

Start with one cloud. Master it before adding complexity.
If you go multi-cloud, standardize everything. Common modules, common monitoring, common processes.
Justify the cost. Multi-cloud isn't free. Calculate the engineering overhead.
Plan for Day 2 operations. Multi-cloud is hardest during incidents, not deployment.
Invest in abstractions. Whether Kubernetes, Terraform modules, or service meshes, abstractions reduce cloud-specific code.

The best multi-cloud strategy is the one that solves your actual problems without creating new ones. Be honest about whether you need it, rigorous in implementation if you do, and ruthless about cutting complexity wherever possible.

Next Steps

If you're implementing multi-cloud:

Document your why. Write down the specific problems multi-cloud solves for you.
Calculate the cost. Include engineering time, not just infrastructure.
Start with the smallest useful implementation. Don't over-engineer from day one.
Standardize before scaling. Get your patterns right with one workload before replicating.
Measure everything. Track costs, incidents, and time spent per cloud.

Additional Resources

Questions about multi-cloud architecture? Share your experiences in the comments below.

Multi-Cloud Infrastructure Patterns: Strategy and Implementation

On this page

Prerequisites

Introduction

The Multi-Cloud Reality Check

When Multi-Cloud Makes Sense

When Multi-Cloud is a Mistake

Multi-Cloud Pattern Categories

Pattern 1: The Isolated Island Pattern

Pattern 2: The Active-Passive Disaster Recovery Pattern

Pattern 3: The Active-Active Multi-Cloud Pattern

Pattern 4: The Hybrid Best-of-Breed Pattern

Cross-Cloud Networking Strategies

Strategy 1: VPN Mesh

Strategy 2: SD-WAN Overlay

Strategy 3: Direct Interconnects

Identity and Access Management Across Clouds

Pattern: Centralized Identity Provider

Pattern: Service Account Federation

Cost Management in Multi-Cloud

Unified Cost Tracking

Multi-Cloud Cost Tools

Operational Complexity Management

Standardization is Key

Multi-Cloud Observability

Migration Strategies

Gradual Migration Pattern

Strangler Fig Pattern

Conclusion

Next Steps

Additional Resources

Comments (0)

Leave a Comment