Building a Self-Service Infrastructure Platform

Introduction

Every growing engineering team hits the same bottleneck: developers need infrastructure, but only a handful of platform engineers know how to provision it safely. Ticket queues grow, context switching kills productivity, and your best engineers spend their time being human APIs instead of solving hard problems.

A self-service infrastructure platform solves this by giving developers the ability to provision infrastructure themselves, within guardrails set by the platform team. Done right, it accelerates teams while maintaining security and cost control. Done wrong, it creates technical debt and shadow IT.

This guide walks through building a production-grade self-service platform from first principles, including architecture decisions, implementation patterns, and code examples you can adapt.

Why Build a Self-Service Platform?

The Ticket Queue Problem

Without Self-Service:

Developer: "I need a PostgreSQL database for my new service"
  ↓ (Create Jira ticket)
Platform Team: Ticket sits in backlog for 3 days
  ↓ (Triage meeting)
Platform Engineer: Spends 30 minutes provisioning via Terraform
  ↓ (Manual steps, credentials, documentation)
Developer: Finally receives credentials, 5 days later
Total: 5 days, 30 minutes of engineering time

With Self-Service:

Developer: Fills out form in portal
  ↓ (Automated pipeline)
Platform: Provisions database, sets up monitoring, creates credentials
  ↓ (5 minutes)
Developer: Receives credentials and documentation
Total: 5 minutes, 0 human intervention

The Real Benefits

For Developers:

Provision infrastructure in minutes, not days
Iterate quickly without waiting for tickets
Understand what they're getting (standardized offerings)
Focus on building features, not fighting infrastructure

For Platform Teams:

Stop being human APIs
Enforce best practices automatically
Reduce toil and repetitive work
Focus on platform improvements, not individual requests

For the Organization:

Faster time to market
Consistent, secure infrastructure
Cost visibility and control
Scale engineering without proportional platform team growth

When You Need It

Build a self-service platform when:

Your platform team gets 10+ infrastructure requests per week
Developers wait days for basic infrastructure
You're repeating the same Terraform setups constantly
Inconsistency in infrastructure is causing operational issues
You want to scale to 50+ engineering teams

Don't build one if:

You have fewer than 20 engineers
Infrastructure needs are highly variable and specialized
Platform team can handle current load comfortably
You lack engineering capacity to maintain the platform

Core Design Principles

1. Paved Roads, Not Walls

Provide a well-maintained path that's easier than going off-road, but don't block custom solutions when needed.

Good: "Here's a button for a standard database. Need something custom? Here's how to do it safely."
Bad: "You can only use these exact configurations. No exceptions."

2. Sensible Defaults, Easy Customization

Start with 80% use case covered by defaults. Allow overrides for the remaining 20%.

# Template with smart defaults
database:
  type: postgres
  version: 14  # Default to latest stable
  instance_class: db.t3.medium  # Sensible for most services
  storage_gb: 100
  backup_retention_days: 7
  
  # But allow overrides
  custom_parameters:
    max_connections: 200  # Override if needed
    shared_buffers: "256MB"

3. Guardrails Over Gates

Prevent mistakes and enforce policy, but don't block legitimate use cases.

# Good: Validation with helpful errors
def validate_database_request(request):
    if request.storage_gb > 10000:
        if not request.has_approval("large-database-approval"):
            return Error(
                "Databases over 10TB require approval. "
                "Please reach out to #platform-team with your use case."
            )
    
    if request.instance_class.startswith("db.r6g"):
        # Expensive instance type
        return Warning(
            "ARM instances are 20% cheaper than Intel. "
            "Consider db.r6g instead of db.r5. "
            "Proceeding with your selection."
        )

4. Progressive Disclosure

Start simple, reveal complexity only when needed.

Basic Mode:  [Database Type] [Name] [Environment] → Done
Advanced:    Show 30+ configuration options
Expert:      Direct Terraform/API access

5. Observable and Debuggable

Developers should understand what's happening and why things fail.

Bad:  "Error: Failed to create database"
Good: "PostgreSQL provisioning failed at step 3 of 5: Security group creation
       Error: sg-12345 already exists with conflicting rules
       Solution: Check if database 'myapp-db' already exists
       Logs: https://platform.company.com/logs/req-abc123"

Architecture Overview

High-Level Components

┌─────────────────────────────────────────────────────────────┐
│                     Developer Interface                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   Web    │  │   CLI    │  │   API    │  │ Terraform│   │
│  │  Portal  │  │   Tool   │  │  Direct  │  │  Module  │   │
│  └─────┬────┘  └─────┬────┘  └─────┬────┘  └─────┬────┘   │
└────────┼─────────────┼─────────────┼─────────────┼─────────┘
         │             │             │             │
         └─────────────┴─────────────┴─────────────┘
                           │
         ┌─────────────────▼─────────────────┐
         │      Platform API Layer           │
         │  - Request validation             │
         │  - Policy enforcement             │
         │  - Resource naming                │
         │  - Cost estimation                │
         └─────────────┬─────────────────────┘
                       │
         ┌─────────────▼─────────────────┐
         │    Provisioning Engine        │
         │  - Terraform/Pulumi           │
         │  - GitOps workflows           │
         │  - State management           │
         │  - Drift detection            │
         └─────────────┬─────────────────┘
                       │
         ┌─────────────▼─────────────────┐
         │    Cloud Providers            │
         │  AWS, GCP, Azure              │
         └───────────────────────────────┘

    Parallel Services:
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Monitoring  │  │   Security   │  │     Cost     │
│  & Alerting  │  │   Scanning   │  │   Tracking   │
└──────────────┘  └──────────────┘  └──────────────┘

Technology Stack Options

Frontend (Developer Interface):

Web Portal: Next.js, React, Backstage
CLI: Go, Python (Click), Node.js
IDE Integration: VS Code extension

Backend (API Layer):

FastAPI (Python)
Express (Node.js)
Go (Gin, Echo)

Provisioning Engine:

Terraform with custom wrapper
Pulumi Automation API
Crossplane
AWS CDK / CDK for Terraform

Data Storage:

PostgreSQL: Request history, metadata
Redis: Cache, job queue
S3/GCS: Terraform state, logs

Infrastructure:

Kubernetes for the platform itself
Serverless functions (Lambda, Cloud Functions) for lightweight operations
GitLab/GitHub Actions for GitOps workflows

Implementation: Core Components

Component 1: Service Catalog

Define what developers can self-provision.

# catalog/postgres-database.yaml
name: PostgreSQL Database
description: Managed PostgreSQL database with automated backups and monitoring
category: databases
icon: postgresql
version: 1.2.0

parameters:
  - name: database_name
    type: string
    required: true
    pattern: "^[a-z][a-z0-9-]{2,30}$"
    description: "Database identifier (lowercase, alphanumeric, hyphens)"
    
  - name: environment
    type: enum
    required: true
    options: [dev, staging, prod]
    description: "Deployment environment"
    
  - name: instance_class
    type: enum
    required: false
    default: db.t3.medium
    options:
      - value: db.t3.small
        label: "Small (2 vCPU, 2GB RAM) - $30/month"
      - value: db.t3.medium
        label: "Medium (2 vCPU, 4GB RAM) - $60/month"
      - value: db.t3.large
        label: "Large (2 vCPU, 8GB RAM) - $120/month"
      - value: db.r6g.xlarge
        label: "XLarge (4 vCPU, 32GB RAM) - $250/month"
    description: "Instance size"
    
  - name: storage_gb
    type: integer
    required: false
    default: 100
    min: 20
    max: 1000
    description: "Storage size in GB"
    
  - name: backup_retention_days
    type: integer
    required: false
    default: 7
    min: 1
    max: 35
    description: "Days to retain automated backups"

  - name: multi_az
    type: boolean
    required: false
    default: false
    description: "Enable multi-AZ for high availability (2x cost)"
    
  - name: team_owner
    type: string
    required: true
    description: "Team responsible for this database"
    
  - name: cost_center
    type: string
    required: true
    description: "Cost center for billing"

outputs:
  - name: endpoint
    description: "Database connection endpoint"
    
  - name: port
    description: "Database port"
    
  - name: credentials_secret_arn
    description: "AWS Secrets Manager ARN containing credentials"
    
  - name: monitoring_dashboard
    description: "CloudWatch dashboard URL"

cost_estimate:
  monthly_min: 30
  monthly_max: 300
  formula: |
    base = instance_class_cost
    storage = storage_gb * 0.115
    backup = storage_gb * backup_retention_days * 0.02
    multi_az_multiplier = 2 if multi_az else 1
    total = (base + storage + backup) * multi_az_multiplier

tags:
  - managed-by: platform
  - service-type: database
  - database-engine: postgres

Component 2: Web Portal

Simple React form with validation:

// components/DatabaseRequestForm.tsx
import React, { useState } from 'react';
import { useMutation } from 'react-query';
import { createDatabase } from '../api/platform';

interface DatabaseFormData {
  database_name: string;
  environment: 'dev' | 'staging' | 'prod';
  instance_class: string;
  storage_gb: number;
  team_owner: string;
  cost_center: string;
}

export function DatabaseRequestForm() {
  const [formData, setFormData] = useState<DatabaseFormData>({
    database_name: '',
    environment: 'dev',
    instance_class: 'db.t3.medium',
    storage_gb: 100,
    team_owner: '',
    cost_center: '',
  });
  
  const [costEstimate, setCostEstimate] = useState<number | null>(null);
  
  const createMutation = useMutation(createDatabase, {
    onSuccess: (data) => {
      console.log('Database provisioning started:', data);
      // Redirect to status page
      window.location.href = `/requests/${data.request_id}`;
    },
  });
  
  // Real-time cost estimation
  const estimateCost = async () => {
    const response = await fetch('/api/v1/catalog/postgres-database/estimate', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(formData),
    });
    const { monthly_cost } = await response.json();
    setCostEstimate(monthly_cost);
  };
  
  React.useEffect(() => {
    estimateCost();
  }, [formData.instance_class, formData.storage_gb]);
  
  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault();
    createMutation.mutate(formData);
  };
  
  return (
    <form onSubmit={handleSubmit} className="space-y-6">
      <div>
        <label className="block text-sm font-medium">Database Name</label>
        <input
          type="text"
          value={formData.database_name}
          onChange={(e) => setFormData({ ...formData, database_name: e.target.value })}
          pattern="^[a-z][a-z0-9-]{2,30}$"
          required
          className="mt-1 block w-full rounded-md border-gray-300"
          placeholder="my-service-db"
        />
        <p className="mt-1 text-sm text-gray-500">
          Lowercase letters, numbers, and hyphens only
        </p>
      </div>
      
      <div>
        <label className="block text-sm font-medium">Environment</label>
        <select
          value={formData.environment}
          onChange={(e) => setFormData({ ...formData, environment: e.target.value as any })}
          className="mt-1 block w-full rounded-md border-gray-300"
        >
          <option value="dev">Development</option>
          <option value="staging">Staging</option>
          <option value="prod">Production</option>
        </select>
      </div>
      
      <div>
        <label className="block text-sm font-medium">Instance Size</label>
        <select
          value={formData.instance_class}
          onChange={(e) => setFormData({ ...formData, instance_class: e.target.value })}
          className="mt-1 block w-full rounded-md border-gray-300"
        >
          <option value="db.t3.small">Small (2 vCPU, 2GB) - ~$30/month</option>
          <option value="db.t3.medium">Medium (2 vCPU, 4GB) - ~$60/month</option>
          <option value="db.t3.large">Large (2 vCPU, 8GB) - ~$120/month</option>
        </select>
      </div>
      
      <div>
        <label className="block text-sm font-medium">Storage (GB)</label>
        <input
          type="number"
          value={formData.storage_gb}
          onChange={(e) => setFormData({ ...formData, storage_gb: parseInt(e.target.value) })}
          min={20}
          max={1000}
          className="mt-1 block w-full rounded-md border-gray-300"
        />
      </div>
      
      <div>
        <label className="block text-sm font-medium">Team Owner</label>
        <input
          type="text"
          value={formData.team_owner}
          onChange={(e) => setFormData({ ...formData, team_owner: e.target.value })}
          required
          className="mt-1 block w-full rounded-md border-gray-300"
          placeholder="backend-team"
        />
      </div>
      
      <div>
        <label className="block text-sm font-medium">Cost Center</label>
        <input
          type="text"
          value={formData.cost_center}
          onChange={(e) => setFormData({ ...formData, cost_center: e.target.value })}
          required
          className="mt-1 block w-full rounded-md border-gray-300"
          placeholder="CC-1234"
        />
      </div>
      
      {costEstimate !== null && (
        <div className="rounded-md bg-blue-50 p-4">
          <p className="text-sm text-blue-700">
            Estimated cost: <strong>${costEstimate.toFixed(2)}/month</strong>
          </p>
        </div>
      )}
      
      <button
        type="submit"
        disabled={createMutation.isLoading}
        className="w-full bg-blue-600 text-white py-2 px-4 rounded-md hover:bg-blue-700 disabled:opacity-50"
      >
        {createMutation.isLoading ? 'Provisioning...' : 'Create Database'}
      </button>
    </form>
  );
}

Component 3: API Backend

FastAPI backend with validation and workflow orchestration:

# api/main.py
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field, validator
from typing import Optional
import uuid
from datetime import datetime

app = FastAPI(title="Infrastructure Platform API")

class DatabaseRequest(BaseModel):
    database_name: str = Field(..., pattern="^[a-z][a-z0-9-]{2,30}$")
    environment: str = Field(..., pattern="^(dev|staging|prod)$")
    instance_class: str = "db.t3.medium"
    storage_gb: int = Field(default=100, ge=20, le=1000)
    backup_retention_days: int = Field(default=7, ge=1, le=35)
    multi_az: bool = False
    team_owner: str
    cost_center: str
    
    @validator('instance_class')
    def validate_instance_class(cls, v):
        allowed = ['db.t3.small', 'db.t3.medium', 'db.t3.large', 'db.r6g.xlarge']
        if v not in allowed:
            raise ValueError(f'Instance class must be one of {allowed}')
        return v

class DatabaseResponse(BaseModel):
    request_id: str
    status: str
    database_name: str
    estimated_completion_time: str
    status_url: str

@app.post("/api/v1/databases", response_model=DatabaseResponse)
async def create_database(
    request: DatabaseRequest,
    background_tasks: BackgroundTasks
):
    """Create a new PostgreSQL database"""
    
    # Generate request ID
    request_id = str(uuid.uuid4())
    
    # Validate against policy
    policy_check = await validate_policy(request)
    if not policy_check.allowed:
        raise HTTPException(
            status_code=403,
            detail=f"Policy violation: {policy_check.reason}"
        )
    
    # Check for naming conflicts
    if await database_exists(request.database_name, request.environment):
        raise HTTPException(
            status_code=409,
            detail=f"Database '{request.database_name}' already exists in {request.environment}"
        )
    
    # Store request in database
    await store_request(request_id, request)
    
    # Start provisioning in background
    background_tasks.add_task(
        provision_database,
        request_id=request_id,
        request=request
    )
    
    return DatabaseResponse(
        request_id=request_id,
        status="provisioning",
        database_name=request.database_name,
        estimated_completion_time="5-10 minutes",
        status_url=f"/api/v1/requests/{request_id}"
    )

async def provision_database(request_id: str, request: DatabaseRequest):
    """Background task to provision database"""
    
    try:
        # Update status
        await update_request_status(request_id, "validating")
        
        # Generate Terraform configuration
        terraform_config = generate_terraform_config(request)
        
        # Update status
        await update_request_status(request_id, "creating_resources")
        
        # Apply Terraform
        result = await apply_terraform(
            request_id=request_id,
            config=terraform_config,
            environment=request.environment
        )
        
        # Create monitoring dashboard
        await create_monitoring_dashboard(request.database_name, result['db_instance_id'])
        
        # Store credentials in secrets manager
        credentials_arn = await store_credentials(
            database_name=request.database_name,
            username=result['master_username'],
            password=result['master_password'],
            endpoint=result['endpoint']
        )
        
        # Update status
        await update_request_status(
            request_id,
            "completed",
            outputs={
                "endpoint": result['endpoint'],
                "port": result['port'],
                "credentials_secret_arn": credentials_arn,
                "monitoring_dashboard": f"https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name={request.database_name}"
            }
        )
        
        # Send notification
        await notify_user(
            request_id=request_id,
            database_name=request.database_name,
            endpoint=result['endpoint'],
            credentials_arn=credentials_arn
        )
        
    except Exception as e:
        await update_request_status(
            request_id,
            "failed",
            error=str(e)
        )
        await notify_failure(request_id, str(e))

def generate_terraform_config(request: DatabaseRequest) -> str:
    """Generate Terraform configuration from request"""
    
    return f"""
terraform {{
  backend "s3" {{
    bucket = "platform-terraform-state"
    key    = "databases/{request.environment}/{request.database_name}/terraform.tfstate"
    region = "us-east-1"
  }}
}}

module "postgres_database" {{
  source = "git::https://github.com/company/terraform-modules//postgres-rds?ref=v2.1.0"
  
  identifier              = "{request.database_name}"
  instance_class          = "{request.instance_class}"
  allocated_storage       = {request.storage_gb}
  backup_retention_period = {request.backup_retention_days}
  multi_az                = {str(request.multi_az).lower()}
  
  vpc_id     = data.terraform_remote_state.networking.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.networking.outputs.database_subnet_ids
  
  tags = {{
    Name        = "{request.database_name}"
    Environment = "{request.environment}"
    Team        = "{request.team_owner}"
    CostCenter  = "{request.cost_center}"
    ManagedBy   = "platform"
    RequestID   = "{request.request_id}"
  }}
}}

output "endpoint" {{
  value = module.postgres_database.endpoint
}}

output "port" {{
  value = module.postgres_database.port
}}

output "db_instance_id" {{
  value = module.postgres_database.db_instance_id
}}
"""

async def apply_terraform(request_id: str, config: str, environment: str) -> dict:
    """Apply Terraform configuration"""
    
    import tempfile
    import subprocess
    import json
    
    # Write config to temp directory
    with tempfile.TemporaryDirectory() as tmpdir:
        config_path = f"{tmpdir}/main.tf"
        with open(config_path, 'w') as f:
            f.write(config)
        
        # Initialize Terraform
        subprocess.run(
            ["terraform", "init"],
            cwd=tmpdir,
            check=True,
            capture_output=True
        )
        
        # Plan
        subprocess.run(
            ["terraform", "plan", "-out=tfplan"],
            cwd=tmpdir,
            check=True,
            capture_output=True
        )
        
        # Apply
        result = subprocess.run(
            ["terraform", "apply", "-auto-approve", "tfplan"],
            cwd=tmpdir,
            check=True,
            capture_output=True
        )
        
        # Get outputs
        output_result = subprocess.run(
            ["terraform", "output", "-json"],
            cwd=tmpdir,
            check=True,
            capture_output=True
        )
        
        outputs = json.loads(output_result.stdout)
        
        return {
            "endpoint": outputs["endpoint"]["value"],
            "port": outputs["port"]["value"],
            "db_instance_id": outputs["db_instance_id"]["value"]
        }

Component 4: Policy Engine

Enforce governance and cost controls:

# policy/engine.py
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class PolicyResult:
    allowed: bool
    reason: Optional[str] = None
    warnings: List[str] = None

class PolicyEngine:
    def __init__(self):
        self.policies = [
            self.check_cost_limit,
            self.check_environment_restrictions,
            self.check_naming_conventions,
            self.check_required_tags,
        ]
    
    async def evaluate(self, request: DatabaseRequest) -> PolicyResult:
        """Evaluate all policies"""
        
        warnings = []
        
        for policy in self.policies:
            result = await policy(request)
            
            if not result.allowed:
                return result
            
            if result.warnings:
                warnings.extend(result.warnings)
        
        return PolicyResult(allowed=True, warnings=warnings)
    
    async def check_cost_limit(self, request: DatabaseRequest) -> PolicyResult:
        """Enforce cost limits per environment"""
        
        estimated_cost = calculate_monthly_cost(request)
        
        limits = {
            'dev': 100,
            'staging': 500,
            'prod': 10000
        }
        
        limit = limits.get(request.environment, 100)
        
        if estimated_cost > limit:
            return PolicyResult(
                allowed=False,
                reason=f"Estimated cost ${estimated_cost}/month exceeds limit "
                       f"${limit}/month for {request.environment}. "
                       f"Please request approval from #platform-team."
            )
        
        # Warning for high cost in non-prod
        if request.environment != 'prod' and estimated_cost > 200:
            return PolicyResult(
                allowed=True,
                warnings=[
                    f"Cost ${estimated_cost}/month is high for {request.environment}. "
                    f"Consider using a smaller instance."
                ]
            )
        
        return PolicyResult(allowed=True)
    
    async def check_environment_restrictions(self, request: DatabaseRequest) -> PolicyResult:
        """Production databases require additional validation"""
        
        if request.environment == 'prod':
            # Production databases must have backups
            if request.backup_retention_days < 7:
                return PolicyResult(
                    allowed=False,
                    reason="Production databases must have at least 7 days backup retention"
                )
            
            # Warn if not multi-AZ
            if not request.multi_az:
                return PolicyResult(
                    allowed=True,
                    warnings=[
                        "Production database is not multi-AZ. "
                        "Consider enabling for high availability."
                    ]
                )
        
        return PolicyResult(allowed=True)
    
    async def check_naming_conventions(self, request: DatabaseRequest) -> PolicyResult:
        """Enforce naming standards"""
        
        # Name should not contain environment (it's in tags)
        if request.environment in request.database_name:
            return PolicyResult(
                allowed=True,
                warnings=[
                    f"Database name contains environment '{request.environment}'. "
                    f"Environment is already tagged separately."
                ]
            )
        
        return PolicyResult(allowed=True)
    
    async def check_required_tags(self, request: DatabaseRequest) -> PolicyResult:
        """Validate required tags are present"""
        
        if not request.team_owner:
            return PolicyResult(
                allowed=False,
                reason="team_owner is required for cost allocation"
            )
        
        if not request.cost_center:
            return PolicyResult(
                allowed=False,
                reason="cost_center is required for billing"
            )
        
        return PolicyResult(allowed=True)

def calculate_monthly_cost(request: DatabaseRequest) -> float:
    """Calculate estimated monthly cost"""
    
    # Instance costs (simplified)
    instance_costs = {
        'db.t3.small': 30,
        'db.t3.medium': 60,
        'db.t3.large': 120,
        'db.r6g.xlarge': 250,
    }
    
    base_cost = instance_costs.get(request.instance_class, 60)
    storage_cost = request.storage_gb * 0.115
    backup_cost = request.storage_gb * request.backup_retention_days * 0.02
    
    total = base_cost + storage_cost + backup_cost
    
    if request.multi_az:
        total *= 2
    
    return round(total, 2)

Component 5: Request Status Tracking

Let developers see what's happening:

# api/status.py
from enum import Enum
from datetime import datetime
from typing import Optional, Dict

class RequestStatus(str, Enum):
    PENDING = "pending"
    VALIDATING = "validating"
    CREATING_RESOURCES = "creating_resources"
    CONFIGURING = "configuring"
    COMPLETED = "completed"
    FAILED = "failed"

@app.get("/api/v1/requests/{request_id}")
async def get_request_status(request_id: str):
    """Get status of infrastructure request"""
    
    request = await db.get_request(request_id)
    
    if not request:
        raise HTTPException(status_code=404, detail="Request not found")
    
    # Get provisioning steps
    steps = await get_provisioning_steps(request_id)
    
    # Get logs
    logs = await get_provisioning_logs(request_id)
    
    return {
        "request_id": request_id,
        "status": request.status,
        "created_at": request.created_at,
        "updated_at": request.updated_at,
        "estimated_completion": request.estimated_completion,
        "resource": {
            "type": "postgres-database",
            "name": request.database_name,
            "environment": request.environment
        },
        "steps": steps,
        "outputs": request.outputs if request.status == "completed" else None,
        "error": request.error if request.status == "failed" else None,
        "logs_url": f"/api/v1/requests/{request_id}/logs"
    }

async def get_provisioning_steps(request_id: str) -> list:
    """Get provisioning steps with status"""
    
    return [
        {
            "step": 1,
            "name": "Validate request",
            "status": "completed",
            "duration_seconds": 2
        },
        {
            "step": 2,
            "name": "Generate configuration",
            "status": "completed",
            "duration_seconds": 1
        },
        {
            "step": 3,
            "name": "Create database instance",
            "status": "in_progress",
            "duration_seconds": 180,
            "details": "Waiting for RDS instance to be available..."
        },
        {
            "step": 4,
            "name": "Configure monitoring",
            "status": "pending"
        },
        {
            "step": 5,
            "name": "Store credentials",
            "status": "pending"
        }
    ]

Advanced Features

Feature 1: Cost Forecasting

Help developers understand financial impact:

# cost/forecasting.py
from datetime import datetime, timedelta
from typing import List, Dict

class CostForecaster:
    def forecast_monthly_cost(self, request: DatabaseRequest) -> Dict:
        """Forecast costs with breakdown"""
        
        compute = self.calculate_compute_cost(request.instance_class, request.multi_az)
        storage = self.calculate_storage_cost(request.storage_gb)
        backup = self.calculate_backup_cost(request.storage_gb, request.backup_retention_days)
        
        total = compute + storage + backup
        
        return {
            "monthly_total": round(total, 2),
            "breakdown": {
                "compute": round(compute, 2),
                "storage": round(storage, 2),
                "backup": round(backup, 2)
            },
            "annual_total": round(total * 12, 2),
            "recommendations": self.get_cost_recommendations(request, total)
        }
    
    def get_cost_recommendations(self, request: DatabaseRequest, current_cost: float) -> List[str]:
        """Suggest cost optimizations"""
        
        recommendations = []
        
        # ARM instance recommendation
        if request.instance_class.startswith('db.r5'):
            arm_equivalent = request.instance_class.replace('r5', 'r6g')
            arm_cost = self.calculate_compute_cost(arm_equivalent, request.multi_az)
            savings = current_cost - arm_cost
            
            if savings > 10:
                recommendations.append(
                    f"Switch to {arm_equivalent} (ARM-based) to save ${savings:.2f}/month"
                )
        
        # Storage optimization
        if request.storage_gb > 500 and request.environment != 'prod':
            recommendations.append(
                f"Consider starting with 250GB storage for {request.environment}. "
                f"You can scale up later."
            )
        
        # Backup retention
        if request.backup_retention_days > 14 and request.environment == 'dev':
            recommendations.append(
                "Development databases rarely need >7 days backup retention"
            )
        
        return recommendations

Feature 2: Ephemeral Environments

Create temporary environments that auto-expire:

# environments/ephemeral.py
from datetime import datetime, timedelta

class EphemeralEnvironment:
    def create_preview_environment(
        self,
        pull_request_id: int,
        base_environment: str = "staging",
        ttl_hours: int = 24
    ) -> str:
        """Create temporary environment for PR preview"""
        
        env_name = f"pr-{pull_request_id}"
        expires_at = datetime.utcnow() + timedelta(hours=ttl_hours)
        
        # Clone configuration from base environment
        config = self.clone_environment_config(base_environment)
        
        # Scale down for cost savings
        config = self.optimize_for_preview(config)
        
        # Provision environment
        env_id = await self.provision_environment(
            name=env_name,
            config=config,
            labels={
                "type": "ephemeral",
                "pull_request": str(pull_request_id),
                "expires_at": expires_at.isoformat()
            }
        )
        
        # Schedule cleanup
        await self.schedule_cleanup(env_id, expires_at)
        
        # Comment on PR
        await self.comment_on_pr(
            pull_request_id,
            f"Preview environment ready: https://{env_name}.preview.company.com\n"
            f"Environment will be deleted in {ttl_hours} hours."
        )
        
        return env_id
    
    def optimize_for_preview(self, config: Dict) -> Dict:
        """Scale down resources for cost savings"""
        
        # Use smaller instance types
        if 'instance_class' in config:
            config['instance_class'] = 'db.t3.small'
        
        # Reduce storage
        if 'storage_gb' in config and config['storage_gb'] > 100:
            config['storage_gb'] = 100
        
        # Disable multi-AZ
        config['multi_az'] = False
        
        # Shorter backup retention
        config['backup_retention_days'] = 1
        
        return config
    
    async def schedule_cleanup(self, env_id: str, expires_at: datetime):
        """Schedule automatic environment cleanup"""
        
        # Use Lambda/Cloud Function with scheduled trigger
        # Or add to cleanup queue
        await cleanup_queue.enqueue(
            job_id=f"cleanup-{env_id}",
            run_at=expires_at,
            action="delete_environment",
            parameters={"environment_id": env_id}
        )

Feature 3: Drift Detection

Alert when infrastructure diverges from defined state:

# monitoring/drift.py
import asyncio
from typing import List, Dict

class DriftDetector:
    async def detect_drift_all(self) -> List[Dict]:
        """Check all managed resources for drift"""
        
        resources = await self.get_managed_resources()
        drift_results = []
        
        for resource in resources:
            drift = await self.check_resource_drift(resource)
            if drift['has_drift']:
                drift_results.append(drift)
        
        return drift_results
    
    async def check_resource_drift(self, resource: Dict) -> Dict:
        """Check single resource for drift"""
        
        # Get expected state from Terraform/platform
        expected_state = await self.get_expected_state(resource['id'])
        
        # Get actual state from cloud provider
        actual_state = await self.get_actual_state(
            resource['type'],
            resource['id']
        )
        
        # Compare
        differences = self.compare_states(expected_state, actual_state)
        
        if differences:
            # Alert team
            await self.alert_drift(
                resource_id=resource['id'],
                differences=differences
            )
            
            return {
                "resource_id": resource['id'],
                "resource_type": resource['type'],
                "has_drift": True,
                "differences": differences,
                "detected_at": datetime.utcnow().isoformat()
            }
        
        return {
            "resource_id": resource['id'],
            "has_drift": False
        }
    
    async def alert_drift(self, resource_id: str, differences: List[Dict]):
        """Send drift alert to team"""
        
        message = f"""
🚨 Configuration Drift Detected

Resource: {resource_id}

Changes detected:
{self.format_differences(differences)}

This means the resource was modified outside of the platform.

Actions:
1. Review changes in cloud console
2. If changes are desired, update platform configuration
3. If changes are errors, use platform to revert

View details: https://platform.company.com/resources/{resource_id}/drift
"""
        
        await self.send_slack_alert(channel="#platform-alerts", message=message)

# Run drift detection daily
async def scheduled_drift_detection():
    detector = DriftDetector()
    
    while True:
        print("Running drift detection...")
        drift_results = await detector.detect_drift_all()
        
        if drift_results:
            print(f"Found {len(drift_results)} resources with drift")
        else:
            print("No drift detected")
        
        # Wait 24 hours
        await asyncio.sleep(86400)

Feature 4: Resource Lifecycle Management

Automate cleanup of unused resources:

# lifecycle/manager.py
from datetime import datetime, timedelta

class LifecycleManager:
    async def scan_unused_resources(self):
        """Find resources that haven't been used recently"""
        
        unused = []
        
        # Check databases with no connections
        databases = await self.get_all_databases()
        for db in databases:
            connections = await self.get_connection_count(db['id'], days=30)
            
            if connections == 0:
                unused.append({
                    "type": "database",
                    "id": db['id'],
                    "name": db['name'],
                    "reason": "No connections in 30 days",
                    "monthly_cost": db['monthly_cost'],
                    "owner": db['team_owner']
                })
        
        # Check EC2 instances with low CPU
        instances = await self.get_all_instances()
        for instance in instances:
            avg_cpu = await self.get_avg_cpu(instance['id'], days=7)
            
            if avg_cpu < 5:
                unused.append({
                    "type": "ec2_instance",
                    "id": instance['id'],
                    "name": instance['name'],
                    "reason": f"Average CPU {avg_cpu}% over 7 days",
                    "monthly_cost": instance['monthly_cost'],
                    "owner": instance['team_owner']
                })
        
        # Notify owners
        await self.notify_unused_resources(unused)
        
        return unused
    
    async def notify_unused_resources(self, unused: List[Dict]):
        """Notify teams about potentially unused resources"""
        
        # Group by owner
        by_owner = {}
        for resource in unused:
            owner = resource['owner']
            if owner not in by_owner:
                by_owner[owner] = []
            by_owner[owner].append(resource)
        
        # Send notifications
        for owner, resources in by_owner.items():
            total_cost = sum(r['monthly_cost'] for r in resources)
            
            message = f"""
💰 Unused Resource Alert

Team: {owner}
Potentially unused resources: {len(resources)}
Estimated monthly savings: ${total_cost:.2f}

Resources:
{self.format_resource_list(resources)}

These resources haven't been used recently. If you no longer need them, consider deleting to save costs.

Review and manage: https://platform.company.com/resources/unused
"""
            
            await self.send_team_notification(owner, message)
    
    async def auto_delete_ephemeral(self):
        """Automatically delete expired ephemeral environments"""
        
        now = datetime.utcnow()
        
        # Find expired environments
        expired = await db.query("""
            SELECT * FROM environments
            WHERE type = 'ephemeral'
            AND expires_at < $1
            AND status != 'deleted'
        """, now)
        
        for env in expired:
            print(f"Deleting expired environment: {env['name']}")
            
            # Delete resources
            await self.delete_environment(env['id'])
            
            # Notify owner
            await self.notify_environment_deleted(env)

Measuring Success

Key Metrics

Developer Experience:

# metrics/developer_experience.py

class PlatformMetrics:
    def calculate_time_to_infrastructure(self) -> Dict:
        """Measure speed of infrastructure delivery"""
        
        requests = db.query("""
            SELECT
                DATE(created_at) as date,
                AVG(EXTRACT(EPOCH FROM (completed_at - created_at))/60) as avg_minutes,
                PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (completed_at - created_at)) as median_time,
                PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY (completed_at - created_at)) as p95_time
            FROM requests
            WHERE status = 'completed'
            AND created_at > NOW() - INTERVAL '30 days'
            GROUP BY DATE(created_at)
            ORDER BY date DESC
        """)
        
        return {
            "avg_provision_time_minutes": requests[0]['avg_minutes'],
            "median_provision_time": requests[0]['median_time'],
            "p95_provision_time": requests[0]['p95_time']
        }
    
    def calculate_self_service_adoption(self) -> Dict:
        """Measure adoption of self-service vs tickets"""
        
        current_month = db.query("""
            SELECT COUNT(*) as self_service_count
            FROM requests
            WHERE created_at > DATE_TRUNC('month', CURRENT_DATE)
        """)[0]['self_service_count']
        
        previous_month = db.query("""
            SELECT COUNT(*) as ticket_count
            FROM jira_tickets
            WHERE type = 'infrastructure'
            AND created_at > DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
            AND created_at < DATE_TRUNC('month', CURRENT_DATE)
        """)[0]['ticket_count']
        
        return {
            "self_service_this_month": current_month,
            "tickets_last_month": previous_month,
            "adoption_rate": current_month / (current_month + previous_month) if previous_month else 1.0
        }

Platform Team Efficiency:

Tickets handled per engineer per week
Time spent on toil vs strategic work
Number of manual interventions required

Cost Optimization:

Resources created within cost guardrails
Unused resources identified and deleted
Cost savings from rightsizing recommendations

Quality:

Infrastructure provisioned with proper tagging
Resources meeting compliance standards
Drift incidents per month

Dashboard Example

// components/PlatformDashboard.tsx
export function PlatformDashboard() {
  const { data: metrics } = useQuery('platform-metrics', fetchMetrics);
  
  return (
    <div className="grid grid-cols-3 gap-6">
      <MetricCard
        title="Avg. Provision Time"
        value={`${metrics.avg_provision_time} min`}
        trend="-23%"
        trendDirection="down"
        description="Time from request to ready"
      />
      
      <MetricCard
        title="Self-Service Adoption"
        value={`${metrics.adoption_rate}%`}
        trend="+15%"
        trendDirection="up"
        description="Requests via platform vs tickets"
      />
      
      <MetricCard
        title="Monthly Cost Savings"
        value={`$${metrics.cost_savings}`}
        trend="+$2.1k"
        trendDirection="up"
        description="From rightsizing and cleanup"
      />
      
      <MetricCard
        title="Active Resources"
        value={metrics.active_resources}
        description="Managed by platform"
      />
      
      <MetricCard
        title="Drift Incidents"
        value={metrics.drift_incidents}
        trend="-5"
        trendDirection="down"
        description="Manual changes detected"
      />
      
      <MetricCard
        title="Policy Violations"
        value={metrics.policy_violations}
        trend="-12"
        trendDirection="down"
        description="Blocked by policy this month"
      />
    </div>
  );
}

Common Pitfalls

Pitfall 1: Too Much Customization

Problem: Offering every possible configuration option overwhelms developers and creates support burden.

Solution: Start with 3-5 common configurations. Add options based on demand.

# Bad: 50+ configuration options
# Good: 3 preset sizes + advanced mode

database_presets:
  small:
    instance_class: db.t3.small
    storage_gb: 50
    description: "Development and testing"
    cost: "$30/month"
  
  medium:
    instance_class: db.t3.large
    storage_gb: 200
    description: "Small production workloads"
    cost: "$120/month"
  
  large:
    instance_class: db.r6g.2xlarge
    storage_gb: 500
    description: "Production workloads"
    cost: "$500/month"

Pitfall 2: No Cost Guardrails

Problem: Developers create expensive resources without understanding costs.

Solution: Show costs upfront, set limits, require approval for expensive resources.

Pitfall 3: Ignoring Day 2 Operations

Problem: Platform makes it easy to create resources but not maintain them.

Solution: Build lifecycle management, monitoring, and cleanup into the platform.

Pitfall 4: Building Instead of Buying

Problem: Spending years building a custom platform when products exist.

Solution: Evaluate existing tools first:

Backstage (Spotify): Open source developer portal
Port (Port.io): Internal developer platform
Humanitec: Platform orchestrator
Terraform Cloud: Managed Terraform with RBAC
Pulumi Cloud: Managed Pulumi with team features

Build custom only if:

Existing tools don't fit your workflow
You have engineering capacity to maintain
Your requirements are truly unique

Pitfall 5: Poor Documentation

Problem: Developers don't know what's available or how to use it.

Solution: Treat documentation as a product. Include:

Getting started guides
Reference for each service catalog item
Troubleshooting common issues
Example configurations
Architecture diagrams

Conclusion

A self-service infrastructure platform is a force multiplier for engineering organizations. Done right, it accelerates development, reduces toil, and maintains security and cost controls.

Key Principles:

Start simple. One service catalog item is better than zero.
Focus on 80% use cases. Don't try to handle every edge case initially.
Measure adoption. Track usage and iterate based on feedback.
Enforce guardrails. Prevent mistakes without blocking legitimate use cases.
Automate Day 2 operations. Lifecycle management is as important as provisioning.

Build vs Buy Decision Matrix:

Build when:

Existing tools don't integrate with your stack
You have 3+ platform engineers
Your requirements are truly unique
You're willing to maintain long-term

Buy/adopt existing when:

Products fit your workflow (Backstage, Port, etc.)
Platform team is small (< 3 people)
You want to move quickly
Standard features meet 80% of needs

The best platform is the one that developers actually use. Get feedback early, iterate fast, and measure what matters: time to infrastructure, adoption rate, and developer satisfaction.

Next Steps

Identify your first service catalog item: What infrastructure request do you get most often?
Build a prototype: Create a simple form that provisions that one thing
Get feedback: Have 3-5 developers try it and iterate
Measure success: Track time savings and adoption
Scale gradually: Add more catalog items based on demand

Additional Resources

Building a self-service platform? Share your experiences and challenges in the comments below.