Building a Self-Service Infrastructure Platform
Prerequisites
- Infrastructure as Code experience (Terraform or Pulumi)
- Understanding of CI/CD pipelines
- Basic web development (React/API development)
- Cloud provider expertise (AWS, GCP, or Azure)
- Experience operating production infrastructure
Introduction
Every growing engineering team hits the same bottleneck: developers need infrastructure, but only a handful of platform engineers know how to provision it safely. Ticket queues grow, context switching kills productivity, and your best engineers spend their time being human APIs instead of solving hard problems.
A self-service infrastructure platform solves this by giving developers the ability to provision infrastructure themselves, within guardrails set by the platform team. Done right, it accelerates teams while maintaining security and cost control. Done wrong, it creates technical debt and shadow IT.
This guide walks through building a production-grade self-service platform from first principles, including architecture decisions, implementation patterns, and code examples you can adapt.
Why Build a Self-Service Platform?
The Ticket Queue Problem
Without Self-Service:
Developer: "I need a PostgreSQL database for my new service"
β (Create Jira ticket)
Platform Team: Ticket sits in backlog for 3 days
β (Triage meeting)
Platform Engineer: Spends 30 minutes provisioning via Terraform
β (Manual steps, credentials, documentation)
Developer: Finally receives credentials, 5 days later
Total: 5 days, 30 minutes of engineering time
With Self-Service:
Developer: Fills out form in portal
β (Automated pipeline)
Platform: Provisions database, sets up monitoring, creates credentials
β (5 minutes)
Developer: Receives credentials and documentation
Total: 5 minutes, 0 human intervention
The Real Benefits
For Developers:
- Provision infrastructure in minutes, not days
- Iterate quickly without waiting for tickets
- Understand what they're getting (standardized offerings)
- Focus on building features, not fighting infrastructure
For Platform Teams:
- Stop being human APIs
- Enforce best practices automatically
- Reduce toil and repetitive work
- Focus on platform improvements, not individual requests
For the Organization:
- Faster time to market
- Consistent, secure infrastructure
- Cost visibility and control
- Scale engineering without proportional platform team growth
When You Need It
Build a self-service platform when:
- Your platform team gets 10+ infrastructure requests per week
- Developers wait days for basic infrastructure
- You're repeating the same Terraform setups constantly
- Inconsistency in infrastructure is causing operational issues
- You want to scale to 50+ engineering teams
Don't build one if:
- You have fewer than 20 engineers
- Infrastructure needs are highly variable and specialized
- Platform team can handle current load comfortably
- You lack engineering capacity to maintain the platform
Core Design Principles
1. Paved Roads, Not Walls
Provide a well-maintained path that's easier than going off-road, but don't block custom solutions when needed.
Good: "Here's a button for a standard database. Need something custom? Here's how to do it safely."
Bad: "You can only use these exact configurations. No exceptions."
2. Sensible Defaults, Easy Customization
Start with 80% use case covered by defaults. Allow overrides for the remaining 20%.
# Template with smart defaults
database:
type: postgres
version: 14 # Default to latest stable
instance_class: db.t3.medium # Sensible for most services
storage_gb: 100
backup_retention_days: 7
# But allow overrides
custom_parameters:
max_connections: 200 # Override if needed
shared_buffers: "256MB"
3. Guardrails Over Gates
Prevent mistakes and enforce policy, but don't block legitimate use cases.
# Good: Validation with helpful errors
def validate_database_request(request):
if request.storage_gb > 10000:
if not request.has_approval("large-database-approval"):
return Error(
"Databases over 10TB require approval. "
"Please reach out to #platform-team with your use case."
)
if request.instance_class.startswith("db.r6g"):
# Expensive instance type
return Warning(
"ARM instances are 20% cheaper than Intel. "
"Consider db.r6g instead of db.r5. "
"Proceeding with your selection."
)
4. Progressive Disclosure
Start simple, reveal complexity only when needed.
Basic Mode: [Database Type] [Name] [Environment] β Done
Advanced: Show 30+ configuration options
Expert: Direct Terraform/API access
5. Observable and Debuggable
Developers should understand what's happening and why things fail.
Bad: "Error: Failed to create database"
Good: "PostgreSQL provisioning failed at step 3 of 5: Security group creation
Error: sg-12345 already exists with conflicting rules
Solution: Check if database 'myapp-db' already exists
Logs: https://platform.company.com/logs/req-abc123"
Architecture Overview
High-Level Components
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Developer Interface β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Web β β CLI β β API β β Terraformβ β
β β Portal β β Tool β β Direct β β Module β β
β βββββββ¬βββββ βββββββ¬βββββ βββββββ¬βββββ βββββββ¬βββββ β
ββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββ
β β β β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
β Platform API Layer β
β - Request validation β
β - Policy enforcement β
β - Resource naming β
β - Cost estimation β
βββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββββ
β Provisioning Engine β
β - Terraform/Pulumi β
β - GitOps workflows β
β - State management β
β - Drift detection β
βββββββββββββββ¬ββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββββ
β Cloud Providers β
β AWS, GCP, Azure β
βββββββββββββββββββββββββββββββββ
Parallel Services:
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Monitoring β β Security β β Cost β
β & Alerting β β Scanning β β Tracking β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Technology Stack Options
Frontend (Developer Interface):
- Web Portal: Next.js, React, Backstage
- CLI: Go, Python (Click), Node.js
- IDE Integration: VS Code extension
Backend (API Layer):
- FastAPI (Python)
- Express (Node.js)
- Go (Gin, Echo)
Provisioning Engine:
- Terraform with custom wrapper
- Pulumi Automation API
- Crossplane
- AWS CDK / CDK for Terraform
Data Storage:
- PostgreSQL: Request history, metadata
- Redis: Cache, job queue
- S3/GCS: Terraform state, logs
Infrastructure:
- Kubernetes for the platform itself
- Serverless functions (Lambda, Cloud Functions) for lightweight operations
- GitLab/GitHub Actions for GitOps workflows
Implementation: Core Components
Component 1: Service Catalog
Define what developers can self-provision.
# catalog/postgres-database.yaml
name: PostgreSQL Database
description: Managed PostgreSQL database with automated backups and monitoring
category: databases
icon: postgresql
version: 1.2.0
parameters:
- name: database_name
type: string
required: true
pattern: "^[a-z][a-z0-9-]{2,30}$"
description: "Database identifier (lowercase, alphanumeric, hyphens)"
- name: environment
type: enum
required: true
options: [dev, staging, prod]
description: "Deployment environment"
- name: instance_class
type: enum
required: false
default: db.t3.medium
options:
- value: db.t3.small
label: "Small (2 vCPU, 2GB RAM) - $30/month"
- value: db.t3.medium
label: "Medium (2 vCPU, 4GB RAM) - $60/month"
- value: db.t3.large
label: "Large (2 vCPU, 8GB RAM) - $120/month"
- value: db.r6g.xlarge
label: "XLarge (4 vCPU, 32GB RAM) - $250/month"
description: "Instance size"
- name: storage_gb
type: integer
required: false
default: 100
min: 20
max: 1000
description: "Storage size in GB"
- name: backup_retention_days
type: integer
required: false
default: 7
min: 1
max: 35
description: "Days to retain automated backups"
- name: multi_az
type: boolean
required: false
default: false
description: "Enable multi-AZ for high availability (2x cost)"
- name: team_owner
type: string
required: true
description: "Team responsible for this database"
- name: cost_center
type: string
required: true
description: "Cost center for billing"
outputs:
- name: endpoint
description: "Database connection endpoint"
- name: port
description: "Database port"
- name: credentials_secret_arn
description: "AWS Secrets Manager ARN containing credentials"
- name: monitoring_dashboard
description: "CloudWatch dashboard URL"
cost_estimate:
monthly_min: 30
monthly_max: 300
formula: |
base = instance_class_cost
storage = storage_gb * 0.115
backup = storage_gb * backup_retention_days * 0.02
multi_az_multiplier = 2 if multi_az else 1
total = (base + storage + backup) * multi_az_multiplier
tags:
- managed-by: platform
- service-type: database
- database-engine: postgres
Component 2: Web Portal
Simple React form with validation:
// components/DatabaseRequestForm.tsx
import React, { useState } from 'react';
import { useMutation } from 'react-query';
import { createDatabase } from '../api/platform';
interface DatabaseFormData {
database_name: string;
environment: 'dev' | 'staging' | 'prod';
instance_class: string;
storage_gb: number;
team_owner: string;
cost_center: string;
}
export function DatabaseRequestForm() {
const [formData, setFormData] = useState<DatabaseFormData>({
database_name: '',
environment: 'dev',
instance_class: 'db.t3.medium',
storage_gb: 100,
team_owner: '',
cost_center: '',
});
const [costEstimate, setCostEstimate] = useState<number | null>(null);
const createMutation = useMutation(createDatabase, {
onSuccess: (data) => {
console.log('Database provisioning started:', data);
// Redirect to status page
window.location.href = `/requests/${data.request_id}`;
},
});
// Real-time cost estimation
const estimateCost = async () => {
const response = await fetch('/api/v1/catalog/postgres-database/estimate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(formData),
});
const { monthly_cost } = await response.json();
setCostEstimate(monthly_cost);
};
React.useEffect(() => {
estimateCost();
}, [formData.instance_class, formData.storage_gb]);
const handleSubmit = async (e: React.FormEvent) => {
e.preventDefault();
createMutation.mutate(formData);
};
return (
<form onSubmit={handleSubmit} className="space-y-6">
<div>
<label className="block text-sm font-medium">Database Name</label>
<input
type="text"
value={formData.database_name}
onChange={(e) => setFormData({ ...formData, database_name: e.target.value })}
pattern="^[a-z][a-z0-9-]{2,30}$"
required
className="mt-1 block w-full rounded-md border-gray-300"
placeholder="my-service-db"
/>
<p className="mt-1 text-sm text-gray-500">
Lowercase letters, numbers, and hyphens only
</p>
</div>
<div>
<label className="block text-sm font-medium">Environment</label>
<select
value={formData.environment}
onChange={(e) => setFormData({ ...formData, environment: e.target.value as any })}
className="mt-1 block w-full rounded-md border-gray-300"
>
<option value="dev">Development</option>
<option value="staging">Staging</option>
<option value="prod">Production</option>
</select>
</div>
<div>
<label className="block text-sm font-medium">Instance Size</label>
<select
value={formData.instance_class}
onChange={(e) => setFormData({ ...formData, instance_class: e.target.value })}
className="mt-1 block w-full rounded-md border-gray-300"
>
<option value="db.t3.small">Small (2 vCPU, 2GB) - ~$30/month</option>
<option value="db.t3.medium">Medium (2 vCPU, 4GB) - ~$60/month</option>
<option value="db.t3.large">Large (2 vCPU, 8GB) - ~$120/month</option>
</select>
</div>
<div>
<label className="block text-sm font-medium">Storage (GB)</label>
<input
type="number"
value={formData.storage_gb}
onChange={(e) => setFormData({ ...formData, storage_gb: parseInt(e.target.value) })}
min={20}
max={1000}
className="mt-1 block w-full rounded-md border-gray-300"
/>
</div>
<div>
<label className="block text-sm font-medium">Team Owner</label>
<input
type="text"
value={formData.team_owner}
onChange={(e) => setFormData({ ...formData, team_owner: e.target.value })}
required
className="mt-1 block w-full rounded-md border-gray-300"
placeholder="backend-team"
/>
</div>
<div>
<label className="block text-sm font-medium">Cost Center</label>
<input
type="text"
value={formData.cost_center}
onChange={(e) => setFormData({ ...formData, cost_center: e.target.value })}
required
className="mt-1 block w-full rounded-md border-gray-300"
placeholder="CC-1234"
/>
</div>
{costEstimate !== null && (
<div className="rounded-md bg-blue-50 p-4">
<p className="text-sm text-blue-700">
Estimated cost: <strong>${costEstimate.toFixed(2)}/month</strong>
</p>
</div>
)}
<button
type="submit"
disabled={createMutation.isLoading}
className="w-full bg-blue-600 text-white py-2 px-4 rounded-md hover:bg-blue-700 disabled:opacity-50"
>
{createMutation.isLoading ? 'Provisioning...' : 'Create Database'}
</button>
</form>
);
}
Component 3: API Backend
FastAPI backend with validation and workflow orchestration:
# api/main.py
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field, validator
from typing import Optional
import uuid
from datetime import datetime
app = FastAPI(title="Infrastructure Platform API")
class DatabaseRequest(BaseModel):
database_name: str = Field(..., pattern="^[a-z][a-z0-9-]{2,30}$")
environment: str = Field(..., pattern="^(dev|staging|prod)$")
instance_class: str = "db.t3.medium"
storage_gb: int = Field(default=100, ge=20, le=1000)
backup_retention_days: int = Field(default=7, ge=1, le=35)
multi_az: bool = False
team_owner: str
cost_center: str
@validator('instance_class')
def validate_instance_class(cls, v):
allowed = ['db.t3.small', 'db.t3.medium', 'db.t3.large', 'db.r6g.xlarge']
if v not in allowed:
raise ValueError(f'Instance class must be one of {allowed}')
return v
class DatabaseResponse(BaseModel):
request_id: str
status: str
database_name: str
estimated_completion_time: str
status_url: str
@app.post("/api/v1/databases", response_model=DatabaseResponse)
async def create_database(
request: DatabaseRequest,
background_tasks: BackgroundTasks
):
"""Create a new PostgreSQL database"""
# Generate request ID
request_id = str(uuid.uuid4())
# Validate against policy
policy_check = await validate_policy(request)
if not policy_check.allowed:
raise HTTPException(
status_code=403,
detail=f"Policy violation: {policy_check.reason}"
)
# Check for naming conflicts
if await database_exists(request.database_name, request.environment):
raise HTTPException(
status_code=409,
detail=f"Database '{request.database_name}' already exists in {request.environment}"
)
# Store request in database
await store_request(request_id, request)
# Start provisioning in background
background_tasks.add_task(
provision_database,
request_id=request_id,
request=request
)
return DatabaseResponse(
request_id=request_id,
status="provisioning",
database_name=request.database_name,
estimated_completion_time="5-10 minutes",
status_url=f"/api/v1/requests/{request_id}"
)
async def provision_database(request_id: str, request: DatabaseRequest):
"""Background task to provision database"""
try:
# Update status
await update_request_status(request_id, "validating")
# Generate Terraform configuration
terraform_config = generate_terraform_config(request)
# Update status
await update_request_status(request_id, "creating_resources")
# Apply Terraform
result = await apply_terraform(
request_id=request_id,
config=terraform_config,
environment=request.environment
)
# Create monitoring dashboard
await create_monitoring_dashboard(request.database_name, result['db_instance_id'])
# Store credentials in secrets manager
credentials_arn = await store_credentials(
database_name=request.database_name,
username=result['master_username'],
password=result['master_password'],
endpoint=result['endpoint']
)
# Update status
await update_request_status(
request_id,
"completed",
outputs={
"endpoint": result['endpoint'],
"port": result['port'],
"credentials_secret_arn": credentials_arn,
"monitoring_dashboard": f"https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name={request.database_name}"
}
)
# Send notification
await notify_user(
request_id=request_id,
database_name=request.database_name,
endpoint=result['endpoint'],
credentials_arn=credentials_arn
)
except Exception as e:
await update_request_status(
request_id,
"failed",
error=str(e)
)
await notify_failure(request_id, str(e))
def generate_terraform_config(request: DatabaseRequest) -> str:
"""Generate Terraform configuration from request"""
return f"""
terraform {{
backend "s3" {{
bucket = "platform-terraform-state"
key = "databases/{request.environment}/{request.database_name}/terraform.tfstate"
region = "us-east-1"
}}
}}
module "postgres_database" {{
source = "git::https://github.com/company/terraform-modules//postgres-rds?ref=v2.1.0"
identifier = "{request.database_name}"
instance_class = "{request.instance_class}"
allocated_storage = {request.storage_gb}
backup_retention_period = {request.backup_retention_days}
multi_az = {str(request.multi_az).lower()}
vpc_id = data.terraform_remote_state.networking.outputs.vpc_id
subnet_ids = data.terraform_remote_state.networking.outputs.database_subnet_ids
tags = {{
Name = "{request.database_name}"
Environment = "{request.environment}"
Team = "{request.team_owner}"
CostCenter = "{request.cost_center}"
ManagedBy = "platform"
RequestID = "{request.request_id}"
}}
}}
output "endpoint" {{
value = module.postgres_database.endpoint
}}
output "port" {{
value = module.postgres_database.port
}}
output "db_instance_id" {{
value = module.postgres_database.db_instance_id
}}
"""
async def apply_terraform(request_id: str, config: str, environment: str) -> dict:
"""Apply Terraform configuration"""
import tempfile
import subprocess
import json
# Write config to temp directory
with tempfile.TemporaryDirectory() as tmpdir:
config_path = f"{tmpdir}/main.tf"
with open(config_path, 'w') as f:
f.write(config)
# Initialize Terraform
subprocess.run(
["terraform", "init"],
cwd=tmpdir,
check=True,
capture_output=True
)
# Plan
subprocess.run(
["terraform", "plan", "-out=tfplan"],
cwd=tmpdir,
check=True,
capture_output=True
)
# Apply
result = subprocess.run(
["terraform", "apply", "-auto-approve", "tfplan"],
cwd=tmpdir,
check=True,
capture_output=True
)
# Get outputs
output_result = subprocess.run(
["terraform", "output", "-json"],
cwd=tmpdir,
check=True,
capture_output=True
)
outputs = json.loads(output_result.stdout)
return {
"endpoint": outputs["endpoint"]["value"],
"port": outputs["port"]["value"],
"db_instance_id": outputs["db_instance_id"]["value"]
}
Component 4: Policy Engine
Enforce governance and cost controls:
# policy/engine.py
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class PolicyResult:
allowed: bool
reason: Optional[str] = None
warnings: List[str] = None
class PolicyEngine:
def __init__(self):
self.policies = [
self.check_cost_limit,
self.check_environment_restrictions,
self.check_naming_conventions,
self.check_required_tags,
]
async def evaluate(self, request: DatabaseRequest) -> PolicyResult:
"""Evaluate all policies"""
warnings = []
for policy in self.policies:
result = await policy(request)
if not result.allowed:
return result
if result.warnings:
warnings.extend(result.warnings)
return PolicyResult(allowed=True, warnings=warnings)
async def check_cost_limit(self, request: DatabaseRequest) -> PolicyResult:
"""Enforce cost limits per environment"""
estimated_cost = calculate_monthly_cost(request)
limits = {
'dev': 100,
'staging': 500,
'prod': 10000
}
limit = limits.get(request.environment, 100)
if estimated_cost > limit:
return PolicyResult(
allowed=False,
reason=f"Estimated cost ${estimated_cost}/month exceeds limit "
f"${limit}/month for {request.environment}. "
f"Please request approval from #platform-team."
)
# Warning for high cost in non-prod
if request.environment != 'prod' and estimated_cost > 200:
return PolicyResult(
allowed=True,
warnings=[
f"Cost ${estimated_cost}/month is high for {request.environment}. "
f"Consider using a smaller instance."
]
)
return PolicyResult(allowed=True)
async def check_environment_restrictions(self, request: DatabaseRequest) -> PolicyResult:
"""Production databases require additional validation"""
if request.environment == 'prod':
# Production databases must have backups
if request.backup_retention_days < 7:
return PolicyResult(
allowed=False,
reason="Production databases must have at least 7 days backup retention"
)
# Warn if not multi-AZ
if not request.multi_az:
return PolicyResult(
allowed=True,
warnings=[
"Production database is not multi-AZ. "
"Consider enabling for high availability."
]
)
return PolicyResult(allowed=True)
async def check_naming_conventions(self, request: DatabaseRequest) -> PolicyResult:
"""Enforce naming standards"""
# Name should not contain environment (it's in tags)
if request.environment in request.database_name:
return PolicyResult(
allowed=True,
warnings=[
f"Database name contains environment '{request.environment}'. "
f"Environment is already tagged separately."
]
)
return PolicyResult(allowed=True)
async def check_required_tags(self, request: DatabaseRequest) -> PolicyResult:
"""Validate required tags are present"""
if not request.team_owner:
return PolicyResult(
allowed=False,
reason="team_owner is required for cost allocation"
)
if not request.cost_center:
return PolicyResult(
allowed=False,
reason="cost_center is required for billing"
)
return PolicyResult(allowed=True)
def calculate_monthly_cost(request: DatabaseRequest) -> float:
"""Calculate estimated monthly cost"""
# Instance costs (simplified)
instance_costs = {
'db.t3.small': 30,
'db.t3.medium': 60,
'db.t3.large': 120,
'db.r6g.xlarge': 250,
}
base_cost = instance_costs.get(request.instance_class, 60)
storage_cost = request.storage_gb * 0.115
backup_cost = request.storage_gb * request.backup_retention_days * 0.02
total = base_cost + storage_cost + backup_cost
if request.multi_az:
total *= 2
return round(total, 2)
Component 5: Request Status Tracking
Let developers see what's happening:
# api/status.py
from enum import Enum
from datetime import datetime
from typing import Optional, Dict
class RequestStatus(str, Enum):
PENDING = "pending"
VALIDATING = "validating"
CREATING_RESOURCES = "creating_resources"
CONFIGURING = "configuring"
COMPLETED = "completed"
FAILED = "failed"
@app.get("/api/v1/requests/{request_id}")
async def get_request_status(request_id: str):
"""Get status of infrastructure request"""
request = await db.get_request(request_id)
if not request:
raise HTTPException(status_code=404, detail="Request not found")
# Get provisioning steps
steps = await get_provisioning_steps(request_id)
# Get logs
logs = await get_provisioning_logs(request_id)
return {
"request_id": request_id,
"status": request.status,
"created_at": request.created_at,
"updated_at": request.updated_at,
"estimated_completion": request.estimated_completion,
"resource": {
"type": "postgres-database",
"name": request.database_name,
"environment": request.environment
},
"steps": steps,
"outputs": request.outputs if request.status == "completed" else None,
"error": request.error if request.status == "failed" else None,
"logs_url": f"/api/v1/requests/{request_id}/logs"
}
async def get_provisioning_steps(request_id: str) -> list:
"""Get provisioning steps with status"""
return [
{
"step": 1,
"name": "Validate request",
"status": "completed",
"duration_seconds": 2
},
{
"step": 2,
"name": "Generate configuration",
"status": "completed",
"duration_seconds": 1
},
{
"step": 3,
"name": "Create database instance",
"status": "in_progress",
"duration_seconds": 180,
"details": "Waiting for RDS instance to be available..."
},
{
"step": 4,
"name": "Configure monitoring",
"status": "pending"
},
{
"step": 5,
"name": "Store credentials",
"status": "pending"
}
]
Advanced Features
Feature 1: Cost Forecasting
Help developers understand financial impact:
# cost/forecasting.py
from datetime import datetime, timedelta
from typing import List, Dict
class CostForecaster:
def forecast_monthly_cost(self, request: DatabaseRequest) -> Dict:
"""Forecast costs with breakdown"""
compute = self.calculate_compute_cost(request.instance_class, request.multi_az)
storage = self.calculate_storage_cost(request.storage_gb)
backup = self.calculate_backup_cost(request.storage_gb, request.backup_retention_days)
total = compute + storage + backup
return {
"monthly_total": round(total, 2),
"breakdown": {
"compute": round(compute, 2),
"storage": round(storage, 2),
"backup": round(backup, 2)
},
"annual_total": round(total * 12, 2),
"recommendations": self.get_cost_recommendations(request, total)
}
def get_cost_recommendations(self, request: DatabaseRequest, current_cost: float) -> List[str]:
"""Suggest cost optimizations"""
recommendations = []
# ARM instance recommendation
if request.instance_class.startswith('db.r5'):
arm_equivalent = request.instance_class.replace('r5', 'r6g')
arm_cost = self.calculate_compute_cost(arm_equivalent, request.multi_az)
savings = current_cost - arm_cost
if savings > 10:
recommendations.append(
f"Switch to {arm_equivalent} (ARM-based) to save ${savings:.2f}/month"
)
# Storage optimization
if request.storage_gb > 500 and request.environment != 'prod':
recommendations.append(
f"Consider starting with 250GB storage for {request.environment}. "
f"You can scale up later."
)
# Backup retention
if request.backup_retention_days > 14 and request.environment == 'dev':
recommendations.append(
"Development databases rarely need >7 days backup retention"
)
return recommendations
Feature 2: Ephemeral Environments
Create temporary environments that auto-expire:
# environments/ephemeral.py
from datetime import datetime, timedelta
class EphemeralEnvironment:
def create_preview_environment(
self,
pull_request_id: int,
base_environment: str = "staging",
ttl_hours: int = 24
) -> str:
"""Create temporary environment for PR preview"""
env_name = f"pr-{pull_request_id}"
expires_at = datetime.utcnow() + timedelta(hours=ttl_hours)
# Clone configuration from base environment
config = self.clone_environment_config(base_environment)
# Scale down for cost savings
config = self.optimize_for_preview(config)
# Provision environment
env_id = await self.provision_environment(
name=env_name,
config=config,
labels={
"type": "ephemeral",
"pull_request": str(pull_request_id),
"expires_at": expires_at.isoformat()
}
)
# Schedule cleanup
await self.schedule_cleanup(env_id, expires_at)
# Comment on PR
await self.comment_on_pr(
pull_request_id,
f"Preview environment ready: https://{env_name}.preview.company.com\n"
f"Environment will be deleted in {ttl_hours} hours."
)
return env_id
def optimize_for_preview(self, config: Dict) -> Dict:
"""Scale down resources for cost savings"""
# Use smaller instance types
if 'instance_class' in config:
config['instance_class'] = 'db.t3.small'
# Reduce storage
if 'storage_gb' in config and config['storage_gb'] > 100:
config['storage_gb'] = 100
# Disable multi-AZ
config['multi_az'] = False
# Shorter backup retention
config['backup_retention_days'] = 1
return config
async def schedule_cleanup(self, env_id: str, expires_at: datetime):
"""Schedule automatic environment cleanup"""
# Use Lambda/Cloud Function with scheduled trigger
# Or add to cleanup queue
await cleanup_queue.enqueue(
job_id=f"cleanup-{env_id}",
run_at=expires_at,
action="delete_environment",
parameters={"environment_id": env_id}
)
Feature 3: Drift Detection
Alert when infrastructure diverges from defined state:
# monitoring/drift.py
import asyncio
from typing import List, Dict
class DriftDetector:
async def detect_drift_all(self) -> List[Dict]:
"""Check all managed resources for drift"""
resources = await self.get_managed_resources()
drift_results = []
for resource in resources:
drift = await self.check_resource_drift(resource)
if drift['has_drift']:
drift_results.append(drift)
return drift_results
async def check_resource_drift(self, resource: Dict) -> Dict:
"""Check single resource for drift"""
# Get expected state from Terraform/platform
expected_state = await self.get_expected_state(resource['id'])
# Get actual state from cloud provider
actual_state = await self.get_actual_state(
resource['type'],
resource['id']
)
# Compare
differences = self.compare_states(expected_state, actual_state)
if differences:
# Alert team
await self.alert_drift(
resource_id=resource['id'],
differences=differences
)
return {
"resource_id": resource['id'],
"resource_type": resource['type'],
"has_drift": True,
"differences": differences,
"detected_at": datetime.utcnow().isoformat()
}
return {
"resource_id": resource['id'],
"has_drift": False
}
async def alert_drift(self, resource_id: str, differences: List[Dict]):
"""Send drift alert to team"""
message = f"""
π¨ Configuration Drift Detected
Resource: {resource_id}
Changes detected:
{self.format_differences(differences)}
This means the resource was modified outside of the platform.
Actions:
1. Review changes in cloud console
2. If changes are desired, update platform configuration
3. If changes are errors, use platform to revert
View details: https://platform.company.com/resources/{resource_id}/drift
"""
await self.send_slack_alert(channel="#platform-alerts", message=message)
# Run drift detection daily
async def scheduled_drift_detection():
detector = DriftDetector()
while True:
print("Running drift detection...")
drift_results = await detector.detect_drift_all()
if drift_results:
print(f"Found {len(drift_results)} resources with drift")
else:
print("No drift detected")
# Wait 24 hours
await asyncio.sleep(86400)
Feature 4: Resource Lifecycle Management
Automate cleanup of unused resources:
# lifecycle/manager.py
from datetime import datetime, timedelta
class LifecycleManager:
async def scan_unused_resources(self):
"""Find resources that haven't been used recently"""
unused = []
# Check databases with no connections
databases = await self.get_all_databases()
for db in databases:
connections = await self.get_connection_count(db['id'], days=30)
if connections == 0:
unused.append({
"type": "database",
"id": db['id'],
"name": db['name'],
"reason": "No connections in 30 days",
"monthly_cost": db['monthly_cost'],
"owner": db['team_owner']
})
# Check EC2 instances with low CPU
instances = await self.get_all_instances()
for instance in instances:
avg_cpu = await self.get_avg_cpu(instance['id'], days=7)
if avg_cpu < 5:
unused.append({
"type": "ec2_instance",
"id": instance['id'],
"name": instance['name'],
"reason": f"Average CPU {avg_cpu}% over 7 days",
"monthly_cost": instance['monthly_cost'],
"owner": instance['team_owner']
})
# Notify owners
await self.notify_unused_resources(unused)
return unused
async def notify_unused_resources(self, unused: List[Dict]):
"""Notify teams about potentially unused resources"""
# Group by owner
by_owner = {}
for resource in unused:
owner = resource['owner']
if owner not in by_owner:
by_owner[owner] = []
by_owner[owner].append(resource)
# Send notifications
for owner, resources in by_owner.items():
total_cost = sum(r['monthly_cost'] for r in resources)
message = f"""
π° Unused Resource Alert
Team: {owner}
Potentially unused resources: {len(resources)}
Estimated monthly savings: ${total_cost:.2f}
Resources:
{self.format_resource_list(resources)}
These resources haven't been used recently. If you no longer need them, consider deleting to save costs.
Review and manage: https://platform.company.com/resources/unused
"""
await self.send_team_notification(owner, message)
async def auto_delete_ephemeral(self):
"""Automatically delete expired ephemeral environments"""
now = datetime.utcnow()
# Find expired environments
expired = await db.query("""
SELECT * FROM environments
WHERE type = 'ephemeral'
AND expires_at < $1
AND status != 'deleted'
""", now)
for env in expired:
print(f"Deleting expired environment: {env['name']}")
# Delete resources
await self.delete_environment(env['id'])
# Notify owner
await self.notify_environment_deleted(env)
Measuring Success
Key Metrics
Developer Experience:
# metrics/developer_experience.py
class PlatformMetrics:
def calculate_time_to_infrastructure(self) -> Dict:
"""Measure speed of infrastructure delivery"""
requests = db.query("""
SELECT
DATE(created_at) as date,
AVG(EXTRACT(EPOCH FROM (completed_at - created_at))/60) as avg_minutes,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (completed_at - created_at)) as median_time,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY (completed_at - created_at)) as p95_time
FROM requests
WHERE status = 'completed'
AND created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC
""")
return {
"avg_provision_time_minutes": requests[0]['avg_minutes'],
"median_provision_time": requests[0]['median_time'],
"p95_provision_time": requests[0]['p95_time']
}
def calculate_self_service_adoption(self) -> Dict:
"""Measure adoption of self-service vs tickets"""
current_month = db.query("""
SELECT COUNT(*) as self_service_count
FROM requests
WHERE created_at > DATE_TRUNC('month', CURRENT_DATE)
""")[0]['self_service_count']
previous_month = db.query("""
SELECT COUNT(*) as ticket_count
FROM jira_tickets
WHERE type = 'infrastructure'
AND created_at > DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND created_at < DATE_TRUNC('month', CURRENT_DATE)
""")[0]['ticket_count']
return {
"self_service_this_month": current_month,
"tickets_last_month": previous_month,
"adoption_rate": current_month / (current_month + previous_month) if previous_month else 1.0
}
Platform Team Efficiency:
- Tickets handled per engineer per week
- Time spent on toil vs strategic work
- Number of manual interventions required
Cost Optimization:
- Resources created within cost guardrails
- Unused resources identified and deleted
- Cost savings from rightsizing recommendations
Quality:
- Infrastructure provisioned with proper tagging
- Resources meeting compliance standards
- Drift incidents per month
Dashboard Example
// components/PlatformDashboard.tsx
export function PlatformDashboard() {
const { data: metrics } = useQuery('platform-metrics', fetchMetrics);
return (
<div className="grid grid-cols-3 gap-6">
<MetricCard
title="Avg. Provision Time"
value={`${metrics.avg_provision_time} min`}
trend="-23%"
trendDirection="down"
description="Time from request to ready"
/>
<MetricCard
title="Self-Service Adoption"
value={`${metrics.adoption_rate}%`}
trend="+15%"
trendDirection="up"
description="Requests via platform vs tickets"
/>
<MetricCard
title="Monthly Cost Savings"
value={`$${metrics.cost_savings}`}
trend="+$2.1k"
trendDirection="up"
description="From rightsizing and cleanup"
/>
<MetricCard
title="Active Resources"
value={metrics.active_resources}
description="Managed by platform"
/>
<MetricCard
title="Drift Incidents"
value={metrics.drift_incidents}
trend="-5"
trendDirection="down"
description="Manual changes detected"
/>
<MetricCard
title="Policy Violations"
value={metrics.policy_violations}
trend="-12"
trendDirection="down"
description="Blocked by policy this month"
/>
</div>
);
}
Common Pitfalls
Pitfall 1: Too Much Customization
Problem: Offering every possible configuration option overwhelms developers and creates support burden.
Solution: Start with 3-5 common configurations. Add options based on demand.
# Bad: 50+ configuration options
# Good: 3 preset sizes + advanced mode
database_presets:
small:
instance_class: db.t3.small
storage_gb: 50
description: "Development and testing"
cost: "$30/month"
medium:
instance_class: db.t3.large
storage_gb: 200
description: "Small production workloads"
cost: "$120/month"
large:
instance_class: db.r6g.2xlarge
storage_gb: 500
description: "Production workloads"
cost: "$500/month"
Pitfall 2: No Cost Guardrails
Problem: Developers create expensive resources without understanding costs.
Solution: Show costs upfront, set limits, require approval for expensive resources.
Pitfall 3: Ignoring Day 2 Operations
Problem: Platform makes it easy to create resources but not maintain them.
Solution: Build lifecycle management, monitoring, and cleanup into the platform.
Pitfall 4: Building Instead of Buying
Problem: Spending years building a custom platform when products exist.
Solution: Evaluate existing tools first:
- Backstage (Spotify): Open source developer portal
- Port (Port.io): Internal developer platform
- Humanitec: Platform orchestrator
- Terraform Cloud: Managed Terraform with RBAC
- Pulumi Cloud: Managed Pulumi with team features
Build custom only if:
- Existing tools don't fit your workflow
- You have engineering capacity to maintain
- Your requirements are truly unique
Pitfall 5: Poor Documentation
Problem: Developers don't know what's available or how to use it.
Solution: Treat documentation as a product. Include:
- Getting started guides
- Reference for each service catalog item
- Troubleshooting common issues
- Example configurations
- Architecture diagrams
Conclusion
A self-service infrastructure platform is a force multiplier for engineering organizations. Done right, it accelerates development, reduces toil, and maintains security and cost controls.
Key Principles:
- Start simple. One service catalog item is better than zero.
- Focus on 80% use cases. Don't try to handle every edge case initially.
- Measure adoption. Track usage and iterate based on feedback.
- Enforce guardrails. Prevent mistakes without blocking legitimate use cases.
- Automate Day 2 operations. Lifecycle management is as important as provisioning.
Build vs Buy Decision Matrix:
Build when:
- Existing tools don't integrate with your stack
- You have 3+ platform engineers
- Your requirements are truly unique
- You're willing to maintain long-term
Buy/adopt existing when:
- Products fit your workflow (Backstage, Port, etc.)
- Platform team is small (< 3 people)
- You want to move quickly
- Standard features meet 80% of needs
The best platform is the one that developers actually use. Get feedback early, iterate fast, and measure what matters: time to infrastructure, adoption rate, and developer satisfaction.
Next Steps
- Identify your first service catalog item: What infrastructure request do you get most often?
- Build a prototype: Create a simple form that provisions that one thing
- Get feedback: Have 3-5 developers try it and iterate
- Measure success: Track time savings and adoption
- Scale gradually: Add more catalog items based on demand
Additional Resources
- Backstage by Spotify: Open Source Developer Portal
- CNCF Platforms White Paper
- Pulumi Automation API Documentation
- Port: Internal Developer Portal Documentation
Building a self-service platform? Share your experiences and challenges in the comments below.