Standardization: The Prerequisite to Scale
Category: Platform Engineering
Reading Time: 6 minutes
Tags: Standards, Platform, Scale
The Chaos of 47 Ways to Deploy
When I joined NowFloats in 2018, we had a problem. A big one.
We had 47 different deployment methods across 120 microservices:
- 12 services used Jenkins
- 8 used GitLab CI
- 15 used CircleCI
- 6 used manual shell scripts
- 3 used Ansible playbooks
- 1 service (I kid you not) deployed via FTP
- The rest? A mix of tribal knowledge and "ask Sarah"
Each service had its own:
- Docker base image (23 different variations)
- Logging format (JSON, text, custom, none)
- Metrics endpoint (Prometheus, StatsD, custom, none)
- Health check path (/health, /healthz, /status, /_health, none)
- Configuration method (ENV vars, config files, hardcoded, CLI flags)
The result?
- Deploy time: 2-4 hours per service
- Failed deploys: 30% failure rate
- Onboarding: 3 weeks to deploy first service
- Debugging: "Which service is failing?" became a 30-minute investigation
- Scaling: Impossible. Each service was a unique snowflake.
The lesson: You cannot optimize what you haven't defined. Standardization is the prerequisite to scale.
Why Standardization Matters (The Math)
Let's do some simple math:
Without Standardization
- Number of services: 100
- Unique deployment methods: 47
- Time to learn each method: 2 hours
- Total knowledge required: 94 hours (2+ weeks)
- Bus factor: Knowledge locked in 10+ people's heads
- Onboarding time: 3 weeks minimum
With Standardization
- Number of services: 100
- Standard deployment methods: 1
- Time to learn the method: 4 hours
- Total knowledge required: 4 hours
- Bus factor: Zero. It's documented and automated.
- Onboarding time: 1 day
Productivity multiplier: 23.5x
That's not theoretical. That's what we achieved at NowFloats.
The Golden Path: Making the Right Thing the Easy Thing
The key insight: Don't force standardization. Make it irresistible.
We created what we called the "Golden Path"βa paved road that makes doing the right thing easier than doing the wrong thing.
Before (Snowflake Deployment)
# service-a/deploy.sh
#!/bin/bash
ssh prod-server-1 "docker pull myregistry/service-a:${TAG}"
ssh prod-server-1 "docker stop service-a || true"
ssh prod-server-1 "docker run -d --name service-a \
-p 8080:8080 \
-e DB_HOST=prod-db \
-e LOG_LEVEL=info \
myregistry/service-a:${TAG}"
# Hope it works π€
Problems:
- No rollback mechanism
- No health checks
- Hardcoded configuration
- Manual intervention required
- Can't scale horizontally
- No metrics or monitoring
After (Golden Path)
# service-a/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: service-a
labels:
app: service-a
version: v1.0.0
team: payments
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
containers:
- name: service-a
image: myregistry/service-a:{{ .Values.version }}
# Standardized ports
ports:
- name: http
containerPort: 8080
- name: metrics
containerPort: 9090
# Standardized health checks
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# Standardized resources
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Standardized config
envFrom:
- configMapRef:
name: service-a-config
- secretRef:
name: service-a-secrets
Deploy command:
# One command for all services
make deploy SERVICE=service-a ENV=production
What changed:
- β Automatic rollbacks on failure
- β Built-in health checks
- β Configuration management
- β Zero-downtime deployments
- β Horizontal scaling ready
- β Metrics automatically scraped
The Four Pillars of Standardization
1. Standard Deployment Pipeline
One pipeline template for all services:
# .github/workflows/deploy.yml (template)
name: Deploy Service
on:
workflow_call:
inputs:
service-name:
required: true
type: string
environment:
required: true
type: string
jobs:
deploy:
runs-on: ubuntu-latest
steps:
# Step 1: Build
- name: Build Docker image
run: |
docker build \
-t ${{ inputs.service-name }}:${{ github.sha }} \
-f Dockerfile \
.
# Step 2: Test
- name: Run tests
run: |
docker run ${{ inputs.service-name }}:${{ github.sha }} \
npm test
# Step 3: Security scan
- name: Security scan
run: |
trivy image ${{ inputs.service-name }}:${{ github.sha }}
# Step 4: Push
- name: Push to registry
run: |
docker push ${{ inputs.service-name }}:${{ github.sha }}
# Step 5: Deploy
- name: Deploy to Kubernetes
run: |
helm upgrade --install ${{ inputs.service-name }} \
./charts/service \
--set image.tag=${{ github.sha }} \
--namespace ${{ inputs.environment }}
# Step 6: Verify
- name: Smoke test
run: |
kubectl wait --for=condition=ready pod \
-l app=${{ inputs.service-name }} \
--timeout=300s
Usage in any service:
# service-a/.github/workflows/deploy-prod.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
uses: ./.github/workflows/deploy.yml
with:
service-name: service-a
environment: production
Result: Same deployment process for 100+ services.
2. Standard Observability
Every service gets the same observability stack:
# Standard Python service template
from fastapi import FastAPI
from prometheus_client import Counter, Histogram, make_asgi_app
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import structlog
# Configure structured logging (standard)
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
],
logger_factory=structlog.stdlib.LoggerFactory(),
)
logger = structlog.get_logger()
# Configure metrics (standard)
http_requests = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_latency = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
# Configure tracing (standard)
tracer = trace.get_tracer(__name__)
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
# Standard health checks
@app.get("/health/live")
async def liveness():
return {"status": "alive"}
@app.get("/health/ready")
async def readiness():
# Check dependencies
db_healthy = await check_database()
cache_healthy = await check_redis()
if not (db_healthy and cache_healthy):
return {"status": "not ready"}, 503
return {"status": "ready"}
# Standard metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
# Standard request logging
@app.middleware("http")
async def log_requests(request, call_next):
logger.info(
"request_started",
method=request.method,
path=request.url.path,
client=request.client.host
)
with http_latency.labels(
method=request.method,
endpoint=request.url.path
).time():
response = await call_next(request)
http_requests.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
logger.info(
"request_completed",
method=request.method,
path=request.url.path,
status=response.status_code
)
return response
What this gives you:
- Consistent log format across all services
- Automatic Prometheus metrics
- Distributed tracing out of the box
- Standardized health check paths
- Easy debugging ("All logs are JSON")
3. Standard Configuration
ConfigMap + Secrets pattern for all services:
# Standard ConfigMap structure
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Values.serviceName }}-config
data:
# Application config
LOG_LEVEL: "info"
PORT: "8080"
METRICS_PORT: "9090"
# Feature flags
FEATURE_NEW_AUTH: "true"
FEATURE_BETA_UI: "false"
# Dependencies
DATABASE_URL: "postgresql://db.cluster.local:5432/{{ .Values.database }}"
REDIS_URL: "redis://redis.cluster.local:6379/0"
# Observability
OTEL_EXPORTER_OTLP_ENDPOINT: "http://tempo:4317"
SENTRY_DSN: "{{ .Values.sentryDsn }}"
# Standard Secrets structure
apiVersion: v1
kind: Secret
metadata:
name: {{ .Values.serviceName }}-secrets
type: Opaque
data:
# Database credentials (base64 encoded)
DATABASE_PASSWORD: {{ .Values.databasePassword | b64enc }}
# API keys
STRIPE_API_KEY: {{ .Values.stripeApiKey | b64enc }}
SENDGRID_API_KEY: {{ .Values.sendgridApiKey | b64enc }}
# OAuth
OAUTH_CLIENT_SECRET: {{ .Values.oauthClientSecret | b64enc }}
Environment parity:
# Dev, staging, production all use the same structure
helm install service-a ./charts/service \
-f values.dev.yaml
helm install service-a ./charts/service \
-f values.staging.yaml
helm install service-a ./charts/service \
-f values.production.yaml
4. Standard Service Interface
Contract: All services expose the same endpoints:
GET /health/live -> 200 (always)
GET /health/ready -> 200/503 (dependencies check)
GET /metrics -> Prometheus metrics
GET /docs -> OpenAPI/Swagger docs
POST / -> Business logic
Why this matters:
- Load balancers know where to check health
- Prometheus knows where to scrape
- API gateway can auto-generate docs
- Developers know what to expect
The Platform Team: Enablers, Not Gatekeepers
Our platform team's job wasn't to control. It was to enable.
We provided:
- Templates: "Start here, customize if needed"
- Documentation: "How to do X the standard way"
- Tools: "CLI that generates boilerplate"
- Support: "We'll help you migrate"
We didn't enforce through:
- Approvals
- Tickets
- Process
- Bureaucracy
We enforced through:
- Making the Golden Path easiest
- Automated validation (CI/CD checks)
- Clear incentives ("Standard services get priority support")
The CLI: Making Standards Frictionless
# Create a new service
$ nowfloats create service payment-processor
β Creating service structure...
β Generating Dockerfile...
β Creating Kubernetes manifests...
β Setting up CI/CD pipeline...
β Configuring monitoring...
β Adding to service catalog...
Your service is ready!
Next steps:
1. cd payment-processor
2. Add your business logic to src/main.py
3. git push origin main
4. Your service will auto-deploy to staging
Documentation: https://docs.nowfloats.com/golden-path
What it generated:
payment-processor/
βββ Dockerfile # Standard multi-stage build
βββ docker-compose.yml # Local development
βββ .github/
β βββ workflows/
β βββ test.yml # Run tests on PR
β βββ deploy.yml # Deploy on merge
βββ charts/
β βββ service/ # Helm chart
β βββ Chart.yaml
β βββ values.yaml
β βββ templates/
β βββ deployment.yaml
β βββ service.yaml
β βββ ingress.yaml
βββ src/
β βββ main.py # FastAPI app template
β βββ models.py
β βββ config.py
βββ tests/
β βββ test_main.py
βββ README.md # Auto-generated docs
βββ Makefile # Standard commands
Result: From idea to deployed service in 15 minutes.
Real-World Results at NowFloats
After 6 months of standardization:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Deployment methods | 47 | 1 | 98% β |
| Deploy time | 2-4 hours | 10 minutes | 95% β |
| Failed deploys | 30% | 2% | 93% β |
| Onboarding time | 3 weeks | 1 day | 95% β |
| Services created/month | 2-3 | 15-20 | 7x β |
| Debugging time | 30 min/issue | 5 min/issue | 83% β |
| Infrastructure cost | $45K/mo | $28K/mo | 38% β |
Business impact:
- Released features 3x faster
- Reduced operations team overhead by 60%
- New developers productive on day 1
- Eliminated "tribal knowledge" entirely
When to Deviate from Standards
Standards aren't prison. They're guidelines.
When to break the rules:
- Performance: "This service needs raw TCP, not HTTP"
- Regulation: "HIPAA requires dedicated infrastructure"
- Legacy: "Migration cost outweighs benefit"
- Innovation: "We're experimenting with new tech"
How to break the rules:
- Document why: "We're using X because Y"
- Get approval: Not to gatekeep, but to share knowledge
- Create a template: If it works, standardize it
- Revisit regularly: "Is this still necessary?"
Key Takeaways
- Standardization enables scale: You can't optimize chaos
- Make the right thing easy: Golden paths beat mandates
- Automate everything: Standards without automation are suggestions
- Measure the impact: Track time saved, errors reduced
- Stay flexible: Standards evolve. Don't calcify.
The hard truth: Every unique snowflake is technical debt. Every "we do it differently" is a scaling bottleneck.
You cannot optimize what you haven't defined. Standardization isn't about controlβit's about velocity.
About the Author
Devesh Kumar is a Staff Software Engineer specializing in platform engineering and infrastructure standardization. He's built deployment platforms serving 100+ microservices at NowFloats and scaled GenAI systems at Okta.
Want to discuss platform strategy?