Standardization: The Prerequisite to Scale

Category: Platform Engineering
Reading Time: 6 minutes
Tags: Standards, Platform, Scale

The Chaos of 47 Ways to Deploy

When I joined NowFloats in 2018, we had a problem. A big one.

We had 47 different deployment methods across 120 microservices:

12 services used Jenkins
8 used GitLab CI
15 used CircleCI
6 used manual shell scripts
3 used Ansible playbooks
1 service (I kid you not) deployed via FTP
The rest? A mix of tribal knowledge and "ask Sarah"

Each service had its own:

Docker base image (23 different variations)
Logging format (JSON, text, custom, none)
Metrics endpoint (Prometheus, StatsD, custom, none)
Health check path (/health, /healthz, /status, /_health, none)
Configuration method (ENV vars, config files, hardcoded, CLI flags)

The result?

Deploy time: 2-4 hours per service
Failed deploys: 30% failure rate
Onboarding: 3 weeks to deploy first service
Debugging: "Which service is failing?" became a 30-minute investigation
Scaling: Impossible. Each service was a unique snowflake.

The lesson: You cannot optimize what you haven't defined. Standardization is the prerequisite to scale.

Why Standardization Matters (The Math)

Let's do some simple math:

Without Standardization

Number of services: 100
Unique deployment methods: 47
Time to learn each method: 2 hours
Total knowledge required: 94 hours (2+ weeks)
Bus factor: Knowledge locked in 10+ people's heads
Onboarding time: 3 weeks minimum

With Standardization

Number of services: 100
Standard deployment methods: 1
Time to learn the method: 4 hours
Total knowledge required: 4 hours
Bus factor: Zero. It's documented and automated.
Onboarding time: 1 day

Productivity multiplier: 23.5x

That's not theoretical. That's what we achieved at NowFloats.

The Golden Path: Making the Right Thing the Easy Thing

The key insight: Don't force standardization. Make it irresistible.

We created what we called the "Golden Path"—a paved road that makes doing the right thing easier than doing the wrong thing.

Before (Snowflake Deployment)

# service-a/deploy.sh
#!/bin/bash
ssh prod-server-1 "docker pull myregistry/service-a:${TAG}"
ssh prod-server-1 "docker stop service-a || true"
ssh prod-server-1 "docker run -d --name service-a \
-p 8080:8080 \
-e DB_HOST=prod-db \
-e LOG_LEVEL=info \
myregistry/service-a:${TAG}"
# Hope it works 🤞

Problems:

No rollback mechanism
No health checks
Hardcoded configuration
Manual intervention required
Can't scale horizontally
No metrics or monitoring

After (Golden Path)

# service-a/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: service-a
  labels:
    app: service-a
    version: v1.0.0
    team: payments
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      containers:
        - name: service-a
          image: myregistry/service-a:{{ .Values.version }}

          # Standardized ports
          ports:
            - name: http
              containerPort: 8080
            - name: metrics
              containerPort: 9090

          # Standardized health checks
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

          # Standardized resources
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"

          # Standardized config
          envFrom:
            - configMapRef:
                name: service-a-config
            - secretRef:
                name: service-a-secrets

Deploy command:

# One command for all services
make deploy SERVICE=service-a ENV=production

What changed:

✅ Automatic rollbacks on failure
✅ Built-in health checks
✅ Configuration management
✅ Zero-downtime deployments
✅ Horizontal scaling ready
✅ Metrics automatically scraped

The Four Pillars of Standardization

1. Standard Deployment Pipeline

One pipeline template for all services:

# .github/workflows/deploy.yml (template)
name: Deploy Service

on:
  workflow_call:
    inputs:
      service-name:
        required: true
        type: string
      environment:
        required: true
        type: string

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      # Step 1: Build
      - name: Build Docker image
        run: |
          docker build \
            -t ${{ inputs.service-name }}:${{ github.sha }} \
            -f Dockerfile \
            .

      # Step 2: Test
      - name: Run tests
        run: |
          docker run ${{ inputs.service-name }}:${{ github.sha }} \
            npm test

      # Step 3: Security scan
      - name: Security scan
        run: |
          trivy image ${{ inputs.service-name }}:${{ github.sha }}

      # Step 4: Push
      - name: Push to registry
        run: |
          docker push ${{ inputs.service-name }}:${{ github.sha }}

      # Step 5: Deploy
      - name: Deploy to Kubernetes
        run: |
          helm upgrade --install ${{ inputs.service-name }} \
            ./charts/service \
            --set image.tag=${{ github.sha }} \
            --namespace ${{ inputs.environment }}

      # Step 6: Verify
      - name: Smoke test
        run: |
          kubectl wait --for=condition=ready pod \
            -l app=${{ inputs.service-name }} \
            --timeout=300s

Usage in any service:

# service-a/.github/workflows/deploy-prod.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    uses: ./.github/workflows/deploy.yml
    with:
      service-name: service-a
      environment: production

Result: Same deployment process for 100+ services.

2. Standard Observability

Every service gets the same observability stack:

# Standard Python service template
from fastapi import FastAPI
from prometheus_client import Counter, Histogram, make_asgi_app
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import structlog

# Configure structured logging (standard)
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    logger_factory=structlog.stdlib.LoggerFactory(),
)
logger = structlog.get_logger()

# Configure metrics (standard)
http_requests = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

# Configure tracing (standard)
tracer = trace.get_tracer(__name__)
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

# Standard health checks
@app.get("/health/live")
async def liveness():
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    # Check dependencies
    db_healthy = await check_database()
    cache_healthy = await check_redis()

    if not (db_healthy and cache_healthy):
        return {"status": "not ready"}, 503

    return {"status": "ready"}

# Standard metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

# Standard request logging
@app.middleware("http")
async def log_requests(request, call_next):
    logger.info(
        "request_started",
        method=request.method,
        path=request.url.path,
        client=request.client.host
    )

    with http_latency.labels(
        method=request.method,
        endpoint=request.url.path
    ).time():
        response = await call_next(request)

    http_requests.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    logger.info(
        "request_completed",
        method=request.method,
        path=request.url.path,
        status=response.status_code
    )

    return response

What this gives you:

Consistent log format across all services
Automatic Prometheus metrics
Distributed tracing out of the box
Standardized health check paths
Easy debugging ("All logs are JSON")

3. Standard Configuration

ConfigMap + Secrets pattern for all services:

# Standard ConfigMap structure
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Values.serviceName }}-config
data:
  # Application config
  LOG_LEVEL: "info"
  PORT: "8080"
  METRICS_PORT: "9090"

  # Feature flags
  FEATURE_NEW_AUTH: "true"
  FEATURE_BETA_UI: "false"

  # Dependencies
  DATABASE_URL: "postgresql://db.cluster.local:5432/{{ .Values.database }}"
  REDIS_URL: "redis://redis.cluster.local:6379/0"

  # Observability
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://tempo:4317"
  SENTRY_DSN: "{{ .Values.sentryDsn }}"

# Standard Secrets structure
apiVersion: v1
kind: Secret
metadata:
  name: {{ .Values.serviceName }}-secrets
type: Opaque
data:
  # Database credentials (base64 encoded)
  DATABASE_PASSWORD: {{ .Values.databasePassword | b64enc }}

  # API keys
  STRIPE_API_KEY: {{ .Values.stripeApiKey | b64enc }}
  SENDGRID_API_KEY: {{ .Values.sendgridApiKey | b64enc }}

  # OAuth
  OAUTH_CLIENT_SECRET: {{ .Values.oauthClientSecret | b64enc }}

Environment parity:

# Dev, staging, production all use the same structure
helm install service-a ./charts/service \
  -f values.dev.yaml

helm install service-a ./charts/service \
  -f values.staging.yaml

helm install service-a ./charts/service \
  -f values.production.yaml

4. Standard Service Interface

Contract: All services expose the same endpoints:

GET  /health/live      -> 200 (always)
GET  /health/ready     -> 200/503 (dependencies check)
GET  /metrics          -> Prometheus metrics
GET  /docs             -> OpenAPI/Swagger docs
POST /                 -> Business logic

Why this matters:

Load balancers know where to check health
Prometheus knows where to scrape
API gateway can auto-generate docs
Developers know what to expect

The Platform Team: Enablers, Not Gatekeepers

Our platform team's job wasn't to control. It was to enable.

We provided:

Templates: "Start here, customize if needed"
Documentation: "How to do X the standard way"
Tools: "CLI that generates boilerplate"
Support: "We'll help you migrate"

We didn't enforce through:

Approvals
Tickets
Process
Bureaucracy

We enforced through:

Making the Golden Path easiest
Automated validation (CI/CD checks)
Clear incentives ("Standard services get priority support")

The CLI: Making Standards Frictionless

# Create a new service
$ nowfloats create service payment-processor

✓ Creating service structure...
✓ Generating Dockerfile...
✓ Creating Kubernetes manifests...
✓ Setting up CI/CD pipeline...
✓ Configuring monitoring...
✓ Adding to service catalog...

Your service is ready!

Next steps:
  1. cd payment-processor
  2. Add your business logic to src/main.py
  3. git push origin main
  4. Your service will auto-deploy to staging

Documentation: https://docs.nowfloats.com/golden-path

What it generated:

payment-processor/
├── Dockerfile                 # Standard multi-stage build
├── docker-compose.yml         # Local development
├── .github/
│   └── workflows/
│       ├── test.yml           # Run tests on PR
│       └── deploy.yml         # Deploy on merge
├── charts/
│   └── service/              # Helm chart
│       ├── Chart.yaml
│       ├── values.yaml
│       └── templates/
│           ├── deployment.yaml
│           ├── service.yaml
│           └── ingress.yaml
├── src/
│   ├── main.py               # FastAPI app template
│   ├── models.py
│   └── config.py
├── tests/
│   └── test_main.py
├── README.md                  # Auto-generated docs
└── Makefile                   # Standard commands

Result: From idea to deployed service in 15 minutes.

Real-World Results at NowFloats

After 6 months of standardization:

Metric	Before	After	Improvement
Deployment methods	47	1	98% ↓
Deploy time	2-4 hours	10 minutes	95% ↓
Failed deploys	30%	2%	93% ↓
Onboarding time	3 weeks	1 day	95% ↓
Services created/month	2-3	15-20	7x ↑
Debugging time	30 min/issue	5 min/issue	83% ↓
Infrastructure cost	$45K/mo	$28K/mo	38% ↓

Business impact:

Released features 3x faster
Reduced operations team overhead by 60%
New developers productive on day 1
Eliminated "tribal knowledge" entirely

When to Deviate from Standards

Standards aren't prison. They're guidelines.

When to break the rules:

Performance: "This service needs raw TCP, not HTTP"
Regulation: "HIPAA requires dedicated infrastructure"
Legacy: "Migration cost outweighs benefit"
Innovation: "We're experimenting with new tech"

How to break the rules:

Document why: "We're using X because Y"
Get approval: Not to gatekeep, but to share knowledge
Create a template: If it works, standardize it
Revisit regularly: "Is this still necessary?"

Key Takeaways

Standardization enables scale: You can't optimize chaos
Make the right thing easy: Golden paths beat mandates
Automate everything: Standards without automation are suggestions
Measure the impact: Track time saved, errors reduced
Stay flexible: Standards evolve. Don't calcify.

The hard truth: Every unique snowflake is technical debt. Every "we do it differently" is a scaling bottleneck.

You cannot optimize what you haven't defined. Standardization isn't about control—it's about velocity.

About the Author
Devesh Kumar is a Staff Software Engineer specializing in platform engineering and infrastructure standardization. He's built deployment platforms serving 100+ microservices at NowFloats and scaled GenAI systems at Okta.

Want to discuss platform strategy?