Back to Insights

Standardization: The Prerequisite to Scale

Platform Engineeringβ€’6 minβ€’December 20, 2023
StandardsPlatformScale

Standardization: The Prerequisite to Scale

Category: Platform Engineering
Reading Time: 6 minutes
Tags: Standards, Platform, Scale


The Chaos of 47 Ways to Deploy

When I joined NowFloats in 2018, we had a problem. A big one.

We had 47 different deployment methods across 120 microservices:

  • 12 services used Jenkins
  • 8 used GitLab CI
  • 15 used CircleCI
  • 6 used manual shell scripts
  • 3 used Ansible playbooks
  • 1 service (I kid you not) deployed via FTP
  • The rest? A mix of tribal knowledge and "ask Sarah"

Each service had its own:

  • Docker base image (23 different variations)
  • Logging format (JSON, text, custom, none)
  • Metrics endpoint (Prometheus, StatsD, custom, none)
  • Health check path (/health, /healthz, /status, /_health, none)
  • Configuration method (ENV vars, config files, hardcoded, CLI flags)

The result?

  • Deploy time: 2-4 hours per service
  • Failed deploys: 30% failure rate
  • Onboarding: 3 weeks to deploy first service
  • Debugging: "Which service is failing?" became a 30-minute investigation
  • Scaling: Impossible. Each service was a unique snowflake.

The lesson: You cannot optimize what you haven't defined. Standardization is the prerequisite to scale.


Why Standardization Matters (The Math)

Let's do some simple math:

Without Standardization

  • Number of services: 100
  • Unique deployment methods: 47
  • Time to learn each method: 2 hours
  • Total knowledge required: 94 hours (2+ weeks)
  • Bus factor: Knowledge locked in 10+ people's heads
  • Onboarding time: 3 weeks minimum

With Standardization

  • Number of services: 100
  • Standard deployment methods: 1
  • Time to learn the method: 4 hours
  • Total knowledge required: 4 hours
  • Bus factor: Zero. It's documented and automated.
  • Onboarding time: 1 day

Productivity multiplier: 23.5x

That's not theoretical. That's what we achieved at NowFloats.


The Golden Path: Making the Right Thing the Easy Thing

The key insight: Don't force standardization. Make it irresistible.

We created what we called the "Golden Path"β€”a paved road that makes doing the right thing easier than doing the wrong thing.

Before (Snowflake Deployment)

# service-a/deploy.sh
#!/bin/bash
ssh prod-server-1 "docker pull myregistry/service-a:${TAG}"
ssh prod-server-1 "docker stop service-a || true"
ssh prod-server-1 "docker run -d --name service-a \
-p 8080:8080 \
-e DB_HOST=prod-db \
-e LOG_LEVEL=info \
myregistry/service-a:${TAG}"
# Hope it works 🀞

Problems:

  • No rollback mechanism
  • No health checks
  • Hardcoded configuration
  • Manual intervention required
  • Can't scale horizontally
  • No metrics or monitoring

After (Golden Path)

# service-a/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: service-a
  labels:
    app: service-a
    version: v1.0.0
    team: payments
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      containers:
        - name: service-a
          image: myregistry/service-a:{{ .Values.version }}

          # Standardized ports
          ports:
            - name: http
              containerPort: 8080
            - name: metrics
              containerPort: 9090

          # Standardized health checks
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5

          # Standardized resources
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"

          # Standardized config
          envFrom:
            - configMapRef:
                name: service-a-config
            - secretRef:
                name: service-a-secrets

Deploy command:

# One command for all services
make deploy SERVICE=service-a ENV=production

What changed:

  • βœ… Automatic rollbacks on failure
  • βœ… Built-in health checks
  • βœ… Configuration management
  • βœ… Zero-downtime deployments
  • βœ… Horizontal scaling ready
  • βœ… Metrics automatically scraped

The Four Pillars of Standardization

1. Standard Deployment Pipeline

One pipeline template for all services:

# .github/workflows/deploy.yml (template)
name: Deploy Service

on:
  workflow_call:
    inputs:
      service-name:
        required: true
        type: string
      environment:
        required: true
        type: string

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      # Step 1: Build
      - name: Build Docker image
        run: |
          docker build \
            -t ${{ inputs.service-name }}:${{ github.sha }} \
            -f Dockerfile \
            .

      # Step 2: Test
      - name: Run tests
        run: |
          docker run ${{ inputs.service-name }}:${{ github.sha }} \
            npm test

      # Step 3: Security scan
      - name: Security scan
        run: |
          trivy image ${{ inputs.service-name }}:${{ github.sha }}

      # Step 4: Push
      - name: Push to registry
        run: |
          docker push ${{ inputs.service-name }}:${{ github.sha }}

      # Step 5: Deploy
      - name: Deploy to Kubernetes
        run: |
          helm upgrade --install ${{ inputs.service-name }} \
            ./charts/service \
            --set image.tag=${{ github.sha }} \
            --namespace ${{ inputs.environment }}

      # Step 6: Verify
      - name: Smoke test
        run: |
          kubectl wait --for=condition=ready pod \
            -l app=${{ inputs.service-name }} \
            --timeout=300s

Usage in any service:

# service-a/.github/workflows/deploy-prod.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    uses: ./.github/workflows/deploy.yml
    with:
      service-name: service-a
      environment: production

Result: Same deployment process for 100+ services.


2. Standard Observability

Every service gets the same observability stack:

# Standard Python service template
from fastapi import FastAPI
from prometheus_client import Counter, Histogram, make_asgi_app
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
import structlog

# Configure structured logging (standard)
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    logger_factory=structlog.stdlib.LoggerFactory(),
)
logger = structlog.get_logger()

# Configure metrics (standard)
http_requests = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

# Configure tracing (standard)
tracer = trace.get_tracer(__name__)
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

# Standard health checks
@app.get("/health/live")
async def liveness():
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    # Check dependencies
    db_healthy = await check_database()
    cache_healthy = await check_redis()

    if not (db_healthy and cache_healthy):
        return {"status": "not ready"}, 503

    return {"status": "ready"}

# Standard metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

# Standard request logging
@app.middleware("http")
async def log_requests(request, call_next):
    logger.info(
        "request_started",
        method=request.method,
        path=request.url.path,
        client=request.client.host
    )

    with http_latency.labels(
        method=request.method,
        endpoint=request.url.path
    ).time():
        response = await call_next(request)

    http_requests.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    logger.info(
        "request_completed",
        method=request.method,
        path=request.url.path,
        status=response.status_code
    )

    return response

What this gives you:

  • Consistent log format across all services
  • Automatic Prometheus metrics
  • Distributed tracing out of the box
  • Standardized health check paths
  • Easy debugging ("All logs are JSON")

3. Standard Configuration

ConfigMap + Secrets pattern for all services:

# Standard ConfigMap structure
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ .Values.serviceName }}-config
data:
  # Application config
  LOG_LEVEL: "info"
  PORT: "8080"
  METRICS_PORT: "9090"

  # Feature flags
  FEATURE_NEW_AUTH: "true"
  FEATURE_BETA_UI: "false"

  # Dependencies
  DATABASE_URL: "postgresql://db.cluster.local:5432/{{ .Values.database }}"
  REDIS_URL: "redis://redis.cluster.local:6379/0"

  # Observability
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://tempo:4317"
  SENTRY_DSN: "{{ .Values.sentryDsn }}"
# Standard Secrets structure
apiVersion: v1
kind: Secret
metadata:
  name: {{ .Values.serviceName }}-secrets
type: Opaque
data:
  # Database credentials (base64 encoded)
  DATABASE_PASSWORD: {{ .Values.databasePassword | b64enc }}

  # API keys
  STRIPE_API_KEY: {{ .Values.stripeApiKey | b64enc }}
  SENDGRID_API_KEY: {{ .Values.sendgridApiKey | b64enc }}

  # OAuth
  OAUTH_CLIENT_SECRET: {{ .Values.oauthClientSecret | b64enc }}

Environment parity:

# Dev, staging, production all use the same structure
helm install service-a ./charts/service \
  -f values.dev.yaml

helm install service-a ./charts/service \
  -f values.staging.yaml

helm install service-a ./charts/service \
  -f values.production.yaml

4. Standard Service Interface

Contract: All services expose the same endpoints:

GET  /health/live      -> 200 (always)
GET  /health/ready     -> 200/503 (dependencies check)
GET  /metrics          -> Prometheus metrics
GET  /docs             -> OpenAPI/Swagger docs
POST /                 -> Business logic

Why this matters:

  • Load balancers know where to check health
  • Prometheus knows where to scrape
  • API gateway can auto-generate docs
  • Developers know what to expect

The Platform Team: Enablers, Not Gatekeepers

Our platform team's job wasn't to control. It was to enable.

We provided:

  1. Templates: "Start here, customize if needed"
  2. Documentation: "How to do X the standard way"
  3. Tools: "CLI that generates boilerplate"
  4. Support: "We'll help you migrate"

We didn't enforce through:

  • Approvals
  • Tickets
  • Process
  • Bureaucracy

We enforced through:

  • Making the Golden Path easiest
  • Automated validation (CI/CD checks)
  • Clear incentives ("Standard services get priority support")

The CLI: Making Standards Frictionless

# Create a new service
$ nowfloats create service payment-processor

βœ“ Creating service structure...
βœ“ Generating Dockerfile...
βœ“ Creating Kubernetes manifests...
βœ“ Setting up CI/CD pipeline...
βœ“ Configuring monitoring...
βœ“ Adding to service catalog...

Your service is ready!

Next steps:
  1. cd payment-processor
  2. Add your business logic to src/main.py
  3. git push origin main
  4. Your service will auto-deploy to staging

Documentation: https://docs.nowfloats.com/golden-path

What it generated:

payment-processor/
β”œβ”€β”€ Dockerfile                 # Standard multi-stage build
β”œβ”€β”€ docker-compose.yml         # Local development
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       β”œβ”€β”€ test.yml           # Run tests on PR
β”‚       └── deploy.yml         # Deploy on merge
β”œβ”€β”€ charts/
β”‚   └── service/              # Helm chart
β”‚       β”œβ”€β”€ Chart.yaml
β”‚       β”œβ”€β”€ values.yaml
β”‚       └── templates/
β”‚           β”œβ”€β”€ deployment.yaml
β”‚           β”œβ”€β”€ service.yaml
β”‚           └── ingress.yaml
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py               # FastAPI app template
β”‚   β”œβ”€β”€ models.py
β”‚   └── config.py
β”œβ”€β”€ tests/
β”‚   └── test_main.py
β”œβ”€β”€ README.md                  # Auto-generated docs
└── Makefile                   # Standard commands

Result: From idea to deployed service in 15 minutes.


Real-World Results at NowFloats

After 6 months of standardization:

Metric Before After Improvement
Deployment methods 47 1 98% ↓
Deploy time 2-4 hours 10 minutes 95% ↓
Failed deploys 30% 2% 93% ↓
Onboarding time 3 weeks 1 day 95% ↓
Services created/month 2-3 15-20 7x ↑
Debugging time 30 min/issue 5 min/issue 83% ↓
Infrastructure cost $45K/mo $28K/mo 38% ↓

Business impact:

  • Released features 3x faster
  • Reduced operations team overhead by 60%
  • New developers productive on day 1
  • Eliminated "tribal knowledge" entirely

When to Deviate from Standards

Standards aren't prison. They're guidelines.

When to break the rules:

  1. Performance: "This service needs raw TCP, not HTTP"
  2. Regulation: "HIPAA requires dedicated infrastructure"
  3. Legacy: "Migration cost outweighs benefit"
  4. Innovation: "We're experimenting with new tech"

How to break the rules:

  1. Document why: "We're using X because Y"
  2. Get approval: Not to gatekeep, but to share knowledge
  3. Create a template: If it works, standardize it
  4. Revisit regularly: "Is this still necessary?"

Key Takeaways

  1. Standardization enables scale: You can't optimize chaos
  2. Make the right thing easy: Golden paths beat mandates
  3. Automate everything: Standards without automation are suggestions
  4. Measure the impact: Track time saved, errors reduced
  5. Stay flexible: Standards evolve. Don't calcify.

The hard truth: Every unique snowflake is technical debt. Every "we do it differently" is a scaling bottleneck.

You cannot optimize what you haven't defined. Standardization isn't about controlβ€”it's about velocity.


About the Author
Devesh Kumar is a Staff Software Engineer specializing in platform engineering and infrastructure standardization. He's built deployment platforms serving 100+ microservices at NowFloats and scaled GenAI systems at Okta.

Want to discuss platform strategy?

Want to discuss this topic?

I work with 3-4 high-stakes organizations per year on platform architecture and infrastructure strategy.

Schedule Strategy Session