Back to Insights

Automation and the Provenance of State

Platform Engineering7 minJanuary 5, 2024
AutomationIaCDevOps

Automation and the Provenance of State

Category: Platform Engineering
Reading Time: 7 minutes
Tags: Automation, IaC, DevOps


The 3 AM Phone Call That Changed Everything

It was 3:17 AM when my phone rang. Production was down. Hard.

"Did you change anything?" I asked the on-call engineer.

"No, nothing. I just restarted the database pod because it was showing high memory."

That single kubectl command—executed manually, with good intentions—cascaded into a 4-hour outage affecting 2 million users.

Why? Because that database pod had specific resource limits set manually two weeks ago to handle a spike. No one documented it. No one remembered it. And when it was recreated, it got the default (insufficient) limits.

The lesson I learned that night: Automation isn't just about speed. It's about the provenance of state.


What is Provenance of State?

Provenance means knowing the complete history of how your infrastructure reached its current state:

  • What changed?
  • When did it change?
  • Who changed it?
  • Why was it changed?
  • How can we reproduce it?

Without provenance, you're flying blind. You have a system in a certain state, but no idea how it got there or how to recreate it.

Example: Two Kubernetes Clusters

Cluster A (Manual Configuration):

# History lost to time
$ kubectl scale deployment api-server --replicas=10
$ kubectl set resources deployment api-server --limits=cpu=2,memory=4Gi
$ kubectl patch deployment api-server -p '{"spec":{"template":{"spec":{"nodeSelector":{"node-type":"highcpu"}}}}}'

Questions you can't answer:

  • Why 10 replicas instead of 5 or 20?
  • Who set those resource limits and when?
  • Why is it pinned to highcpu nodes?
  • Can we recreate this in staging?

Cluster B (Infrastructure as Code):

# Git commit: abc123
# Author: devesh@company.com
# Date: 2024-01-15
# Message: "Scale api-server for Black Friday traffic (JIRA-1234)"

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  annotations:
    reason: "Black Friday traffic spike"
    jira-ticket: "JIRA-1234"
spec:
  replicas: 10 # Increased from 5 to handle 2x traffic
  template:
    spec:
      nodeSelector:
        node-type: highcpu # CPU-bound workload
      containers:
        - name: api
          resources:
            limits:
              cpu: "2" # Profiled: 1.5 CPU under load
              memory: "4Gi" # 3.2Gi average + 20% buffer

Now you know:

  • Why: Black Friday traffic (JIRA-1234)
  • When: Jan 15, 2024
  • Who: devesh@company.com
  • How to reproduce: git checkout abc123; kubectl apply -f .

Provenance = Trust = Reliability


The Architecture Has Failed

Here's a controversial statement:

If a human has to touch production, the architecture has failed.

Not because humans are bad (we're not). But because:

  1. Humans don't scale: You can't hire fast enough
  2. Knowledge is tribal: "Sarah knows how to fix this"
  3. Context is lost: "I think someone changed this last month?"
  4. Reproducibility is impossible: "It works on my machine"
  5. Auditing is a nightmare: "Who changed the firewall rules?"

Case Study: The Lookout Deployment Pipeline

When I joined Lookout, deploys were manual:

Manual Process (2019):

  1. SSH into bastion host
  2. Pull latest code
  3. Run migration scripts (hope they're idempotent)
  4. Restart services in the "right" order
  5. Check logs manually
  6. Hope nothing breaks

Time: 2-3 hours
Error rate: 15%
Deploys per week: 2-3
Rollback time: 1-2 hours
Knowledge holders: 3 people

Automated Process (2020):

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Run database migrations
        run: |
          kubectl exec -it db-pod -- \
            /app/migrations/run.sh \
            --version=$(git rev-parse HEAD)

      - name: Deploy to Kubernetes
        run: |
          # GitOps: ArgoCD picks up changes automatically
          git tag deploy-prod-$(date +%s)
          git push --tags

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/api-server
          kubectl rollout status deployment/worker-pool

      - name: Run smoke tests
        run: |
          ./tests/smoke-test.sh

      - name: Notify Slack
        if: always()
        run: |
          curl -X POST $SLACK_WEBHOOK \
            -d "{'text': 'Deploy ${{ job.status }}: ${{ github.sha }}'}"

Time: 8 minutes
Error rate: 0.3%
Deploys per week: 20-30
Rollback time: 2 minutes (GitOps revert)
Knowledge holders: Codified in repository


Building for Provenance: The Layers

Layer 1: Infrastructure as Code (IaC)

Every piece of infrastructure must be defined in code.

Terraform Example:

# terraform/production/main.tf

# VPC with full provenance
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name        = "production-vpc"
    Environment = "production"
    ManagedBy   = "terraform"
    GitCommit   = var.git_commit
    LastChanged = timestamp()
    Owner       = "platform-team"
    CostCenter  = "engineering"
  }
}

# EKS cluster
resource "aws_eks_cluster" "main" {
  name     = "production-eks"
  role_arn = aws_iam_role.eks_cluster.arn
  version  = "1.28"  # Pinned version for reproducibility

  vpc_config {
    subnet_ids = aws_subnet.private[*].id

    # Security: Private endpoint only
    endpoint_private_access = true
    endpoint_public_access  = false
  }

  # Enable control plane logging
  enabled_cluster_log_types = [
    "api",
    "audit",
    "authenticator",
    "controllerManager",
    "scheduler"
  ]

  tags = {
    GitCommit = var.git_commit
    Terraform = "true"
  }
}

Result: Complete infrastructure history in Git. Can recreate entire environment from commit SHA.


Layer 2: Configuration as Code

Don't just automate deployment. Automate configuration.

Bad (Manual ConfigMap):

kubectl create configmap app-config \
  --from-literal=LOG_LEVEL=debug \
  --from-literal=MAX_CONNECTIONS=100

Good (Versioned in Git):

# k8s/configmaps/app-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  annotations:
    config-version: "2.3.0"
    last-updated: "2024-01-15"
    updated-by: "devesh@company.com"
data:
  LOG_LEVEL: "info" # Changed from debug (JIRA-5678)
  MAX_CONNECTIONS: "500" # Increased for peak load
  CACHE_TTL: "3600"
  FEATURE_FLAGS: |
    new_auth_flow: true
    beta_dashboard: false

Layer 3: Declarative Rollouts

Use GitOps to ensure desired state = actual state.

ArgoCD Application:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-server
  namespace: argocd
spec:
  project: production

  source:
    repoURL: https://github.com/company/k8s-manifests
    targetRevision: main
    path: apps/api-server

  destination:
    server: https://kubernetes.default.svc
    namespace: production

  syncPolicy:
    automated:
      prune: true # Delete removed resources
      selfHeal: true # Fix manual changes

    syncOptions:
      - CreateNamespace=true

  # Rollback on failure
  revisionHistoryLimit: 10

What this gives you:

  • Automatic drift detection: Manual changes are reverted
  • Audit trail: Every change is a Git commit
  • Easy rollback: git revert abc123
  • Preview changes: PR diffs show infrastructure changes

Layer 4: Policy as Code

Prevent bad changes before they happen.

Open Policy Agent (OPA) Example:

# policy/kubernetes.rego

# Deny deployments without resource limits
deny[msg] {
  input.kind == "Deployment"
  not input.spec.template.spec.containers[_].resources.limits
  msg := "Deployments must specify resource limits"
}

# Deny privileged containers
deny[msg] {
  input.kind == "Pod"
  input.spec.containers[_].securityContext.privileged == true
  msg := "Privileged containers are not allowed"
}

# Require labels
deny[msg] {
  input.kind == "Deployment"
  not input.metadata.labels.team
  msg := "Deployments must have 'team' label"
}

# Require cost center tag
deny[msg] {
  input.kind == "Deployment"
  not input.metadata.labels["cost-center"]
  msg := "Deployments must have 'cost-center' label"
}

Enforcement:

# Pre-commit hook
#!/bin/bash
for file in k8s/*.yaml; do
  conftest test $file --policy policy/
done

Observability for Automation

How do you know your automation is working?

1. Drift Detection

import asyncio
from kubernetes import client, config

async def detect_drift():
    """Compare desired state (Git) vs actual state (cluster)"""

    # Load desired state from Git
    desired = load_manifests_from_git()

    # Load actual state from cluster
    config.load_kube_config()
    v1 = client.AppsV1Api()
    actual_deployments = v1.list_deployment_for_all_namespaces()

    drift = []
    for deployment in actual_deployments.items:
        name = deployment.metadata.name
        namespace = deployment.metadata.namespace

        # Check if in Git
        if name not in desired:
            drift.append({
                "resource": f"{namespace}/{name}",
                "issue": "Not in Git (manual creation)",
                "action": "Delete or add to Git"
            })
            continue

        # Compare replicas
        if deployment.spec.replicas != desired[name].replicas:
            drift.append({
                "resource": f"{namespace}/{name}",
                "issue": f"Replica count mismatch: {deployment.spec.replicas} != {desired[name].replicas}",
                "action": "Sync from Git"
            })

    # Alert on drift
    if drift:
        send_slack_alert(f"Detected {len(drift)} drifts in cluster")

    return drift

2. Deployment Audit Trail

# Log every infrastructure change
@audit_log
def apply_terraform_change(module: str, action: str):
    """Apply Terraform changes with full audit trail"""

    # Generate plan
    plan = terraform_plan(module)

    # Log intent
    logger.info(
        "terraform_plan_generated",
        module=module,
        action=action,
        changes=plan.resource_changes,
        user=get_current_user(),
        git_commit=get_git_commit()
    )

    # Require approval for production
    if module == "production":
        approval = request_approval(plan)
        if not approval.approved:
            raise Exception("Terraform change not approved")

    # Apply changes
    result = terraform_apply(module)

    # Log result
    logger.info(
        "terraform_applied",
        module=module,
        success=result.success,
        resources_created=result.created,
        resources_modified=result.modified,
        resources_deleted=result.deleted,
        duration_seconds=result.duration
    )

    return result

Real-World Results

After implementing full automation + provenance at Lookout:

Metric Before After Improvement
Deploy time 2-3 hours 8 minutes 95% ↓
Deploy frequency 2-3/week 20-30/week 10x ↑
Error rate 15% 0.3% 98% ↓
Rollback time 1-2 hours 2 minutes 97% ↓
MTTR (mean time to recovery) 4 hours 12 minutes 95% ↓
On-call pages 15/week 2/week 87% ↓
Knowledge silos 3 people 0 (in code)

Business impact:

  • Released features 10x faster
  • Reduced operations team from 8 to 3
  • Zero-downtime deployments became standard
  • New engineers could deploy on day 1

Key Takeaways

  1. Provenance is trust: If you can't explain how your system reached its current state, you can't trust it
  2. Manual operations don't scale: Humans are bottlenecks and single points of failure
  3. Everything as code: Infrastructure, configuration, policies—all in Git
  4. Automation is not optional: At scale, it's the only way to maintain reliability
  5. Drift is the enemy: Detect and eliminate configuration drift

The hard truth: Every manual change is technical debt. Every "quick fix" in production is a future outage waiting to happen.

Build systems that codify knowledge. Automation isn't about replacing humans—it's about freeing them to focus on what actually matters.

If a human has to touch production, the architecture has failed.


About the Author
Devesh Kumar is a Staff Software Engineer specializing in platform automation and cloud infrastructure. He's built zero-downtime deployment pipelines processing billions of events at Okta and Lookout.

Want to discuss automation strategies?

Want to discuss this topic?

I work with 3-4 high-stakes organizations per year on platform architecture and infrastructure strategy.

Schedule Strategy Session