Automation and the Provenance of State

Category: Platform Engineering
Reading Time: 7 minutes
Tags: Automation, IaC, DevOps

The 3 AM Phone Call That Changed Everything

It was 3:17 AM when my phone rang. Production was down. Hard.

"Did you change anything?" I asked the on-call engineer.

"No, nothing. I just restarted the database pod because it was showing high memory."

That single kubectl command—executed manually, with good intentions—cascaded into a 4-hour outage affecting 2 million users.

Why? Because that database pod had specific resource limits set manually two weeks ago to handle a spike. No one documented it. No one remembered it. And when it was recreated, it got the default (insufficient) limits.

The lesson I learned that night: Automation isn't just about speed. It's about the provenance of state.

What is Provenance of State?

Provenance means knowing the complete history of how your infrastructure reached its current state:

What changed?
When did it change?
Who changed it?
Why was it changed?
How can we reproduce it?

Without provenance, you're flying blind. You have a system in a certain state, but no idea how it got there or how to recreate it.

Example: Two Kubernetes Clusters

Cluster A (Manual Configuration):

# History lost to time
$ kubectl scale deployment api-server --replicas=10
$ kubectl set resources deployment api-server --limits=cpu=2,memory=4Gi
$ kubectl patch deployment api-server -p '{"spec":{"template":{"spec":{"nodeSelector":{"node-type":"highcpu"}}}}}'

Questions you can't answer:

Why 10 replicas instead of 5 or 20?
Who set those resource limits and when?
Why is it pinned to highcpu nodes?
Can we recreate this in staging?

Cluster B (Infrastructure as Code):

# Git commit: abc123
# Author: devesh@company.com
# Date: 2024-01-15
# Message: "Scale api-server for Black Friday traffic (JIRA-1234)"

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  annotations:
    reason: "Black Friday traffic spike"
    jira-ticket: "JIRA-1234"
spec:
  replicas: 10 # Increased from 5 to handle 2x traffic
  template:
    spec:
      nodeSelector:
        node-type: highcpu # CPU-bound workload
      containers:
        - name: api
          resources:
            limits:
              cpu: "2" # Profiled: 1.5 CPU under load
              memory: "4Gi" # 3.2Gi average + 20% buffer

Now you know:

Why: Black Friday traffic (JIRA-1234)
When: Jan 15, 2024
Who: devesh@company.com
How to reproduce: git checkout abc123; kubectl apply -f .

Provenance = Trust = Reliability

The Architecture Has Failed

Here's a controversial statement:

If a human has to touch production, the architecture has failed.

Not because humans are bad (we're not). But because:

Humans don't scale: You can't hire fast enough
Knowledge is tribal: "Sarah knows how to fix this"
Context is lost: "I think someone changed this last month?"
Reproducibility is impossible: "It works on my machine"
Auditing is a nightmare: "Who changed the firewall rules?"

Case Study: The Lookout Deployment Pipeline

When I joined Lookout, deploys were manual:

Manual Process (2019):

SSH into bastion host
Pull latest code
Run migration scripts (hope they're idempotent)
Restart services in the "right" order
Check logs manually
Hope nothing breaks

Time: 2-3 hours
Error rate: 15%
Deploys per week: 2-3
Rollback time: 1-2 hours
Knowledge holders: 3 people

Automated Process (2020):

# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Run database migrations
        run: |
          kubectl exec -it db-pod -- \
            /app/migrations/run.sh \
            --version=$(git rev-parse HEAD)

      - name: Deploy to Kubernetes
        run: |
          # GitOps: ArgoCD picks up changes automatically
          git tag deploy-prod-$(date +%s)
          git push --tags

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/api-server
          kubectl rollout status deployment/worker-pool

      - name: Run smoke tests
        run: |
          ./tests/smoke-test.sh

      - name: Notify Slack
        if: always()
        run: |
          curl -X POST $SLACK_WEBHOOK \
            -d "{'text': 'Deploy ${{ job.status }}: ${{ github.sha }}'}"

Time: 8 minutes
Error rate: 0.3%
Deploys per week: 20-30
Rollback time: 2 minutes (GitOps revert)
Knowledge holders: Codified in repository

Building for Provenance: The Layers

Layer 1: Infrastructure as Code (IaC)

Every piece of infrastructure must be defined in code.

Terraform Example:

# terraform/production/main.tf

# VPC with full provenance
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name        = "production-vpc"
    Environment = "production"
    ManagedBy   = "terraform"
    GitCommit   = var.git_commit
    LastChanged = timestamp()
    Owner       = "platform-team"
    CostCenter  = "engineering"
  }
}

# EKS cluster
resource "aws_eks_cluster" "main" {
  name     = "production-eks"
  role_arn = aws_iam_role.eks_cluster.arn
  version  = "1.28"  # Pinned version for reproducibility

  vpc_config {
    subnet_ids = aws_subnet.private[*].id

    # Security: Private endpoint only
    endpoint_private_access = true
    endpoint_public_access  = false
  }

  # Enable control plane logging
  enabled_cluster_log_types = [
    "api",
    "audit",
    "authenticator",
    "controllerManager",
    "scheduler"
  ]

  tags = {
    GitCommit = var.git_commit
    Terraform = "true"
  }
}

Result: Complete infrastructure history in Git. Can recreate entire environment from commit SHA.

Layer 2: Configuration as Code

Don't just automate deployment. Automate configuration.

Bad (Manual ConfigMap):

kubectl create configmap app-config \
  --from-literal=LOG_LEVEL=debug \
  --from-literal=MAX_CONNECTIONS=100

Good (Versioned in Git):

# k8s/configmaps/app-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  annotations:
    config-version: "2.3.0"
    last-updated: "2024-01-15"
    updated-by: "devesh@company.com"
data:
  LOG_LEVEL: "info" # Changed from debug (JIRA-5678)
  MAX_CONNECTIONS: "500" # Increased for peak load
  CACHE_TTL: "3600"
  FEATURE_FLAGS: |
    new_auth_flow: true
    beta_dashboard: false

Layer 3: Declarative Rollouts

Use GitOps to ensure desired state = actual state.

ArgoCD Application:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-server
  namespace: argocd
spec:
  project: production

  source:
    repoURL: https://github.com/company/k8s-manifests
    targetRevision: main
    path: apps/api-server

  destination:
    server: https://kubernetes.default.svc
    namespace: production

  syncPolicy:
    automated:
      prune: true # Delete removed resources
      selfHeal: true # Fix manual changes

    syncOptions:
      - CreateNamespace=true

  # Rollback on failure
  revisionHistoryLimit: 10

What this gives you:

Automatic drift detection: Manual changes are reverted
Audit trail: Every change is a Git commit
Easy rollback: git revert abc123
Preview changes: PR diffs show infrastructure changes

Layer 4: Policy as Code

Prevent bad changes before they happen.

Open Policy Agent (OPA) Example:

# policy/kubernetes.rego

# Deny deployments without resource limits
deny[msg] {
  input.kind == "Deployment"
  not input.spec.template.spec.containers[_].resources.limits
  msg := "Deployments must specify resource limits"
}

# Deny privileged containers
deny[msg] {
  input.kind == "Pod"
  input.spec.containers[_].securityContext.privileged == true
  msg := "Privileged containers are not allowed"
}

# Require labels
deny[msg] {
  input.kind == "Deployment"
  not input.metadata.labels.team
  msg := "Deployments must have 'team' label"
}

# Require cost center tag
deny[msg] {
  input.kind == "Deployment"
  not input.metadata.labels["cost-center"]
  msg := "Deployments must have 'cost-center' label"
}

Enforcement:

# Pre-commit hook
#!/bin/bash
for file in k8s/*.yaml; do
  conftest test $file --policy policy/
done

Observability for Automation

How do you know your automation is working?

1. Drift Detection

import asyncio
from kubernetes import client, config

async def detect_drift():
    """Compare desired state (Git) vs actual state (cluster)"""

    # Load desired state from Git
    desired = load_manifests_from_git()

    # Load actual state from cluster
    config.load_kube_config()
    v1 = client.AppsV1Api()
    actual_deployments = v1.list_deployment_for_all_namespaces()

    drift = []
    for deployment in actual_deployments.items:
        name = deployment.metadata.name
        namespace = deployment.metadata.namespace

        # Check if in Git
        if name not in desired:
            drift.append({
                "resource": f"{namespace}/{name}",
                "issue": "Not in Git (manual creation)",
                "action": "Delete or add to Git"
            })
            continue

        # Compare replicas
        if deployment.spec.replicas != desired[name].replicas:
            drift.append({
                "resource": f"{namespace}/{name}",
                "issue": f"Replica count mismatch: {deployment.spec.replicas} != {desired[name].replicas}",
                "action": "Sync from Git"
            })

    # Alert on drift
    if drift:
        send_slack_alert(f"Detected {len(drift)} drifts in cluster")

    return drift

2. Deployment Audit Trail

# Log every infrastructure change
@audit_log
def apply_terraform_change(module: str, action: str):
    """Apply Terraform changes with full audit trail"""

    # Generate plan
    plan = terraform_plan(module)

    # Log intent
    logger.info(
        "terraform_plan_generated",
        module=module,
        action=action,
        changes=plan.resource_changes,
        user=get_current_user(),
        git_commit=get_git_commit()
    )

    # Require approval for production
    if module == "production":
        approval = request_approval(plan)
        if not approval.approved:
            raise Exception("Terraform change not approved")

    # Apply changes
    result = terraform_apply(module)

    # Log result
    logger.info(
        "terraform_applied",
        module=module,
        success=result.success,
        resources_created=result.created,
        resources_modified=result.modified,
        resources_deleted=result.deleted,
        duration_seconds=result.duration
    )

    return result

Real-World Results

After implementing full automation + provenance at Lookout:

Metric	Before	After	Improvement
Deploy time	2-3 hours	8 minutes	95% ↓
Deploy frequency	2-3/week	20-30/week	10x ↑
Error rate	15%	0.3%	98% ↓
Rollback time	1-2 hours	2 minutes	97% ↓
MTTR (mean time to recovery)	4 hours	12 minutes	95% ↓
On-call pages	15/week	2/week	87% ↓
Knowledge silos	3 people	0 (in code)	∞

Business impact:

Released features 10x faster
Reduced operations team from 8 to 3
Zero-downtime deployments became standard
New engineers could deploy on day 1

Key Takeaways

Provenance is trust: If you can't explain how your system reached its current state, you can't trust it
Manual operations don't scale: Humans are bottlenecks and single points of failure
Everything as code: Infrastructure, configuration, policies—all in Git
Automation is not optional: At scale, it's the only way to maintain reliability
Drift is the enemy: Detect and eliminate configuration drift

The hard truth: Every manual change is technical debt. Every "quick fix" in production is a future outage waiting to happen.

Build systems that codify knowledge. Automation isn't about replacing humans—it's about freeing them to focus on what actually matters.