Automation and the Provenance of State
Category: Platform Engineering
Reading Time: 7 minutes
Tags: Automation, IaC, DevOps
The 3 AM Phone Call That Changed Everything
It was 3:17 AM when my phone rang. Production was down. Hard.
"Did you change anything?" I asked the on-call engineer.
"No, nothing. I just restarted the database pod because it was showing high memory."
That single kubectl command—executed manually, with good intentions—cascaded into a 4-hour outage affecting 2 million users.
Why? Because that database pod had specific resource limits set manually two weeks ago to handle a spike. No one documented it. No one remembered it. And when it was recreated, it got the default (insufficient) limits.
The lesson I learned that night: Automation isn't just about speed. It's about the provenance of state.
What is Provenance of State?
Provenance means knowing the complete history of how your infrastructure reached its current state:
- What changed?
- When did it change?
- Who changed it?
- Why was it changed?
- How can we reproduce it?
Without provenance, you're flying blind. You have a system in a certain state, but no idea how it got there or how to recreate it.
Example: Two Kubernetes Clusters
Cluster A (Manual Configuration):
# History lost to time
$ kubectl scale deployment api-server --replicas=10
$ kubectl set resources deployment api-server --limits=cpu=2,memory=4Gi
$ kubectl patch deployment api-server -p '{"spec":{"template":{"spec":{"nodeSelector":{"node-type":"highcpu"}}}}}'
Questions you can't answer:
- Why 10 replicas instead of 5 or 20?
- Who set those resource limits and when?
- Why is it pinned to highcpu nodes?
- Can we recreate this in staging?
Cluster B (Infrastructure as Code):
# Git commit: abc123
# Author: devesh@company.com
# Date: 2024-01-15
# Message: "Scale api-server for Black Friday traffic (JIRA-1234)"
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
annotations:
reason: "Black Friday traffic spike"
jira-ticket: "JIRA-1234"
spec:
replicas: 10 # Increased from 5 to handle 2x traffic
template:
spec:
nodeSelector:
node-type: highcpu # CPU-bound workload
containers:
- name: api
resources:
limits:
cpu: "2" # Profiled: 1.5 CPU under load
memory: "4Gi" # 3.2Gi average + 20% buffer
Now you know:
- Why: Black Friday traffic (JIRA-1234)
- When: Jan 15, 2024
- Who: devesh@company.com
- How to reproduce:
git checkout abc123; kubectl apply -f .
Provenance = Trust = Reliability
The Architecture Has Failed
Here's a controversial statement:
If a human has to touch production, the architecture has failed.
Not because humans are bad (we're not). But because:
- Humans don't scale: You can't hire fast enough
- Knowledge is tribal: "Sarah knows how to fix this"
- Context is lost: "I think someone changed this last month?"
- Reproducibility is impossible: "It works on my machine"
- Auditing is a nightmare: "Who changed the firewall rules?"
Case Study: The Lookout Deployment Pipeline
When I joined Lookout, deploys were manual:
Manual Process (2019):
- SSH into bastion host
- Pull latest code
- Run migration scripts (hope they're idempotent)
- Restart services in the "right" order
- Check logs manually
- Hope nothing breaks
Time: 2-3 hours
Error rate: 15%
Deploys per week: 2-3
Rollback time: 1-2 hours
Knowledge holders: 3 people
Automated Process (2020):
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Run database migrations
run: |
kubectl exec -it db-pod -- \
/app/migrations/run.sh \
--version=$(git rev-parse HEAD)
- name: Deploy to Kubernetes
run: |
# GitOps: ArgoCD picks up changes automatically
git tag deploy-prod-$(date +%s)
git push --tags
- name: Wait for rollout
run: |
kubectl rollout status deployment/api-server
kubectl rollout status deployment/worker-pool
- name: Run smoke tests
run: |
./tests/smoke-test.sh
- name: Notify Slack
if: always()
run: |
curl -X POST $SLACK_WEBHOOK \
-d "{'text': 'Deploy ${{ job.status }}: ${{ github.sha }}'}"
Time: 8 minutes
Error rate: 0.3%
Deploys per week: 20-30
Rollback time: 2 minutes (GitOps revert)
Knowledge holders: Codified in repository
Building for Provenance: The Layers
Layer 1: Infrastructure as Code (IaC)
Every piece of infrastructure must be defined in code.
Terraform Example:
# terraform/production/main.tf
# VPC with full provenance
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "production-vpc"
Environment = "production"
ManagedBy = "terraform"
GitCommit = var.git_commit
LastChanged = timestamp()
Owner = "platform-team"
CostCenter = "engineering"
}
}
# EKS cluster
resource "aws_eks_cluster" "main" {
name = "production-eks"
role_arn = aws_iam_role.eks_cluster.arn
version = "1.28" # Pinned version for reproducibility
vpc_config {
subnet_ids = aws_subnet.private[*].id
# Security: Private endpoint only
endpoint_private_access = true
endpoint_public_access = false
}
# Enable control plane logging
enabled_cluster_log_types = [
"api",
"audit",
"authenticator",
"controllerManager",
"scheduler"
]
tags = {
GitCommit = var.git_commit
Terraform = "true"
}
}
Result: Complete infrastructure history in Git. Can recreate entire environment from commit SHA.
Layer 2: Configuration as Code
Don't just automate deployment. Automate configuration.
Bad (Manual ConfigMap):
kubectl create configmap app-config \
--from-literal=LOG_LEVEL=debug \
--from-literal=MAX_CONNECTIONS=100
Good (Versioned in Git):
# k8s/configmaps/app-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
annotations:
config-version: "2.3.0"
last-updated: "2024-01-15"
updated-by: "devesh@company.com"
data:
LOG_LEVEL: "info" # Changed from debug (JIRA-5678)
MAX_CONNECTIONS: "500" # Increased for peak load
CACHE_TTL: "3600"
FEATURE_FLAGS: |
new_auth_flow: true
beta_dashboard: false
Layer 3: Declarative Rollouts
Use GitOps to ensure desired state = actual state.
ArgoCD Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api-server
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/company/k8s-manifests
targetRevision: main
path: apps/api-server
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete removed resources
selfHeal: true # Fix manual changes
syncOptions:
- CreateNamespace=true
# Rollback on failure
revisionHistoryLimit: 10
What this gives you:
- Automatic drift detection: Manual changes are reverted
- Audit trail: Every change is a Git commit
- Easy rollback:
git revert abc123 - Preview changes: PR diffs show infrastructure changes
Layer 4: Policy as Code
Prevent bad changes before they happen.
Open Policy Agent (OPA) Example:
# policy/kubernetes.rego
# Deny deployments without resource limits
deny[msg] {
input.kind == "Deployment"
not input.spec.template.spec.containers[_].resources.limits
msg := "Deployments must specify resource limits"
}
# Deny privileged containers
deny[msg] {
input.kind == "Pod"
input.spec.containers[_].securityContext.privileged == true
msg := "Privileged containers are not allowed"
}
# Require labels
deny[msg] {
input.kind == "Deployment"
not input.metadata.labels.team
msg := "Deployments must have 'team' label"
}
# Require cost center tag
deny[msg] {
input.kind == "Deployment"
not input.metadata.labels["cost-center"]
msg := "Deployments must have 'cost-center' label"
}
Enforcement:
# Pre-commit hook
#!/bin/bash
for file in k8s/*.yaml; do
conftest test $file --policy policy/
done
Observability for Automation
How do you know your automation is working?
1. Drift Detection
import asyncio
from kubernetes import client, config
async def detect_drift():
"""Compare desired state (Git) vs actual state (cluster)"""
# Load desired state from Git
desired = load_manifests_from_git()
# Load actual state from cluster
config.load_kube_config()
v1 = client.AppsV1Api()
actual_deployments = v1.list_deployment_for_all_namespaces()
drift = []
for deployment in actual_deployments.items:
name = deployment.metadata.name
namespace = deployment.metadata.namespace
# Check if in Git
if name not in desired:
drift.append({
"resource": f"{namespace}/{name}",
"issue": "Not in Git (manual creation)",
"action": "Delete or add to Git"
})
continue
# Compare replicas
if deployment.spec.replicas != desired[name].replicas:
drift.append({
"resource": f"{namespace}/{name}",
"issue": f"Replica count mismatch: {deployment.spec.replicas} != {desired[name].replicas}",
"action": "Sync from Git"
})
# Alert on drift
if drift:
send_slack_alert(f"Detected {len(drift)} drifts in cluster")
return drift
2. Deployment Audit Trail
# Log every infrastructure change
@audit_log
def apply_terraform_change(module: str, action: str):
"""Apply Terraform changes with full audit trail"""
# Generate plan
plan = terraform_plan(module)
# Log intent
logger.info(
"terraform_plan_generated",
module=module,
action=action,
changes=plan.resource_changes,
user=get_current_user(),
git_commit=get_git_commit()
)
# Require approval for production
if module == "production":
approval = request_approval(plan)
if not approval.approved:
raise Exception("Terraform change not approved")
# Apply changes
result = terraform_apply(module)
# Log result
logger.info(
"terraform_applied",
module=module,
success=result.success,
resources_created=result.created,
resources_modified=result.modified,
resources_deleted=result.deleted,
duration_seconds=result.duration
)
return result
Real-World Results
After implementing full automation + provenance at Lookout:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Deploy time | 2-3 hours | 8 minutes | 95% ↓ |
| Deploy frequency | 2-3/week | 20-30/week | 10x ↑ |
| Error rate | 15% | 0.3% | 98% ↓ |
| Rollback time | 1-2 hours | 2 minutes | 97% ↓ |
| MTTR (mean time to recovery) | 4 hours | 12 minutes | 95% ↓ |
| On-call pages | 15/week | 2/week | 87% ↓ |
| Knowledge silos | 3 people | 0 (in code) | ∞ |
Business impact:
- Released features 10x faster
- Reduced operations team from 8 to 3
- Zero-downtime deployments became standard
- New engineers could deploy on day 1
Key Takeaways
- Provenance is trust: If you can't explain how your system reached its current state, you can't trust it
- Manual operations don't scale: Humans are bottlenecks and single points of failure
- Everything as code: Infrastructure, configuration, policies—all in Git
- Automation is not optional: At scale, it's the only way to maintain reliability
- Drift is the enemy: Detect and eliminate configuration drift
The hard truth: Every manual change is technical debt. Every "quick fix" in production is a future outage waiting to happen.
Build systems that codify knowledge. Automation isn't about replacing humans—it's about freeing them to focus on what actually matters.
If a human has to touch production, the architecture has failed.
About the Author
Devesh Kumar is a Staff Software Engineer specializing in platform automation and cloud infrastructure. He's built zero-downtime deployment pipelines processing billions of events at Okta and Lookout.
Want to discuss automation strategies?