Observability is the Only Reality That Matters
Category: SRE & Reliability
Reading Time: 10 minutes
Tags: SRE, Observability, Kubernetes
The Day We Lost 2% of Authentication Events (And Didn't Know for 3 Hours)
It was a Tuesday morning at Okta. Our dashboards were green. CPUs were at 40%. Memory looked fine. No alerts firing. Everything was "normal."
Except it wasn't.
A customer called support: "Users can't log in." We checked our metrics. Success rate: 99.2%. Well within SLO. We told them it was likely a local issue.
Then another call. And another.
Three hours later, we discovered the truth: A silent failure in our Kafka consumer group was dropping 2% of authentication events. No errors logged. No exceptions thrown. Messages just... vanished into the void.
The cost: $2.3M in SLA credits. Customer trust damaged. And a brutal lesson learned.
At web scale, hope is not a strategy. You cannot verify what you cannot observe.
The Three Pillars (And Why They're Not Enough)
Every SRE has heard of the "three pillars of observability":
- Metrics (Prometheus, CloudWatch)
- Logs (ELK, Splunk)
- Traces (Jaeger, Datadog APM)
But here's the dirty secret: Most companies have all three and still can't answer basic questions.
Questions like:
- "Why did this API call take 8 seconds?"
- "Which service introduced this 500 error?"
- "What percentage of users are affected by this bug?"
The problem isn't the pillars. The problem is semantic observability—connecting the dots between what you measure and what actually matters.
Semantic Observability: The Fourth Pillar
Semantic observability means every metric, log, and trace carries business context.
Bad Observability (Syntactic)
# Prometheus metric
http_requests_total{status="500"} 142
What we know: 142 requests failed.
What we don't know: Who were they? What data was affected? Are they retrying?
Good Observability (Semantic)
# Prometheus metric with context
http_requests_total{
status="500",
endpoint="/auth/login",
tenant_id="acme_corp",
user_tier="enterprise",
retry_attempt="3",
correlation_id="abc123"
} 142
Now we know:
- Who: Enterprise tenant "acme_corp"
- What: Login failures
- Context: Already on 3rd retry
- Trace: Can follow correlation_id end-to-end
Building Observable Systems: A Blueprint
1. Instrument at the Source (Not the Edge)
Most teams instrument their API gateway and call it a day. That's like putting a speedometer on your car and expecting to diagnose engine problems.
Instrument where the work happens:
from opentelemetry import trace
from prometheus_client import Counter, Histogram
import structlog
logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)
# Metrics
auth_events_processed = Counter(
'auth_events_processed_total',
'Authentication events processed',
['tenant_id', 'result', 'provider']
)
auth_event_latency = Histogram(
'auth_event_duration_seconds',
'Time to process auth event',
['tenant_id', 'provider'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)
async def process_auth_event(event: AuthEvent):
# Start trace span
with tracer.start_as_current_span(
"process_auth_event",
attributes={
"tenant_id": event.tenant_id,
"user_id": event.user_id,
"provider": event.provider,
"correlation_id": event.correlation_id
}
) as span:
start_time = time.time()
try:
# Business logic
result = await authenticate_user(event)
# Log with full context
logger.info(
"auth_event_processed",
tenant_id=event.tenant_id,
user_id=event.user_id,
provider=event.provider,
result="success",
duration_ms=(time.time() - start_time) * 1000,
correlation_id=event.correlation_id
)
# Record metrics
auth_events_processed.labels(
tenant_id=event.tenant_id,
result="success",
provider=event.provider
).inc()
auth_event_latency.labels(
tenant_id=event.tenant_id,
provider=event.provider
).observe(time.time() - start_time)
# Add span attributes
span.set_attribute("result", "success")
return result
except Exception as e:
# Structured error logging
logger.error(
"auth_event_failed",
tenant_id=event.tenant_id,
user_id=event.user_id,
error=str(e),
error_type=type(e).__name__,
correlation_id=event.correlation_id
)
# Record failure metrics
auth_events_processed.labels(
tenant_id=event.tenant_id,
result="error",
provider=event.provider
).inc()
# Mark span as error
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
2. Correlation IDs: The Golden Thread
Every request needs a correlation ID that flows through every service, log, and trace.
Generate at entry point:
from uuid import uuid4
@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
correlation_id = request.headers.get("X-Correlation-ID") or str(uuid4())
# Inject into request context
request.state.correlation_id = correlation_id
# Pass to downstream services
response = await call_next(request)
response.headers["X-Correlation-ID"] = correlation_id
return response
Propagate through service calls:
async def call_downstream_service(endpoint: str, payload: dict):
headers = {
"X-Correlation-ID": request.state.correlation_id,
"X-Tenant-ID": request.state.tenant_id,
}
async with httpx.AsyncClient() as client:
response = await client.post(
endpoint,
json=payload,
headers=headers
)
return response
Result: Full request path visibility across 20+ microservices.
3. High-Cardinality Dimensions (The Hidden Killer)
Be careful with high-cardinality labels. They explode your metrics storage.
Bad (High Cardinality):
# DON'T DO THIS - user_id has millions of values
requests_total.labels(user_id=user.id).inc()
Good (Low Cardinality):
# Use user tier, not user ID
requests_total.labels(
user_tier=user.tier, # "free", "pro", "enterprise"
endpoint="/auth"
).inc()
# Store user_id in logs/traces instead
logger.info("request_processed", user_id=user.id)
Rule of thumb: Keep cardinality under 10K unique combinations per metric.
4. Distributed Tracing: Connect the Dots
Traces show you the story of a request across services.
Setup OpenTelemetry:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
# Configure tracer
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="jaeger:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
Manual spans for business logic:
@tracer.start_as_current_span("validate_user_permissions")
async def validate_permissions(user_id: str, resource: str):
span = trace.get_current_span()
# Add custom attributes
span.set_attribute("user.id", user_id)
span.set_attribute("resource.type", resource)
# Business logic
permissions = await db.get_user_permissions(user_id)
has_access = resource in permissions
span.set_attribute("access.granted", has_access)
return has_access
Trace visualization:
[API Gateway] 234ms
├─ [Auth Service] 45ms
│ ├─ [Redis Cache] 2ms (hit)
│ └─ [User DB Query] 38ms
├─ [Permission Service] 156ms ← SLOW!
│ ├─ [Policy Evaluation] 145ms ← ROOT CAUSE
│ └─ [Audit Log Write] 8ms
└─ [Response Serialization] 3ms
Now we know exactly where the bottleneck is.
5. Alerting That Doesn't Cry Wolf
Most alerting is noise. Here's how to fix it:
Bad Alert (Symptom):
# Alert on high CPU
- alert: HighCPU
expr: cpu_usage > 80
for: 5m
Good Alert (Impact):
# Alert on user-facing impact
- alert: AuthenticationFailureRateHigh
expr: |
(
sum(rate(auth_events_processed_total{result="error"}[5m]))
/
sum(rate(auth_events_processed_total[5m]))
) > 0.01 # 1% error rate
for: 5m
labels:
severity: critical
component: auth
annotations:
summary: "Authentication failure rate above 1%"
description: "{{ $value | humanizePercentage }} of auth requests failing"
runbook: "https://wiki.company.com/runbooks/auth-failures"
Alert Hierarchy:
- Critical: User-facing impact, wake someone up
- Warning: Approaching threshold, investigate during business hours
- Info: Informational, log only
6. Canary Queries: Your Early Warning System
Don't wait for users to report issues. Proactively test critical flows:
import asyncio
from datetime import datetime
async def canary_auth_flow():
"""Synthetic test of end-to-end auth flow"""
try:
# Test login
start = time.time()
response = await test_client.post("/auth/login", json={
"username": "canary@example.com",
"password": "test123"
})
duration = time.time() - start
# Record metrics
canary_latency.labels(flow="auth_login").observe(duration)
if response.status_code != 200:
canary_failures.labels(flow="auth_login").inc()
logger.error("canary_failed", flow="auth_login", status=response.status_code)
else:
canary_successes.labels(flow="auth_login").inc()
except Exception as e:
canary_failures.labels(flow="auth_login").inc()
logger.error("canary_exception", flow="auth_login", error=str(e))
# Run every 60 seconds
async def run_canaries():
while True:
await canary_auth_flow()
await asyncio.sleep(60)
The Kafka Incident: What We Learned
Remember that Kafka incident? Here's what we did after:
1. Consumer Lag Monitoring
# Alert on consumer lag
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_lag > 10000
for: 2m
annotations:
summary: "Consumer {{ $labels.group }} is lagging"
2. Message Processing Metrics
# Track every message
messages_consumed = Counter(
'kafka_messages_consumed_total',
'Messages consumed from Kafka',
['topic', 'partition', 'consumer_group']
)
messages_processed = Counter(
'kafka_messages_processed_total',
'Messages successfully processed',
['topic', 'result']
)
async def consume_messages():
async for message in kafka_consumer:
messages_consumed.labels(
topic=message.topic,
partition=message.partition,
consumer_group='auth-group'
).inc()
try:
await process_message(message)
messages_processed.labels(
topic=message.topic,
result='success'
).inc()
except Exception as e:
messages_processed.labels(
topic=message.topic,
result='error'
).inc()
raise
3. End-to-End Verification
# Reconciliation job runs every 5 minutes
async def reconcile_auth_events():
"""Verify we processed all events"""
# Count events in source DB
source_count = await db.count_auth_events(
start_time=now - timedelta(minutes=5)
)
# Count events we processed (from metrics)
processed_count = get_processed_count_from_prometheus()
if source_count != processed_count:
logger.error(
"event_count_mismatch",
source_count=source_count,
processed_count=processed_count,
missing=source_count - processed_count
)
# Alert on mismatch
reconciliation_errors.inc()
Result: Never missed events again.
Real-World Observability Stack
Here's what we run in production:
Metrics: Prometheus + Thanos
- Prometheus: Time-series metrics
- Thanos: Long-term storage, multi-cluster queries
- Grafana: Visualization and dashboards
Logs: Loki + Tempo
- Loki: Log aggregation (like Prometheus for logs)
- Tempo: Distributed tracing backend
- Grafana: Unified view (logs + metrics + traces)
Alerting: AlertManager
- Routes: Based on severity and team
- Silencing: During deployments
- Escalation: PagerDuty integration
Cost: $8K/month for 5M events/day
- Prometheus: Self-hosted on Kubernetes
- Thanos: S3 storage ($1.2K/mo)
- Loki: Self-hosted ($200/mo infra)
- Tempo: S3 storage ($800/mo)
- Grafana Cloud: Pro plan ($3.5K/mo)
- PagerDuty: Team plan ($2K/mo)
Key Takeaways
- Observability isn't optional: At web scale, you can't verify what you can't observe
- Semantic context matters: Metrics without context are noise
- Correlation IDs are non-negotiable: Track requests end-to-end
- Alert on impact, not symptoms: High CPU doesn't matter if users are happy
- Test proactively: Don't wait for users to report issues
The hard truth: Your monitoring dashboards are green until they're not. And by then, it's too late.
Build systems that tell you the truth. Observability is the only reality that matters.
About the Author
Devesh Kumar is a Staff Software Engineer with 8+ years building zero-downtime systems. He's processed billions of authentication events at Okta and built GenAI platforms at StartupManch.
Want to discuss SRE practices?