Back to Insights

Observability is the Only Reality That Matters

SRE & Reliability10 minJanuary 10, 2024
SREObservabilityKubernetes

Observability is the Only Reality That Matters

Category: SRE & Reliability
Reading Time: 10 minutes
Tags: SRE, Observability, Kubernetes


The Day We Lost 2% of Authentication Events (And Didn't Know for 3 Hours)

It was a Tuesday morning at Okta. Our dashboards were green. CPUs were at 40%. Memory looked fine. No alerts firing. Everything was "normal."

Except it wasn't.

A customer called support: "Users can't log in." We checked our metrics. Success rate: 99.2%. Well within SLO. We told them it was likely a local issue.

Then another call. And another.

Three hours later, we discovered the truth: A silent failure in our Kafka consumer group was dropping 2% of authentication events. No errors logged. No exceptions thrown. Messages just... vanished into the void.

The cost: $2.3M in SLA credits. Customer trust damaged. And a brutal lesson learned.

At web scale, hope is not a strategy. You cannot verify what you cannot observe.


The Three Pillars (And Why They're Not Enough)

Every SRE has heard of the "three pillars of observability":

  1. Metrics (Prometheus, CloudWatch)
  2. Logs (ELK, Splunk)
  3. Traces (Jaeger, Datadog APM)

But here's the dirty secret: Most companies have all three and still can't answer basic questions.

Questions like:

  • "Why did this API call take 8 seconds?"
  • "Which service introduced this 500 error?"
  • "What percentage of users are affected by this bug?"

The problem isn't the pillars. The problem is semantic observability—connecting the dots between what you measure and what actually matters.


Semantic Observability: The Fourth Pillar

Semantic observability means every metric, log, and trace carries business context.

Bad Observability (Syntactic)

# Prometheus metric
http_requests_total{status="500"} 142

What we know: 142 requests failed.
What we don't know: Who were they? What data was affected? Are they retrying?

Good Observability (Semantic)

# Prometheus metric with context
http_requests_total{
status="500",
endpoint="/auth/login",
tenant_id="acme_corp",
user_tier="enterprise",
retry_attempt="3",
correlation_id="abc123"
} 142

Now we know:

  • Who: Enterprise tenant "acme_corp"
  • What: Login failures
  • Context: Already on 3rd retry
  • Trace: Can follow correlation_id end-to-end

Building Observable Systems: A Blueprint

1. Instrument at the Source (Not the Edge)

Most teams instrument their API gateway and call it a day. That's like putting a speedometer on your car and expecting to diagnose engine problems.

Instrument where the work happens:

from opentelemetry import trace
from prometheus_client import Counter, Histogram
import structlog

logger = structlog.get_logger()
tracer = trace.get_tracer(__name__)

# Metrics
auth_events_processed = Counter(
    'auth_events_processed_total',
    'Authentication events processed',
    ['tenant_id', 'result', 'provider']
)

auth_event_latency = Histogram(
    'auth_event_duration_seconds',
    'Time to process auth event',
    ['tenant_id', 'provider'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)

async def process_auth_event(event: AuthEvent):
    # Start trace span
    with tracer.start_as_current_span(
        "process_auth_event",
        attributes={
            "tenant_id": event.tenant_id,
            "user_id": event.user_id,
            "provider": event.provider,
            "correlation_id": event.correlation_id
        }
    ) as span:
        start_time = time.time()

        try:
            # Business logic
            result = await authenticate_user(event)

            # Log with full context
            logger.info(
                "auth_event_processed",
                tenant_id=event.tenant_id,
                user_id=event.user_id,
                provider=event.provider,
                result="success",
                duration_ms=(time.time() - start_time) * 1000,
                correlation_id=event.correlation_id
            )

            # Record metrics
            auth_events_processed.labels(
                tenant_id=event.tenant_id,
                result="success",
                provider=event.provider
            ).inc()

            auth_event_latency.labels(
                tenant_id=event.tenant_id,
                provider=event.provider
            ).observe(time.time() - start_time)

            # Add span attributes
            span.set_attribute("result", "success")

            return result

        except Exception as e:
            # Structured error logging
            logger.error(
                "auth_event_failed",
                tenant_id=event.tenant_id,
                user_id=event.user_id,
                error=str(e),
                error_type=type(e).__name__,
                correlation_id=event.correlation_id
            )

            # Record failure metrics
            auth_events_processed.labels(
                tenant_id=event.tenant_id,
                result="error",
                provider=event.provider
            ).inc()

            # Mark span as error
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            span.record_exception(e)

            raise

2. Correlation IDs: The Golden Thread

Every request needs a correlation ID that flows through every service, log, and trace.

Generate at entry point:

from uuid import uuid4

@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid4())

    # Inject into request context
    request.state.correlation_id = correlation_id

    # Pass to downstream services
    response = await call_next(request)
    response.headers["X-Correlation-ID"] = correlation_id

    return response

Propagate through service calls:

async def call_downstream_service(endpoint: str, payload: dict):
    headers = {
        "X-Correlation-ID": request.state.correlation_id,
        "X-Tenant-ID": request.state.tenant_id,
    }

    async with httpx.AsyncClient() as client:
        response = await client.post(
            endpoint,
            json=payload,
            headers=headers
        )

    return response

Result: Full request path visibility across 20+ microservices.


3. High-Cardinality Dimensions (The Hidden Killer)

Be careful with high-cardinality labels. They explode your metrics storage.

Bad (High Cardinality):

# DON'T DO THIS - user_id has millions of values
requests_total.labels(user_id=user.id).inc()

Good (Low Cardinality):

# Use user tier, not user ID
requests_total.labels(
    user_tier=user.tier,  # "free", "pro", "enterprise"
    endpoint="/auth"
).inc()

# Store user_id in logs/traces instead
logger.info("request_processed", user_id=user.id)

Rule of thumb: Keep cardinality under 10K unique combinations per metric.


4. Distributed Tracing: Connect the Dots

Traces show you the story of a request across services.

Setup OpenTelemetry:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure tracer
trace.set_tracer_provider(TracerProvider())
otlp_exporter = OTLPSpanExporter(endpoint="jaeger:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)

Manual spans for business logic:

@tracer.start_as_current_span("validate_user_permissions")
async def validate_permissions(user_id: str, resource: str):
    span = trace.get_current_span()

    # Add custom attributes
    span.set_attribute("user.id", user_id)
    span.set_attribute("resource.type", resource)

    # Business logic
    permissions = await db.get_user_permissions(user_id)
    has_access = resource in permissions

    span.set_attribute("access.granted", has_access)

    return has_access

Trace visualization:

[API Gateway] 234ms
  ├─ [Auth Service] 45ms
  │   ├─ [Redis Cache] 2ms (hit)
  │   └─ [User DB Query] 38ms
  ├─ [Permission Service] 156ms ← SLOW!
  │   ├─ [Policy Evaluation] 145ms ← ROOT CAUSE
  │   └─ [Audit Log Write] 8ms
  └─ [Response Serialization] 3ms

Now we know exactly where the bottleneck is.


5. Alerting That Doesn't Cry Wolf

Most alerting is noise. Here's how to fix it:

Bad Alert (Symptom):

# Alert on high CPU
- alert: HighCPU
  expr: cpu_usage > 80
  for: 5m

Good Alert (Impact):

# Alert on user-facing impact
- alert: AuthenticationFailureRateHigh
  expr: |
    (
      sum(rate(auth_events_processed_total{result="error"}[5m]))
      /
      sum(rate(auth_events_processed_total[5m]))
    ) > 0.01  # 1% error rate
  for: 5m
  labels:
    severity: critical
    component: auth
  annotations:
    summary: "Authentication failure rate above 1%"
    description: "{{ $value | humanizePercentage }} of auth requests failing"
    runbook: "https://wiki.company.com/runbooks/auth-failures"

Alert Hierarchy:

  1. Critical: User-facing impact, wake someone up
  2. Warning: Approaching threshold, investigate during business hours
  3. Info: Informational, log only

6. Canary Queries: Your Early Warning System

Don't wait for users to report issues. Proactively test critical flows:

import asyncio
from datetime import datetime

async def canary_auth_flow():
    """Synthetic test of end-to-end auth flow"""
    try:
        # Test login
        start = time.time()
        response = await test_client.post("/auth/login", json={
            "username": "canary@example.com",
            "password": "test123"
        })
        duration = time.time() - start

        # Record metrics
        canary_latency.labels(flow="auth_login").observe(duration)

        if response.status_code != 200:
            canary_failures.labels(flow="auth_login").inc()
            logger.error("canary_failed", flow="auth_login", status=response.status_code)
        else:
            canary_successes.labels(flow="auth_login").inc()

    except Exception as e:
        canary_failures.labels(flow="auth_login").inc()
        logger.error("canary_exception", flow="auth_login", error=str(e))

# Run every 60 seconds
async def run_canaries():
    while True:
        await canary_auth_flow()
        await asyncio.sleep(60)

The Kafka Incident: What We Learned

Remember that Kafka incident? Here's what we did after:

1. Consumer Lag Monitoring

# Alert on consumer lag
- alert: KafkaConsumerLagHigh
  expr: kafka_consumer_lag > 10000
  for: 2m
  annotations:
    summary: "Consumer {{ $labels.group }} is lagging"

2. Message Processing Metrics

# Track every message
messages_consumed = Counter(
    'kafka_messages_consumed_total',
    'Messages consumed from Kafka',
    ['topic', 'partition', 'consumer_group']
)

messages_processed = Counter(
    'kafka_messages_processed_total',
    'Messages successfully processed',
    ['topic', 'result']
)

async def consume_messages():
    async for message in kafka_consumer:
        messages_consumed.labels(
            topic=message.topic,
            partition=message.partition,
            consumer_group='auth-group'
        ).inc()

        try:
            await process_message(message)
            messages_processed.labels(
                topic=message.topic,
                result='success'
            ).inc()
        except Exception as e:
            messages_processed.labels(
                topic=message.topic,
                result='error'
            ).inc()
            raise

3. End-to-End Verification

# Reconciliation job runs every 5 minutes
async def reconcile_auth_events():
    """Verify we processed all events"""
    # Count events in source DB
    source_count = await db.count_auth_events(
        start_time=now - timedelta(minutes=5)
    )

    # Count events we processed (from metrics)
    processed_count = get_processed_count_from_prometheus()

    if source_count != processed_count:
        logger.error(
            "event_count_mismatch",
            source_count=source_count,
            processed_count=processed_count,
            missing=source_count - processed_count
        )

        # Alert on mismatch
        reconciliation_errors.inc()

Result: Never missed events again.


Real-World Observability Stack

Here's what we run in production:

Metrics: Prometheus + Thanos

  • Prometheus: Time-series metrics
  • Thanos: Long-term storage, multi-cluster queries
  • Grafana: Visualization and dashboards

Logs: Loki + Tempo

  • Loki: Log aggregation (like Prometheus for logs)
  • Tempo: Distributed tracing backend
  • Grafana: Unified view (logs + metrics + traces)

Alerting: AlertManager

  • Routes: Based on severity and team
  • Silencing: During deployments
  • Escalation: PagerDuty integration

Cost: $8K/month for 5M events/day

  • Prometheus: Self-hosted on Kubernetes
  • Thanos: S3 storage ($1.2K/mo)
  • Loki: Self-hosted ($200/mo infra)
  • Tempo: S3 storage ($800/mo)
  • Grafana Cloud: Pro plan ($3.5K/mo)
  • PagerDuty: Team plan ($2K/mo)

Key Takeaways

  1. Observability isn't optional: At web scale, you can't verify what you can't observe
  2. Semantic context matters: Metrics without context are noise
  3. Correlation IDs are non-negotiable: Track requests end-to-end
  4. Alert on impact, not symptoms: High CPU doesn't matter if users are happy
  5. Test proactively: Don't wait for users to report issues

The hard truth: Your monitoring dashboards are green until they're not. And by then, it's too late.

Build systems that tell you the truth. Observability is the only reality that matters.


About the Author
Devesh Kumar is a Staff Software Engineer with 8+ years building zero-downtime systems. He's processed billions of authentication events at Okta and built GenAI platforms at StartupManch.

Want to discuss SRE practices?

Want to discuss this topic?

I work with 3-4 high-stakes organizations per year on platform architecture and infrastructure strategy.

Schedule Strategy Session