Building for the Jagged Edges of LLM Latency

Category: GenAI Infrastructure
Reading Time: 8 minutes
Tags: GenAI, Architecture, Performance

The Problem: Average Latency is a Lie

When I started building GenAI platforms at Okta and later at StartupManch, I made the same mistake most engineers make: I optimized for average latency. Our LLM calls averaged 200ms. Great, right? Wrong.

The reality was far more brutal. While p50 latency sat comfortably at 200ms, p95 was 2.3 seconds, and p99 could spike to 8+ seconds. These weren't outliers—they were the experience for 1 in 100 users. At 5 million events per day, that's 50,000 users hitting the jagged edge.

The core lesson: In GenAI systems, abstractions leak, but poor abstractions flood. You must build for the jagged edges, not the averages.

Why LLM Latency is Fundamentally Jagged

Unlike traditional APIs where latency is relatively predictable, LLM inference has inherent variability:

1. Token Generation is Sequential

LLMs generate tokens one at a time. A 500-token response takes 10x longer than a 50-token response. There's no parallelization within a single request.

# This is the reality you're working with
time_per_token = 15ms  # Approximately
short_response = 50 tokens * 15ms = 750ms
long_response = 500 tokens * 15ms = 7500ms  # 10x difference!

2. Cold Start Penalties

Model loading and GPU allocation aren't instant:

Cold container: 2-5 seconds
Model loading: 3-8 seconds for large models
Warm inference: 200-400ms

3. Queueing Theory at Scale

When you're processing 500K+ queries/day:

Peak traffic causes queueing
Even small delays compound exponentially
Little's Law guarantees pain at scale

Average Queue Length = (Arrival Rate × Average Wait Time)

At peak load (1000 req/sec), even 100ms of delay means 100 requests waiting in queue.

4. Provider Throttling

OpenAI, Anthropic, and others have dynamic rate limits. Hit them, and your latency doesn't degrade linearly—it falls off a cliff.

Architecture Pattern: Event-Driven Vector Mesh

After burning through three architectures, here's what actually works:

1. Decouple Request from Response

Bad Pattern (Synchronous):

@app.post("/chat")
async def chat(request: ChatRequest):
    # User waits here for 8 seconds
    response = await llm_client.complete(request.prompt)
    return response

Good Pattern (Event-Driven):

@app.post("/chat")
async def chat(request: ChatRequest):
    # Return immediately
    job_id = await queue.enqueue({
        "user_id": request.user_id,
        "prompt": request.prompt,
        "priority": calculate_priority(request)
    })
    return {"job_id": job_id, "status": "queued"}

# Separate worker pool
async def process_llm_jobs():
    async for job in queue.consume():
        response = await llm_client.complete(job.prompt)
        await websocket.send(job.user_id, response)

2. Implement Smart Caching with Vector Similarity

Not all prompts are unique. We cache semantically similar responses:

from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, similarity_threshold=0.92):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.vector_db = PineconeClient()
        self.threshold = similarity_threshold

    async def get(self, prompt: str):
        # Convert prompt to vector
        embedding = self.encoder.encode(prompt)

        # Search for similar vectors
        results = await self.vector_db.query(
            vector=embedding,
            top_k=1,
            threshold=self.threshold
        )

        if results and results[0].score >= self.threshold:
            return results[0].metadata['response']
        return None

    async def set(self, prompt: str, response: str):
        embedding = self.encoder.encode(prompt)
        await self.vector_db.upsert({
            'id': hash(prompt),
            'vector': embedding,
            'metadata': {'response': response}
        })

Result: 40% cache hit rate, reducing average latency from 200ms to 120ms.

3. Circuit Breakers for Cascading Failures

When LLM providers slow down, don't let them take your entire platform with them:

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_llm_with_protection(prompt: str):
    try:
        return await llm_client.complete(prompt, timeout=5.0)
    except TimeoutError:
        # Circuit opens after 5 consecutive timeouts
        raise
    except Exception as e:
        # Log and re-raise
        logger.error(f"LLM call failed: {e}")
        raise

# Fallback handler when circuit is open
async def chat_endpoint(request):
    try:
        return await call_llm_with_protection(request.prompt)
    except CircuitBreakerError:
        # Serve degraded experience
        return {
            "response": "I'm experiencing high load. Please try again.",
            "status": "degraded"
        }

4. Adaptive Concurrency Limiting

Traditional fixed concurrency limits are too rigid. Use adaptive limits based on observed latency:

class AdaptiveConcurrencyLimiter:
    def __init__(self, min_limit=10, max_limit=200):
        self.min_limit = min_limit
        self.max_limit = max_limit
        self.current_limit = 50
        self.latency_window = deque(maxlen=100)

    def record_latency(self, latency_ms: float):
        self.latency_window.append(latency_ms)

        # Calculate p95 latency
        p95 = np.percentile(self.latency_window, 95)

        # Adjust concurrency based on latency
        if p95 < 500:  # Target: sub-500ms p95
            # Increase concurrency
            self.current_limit = min(
                self.current_limit + 5,
                self.max_limit
            )
        elif p95 > 1000:
            # Decrease concurrency
            self.current_limit = max(
                self.current_limit - 10,
                self.min_limit
            )

    async def acquire(self):
        # Block if at limit
        while self.in_flight >= self.current_limit:
            await asyncio.sleep(0.01)
        self.in_flight += 1

    def release(self):
        self.in_flight -= 1

Load Shedding: The Nuclear Option

When all else fails, drop traffic intelligently:

def calculate_priority(request: ChatRequest) -> int:
    """Higher score = higher priority"""
    score = 0

    # Premium users get priority
    if request.user.tier == "premium":
        score += 100

    # Retry requests get lower priority
    if request.metadata.get("retry_count", 0) > 0:
        score -= 50

    # Time-sensitive requests
    if request.metadata.get("real_time", False):
        score += 75

    return score

async def smart_load_shedder(request: ChatRequest):
    current_load = await metrics.get_cpu_usage()

    if current_load > 0.85:  # 85% CPU threshold
        priority = calculate_priority(request)

        # Drop low-priority requests
        if priority < 50:
            raise HTTPException(
                status_code=503,
                detail="System at capacity. Please retry."
            )

    return await process_request(request)

Observability: Measure Everything

You can't fix what you can't see. Instrument at every layer:

from prometheus_client import Histogram, Counter

# Metrics
llm_latency = Histogram(
    'llm_request_duration_seconds',
    'LLM request latency',
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

cache_hits = Counter('cache_hits_total', 'Cache hit count')
cache_misses = Counter('cache_misses_total', 'Cache miss count')

@llm_latency.time()
async def call_llm(prompt: str):
    # Check cache first
    cached = await cache.get(prompt)
    if cached:
        cache_hits.inc()
        return cached

    cache_misses.inc()
    response = await llm_client.complete(prompt)
    await cache.set(prompt, response)
    return response

Dashboard Alerts:

p95 latency > 2s for 5 minutes
Cache hit rate < 30%
Circuit breaker open state
Queue depth > 1000

Real-World Results

After implementing this architecture at StartupManch:

Metric	Before	After	Improvement
p50 latency	200ms	120ms	40% ↓
p95 latency	2.3s	800ms	65% ↓
p99 latency	8s	1.5s	81% ↓
Timeout rate	2.3%	0.1%	95% ↓
Infrastructure cost	$12K/mo	$7.5K/mo	38% ↓

Key Takeaways

Design for p99, not p50: Your worst-case latency defines user experience
Decouple request from response: Async is non-negotiable at scale
Cache semantically: Vector similarity unlocks massive wins
Fail gracefully: Circuit breakers + load shedding prevent cascading failures
Measure obsessively: Observability is the only reality that matters

The jagged edges of LLM latency will hurt you. But with the right architecture, you can turn chaos into predictability.

About the Author
Devesh Kumar is a Staff Software Engineer specializing in GenAI platforms and cloud infrastructure. He's built systems processing 5M+ LLM requests/day at Okta and StartupManch.

Want to discuss GenAI architecture?