Building for the Jagged Edges of LLM Latency
Category: GenAI Infrastructure
Reading Time: 8 minutes
Tags: GenAI, Architecture, Performance
The Problem: Average Latency is a Lie
When I started building GenAI platforms at Okta and later at StartupManch, I made the same mistake most engineers make: I optimized for average latency. Our LLM calls averaged 200ms. Great, right? Wrong.
The reality was far more brutal. While p50 latency sat comfortably at 200ms, p95 was 2.3 seconds, and p99 could spike to 8+ seconds. These weren't outliers—they were the experience for 1 in 100 users. At 5 million events per day, that's 50,000 users hitting the jagged edge.
The core lesson: In GenAI systems, abstractions leak, but poor abstractions flood. You must build for the jagged edges, not the averages.
Why LLM Latency is Fundamentally Jagged
Unlike traditional APIs where latency is relatively predictable, LLM inference has inherent variability:
1. Token Generation is Sequential
LLMs generate tokens one at a time. A 500-token response takes 10x longer than a 50-token response. There's no parallelization within a single request.
# This is the reality you're working with
time_per_token = 15ms # Approximately
short_response = 50 tokens * 15ms = 750ms
long_response = 500 tokens * 15ms = 7500ms # 10x difference!
2. Cold Start Penalties
Model loading and GPU allocation aren't instant:
- Cold container: 2-5 seconds
- Model loading: 3-8 seconds for large models
- Warm inference: 200-400ms
3. Queueing Theory at Scale
When you're processing 500K+ queries/day:
- Peak traffic causes queueing
- Even small delays compound exponentially
- Little's Law guarantees pain at scale
Average Queue Length = (Arrival Rate × Average Wait Time)
At peak load (1000 req/sec), even 100ms of delay means 100 requests waiting in queue.
4. Provider Throttling
OpenAI, Anthropic, and others have dynamic rate limits. Hit them, and your latency doesn't degrade linearly—it falls off a cliff.
Architecture Pattern: Event-Driven Vector Mesh
After burning through three architectures, here's what actually works:
1. Decouple Request from Response
Bad Pattern (Synchronous):
@app.post("/chat")
async def chat(request: ChatRequest):
# User waits here for 8 seconds
response = await llm_client.complete(request.prompt)
return response
Good Pattern (Event-Driven):
@app.post("/chat")
async def chat(request: ChatRequest):
# Return immediately
job_id = await queue.enqueue({
"user_id": request.user_id,
"prompt": request.prompt,
"priority": calculate_priority(request)
})
return {"job_id": job_id, "status": "queued"}
# Separate worker pool
async def process_llm_jobs():
async for job in queue.consume():
response = await llm_client.complete(job.prompt)
await websocket.send(job.user_id, response)
2. Implement Smart Caching with Vector Similarity
Not all prompts are unique. We cache semantically similar responses:
from sentence_transformers import SentenceTransformer
class SemanticCache:
def __init__(self, similarity_threshold=0.92):
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
self.vector_db = PineconeClient()
self.threshold = similarity_threshold
async def get(self, prompt: str):
# Convert prompt to vector
embedding = self.encoder.encode(prompt)
# Search for similar vectors
results = await self.vector_db.query(
vector=embedding,
top_k=1,
threshold=self.threshold
)
if results and results[0].score >= self.threshold:
return results[0].metadata['response']
return None
async def set(self, prompt: str, response: str):
embedding = self.encoder.encode(prompt)
await self.vector_db.upsert({
'id': hash(prompt),
'vector': embedding,
'metadata': {'response': response}
})
Result: 40% cache hit rate, reducing average latency from 200ms to 120ms.
3. Circuit Breakers for Cascading Failures
When LLM providers slow down, don't let them take your entire platform with them:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_llm_with_protection(prompt: str):
try:
return await llm_client.complete(prompt, timeout=5.0)
except TimeoutError:
# Circuit opens after 5 consecutive timeouts
raise
except Exception as e:
# Log and re-raise
logger.error(f"LLM call failed: {e}")
raise
# Fallback handler when circuit is open
async def chat_endpoint(request):
try:
return await call_llm_with_protection(request.prompt)
except CircuitBreakerError:
# Serve degraded experience
return {
"response": "I'm experiencing high load. Please try again.",
"status": "degraded"
}
4. Adaptive Concurrency Limiting
Traditional fixed concurrency limits are too rigid. Use adaptive limits based on observed latency:
class AdaptiveConcurrencyLimiter:
def __init__(self, min_limit=10, max_limit=200):
self.min_limit = min_limit
self.max_limit = max_limit
self.current_limit = 50
self.latency_window = deque(maxlen=100)
def record_latency(self, latency_ms: float):
self.latency_window.append(latency_ms)
# Calculate p95 latency
p95 = np.percentile(self.latency_window, 95)
# Adjust concurrency based on latency
if p95 < 500: # Target: sub-500ms p95
# Increase concurrency
self.current_limit = min(
self.current_limit + 5,
self.max_limit
)
elif p95 > 1000:
# Decrease concurrency
self.current_limit = max(
self.current_limit - 10,
self.min_limit
)
async def acquire(self):
# Block if at limit
while self.in_flight >= self.current_limit:
await asyncio.sleep(0.01)
self.in_flight += 1
def release(self):
self.in_flight -= 1
Load Shedding: The Nuclear Option
When all else fails, drop traffic intelligently:
def calculate_priority(request: ChatRequest) -> int:
"""Higher score = higher priority"""
score = 0
# Premium users get priority
if request.user.tier == "premium":
score += 100
# Retry requests get lower priority
if request.metadata.get("retry_count", 0) > 0:
score -= 50
# Time-sensitive requests
if request.metadata.get("real_time", False):
score += 75
return score
async def smart_load_shedder(request: ChatRequest):
current_load = await metrics.get_cpu_usage()
if current_load > 0.85: # 85% CPU threshold
priority = calculate_priority(request)
# Drop low-priority requests
if priority < 50:
raise HTTPException(
status_code=503,
detail="System at capacity. Please retry."
)
return await process_request(request)
Observability: Measure Everything
You can't fix what you can't see. Instrument at every layer:
from prometheus_client import Histogram, Counter
# Metrics
llm_latency = Histogram(
'llm_request_duration_seconds',
'LLM request latency',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
cache_hits = Counter('cache_hits_total', 'Cache hit count')
cache_misses = Counter('cache_misses_total', 'Cache miss count')
@llm_latency.time()
async def call_llm(prompt: str):
# Check cache first
cached = await cache.get(prompt)
if cached:
cache_hits.inc()
return cached
cache_misses.inc()
response = await llm_client.complete(prompt)
await cache.set(prompt, response)
return response
Dashboard Alerts:
- p95 latency > 2s for 5 minutes
- Cache hit rate < 30%
- Circuit breaker open state
- Queue depth > 1000
Real-World Results
After implementing this architecture at StartupManch:
| Metric | Before | After | Improvement |
|---|---|---|---|
| p50 latency | 200ms | 120ms | 40% ↓ |
| p95 latency | 2.3s | 800ms | 65% ↓ |
| p99 latency | 8s | 1.5s | 81% ↓ |
| Timeout rate | 2.3% | 0.1% | 95% ↓ |
| Infrastructure cost | $12K/mo | $7.5K/mo | 38% ↓ |
Key Takeaways
- Design for p99, not p50: Your worst-case latency defines user experience
- Decouple request from response: Async is non-negotiable at scale
- Cache semantically: Vector similarity unlocks massive wins
- Fail gracefully: Circuit breakers + load shedding prevent cascading failures
- Measure obsessively: Observability is the only reality that matters
The jagged edges of LLM latency will hurt you. But with the right architecture, you can turn chaos into predictability.
About the Author
Devesh Kumar is a Staff Software Engineer specializing in GenAI platforms and cloud infrastructure. He's built systems processing 5M+ LLM requests/day at Okta and StartupManch.
Want to discuss GenAI architecture?