Staff Software Engineer Portfolio

Devesh Kumar.
Staff Software Engineer & Cloud/Platform Architect.

8+ years designing and operating multi-tenant cloud platforms across Kubernetes, AWS, and GCP. Built GenAI retrieval systems processing 5M+ events/dayand 500K+ daily querieswith 99.8–99.9% availability. Cut cloud spend 15–25%through predictive autoscaling and FinOps automation while reducing alert noise by 40–60%.

Okta • Lookout • CipherCloud • StartupManch | Bengaluru, India

0M+

Events/Day

0K+

Queries/Day

0.0%

Availability

Cost Savings

Strategic Pillars

Deep technical capabilities shaped by platform leadership, GenAI infrastructure, and cloud reliability at scale.

Kubernetes Fleet Reliability

Led zero-downtime fleet upgrades across regulated production clusters, sustaining 99.8–99.9% availability. Established SLO/SLI governance, burn alerts, and incident review loops to harden reliability at scale.

Explore Case Study

Multi-Tenant GenAI Platforms

Architected control/data planes for RAG pipelines processing 5M+ events/day and 100K–500K queries/day. Cut P95 indexing latency ~70% and maintained 99.8% SLOs under bursty AI traffic.

Explore Case Study

AIOps & FinOps Intelligence

Built anomaly detection and signal correlation to reduce alert noise 40–60%, paired with ML-based capacity forecasting and autoscaling to cut cloud spend 15–25%+ without SLO regressions.

Explore Case Study

Impact Metrics Dashboard

Quantified improvements across latency, cost, availability, and operational efficiency.

P95 Indexing Latency

Before Optimization

5.0s

After Optimization

1.5s

70%Improvement

Reduced through async vector ingestion and event-driven architecture

Baseline

5.0s

Optimized

1.5s

Technical Arsenal

Radical transparency on capabilities. What I own, what I use, and what I'm conquering next.

The Arsenal (Strong)

Battle-tested in production at scale. My primary weapons of choice.

Kubernetes (EKS/GKE) & Docker
SRE (SLO/SLI/Error Budgets)
Python & Go
Terraform / GitOps / IaC
AWS & GCP Architecture
Observability (Prometheus/Grafana/Tracing)
Vector Search & RAG Pipelines

The Expansion (Moderate)

Competent and capable. Used for enabling the core infrastructure.

Frontend (React/Next.js)
Java / Spring Boot
Azure Platform
Data & Streaming (Kafka, Pub/Sub)
Security (Vault, IAM, CSPM)
PostgreSQL & Redis

The Frontier (Focus)

Actively researching and rigorously working on to master.

Advanced MLOps Control Planes
GenAI Model Serving & Monitoring
Vector DB Operations at Scale
WASM at the Edge

Career Journey

8+ Years of Evolution

From blockchain development to building planet-scale GenAI platforms—a journey of continuous growth

2016

2017

2018

2019

2020

2021

2022

2024

Click any milestone to view details

The Journey

Contextualizing the pivotal shifts in infrastructure strategy across scale.

StartupManch Pvt. Ltd.

Co-Founder & Engineering Lead

Jul 2020 – Present

Designed multi-tenant control/data planes on GCP processing 5M+ events/day and 100K–500K queries/day
Reduced P95 indexing latency ~70% with async ingestion and vector pipelines
Built internal GitOps platform cutting deploy time 3 days → <6 hours

Okta

Staff Software Engineer

Nov 2023 – Jun 2025

Led zero-downtime EKS upgrades sustaining 99.9% availability across global fleets
Reduced alert fatigue by 50% through AIOps correlation and suppression
Cut cloud spend 15–20% with ML-based capacity forecasting

Lookout

Senior Software Engineer

Jan 2022 – Nov 2023

Automated multi-region provisioning with Terraform/Python → 40% faster rollout
Designed global DNS failover reducing customer incidents by 90%
Reduced platform costs 15% while supporting 50% traffic growth

NowFloats Technologies

Senior (Lead) Software Engineer

May 2021 – Jan 2022

Built multi-tenant edge delivery and reverse-proxy platform
Designed programmable L7 load balancer and asset optimization pipeline
Improved page load performance by ~40%

ICBC (Blockchain)

Contractor [Blockchain Consultant]

Aug 2018 – Feb 2019

Architected Secured Ledger Wallet using Hyperledger + Ethereum
Designed Encrypted P2P Communication Protocols
Enabled Immutable Transaction History Investigation

The Whitepapers

Engineering at the Edge of Chaos.

The AI Orchestrator

Taming the Chaos of Multi-LLM Routing at 500k QPD.

The Risk

"Unpredictable 'burst' traffic from viral AI agents threatened to capsize the control plane, creating latency spikes > 5s."

The Solution

Architected a 'Shared Control Plane' with predictive autoscaling and vector-caching layers. Decoupled ingestion from varying LLM latency profiles.

The Outcome

Stabilized 500K+ daily queries across Perplexity & ChatGPT.
Reduced P95 Indexing Latency by 70%.
Cut Cloud Spend by 25% via Spot Instance orchestration.

Full Whitepaper Restricted

Access requested. Document releasing soon.

Coming Soon

Architecture Blueprint: v2.0-SCALE

Architecture: Event-Driven Vector Mesh

Ring 4

Replaced with K8s

Hold

Technology Adoption Framework

Adopt: Production-ready, recommended

Trial: Testing in real scenarios

Assess: Evaluating potential

Hold: Phasing out or not recommended

Engineering Insights & Lessons

Thought leadership distilled from building planet-scale systems. Real problems, real solutions.

Featured Insight

Building for the Jagged Edges of LLM Latency

Why your GenAI platform must be designed for worst-case latency, not averages. A deep dive into event-driven vector mesh architectures.

"Abstractions leak, but poor abstractions flood. In GenAI systems, you must build for the jagged edges of LLM latency, not just the averages."

8 min read

GenAIArchitecturePerformance

Read Full Article

SRE & Reliability10 min

Observability is the Only Reality That Matters

At web scale, you cannot verify what you cannot observe. How to build observable systems that actually tell the truth.

"At web scale, hope is not a strategy. You cannot verify what you cannot observe. Observability is the only reality that matters."

SREObservabilityKubernetes

Platform Engineering7 min

Automation and the Provenance of State

Why automation isn't just about speed—it's about knowing the exact state of your infrastructure at any moment in time.

"Automation isn't just about speed; it's about the provenance of state. If a human has to touch production, the architecture has failed."

AutomationIaCDevOps

Platform Engineering6 min

Standardization: The Prerequisite to Scale

You cannot optimize what you haven't defined. How standardization unlocks velocity at scale.

"Standardization is the prerequisite to scale. You cannot optimize what you haven't defined."

StandardsPlatformScale

Blockchain & Security9 min

In Fintech, Trust is an Engineering Specification

Building immutable systems where trust is not a feeling but a mathematical certainty. Code is law.

"In Fintech, trust is an engineering specification, not a feeling. Code is law."

BlockchainSecurityFintech

Platform Engineering Insights Newsletter

Monthly deep-dives on Kubernetes, GenAI infrastructure, and building platforms that scale. No fluff, just systems thinking.

💡 Connect newsletter service (Mailchimp, ConvertKit, etc.) to make this functional

Academic Foundation

Graduate-level training in software engineering and distributed systems, paired with telecommunications fundamentals.

M.Tech in Software Engineering

BITS Pilani

Advanced Software Systems, Distributed Computing, Cloud Architecture

B.Tech in Electronics & Telecommunication

KIIT University

Computer Networks, Embedded Systems, Signal Processing

Community & Impact

Contributing to the cloud-native ecosystem and sharing platform engineering insights.

Open Source Contributor

Active contributor to cloud-native and platform engineering projects on GitHub.

View GitHub Profile→

Knowledge Sharing

Sharing learnings on platform engineering, Kubernetes, and GenAI infrastructure through technical writing and community engagement.

Read Articles→

Mentorship & Leadership

Mentoring engineers toward senior/staff growth, running incident reviews, and leading architecture discussions across teams.

Engineering Principles

"Abstractions leak, but poor abstractions flood."

"At web scale, hope is not a strategy. Observability is the only reality that matters."

"Automation isn't just about speed; it's about the provenance of state."

"Standardization is the prerequisite to scale. You cannot optimize what you haven't defined."

"In Fintech, trust is an engineering specification, not a feeling. Code is law."

Open Source Activity

Building in public, sharing infrastructure learnings

Years

Companies

100+

Projects

Ready to rebuild your infrastructure at planet-scale?

I only work with 3-4 high-stakes organizations per year. Let's determine if we are a strategic fit.

Send Email

Staff Software Engineer Portfolio

Devesh Kumar.Staff Software Engineer & Cloud/Platform Architect.

Strategic Pillars

Kubernetes Fleet Reliability

Multi-Tenant GenAI Platforms

AIOps & FinOps Intelligence

Impact Metrics Dashboard

P95 Indexing Latency

Technical Arsenal

The Arsenal (Strong)

The Expansion (Moderate)

The Frontier (Focus)

8+ Years of Evolution

The Journey

StartupManch Pvt. Ltd.

Okta

Lookout

NowFloats Technologies

ICBC (Blockchain)

The Whitepapers

Engineering at the Edge of Chaos.

The AI Orchestrator

Taming the Chaos of Multi-LLM Routing at 500k QPD.

The Risk

The Solution

The Outcome

Architecture Blueprint: v2.0-SCALE

Technology Adoption Strategy

Kubernetes

Terraform

Prometheus

ArgoCD

Temporal

Cilium

OpenTelemetry

Pinecone

Wasm

Backstage

Crossplane

Dapr

Jenkins

Helm 2

Docker Swarm

Technology Adoption Framework

Engineering Insights & Lessons

Building for the Jagged Edges of LLM Latency

Observability is the Only Reality That Matters

Automation and the Provenance of State

Standardization: The Prerequisite to Scale

In Fintech, Trust is an Engineering Specification

Platform Engineering Insights Newsletter

Academic Foundation

M.Tech in Software Engineering

B.Tech in Electronics & Telecommunication

Community & Impact

Open Source Contributor

Knowledge Sharing

Mentorship & Leadership

Engineering Principles

Open Source Activity

Ready to rebuild your infrastructure at planet-scale?

Devesh Kumar.
Staff Software Engineer & Cloud/Platform Architect.