Staff Software Engineer Portfolio

Devesh Kumar.
Staff Software Engineer & Cloud/Platform Architect.

8+ years designing and operating multi-tenant cloud platforms across Kubernetes, AWS, and GCP. Built GenAI retrieval systems processing 5M+ events/dayand 500K+ daily querieswith 99.8–99.9% availability. Cut cloud spend 15–25%through predictive autoscaling and FinOps automation while reducing alert noise by 40–60%.

Okta • Lookout • CipherCloud • StartupManch | Bengaluru, India

0M+
Events/Day
0K+
Queries/Day
0.0%
Availability
0%
Cost Savings

Strategic Pillars

Deep technical capabilities shaped by platform leadership, GenAI infrastructure, and cloud reliability at scale.

Kubernetes Fleet Reliability

Led zero-downtime fleet upgrades across regulated production clusters, sustaining 99.8–99.9% availability. Established SLO/SLI governance, burn alerts, and incident review loops to harden reliability at scale.

Multi-Tenant GenAI Platforms

Architected control/data planes for RAG pipelines processing 5M+ events/day and 100K–500K queries/day. Cut P95 indexing latency ~70% and maintained 99.8% SLOs under bursty AI traffic.

AIOps & FinOps Intelligence

Built anomaly detection and signal correlation to reduce alert noise 40–60%, paired with ML-based capacity forecasting and autoscaling to cut cloud spend 15–25%+ without SLO regressions.

Impact Metrics Dashboard

Quantified improvements across latency, cost, availability, and operational efficiency.

P95 Indexing Latency

Before Optimization
5.0s
After Optimization
1.5s
70%Improvement

Reduced through async vector ingestion and event-driven architecture

Baseline
5.0s
Optimized
1.5s

Technical Arsenal

Radical transparency on capabilities. What I own, what I use, and what I'm conquering next.

The Arsenal (Strong)

Battle-tested in production at scale. My primary weapons of choice.

  • Kubernetes (EKS/GKE) & Docker
  • SRE (SLO/SLI/Error Budgets)
  • Python & Go
  • Terraform / GitOps / IaC
  • AWS & GCP Architecture
  • Observability (Prometheus/Grafana/Tracing)
  • Vector Search & RAG Pipelines

The Expansion (Moderate)

Competent and capable. Used for enabling the core infrastructure.

  • Frontend (React/Next.js)
  • Java / Spring Boot
  • Azure Platform
  • Data & Streaming (Kafka, Pub/Sub)
  • Security (Vault, IAM, CSPM)
  • PostgreSQL & Redis

The Frontier (Focus)

Actively researching and rigorously working on to master.

  • Advanced MLOps Control Planes
  • GenAI Model Serving & Monitoring
  • Vector DB Operations at Scale
  • WASM at the Edge
Career Journey

8+ Years of Evolution

From blockchain development to building planet-scale GenAI platforms—a journey of continuous growth

2016
2017
2018
2019
2020
2021
2022
2024
Click any milestone to view details

The Journey

Contextualizing the pivotal shifts in infrastructure strategy across scale.

StartupManch Pvt. Ltd.

Co-Founder & Engineering Lead

Jul 2020 – Present
  • Designed multi-tenant control/data planes on GCP processing 5M+ events/day and 100K–500K queries/day
  • Reduced P95 indexing latency ~70% with async ingestion and vector pipelines
  • Built internal GitOps platform cutting deploy time 3 days → <6 hours

Okta

Staff Software Engineer

Nov 2023 – Jun 2025
  • Led zero-downtime EKS upgrades sustaining 99.9% availability across global fleets
  • Reduced alert fatigue by 50% through AIOps correlation and suppression
  • Cut cloud spend 15–20% with ML-based capacity forecasting

Lookout

Senior Software Engineer

Jan 2022 – Nov 2023
  • Automated multi-region provisioning with Terraform/Python → 40% faster rollout
  • Designed global DNS failover reducing customer incidents by 90%
  • Reduced platform costs 15% while supporting 50% traffic growth

NowFloats Technologies

Senior (Lead) Software Engineer

May 2021 – Jan 2022
  • Built multi-tenant edge delivery and reverse-proxy platform
  • Designed programmable L7 load balancer and asset optimization pipeline
  • Improved page load performance by ~40%

ICBC (Blockchain)

Contractor [Blockchain Consultant]

Aug 2018 – Feb 2019
  • Architected Secured Ledger Wallet using Hyperledger + Ethereum
  • Designed Encrypted P2P Communication Protocols
  • Enabled Immutable Transaction History Investigation

The Whitepapers

Engineering at the Edge of Chaos.

The AI Orchestrator

Taming the Chaos of Multi-LLM Routing at 500k QPD.

The Risk

"Unpredictable 'burst' traffic from viral AI agents threatened to capsize the control plane, creating latency spikes > 5s."

The Solution

Architected a 'Shared Control Plane' with predictive autoscaling and vector-caching layers. Decoupled ingestion from varying LLM latency profiles.

The Outcome
  • Stabilized 500K+ daily queries across Perplexity & ChatGPT.
  • Reduced P95 Indexing Latency by 70%.
  • Cut Cloud Spend by 25% via Spot Instance orchestration.

Full Whitepaper Restricted

Access requested. Document releasing soon.

Coming Soon

Architecture Blueprint: v2.0-SCALE

API GatewayLoad BalancerAuth ServiceData ServiceSharded DB

Architecture: Event-Driven Vector Mesh

Technology Radar

Technology Adoption Strategy

A living framework mapping technologies across adoption stages—from production-ready tools to emerging trends under evaluation

Kubernetes

Ring 1

Container orchestration in production

Adopt

Terraform

Ring 1

Infrastructure as Code

Adopt

Prometheus

Ring 1

Monitoring & alerting

Adopt

ArgoCD

Ring 1

GitOps deployment

Adopt

Temporal

Ring 2

Workflow orchestration

Trial

Cilium

Ring 2

eBPF-based networking

Trial

OpenTelemetry

Ring 2

Observability standard

Trial

Pinecone

Ring 2

Vector database for GenAI

Trial

Wasm

Ring 3

Edge computing runtime

Assess

Backstage

Ring 3

Developer portal

Assess

Crossplane

Ring 3

Cloud resource management

Assess

Dapr

Ring 3

Distributed app runtime

Assess

Jenkins

Ring 4

Replaced with GitOps

Hold

Helm 2

Ring 4

Migrated to Helm 3

Hold

Docker Swarm

Ring 4

Replaced with K8s

Hold

Technology Adoption Framework

Adopt: Production-ready, recommended
Trial: Testing in real scenarios
Assess: Evaluating potential
Hold: Phasing out or not recommended

Engineering Insights & Lessons

Thought leadership distilled from building planet-scale systems. Real problems, real solutions.

Featured Insight

Building for the Jagged Edges of LLM Latency

Why your GenAI platform must be designed for worst-case latency, not averages. A deep dive into event-driven vector mesh architectures.

"Abstractions leak, but poor abstractions flood. In GenAI systems, you must build for the jagged edges of LLM latency, not just the averages."

8 min read
GenAIArchitecturePerformance
Read Full Article

Platform Engineering Insights Newsletter

Monthly deep-dives on Kubernetes, GenAI infrastructure, and building platforms that scale. No fluff, just systems thinking.

💡 Connect newsletter service (Mailchimp, ConvertKit, etc.) to make this functional

Academic Foundation

Graduate-level training in software engineering and distributed systems, paired with telecommunications fundamentals.

M.Tech in Software Engineering

BITS Pilani

Advanced Software Systems, Distributed Computing, Cloud Architecture

B.Tech in Electronics & Telecommunication

KIIT University

Computer Networks, Embedded Systems, Signal Processing

Community & Impact

Contributing to the cloud-native ecosystem and sharing platform engineering insights.

Open Source Contributor

Active contributor to cloud-native and platform engineering projects on GitHub.

View GitHub Profile

Knowledge Sharing

Sharing learnings on platform engineering, Kubernetes, and GenAI infrastructure through technical writing and community engagement.

Read Articles

Mentorship & Leadership

Mentoring engineers toward senior/staff growth, running incident reviews, and leading architecture discussions across teams.

Engineering Principles

01

"Abstractions leak, but poor abstractions flood."

02

"At web scale, hope is not a strategy. Observability is the only reality that matters."

03

"Automation isn't just about speed; it's about the provenance of state."

04

"Standardization is the prerequisite to scale. You cannot optimize what you haven't defined."

05

"In Fintech, trust is an engineering specification, not a feeling. Code is law."

Open Source Activity

Building in public, sharing infrastructure learnings

8+
Years
5
Companies
100+
Projects

Ready to rebuild your infrastructure at planet-scale?

I only work with 3-4 high-stakes organizations per year. Let's determine if we are a strategic fit.

Send Email