Loading...
Loading...
A showcase of professional work and personal projects exploring platform engineering, MLOps, and cloud-native technologies.
In-depth looks at platform challenges, approaches, and outcomes.
Data-intensive analytics company processing intelligence data across multiple regions, requiring a modern data platform to handle 50TB+ daily data with strict governance requirements.
The existing data infrastructure couldn't scale to meet growing data volumes. Manual cluster management led to inefficient resource utilization, and lack of standardized pipelines created inconsistency across data teams. Cost visibility was poor, making optimization difficult.
Designed a multi-region Databricks platform with Infrastructure as Code at its core, implementing medallion architecture for data organization and self-service capabilities for data teams.
Global fashion retailer (Calvin Klein, Tommy Hilfiger) running 200+ microservices on legacy VM infrastructure, facing scaling challenges and slow deployment velocity.
Weekly deployments were the norm, with each taking 2+ hours. Teams waited days for infrastructure provisioning. Observability was fragmented across tools, making incident response slow. The VM-based infrastructure couldn't efficiently handle traffic spikes.
Built a Kubernetes-native platform with self-service capabilities, comprehensive observability, and GitOps-driven deployments. Migrated services incrementally with feature flags to minimize risk.
Large-scale observability requirements for 200+ services generating terabytes of logs daily, with existing Elasticsearch clusters becoming increasingly expensive and difficult to manage.
Elasticsearch licensing costs were escalating rapidly. Cluster management was manual and error-prone. Index lifecycle management was inconsistent, leading to storage bloat. Teams lacked self-service capabilities for creating dashboards and alerts.
Migrated to OpenSearch with a focus on operational efficiency, implementing automated index lifecycle management and self-service patterns for development teams.
Personal projects exploring MLOps, AI infrastructure, and advanced Kubernetes patterns.
Kubernetes-native AI inference gateway for multi-model routing, A/B testing, and intelligent failover. Features circuit breakers with exponential backoff, OpenTelemetry tracing, smart routing (cost/latency/context-length), and configuration hot-reload. CNCF Sandbox candidate.
Production-ready cost optimization platform for AI/ML workloads. Features GPU utilization monitoring, budget forecasting with alerts, ML-based anomaly detection, automated right-sizing recommendations, and multi-cloud billing integration. All 3 phases complete.
Production-ready multi-cloud MLOps platform on AWS EKS, Azure AKS, and GCP GKE with defense-in-depth security. Enables data science teams to deploy ML models and LLMs from experimentation to production in 15 minutes with full auditability, drift detection, and GitOps-driven infrastructure.
GPU compute price aggregator — "Trivago for ML training". Arbitrages spot pricing across AWS, RunPod, and Lambda Labs to find the cheapest GPU instances for batch training jobs.
Docker Compose for AI Agents — Declarative spec that deploys AI agent stacks to Kubernetes. GitOps-native with Kortex integration for inference governance. Transparent abstraction: generates readable K8s manifests you own.