Loading...
Loading...
A showcase of professional work and personal projects exploring platform engineering, MLOps, and cloud-native technologies.
In-depth looks at platform challenges, approaches, and outcomes.
Data-intensive analytics company processing intelligence data across multiple regions, requiring a modern data platform to handle 50TB+ daily data with strict governance requirements.
The existing data infrastructure couldn't scale to meet growing data volumes. Manual cluster management led to inefficient resource utilization, and lack of standardized pipelines created inconsistency across data teams. Cost visibility was poor, making optimization difficult.
Designed a multi-region Databricks platform with Infrastructure as Code at its core, implementing medallion architecture for data organization and self-service capabilities for data teams.
Global fashion retailer (Calvin Klein, Tommy Hilfiger) running 200+ microservices on legacy VM infrastructure, facing scaling challenges and slow deployment velocity.
Weekly deployments were the norm, with each taking 2+ hours. Teams waited days for infrastructure provisioning. Observability was fragmented across tools, making incident response slow. The VM-based infrastructure couldn't efficiently handle traffic spikes.
Built a Kubernetes-native platform with self-service capabilities, comprehensive observability, and GitOps-driven deployments. Migrated services incrementally with feature flags to minimize risk.
Large-scale observability requirements for 200+ services generating terabytes of logs daily, with existing Elasticsearch clusters becoming increasingly expensive and difficult to manage.
Elasticsearch licensing costs were escalating rapidly. Cluster management was manual and error-prone. Index lifecycle management was inconsistent, leading to storage bloat. Teams lacked self-service capabilities for creating dashboards and alerts.
Migrated from the ELK stack to AWS Managed OpenSearch with a focus on cost reduction and operational efficiency, implementing automated index lifecycle management and self-service patterns for development teams.
Healthcare SaaS needing a secure, EU-compliant platform for an LLM-powered clinical trial matching API. B2B API product processing de-identified medical records through multi-model LLM inference.
No existing infrastructure for the new Matching API product. Needed dedicated isolation for medical data compliance, LLM inference co-located for zero-latency, SQS-driven async processing with autoscaling workers, and defence-in-depth security for a regulated healthcare environment.
Designed and delivered a dedicated EKS cluster with defence-in-depth security, co-located LLM proxy, CloudFront VPC Origins for zero-exposure production APIs, and KEDA-driven autoscaling for async batch LLM processing.
Multi-cloud MLOps reference architecture enabling self-service model deployment across AWS EKS, Azure AKS, and GCP GKE with defense-in-depth security and full-stack observability.
Data science teams were spending 2-3 days per model deployment with no standardized process. Security reviews were ad-hoc, GPU spend was invisible, and there was no consistent path from experimentation to production across clouds.
Built a production-grade MLOps platform with GitOps-driven infrastructure, self-service model serving, and defense-in-depth security across three cloud providers.
Personal projects exploring MLOps, AI infrastructure, and advanced Kubernetes patterns.
Kubernetes-native AI inference gateway for multi-model routing, A/B testing, and intelligent failover. Features circuit breakers with exponential backoff, OpenTelemetry tracing, smart routing (cost/latency/context-length), and configuration hot-reload. CNCF Sandbox candidate.
Production-ready cost optimization platform for AI/ML workloads. Features GPU utilization monitoring, budget forecasting with alerts, ML-based anomaly detection, automated right-sizing recommendations, and multi-cloud billing integration. All 3 phases complete.
Production-ready multi-cloud MLOps platform on AWS EKS, Azure AKS, and GCP GKE with defense-in-depth security and full-stack observability. Enables data science teams to deploy ML models, HuggingFace transformers, and LLMs from experimentation to production in 15 minutes with full auditability, drift detection, and GitOps-driven infrastructure.
GPU compute price aggregator — "Trivago for ML training". Arbitrages spot pricing across AWS, RunPod, and Lambda Labs to find the cheapest GPU instances for batch training jobs.
Docker Compose for AI Agents — Declarative spec that deploys AI agent stacks to Kubernetes. GitOps-native with Kortex integration for inference governance. Transparent abstraction: generates readable K8s manifests you own.