Loading...
Loading...
I build AI-ready platforms as stacked capabilities: each layer builds on the one below, creating a foundation that serves both analytics and ML teams.
The foundation: Kubernetes clusters, infrastructure as code, GitOps workflows, observability, security baselines, and golden paths that enable teams to ship with confidence.
EKS, AKS, GKE clusters with auto-scaling, security policies, and multi-tenancy
Terraform, Pulumi, Crossplane for declarative, version-controlled infrastructure
ArgoCD, Flux, GitHub Actions for automated, auditable deployments
Prometheus, Grafana, OpenSearch for metrics, logs, traces, and alerting
My Defaults: Terraform for IaC unless the logic requires real programming constructs - then Pulumi. I avoid service mesh unless the organization has 3+ teams actively maintaining it; the operational overhead rarely justifies the traffic management gains at smaller scale. For observability, I reach for self-managed Prometheus + Grafana over Datadog at 100+ services because per-host pricing becomes non-linear, but I'll use Datadog for teams under 50 services where operational simplicity matters more.
Data foundations on cloud and Kubernetes: lakehouse/warehouse architecture, streaming pipelines, data quality, governance, and cost-efficient storage/compute for analytics and ML.
Databricks, Delta Lake, medallion architecture for unified analytics
OpenSearch, Elasticsearch clusters with optimized indexing and query performance
Kafka, Spark Streaming, real-time data ingestion and transformation
Unity Catalog, access controls, lineage tracking, compliance
My Defaults: AWS Managed OpenSearch over self-managed for most organizations - the operational burden of running stateful workloads on Kubernetes is only worth it when you have a dedicated platform team and TB-scale data. For lakehouse architecture, Delta Lake is my first choice when Databricks is already in the stack; the ACID transaction guarantees and time travel are worth the ecosystem lock-in.
Infrastructure for the ML lifecycle: feature pipelines, training workflows, experiment tracking, model registry, and production deployment with monitoring and safe rollouts.
Kubeflow, Argo Workflows for reproducible training and feature engineering
MLflow, model versioning, hyperparameter tracking, artifact management
Seldon Core, KServe for A/B testing, canary deployments, autoscaling
Model performance monitoring, drift detection, SLO-driven operations
My Defaults: KServe is my default for model serving on Kubernetes because of native scale-to-zero and canary deployment support. I avoid Seldon unless the team needs their specific inference graph abstraction. For experiment tracking, MLflow wins on simplicity and open-source portability; I'd only consider Weights & Biases for teams that need collaborative experiment visualization.
Hardware-aware cloud platforms for AI: GPU clusters, job schedulers, inference/training stacks, and aggressive optimization for performance and cost at scale.
NVIDIA GPU Operator, node pools, scheduling for training and inference
vLLM, Ollama, multi-model serving with intelligent routing
Spot instances, autoscaling, GPU utilization monitoring, right-sizing
Inference latency, queue depth, GPU metrics, training job monitoring
My Defaults: I always start with spot/preemptible instances for training workloads and batch inference - the 60%+ savings justify the checkpointing complexity. For LLM inference, vLLM with continuous batching is my default engine; the throughput from PagedAttention consistently beats TGI in benchmarks I've run. Cost attribution should happen at the inference gateway layer, not as an afterthought in billing dashboards.