Migrating from ELK to AWS Managed OpenSearch: What I Learned at PVH
An opinionated take on migrating a TB-scale observability platform from the ELK stack to AWS Managed OpenSearch for 200+ microservices at PVH Europe, and when managed beats self-managed.
The Decision That Shaped Two Years of My Work
In 2023, I faced a decision that would define the next two years of platform work at PVH Europe: should we migrate our ELK stack to self-managed OpenSearch on Kubernetes, or go with AWS Managed OpenSearch?
We chose managed. It was the right call for us. It might be the wrong call for you. This post explains how I think about that decision.
Context: What We Were Dealing With
PVH Europe runs Calvin Klein and Tommy Hilfiger's e-commerce platforms. That means 200+ microservices generating terabytes of logs per day, with sub-second search latency requirements for incident response during peak events like Black Friday.
Our existing Elasticsearch clusters were expensive and getting more so with licensing changes. We needed to move, and the question was where.
The Decision Matrix
Here is how I evaluated the two options across the dimensions that actually mattered:
Operational overhead: This is where managed services win, and win hard. Self-managed means you own upgrades, security patches, backup verification, and cluster health monitoring. Our platform team was already stretched across Kubernetes, CI/CD, and observability. Adding another stateful system to operate would have been a serious drag on velocity.
Cost at scale: AWS Managed OpenSearch pricing is per-instance-hour, and at TB-scale the per-node cost is higher than running equivalent EC2 instances yourself. But the total cost of ownership includes engineering time for upgrades, patching, and incident response. When we modeled the full TCO including operational labor, managed came out ahead.
Tiering control: We needed hot/warm/cold data tiering with ISM (Index State Management) policies that moved indices based on age and access patterns. AWS Managed OpenSearch supports UltraWarm and cold storage natively. The tiering granularity is less flexible than what you can do with dedicated Kubernetes node pools, but it was sufficient for our retention requirements.
Reliability during peak events: Black Friday and seasonal sales are non-negotiable. AWS manages the underlying infrastructure availability, patching, and automated backups. For a revenue-critical e-commerce platform, offloading that responsibility was worth the cost premium.
What We Gained by Going Managed
Zero-downtime upgrades: AWS handles rolling upgrades across the cluster. With self-managed, every minor version upgrade would have required testing in staging, coordinating rolling restarts, and checking plugin compatibility. Major versions would need a dedicated sprint. We got that time back.
Automated backups and recovery: Automated snapshots to S3 with configurable retention, managed by AWS. No need to build restore drill pipelines or verify backup integrity manually. The managed service handles this transparently.
Reduced on-call burden: OpenSearch is not part of our on-call rotation for infrastructure incidents. AWS handles shard allocation failures, JVM pressure, and node health. Our team monitors application-level OpenSearch dashboards, not cluster internals.
Faster time to value: We were productive within weeks rather than spending months building out a self-managed deployment pipeline, monitoring stack, and runbooks.
What We Built on Top
Going managed did not mean going passive. We still had significant platform work to do.
Terraform modules for self-service: We built Terraform modules that let application teams provision their own index patterns with sensible defaults (optimized mappings, appropriate shard counts, ISM policies attached automatically). This reduced the platform team from a bottleneck to a guardrail.
Index lifecycle automation: Automated ISM policies moved indices from hot to UltraWarm to cold to deleted based on configurable retention. Combined with index templates that enforced mapping best practices, this kept storage growth predictable.
Cost visibility: We tagged OpenSearch domains and correlated costs with team ownership through AWS Cost Explorer and custom dashboards. This drove behavioral change: teams that could see their log volume started being more thoughtful about what they logged.
Performance tuning: Query optimization and shard right-sizing delivered a 40% performance improvement without upsizing instances. This required understanding per-index access patterns, but the work is the same whether you run managed or self-managed.
What I Would Do Differently
Invest in index mapping governance earlier: Self-service is great, but we underestimated how many teams would create indices with dynamic mappings that exploded field counts. Our Terraform modules now enforce explicit mappings with a maximum field count.
Right-size instance types from day one: We initially over-provisioned to be safe during migration. It took three months of load analysis before we right-sized. Starting with a structured capacity test during migration would have saved cost from the start.
Define ISM policies before migration, not after: We migrated data first and applied lifecycle policies later. This meant a period of storage bloat while we tuned retention. Defining the target ISM policies as part of the migration plan would have been cleaner.
My Position
AWS Managed OpenSearch wins when these conditions are met:
- Your platform team is already stretched: If your team is operating Kubernetes, CI/CD, and other infrastructure, adding a complex stateful system to self-manage is a serious operational tax. Managed lets you focus on the platform work that differentiates your organization.
- You value reliability during peak events: For revenue-critical workloads where observability downtime during Black Friday is not an option, offloading infrastructure reliability to AWS reduces risk.
- TCO includes engineering time: The per-node cost of managed is higher, but if you factor in the engineering hours for upgrades, patching, incident response, and backup verification, managed often wins on total cost.
Self-managed OpenSearch on Kubernetes wins when you have a dedicated platform team with deep distributed systems expertise, you operate at massive scale where the per-node cost delta compounds significantly, and you need integration depth with a Kubernetes-native observability stack that would make a managed service an operational island.
The worst outcome is choosing self-managed for cost reasons alone, without the team maturity to operate it. You will spend more on engineering time than you save on infrastructure.
The Results
For PVH, managed was the right call:
- 60%+ reduction in licensing costs compared to the ELK stack
- 40% faster query performance through shard optimization and caching
- 50% storage reduction via ISM automation and mapping optimization
- Self-service adoption by 85% of teams within six months
These results came from the platform engineering work we built on top of the managed service, not from the managed service itself. The decision to go managed freed up the engineering capacity to focus on self-service tooling, performance tuning, and cost visibility rather than cluster operations.
If you are evaluating this decision for your organization, I would start by honestly assessing your team's operational bandwidth. The technology choice is secondary to the team's ability to sustain it alongside everything else they are responsible for.