Engineering Methodology

Netflix's Data Migration Factory: Moving 400+ Clusters to Aurora

Dillip Chowdary • Mar 10, 2026 • 22 min read

Database migrations are notoriously high-risk operations. For a company like Netflix, where even a few seconds of metadata unavailability can disrupt the streaming experience for millions, the risk is exponential. In March 2026, Netflix Engineering pulled back the curtain on their **Automated Migration Framework**, a specialized platform that has autonomously transitioned over 400 production PostgreSQL clusters from standard RDS to Amazon Aurora with zero downtime and zero manual intervention.

1. The Challenge: Managing State at Scale

Netflix maintains a massive fleet of stateful microservices. As the organization standardized on Amazon Aurora for its superior replication lag and storage auto-scaling, the infrastructure team faced a "Legacy Debt" problem. Manually migrating 400+ clusters would have taken years and required deep involvement from every application team. They needed a Migration Factory—a self-service platform that could handle the heavy lifting of data replication, validation, and cutover.

2. The Architecture: The Data Access Layer (DAL)

The core innovation enabling zero-downtime transitions is Netflix's **Data Access Layer (DAL)**, built on an extended version of the Envoy Proxy. Instead of application code connecting directly to a database endpoint (e.g., db.prod.netflix.com), all connections are routed through a local sidecar proxy.

The Envoy "Migration Filter":

Secure Your Migration Logs

Migrating production data generates massive amounts of audit logs and temporary schemas. Use ByteNotes to store your migration runbooks and security checklists in a secure, encrypted environment.

Try ByteNotes →

3. The Methodology: The 4-Phase Migration Loop

The framework operates as a fully autonomous state machine, moving every cluster through four distinct phases:

  1. Phase 1: Hydration & Replication: The framework provisions a new Aurora cluster and establishes a low-latency replication link (using AWS Database Migration Service or native Postgres logical replication).
  2. Phase 2: Continuous Validation: A "Data Auditor" service runs asynchronously, sampling 1% of transactions across both clusters to verify that the target state matches the source state within a 5ms window.
  3. Phase 3: The "Shadow Cutover": The DAL begins routing READ traffic to the new Aurora replicas while the source RDS cluster remains the primary for writes.
  4. Phase 4: The Final Flip: Once replication lag is confirmed to be under 10ms and validation success is at 100%, the framework executes a sub-second "Stop Writes" command, waits for the final sync, and promotes Aurora to Primary.

4. Results: Reliability by Default

By treating migrations as a platform problem rather than an application problem, Netflix achieved:

Actionable Takeaways for DevOps Teams

  1. Abstract Your Endpoints: If your code has hardcoded database hostnames, you cannot migrate without downtime. Use a proxy layer like Envoy or a service mesh.
  2. Invest in Data Auditing: Replication is not a guarantee of consistency. Build automated tooling to sample and compare data across source and target in real-time.
  3. Rollback is a Feature: Design your migration scripts so that returning to the source state is as automated and fast as the migration itself.