- Mailing Lists
- in
- How Uber Built Odin to Handle 3.8 Million Containers
Archives
- By thread 4455
-
By date
- June 2021 10
- July 2021 6
- August 2021 20
- September 2021 21
- October 2021 48
- November 2021 40
- December 2021 23
- January 2022 46
- February 2022 80
- March 2022 110
- April 2022 99
- May 2022 98
- June 2022 104
- July 2022 83
- August 2022 95
- September 2022 102
- October 2022 118
- November 2022 115
- December 2022 101
- January 2023 89
- February 2023 90
- March 2023 115
- April 2023 98
- May 2023 160
- June 2023 143
- July 2023 121
- August 2023 90
- September 2023 101
- October 2023 106
- November 2023 101
- December 2023 73
- January 2024 75
- February 2024 75
- March 2024 78
- April 2024 74
- May 2024 108
- June 2024 99
- July 2024 115
- August 2024 134
- September 2024 130
- October 2024 141
- November 2024 171
- December 2024 115
- January 2025 216
- February 2025 140
- March 2025 220
- April 2025 29
How Our Live Scoring & Tech Create an Unforgettable Day
[FHL 10 Encore] Register to watch the replay!
How Uber Built Odin to Handle 3.8 Million Containers
How Uber Built Odin to Handle 3.8 Million Containers
Building AI Apps on Postgres? Start with pgai (Sponsored)pgai is a PostgreSQL extension that brings more AI workflows to PostgreSQL, like embedding creation and model completion. pgai empowers developers with AI superpowers, making it easier to build search and retrieval-augmented generation (RAG) applications. Automates embedding creation with pgai Vectorizer, keeping your embeddings up to date as your data changes—no manual syncing required. Available free on GitHub or fully managed in Timescale Cloud. Disclaimer: The details in this post have been derived from Uber Engineering Blog and other sources. All credit for the technical details goes to the Uber engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. In the early days, the engineers at Uber had to take care of databases and storage systems manually. Whenever they needed to set up, update, or fix something, they followed written instructions called "runbooks". These runbooks were like a step-by-step guide. As Uber grew, this manual process became overwhelming. They had thousands of databases spread globally, and managing them by hand was a slow and difficult process prone to mistakes. To solve this, Uber created Odin, an automated system that manages all these databases and storage clusters without human intervention. Unlike older systems that work with only specific types of databases, Odin is technology-agnostic, meaning it can handle many different databases and storage systems seamlessly. Odin helps Uber’s engineers organize, scale, and maintain their storage infrastructure, ensuring everything runs smoothly and reliably. In this article, we will look at a comprehensive breakdown of Odin and the challenges Uber faced while developing it. The Scale of OdinSince 2014, Uber’s data infrastructure has expanded at an unprecedented scale. What started with a few hundred hosts has now evolved into a massive fleet of over 100,000 hosts, supporting a huge number of stateful workloads. These workloads are essential for Uber’s real-time services, including ride-hailing, food delivery, and payment processing, all of which depend on highly available and scalable storage solutions. As per an estimate, Uber’s fleet collectively manages multiple exbibytes of storage. To put this into perspective 1 exbibyte equals 1,152,921.5 terabytes (TB). Multiple exbibytes place Uber’s storage in the zettabyte-scale range. This level of storage capacity is necessary because Uber generates massive volumes of real-time data, such as:
Odin allows Uber to manage this enormous ecosystem through automation, self-healing mechanisms, and efficient resource scheduling. It is responsible for handling:
In essence, Odin optimizes how storage is allocated and accessed to ensure fast read/write performance while preventing unnecessary duplication and inefficiency. Odin is also technology-agnostic, meaning it doesn’t just manage a single type of database or storage system but instead integrates with 23 different storage technologies, including:
Each of these technologies serves a specific purpose within Uber’s ecosystem, and Odin ensures they can all operate efficiently within the same unified infrastructure. For example:
Odin’s Architecture and Design PrinciplesUnlike traditional imperative systems that require explicit commands to perform tasks, Odin follows a declarative model where engineers define what the system should look like (goal state) rather than how to achieve it. To maintain the goal state, Odin employs self-healing remediation loops that continuously monitor the system state and detect deviations. If any deviation is found, it takes corrective actions automatically without human intervention. See the diagram below that shows the steps within Odin’s remediation loop. For example, an engineer defines that a database should always have three replicas. Odin automatically ensures this is always the case. If a replica fails or a node crashes, Odin self-heals by spinning up a new instance to meet the goal state. This design is similar to Kubernetes but optimized for stateful workloads such as databases and large-scale storage systems. The diagram below shows a high-level view of Odin’s architecture. The key architectural components of Odin are as follows: 1 - Grail: The Data Integration PlatformOdin’s intelligence depends on having an accurate and up-to-date view of Uber’s global infrastructure. This is handled by Grail. Grail provides a real-time snapshot of all hosts, containers, and workloads across Uber’s infrastructure. It works similarly to the Kubernetes API Server but at a much larger scale. It allows engineers and remediation loops to query global system state instantly, enabling informed scheduling and decision-making. The key advantages of Grail are as follows:
2 - Remediation LoopsAt the core of Odin’s automation capabilities are remediation loops. Each remediation loop is a separate microservice. This allows independent development and deployment without affecting other parts of the system. It follows a four-step cycle:
This process is continuous, ensuring that Odin actively maintains stability and performance. 3 - The Control PlaneOdin’s control plane is responsible for orchestrating workloads, scheduling tasks, and managing cluster topology. The core responsibilities of the Control Plane are as follows:
4 - Host-Level AgentsOdin’s architecture separates global control from local execution using two host-level agents.
See the diagram below that shows the two host-level agents in more detail. By separating general host management (Odin-Agent) from database-specific logic (Worker), Odin maintains standardization and customization across different workloads. Odin’s Key Features and InnovationsAs mentioned, Odin is a stateful workload management system designed to handle Uber’s massive-scale database and storage infrastructure. 1 - Self Healing SystemOdin continuously compares the desired goal state with the actual state of the system. If a deviation is detected, it automatically triggers corrective actions through Cadence workflows. No human intervention is required in this process. The system detects and fixes failures autonomously. This eliminates downtime caused by manual debugging. Moreover, Odin constantly fine-tunes itself to maintain optimal performance. 2 - High Availability and Fault ToleranceOdin ensures high availability through dynamic workload rescheduling and failure-handling mechanisms. 60% of workloads are rescheduled each month to balance resource utilization. Workloads move between hosts dynamically to optimize performance and maintain resilience against failures. Odin makes real-time rescheduling decisions based on resource usage, fault tolerance, and availability requirements. One of Odin’s key innovations is its make-before-break strategy. Before shutting down a workload, Odin provisions a replacement instance. The old instance is only removed once the new one is fully operational. This approach is important for stateful workloads like databases, where shutting down a node without a replacement can cause data loss, unavailability, or increased latency. 3 - Efficient Storage ManagementWhen Odin was introduced, containerizing databases was still a relatively uncommon and controversial practice. However, Uber fully embraced containerized stateful workloads. An innovation Odin brought to the table was placing multiple database instances on a single host. Traditionally, databases run on dedicated machines, wasting resources. Odin enables up to 100 databases to be colocated on the same host by managing CPU, memory, and disk allocation. Unlike traditional NAS-based database solutions, Odin relies on locally attached SSDs and HDDs. This results in lower latency, higher throughput, and reduced costs. 4 - Workload Identity and Stateful MigrationStateful applications require persistent identity across rescheduling events. Unlike Kubernetes StatefulSets, which assigns a stable pod identity, Odin takes a different approach. In this approach, workload identity is preserved across rescheduling events. Data replication is performed before workload termination. Goal-state propagation ensures that the new instance picks up exactly where the old one left off. In other words, the workload transition is seamless, with no downtime or loss of service. Challenges and Optimization Strategies in OdinManaging a platform as large as Odin comes with significant engineering challenges. As Uber scaled, Odin had to evolve from a human-driven system to a fully automated, highly coordinated, and resilient infrastructure management platform. Some key takeaways are as follows:
ConclusionOdin represents a significant leap in stateful workload management, enabling Uber to scale its infrastructure from hundreds to over 100,000 hosts while maintaining high availability, cost efficiency, and operational stability. By transitioning from manual, human-driven operations to an intent-based, self-healing, and automated orchestration system, Odin has eliminated inefficiencies and minimized downtime in Uber’s mission-critical storage systems. The key innovations behind Odin, such as self-healing remediation loops, intelligent workload rescheduling, make-before-break migrations, and platform-wide coordination mechanisms, have allowed Uber to achieve 95%+ resource utilization. As Uber continues to grow, Odin is set to evolve further, integrating new optimizations, smarter automation, and deeper AI-driven workload management. References: © 2025 ByteByteGo |
by "ByteByteGo" <bytebytego@substack.com> - 10:36 - 4 Mar 2025