- Mailing Lists
- in
- Scaling to 1.2 Billion Daily API Requests with Caching at RevenueCat
Archives
- By thread 3808
-
By date
- June 2021 10
- July 2021 6
- August 2021 20
- September 2021 21
- October 2021 48
- November 2021 40
- December 2021 23
- January 2022 46
- February 2022 80
- March 2022 109
- April 2022 100
- May 2022 97
- June 2022 105
- July 2022 82
- August 2022 95
- September 2022 103
- October 2022 117
- November 2022 115
- December 2022 102
- January 2023 88
- February 2023 90
- March 2023 116
- April 2023 97
- May 2023 159
- June 2023 145
- July 2023 120
- August 2023 90
- September 2023 102
- October 2023 106
- November 2023 100
- December 2023 74
- January 2024 75
- February 2024 75
- March 2024 78
- April 2024 74
- May 2024 108
- June 2024 98
- July 2024 116
- August 2024 134
- September 2024 130
- October 2024 141
- November 2024 171
- December 2024 64
[Online workshop] Maximizing observability with New Relic logs
RE:How to smoothly ship goods to the Middle East?
Scaling to 1.2 Billion Daily API Requests with Caching at RevenueCat
Scaling to 1.2 Billion Daily API Requests with Caching at RevenueCat
Effortlessly Integrate E-Signatures into Your App with BoldSign (Sponsored)BoldSign by Syncfusion makes it easy for developers to integrate e-signatures into applications. Our powerful e-signature API allows you to embed signature requests, create templates, add custom branding, and more. It’s so easy to get started that 60% of our customers integrated BoldSign into their apps within one day. Why BoldSign stands out:
Disclaimer: The details in this post have been derived from the article originally published on the RevenueCat Engineering Blog. All credit for the details about RevenueCat’s architecture goes to their engineering team. The link to the original article is present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. RevenueCat is a platform that makes it easy for mobile app developers to implement and manage in-app subscriptions and purchases. The staggering part is that they handle over 1.2 billion API requests per day from the apps. At this massive scale, a fast and reliable performance becomes critical. Some of it is achieved by distributing the workload uniformly across multiple servers. However, an efficient caching solution also becomes the need of the hour. Caching allows frequently accessed data to be quickly retrieved from fast memory rather than slower backend databases and systems. This can dramatically speed up response times. But caching also adds complexity since the cached data must be kept consistent with the source of truth in the databases. Stale or incorrect data in the cache can lead to serious issues. For an application operating at the scale of RevenueCat, even small inefficiencies or inconsistencies in the caching layer can have a huge impact. In this post, we will look at how RevenueCat overcame multiple challenges to build a truly reliable and scalable caching solution using Memcached. The Three Key Goals of CachingRevenueCat has three key goals for its caching infrastructure:
While these main goals are highly relevant to applications operating at scale, a robust caching solution also needs supporting features such as monitoring and observability, optimization, and some sort of automated scaling. Let’s look at each of these goals in more detail and how RevenueCat’s engineering team achieved them. Low LatencyThere’s no doubt that latency has a huge impact on user experience. As per a statistic by Amazon, every 100ms of latency costs them 1% in sales. While it’s hard to confirm whether this is 100% true, there’s no denying the fact that latency impacts user experience. Even small delays of a few hundred milliseconds can make an application feel sluggish and unresponsive. As latency increases, user engagement and satisfaction plummet. RevenueCat achieves low latency in its caching layer through two key techniques. 1 - Pre-established connectionsTheir cache client maintains a pool of open connections to the cache servers. When the application needs to make a cache request, it borrows a connection from the pool instead of establishing a new TCP one. This is because a TCP handshake could nearly double the cache response times. Borrowing the connection avoids the overhead of the TCP handshake on each request. But no decision comes without some tradeoff. Keeping connections open consumes memory and other resources on both the client and server. Therefore, it’s important to carefully tune the number of connections to balance resource usage with the ability to handle traffic spikes. 2 - Fail-fast approachIf a cache server becomes unresponsive, the client immediately marks it as down for a few seconds and fails the request, treating it as a cache miss. In other words, the client will not retry the request or attempt to establish new connections to the problematic server during this period. The key insight here is that even brief retry delays of 100ms can cause cascading failures under heavy load. Requests pile up, servers get overloaded, and the "retry storm" can bring the whole system down. Though it might sound counterintuitive, failing fast is crucial for a stable system. But what’s the tradeoff here? There may be a slight increase in cache misses when servers have temporary issues. But this is far better than risking a system-wide outage. A 99.99% cache hit rate is meaningless if 0.01% of requests trigger cascading failures. Prioritizing stability over perfect efficiency is the right call. One potential enhancement over here could be circuit breaking where requests to misbehaving servers can be disabled based on error rates and latency measurements. This is something that Uber uses in their integrated cache solution called CacheFront. However, the aggressive timeouts and managing connection pools likely achieve similar results with far less complexity. Keeping Cache Servers WarmThe next goal RevenueCat had was keeping the cache servers warm. They employed several strategies to achieve this. 1 - Planning for Failure with Mirrored and Gutter poolRevenueCat uses fallback cache pools to handle failures. Their strategy is designed to handle cache server failures and maintain high availability. The two approaches they use are as follows:
Here also, there are trade-offs to consider concerning server size: For example, having smaller servers provides benefits such as:
However, small servers also have drawbacks:
The diagram below from RevenueCat’s article shows this comparison:
Bigger servers also have some trade-offs:
This is where the strategy of using a mirrored pool for fast failover and a gutter pool for temporary caching strikes a balance between availability and cost. The mirrored pool ensures immediate availability. The gutter pool, on the other hand, provides a cost-effective way to handle failures temporarily. Generally speaking, it’s better to design the cache tier based on a solid understanding of the backend capacity. Also, when using sharding, the cache, and the backend sharding should be orthogonal so that a cache server going down translates into a moderate increase on backend servers. Latest articlesIf you’re not a paid subscriber, here’s what you missed. To receive all the full articles and support ByteByteGo, consider subscribing: 2 - Dedicated PoolsAnother technique they employ to keep cache servers warm is to use dedicated cache pools for certain use cases. Here’s how the strategy works:
The dedicated pools strategy has several advantages:
3 - Handling Hot KeysHot keys are a common challenge in caching systems. They refer to keys that are accessed more frequently than others, leading to a high concentration of requests on a single cache server. This can cause performance issues and overload the server, potentially impacting the overall system. There are two main strategies for handling hot keys: Key SplittingThe below points explain how key splitting works:
Local CachingLocal caching is simpler when compared to key splitting. Here are some points to explain how it works:
Avoiding Thundering HerdsWhen a popular key expires, all clients may request it from the backend simultaneously, causing a spike. This is known as the “thundering herd situation”. RevenueCat avoids this situation since it tries to maintain cache consistency by updating it during the writes. However, when using low TTLs and invalidations from DB changes, the thundering herd can cause a lot of problems. Some other potential solutions to avoid thundering herds are as follows:
Cache Server MigrationsSometimes cache servers have to be replaced while minimizing impact on hit rates and user experience. RevenueCat has built a coordinated cache server migration system that consists of the following steps:
The diagram below shows the entire migration process. Maintaining data consistency is one of the biggest challenges when using caching in distributed systems. The fundamental issue is that data is stored in multiple places - the primary data store (like a database) and the cache. Keeping the data in sync across these locations in the face of concurrent reads and writes is a non-trivial problem. See the example below that shows how a simple race condition can result in a consistency problem between the database and the cache. What’s going on over here?
RevenueCat uses two main strategies to maintain cache consistency. 1 - Write Failure TrackingIn RevenueCat's system, a cache write failure is a strong signal that there may be an inconsistency between the cache and the primary store. However, there are better options than simply retrying the write because that can lead to cascading failures and overload as discussed earlier. Instead, RevenueCat's caching client records all write failures. After recording, it deduplicates them and ensures that the affected keys are invalidated in the cache at least once (retrying as needed until successful). This guarantees that the next read for those keys will fetch fresh data from the primary store, resynchronizing the cache. This write failure tracking allows them to treat cache writes as if they should always succeed, significantly simplifying their consistency model. They can assume the write succeeded, and if it didn't, the tracker will ensure eventual consistency. 2 - Consistent CRUD OperationsFor each type of data operation (Create, Read, Update, Delete), they have developed a strategy to keep the cache and primary store in sync. For reads, they use the standard cache-aside pattern: read from the cache, and on a miss, read from the primary store and populate the cache. They always use an "add" operation to populate, which only succeeds if the key doesn't already exist, to avoid overwriting newer values. For updates, they use a clever strategy as follows:
If a failure occurs between steps 1 and 2, the cache remains consistent as the update never reaches the primary store. If a failure occurs between 2 and 3, the cache will be stale, but only for a short time until the reduced TTL expires. Also, any complete failures are caught by the write failure tracker that we talked about earlier. For deletes, they use a similar TTL reduction strategy before the primary store delete. However, for creation, they rely on the primary store to provide unique IDs to avoid conflicts. ConclusionRevenueCat’s approach illustrates the complexities of running caches at a massive scale. While some details may be specific to their Memcached setup, the high-level lessons are widely relevant. Here are some key takeaways to consider from this case study:
References:
SPONSOR USGet your product in front of more than 500,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com
© 2024 ByteByteGo |
by "ByteByteGo" <bytebytego@substack.com> - 11:36 - 18 Jun 2024