How PayPal Serves 350 Billion Daily Requests with JunoDB

Or send us feedback—we’d love to hear from you.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this newsletter because you subscribed to the Only McKinsey newsletter, formerly called On Point.

by "Only McKinsey" <publishing@email.mckinsey.com> - 11:06 - 16 Apr 2024

How PayPal Serves 350 Billion Daily Requests with JunoDB

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

How PayPal Serves 350 Billion Daily Requests with JunoDB

Apr 16

Stop releasing bugs with fully automated end-to-end test coverage (Sponsored)

Bugs sneak out when less than 80% of user flows are tested before shipping. But how do you get that kind of coverage? You either spend years scaling in-house QA — or you get there in just 4 months with QA Wolf.

How's QA Wolf different?

They don't charge hourly.
They guarantee results.
They provide all of the tooling and (parallel run) infrastructure needed to run a 15-minute QA cycle.

Get a quick demo here

Have you ever seen a database that fails and comes up again in the blink of an eye?

PayPal’s JunoDB is a database capable of doing so. As per PayPal’s claim, JunoDB can run at 6 nines of availability (99.9999%). This comes to just 86.40 milliseconds of downtime per day.

For reference, our average eye blink takes around 100-150 milliseconds.

While the statistics are certainly amazing, it also means that there are many interesting things to pick up from JunoDB’s architecture and design.

In this post, we will cover the following topics:

JunoDB’s Architecture Breakdown
How JunoDB achieves scalability, availability, performance, and security
Use cases of JunoDB

Key Facts about JunoDB

Before we go further, here are some key facts about JunoDB that can help us develop a better understanding of it.

JunoDB is a distributed key-value store. Think of a key-value store as a dictionary where you look up a word (the “key”) to find its definition (the “value”).
JunoDB leverages a highly concurrent architecture implemented in Go to efficiently handle hundreds of thousands of connections.
At PayPal, JunoDB serves almost 350 billion daily requests and is used in every core backend service, including critical functionalities like login, risk management, and transaction processing.
PayPal primarily uses JunoDB for caching to reduce the load on the main source-of-truth database. However, there are also other use cases that we will discuss in a later section.

The diagram shows how JunoDB fits into the overall scheme of things at PayPal.

Why the Need for JunoDB?

One common question surrounding the creation of something like JunoDB is this:

“Why couldn’t PayPal just use something off-the-shelf like Redis?”

The reason is PayPal wanted multi-core support for the database and Redis is not designed to benefit from multiple CPU cores. It is single-threaded in nature and utilizes only one core. Typically, you need to launch several Redis instances to scale out on several cores if needed.

Incidentally, JunoDB started as a single-threaded C++ program and the initial goal was to use it as an in-memory short TTL data store.

For reference, TTL stands for Time to Live. It specifies the maximum duration a piece of data should be retained or the maximum time it is considered valid.

However, the goals for JunoDB evolved with time.

First, PayPal wanted JunoDB to work as a persistent data store supporting long TTLs.
Second, JunoDB was also expected to provide improved data security via on-disk encryption and TLS in transit by default.

These goals meant that JunoDB had to be CPU-bound rather than memory-bound.

For reference, “memory-bound” and “CPU-bound” refer to different performance aspects in computer programs. As the name suggests, the performance of memory-bound programs is limited by the amount of available memory. On the other hand, CPU-bound programs depend on the processing power of the CPU.

For example, Redis is memory-bound. It primarily stores the data in RAM and everything about it is optimized for quick in-memory access. The limiting factor for the performance of Redis is memory rather than CPU.

However, requirements like encryption are CPU-intensive because many cryptographic algorithms require raw processing power to carry out complex mathematical calculations.

As a result, PayPal decided to rewrite the earlier version of JunoDB in Go to make it multi-core friendly and support high concurrency.

The Architecture of JunoDB

The below diagram shows the high-level architecture of JunoDB.

Let’s look at the main components of the overall design.

1 - JunoDB Client Library

The client library is part of the client application and provides an API for storing and retrieving data via the JunoDB proxy.

It is implemented in several programming languages such as Java, C++, Python, and Golang to make it easy to use across different application stacks.

For developers, it’s just a matter of picking the library for their respective programming language and including it in the application to carry out the various operations.

2 - JunoDB Proxy with Load Balancer

JunoDB utilizes a proxy-based design where the proxy connects to all JunoDB storage server instances.

This design has a few important advantages:

The complexity of determining which storage server should handle a query is kept out of the client libraries. Since JunoDB is a distributed data store, the data is spread across multiple servers. The proxy handles the job of directing the requests to the correct server.
The proxy is also aware of the JunoDB cluster configuration (such as shard mappings) stored in the ETCD key-value store.

But can the JunoDB proxy turn into a single point of failure?

To prevent this possibility, the proxy runs on multiple instances downstream to a load balancer. The load balancer receives incoming requests from the client applications and routes the requests to the appropriate proxy instance.

3 - JunoDB Storage Servers

The last major component in the JunoDB architecture is the storage servers.

These are instances that accept the operation requests from the proxy and store data in the memory or persistent storage.

Each storage server is responsible for a set of partitions or shards for an efficient distribution of data.

Internally, JunoDB uses RocksDB as the storage engine. Using an off-the-shelf storage engine like RocksDB is common in the database world to avoid building everything from the ground up. For reference, RocksDB is an embedded key-value storage engine that is optimized for high read and write throughput.

Key Priorities of JunoDB

Now that we have looked at the overall design and architecture of JunoDB, it’s time to understand a few key priorities for JunoDB and how it achieves them.

Scalability

Several years ago, PayPal transitioned to a horizontally scalable microservice-based architecture to support the rapid growth in active customers and payment rates.

While microservices solve many problems for them, they also have some drawbacks.

One important drawback is the increased number of persistent connections to key-value stores due to scaling out the application tier. JunoDB handles this scaling requirement in two primary ways.

1 - Scaling for Client Connections

As discussed earlier, JunoDB uses a proxy-based architecture.

If client connections to the database reach a limit, additional proxies can be added to support more connections.

There is an acceptable trade-off with latency in this case.

2 - Scaling for Data Volume and Throughput

The second type of scaling requirement is related to the growth in data size.

To ensure efficient storage and data fetching, JunoDB supports partitioning based on the consistent hashing algorithm. Partitions (or shards) are distributed to physical storage nodes using a shard map.

Consistent hashing is very useful in this case because when the nodes in a cluster change due to additions or removals, only a minimal number of shards require reassignment to different storage nodes.

PayPal uses a fixed number of shards (1024 shards, to be precise), and the shard map is pre-generated and stored in ETCD storage.

Any change to the shard mapping triggers an automatic data redistribution process, making it easy to scale your JunoDB cluster depending on the need.

The below diagram shows the process in more detail.

Availability

High availability is critical for PayPal. You can’t have a global payment platform going down without a big loss of reputation.

However, outages can and will occur due to various reasons such as software bugs, hardware failures, power outages, and even human error. Failures can lead to data loss, slow response times, or complete unavailability.

To mitigate these challenges, JunoDB relies on replication and failover strategies.

1 - Within-Cluster Replication

In a cluster, JunoDB storage nodes are logically organized into a grid. Each column represents a zone, and each row signifies a storage group.

Data is partitioned into shards and assigned to storage groups. Within a storage group, each shard is synchronously replicated across various zones based on the quorum protocol.

The quorum-based protocol is the key to reaching a consensus on a value within a distributed database. You’ve two quorums:

The Read Quorum: When a client wants to read data, it needs to receive responses from a certain number of zones (known as the read quorum). This is to make sure that it gets the most up-to-date data.
The Write Quorum: When the client wants to write data, it must receive acknowledgment from a certain number of zones to make sure that the data is written to a majority of the zones.

There are two important rules when it comes to quorum.

The sum of the read quorum and write quorum must be greater than the number of zones. If that’s not the case, the client may end up reading outdated data. For example, if there are 5 zones with read quorum as 2 and write quorum as 3, a client can write data to 3 zones but another client may read from the 2 zones that have not yet received the updated data.
The write quorum must be more than half the number of zones to prevent two concurrent write operations on the same key. For example, if there is a JunoDB cluster with 5 zones and a write quorum of 2, client A may write value X to key K and is considered successful when 2 zones acknowledge the request. Similarly, client B may write value Y to the same key K and is also successful when two different zones acknowledge the request. Ultimately, the data for key K is in an inconsistent state.

In production, PayPal has a configuration with 5 zones, a read quorum of 3, and a write quorum of 3.

Lastly, the failover process in JunoDB is automatic and instantaneous without any need for leader re-election or data redistribution. Proxies can know about a node failure through a lost connection or a read request that has timed out.

2 - Cross-data center replication

Cross-data center replication is implemented by asynchronously replicating data between the proxies of each cluster across different data centers.

This is important to make sure that the system continues to operate even if there’s a catastrophic failure at one data center.

Performance

One of the critical goals of JunoDB is to deliver high performance at scale.

This translates to maintaining single-digit millisecond response times while providing a great user experience.

The below graphs shared by PayPal show the benchmark results demonstrating JunoDB’s performance in the case of persistent connections and high throughput.

Security

Being a trusted payment processor, security is paramount for PayPal.

Therefore, it’s no surprise that JunoDB has been designed to secure data both in transit and at rest.

For transmission security, TLS is enabled between the client and proxy as well as proxies in different data centers used for replication.
Payload encryption is performed at the client or proxy level to prevent multiple encryptions of the same data. The ideal approach is to encrypt the data on the client side but if it’s not done, the proxy figures it out through a metadata flag and carries out the encryption.
All data received by the storage server and stored in the engine are also encrypted to maintain security at rest.

A key management module is used to manage certificates for TLS, sessions, and the distribution of encryption keys to facilitate key rotation,

The below diagram shows JunoDB’s security setup in more detail.

Use Cases of JunoDB

With PayPal having made JunoDB open-source, it’s possible that you can also use it within your projects.

There are various use cases where JunoDB can help. Let’s look at a few important ones:

1 - Caching

You can use JunoDB as a temporary cache to store data that doesn’t change frequently.

Since JunoDB supports both short and long-lived TTLs, you can store data from a few seconds to a few days. For example, a use case is to store short-lived tokens in JunoDB instead of fetching them from the database.

Other items you can cache in JunoDB are user preferences, account details, and API responses.

2 - Idempotency

You can also use JunoDB to implement idempotency.

An operation is idempotent when it produces the same result even when applied multiple times. With idempotency, repeating the operation is safe and you don’t need to be worried about things like duplicate payments getting applied.

PayPal uses JunoDB to ensure they don’t process a particular payment multiple times due to retries. JunoDB’s high availability makes it an ideal data store to keep track of processing details without overloading the main database.

3 - Counters

Let’s say you’ve certain resources that aren’t available for some reason or they have an access limit to their usage. For example, these resources can be database connections, API rate limits, or user authentication attempts.

You can use JunoDB to store counters for these resources and track whether their usage exceeds the threshold.

4 - Latency Bridging

As we discussed earlier, JunoDB provides fast inter-cluster replication. This can help you deal with slow replication in a more traditional setup.

For example, in PayPal’s case, they run Oracle in Active-Active mode, but the replication usually isn’t as fast as they would like for their requirement.

It means there are chances of inconsistent reads if records written in one data center are not replicated in the second data center and the first data center goes down.

JunoDB can help bridge the latency where you can write to Data Center A (both Oracle and JunoDB) and even if it goes down, you can read the updates consistently from the JunoDB instance in Data Center B.

See the below diagram for a better understanding of this concept.

Conclusion

JunoDB is a distributed key-value store playing a crucial role in various PayPal applications. It provides efficient data storage for fast access to reduce the load on costly database solutions.

While doing so, it also fulfills critical requirements such as scalability, high availability with performance, consistency, and security.

Due to its advantages, PayPal has started using JunoDB in multiple use cases and patterns. For us, it provides a great opportunity to learn about an exciting new database system.

References:

DELIVERING ACTIONABLE INSIGHTS ON THE DAY’S NEWS
As only McKinsey can

© 2024 ByteByteGo
548 Market Street PMB 72296, San Francisco, CA 94104
Unsubscribe

by "ByteByteGo" <bytebytego@substack.com> - 11:36 - 16 Apr 2024

What’s on CEO agendas as 2024 unfolds?

On Point

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

CEO agendas in 2024

In the news

•

Geopolitical concerns. US and European business leaders are anxious that ongoing conflicts could destabilize supply chains as boycotts against Western companies stifle profits, according to Bloomberg. That could imperil US stock markets’ historical rally. A study of earnings call transcripts from hundreds of S&P 500 and Stoxx 600 companies found that executives have increasingly mentioned geopolitics during the first quarter of the year. Unrelenting geopolitical pressure could hinder corporate earnings and lead to increased prices, a portfolio manager said. [Bloomberg]

•

Grasping generative AI. Generative AI (gen AI) has risen to the top of the agenda for CEOs, according to a recent survey of more than 100 executives from multinational companies, Fortune reported. About half of the business leaders polled said that gen AI was a top priority, while three in ten thought that the technology would greatly affect their organizations. Although CEOs acknowledge gen AI’s significance, 75% also said they were not prepared to guide colleagues through adopting the technology. [Fortune]

“A third of business leaders said generative AI was already being used in at least one business function in their organization.”

On McKinsey.com

•

Building geopolitical muscle. An endless series of disruptions are making it much harder to lead, Homayoun Hatami, McKinsey’s managing partner for client capabilities, reflects on a recent episode of The McKinsey Podcast. What should leaders focus on as 2024 unfolds? As geopolitical risks rise, leaders must strengthen their abilities to address issues such as investing in new markets. They should also consider establishing processes and systems that enable geopolitically informed decision making, McKinsey’s chief client officer Liz Hilton Segel says.

•

Investing in gen AI. Executives are shifting from piloting gen AI to rolling out applications across enterprises, Hatami shares. Last fall, for instance, McKinsey worked with the bank ING in the Netherlands, Hilton Segel reveals. “We looked at how they could get a customer-facing chatbot working. In less than two months, they had it up and running,” she says. Consequently, there was about a 20% increase in customers seeing a lower wait time and getting better answers to their questions. Learn what’s on the minds of CEOs as 2024 evolves.

Consider CEO priorities in 2024

—Edited by Belinda Yu, editor, Atlanta

Or send us feedback—we’d love to hear from you.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this newsletter because you subscribed to the Only McKinsey newsletter, formerly called On Point.

ESSENTIALS FOR LEADERS AND THOSE THEY LEAD

by "Only McKinsey" <publishing@email.mckinsey.com> - 01:25 - 16 Apr 2024

A leader’s guide to digital virtualization

Share this email

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

“You stay in Wonderland, and I show you how deep the rabbit hole goes,” says the character Morpheus in the classic sci-fi thriller The Matrix, which showcases the extent to which the lines between the virtual and the real can blur. Today, digital technologies have enabled organizations to simulate, experiment, and innovate in imaginative ways and at speeds that would be almost impossible in a solely physical environment. This week, join us as we go down the rabbit hole—to a world where fantasy and reality can meet.

AN IDEA

An image linking to the web page “The AI revolution will be ‘virtualized’” on McKinsey.com.

Experiment with digital virtualization

“The line between the digital and the physical has dissolved completely” for some companies, according to McKinsey senior partner Kimberly Borden and colleagues in a new article on digital innovation. “It’s now possible to create digital representations of components, products, and processes that precisely replicate the behavior of their real-world counterparts,” they note. For example, by creating virtual prototypes to simulate thousands of different racing conditions, a sailboat team won two major competitions; a heart pump maker deployed high-performance virtualization to reduce design optimization time by over a factor of 1,000. To be successful at digital virtualization, “build it to a code that verifiably matches digital results to the physical world, within an acceptable error range,” suggest the authors. “At that point, the virtualization can be trusted as authoritative and safe to use.”

A BIG NUMBER

$20 billion

That’s the potential size of the opportunity for the virtual travel industry by 2030, McKinsey research shows. The metaverse—a collective space where physical and digital worlds converge to deliver immersive, interactive user experiences—is already transforming many sectors: “You can attend concerts, shop, test products, visit attractions, and take workshops, all without physically traveling anywhere,” observe McKinsey partner Margaux Constantin and colleagues. To gain first-mover advantage in the metaverse, travel companies may need to focus on four key steps, including finding the right talent. “Developing any offer will likely require new skills—not just to make your immersive world look good but to ensure that it’s smooth and exhilarating to use,” say the McKinsey experts.

A QUOTE

“We learn by doing things ourselves.”

That’s Camille Kroely, chief metaverse and Web3 officer at L’Oréal, on how the global cosmetics company trains its own executives to become proficient in digital skills. L’Oréal has launched several initiatives to blend the physical, digital, and virtual worlds in its product marketing—for example, by working with avatar developers to offer virtual makeup looks. “Upskilling is now part of our road map,” says Kroely. “One of the first things that we did with our executive committee members was to train and upskill them so that they can keep learning by themselves—such as opening their own wallets, customizing their avatars, testing the platform, and going on [the online game platform] Roblox.”

A SPOTLIGHT INTERVIEW

An image linking to the web page “Embracing digital transformation with a digital factory” on McKinsey.com.

Creating a dedicated digital unit can sometimes be the most efficient way to quickly spread innovative digital practices throughout the organization. That strategy has worked for Saudi Arabia’s Alinma Bank, according to chief digital officer Sami Al-Rowaithey. Recently, the bank established a “digital factory,” says Al-Rowaithey in a conversation with McKinsey partner Sonia Wedrychowicz. “It’s responsible for managing the entire digital structure and its associated strategies, business development, product innovation, experience management, performance management, and digital profitability,” he says. While that may seem like a daunting task, it helps to act fast, adds Al-Rowaithey. “The first thing I would say is to not spend a lot of time trying to figure out all the details of the journey. It’s important to start quick, learn as you go, be flexible, modify, and move on.”

CARBON COPIES

An image linking to the web page “Digital twins: The key to smart product development” on McKinsey.com.

Clones may seem creepy in science fiction, but they can be highly practical in the real world. Digital twins—virtual replicas of physical objects, systems, or processes—are increasingly popular in product development and manufacturing. “Interacting with or modifying a product in a virtual space can be quicker, easier, and safer than doing so in the real world,” note McKinsey partners Roberto Argolini and Johannes Deichmann and colleagues. For example, a factory digital twin can model floor layouts, optimize the footprint, and estimate inventory size. “Eventually, we may see the emergence of digital twins capable of learning from their own experiences, identifying opportunities and offering product improvement suggestions entirely autonomously,” say the McKinsey experts.

Lead by virtualizing.

– Edited by Rama Ramaswami, senior editor, New York

Share these insights

Did you enjoy this newsletter? Forward it to colleagues and friends so they can subscribe too. Was this issue forwarded to you? Sign up for it and sample our 40+ other free email subscriptions here.

This email contains information about McKinsey’s research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this email because you subscribed to the Leading Off newsletter.

DELIVERING ACTIONABLE INSIGHTS ON THE DAY’S NEWS
As only McKinsey can

by "McKinsey Leading Off" <publishing@email.mckinsey.com> - 04:41 - 15 Apr 2024

Do you know how healthy your organization is?

On Point

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

Prioritizing organizational health

In the news

•

Higher pay for office jobs. With many workers preferring flexible work arrangements, companies—particularly in the US—are significantly raising salaries to entice employees to accept fully on-site roles, the BBC notes. In March 2024, US companies paid workers about 33% more, on average, to work fully from the office compared with the previous year, ZipRecruiter data showed. This suggests businesses must offer higher pay if they cannot provide flexible work options, the company’s chief economist said. [BBC]

•

Motivating top performers. Leaders may be tempted to neglect their high-performing employees in favor of supporting underperforming workers, but that can harm companies, Fortune reports. Star employees often enhance workplace health by inspiring other workers and sharing their skills, a Yale School of Management expert says. Offering financial incentives can help keep them engaged. Since top performers are motivated by purpose and career growth opportunities, companies can also provide regular feedback and recognition. [Fortune]

“Getting access to development and mobility over time are key drivers to help employees stay at an organization.”

On McKinsey.com

•

Power practices. “The research is very clear: organizational health propels long-term performance,” McKinsey partner Brooke Weddle shares on a recent episode of McKinsey Talks Talent. To support organizational health, companies have to get four power practices right, adds partner Bryan Hancock, who joined Weddle on the podcast. These are strategic clarity—which means having clear and measurable goals—role clarity, personal ownership, and competitive insights (that is, understanding where businesses fit in versus the competition).

•

Decisive leadership. As leaders grapple with disruptions such as generative AI and persistent economic uncertainty, they’re discovering the need to be both decisive and empowering. McKinsey’s recently updated Organizational Health Index highlights decisive leadership as one of the best predictors of organizational health. Decisive leaders are able to quickly make and follow through on decisions, rather than leaning on their authority to get things done, Weddle says. Explore new measures of organizational health and three steps leaders can take to get started.

Promote organizational health

—Edited by Belinda Yu, editor, Atlanta

Or send us feedback—we’d love to hear from you.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this newsletter because you subscribed to the Only McKinsey newsletter, formerly called On Point.

ALL THE WEEK’S DATA THAT’S FIT TO VISUALIZE

by "Only McKinsey" <publishing@email.mckinsey.com> - 01:08 - 15 Apr 2024

The week in charts

The Week in Charts

Share this email

Our McKinsey Chart of the Day series offers a daily chart that helps explain a changing world—as we strive toward sustainable and inclusive growth. In case you missed them, this week’s graphics explored Latino representation in Hollywood, connectivity options for telcos, employee well-being, gen AI risk assessments, and tech’s role in customer care.

FEATURED CHART

Latinos underrepresented in entertainment

A chart titled “Latinos suffer from consistently low allyship and are less likely to be hired by non-Latino creative leaders.” Click to open the full article on McKinsey.com.

This week’s other select charts

Staying connected

A feel-good story for the economy

A gen AI risk assessment

Are the chatbots taking over?

Share these insights

Did you enjoy this newsletter? Forward it to colleagues and friends so they can subscribe too. Was this issue forwarded to you? Sign up for it and sample our 40+ other free email subscriptions here.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this email because you subscribed to The Week in Charts newsletter.

by "McKinsey Week in Charts" <publishing@email.mckinsey.com> - 03:38 - 13 Apr 2024

EP107: Top 9 Architectural Patterns for Data and Communication Flow

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

EP107: Top 9 Architectural Patterns for Data and Communication Flow

Apr 13

This week’s system de sign refresher:

Top 9 Architectural Patterns for Data and Communication Flow
How Netflix Really Uses Java?
Top 6 Cloud Messaging Patterns
What Are the Most Important AWS Services To Learn?
SPONSOR US

Register for POST/CON 24 | Save 20% Off (Sponsored)

Postman’s annual user conference will be one of 2024’s biggest developer events and it features a great mix of high-level talks from tech executives and hands-on training from some of the best developers in the world.

Learn: Get first-hand knowledge from Postman experts and global tech leaders.

Level up: Attend 8-hour workshops to leave with new skills (and badges!)

Become the first to know: See the latest API platform innovations, including advancements in AI.

Help shape the future of Postman: Give direct feedback to the Postman leadership team.

Network with fellow API practitioners and global tech leaders — including speakers from OpenAI, Heroku, and more.

Have fun: Enjoy cocktails, dinner, 360° views of the city, and a live performance from multi-platinum recording artist T-Pain!

Use code PCBYTEBYTEGO20 and save 20% off your ticket!

Register Now

Top 9 Architectural Patterns for Data and Communication Flow

Peer-to-Peer
The Peer-to-Peer pattern involves direct communication between two components without the need for a central coordinator.
API Gateway
An API Gateway acts as a single entry point for all client requests to the backend services of an application.
Pub-Sub
The Pub-Sub pattern decouples the producers of messages (publishers) from the consumers of messages (subscribers) through a message broker.
Request-Response
This is one of the most fundamental integration patterns, where a client sends a request to a server and waits for a response.
Event Sourcing
Event Sourcing involves storing the state changes of an application as a sequence of events.
ETL
ETL is a data integration pattern used to gather data from multiple sources, transform it into a structured format, and load it into a destination database.
Batching
Batching involves accumulating data over a period or until a certain threshold is met before processing it as a single group.
Streaming Processing
Streaming Processing allows for the continuous ingestion, processing, and analysis of data streams in real-time.
Orchestration
Orchestration involves a central coordinator (an orchestrator) managing the interactions between distributed components or services to achieve a workflow or business process.

Latest articles

If you’re not a paid subscriber, here’s what you missed.

To receive all the full articles and support ByteByteGo, consider subscribing:

Upgrade to paid

How Netflix Really Uses Java?

Netflix is predominantly a Java shop.

Every backend application (including internal apps, streaming, and movie production apps) at Netflix is a Java application.

However, the Java stack is not static and has gone through multiple iterations over the years.

Here are the details of those iterations:

API Gateway
Netflix follows a microservices architecture. Every piece of functionality and data is owned by a microservice built using Java (initially version 8)

This means that rendering one screen (such as the List of List of Movies or LOLOMO) involved fetching data from 10s of microservices. But making all these calls from the client created a performance problem.

Netflix initially used the API Gateway pattern using Zuul to handle the orchestration.
BFFs with Groovy & RxJava
Using a single gateway for multiple clients was a problem for Netflix because each client (such as TV, mobile apps, or web browser) had subtle differences.

To handle this, Netflix used the Backend-for-Frontend (BFF) pattern. Zuul was moved to the role of a proxy

In this pattern, every frontend or UI gets its own mini backend that performs the request fanout and orchestration for multiple services.

The BFFs were built using Groovy scripts and the service fanout was done using RxJava for thread management.
GraphQL Federation
The Groovy and RxJava approach required more work from the UI developers in creating the Groovy scripts. Also, reactive programming is generally hard.

Recently, Netflix moved to GraphQL Federation. With GraphQL, a client can specify exactly what set of fields it needs, thereby solving the problem of overfetching and underfetching with REST APIs.

The GraphQL Federation takes care of calling the necessary microservices to fetch the data.

These microservices are called Domain Graph Service (DGS) and are built using Java 17, Spring Boot 3, and Spring Boot Netflix OSS packages. The move from Java 8 to Java 17 resulted in 20% CPU gains.

More recently, Netflix has started to migrate to Java 21 to take advantage of features like virtual threads.

Top 6 Cloud Messaging Patterns

How do services communicate with each other? The diagram below shows 6 cloud messaging patterns.

🔹 Asynchronous Request-Reply
This pattern aims at providing determinism for long-running backend tasks. It decouples backend processing from frontend clients.

In the diagram below, the client makes a synchronous call to the API, triggering a long-running operation on the backend. The API returns an HTTP 202 (Accepted) status code, acknowledging that the request has been received for processing.

🔹 Publisher-Subscriber
This pattern targets decoupling senders from consumers, and avoiding blocking the sender to wait for a response.

🔹 Claim Check
This pattern solves the transmision of large messages. It stores the whole message payload into a database and transmits only the reference to the message, which will be used later to retrieve the payload from the database.

🔹 Priority Queue
This pattern prioritizes requests sent to services so that requests with a higher priority are received and processed more quickly than those with a lower priority.

🔹 Saga
Saga is used to manage data consistency across multiple services in distributed systems, especially in microservices architectures where each service manages its own database.

The saga pattern addresses the challenge of maintaining data consistency without relying on distributed transactions, which are difficult to scale and can negatively impact system performance.

🔹 Competing Consumers
This pattern enables multiple concurrent consumers to process messages received on the same messaging channel. There is no need to configure complex coordination between the consumers. However, this pattern cannot guarantee message ordering.

Reference: Azure Messaging patterns

What Are the Most Important AWS Services To Learn?

Since its inception in 2006, AWS has rapidly evolved from simple offerings like S3 and EC2 to an expansive, versatile cloud ecosystem.

No alternative text description for this image

Today, AWS provides a highly reliable, scalable infrastructure platform with over 200 services in the cloud, powering hundreds of thousands of businesses in 190 countries around the world.

For both newcomers and seasoned professionals, navigating the broad set of AWS services is no small feat.

From computing power, storage options, and networking capabilities to database management, analytics, and machine learning, AWS provides a wide array of tools that can be daunting to understand and master.

Each service is tailored to specific needs and use cases, requiring a deep understanding of not just the services themselves, but also how they interact and integrate within an IT ecosystem.

This attached illustration can serve as both a starting point and a quick reference for anyone looking to demystify AWS and focus their efforts on the services that matter most.

It provides a visual roadmap, outlining the foundational services that underpin cloud computing essentials, as well as advanced services catering to specific needs like serverless architectures, DevOps, and machine learning.

Over to you: What has your journey been like with AWS so far?

SPONSOR US

Get your product in front of more than 500,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com.

Es posible que los beneficios de la IA generativa solo lleguen cuando las empresas realicen una cirugía organizacional más profunda de su negocio.

© 2024 ByteByteGo
548 Market Street PMB 72296, San Francisco, CA 94104
Unsubscribe

by "ByteByteGo" <bytebytego@substack.com> - 11:35 - 13 Apr 2024

Embarcarse en un reinicio de la IA generativa

Comparte este email

Destacados mensuales, abril 2024

Aunque la IA generativa ha sido uno de los temas principales de la agenda de muchas empresas este último año, capturar su valor es más difícil de lo esperado. 2024 es el año para que la IA generativa demuestre su valor, lo que requerirá que las empresas realicen cambios fundamentales y reconfiguren el negocio, escriben Eric Lamarre, Alex Singla, Alexander Sukharevsky y Rodney Zemmel en nuestro artículo destacado. Otros temas relevantes son:

•

el valor aún sin explotar de ofrecer más oportunidades al talento latino en Hollywood

•

el poder de la colaboración entre el CEO y el CMO

•

cómo aprovechar todo el potencial de la IA generativa en el diseño de productos físicos

•

cómo las instituciones financieras pueden prepararse para el futuro frente a los crecientes riesgos cibernéticos

La selección de nuestros editores

Un reinicio de la IA generativa: Reconfiguración para convertir el potencial en valor en 2024

Obtenga una ventaja competitiva

LOS DESTACADOS DE ESTE MES

Latinos en Hollywood: Amplificando voces, ampliando horizontes

Mejorar la representación de los latinos, sus historias y su cultura delante y detrás de cámaras podría generar miles de millones de dólares en nuevos ingresos para la industria, según un nuevo estudio de McKinsey.
Cierre la brecha

Análisis de la relación CEO-CMO y su efecto en el crecimiento

Los CEOs reconocen la experiencia y la importancia de los directores de marketing, así como su papel en el crecimiento de la empresa, pero aún existe una desconexión estratégica en la alta dirección. He aquí cómo cerrar la brecha.
Adopte un enfoque holístico

La IA generativa impulsa el diseño creativo de productos físicos, pero no es una varita mágica

Las herramientas de IA generativa pueden acortar significativamente los ciclos de vida del diseño de productos físicos y estimular la innovación, pero son necesarios los conocimientos y la discreción de los expertos en diseño para mitigar los posibles obstáculos.
Desbloquee la creatividad

Cómo la IA generativa puede ayudar a los bancos a gestionar el riesgo y el cumplimiento normativo

En los próximos cinco años, la IA generativa podría cambiar fundamentalmente la gestión de riesgos de las instituciones financieras al automatizar, acelerar y mejorar todo, desde el cumplimiento normativo hasta el control del riesgo climático.
Estrategias ganadoras para la IA generativa

An image linking to the web page “El reloj cibernético no se detiene: Reducir el riesgo de las tecnologías emergentes en los servicios financieros” on McKinsey.com.

El reloj cibernético no se detiene: Reducir el riesgo de las tecnologías emergentes en los servicios financieros

A medida que las instituciones financieras adoptan activamente las tecnologías emergentes, deberían actuar ahora con el fin de prepararse para el futuro frente a los crecientes riesgos cibernéticos.
Actúe

La creciente complejidad de ser miembro de un consejo de administración

Expertos en consejos de administración explican cómo sus miembros pueden hacer frente a las exigencias de unas agendas cada vez más apretadas.
Equilibre responsabilidades

Esperamos que disfrute de los artículos en español que seleccionamos este mes y lo invitamos a explorar también los siguientes artículos en inglés.

ALSO New

Implementing generative AI with speed and safety

Mapping the road to prosperity and parity for Black and Latino residents across America

The human element is key: An interview with Barbara Karuth-Zelle

Leading through uncertainty: An interview with Neeti Bhalla Johnson

The business of risk-taking: An interview with Jayne Plunkett

Using M&A as a launchpad for transformation

From legacy to cloud: Lessons from the trenches

The power of pace in technology

The journey toward AI-enabled railway companies

Seizing the core connectivity opportunity in B2B telecom

Bringing innovation to a not-for-profit setting

Author Talks: Chris Dixon on how to reshape the digital landscape

An image linking to the web page “The CEO’s secret to successful leadership: CEO Excellence revisited” on McKinsey.com.

OUT NOW: What’s the CEO’s secret to successful leadership?

Three McKinsey senior partners—and, now, international best-selling authors—reflect on the far-reaching impact of their book, CEO Excellence, two years after its release.

Dare to lead

Special features

Digital image of a circular, white maze filled with white semicircles

McKinsey Explainers

Find direct answers to complex questions, backed by McKinsey’s expert insights.
Learn more

McKinsey Themes

Browse our essential reading on the topics that matter.
Get up to speed

Digital image of a vending machine that dispenses books

McKinsey on Books

Explore this month’s best-selling business books prepared exclusively for McKinsey Publishing by Circana.
See the lists

An image linking to the web page “Chart of the Day” on McKinsey.com.

McKinsey Chart of the Day

See our daily chart that helps explain a changing world—as we strive for sustainable, inclusive growth.
Dive in

An image linking to the web page “Digital risk: Transforming risk management for the 2020s” on McKinsey.com.

McKinsey Classics

Significant improvements in risk management can be gained quickly through selective digitization—but capabilities must be test hardened before release. Read our 2017 classic “Digital risk: Transforming risk management for the 2020s” to learn more.
Rewind

Leading Off

Our Leading Off newsletter features revealing research and inspiring interviews to empower you—and those you lead.
Subscribe now

— Edited by Joyce Yoo, editor, New York

McKinsey Insights - Get our latest thinking on your iPhone, iPad or Android.

COMPARTA ESTAS IDEAS

¿Disfrutó este boletín? Reenvíelo a colegas y amigos para que ellos también puedan suscribirse. ¿Se le remitió este articulo? Regístrese y pruebe nuestras más de 40 suscripciones gratuitas por correo electrónico aquí.

Este correo electrónico contiene información sobre la investigación , los conocimientos, los servicios o los eventos de McKinsey. Al abrir nuestros correos electrónicos o hacer clic en los enlaces, acepta nuestro uso de cookies y tecnología de seguimiento web. Para obtener más información sobre cómo usamos y protegemos su información, consulte nuestra política de privacidad.

Recibió este correo electrónico porque es un miembro registrado de nuestro boletín informativo Destacados.

Manejar suscripciones | Cancelar

by "Destacados de McKinsey" <publishing@email.mckinsey.com> - 08:19 - 13 Apr 2024

RE: Seeking for the agent in Saudi Arabia

Dear agent,

Trust you are doing great.

I sincerely invite you to join GLA platform and cover the increasing request ex Saudi Arabia from our GLA members.

Now is the best time to join GLA Family in the beginning of 2024.

Last but not the least, I wish you, your family and your company a prosperous 2024! See you in GLA Dubai Conference for a great talk if possible!

I am looking forward to your cooperative feedback regarding your GLA Membership Application.

Best regards,

说明: cid:image006.png@01D96931.C9605D30

Ella Liu

GLA Overseas Department

Mobile:	(86) 199 2879 7478
Email:	member139@glafamily.com
Company:	GLA Co.,Ltd
Website:	glafamily.com
Address:	No. 2109, 21st Floor, HongChang, Plaza, No. 2001, Road Shenzhen, China

Ø The 10^th GLA Global Logistics Conference will be held at 29^th April–2^nd May 2024 in Dubai, UAE.

Registration Link: https://www.glafamily.com/meeting/registration/Ella

说明: cid:image015.jpg@01DA68BB.1915E7D0

Ø The 8^th GLA Conference in Bangkok Thailand, online album: https://live.photoplus.cn/live/76884709?accessFrom=qrcode&userType=null&state=STATE

Ø The 9^th GLA Conference in Hainan China, online album: https://live.photoplus.cn/live/61916679?accessFrom=qrcode&userType=null&state=STATE&code=0717A0200FZ46R10fJ200EJbI337A02u#/live

说明: cid:image001.jpg@01DA1D5D.279FD080 说明: cid:image002.jpg@01DA1DF8.6F02F290

【Notice Agreement No 7】

7. GLA president reserves the right to cancel or reject membership or application.

company shall cease to be a member of GLA if:

a) the Member does not adhere to GLA terms and conditions

b) the Member gives notice of resignation in writing to the GLA.

c) No good reputation in the market.

d) Have bad debt records in the GLA platform or in the market

by "Ella Liu" <member139@glafamily.com> - 02:08 - 12 Apr 2024

11 attachments 11 attachments

Want to know how much the space sector could be worth by 2035?

On Point

DELIVERING ACTIONABLE INSIGHTS ON THE DAY’S NEWS
As only McKinsey can

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

Space race

In the news

•

Lunar time zone. What time is it? Hundreds of thousands of miles beyond Earth, time is ticking by one second faster on the moon than on our planet, the Economist explains. That difference may be small, but it’s still important enough for the world to consider developing a standard for Coordinated Lunar Time. One point of view is that a more accurate timing system is needed for satellites, since docking and landing spacecraft on the moon demand a high degree of precision. [Economist]

•

The satellite industry. At the heart of the thriving space sector are a group of companies that maintain a vast and growing number of satellites, Fast Company reports. About 9,600 satellites are now in orbit above Earth, and organizations launched more than 2,200 of them in 2023 alone, a US space tech start-up said. This has generated a growing demand for businesses that specialize in traffic management and satellite monitoring. One US company provides compact thruster systems that help satellites stay collision free. [Fast Company]

We estimate that the global space economy could be worth $1.8 trillion by 2035, up from $630 billion in 2023.

On McKinsey.com

•

Space economy boom. Activity in space is accelerating. In 2023, “backbone” applications for space (such as satellites and services like GPS) made up $330 billion of the global space economy. “Reach” applications, which describe space technologies that companies across industries use to help generate revenues, represented another $300 billion. Over the next decade, space applications are expected to grow at an annual rate that is double the projected rate of GDP growth, McKinsey senior partner Ryan Brukardt and coauthors reveal.

•

Four key findings. To learn what will shape the future space economy, the World Economic Forum and McKinsey spoke with more than 60 leaders and experts across the globe. The team also used market forecasts and other data to develop original estimates for the size of each part of the space economy, including commercial services and state-sponsored civil and defense applications. To read our key findings on the space economy, download the full report, Space: The $1.8 trillion opportunity for economic growth.

See critical space economy trends

—Edited by Belinda Yu, editor, Atlanta

Or send us feedback—we’d love to hear from you.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this newsletter because you subscribed to the Only McKinsey newsletter, formerly called On Point.

Unsubscribe Manage preferences

by "Only McKinsey" <publishing@email.mckinsey.com> - 01:13 - 12 Apr 2024

White Label Fuel Monitoring Software

Download E-book

A versatile fuel monitoring system that can be tailored to fit any brand or company's fuel sensor, regardless of type or model.

Catch a glimpse of what our software has to offer

Book a Free Trial Now

www.uffizio.com

info@uffizio.com

+91-7283858080

Uffizio Technologies Pvt. Ltd., 4th Floor, Metropolis, Opp. S.T Workshop, Valsad, Gujarat, 396001, India

by "Sunny Thakur" <sunny.thakur@uffizio.com> - 08:00 - 11 Apr 2024

Embracing Chaos to Improve System Resilience: Chaos Engineering

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Latest articles

If you’re not a subscriber, here’s what you missed this month.

To receive all the full articles and support ByteByteGo, consider subscribing:

Upgrade to paid

Embracing Chaos to Improve System Resilience: Chaos Engineering

Apr 11

∙

Preview

Or upgrade your subscription. Upgrade to paid

Imagine it's the early 2000s, and you're a developer with a bold idea. You want to test your software not in a safe, controlled environment but right where the action is: the production environment. This is where real users interact with your system. Back then, suggesting something like this might have gotten you some strange looks from your bosses. But now, testing in the real world is not just okay; it's often recommended.

Why the big change? A few reasons stand out. Systems today are more complex than ever, pushing us to innovate faster and ensure our services are both reliable and strong. The rise of cloud technology, microservices, and distributed systems has changed the game. We've had to adapt our methods and mindsets accordingly.

Our goal now is to make systems that can handle anything—be it a slowdown or a full-blown outage. Enter Chaos Engineering.

In this issue, we dive into what chaos engineering is all about. We'll break down its key principles, how it's practiced, and examples from the real world. You'll learn how causing a bit of controlled chaos can actually help find and fix weaknesses before they become major problems.

Prepare to see how embracing chaos can lead to stronger, more reliable systems. Let's get started!

What is Chaos Engineering?

So, what exactly is chaos engineering? It's a way to deal with unexpected issues in software development and keep systems up and running. Some folks might think that a server running an app will continue without a hitch forever. Others believe that problems are just part of the deal and that downtime is inevitable.

Chaos engineering strikes a balance between these views. It recognizes that things can go wrong but asserts that we can take steps to prevent these issues from impacting our systems and the performance of our apps.

This approach involves experimenting on our live, production systems to identify weak spots and areas that aren't as reliable as they should be. It's about measuring how much we trust our system's resilience and working to boost that confidence

However, it's important to understand that being 100% sure nothing will go wrong is unrealistic. Through chaos engineering, we intentionally introduce unexpected events to uncover vulnerabilities. These events can vary widely, such as taking down a server randomly, disrupting a data center, or tampering with load balancers and application replicas.

In short, chaos engineering is about designing experiments that rigorously test our systems' robustness.

Defining Chaos Engineering

There are many ways to describe chaos engineering, but here's a definition that captures its essence well, sourced from https://principlesofchaos.org/.

“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

This definition highlights the core objective of chaos engineering: to ensure our systems can handle the unpredictable nature of real-world operations.

Performance Engineering vs. Chaos Engineering

When we talk about ensuring our systems run smoothly, two concepts often come up: performance engineering and chaos engineering. Let's discuss what sets these two apart and how they might overlap.

Many developers are already familiar with performance engineering, which is in the same family as DevOps. It involves using a combination of tools, processes, and technologies to monitor our system's performance and make continuous improvements. This includes conducting various types of testing, such as load, stress, and endurance tests, all aimed at boosting the performance of our applications.

On the flip side, chaos engineering is about intentionally breaking things. Yes, this includes stress testing, but it's more about observing how systems respond under unexpected stress. Stress testing could be seen as a form of chaos experiment. So, one way to look at it is to consider performance engineering as a subset of chaos engineering or the other way around, depending on how you apply these practices.

Another way to view these two is as distinct disciplines within an organization. One team might focus solely on conducting chaos experiments and learning from the failures, while another might immerse itself in performance engineering tasks like testing and monitoring. Depending on the structure of the organization, the skill sets of the team, and various other factors, we might have separate teams for each discipline or one team that tackles both.

Chaos Engineering in Practice

Let's consider an example to better understand chaos engineering. Imagine we have a system with a load balancer that directs requests to web servers. These servers then connect to a payment service, which, in turn, interacts with a third-party API and a cache service, all located in Availability Zone A. If the payment service fails to communicate with the third-party API or the cache, requests need to be rerouted to Availability Zone B to maintain high availability.

Continue reading this post for free, courtesy of Alex Xu.

Claim my free post

A subscription gets you:

	An extra deep dive on Thursdays
	Full archive
	Many expense it with team's learning budget

DELIVERING ACTIONABLE INSIGHTS ON THE DAY’S NEWS
As only McKinsey can

© 2024 ByteByteGo
548 Market Street PMB 72296, San Francisco, CA 94104
Unsubscribe

by "ByteByteGo" <bytebytego@substack.com> - 11:37 - 11 Apr 2024

Maximise ROI: Insights from the 2023 Observability Forecast Report

New Relic

NR_Olly Forecast EUROPE Email_1800x750.png

Our latest report unveils pivotal insights for businesses considering observability. The 2023 Observability Forecast, with data collected from 1,700 technology professionals across 15 countries, dives into observability practices and their impacts on costs and revenue.

The report highlights the ROI of observability according to respondents, including driving business efficiency, security, and profitability.

Access the blog now that rounds up key insights to understand the potential impact of observability on your business.

Read Now

Need help? Let's get in touch.

Ask Community

Ask Support

Contact Sales

This email is sent from an account used for sending messages only. Please do not reply to this email to contact us—we will not get your response.

View in browser

This email was sent to info@learn.odoo.com Update your email preferences.

For information about our privacy practices, see our Privacy Policy.

Need to contact New Relic? You can chat or call us at +44 20 3859 9190.
Strand Bridge House, 138-142 Strand, London WC2R 1HH

Global unsubscribe page.

by "New Relic" <emeamarketing@newrelic.com> - 07:11 - 11 Apr 2024

How can the world jump-start productivity?

On Point

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

The power of productivity

In the news

•

Funding technological innovation. Could advances in AI launch a new period of productivity growth? As a US government official mused in January, according to Bloomberg: “Probably maybe in the longer run.” One thing that’s becoming clearer is that the freewheeling nature of US capital markets—unlike those of other countries that seek tighter control—allows them to finance ambitious dreams, which in turn fuels technological innovation. Already, AI investments are expected to add roughly 0.5% to the growth of US GDP in 2024. [Bloomberg]

•

AI-powered productivity. Many businesses are coping with ongoing shortages of workers by using machines to perform the work of humans, according to Fortune. In addition, they’re training employees to apply AI and other technologies to increase efficiency. America’s current upswing in productivity sheds light on how the US economy has maintained its strength while enduring high interest rates. Growth in productivity is like having “magic beanstalk beans for the economy,” enabling quicker income, wage, and GDP growth, an economist observed. [Fortune]

By investing to regain productivity growth, advanced economies stand to gain between $1,500 and $8,000 in incremental GDP per capita by 2030.

On McKinsey.com

•

Productivity slowdown. Over the past 25 years, strong growth in productivity has helped raise the world’s living standards, with median economy productivity jumping sixfold, McKinsey Global Institute chair Sven Smit and coauthors reveal. More recently, though, productivity growth has faded, with a near-universal slowdown since the 2008 global financial crisis. In advanced economies, productivity growth fell to less than 1% in the decade between 2012 and 2022. Over the same period, productivity growth also fell in emerging economies.

•

Reversing the trend. Today, the world needs productivity growth more than ever. It can raise living standards amid challenges such as an aging workforce, the energy transition, and the reshaping of supply chains. By investing strategically in areas such as AI, automation, and digitization, countries could reverse the trend of slumping productivity in the coming years. Read the full report, Investing in productivity growth, for seven questions that explore the future of productivity growth.

Accelerate global productivity

—Edited by Belinda Yu, editor, Atlanta

Or send us feedback—we’d love to hear from you.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this newsletter because you subscribed to the Only McKinsey newsletter, formerly called On Point.

DELIVERING ACTIONABLE INSIGHTS ON THE DAY’S NEWS
As only McKinsey can

by "Only McKinsey" <publishing@email.mckinsey.com> - 11:05 - 10 Apr 2024

Unlock Data Transformation at Sumo Logic's Booth - AWS Summit London

Sumo Logic

View in browser

Dear Mohammad,

I'm thrilled to extend an invitation for you to visit the Sumo Logic booth at the AWS Summit London on the 24th of April at the ExCel London.

Discover how Sumo Logic can transform your approach to data management, ensuring security, scalability, and efficiency. Our experts will showcase our latest innovations and discuss tailored solutions to address your specific needs.

Here are compelling reasons to visit Booth B36:

Find out how you can power DevSecOps from only one cloud native platform
Learn more about our $0 data ingest approach
Explore configuration management and compliance reporting
Discover how we leverage AI/ML to uncover and mitigate threats

Don't miss this opportunity to connect with Sumo Logic and propel your business forward.

Claim your free entry ticket using this link.

We eagerly anticipate your presence at Booth B36!

Best regards,
Sumo Logic

Register now!

About Sumo Logic

Sumo Logic is the pioneer in continuous intelligence, a new category of software to address the data challenges presented by digital transformation, modern applications, and cloud computing.

by "Sumo Logic" <marketing-info@sumologic.com> - 06:01 - 10 Apr 2024

What does it take to become a CFO?

On Point

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

CFOs’ changing roles

In the news

•

Promoted to CEO. If a bus hit a firm’s CEO, who would take over? Many board chairs today say that a CFO would, the Financial Times reports. More companies around the world are hiring CFOs—perceived as detail oriented, pragmatic, and modest—to take on the top role. From 2019 to 2023, the share of CEOs of FTSE 100 companies who had prior experience as a CFO rose to about one-third, from 21%. CFOs’ increasing involvement in areas outside of finance (including data analytics and business strategy) is fueling the trend. [FT]

•

Grappling with generative AI. As organizations experiment with generative AI platforms to help their employees become more productive, interact with clients, and enhance financial reporting, CFOs are ensuring that spending on the technology pays off, the Wall Street Journal reflects. Companies are relying on finance leaders to assess the technology’s benefits and manage its expenses, which can tally millions of dollars. Many are measuring returns on investment through job satisfaction, profits, and other metrics. [WSJ]

Two-thirds of the new CFOs at S&P 500 companies and more than half of those in Europe were first-time finance leaders—a record high.

On McKinsey.com

•

Take career risks. What matters most for securing the CFO job? To find out, McKinsey senior partner Andy West and coauthors spoke with former CFOs, such as Arun Nayar, Tyco International’s former EVP and CFO and PepsiCo’s former CFO of global operations. Earlier in his career, Nayar realized that to rise higher, he would need to gain operational know-how. After obtaining a role overseeing finance in PepsiCo’s global operations division, an area he knew nothing about, he formed the No Fear Club, through which he mentors others in finance roles.

•

Become tech-savvy. In today’s volatile business world, CFOs must possess skills in areas that go beyond traditional expertise, such as technology. “Technologies support finance in a much broader way than they did ten years ago,” shares Axel Strotbek, former CFO of Audi AG and Volkswagen Group China. “The more the CFO can standardize data and processes in the finance organization, the faster and more efficient the function is in supporting decision makers,” he adds. Discover five priorities that CFO hopefuls should embrace.

Get advice from former CFOs

—Edited by Belinda Yu, editor, Atlanta

Or send us feedback—we’d love to hear from you.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this newsletter because you subscribed to the Only McKinsey newsletter, formerly called On Point.

by "Only McKinsey" <publishing@email.mckinsey.com> - 01:11 - 10 Apr 2024

Reddit's Architecture: The Evolutionary Journey

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Reddit's Architecture: The Evolutionary Journey

Apr 9

The comprehensive developer resource for B2B User Management (Sponsored)

Building an enterprise-ready, resilient B2B auth is one of the more complex tasks developers face these days. Today, even smaller startups are demanding security features like SSO that used to be the purview of the Fortune 500.

The latest guide from WorkOS covers the essentials of modern day user management for B2B apps — from 101 topics to more advanced concepts that include:

→ SSO, MFA, and sessions

→ Bot policies, org auth policies, and UI considerations

→ Identity linking, email verification, and just-in-time provisioning

This resource also presents an easier alternative for supporting user management.

Reddit's Architecture: The Evolutionary Journey

Reddit was founded in 2005 with the vision to become “the front page of the Internet”.

Over the years, it has evolved into one of the most popular social networks on the planet fostering tens of thousands of communities built around the passions and interests of its members. With over a billion monthly users, Reddit is where people come to participate in conversations on a vast array of topics.

Some interesting numbers that convey Reddit’s incredible popularity are as follows:

Reddit has 1.2 billion unique monthly visitors, turning it into a virtual town square.
Reddit’s monthly active user base has exploded by 366% since 2018, demonstrating the need for online communities.
In 2023 alone, an astonishing 469 million posts flooded Reddit’s servers resulting in 2.84 billion comments and interactions.
Reddit ranked as the 18th most visited website globally in 2023, raking in $804 million in revenue.

Looking at the stats, it’s no surprise that their recent IPO launch was a huge success, propelling Reddit to a valuation of around $6.4 billion.

While the monetary success might be attributed to the leadership team, it wouldn’t have been possible without the fascinating journey of architectural evolution that helped Reddit achieve such popularity.

In this post, we will go through this journey and look at some key architectural steps that have transformed Reddit.

The Early Days of Reddit

Reddit was originally written in Lisp but was rewritten in Python in December 2005.

Lisp was great but the main issue at the time was the lack of widely used and tested libraries. There was rarely more than one library choice for any task and the libraries were not properly documented.

Steve Huffman (one of the founders of Reddit) expressed this problem in his blog:

“Since we're building a site largely by standing on the shoulders of others, this made things a little tougher. There just aren't as many shoulders on which to stand.”

When it came to Python, they initially used a web framework named web.py that was developed by Swartz (another co-founder of Reddit). Later in 2009, Reddit started to use Pylons as its web framework.

The Core Components of Reddit’s Architecture

The below diagram shows the core components of Reddit’s high-level architecture.

While Reddit has many moving parts and things have also evolved over the years, this diagram represents the overall scaffolding that supports Reddit.

The main components are as follows:

Content Delivery Network: Reddit uses a CDN from Fastly as a front for the application. The CDN handles a lot of decision logic at the edge to figure out how a particular request will be routed based on the domain and path.
Front-End Applications: Reddit started using jQuery in early 2009. Later on, they also started using Typescript to redesign their UI and moved to Node.js-based frameworks to embrace a modern web development approach.
The R2 Monolith: In the middle is the giant box known as r2. This is the original monolithic application built using Python and consists of functionalities like Search and entities such as Things and Listings. We will look at the architecture of R2 in more detail in the next section.

From an infrastructure point of view, Reddit decommissioned the last of its physical servers in 2009 and moved the entire website to AWS. They had been one of the early adopters of S3 and were using it to host thumbnails and store logs for quite some time.

However, in 2008, they decided to move batch processing to AWS EC2 to free up more machines to work as application servers. The system worked quite well and in 2009 they completely migrated to EC2.

R2 Deep Dive

As mentioned earlier, r2 is the core of Reddit.

It is a giant monolithic application and has its own internal architecture as shown below:

For scalability reasons, the same application code is deployed and run on multiple servers.

The load balancer sits in the front and performs the task of routing the request to the appropriate server pool. This makes it possible to route different request paths such as comments, the front page, or the user profile.

Expensive operations such as a user voting or submitting a link are deferred to an asynchronous job queue via Rabbit MQ. The messages are placed in the queue by the application servers and are handled by the job processors.

From a data storage point of view, Reddit relies on Postgres for its core data model. To reduce the load on the database, they place memcache clusters in front of Postgres. Also, they use Cassandra quite heavily for new features mainly because of its resiliency and availability properties.

Latest articles

If you’re not a paid subscriber, here’s what you missed.

To receive all the full articles and support ByteByteGo, consider subscribing:

Upgrade to paid

The Expansion Phase

As Reddit has grown in popularity, its user base has skyrocketed. To keep the users engaged, Reddit has added a lot of new features. Also, the scale of the application and its complexity has gone up.

These changes have created a need to evolve the design in multiple areas. While design and architecture is an ever-changing process and small changes continue to occur daily, there have been concrete developments in several key areas.

Let’s look at them in more detail to understand the direction Reddit has taken when it comes to architecture.

GraphQL Federation with Golang Microservices

Reddit started its GraphQL journey in 2017. Within 4 years, the clients of the monolithic application had fully adopted GraphQL.

GraphQL is an API specification that allows clients to request only the data they want. This makes it a great choice for a multi-client system where each client has slightly different data needs.

In early 2021, they also started moving to GraphQL Federation with a few major goals:

Retiring the monolith
Improving concurrency
Encouraging separation of concerns

GraphQL Federation is a way to combine multiple smaller GraphQL APIs (also known as subgraphs) into a single, large GraphQL API (called the supergraph). The supergraph acts as a central point for client applications to send queries and receive data.

When a client sends a query to the supergraph, the supergraph figures out which subgraphs have the data needed to answer that query. It routes the relevant parts of the query to those subgraphs, collects the responses, and sends the combined response back to the client.

In 2022, the GraphQL team at Reddit added several new Go subgraphs for core Reddit entities like Subreddits and Comments. These subgraphs took over ownership of existing parts of the overall schema.

During the transition phase, the Python monolith and new Go subgraphs work together to fulfill queries. As more and more functionalities are extracted to individual Go subgraphs, the monolith can be eventually retired.

The below diagram shows this gradual transition.

One major requirement for Reddit was to handle the migration of functionality from the monolith to a new Go subgraph incrementally.

They want to ramp up traffic gradually to evaluate error rates and latencies while having the ability to switch back to the monolith in case of any issues.

Unfortunately, the GraphQL Federation specification doesn’t offer a way to support this gradual migration of traffic. Therefore, Reddit went for a Blue/Green subgraph deployment as shown below:

In this approach, the Python monolith and Go subgraph share ownership of the schema. A load balancer sits between the gateway and the subgraphs to send traffic to the new subgraph or the monolith based on a configuration.

With this setup, they could also control the percentage of traffic handled by the monolith or the new subgraph, resulting in better stability of Reddit during the migration journey.

As of the last update, the migration is still ongoing with minimal disruption to the Reddit experience.

Data Replication with CDC and Debezium

In the early stages, Reddit supported data replication for their databases using WAL segments created by the primary database.

WAL or write-ahead log is a file that records all changes made to a database before they are committed. It ensures that if there’s a failure during a write operation, the changes can be recovered from the log.

To prevent this replication from bogging down the primary database, Reddit used a special tool to continuously archive PostgreSQL WAL files to S3 from where the replica could read the data.

However, there were a few issues with this approach:

Since the daily snapshots ran at night, there were inconsistencies in the data during the day.
Frequent schema changes to databases caused issues with snapshotting the database and replication.
The primary and replica databases ran on EC2 instances, making the replication process fragile. There were multiple failure points such as a failing backup to S3 or the primary node going down.

To make data replication more reliable, Reddit decided to use a streaming Change Data Capture (CDC) solution using Debezium and Kafka Connect.

Debezium is an open-source project that provides a low-latency data streaming platform for CDC.

Whenever a row is added, deleted, or modified in Postgres, Debezium listens to these changes and writes them to a Kafka topic. A downstream connector reads from the Kafka topic and updates the destination table with the changes.

The below diagram shows this process.

Moving to CDC with Debezium has been a great move for Reddit since they can now perform real-time replication of data to multiple target systems.

Also, instead of bulky EC2 instances, the entire process can be managed by lightweight Debezium pods.

Managing Media Metadata at Scale

Reddit hosts billions of posts containing various media content such as images, videos, gifs, and embedded third-party media.

Over the years, users have been uploading media content at an accelerating pace. Therefore, media metadata became crucial for enhancing searchability and organization for these assets.

There were multiple challenges with Reddit’s old approach to managing media metadata:

The data was distributed and scattered across multiple systems.
There were inconsistent storage formats and varying query patterns for different asset types. In some cases, they were even querying S3 buckets for the metadata information which is extremely inefficient at Reddit scale.
No proper mechanism for auditing changes, analyzing content, and categorizing metadata.

To overcome these challenges, Reddit built a brand new media metadata store with some high-level requirements:

Move all existing media metadata from different systems under a common roof.
Support data retrieval at the scale of 100K read requests per second with less than 50 ms latency.
Support data creation and updates.

The choice of the data store was between Postgres and Cassandra. Reddit finally went with AWS Aurora Postgres due to the challenges with ad-hoc queries for debugging in Cassandra and the potential risk of some data not being denormalized and unsearchable.

The below diagram shows a simplified overview of Reddit’s media metadata storage system.

As you can see, there’s just a simple service interfacing with the database, handling reads and writes through APIs.

Though the design was not complicated, the challenge lay in transferring several terabytes of data from various sources to the new database while ensuring that the system continued to operate correctly.

The migration process consisted of multiple steps:

Enable dual writes into the metadata APIs from clients of media metadata.
Backfill data from older databases to the new metadata store.
Enable dual reads on media metadata from the service clients.
Monitor data comparison for every read request and fix any data issues.
Ramp up read traffic to the new metadata store.

Check out the below diagram that shows this setup in more detail.

After the migration was successful, Reddit adopted some scaling strategies for the media metadata store.

Table partitioning using range-based partitioning.
Serving reads from a denormalized JSONB field in Postgres.

Ultimately, they achieved an impressive read latency of 2.6 ms at p50, 4.7 ms at p90, and 17 ms at p99. Also, the data store was generally more available and 50% faster than the previous data system.

Just-in-time Image Optimization

Within the media space, Reddit also serves billions of images per day.

Users upload images for their posts, comments, and profiles. Since these images are consumed on different types of devices, they need to be available in several resolutions and formats. Therefore, Reddit transforms these images for different use cases such as post previews, thumbnails, and so on.

Since 2015, Reddit has relied on third-party vendors to perform just-in-time image optimization. Image handling wasn’t their core competency and therefore, this approach served them well over the years.

However, with an increasing user base and traffic, they decided to move this functionality in-house to manage costs and control the end-to-end user experience.

The below diagram shows the high-level architecture for image optimization setup.

They built two backend services for transforming the images:

The Gif2Vid service resizes and transcodes GIFs to MP4s on-the-fly. Though users love the GIF format, it’s an inefficient choice for animated assets due to its larger file sizes and higher computational resource demands.
The image optimizer service deals with all other image types. It uses govips which is a wrapper around the libvips image manipulation library. The service handles the majority of cache-miss traffic and handles image transformations like blurring, cropping, resizing, overlaying images, and format conversions.

Overall, moving the image optimization in-house was quite successful:

Costs for Gif2Vid conversion were reduced to a mere 0.9% of the original cost.
The p99 cache-miss latency for encoding animated GIFs was down from 20s to 4s.
The bytes served for static images were down by approximately 20%.

Real-Time Protection for Users at Reddit’s Scale

A critical functionality for Reddit is moderating content that violates the policies of the platform. This is essential to keep Reddit safe as a website for the billions of users who see it as a community.

In 2016, they developed a rules engine named Rule-Executor-V1 (REV1) to curb policy-violating content on the site in real time. REV1 enabled the safety team to create rules that would automatically take action based on activities like users posting new content or leaving comments.

For reference, a rule is just a Lua script that is triggered on specific configured events. In practice, this can be a simple piece of code shown below:

In this example, the rule checks whether a post’s text body matches a string “some bad text”. If yes, it performs an asynchronous action on the user by publishing an action to an output Kafka topic.

However, REV1 needed some major improvements:

It ran on a legacy infrastructure of raw EC2 instances. This wasn’t in line with all modern services on Reddit that were running on Kubernetes.
Each rule ran as a separate process in a REV1 node and required vertical scaling as more rules were launched. Beyond a certain point, vertical scaling became expensive and unsustainable.
REV1 used Python 2.7 which was deprecated.
Rules weren’t version-controlled and it was difficult to understand the history of changes.
Lack of staging environment to test out the rules in a sandboxed manner.

In 2021, the Safety Engineering team within Reddit developed a new streaming infrastructure called Snooron. It was built on top of Flink Stateful Functions to modernize REV1’s architecture. The new system was known as REV2.

The below diagram shows both REV1 and REV2.

Some of the key differences between REV1 and REV2 are as follows:

In REV1, all configuration of rules was done via a web interface. With REV2, the configuration primarily happens through code. However, there are UI utilities to make the process simpler.
In REV1, they use Zookeeper as a store for rules. With REV2, rules are stored in Github for better version control and are also persisted to S3 for backup and periodic updates.
In REV1, each rule had its own process that would load the latest code when triggered. However, this caused performance issues when too many rules were running at the same time. REV2 follows a different approach that uses Flink Stateful Functions for handling the stream of events and a separate Baseplate application that executes the Lua code.
In REV1, the actions triggered by rules were handled by the main R2 application. However, REV2 works differently. When a rule is triggered, it sends out structured Protobuf actions to multiple action topics. A new application called the Safety Actioning Worker, built using Flink Statefun, receives and processes these instructions to carry out the actions.

Reddit’s Feed Architecture

Feeds are the backbone of social media and community-based websites.

Millions of people use Reddit’s feeds every day and it’s a critical component of the website’s overall usability. There were some key goals when it came to developing the feed architecture:

The architecture should support a high development velocity and support scalability. Since many teams integrate with the feeds, they need to have the ability to understand, build, and test them quickly.
TTI (Time to Interactive) and scroll performance should be satisfactory since they are critical to user engagement and the overall Reddit experience.
Feeds should be consistent across different platforms such as iOS, Android, and the website.

To support these goals, Reddit built an entirely new, server-driven feeds platform. Some major changes were made in the backend architecture for feeds.

Earlier, each post was represented by a Post object that contained all the information a post may have. It was like sending the kitchen sink over the wire and with new post types, the Post object got quite big over time.

This was also a burden on the client. Each client app contained a bunch of cumbersome logic to determine what should be shown on the UI. Most of the time, this logic was out of sync across platforms.

With the changes to the architecture, they moved away from the big object and instead sent only the description of the exact UI elements the client will render. The backend controlled the type of elements and their order. This approach is also known as Server-Driven UI.

For example, the post unit is represented by a generic Group object that contains an array of Cell objects. The below image shows the change in response structure for the Announcement item and the first post in the feed.

Reddit’s Move from Thrift to gRPC

In the initial days, Reddit had adopted Thrift to build its microservices.

Thrift enables developers to define a common interface (or API) that enables different services to communicate with each other, even if they are written in different programming languages. Thrift takes the language-independent interface and generates code bindings for each specific language.

This way, developers can make API calls from their code using syntax that looks natural for their programming language, without having to worry about the underlying cross-language communication details.

Over the years, the engineering teams at Reddit built hundreds of Thrift-based microservices, and though it served them quite well, Reddit’s growing needs made it costly to continue using Thrift.

gRPC came on the scene in 2016 and achieved significant adoption within the Cloud-native ecosystem.

Some of the advantages of gRPC are as follows:

It provided native support for HTTP2 as a transport protocol
There was native support for gRPC in several service mesh technologies such as Istio and Linkerd
Public cloud providers also support gRPC-native load balancers

While gRPC had several benefits, the cost of switching was non-trivial. However, it was a one-time cost whereas building feature parity in Thrift would have been an ongoing maintenance activity.

Reddit decided to make the transition to gRPC. The below diagram shows the design they used to start the migration process:

The main component is the Transitional Shim. Its job is to act as a bridge between the new gRPC protocol and the existing Thrift-based services.

When a gRPC request comes in, the shim converts it into an equivalent Thrift message format and passes it on to the existing code just like native Thrift. When the service returns a response object, the shim converts it back into the gRPC format.

There are three main parts to this design:

The interface definition language (IDL) converter that translates the Thrift service definitions into the corresponding gRPC interface. This component also takes care of adapting framework idioms and differences as appropriate.
A code-generated gRPC servicer that handles the message conversions for incoming and outgoing messages between Thrift and gRPC.
A pluggable module for the services to support both Thrift and gRPC.

This design allowed Reddit to gradually transition to gRPC by reusing its existing Thrift-based service code while controlling the costs and effort required for the migration.

Conclusion

Reddit’s architectural journey has been one of continual evolution, driven by its meteoric growth and changing needs over the years. What began as a monolithic Lisp application was rewritten in Python, but this monolithic approach couldn’t keep pace as Reddit’s popularity exploded.

The company went on an ambitious transition to a service-based architecture. Each new feature and issue they faced prompted a change in the overall design in various areas such as user protection, media metadata management, communication channels, data replication, API management, and so on.

In this post, we’ve attempted to capture the evolution of Reddit’s architecture from the early days to the latest changes based on the available information.

References:

SPONSOR US

Get your product in front of more than 500,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com.

DELIVERING ACTIONABLE INSIGHTS ON THE DAY’S NEWS
As only McKinsey can

by "ByteByteGo" <bytebytego@substack.com> - 11:35 - 9 Apr 2024

What’s the state of private markets around the world?

On Point

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

The pulse of private markets

In the news

•

Revamping pay. Amid slower dealmaking, leading private equity (PE) firms are encouraging workers to generate financial returns by increasingly linking compensation to the performance of their investments, Fortune reveals. Companies are redirecting a portion of their employees’ commissions from asset management to shareholders, which could result in higher earnings in peak times and lower pay during lean years. PE firms aim to strike a balance between pursuing lucrative deals and delivering consistent earnings to shareholders. [Fortune]

•

Competition in lending. In recent years, major PE and asset management companies have moved into the territory of Wall Street banks by lending to businesses through debt-financed transactions, Reuters reports. Now banks are securing billions in cash to make up for lost ground. In 2022, US banks decreased lending to borrowers with lower creditworthiness. By late 2023, though, revived credit markets prompted banks to use internal capital and institutional investors to grow their private credit ventures. [Reuters]

Private equity fundraising in Europe surged by more than 50% in 2023, resulting in the region’s biggest haul ever.

On McKinsey.com

•

Fundraising decline. Macroeconomic headwinds persisted throughout 2023, with rising financing costs and an uncertain growth outlook taking a toll on private markets. Last year, fundraising fell 22% across private market asset classes globally to just over $1 trillion, as of year-end reported data—the lowest total since 2017, McKinsey senior partner Fredrik Dahlqvist and coauthors reveal. As private market managers look to boost performance, a deeper focus on revenue growth and margin expansion will be needed now more than ever.

•

Preferring large fund managers. Fundraising concentration reached its highest level in more than a decade, as investors continued to favor the largest fund managers. McKinsey analysis indicates that the 25 most successful fundraisers collected 41% of aggregate commitments to closed-end funds, with the top five managers accounting for nearly half that total. Explore McKinsey’s Global Private Markets Review 2024: Private markets: A slower era for the latest trends in private investment.

See shifts in private markets

—Edited by Belinda Yu, editor, Atlanta

Or send us feedback—we’d love to hear from you.

This email contains information about McKinsey's research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this newsletter because you subscribed to the Only McKinsey newsletter, formerly called On Point.

ESSENTIALS FOR LEADERS AND THOSE THEY LEAD

by "Only McKinsey" <publishing@email.mckinsey.com> - 01:08 - 9 Apr 2024

Feeling good: A guide to empathetic leadership

Share this email

Brought to you by Liz Hilton Segel, chief client officer and managing partner, global industry practices, & Homayoun Hatami, managing partner, global client capabilities

An “empath” is defined as a person who is highly attuned to the feelings of others, often to the point of experiencing those emotions themselves. This may sound like a quality that’s more helpful in someone’s personal life than in their work life, where emotions should presumably be kept at bay. Yet it turns out that the opposite may be true: displaying empathy in appropriate ways can be a powerful component of effective leadership. And when it’s combined with other factors such as technology and innovation, it can lead to exceptional business performance, such as breakthroughs in customer service. This week, we explore some strategies for leaders to enhance their empathy quotient.

AN IDEA

An image linking to the web page “It’s cool to be kind: The value of empathy at work” on McKinsey.com.

Empathy can be a superpower

“When I train leaders in empathy, one of the first hurdles I need to get over is this stereotype that empathy is too soft and squishy for the work environment,” says Stanford University research psychologist Jamil Zaki in an episode of our McKinsey Talks Talent podcast. Speaking with McKinsey partners Bryan Hancock and Brooke Weddle , Zaki suggests that, contrary to the stereotype, empathy can be a workplace superpower, helping to alleviate employee burnout and spurring greater productivity and creativity. Boosting managers’ empathetic skills is more important now than ever: “There’s evidence that during the time that social media has taken over so much of our lives, people’s empathy has also dropped,” says Zaki. Ways to improve leaders’ empathy in the workplace include infusing more empathy into regular conversations, rewarding empathic behavior, and automating some non-human-centric responsibilities so that managers can focus on mentorship.

A BIG NUMBER

2

There are two kinds of empathy that may be helpful in business situations, according to professional speaker and author Minter Dial in a conversation with McKinsey. “One is affective empathy, where you feel what the other person is feeling,” he says. “So if you start crying, I’m going to cry. In a business case, the more interesting area is cognitive empathy, where I understand why you’re feeling what you’re feeling.” Where it gets tricky is when trying to teach AI to exhibit empathy. The data sets required to train it may not be effective yet, but an old-fashioned technique—active listening—might. “If you just are able to show that you’re listening and allow the other person to continue on speaking, to provide all the content, that kind of a machine really could work,” says Dial.

A QUOTE

“As we get more power and status, we have difficulties naturally empathizing with others.”

That’s business strategist and author John Horn on why it’s important for leaders to try to understand the mindset of their competitors. “When people become more senior in business, they’ll need to think about competitors directly,” he says in a conversation with McKinsey on cognitive empathy. While it may not be possible to ask competitors about their strategies, technology such as AI , machine learning, and big data may be able to provide competitive-insight tools based on a predictive factor. “You could possibly use that data to look across various industries and learn [for example] what has been the response when prices were raised? That’s thinking about the likely behavior your competitors will have,” says Horn.

A SPOTLIGHT INTERVIEW

An image linking to the web page “McKinsey Legal Podcast, Episode 1: Pierre Gentin on Human-Centered Leadership” on McKinsey.com.

The qualities that make a leader empathetic can be found within oneself and in the situations of daily life. “Try to unlock the energy and the aspiration that lies within us,” suggests McKinsey senior partner and chief legal officer Pierre Gentin in a McKinsey Legal Podcast interview. What Gentin calls “arrows of inspiration”—whether they come from art, music, literature, or people—may be available to leaders every day. He believes that “a lot of this is just opening our eyes and really recognizing what’s in front of us and transitioning from the small gauge and the negative to the big gauge and the transformative.” For example, in an organizational context, people may need to look beyond short-term concerns and consider “what they can do outside of the four corners of the definition of their role,” he says. “How can we collaborate together to do creative things and to do exciting things and to do worthwhile things?”

WHAT A FEELING

An image linking to the web page “Tuning in, turning outward: Cultivating compassionate leadership in a crisis” on McKinsey.com.

There are times when empathy can be too much of a good thing, or leaders can show it in the wrong way. An obvious mistake is to pay lip service to the idea but take limited or no action. In one case, an IT group in a company worked overtime for many weeks during the pandemic. Leaders praised the group lavishly, but the only action they took was to buy everyone on the team a book on time management. Another error may to be to avoid forming the strong relationships that lead to good communication, understanding, and compassion. And empathy by itself—if it’s not balanced by other aspects of emotional intelligence such as self-regulation and composure—can actually derail leadership effectiveness .

Lead empathetically.

– Edited by Rama Ramaswami, senior editor, New York

Share these insights

Did you enjoy this newsletter? Forward it to colleagues and friends so they can subscribe too. Was this issue forwarded to you? Sign up for it and sample our 40+ other free email subscriptions here.

This email contains information about McKinsey’s research, insights, services, or events. By opening our emails or clicking on links, you agree to our use of cookies and web tracking technology. For more information on how we use and protect your information, please review our privacy policy.

You received this email because you subscribed to the Leading Off newsletter.