- Mailing Lists
- in
- How Slack Built a Distributed Cron Execution System for Scale
Archives
- By thread 3895
-
By date
- June 2021 10
- July 2021 6
- August 2021 20
- September 2021 21
- October 2021 48
- November 2021 40
- December 2021 23
- January 2022 46
- February 2022 80
- March 2022 109
- April 2022 100
- May 2022 97
- June 2022 105
- July 2022 82
- August 2022 95
- September 2022 103
- October 2022 117
- November 2022 115
- December 2022 102
- January 2023 88
- February 2023 90
- March 2023 116
- April 2023 97
- May 2023 159
- June 2023 145
- July 2023 120
- August 2023 90
- September 2023 102
- October 2023 106
- November 2023 100
- December 2023 74
- January 2024 75
- February 2024 75
- March 2024 78
- April 2024 74
- May 2024 108
- June 2024 98
- July 2024 116
- August 2024 134
- September 2024 130
- October 2024 141
- November 2024 171
- December 2024 115
- January 2025 36
How things change when employees are seen as artists and athletes
RE: Consumer Electronics Show - CES 2024 (Post Show)
How Slack Built a Distributed Cron Execution System for Scale
How Slack Built a Distributed Cron Execution System for Scale
👋Goodbye low test coverage and slow QA cycles (Sponsored)Bugs sneak out when less than 80% of user flows are tested before shipping. But getting that kind of coverage — and staying there — is hard and pricey for any sized team. QA Wolf takes testing off your plate: → Get to 80% test coverage in just 4 months. → Stay bug-free with 24-hour maintenance and on-demand test creation. → Get unlimited parallel test runs → Zero Flakes guaranteed QA Wolf has generated amazing results for companies like Salesloft, AutoTrader, Mailchimp, and Bubble. 🌟 Rated 4.5/5 on G2 Learn more about their 90-day pilot Have you ever stretched a simple tool to its absolute limits before you upgraded? We do it all the time. And so do the big companies that operate on a massive scale. This is because simple things can sometimes take you much further than you initially think. Of course, you may have to pay with some toil and tears. This is exactly what Slack, a $28 billion company with 35+ million users, did for its cron execution workflow that handles critical functionalities. Instead of moving to some other new-age solutions, they rebuilt their cron execution system from the ground up to run jobs reliably at their scale. In today’s post, we’ll look at how Slack architected a distributed cron execution system and the choices made in the overall design. The Role of Cron Jobs at SlackAs you already know, Slack is one of the most popular platforms for team collaboration. Due to its primary utility as a communication tool, Slack is super dependent on the right notification reaching the right person at the right time. However, as the platform witnessed user and feature growth, Slack faced challenges in maintaining the reliability of its notification system, which largely depended on cron jobs. For reference, cron jobs are used to automate repetitive tasks. You can configure a cron job to ensure that specific scripts or commands run at predefined intervals without manual intervention. Cron jobs play a crucial role in Slack's notification system by making sure that messages and reminders reach users on time. A lot of the critical functionality at Slack relies on these cron scripts. For example,
As Slack grew, there has been a massive growth in the number of cron scripts and the amount of data processed by these scripts. Ultimately, this caused a dip in the overall reliability of the execution environment. The Issues with Cron JobsSome of the challenges and issues Slack faced with their original cron execution approach were as follows:
To solve these issues, Slack decided to build a brand new cron execution service that was more reliable and scalable than the original approach. The High-Level Cron Execution ArchitectureThe below diagram shows the high-level cron execution architecture. There are 3 main components in this design. Let’s look at each of them in more detail. Scheduled Job ConductorSlack developed a new execution service. It was written in Go and deployed on Bedrock. Bedrock is Slack’s in-house platform that wraps around Kubernetes, providing an additional abstraction layer and functionality for Slack’s specific needs. It builds upon Kubernetes and adds some key features such as:
The new service mimics the behavior of cron by utilizing a Go-based cron library while benefiting from the scalability and reliability provided by Kubernetes. The below diagram shows how the Scheduled Job Conductor works in practice. It had some key properties that we should consider. 1 - Scalability through Kubernetes DeploymentBy deploying the cron execution service on Bedrock, Slack gains the ability to easily scale up multiple pods as needed. As you might be aware, Kubernetes provides a flexible infrastructure for containerized applications. You can dynamically adjust the number of pods based on the workload. 2 - Leader Follower ArchitectureInterestingly, Slack's cron execution service does not process requests on multiple pods simultaneously. Instead, they adopt a leader-follower architecture, where only one pod (the leader) is responsible for scheduling jobs, while the other pods remain in standby mode. This design decision may seem counterintuitive, as it appears to introduce a single point of failure. However, the Slack team determined that synchronizing the nodes would be a more significant challenge than the potential risk of having a single leader. A couple of advantages of the leader-follower architecture are as follows:
3 - Offloading Resource-Intensive TasksThe job conductor service is only responsible for job scheduling. The actual execution is handled by worker nodes. This separation of concerns allows the cron execution service to focus on job scheduling while the job queue handles resource-intensive tasks. Latest articlesIf you’re not a paid subscriber, here’s what you missed. To receive all the full articles and support ByteByteGo, consider subscribing: The Job QueueSlack's cron execution service relies on a powerful asynchronous compute platform called the Job Queue to handle the resource-intensive task of running scripts. The Job Queue is a critical component of Slack's infrastructure, processing a whopping 9 billion jobs per day. The Job Queue consists of a series of so-called theoretical “queues” through which various types of jobs flow. Each script triggered by a cron job is treated as a single job within the Job Queue. See the below diagram for reference: The key components of the job queue architecture are as follows:
Slack achieves several important benefits by using the Job Queue:
Vitess Database Table for Job TrackingTo boost the reliability of their cron execution service, Slack also employed a Vitess table for deduplication and job tracking. Vitess is a database clustering system for horizontal scaling of MySQL. It provides a scalable and highly available solution for managing large-scale data. A couple of important requirements handled by Vitess are as described as follows: 1 - DeduplicationWithin Slack’s original cron system, they used flocks, a Linux utility for managing locking in scripts so that only one copy of a script runs at a time. While this approach worked fine, there were cases where a script’s execution time exceeded its recurrence intervals, leading to the possibility of two copies running concurrently. To handle this issue, Slack introduced a Vitess table to handle deduplication. Here’s how it works:
The below diagram shows some sample data stored in Vitess. 2 - Job Tracking and MonitoringThe Vitess table also helps with job tracking and monitoring since it contains information about the state of each job. The job tracking functionality is exposed through a simple web page that displays the execution details. Developers can easily look up the state of their script runs and any errors encountered during execution. ConclusionOne can debate whether using cron was the right decision for Slack when it came to job scheduling. But it can also be said that organizations that choose the simplest solution for critical functionalities are more likely to grow into organizations of Slack’s size. On the other hand, companies that deploy solutions that try to solve all the problems when they have 0 customers never become that big. The difference is an unrelenting focus on solving the business problem rather than building fancy solutions from the beginning. Slack’s journey from a single cron box to a sophisticated, distributed cron execution service shows how simple solutions can be used to build large-scale systems. References: SPONSOR USGet your product in front of more than 500,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com
© 2024 ByteByteGo |
by "ByteByteGo" <bytebytego@substack.com> - 11:36 - 21 May 2024