- Mailing Lists
- in
- Automated Bug Fixing at Facebook Scale
Archives
- By thread 3649
-
By date
- June 2021 10
- July 2021 6
- August 2021 20
- September 2021 21
- October 2021 48
- November 2021 40
- December 2021 23
- January 2022 46
- February 2022 80
- March 2022 109
- April 2022 100
- May 2022 97
- June 2022 105
- July 2022 82
- August 2022 95
- September 2022 103
- October 2022 117
- November 2022 115
- December 2022 102
- January 2023 88
- February 2023 90
- March 2023 116
- April 2023 97
- May 2023 159
- June 2023 145
- July 2023 120
- August 2023 90
- September 2023 102
- October 2023 106
- November 2023 100
- December 2023 74
- January 2024 75
- February 2024 75
- March 2024 78
- April 2024 74
- May 2024 108
- June 2024 98
- July 2024 116
- August 2024 134
- September 2024 130
- October 2024 141
- November 2024 70
Automated Bug Fixing at Facebook Scale
Automated Bug Fixing at Facebook Scale
📌Save the Date! Innovate Faster with AI Code Generation (Sponsored)With AI code generation tools developers can accelerate timelines at a pace and cost that would have been unfathomable just years ago. However, code generated by AI can include bugs and errors, and readability, maintainability, and security issues – just like code produced by developers. Join Manish Kapur, Sr. Director, Product & Solution at Sonar, for “Code Faster, Write Cleaner using AI Coding Assistants and Sonar” on Wednesday, March 20th to dive into the world of AI-assisted coding! Attendees will learn best practices for integrating AI coding assistants into their development workflows as well as practical advice to nurture a culture of clean code. If there’s one thing that a majority of developers truly hate, it’s debugging. While debugging small programs isn’t fun, it can get incredibly irritating when you have to debug millions of lines of code on a Friday evening to find that elusive bug. To make things worse, bugs (software or otherwise) are tenacious. You get rid of one and two more show up. Just when you think you’ve finally fixed the issue and started testing things out, you realize that the patch you just made is causing another crash some other place within those million lines. Before you know it, you are trudging your way through another bug hunt. This is where SapFix projects itself as a game-changing tool in the field of automated bug fixing. It’s a new AI hybrid tool created by Facebook with the goal of reducing the time engineers spend on debugging. SapFix makes debugging easy by automatically generating fixes for specific issues and proposing those fixes to engineers for approval and deployment to production. The below diagram shows the SapFix workflow at a high level. In a later section, we will see the entire process in even more detail. It would be an understatement to say that SapFix has shown promise. Here are some facts worth considering:
If you think about it, those are 6 multi-million line code-bases and it’s still early development days for SapFix! At this point, you might wonder how SapFix is able to generate fixes for so many diverse apps with wildly different uses ranging from communication to social media to building communities. The Role of Sapienz and InferThe secret sauce of SapFix is the adoption of automated program repair techniques. These techniques are based on algorithms to identify, analyze and patch known software bugs without human intervention. One of the widely used approaches relies on software testing to direct the repair process. This is where Facebook leverages its automated test case design system known as Sapienz. Sapienz uses Search-based Software Engineering (SBSE) to automatically design system-level test cases for mobile apps. Executing those test cases allows Sapienz to find 100s of crashes per month even before they can be discovered by Facebook’s internal human testers. Think of SBSE as having a super smart helper that looks through all the lines of code and tries different combinations to fix a problem. It's a lot like when you try different pieces of a puzzle until they fit just right. As an estimate, Facebook’s engineers have been able to fix 75% of crashes reported by Sapienz. This indicates a very high signal-to-noise ratio for bug reports generated by Sapienz. However, to improve this figure even further, Facebook also uses Infer. Infer is an open-source tool that helps with localization and static analysis of the fixes proposed. Like Sapienz, Infer is also deployed directly onto Facebook’s internal continuous integration system and has access to the majority of Facebook’s code base. Sapienz and Infer collaborate with each other to provide information to developers about potential bugs such as:
However, Sapienz and Infer can only provide information and not save the developer’s time in actually fixing the issue. Sure, their collaboration helps identify bugs and their location within the code, but most of the work involved in fixing these bugs still falls to a developer. This is where SapFix comes along and combines three important components to provide an end-to-end automated repair system:
From picking the test cases that detect the crash to fixing the issue and re-testing, SapFix takes care of the entire process as part of Facebook’s continuous integration and deployment system. Latest articlesIf you’re not a paid subscriber, here’s what you missed last month. 1. How Video Recommendations Work - Part 1 3. How do We Design for High Availability? 5. Mastering Design Principles - SOLID To receive all the full articles and support ByteByteGo, consider subscribing: The SapFix WorkflowHow does SapFix actually work? There are four types of fixes that are performed by SapFix:
Below is a diagram that shows the entire workflow of how SapFix handles the process of fixing an issue based on these types. At its core, the process is extremely simple to understand. The fix creation process receives the below input:
Based on this input, SapFix goes ahead and generates a list of revisions that can fix the crash. This list is created after SapFix has tested those revisions thoroughly. From the input to output, there are several steps involved:
The above flow may appear simple, but there are some additional nuances to it, and understanding those makes things clearer. Template Fix and Mutation FixAs the name suggests, the template fix and mutation fix strategies choose between template and mutation-based fixes. Template-based fixes are favored when all other parameters are equal. But where do these templates come from? Template fixes come from another tool known as Getafix that generates patches similar to the ones human developers produced in the past. From the perspective of SapFix, Getafix is a black box that contains a bunch of template fix patterns harvested from previous successful fixes. As far as the mutation fix strategy is concerned, SapFix currently only supports fixing Null Pointer Exception (NPE) crashes. Though Facebook has a plan to cover more mutation strategies, just focusing on NPE has also provided a good amount of success. High Firing CrashesIf neither template-based nor mutation-based strategies produce a patch that passes all tests, SapFix attempts to revert Diffs that result in high-firing crashes. High-firing crash is a software bug that occurs frequently or affects a large number of users. There are a couple of reasons for reverting the diff instead of trying to patch:
The revert strategies (full and partial) basically delete the change made in the Diff. In practice, reverting can mean deletion, addition, or replacement of code in the current version of the system. Between the two types of revert strategies, SapFix generally prefers full diff revert because partial diff revert has a higher probability of knock-on adverse effects. However, new Diffs are generated every few seconds and full diff reverts can also fail due to merge conflicts with other revisions. In those cases, SapFix attempts to go for partial diff revert since the changes produced are smaller and less prone to merge conflicts. SapFix Adoption ResultsOver a period of 3 months, after SapFix was adopted, it tackled 57 crashes related to Null-Pointer Exceptions (NPE). To handle these crashes, 165 patches were created (roughly half from template and half from mutation-based repair). Out of these 165 patches, 131 were correctly built and passed all tests. Finally, 55 were reported to the developers. Also, initial reactions from the developers were quite positive. When going through the very first SapFix-proposed patch, the developers had the feeling of “living in the future”. However, the time taken to generate a fix presented a slightly different issue. The median time from fault detection to publishing a fix to the developer came out to be 69 minutes. The worst case was approximately 1.5 hours and the fastest one was 37 minutes after the crash was first detected. As you can also see, the overall range of observed values is pretty wide. The main reason for this is the computational complexity of fixing an issue and the variation in workloads on the CI/CD system. Since SapFix is deployed in a highly parallel, asynchronous environment, the time from detection to publication is influenced by the current demand on the system and the availability of computing resources. Lessons Learned from SapFixFacebook’s main philosophy behind SapFix was to focus on the industrial deployment of an automated repair system rather than academic research. Therefore, most of the decisions were focused on this goal. Though much remains to be done, Facebook also learned a lot of lessons from SapFix that they have shared. Here are a few important ones:
References: SPONSOR USGet your product in front of more than 500,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com.
© 2024 ByteByteGo |
by "ByteByteGo" <bytebytego@substack.com> - 11:40 - 5 Mar 2024