- Mailing Lists
- in
- Inside Airbnb’s AI-Powered Pipeline to Migrate Tests: Months of Work in Days
Archives
- By thread 5270
-
By date
- June 2021 10
- July 2021 6
- August 2021 20
- September 2021 21
- October 2021 48
- November 2021 40
- December 2021 23
- January 2022 46
- February 2022 80
- March 2022 109
- April 2022 100
- May 2022 97
- June 2022 105
- July 2022 82
- August 2022 95
- September 2022 103
- October 2022 117
- November 2022 115
- December 2022 102
- January 2023 88
- February 2023 90
- March 2023 116
- April 2023 97
- May 2023 159
- June 2023 145
- July 2023 120
- August 2023 90
- September 2023 102
- October 2023 106
- November 2023 100
- December 2023 74
- January 2024 75
- February 2024 75
- March 2024 78
- April 2024 74
- May 2024 108
- June 2024 98
- July 2024 116
- August 2024 134
- September 2024 130
- October 2024 141
- November 2024 171
- December 2024 115
- January 2025 216
- February 2025 140
- March 2025 220
- April 2025 233
- May 2025 239
- June 2025 303
- July 2025 82
Inside Airbnb’s AI-Powered Pipeline to Migrate Tests: Months of Work in Days
Inside Airbnb’s AI-Powered Pipeline to Migrate Tests: Months of Work in Days
DevOps Roadmap: Future-proof Your Engineering Career (Sponsored)Full-stack isn't enough anymore. Today's top developers also understand DevOps. Our actionable roadmap cuts straight to what matters. Built for busy coders, this step-by-step guide maps out the essential DevOps skills that hiring managers actively seek and teams desperately need. Stop feeling overwhelmed and start accelerating your market value. Join thousands of engineers who've done the same. This guide was created exclusively for ByteByteGo readers by TechWorld with Nana Disclaimer: The details in this post have been derived from the articles/videos shared online by the Airbnb Engineering Team. All credit for the technical details goes to the Airbnb Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. Code migrations are usually a slow affair. Dependencies change, frameworks evolve, and teams get stuck rewriting thousands of lines that don’t even change product behavior. That was the situation at Airbnb. Thousands of React test files still relied on Enzyme, a tool that hadn’t kept up with modern React patterns. The goal was clear: move everything to React Testing Library (RTL). However, with over 3,500 files in scope, the effort appeared to be a year-long grind of manual rewrites. Instead, the team finished it in six weeks. The turning point was the use of AI, specifically Large Language Models (LLMs), not just as assistants, but as core agents in an automated migration pipeline. By breaking the work into structured, per-file steps, injecting rich context into prompts, and systematically tuning feedback loops, the team transformed what looked like a long, manual slog into a fast, scalable process. This article unpacks how that migration happened. It covers the structure of the automation pipeline, the trade-offs behind prompt engineering vs. brute-force retries, the methods used to handle complex edge cases, and the results that followed. Where Fintech Engineers Share How They Actually Build (Sponsored)Built by developers, for developers, fintech_devcon is the go-to technical conference for engineers and product leaders building next-generation financial infrastructure.
Still on the fence? Watch past sessions, including Kelsey Hightower’s phenomenal 2024 keynote. The Need for MigrationEnzyme, adopted in 2015, provided fine-grained access to the internal structure of React components. This approach matched earlier versions of React, where testing internal state and component hierarchy was a common pattern. By 2020, Airbnb had shifted all new test development to React Testing Library (RTL). RTL encourages testing components from the perspective of how users interact with them, focusing on rendered output and behavior, not implementation details. This shift reflects modern React testing practices, which prioritize maintainability and resilience to refactoring. However, thousands of existing test files at Airbnb were still using Enzyme. Migrating them introduced several challenges:
The migration was necessary to standardize testing across the codebase and support future React versions, but it had to be automated to be feasible. Migration Strategy and Proof of ConceptThe first indication that LLMs could handle this kind of migration came during a 2023 internal hackathon. A small team tested whether a large language model could convert Enzyme-based test files to RTL. Within days, the prototype successfully migrated hundreds of files. The results were promising in terms of accuracy as well as speed. That early success laid the groundwork for a full-scale solution. In 2024, the engineering team formalized the approach into a scalable migration pipeline. The goal was clear: automate the transformation of thousands of test files, with minimal manual intervention, while preserving test intent and coverage. To get there, the team broke the migration process into discrete, per-file steps that could be run independently and in parallel. Each step handled a specific task, like replacing Enzyme syntax, fixing Jest assertions, or resolving lint and TypeScript errors. When a step failed, the system invoked an LLM to rewrite the file using contextual information. This modular structure made the pipeline easy to debug, retry, and extend. More importantly, it made it possible to run migrations across hundreds of files concurrently, accelerating throughput without sacrificing quality. Pipeline Design and TechniquesHere are the key components of the pipeline design and the various techniques involved: 1 - Step-Based WorkflowTo scale migration reliably, the team treated each test file as an independent unit moving through a step-based state machine. This structure enforced validation at every stage, ensuring that transformations passed real checks before advancing. Each file advanced through the pipeline only if the current step succeeded. If a step failed, the system paused progression, invoked an LLM to refactor the file based on the failure context, and then re-validated before continuing. Key stages in the workflow included:
This approach worked for the following reasons:
This structured approach provided a foundation for automation to succeed at scale. It also set up the necessary hooks for advanced retry logic, context injection, and real-time debugging later in the pipeline. 2 - Retry Loops and Dynamic PromptingInitial experiments showed that deep prompt engineering only got so far. Instead of obsessing over the perfect prompt, the team leaned into a more pragmatic solution: automated retries with incremental context updates. The idea was simple. If a migration step failed, try again with better feedback until it passed or hit a retry limit. At each failed step, the system fed the LLM:
This dynamic prompting approach allowed the model to refine its output based on concrete failures, not just static instructions. Instead of guessing at improvements, the model had specific reasons why the last version didn’t pass. Each step ran inside a loop runner, which retried the operation up to a configurable maximum. This was especially effective for simple to mid-complexity files, where small tweaks (like fixing an import, renaming a variable, or adjusting test structure) often resolved the issue. This worked for the following reasons:
Retrying with context turned out to be a better investment than engineering the “ideal” prompt up front. It allowed the pipeline to adapt without human intervention and pushed a large portion of files through successfully with minimal effort. 3 - Rich Prompt ContextRetry loops handled the bulk of test migrations, but they started to fall short when dealing with more complex files: tests with deep indirection, custom utilities, or tightly coupled setups. These cases needed more than just brute-force retries. They needed contextual understanding. To handle these, the team significantly expanded prompt inputs, pushing token counts into the 40,000 to 100,000 range. Instead of a minimal diff, the model received a detailed picture of the surrounding codebase, testing patterns, and architectural intent. Each rich prompt included:
Source: Airbnb Engineering Blog The key insight was choosing the right context files, pulling in examples that matched the structure and logic of the file being migrated. Adding more tokens didn’t help unless those tokens carried meaningful, relevant information. By layering rich, targeted context, the LLM could infer project-specific conventions, replicate nuanced testing styles, and generate outputs that passed validations even for the hardest edge cases. This approach bridged the final complexity gap, especially in files that reused abstractions, mocked behavior indirectly, or followed non-standard test setups. 4 - Systematic Cleanup From 75% to 97%The first bulk migration pass handled 75% of the test files in under four hours. That left around 900 files stuck. These were too complex for basic retries and too inconsistent for a generic fix. Handling this long tail required targeted tools and a feedback-driven cleanup loop. Two capabilities made this possible: Migration Status AnnotationsEach file was automatically stamped with a machine-readable comment that recorded its migration progress. These markers helped identify exactly where a file had failed, whether in the Enzyme refactor, Jest fixes, or final validation.
Source: Airbnb Engineering Blog This gave the team visibility into patterns: common failure points, repeat offenders, and areas where LLM-generated code needed help. Step-Specific File RerunsA CLI tool allowed engineers to reprocess subsets of files filtered by failure step and path pattern:
Source: Airbnb Engineering Blog This made it easy to focus on fixes without rerunning the full pipeline, accelerating feedback, and isolating scope. Structured Feedback LoopTo convert failure patterns into working migrations, the team used a tight iterative loop:
This method wasn’t theoretical. In practice, it pushed the migration from 75% to 97% completion in just four days. For the remaining ~100 files, the system had already done most of the work. LLM outputs weren’t usable as-is, but served as solid baselines. Manual cleanup on those final files wrapped up the migration in a matter of days, not months. The takeaway was that brute force handled the bulk, but targeted iteration finished the job. Without instrumentation and repeatable tuning, the migration would have plateaued far earlier. ConclusionThe results validated both the tooling and the strategy. The first bulk run completed 75% of the migration in under four hours, covering thousands of test files with minimal manual involvement. Over the next four days, targeted prompt tuning and iterative retries pushed completion to 97%. The remaining ~100 files, representing the final 3%, were resolved manually using LLM-generated outputs as starting points, cutting down the time and effort typically required for handwritten migrations. Throughout the process, the original test intent and code coverage were preserved. The transformed tests passed validation, matched behavioral expectations, and aligned with the structural patterns encouraged by RTL. Even for complex edge cases, the baseline quality of LLM-generated code reduced the manual burden to cleanup and review, not full rewrites. In total, the entire migration was completed in six weeks, with only six engineers involved and modest LLM API usage. Compared to the original 18-month estimate for a manual migration, the savings in time and cost were substantial. The project also highlighted where LLMs excel:
Airbnb now plans to extend this framework to other large-scale code transformations, such as library upgrades, testing strategy shifts, or language migrations. The broader conclusion is clear: AI-assisted development can reduce toil, accelerate modernization, and improve consistency when structured properly, instrumented well, and paired with domain knowledge. References: Jobright Agent : The First AI that hunts jobs for youJob hunting can feel like a second full-time job—hours each day scrolling through endless listings, re-typing the same forms, tweaking your resume, yet still hearing nothing back. What if you had a seasoned recruiter who handled 90% of the grunt work and lined up more interviews for you? That’s the experience with Jobright Agent:
SPONSOR USGet your product in front of more than 1,000,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com. © 2025 ByteByteGo |
by "ByteByteGo" <bytebytego@substack.com> - 11:39 - 24 Jun 2025