Types-First: A Scalable New Architecture for Flow - Flow - Medium

 
notion imagenotion image
TL;DR: The types-first architecture unlocks Flow’s potential at scale by leveraging fully typed module boundaries. We plan to migrate to the new architecture over the next months.
Flow is designed to type check extremely large projects. Over the past several years, we’ve introduced some big changes to enable Flow to scale exponentially to 10M+ lines of code, just barely keeping pace with Facebook’s codebases. For example, lazy mode only checks files affected by local changes, instead of the entire codebase. However, Flow still has to run type analysis on the dependencies of the files that changed, a fundamental inefficiency which multiplies the cost of each change.
We are introducing a new architecture known as “types-first”, which allows us to avoid checking dependencies. It is critical to making Flow faster and allowing us to continue to scale. The benefits extend to other problematic areas, such as error messages and debuggability.
Types-first exploits full type annotations at module boundaries to make IDE services and rechecks multiple times faster on launch.

Overview

Consider a codebase with the following dependency graph. Here, dependencies are formed whenever a file imports a value or a type from another file.
Each dependency is represented with a directed edge. For example, file f depends on files u2 and u3, and files u1 and u2 are in a dependency cycle. Files d1 and d2 are “downstream” dependent on f. These files in turn may depend on a larger set of “upstream” files. In our example d2 also depends on u4.
Let’s say that we’re in lazy mode and file f changes. We need to recheck not only f, but also its downstream files d1 and d2. To unblock these rechecks, we need to compute the types of the upstream files too. Let's compare how this worked in the “classic” architecture (the current default mode), and how the nature of this work changes in types-first.
In classic mode, computing the types of the upstream files meant that we needed to check their code! Moreover, if these files were involved in dependency cycles, we needed to check all of those files together before we could ever get to f or the downstream files that depend on it. We say that files in dependency cycles, for example u1 and u2, form “components”.
Each component needs to be checked in its own process, and so large components hurt parallelism. This architecture led to a variety of performance problems, including peak memory blowups, long garbage collection pauses, amplification of non-linear time complexity, and low use of parallelism.