Bazel, Haskell, and Build-System Joy

2020-09-10

As part of my day job at GitHub, I work on the semantic program analysis toolkit. It’s a lot of fun and a lot of Haskell: we clock around 17,000 lines of Haskell source, though at times it’s been as high as 29,000. The project in total has around a hundred direct dependencies, and several hundred resulting indirect ones. Though initially we used Stack to build semantic, we switched to Cabal when it gained the ability to build within a sandbox.

With cabal and a clean build environment, semantic takes between twenty and forty-five minutes to complete. When optimizations are enabled, it can take on the order of hours, especially when rebuilding dependencies with optimizations. And this is fair! Haskell is a complicated language that takes serious computational power to compile. And though Cabal is a versatile and fast-improving tool, it isn’t perfect, especially when applied to large monorepos containing many subprojects. Issues we ran into include:

  • Little compositionality: though common stanzas in .cabal files cut down on repetition within a single file, it’s not possible to share settings or configurations across multiple files.
  • Suboptimal caching: due to the vagaries of Template Haskell, cabal is very conservative about caching files containing TH splices. This is correct behavior on cabal’s part, but grows tedious when innocuous changes cause rebuilds that we know are unnecessary. The third-party cabal-cache ameliorated some of this on our CI boxes, but did not address the problem fully. Additionally, any modification to a .cabal file throws out its associated caches, even when doing something that shouldn’t invalidate existing caches, such as adding a new module.
  • Phasing restrictions: our build process involved generating Haskell data types from text files containing grammar descriptions, and it’s not generally possible to access project-local files during Template Haskell splices.
  • Scriptability: when dealing with large projects, it’s often very convenient to do some limited forms of iteration when specifying build configurations, and the pure-text format of .cabal files precludes this.
  • Reliable REPLs: in order to yield a REPL capable of loading files without first compiling the entire project, we required a complicated and brittle bash script.
  • Convenience: compared to other languages in which you can just drop new source files and have them picked up by the compiler, cabal requires you to list them explicitly. While explicit is very often better than implicit, the case of adding a new module to a project is so common that editing .cabal files every time becomes tedious.

As you can imagine, living with hour-long CI cycles was not an indefinitely tenable situation. But I think the issue of long CI cycles goes deeper than mere inconvenience: I believe that we as software developers have an ethical duty to keep build times and CI times down. We live in a world where carbon emissions have created, at least in part, a climate so troubled that the west coast of the United States is a literal hellscape, where island populations are being displaced due to rising sea levels, and where cloud computing consumes more energy than some entire nations; I’ve grown to find the thought of letting corporations’ Travis or CircleCI instances recompute thousands of pointless builds upsetting, upsetting in a way similar to the thought of said companies contaminating groundwater with industrial byproducts. To meaningfully change the way companies consume resources requires broad labor action; however, irrespective of when this action is taken, we as engineers have an opportunity and responsibility to be the agents of change.

I also believe that very few peopleExcluding Rust programmers, who get to use the truly excellent cargo, and who seem to be very happy with it.

are truly happy with their build tool. I certainly haven’t been, anyway. To name but a few: the shortcomings of make have been documented for longer than I’ve been alive; xcodebuild only works on macOS/iOS targets; CMake is powerful but has no interoperability with cabal. It’s hardly controversial to suggest that, given a large project, no build system will be perfect for all people. This faact doesn’t make the quest for a better build system useless, but should also inform the engineer-voice that demands perfection.

Wearied by hour-long builds, both on CI and locally, I looked around for alternative build solutions that might ease that weariness. I settled on Bazel, with the rules_haskell toolkit designed by the fine folks at Tweag. I’m happy to report that the experience was brilliant: if it’s possible for a build system to produce joy, Bazel and rules_haskell do. What follows is a brief overview of the Bazel experience, and the process of porting a large, multi-project repository to support either cabal build or bazel build.

The Process

The biggest thing to wrap your head around when coming to Bazel from other build systems is that everything on which your build depends—source files, data files, package dependencies, vendored git repositories−must be specified explicitly in Bazel. It does not suffice for your build process to just look at a given file I know is present in the repository; if the target that I’m specifying depends on that file, it must be listed explicitly as a dependency. Dependencies, in Bazel parlance, are more than libraries or repositories: the set of dependencies, and the hashed contents of all these dependencies, are what tells Bazel when builds can be cached and when they can’t. This also goes for test targets: if your tests need to read from some corpus of fixtures, you’ll have to specify those fixtures as an explicit dependency so that they’re available at runtime. This can be an involved process, as sometimes you just want to access a file (come on, it’s right there, I found myself whispering), but the benefits also show up in testing: if your tests are deterministic (as they should be), Bazel is capable of caching your test results, and only rerunning them when the source or the fixtures change. Given that the full tests for semantic parsing take several minutes to run, this is a profound improvement, especially on CI.

Beginning a new Bazel project, or converting from Cabal, requires starting with the WORKSPACE file. The WORKSPACE specifies the root of the current project, downloads and sets up GHC, and is the only place where external dependencies—such as those downloaded from Hackage—are specified. Targets are specified per-project in BUILD.bazel files, each of which refers as needed to external dependencies defined in the WORKSPACE. The language used to specify these .bzl files is known as Starlark, and is a subset of Python that discourages mutability and iteration—though I’m hardly the world’s biggest fan of Python, I found Starlark very pleasant to use, as its chosen subset of Python is strict enough to disallow most of the things I find egregious. Once the WORKSPACE file is set up, the process of conversion becomes specifying per-project library and executable targets in the BUILD.bazel file present in each project within the monorepo.

Whence Cabal Dependencies?

As I mentioned earlier, there are several hundred direct and indirect dependencies across all subprojects in the semantic monorepo. Each of these dependencies has to be declared and made available as a build target, specified in the WORKSPACE. There are three options for specifying dependencies on Hackage projects:

  • Specify them all manually by downloading them with http_archive and haskell_cabal_library, doing so would be tedious beyond words, especially given that we’d have to declare dependencies for each package.
  • Use the Nix expression language, in combination with the rules_nixpkgs ruleset, and transform Nix derivations into Bazel targets.
  • Pin to a particular Stackage release, specifying non-Stackage dependencies with a YAML file in the project root.

Though Nix has considerable merit, especially when corralling system dependencies, it’s still an unconventional choice in industry, and I deemed it politically unattainable to introduce not just one but two new frameworks for builds. As such, I chose to build against a Stackage release, especially given that we have no real system-level dependencies and that ninety percent of our dependencies are already present in Stackage snapshots.

Code Generation: It Matters

Because maintaining syntax trees by hand was much too onerous, my coworker Ayman swooped in and wrote Template Haskell splices that generate syntax types from a tree-sitter JSON description of the grammar. This works well, but hinges on the ability to read said grammar descriptions from the filesystem. This was a fraught process in Cabal, relying on autogenerated Paths_ modules providing access to files specified in the data-files setting in each project’s .cabal file, and only happened to work by accident: were semantic uploaded to Hackage, no one would be able to use it as a dependency, as cabal would be unable to find the required file. As it is, this happened to work because our downstream clients use a pinned Git hash in their cabal.project to pull in semantic as a dependency; because cabal checks out the whole repository in this case, the tree-sitter files happen to be in the correct place.

Bazel and rules_haskell take a more principled approach to this. Rather than calling pre-provided functions to determine the locations of these JSON files, we make the build system take care of finding them, by declaring that each language package has an explicit dependency on said file. We can pass in the location of this file as a preprocessor flag to the build process, which is then substituted using the CPP extension to Haskell. This doesn’t work perfectly—there’s an incorrect interaction when invoking a REPL on a language package in question—but suffices in almost all cases, and we were able to use it in tandem with the cabal methodology (since not all of our systems are as yet migrated).

A Script of One’s Own

As I mentioned above, the ability to generate code, as well as the result of said generation, hinges on the availability and the specific version of the language description in question. Though these JSON files are available in the Hackage packages we’ve uploaded, Bazel needs to know more information about them: we have to not only provide a Haskell library target, but also a target for the JSON file in question. Since our code is very sensitive to grammar changes, we need to be able to specify grammar versions very precisely, all the time declaring both the Haskell library targets and JSON targets. These target specifications are complicated when written out by hand, but vary very little between language packages. As such, it was easy to specify a shared rule:

tree_sitter_node_types_hackage(
    name = "tree-sitter-go",
    sha256 = "72a1d3bdb2883ace3f2de3a0f754c680908489e984503f1a66243ad74dc2887e",
    version = "0.5.0.2",
)

This function compiles down to an http_archive call targeting a particular grammar version, and that does the required string interpolation to instruct the package in question how to expose both the Haskell and JSON targets.

_tree_sitter_language_build = """
package(default_visibility = ["//visibility:public"])

load("@rules_haskell//haskell:cabal.bzl", "haskell_cabal_library")
load("@stackage//:packages.bzl", "packages")
exports_files(glob(["**/node-types.json"]))

alias(
   name = "src/node-types.json",
   actual = "{node_types_path}",
)

haskell_cabal_library(
    name = "{name}",
    version = "{version}",
    srcs = glob(["**"]),
    deps = packages["{name}"].deps,
    visibility = ["//visibility:public"],
)

filegroup(name = "corpus", srcs = glob(["**/corpus/*.txt"]))
"""

def tree_sitter_node_types_hackage(name, version, sha256, node_types_path = ""):
    """Download a tree-sitter language package from Hackage and build/expose its library and corpus."""

    if node_types_path == "":
        node_types_path = ":vendor/{}/src/node-types.json".format(name)

    info = {
        "name": name,
        "version": version,
        "node_types_path": node_types_path,
    }
    http_archive(
        name = name,
        build_file_content = _tree_sitter_language_build.format(**info),
        urls = ["https://hackage.haskell.org/package/{name}-{version}/{name}-{version}.tar.gz".format(**info)],
        strip_prefix = "{name}-{version}".format(**info),
        sha256 = sha256,
    )

My Conclusions

The process of enabling Bazel on semantic took about two person-weeks of work. At least half, if not more, of that time was spent refining and tweaking the setup: the core attributes (the WORKSPACE and all relevant BUILD) files came together within a day’s work. I’m tremendously pleased with how well Bazel and rules_haskell worked out for semantic. Here is an inexhaustive list of the things that have brought me joy:

  • Content-based caching is superlative. It’s so refreshing to be able to mess around on feature branches and then, when popping back to the main branch, avoid having to rebuild a project I’ve already encountered in the past.
  • ghcide tooling worked straight out of the box. I had a hard time believing it worked straight-out-of-the-box, but it does: IDE support across the project is now significantly more reliable.
  • Being able to share functions and constants across different BUILD.bazel files is revelatory. No longer do I have to remember the correct set of GHC options for executables/libraries/tests: I define them in a common build file, import them explicitly, and pass them as ghc_options flags to rules_haskell functions.
  • Bazel tooling is excellent, particularly its editor integration, the buildifier formatter, and the buildozer tool for batch file changes. buildifier in particular is very opinionated, happily keeping your imports list alphabetized and deduped. As someone who’s spent more time than I’d like hand-formatting .cabal files, this was and is truly pleasant.
  • The documentation, both for rules_haskell and for Bazel itself, is consistently excellent. Indeed, DuckDuckGo contains a bang that allows searching Bazel docs directly from your browser’s address bar.
  • As developers, we no longer need to track GHC versioning ourselves: it’s specified in the WORKSPACE and exists independently from whatever ghc or cabal you might have installed.
  • Lastly, but most importantly: the rules_haskell team are kind and generous with their time. They’ve been tremendously responsive to our needs, and are quick to fix bugs we’ve encountered. In a world where many open-source maintainers are burnt out(understandably so, given the thankless and ill-funded task that is maintaining any open source project of note)

    , it’s a pleasure to interact with people with approaches grounded in patience and empathy.

If you’re curious about how rules_haskell looks in practice, you can check the build files in semantic, or repositories like Digital Asset’s daml and TreeTide’s underhood. Consulting existing projects was, in my experience, the best way to get a sense for preferred idioms and approaches.

Though I don’t use Bazel and rules_haskell everywhere—for simple one-off projects with one or two targets and few dependencies, I still use cabal—I’m fully on board, and truly enamored, with the power and composability it provides. Though it’s not perfect (there are some issues on our CI machines holding us back, and fixing per-language REPLs depends on the as-yet-unimplemented th_deps feature), (EDIT: the Bazel team fixed the first bug, and we obviated the second: thanks so much to them for tackling the former!) I don’t know of any other build system that balances a sensible execution model with the extensibility and customizability that large projects’ builds always end up needing. Simply put, it’s joy-inducing, and if you work on a large Haskell codebase you owe it to yourself to try the Bazel life.

Thanks to Joe Kachmar and Phillip Bowden for reviewing drafts of this post.