Hermit: Deterministic Linux for Controlled Testing and Software Bug-finding

November 22, 2022ByRyan Rhodes Newton

If you've used emulators for older platforms, you probably experienced a level of precise control over software execution that we lack on contemporary platforms. For example, if you play 8-bit video games emulated on your modern console, you are able to suspend and rewind gameplay, and when you resume, that incoming creature or projectile will predictably appear in the same spot because any randomness plays out deterministically within the emulator.

Yet, as a software engineer, when your multithreaded service crashes under transient conditions, or when your test is flaky, you don't have these luxuries. Everything the processor and operating system contributes to your program's execution—thread scheduling, random numbers, virtual memory addresses, unique identifiers—constitutes an unrepeatable, unique set of implicit inputs. Standard test-driven methodologies control for explicit program inputs, but they don't attempt to control these implicit ones.

Since 2020, our team within DevInfra has worked to tackle this hard problem at its root: the pervasive nondeterminism in the interface between applications and the operating system. We've built the first practical deterministic operating system called Hermit (see endnote on prior art). Hermit is not a new kernel—instead it's an emulation layer on top of the Linux kernel. In the same way that Wine translates Windows system calls to POSIX ones, Hermit intercepts system calls and translates them from the Deterministic Linux abstraction to the underlying vanilla Linux OS.

Details on sources of and solutions for nondeterminism can be found in our paper, “Reproducible Containers,” published in ASPLOS '20, which showcased an earlier version of our system. We've open-sourced the new Hermit system and the underlying program-instrumentation framework named Reverie.

Example Applications

Now we explore some of the applications Hermit can be used for, and the role [non]determinism plays. In the next section, we go deeper into how Hermit works.

Flaky tests

First, flaky tests. They're a problem for every company. Google, Apple, Microsoft and Meta have all published their experiences with flaky tests at scale. Fundamentally, the cause of flakiness is that test functions don't really have the signatures that appear in the source code. An engineer might think they're testing a function from a certain input type to output type, for example:

test : Fn(Input) -> Output;

Indeed, unless we're doing property-based testing, then for unit tests it’s even simpler. (The input is empty, and the output is boolean.) Unfortunately, in reality, most tests may be affected by system conditions and even external network interactions, so test functions have a true signature more like the following:

test : Fn(Input, ThreadSchedule, RNG, NetworkResponses) -> Output;

The problem is that most of these parameters are outside of engineers’ control. The test harness and test code, running on a host machine, is at the mercy of the operating system and any external services.

Caption: Irreproducible, implicit inputs from the operating system can affect test outcomes.

That's where Hermit comes in. Hermit’s job is to create a containerized software environment where every one of the implicit inputs (pictured above) is a repeatable function of the container state or the container configuration, including command line flags. For example, when the application requests the time, we provide a deterministic time that is a function of program progress only. When an application thread blocks on I/O, it resumes at a deterministic point relative to other threads.

Hermit’s guarantee is that any program run by Hermit (without external networking) runs an identical execution—irrespective of the time and place it is run—yielding an identical stream of instructions and complete memory states at the time of each instruction. This means if you run your network-free regression test under Hermit, it is guaranteed not to be flaky:

hermit run ./testprog

Further, Hermit allows us to not merely explore a single repeatable execution of a program, but to systematically navigate the landscape of possible executions. Let’s look at how to control one specific feature: pseudo-random number generation (PRNG). Of course, for determinism, when the application requests random bytes from the operating system, we provide repeatable pseudo-random ones. To run a program with different PRNG seeds, we simply use a different --rng-seed parameter:

hermit run --rng-seed=1 prog
hermit run --rng-seed=2 prog

In this case, it doesn't matter what language prog is written in, or what random number generator library it uses, it must ultimately ask the operating system for entropy, at which point it will receive repeatable pseudo-random inputs.

It is the same for thread scheduling: Hermit takes a command line seed to control thread interleaving. Hermit is unique in being able to reproducibly generate schedules for full Linux programs, not merely record ones that happen in nature. It generates schedules via established randomized strategies, designed to exercise concurrency bugs. Alternatively, full thread schedules can be passed explicitly in as input files, which can be derived by capturing and modifying a schedule from a previous run. We'll return to this in the next section.

There is an upshot to making all these implicit influences explicit. Engineers dealing with flaky programs can steer execution as they desire, to achieve the following goals:

Regression testing: Use settings that keep the test passing.
Stress testing: Randomize settings to find bugs more effectively.
Diagnosis: Vary inputs systematically to find which implicit inputs cause flakiness.

Pinpointing a general class of concurrency bugs

As mentioned above, we can vary inputs to find what triggers a failure, and we can control schedules explicitly. Hermit builds on these basic capabilities to analyze a concurrency bug and pinpoint the root cause. First, Hermit searches through random schedules, similar to rr chaos mode. Once Hermit finds failing and passing schedules, it can analyze the schedules further to identify pairs of critical events that run in parallel and where flipping the order of those events causes the program to fail. These bugs are sometimes called ordering violations or race conditions (including but not limited to data races).

Many engineers use race detectors such as ThreadSanitizer or go run -race. Normally, however, a race detector requires compile-time support, is language-specific and works only to detect data races, specifically data races on memory (not files, pipes, etc.). What if, instead, we have a race between a client program written in Python, connecting to a server written in C++, where the client connects before the server has bound the socket? This nondeterministic failure is an instance of the "Async Await" flakiness category, as defined by an empirical analysis and classification of flaky tests.

By building on a deterministic operating system abstraction, we can explicitly vary the schedule to empirically find those critical events and print their stack traces. We start by using randomized scheduling approaches to generate a cloud of samples within the space of possible schedules:

Caption: A visualization of the (exponential) space of possible thread schedules, generated by different Hermit seeds. With the space organized by edit distance, the closest red and green dots correspond to the minimum edit distance observed between a passing and failing schedule.

We can organize this space by treating the thread schedules literally as strings, representing sequential scheduling histories. For example, with two threads A & B, "AABBA" could be an event history. The distance between points is the edit distance between strings (actually, a weighted edit distance preferring swaps over insertion or deletion). We can then take the closest pair of passing and failing schedules and then study it further. In particular, we can bisect between those schedules, following the minimum edit distance path between them, as visualized below.

Caption: A binary search between passing and failing schedules, probing points in between until it finds adjacent schedules that differ by a single transposition, a Damerau-Levenshtein distance of one.

At this point, we've reduced the bug to adjacent events in the thread schedule, where flipping their order makes the difference between passing and failing. We report the stack traces of these events as a cause of flakiness. (Indeed, it is only a single cause because there may be others if flakiness is overdetermined.)

Challenges and how it works

Here, we'll cover a bit about how Hermit works, emphasizing pieces that are different from our prototype from ASPLOS ’20. The basic scenario is the same, which is that we set out to build a user space determinization layer, not allowing ourselves the liberty of modifying the Linux kernel or using any privileged instructions.

Challenge 1: Interposing between operating system and application

Unfortunately, on Linux there is not a standard, efficient, complete method to interpose between user-space applications and the operating system. So we've built a completely new program sandboxing framework in Rust, called Reverie, that abstracts away the backend—how program sandboxing is implemented. Reverie provides a high-level Rust API to register handlers: callbacks on the desired events. The user writes a Reverie tool that observes guest events and maintains its own state.

Reverie is not just for observing events. When you write a Reverie handler, you are writing a snippet of operating system code. You intercept a syscall, and you become the operating system, updating the guest and tool state as you like, injecting zero or more system calls to the underlying Linux operating system and finally returning control to the guest. These handlers are async, and they run on the popular Rust tokio framework, interleaving with each other as multiple guest threads block and continue.

The reference Reverie backend uses the ptrace system call for guest event interception, and a more advanced backend uses in-memory program instrumentation. In fact, Reverie is the only program instrumentation library that abstracts away whether instrumentation code is running in a central place (its own process), or inside the guest processes themselves via injected code.

Challenge 2: Inter-thread synchronization

Consider communication through sockets and pipes within the reproducible container. This is an area where our earlier prototype mainly used the strategy of converting blocking operations to non-blocking ones, and then polling them at deterministic points in time within a sequential execution. Because we run in user space, we don't have a direct way to ask the kernel whether a blocking operation has completed, so attempting a non-blocking version of the syscall serves as our polling mechanism.

Hermit builds on this strategy and includes a sophisticated scheduler with thread priorities and multiple randomization points for exploring "chaos mode" paths through the code. This same scheduler implements a back-off strategy for polling operations to make it more efficient.

Hermit also goes beyond polling, implementing some inter-thread communication entirely inside Hermit. By including features like a built-in futex implementation, Hermit takes a small step closer to behaving like an operating system kernel. But Hermit is still vastly simpler than Linux and passes most of the heavy lifting on to Linux itself.

For specific features that Hermit implements directly, it never passes those system calls through to Linux. In the case of futexes, for example, it is hard or impossible to come up with a series of raw futex syscalls to issue to the kernel and achieve a deterministic outcome. Subtleties include spurious wake-ups (without the futex value changing), the nondeterministic selection of threads to wake, and the ineradicable moment in time after Hermit issues a blocking syscall to Linux, but before we know for sure if Linux has acted on it.

These issues are avoided entirely by intercepting each futex call and updating the Hermit scheduler’s own state precisely and deterministically. The underlying Linux scheduler still runs everything, physically, but Hermit’s own scheduler takes precedence, deciding which thread to unblock next.

Challenge 3: Large and complex binaries

Meta has no shortage of large and challenging binaries that use recent features of the Linux kernel. Nevertheless, after a couple of years of development, Hermit runs thousands of different programs deterministically today. This includes more than 90 percent of test binaries we run it on.

Other applications

While flaky tests and concurrency bugs are where we've invested most heavily, there are many other applications which we will briefly outline below:

Reproducible Builds: which are part of a secure software supply chain that removes the need to trust binary packages and their servers. Our earlier prototype was used for this purpose and presented at the Reproducible Builds Summit in 2019.
Reproducible Debugging: which has the same benefits as traditional replay debugging . Indeed Hermit supports attaching GDB or LLDB to any deterministic execution.
Replicated state machines: which provide fault tolerance in distributed systems, and require determinism. There are even microservices frameworks that completely abstract away machine failures, provided that determinism can be ensured or assumed.
Smart contracts: which are normally written in specialized languages that are constrained to be deterministic. Reproducible containers could make it possible to run traditional software “on chain” with the same assurance of distributed replicability.

There's more than we can explore on our own! That is why we're excited to open up these possibilities for the community at large.

Conclusion

Repeatable execution of software is an important capability for debugging, auditability and the other applications described above. Nevertheless, it's been treated as an afterthought in the technology stacks we all depend on—left as the developer’s responsibility to create "good enough" repeatability for stable tests or a reproducible bug report.

Hermit, as a reproducible container, provides a glimpse of what it would be like if the system stack provided repeatability as an abstraction: a guarantee the developer could rely upon, just like memory isolation or type safety. We've only scratched the surface of what is possible with this foundational technology. We hope you’ll check out the open source GitHub repo and help us apply and evolve Hermit.

Note on prior Art: Earlier academic research on Determinator and dOS represents exploratory work in this area over a decade ago. But these systems were, respectively, a small educational OS and an experimental fork of Linux. Neither was designed as a maintainable mechanism for people to run real software deterministically in practice.

To learn more about Meta Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter, Facebook and LinkedIn.

Tags:

Developer Tools Open Source 2022