In this blog post we introduce below, an Apache 2.0 licensed resource monitor for modern Linux systems. below was designed and developed by the resource control team at Facebook to view and record historical Linux system data. The resource control team, as the name suggests, is responsible for resource management of Linux systems at scale.
One of the kernel’s primary responsibilities is mediating access to resources. Sometimes this might mean parceling out physical memory such that multiple processes can share the same host. Other times it might mean ensuring equitable distribution of CPU time. In all these contexts, the kernel provides the mechanism and leaves the policy to a runtime like systemd or dockerd. The runtime takes input from a scheduler or end user — along the lines of what to run and how to run it — and turns the right knobs and pulls the right levers on the kernel such that the workload can start executing.
In a perfect world this would be the end of the story. However, the reality is that resource management is a complex and opaque amalgam of technologies that has evolved over decades of computing. Despite some of this technology having various warts and dead ends, the end result — a container — works well. While the user does not usually need to concern themselves with the details, it is crucial for infrastructure operators to have visibility into this stack. Visibility and debuggability are essential for detecting and investigating misconfigurations, bugs and systemic issues.
To make matters more complicated, resource outages are often difficult to reproduce. It is not unusual to spend weeks waiting for an issue to reoccur so that the root cause can be investigated. Scaling up further compounds this issue: one cannot run a custom script on every host in the hopes of logging bits of crucial state if the bug happens again. Therefore, more sophisticated tooling like below is required.
Historically Facebook has been a heavy user of atop. atop is a performance monitor for Linux that can report the activity of all processes as well as various pieces of system level activity. One of the most compelling features atop has to offer over tools like htop is the ability to record historical data as a daemon. This sounds like a simple feature, but in practice this has enabled debugging countless production issues. With long enough data retention, it is possible to go backwards in time and look at the host state before, during, and after the outage.
Unfortunately, it became clear over the years that atop had certain deficiencies. First, cgroups emerged as the de facto way to control and monitor resources on a Linux machine. atop still lacks support for this fundamental building block. Second, atop stores data on disk with custom delta compression. This works fine under normal circumstances, but under heavy resource pressure the host is likely to lose data points. Since delta compression is in use, huge swaths of data can be lost for periods of time where the data is most important. Third, the user experience has a steep learning curve. We frequently heard from atop power users that they love the dense layout and numerous keybindings. However, this is a double-edged sword. When someone new to atop wants to debug a production issue, they’re solving two problems at once now: the issue at hand and how to use atop.
Recognizing the opportunity for a next-generation system monitor, we designed below with input from production atop users and the following in mind:
To install the package:
# dnf install -y below
To turn on the recording daemon:
# systemctl enable --now below
The most commonly used mode for below is the replay mode. As the name implies, replay mode replays previously recorded data. Assuming you’ve already started the recording daemon, start a session by running:
$ below replay --time "5 minutes ago"
You should then be greeted with the cgroup view:
If you ever get stuck or forget a keybinding, press ? to access the help menu.
The very top of the screen is the status bar. The status bar displays information about the current sample. You can move forwards and backwards through samples by pressing t and T, respectively. The middle section is the system overview. The system overview contains statistics about the whole system that are always useful to see. The third and lowest section is the multipurpose view. In addition to the cgroup view shown, there are process and system views, accessible by pressing p and s, respectively.
Press ↑ and ↓ to move the list selection. Press <Enter> to collapse and expand cgroups. Now suppose you’ve found an interesting cgroup and you want to see what processes are running inside it, you can zoom into process view by pressing z:
Press z again to return to the cgroup view. The cgroup view can be long at times. If you have a vague idea of what you’re looking for, you can filter by cgroup name by pressing / and entering a filter:
At this point, you may have noticed a tab system we haven’t explored yet. To cycle forwards and backwards through tabs, press <Tab> and <Shift> + <Tab> respectively. We’ll leave this as an exercise to the reader.
Under the hood, below has a powerful design and architecture. Facebook is constantly upgrading to newer kernels, so we never assume a data source is always available. This tacit assumption enables total backwards and forwards compatibility between kernels and below versions. Furthermore, each data point is Zstandard(zstd) compressed and stored in full. This solves the issues with delta compression we’ve seen atop have at scale. Based on our tests, our per-sample compression can achieve on average a 5x compression ratio.
below also uses eBPF to collect information about short-lived processes (processes that live for shorter than the data collection interval). In contrast, atop implements this feature with BSD process accounting, a known slow and priority-inversion-prone kernel interface.
For the user, below also supports live-mode and a dump interface. Live mode combines the recording daemon and the TUI session into one process. This is convenient for browsing system state without committing to a long running daemon or disk space for data storage. The dump interface is a scriptable interface to all the data below stores. Dump is both powerful and flexible — detailed data is available in CSV, JSON, and human readable format.
below offers compelling advantages over existing tools in the resource monitoring space. We (the below developers) have spent a great deal of effort preparing below for open source use. We are excited for the readers and the community to get a chance to try below and we hope that it will provide you with an interactive and easy to use system monitor. If you have any feedback, feature requests, or bugs, please let us know on the Github issue tracker.