fs-verity support in btrfs: Detect tampering in critical read-only files

October 19, 2021ByBoris Burkov

fs-verity is a Linux filesystem feature that provides significant support for ensuring the integrity and authenticity of a read-only file in an otherwise writable filesystem. It was originally built to authenticate the contents of app package (apk) files on Android devices using ext4 and f2fs filesystems. Facebook plans to use it for authenticating system base images, container images, and individual binaries on production servers using btrfs.

Motivation

Without getting bogged in excessive mathematics or security jargon, it is useful to sketch how file authentication is typically done. This will illustrate the tradeoffs inherent to the design of fs-verity and contrast it with other similar tools.

Fundamentally, to verify the authenticity of some data, a system needs to check that the data itself hasn't changed and confirm that the metadata used for checking the data hasn't been tampered with. This is typically done by hashing the data and securely authenticating the hash.

For example, using openssl, we can sign a sha256 hash of a file in userspace and detect tampering:

          # generate a private key and public key certificate

          $ openssl req -newkey rsa:4096 -nodes -keyout key.pem -x509 -out cert.pem

          # reformat the certificate as a raw public key file
          $ openssl x509 -pubkey -out pubkey.pem -in cert.pem

          # generate a 1M file
          $ dd if=/dev/urandom of=foo bs=1M count=1
          1+0 records in
          1+0 records out
          1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00651344 s, 161 MB/s

          # create a signed sha256 hash of the file
          $ openssl dgst -out foo.sig -sign key.pem foo

          # verify the signed hash with the public key
          $ openssl dgst -verify pubkey.pem -signature foo.sig foo
          Verified OK

          # tamper with the file
          $ dd if=/dev/zero of=foo conv=notrunc bs=4k count=1 seek=100
          1+0 records in
          1+0 records out
          4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000141911 s, 28.9 MB/s

          # try verifying again
          $ openssl dgst -verify pubkey.pem -signature foo.sig foo
          Verification Failure

          # reading the file is clearly still OK
          $ sha256sum foo
          68db4587ffb4162e4ff2027c6d629521fb79796dfd8e530504618fbb2dd36750 foo

Though signing a hash is the basic form, there are a few high-level decisions that the designer of an authentication tool must make on exactly how the hashing and signing are done. Here are a few decision choices that can form a taxonomy that we can use to categorize and compare different authentication tools:

device vs. file: Should we authenticate an entire block device, or individual files in a file system?
kernel vs. user space: Should the userspace application be responsible for authentication, or can the kernel do it transparently while performing other file operations?
writeable vs. unwritable: Should we be able to re-write and re-authenticate a file after doing so once?
whole data vs. chunks: Should we operate on and store hashes for all the data at once, or on chunks of it?
hash (Merkle) tree vs. hash list: If we break the data up into chunks, should we store the hashes in a flat list or recursively hash the hashes and form a tree?
lazy vs. eager verification: When verifying authenticity, should we process the entire data up front, or lazily as we use it?

From the above taxonomy, fs-verity operates on files, verifies transparently in the kernel, makes the file unwritable, builds a Merkle tree of hashed blocks, and verifies lazily as pages of the file are read. This is well suited to its goals of authenticating binaries in normal filesystems and has the following advantages:

allows for fast launches
doesn't hash parts of the huge binary that don't get used
doesn't add too much overhead to reads
doesn't affect the rest of the filesystem
detects tampering even during a long running execution

The following chart categorizes some notable tools in the space based on design choices:

In this chart, we can see (among other things) that:

fs-verity is most similar to dm-verity, differing only by operating on files rather than a whole device.
IMA is similar to the userspace offerings, but in-kernel.
dm-integrity and btrfs HMACs are set apart from the rest by leaving the data writeable, and that a hash list is more amenable to a writeable solution, compared to a hash tree.

Now that we understand a bit more about how verity is different from openssl, we can follow an example using fs-verity to generate a Merkle tree with a signed root hash for the file we previously signed above and demonstrate detecting a corruption:

          # convert our certificate to the DER format
          $ openssl x509 -in cert.pem -out cert.der -outform der

          # add the certificate to the special fs-verity kernel keyring
          $ sudo keyctl padd asymmetric '' %keyring:.fs-verity < cert.der
          392921890

          # compute Merkle tree and sign the root hash
          $ fsverity sign foo foo.sig.verity --key=key.pem --cert=cert.pem
          Signed file 'foo' (sha256:cb6772f68fb81873b1497b871060b81be336d83543a7aa5c2c88418ca875576f)

          # enable fs-verity on the file
          $ sudo fsverity enable foo --signature foo.sig.verity

          # naive corruption no longer works (file is unwritable)
          $ dd if=/dev/zero of=foo conv=notrunc bs=4k count=1 seek=101
          dd: failed to open 'foo': Operation not permitted

          # but we can be clever and modify the blocks directly..
          # look up data extents (glazing over some details of btrfs chunk mapping)
          $ sudo xfs_io -r -c fiemap foo
          foo:
          0: [0..799]: 27016..27815
          1: [800..807]: 26632..26639
          2: [808..2047]: 27824..29063

          # write directly to the file data on the device (in this case an LVM volume)
          $ sudo dd if=/dev/zero of=/dev/vg0/lv0 bs=1 count=4k seek=$((27824*512)) conv=notrunc
          4096+0 records in
          4096+0 records out
          4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00427386 s, 958 kB/s

          # drop the page cache since we have read the file recently
          $ echo 3 | sudo tee /proc/sys/vm/drop_caches
          3

          # verity lets us read untampered blocks (because of lazy checking)
          $ dd if=foo bs=1 count=4k | sha256sum
          4096+0 records in
          4096+0 records out
          3ccba2f124ad2fd2c6fdecded66c19eaf4f698696c718ff0e978c9b0517b9900 -
          4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00780672 s, 525 kB/s

          # but fails on the tampered block (in the kernel, reading the block with a regular tool)
          $ dd if=foo bs=1 count=4k skip=$((512*808)) | sha256sum
          dd: error reading 'foo': Input/output error

fs-verity design

It is worthwhile to discuss the Merkle tree data structure to inform the rest of the design of fs-verity. The Merkle tree is a tree of hashes. Consider an example file of 4MiB consisting of 1024 4K blocks. We can think of it as a progression from computing a hash of the entire file contents:

to computing a separate hash for each block, allowing for lazy checking:

to recursively aggregating hashes into blocks and hashing those again until there is only one:

fs-verity stores this tree on-disk, so when we read a block, we can hash it, then read the hashes for the other blocks it aggregates with and hash those, then keep reading and combining hashes going up the tree. Critically, fs-verity does not have to read every block in the file to get to the root hash, only a block with a hash on the path from each block to the root. If the root hash is signed, we can be sure to detect a tampered block, even if the attacker also tampers with the on-disk hashes. To achieve the same with the hash list, we would have to sign each hash separately.

Like many filesystem operations, most of this procedure is generically implemented by the Virtual File System (VFS) with some specific hooks into the filesystem itself. Just as the VFS provides a generic directory walk with help from the filesystem for fetching the contents of each directory, it also provides a generic Merkle tree with help from the filesystem for storing the Merkle tree data. The diagram below is a simplified view of the relationship between the major components involved in fs-verity.

The fs-verity code in the VFS layer handles everything to do with the Merkle tree itself, as well as integration with the fs-verity kernel keyring for using certificates to validate signatures (not pictured). It also implements an optimization where it caches verified subtrees and doesn’t walk up them redundantly (not pictured). However, VFS relies on the support of the particular filesystem for the following:

storing fs-verity data on disk (one verity descriptor per-file, and the Merkle tree nodes themselves).
ensuring every data page gets a generic fs-verity verify function called on it before it is marked Uptodate (and thus, visible to userspace).
implementing transaction semantics to roll back a partially built Merkle tree in the event of a failure.

fs-verity on btrfs

The three responsibilities: storage, verification, and rollback, for a filesystem implementing fs-verity each resulted in some interesting design choices for the btrfs implementation.

storage

Almost all of the functionality of btrfs is implemented in terms of a forest of generic on-disk b-trees, keyed by special triples (objectid, type, offset), all of which have context specific semantics. The b-trees lay out the data sorted by key, with objectid as most significant, then type, and then offset.

This makes storage for fs-verity a matter of choosing which b-tree to store it in and how to organize the keys. Each filesystem (and subvolume) already has a separate b-tree for file metadata like directory entries, inodes, and references to disk extents with the file data. These entries are keyed with the inode number as the objectid so that the entries are laid out by inode. We introduce a few new types for fs-verity items which results in storing fs-verity data along with this other per-file data. This is illustrated in a simplified manner below. Note the sorting by inode (most significant part of the key), then type, and then offset resulting in helpful colocation on disk.

verification

fs-verity expects the filesystem to call back into it whenever a page is read by readpages(), before the filesystem marks the page Uptodate. In btrfs, this is complicated by the existence of several types of fancy file extents, which perform optimizations or implement useful features. These include:

inline extents: small files can be stored inline with the file's inode.
pre allocated extents: if a file is grown with truncate or fallocate(), btrfs allocates space for its data, but just marks the extent as zeroed, rather than writing zeros to disk.
holes: for fallocate() hole punching, btrfs does the same as preallocation, except without allocating the extent.
compressed extents: compress files using btrfs transparent compression.

Corruption in the metadata that encodes these sorts of extents also effectively changes the file's contents, so fs-verity must be used to verify this metadata accordingly. This requires carefully ensuring that any path that artificially filled out a page for a read and marked it up to date had an fs-verity check. With some recent, unrelated, refactoring for subpage reads, this luckily became natural to do in the btrfs readpage code path.

rollback

With typical settings of 4K pages, 4K Merkle tree blocks, and SHA-256 hashes (32 bytes), you can fit 128 hashes in a block, so each layer of the tree is 1/128th the size of the previous layer. This adds up to the whole tree being ~1/127th the size of the file. For a large file, this is still quite large. As a result, it is important to handle errors or interruptions while enabling fs-verity. In the happy error case, the VFS fs-verity code lets the filesystem know something went wrong and indicates to drop the partial fs-verity data. It is also possible that rollback could fail, or crash at any arbitrary point during enabling fs-verity.

While btrfs does have a transaction mechanism at the base of its copy-on-write semantics, that mechanism is designed for quick transactions rather than long operations like writing out a whole file. Luckily, there is another tool we can use: "orphans".

When we start enabling fs-verity on a file, we also create a special orphan item in the b-tree. We remove this item only when we finish fs-verity enabling. That way, the orphan exists if and only if we are in the process of enabling. If we detect an orphan while mounting (after a crash, for example), we can safely delete the partial items. Normal btrfs transactions are used to ensure atomicity of operations like finishing fs-verity enabling and deleting the orphan. Since it is quite finicky to get right, this setup can be extensively tested using dm-log-writes and dm-snapshot. https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/log-writes.html

Why not use native checksums?

By default, btrfs already checksums each file extent, so it seems odd to not make use of that capability for supporting authentication. In fact, a well thought out patch series by Johannes Thumshirn implemented HMAC using btrfs checksums quite elegantly, though it hasn't been merged yet. There's no overwhelming reason to insist on using fs-verity over native btrfs HMAC, but some benefits of fs-verity are:

It's generic in VFS, so it's more compelling to support in userspace. For example, rpm already has support for it.
fs-verity makes the file irrevocably unwritable, while using authenticated checksums wouldn't do that.
btrfs checksums do not use a Merkle tree, so each extent would have to be authenticated separately. There is no single signed hash for the whole file.

As discussed earlier, these two solutions don't necessarily exclude each other and can happily coexist.

Conclusion

fs-verity is one of many options for authenticating files on Linux, and one we think is well suited to production binaries and images. It is already available in ext4 and f2fs (and critical on your Android phone) and will be available in btrfs starting with the 5.15 kernel series.

Resources: