fs-verity is a Linux filesystem feature that provides significant support for ensuring the integrity and authenticity of a read-only file in an otherwise writable filesystem. It was originally built to authenticate the contents of app package (apk) files on Android devices using ext4 and f2fs filesystems. Facebook plans to use it for authenticating system base images, container images, and individual binaries on production servers using btrfs.
Without getting bogged in excessive mathematics or security jargon, it is useful to sketch how file authentication is typically done. This will illustrate the tradeoffs inherent to the design of fs-verity and contrast it with other similar tools.
Fundamentally, to verify the authenticity of some data, a system needs to check that the data itself hasn't changed and confirm that the metadata used for checking the data hasn't been tampered with. This is typically done by hashing the data and securely authenticating the hash.
For example, using openssl, we can sign a sha256 hash of a file in userspace and detect tampering:
# generate a private key and public key certificate $ openssl req -newkey rsa:4096 -nodes -keyout key.pem -x509 -out cert.pem # reformat the certificate as a raw public key file $ openssl x509 -pubkey -out pubkey.pem -in cert.pem # generate a 1M file $ dd if=/dev/urandom of=foo bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00651344 s, 161 MB/s # create a signed sha256 hash of the file $ openssl dgst -out foo.sig -sign key.pem foo # verify the signed hash with the public key $ openssl dgst -verify pubkey.pem -signature foo.sig foo Verified OK # tamper with the file $ dd if=/dev/zero of=foo conv=notrunc bs=4k count=1 seek=100 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000141911 s, 28.9 MB/s # try verifying again $ openssl dgst -verify pubkey.pem -signature foo.sig foo Verification Failure # reading the file is clearly still OK $ sha256sum foo 68db4587ffb4162e4ff2027c6d629521fb79796dfd8e530504618fbb2dd36750 foo
Though signing a hash is the basic form, there are a few high-level decisions that the designer of an authentication tool must make on exactly how the hashing and signing are done. Here are a few decision choices that can form a taxonomy that we can use to categorize and compare different authentication tools:
From the above taxonomy, fs-verity operates on files, verifies transparently in the kernel, makes the file unwritable, builds a Merkle tree of hashed blocks, and verifies lazily as pages of the file are read. This is well suited to its goals of authenticating binaries in normal filesystems and has the following advantages:
The following chart categorizes some notable tools in the space based on design choices:
In this chart, we can see (among other things) that:
Now that we understand a bit more about how verity is different from openssl, we can follow an example using fs-verity to generate a Merkle tree with a signed root hash for the file we previously signed above and demonstrate detecting a corruption:
# convert our certificate to the DER format $ openssl x509 -in cert.pem -out cert.der -outform der # add the certificate to the special fs-verity kernel keyring $ sudo keyctl padd asymmetric '' %keyring:.fs-verity < cert.der 392921890 # compute Merkle tree and sign the root hash $ fsverity sign foo foo.sig.verity --key=key.pem --cert=cert.pem Signed file 'foo' (sha256:cb6772f68fb81873b1497b871060b81be336d83543a7aa5c2c88418ca875576f) # enable fs-verity on the file $ sudo fsverity enable foo --signature foo.sig.verity # naive corruption no longer works (file is unwritable) $ dd if=/dev/zero of=foo conv=notrunc bs=4k count=1 seek=101 dd: failed to open 'foo': Operation not permitted # but we can be clever and modify the blocks directly.. # look up data extents (glazing over some details of btrfs chunk mapping) $ sudo xfs_io -r -c fiemap foo foo: 0: [0..799]: 27016..27815 1: [800..807]: 26632..26639 2: [808..2047]: 27824..29063 # write directly to the file data on the device (in this case an LVM volume) $ sudo dd if=/dev/zero of=/dev/vg0/lv0 bs=1 count=4k seek=$((27824*512)) conv=notrunc 4096+0 records in 4096+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00427386 s, 958 kB/s # drop the page cache since we have read the file recently $ echo 3 | sudo tee /proc/sys/vm/drop_caches 3 # verity lets us read untampered blocks (because of lazy checking) $ dd if=foo bs=1 count=4k | sha256sum 4096+0 records in 4096+0 records out 3ccba2f124ad2fd2c6fdecded66c19eaf4f698696c718ff0e978c9b0517b9900 - 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00780672 s, 525 kB/s # but fails on the tampered block (in the kernel, reading the block with a regular tool) $ dd if=foo bs=1 count=4k skip=$((512*808)) | sha256sum dd: error reading 'foo': Input/output error
It is worthwhile to discuss the Merkle tree data structure to inform the rest of the design of fs-verity. The Merkle tree is a tree of hashes. Consider an example file of 4MiB consisting of 1024 4K blocks. We can think of it as a progression from computing a hash of the entire file contents:
to computing a separate hash for each block, allowing for lazy checking:
to recursively aggregating hashes into blocks and hashing those again until there is only one:
fs-verity stores this tree on-disk, so when we read a block, we can hash it, then read the hashes for the other blocks it aggregates with and hash those, then keep reading and combining hashes going up the tree. Critically, fs-verity does not have to read every block in the file to get to the root hash, only a block with a hash on the path from each block to the root. If the root hash is signed, we can be sure to detect a tampered block, even if the attacker also tampers with the on-disk hashes. To achieve the same with the hash list, we would have to sign each hash separately.
Like many filesystem operations, most of this procedure is generically implemented by the Virtual File System (VFS) with some specific hooks into the filesystem itself. Just as the VFS provides a generic directory walk with help from the filesystem for fetching the contents of each directory, it also provides a generic Merkle tree with help from the filesystem for storing the Merkle tree data. The diagram below is a simplified view of the relationship between the major components involved in fs-verity.
The fs-verity code in the VFS layer handles everything to do with the Merkle tree itself, as well as integration with the fs-verity kernel keyring for using certificates to validate signatures (not pictured). It also implements an optimization where it caches verified subtrees and doesn’t walk up them redundantly (not pictured). However, VFS relies on the support of the particular filesystem for the following:
The three responsibilities: storage, verification, and rollback, for a filesystem implementing fs-verity each resulted in some interesting design choices for the btrfs implementation.
Almost all of the functionality of btrfs is implemented in terms of a forest of generic on-disk b-trees, keyed by special triples (objectid, type, offset), all of which have context specific semantics. The b-trees lay out the data sorted by key, with objectid as most significant, then type, and then offset.
This makes storage for fs-verity a matter of choosing which b-tree to store it in and how to organize the keys. Each filesystem (and subvolume) already has a separate b-tree for file metadata like directory entries, inodes, and references to disk extents with the file data. These entries are keyed with the inode number as the objectid so that the entries are laid out by inode. We introduce a few new types for fs-verity items which results in storing fs-verity data along with this other per-file data. This is illustrated in a simplified manner below. Note the sorting by inode (most significant part of the key), then type, and then offset resulting in helpful colocation on disk.
fs-verity expects the filesystem to call back into it whenever a page is read by readpages(), before the filesystem marks the page Uptodate. In btrfs, this is complicated by the existence of several types of fancy file extents, which perform optimizations or implement useful features. These include:
Corruption in the metadata that encodes these sorts of extents also effectively changes the file's contents, so fs-verity must be used to verify this metadata accordingly. This requires carefully ensuring that any path that artificially filled out a page for a read and marked it up to date had an fs-verity check. With some recent, unrelated, refactoring for subpage reads, this luckily became natural to do in the btrfs readpage code path.
With typical settings of 4K pages, 4K Merkle tree blocks, and SHA-256 hashes (32 bytes), you can fit 128 hashes in a block, so each layer of the tree is 1/128th the size of the previous layer. This adds up to the whole tree being ~1/127th the size of the file. For a large file, this is still quite large. As a result, it is important to handle errors or interruptions while enabling fs-verity. In the happy error case, the VFS fs-verity code lets the filesystem know something went wrong and indicates to drop the partial fs-verity data. It is also possible that rollback could fail, or crash at any arbitrary point during enabling fs-verity.
While btrfs does have a transaction mechanism at the base of its copy-on-write semantics, that mechanism is designed for quick transactions rather than long operations like writing out a whole file. Luckily, there is another tool we can use: "orphans".
When we start enabling fs-verity on a file, we also create a special orphan item in the b-tree. We remove this item only when we finish fs-verity enabling. That way, the orphan exists if and only if we are in the process of enabling. If we detect an orphan while mounting (after a crash, for example), we can safely delete the partial items. Normal btrfs transactions are used to ensure atomicity of operations like finishing fs-verity enabling and deleting the orphan. Since it is quite finicky to get right, this setup can be extensively tested using dm-log-writes and dm-snapshot. https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/log-writes.html
By default, btrfs already checksums each file extent, so it seems odd to not make use of that capability for supporting authentication. In fact, a well thought out patch series by Johannes Thumshirn implemented HMAC using btrfs checksums quite elegantly, though it hasn't been merged yet. There's no overwhelming reason to insist on using fs-verity over native btrfs HMAC, but some benefits of fs-verity are:
As discussed earlier, these two solutions don't necessarily exclude each other and can happily coexist.
fs-verity is one of many options for authenticating files on Linux, and one we think is well suited to production binaries and images. It is already available in ext4 and f2fs (and critical on your Android phone) and will be available in btrfs starting with the 5.15 kernel series.