Kernel Log: Coming in 2.6.38 (Part 2) – File systems
by Thorsten Leemhuis
Linux 2.6.38 contains patches to improve the scalability of VFS that have been the topic of much discussion for the past six months and that Torvalds himself was waiting for. Ext3 and XFS now support batched discard, which is interesting for SSDs, while Btrfs and SquashFS support additional compression technologies.
All the parts in this Kernel Log mini-series can be found by referring to the 2.6.38 tracking page.
On Wednesday, Linus Torvalds released the fifth pre-release of kernel version 2.6.38 saying that some regressions have been fixed and other changes are "pretty spread out and small". The Kernel Log therefore takes the opportunity to continue the overview of the major changes in Linux 2.6.38 with the second part of the mini-series "Coming in 2.6.38". Part one discussed the main changes pertaining to graphics drivers, and in the next few weeks we will be discussing network support, storage hardware, drivers, and code for architecture and infrastructure.
Some of the optimisations of VFS (Virtual File System), which offers basic functions for all file systems, in 2.6.38 were especially important for Torvalds, who could not hide his excitement in a detailed description in an email on the first pre-release version of 2.6.38 (see 1, 2, 3, 4, 5, 6, 7). The basis for all file systems, VFS now uses finer locking with RCU (read copy update). Thanks to changes developed under such names as "RCU-based name lookup", a number of operations in the resolution of file names are considerably accelerated. Large servers with a number of processor cores will not be the only ones to benefit; off-the-shelf systems will too. In his release email, Torvalds says the performance improvements range from 30 to 50 per cent in certain tests of file and name resolution. In an earlier email on the adoption of deep optimisations of VFS, which is very important for the reliability of the kernel, he wrote that a Find command executed with filled caches in his home directory was around 35 per cent faster, even though Find was only working with a single thread.
Nick Piggin headed the work on these optimisations for more than a year. Torvalds wanted to integrate predecessors of the patches, which went under such names as "vfs-scale" and "RCU-walk", in 2.6.36, but the developers involved agreed at the time only to integrate a few of the changes that set the foundations and were not dangerous. Others followed suit in 2.6.37, in which another developer presented similar optimisations, partly based on Piggin's ideas and patches for VFS, which brought about a discussion over which procedure would be the best. LWN.net provided details on the debates and explanations about how the current and previously released changes to VFS work in the articles: "VFS scalability patches in 2.6.36", "Resolving the inode scalability discussion" and "Dcache scalability and RCU-walk". Piggin explains some of the information that's important for developers of file systems in an LKML email.
Ext family, XFS and Btrfs
In 2.6.38, Ext3 and XFS now support batched discard which was integrated in the 37 kernel and is especially interesting for SSDs with a slow TRIM function (1, 2, 3). In the commit message for the change, Christoph Hellwig explicitly points out that batched discard should not be sent during normal workloads because the search for free space drains performance. In the "XFS status update for January 2011", the kernel hacker mentions some of the XFS changes in 2.6.37 and 2.6.38, including some optimisations to the log subsystem's locking code which considerably improve scalability.
In the merge window of 2.6.37, the kernel hackers integrated a number of patches to increase the performance of Ext4; however, they had to disable "Multiple Page-IO Submission", which promised the biggest improvements, just before this kernel version was completed because some problems remained unsolved. The flaws were later corrected, but some of them were not fixed until the merge window of 2.6.38 had closed. The technology, which can be optionally switched on, therefore remains switched off in 2.6.38 and is to be enabled by default in the merge window of 2.6.39.
The still experimental CoW (Copy on Write) file system Btrfs can use LZO, in addition to Zlib, for transparent compression. It is generally much faster, but does not compress quite as efficiently; the commit comment provides some of the measurements supporting that claim along with a comparison with an uncompressed file system (1, 2). Btrfs now also supports write-protected snapshots ; in his main git pull request, Btrfs developer Chris Mason also mentions some corrections in the code to support multiple Btrfs file systems across multiple storage media.