Kernel Log: Coming in 3.9 (part 1)
Filesystems and storage
by Thorsten Leemhuis
The Linux kernel can now be set up to use SSDs as cache for hard drives; Btrfs has native RAID 5 and 6 support. The kernel development team has also resolved two performance problems caused by previous changes.
On Sunday, Linus Torvalds released the fourth pre-release version of Linux kernel 3.9. In his release notes, he noted that development has not yet settled down and called for testing of the RC.
As usual, Torvalds and his the other kernel developers merged all of the major changes planned for Linux 3.9 into the kernel in the two weeks following the release of version 3.8. We are currently in the stabalisation phase and major changes during this phase are rare allowing the Kernel Log to provide a comprehensive overview of the major new features to be expected in the new Linux version, which is expected to be released in late April.
This overview will be provided by a series of articles dealing with various facets of the kernel. The series opens with a description of new features in the areas of storage technology and filesystems. Over the next few weeks, further articles will deal with graphics drivers, kernel infrastructure, networking, processor/platform support and drivers for other hardware.
Device mapper, which is used by the logical volume manager (LVM) but can also be used independently, now includes a cache target called "dm-cache" (1, 2, 3). This option enables a drive to be set up as a cache for another storage device, for example, an SSD as a cache for a hard drive. This feature is able to speed up data writes, as it allows the faster SSD to first cache data and then, in a quiet moment, transfer it to the slower hard drive. The cache target is also able to store frequently read data from the hard drive on the SSD in order to speed up access to it. How exactly the cache target approaches this task is not programmed into dm-cache, but is determined by policy modules. Details of how it all works can be found in the cache target documentation and in the two currently available policy modules, "multiqueue" and "cleaner".
This feature, which is classed as experimental, is a new development that, by taking an alternative starting point from within the kernel, achieves much the same as the more venerable flashcache and bcache caching solutions, both of which are maintained outside the kernel. The developer behind the latter has been working towards merging it into the kernel for some time and has recently added some of the basic features required to do so to Linux's block layer. Following the merger of dm-cache, it was for a while unclear whether the kernel developers would now merge bcache. Jens Axboe, maintainer of the block I/O code in the Linux kernel, has, however merged bcache into his git development tree, in which he is collecting changes that he plans to merge into Linux 3.10.
Things have gone a bit quiet in the flashcache world recently, but, prior to the merger of dm-cache, Axboe had expressed his support for merging EnhanceIO, an SSD caching driver derived from flashcache, into the kernel's staging area. The code originates from STEC, a company specialising in SSD-related hardware and software.
In addition to RAID 0 and 1, the Btrfs filesystem now includes experimental native support for RAID 5 and 6, as unveiled in February. Embedding RAID capabilities within the filesystem allows implementation of features that are difficult to realise using the layer model, in which the filesystem addresses the RAID array as if it were just a large disk and does not concern itself with the underlying complexity of the array.
RAID functionality embedded in the filesystem means that, for example, in the event of the failure and replacement of a disk forming part of a Btrfs RAID array, Btrfs need only restore areas containing data, since it is able to determine which areas are occupied. However, abstraction means that a Linux software RAID array administered using mdadm is not able to access this information and therefore has to restore the RAID volume in its entirety, which is time-consuming.
Native filesystem RAID support also offers benefits in the event of data errors, as the filesystem is able to address the disks making up the array directly. With a RAID 1 array, btrfs is thus (ideally at least) able to use the checksums stored in the filesystem to determine which disks in an array are delivering correct, and which incorrect, data. Btrfs is also able to combine different RAID levels within a single filesystem, for example by storing metadata using RAID 1 (mirroring), whilst using RAID 0 (striping) for the filesystem's payload data.
The development team behind the still-experimental filesystem have also merged a number of other changes, including changes aimed at further improving the filesystem's fsync performance, which is considered problematic (1 and others). When defragmenting Btrfs filesystems containing snapshots, data segments shared by multiple snapshots are now preserved and are no longer subject to space-wasting splitting. A SUSE developer has also improved the send/receive code so that, if required, it will now send metadata only – this is supposed to improve the efficiency of SUSE's "snapper".
- Among other work, ext filesystem developers have fixed a performance problem in the JBD2 journalling layer used by ext4, which arose in Linux 3.0.
- Support for user namespaces has been added to CIFS, NFS and various other filesystems. This change has not, however, permeated through to XFS, meaning that user namespaces can still only be activated in the kernel configuration if XFS is deactivated. As a result, many distributions are likely to continue to omit support for user namespaces in their kernels.
- Changes to Fuse include optimisation of scatter-gather direct IO and support for the readdirplus API. The latter has already been used in NFS and can speed up certain file handling operations.
- The sysfs filesystem now has a directory (/sys/fs/pstore/) for mounting the pstore filesystem. In the event of a system crash, this can be used to store data useful for analysing the cause of the crash after rebooting.
- The cgroup controller for regulating disk read and write speeds now correctly supports hierarchical control groups where CFQ is used as the I/O scheduler. This does not yet apply to I/O throttling, however (1, 2 and others).
- Changes to the kernel's memory management can reduce latencies produced by "stable pages" (1 and others). Since Linux 3.0, stable pages protect data already delivered to the kernel for writing but not yet written from further modification. This is important for processes such as checksum calculation and filesystem-implemented compression. More detail can be found in this LWN.net article.
- The libata drivers now support zero power optical device drives (ZPODD) (optical drives which are able to almost completely power down to save power when there is no CD or DVD in the drive) (1, 2 and others).