In association with heise online

20 November 2009, 17:40

Kernel Log: Coming in 2.6.32 (Part 3) - Storage

  • Twitter
  • Facebook
  • submit to slashdot
  • StumbleUpon
  • submit to reddit

by Thorsten Leemhuis

The kernel development team have enhanced various aspects of Btrfs, one effect of which is to significantly improve the experimental file system's write performance. A number of changes to the block layer promise better data throughputs and reactivity. There are also several new drivers for storage hardware.

Linux kernel 2.6.32-rc7 hit the streets late last week, but when 2.6.32 will finally be released remains anyone's guess. There are likely to be at least one, more likely two, further pre-release versions in this development cycle, a cycle which has been slightly disrupted by typos and the Kernel Summit. Following on from our reviews of the changes in the networking and graphics hardware subsystems, this instalment of the Kernel Log 2.6.32 series looks at file systems and storage.


There have been a whole heap of changes to Btrfs. The experimental "next generation file system for Linux" can now write at more than 1 GB per second on fast hardware and now matches XFS for speed on our test system. Btrfs' previous maximum data transfer speed was 400 MB per second, as this maxed out CPU usage. Snapshots and subvolumes can now be renamed in Btrfs and can be deleted much more rapidly. Thanks to various enhancements, RPM and Yum now also work faster.

For delayed allocations, Btrfs is now more reliable in reserving enough space for metadata, ensuring that sufficient capacity is available for subsequent writes. There is some new experimental code for 'discard operations' which should eventually allow the file system to tell SSDs which blocks have been freed by deleting data – however the requisite support in the SCSI and Libata subsystems is still a work in progress.

A list of further Btrfs-related changes can be found at the end of this article and in various git pull requests in which Chris Mason, the main Btrfs developer, briefly explains the major changes (1, 2, 3). Mason, who works for Oracle, developed many of the changes himself, with many more contributed by other developers on the payrolls of various different companies. The importance of this kind of distribution of developer resources and know-how for a successful, resilient open source project was recently emphasised in a talk given at the Linux Kongress by Theodore Ts'o ('tytso') (well-known for his work on the Ext file systems).

Ext3, Ext4, XFS, etc.

Other significant chances to the Linux kernel file system code:

  • Numerous changes to Ext3 und Ext4 have made their way into the main development tree via Theodore Ts'o (1, 2). One of them speeds up the fs_mark benchmark by around fifty per cent under certain set-ups. 'Data=guarded' mode in Ext3, long under development, has once more been left out in the cold.
  • Sysfs now supports security labels, allowing security frameworks such as SELinux to monitor access to the virtual file system.
  • The kernel's VFAT code will not mount FAT drives by default with the behaviour activated by the "shortname=lower" option, but will instead now default to use "shortname=mixed". This should mean that upper and lower case in file names will no longer be changed when copying using Linux.
  • Support for the 9p file system has been added in Fscache.
  • A change to XFS should make finding free inodes three to four times faster in certain situations. An overview of further developments in XFS can be found in the XFS status updates for September and October.

Block layer

The CFQ (Completely Fair Queuing) I/O scheduler used by many distributions now optimises queries for short response times. This should mean that when programs are running in the background which process large volumes of data, desktop applications running in the foreground will no longer be slowed to the same extent and will consequently feel faster. Background on the changes can be found in a piece by block subsystem maintainer Jens Axboe on The new behaviour does, however, cause some patterns of access to work a little slower, for which reason CFQ's low latency mode can also be deactivated via sysfs.

Zoom Benchmarks: Before...
Axboe has also introduced a major rewrite of the writeback infrastructure, the fruit of several months work, as a result of which each device is now dealt with by its own thread. This and other changes should significantly increase data throughput for writeback-intensive access scenarios and cause them to run more evenly. Axboe includes two benchmark graphics and various measurements in his commit comments to underline the point. More measurements are detailed in an email from Chris Mason. Some background information on the changes can be found in a presentation by AxboePDF and the April also has an article on the recently merged blk-iopoll, which describes the NAPI-like approach to accessing storage devices which aims to increase maximum throughput by reducing IRQs.

Zoom ...and after.

Axboe's main git pull request lists a number of further changes in the block subsystem. Following a long discussion of its pros and cons, replication solution DRBD (Distributed Replicated Block Device) has not made it into Linux 2.6.32, but Torvalds has signalled his willingness to merge it into 2.6.33.

Libata, drivers, etc.

  • It will in future be possible to read certain information on AHCI capabilities – such as whether a port is hot-pluggable or is an eSATA connector – via sysfs. Userspace applications should be able to use this information to make better decisions on optimal behaviours, e.g. when configuring ALPM (Aggressive Link Power Management).
  • The kernel development team has merged the pata_atp867x driver for the ARTOP/Acard ATP867X PATA adapter and the pata_rdc driver for RDC PATA adapters. These now include support for AMD's SB900 southbridge, which, as things stand at present, still looks several months away from a release date.
  • A 1.4 MB patch for the bfa driver for Brocade FC and FCOE host adapters has been merged into the SCSI subsystem. Much more lightweight are the new be2iscsi driver for iSCSI functionality for ServerEngines' 10Gbps BladeEngine 2 storage adapter and pmcraid driver for PMC Sierra's MaxRAID series 6Gb/s SAS adapters. There have also been numerous changes to the SCSI subsystem's FCOE driver.
  • There have been a number of changes to the MD code and other kernel subsystems to improve the ability to offload calculations for RAID 6 to dedicated hardware. There have also been changes to the still fresh MD code for supporting the various options for modifying and converting software RAID arrays – e.g. for converting a RAID 1 to a RAID 5, to a RAID 6 and back again. Such conversions can be carried out by mdadm versions 3.1.x. The developer responsible for the kernel MD code and mdadm has withdrawn the first such version, but has indicated that 3.1.1 should be out shortly. He has also developed several enhancements to the kernel code responsible for modifying and converting RAID arrays, which should find their way into 2.6.33.

Minor Gems

Many further minor, but by no means insignificant, changes can be found in the list below. Like many of the references in the text above, the links point to the relevant commits in the web front end of the Git branch at that Linus Torvalds uses for maintaining the kernel sources. There, the commit comments and the patches themselves provide extensive further information on the respective changes.



Ext3, Ext4











For other articles on 2.6.32 and links to the rest of the "Coming in 2.6.32 " series, see The H's Kernel Log - 2.6.32 Tracking page. (thl /c't).


Print Version | Send by email | Permalink:

  • July's Community Calendar

The H Open

The H Security

The H Developer

The H Internet Toolkit