In association with heise online

27 January 2010, 18:01

Kernel Log: Coming in 2.6.33 (Part 2) - Storage

  • Twitter
  • Facebook
  • submit to slashdot
  • StumbleUpon
  • submit to reddit

Kernel Log by Thorsten Leemhuis

Extended discard support means that Linux now supports ATA TRIM, which can increase SSD lifespan and throughput. New additions to the Linux kernel include HA solution DRBD and drivers for HP, LSI and VMware storage hardware. The new kernel version, expected in early March, also includes many minor improvements to the code for the Btrfs, Ext4 and ReiserFS file systems.

At the end of last week, Linus Torvalds released the fifth pre-release version of Linux version 2.6.33, with the final release expected in 4 to 5 weeks time. At this stage in the development cycle, it is usually predominantly more minor changes and fixes which find their way into the main development tree. Kernel hackers do, however, make occasional exceptions for drivers. Linux 2.6.33-rc5 sees the addition to the kernel of a V4L/DVB driver which supports the Mantis chip-set used in a range of TV cards. However, the changes to the V4L/DVB subsystem in Linux 2.6.33 are the subject of a future article in the "Coming in 2.6.33" series.

Following on from the first article in the 2.6.33 series, which looked at networking changes, this article examines file systems changes and changes to the kernel's storage subsystem.

Trimming

For several months, some sections of the kernel have included a rudimentary discard infrastructure, which allows drivers for mass storage adaptors to determine whether storage areas on a disk are free – as a result, for example, of deleting a file or formatting a partition. This has been revised and extended in 2.6.33. The result is that the Libata subsystem now also supports discards and can forward information on free storage areas to mass storage devices using the ATA TRIM command. This is especially useful for SSDs (Solid State Drives), as sending the internal controller information on free storage areas, allows the controller to optimise internal garbage collector. This increases both SSD performance and SSD lifespan.

For the discard infrastructure to achieve maximum impact, the storage subsystem needs to send the free area information to other parts of the kernel. The Btrfs file system has been able to do this since Linux 2.6.32 and appropriate code has also now been added for Ext4. Since this has not yet been fully tested, this function will for now remain deactivated by default. The new discard support in the code for the FAT file system is also optional.

Replicated

After leaving DRBD (Distributed Replicated Block Device), largely developed by Vienna-based company Linbit, out in the cold in 2.6.32, kernel hackers have finally merged the replication solution, used predominantly in high availability environments, into Linux 2.6.33. DRBD can be roughly understood as a network-based RAID 1 device. The drive or drives on one system, designated as the master, are mirrored on a slave system, in real time. Should the master fail, the slave takes over with no downtime. In order to ensure that data remains synchronised at all times, the master considers write access to be complete only when the slave has also completed the write. A detailed explanation of DRBD can be found a LWN.net article and in documentation on the DRBD website.

Over recent months and years, a number of groups have worked on solutions for limiting the maximum amount of data which individual processes or groups of processes can write to or read from drives. These solutions affect various points within the kernel. The 'blkio controller cgroup interface' (blkio for short), which links into the CFQ (Completely Fair Queuing) I/O Scheduler, has come out with its nose in front. This does not represent the failure of other approaches, however, rather it is intended to offer a foundation for further enhancements and functions made possible by some of the competing solutions. Background information on this topic can be found in a LWN.net article and in the documentation for the blkio framework.

Optimised

Developers have removed the anticipatory I/O scheduler (AS), which, according to the commit comments, offers only a subset of the functions offered by the CFQ scheduler. The latter, which has long been the standard in many distributions, is now also described as being suitable for desktop and server environments. Like the process scheduler, almost every Linux version contains numerous changes optimising the CFQ I/O scheduler for specific application scenarios. Details can be found in the links in the 'Minor gems' section at the end of this article and in the main git pull request from block subsystem maintainer Jens Axboe.

Improvements have been made to the code for migrating software RAIDs managed using mdadm to a different level. The MD subsystem now also supports write barriers. These ensure that data and file system journals are written in the sequence expected by other parts of the kernel. This should ensure better file system integrity in the event of a crash, but can palpably reduce throughput, as MD maintainer Neil Brown notes in his main git pull request. Support for write barriers in the device mapper (DM) has also been extended (git pull request). This now offers a 'merge target' (e. g. 1, 2), which can restore systems to a previous snapshot following, for example, a problematic system update (LWN.net article).

Drivers

The IDE subsystem drivers are now officially classed as deprecated. Users are advised to switch to the Libata subsystem PATA drivers, which have long been in the kernel and are no longer classed as experimental (1, 2). In 2.6.33, these contain numerous minor enhancements and corrections, some originating from Bartlomiej Zolnierkiewicz, who, until a few months ago, maintained the IDE subsystem.

The SCSI subsystem contains two new drivers: 3w-sas for the LSI 3ware 9750 and vmw_pvscsi for the virtual hardware seen by guest systems under some VMware hypervisors. New smart array controllers from HP can now be addressed both by the block subsystem's cciss driver, which has seen various enhancements in 2.6.33, and by the new hpsa driver. The latter, being part of the SCSI subsystem, like all other SCSI and Libata drivers provides a standard device (/dev/sdx) for access. Other new entries in this week's kernel include the pm8001 driver for SAS/SATA HBAs containing the PMC Sierra SPC 8001 chip.

Miscellaneous

A few further file system and storage code related changes:

  • According to main developer Chris Mason's main git pull request for 2.6.33, Btrfs, the experimental 'next generation file system for Linux'-elect, saw primarily minor enhancements and fixes in 2.6.33.
  • Ext4, which is based on the Ext2 and Ext3 file system code, can now mount Ext2 and Ext3 file systems. This allows environments which require the smallest possible kernel image to save a little space.
  • Distributed file system Ceph was put forward for merger into 2.6.33. Torvalds has, however, chosen to omit it for now, explaining that this was partly due to time constraints and partly because there was not enough noise from kernel developers and distributors in favour of it. Background information can be found in this article on LWN.net.
  • Although the ReiserFS code has long lacked an official maintainer, one developer has still taken the trouble to significantly reduce use of the big kernel lock (BKL) in the ReiserFS code. This should make the file system more scalable and in some cases a little more fleet of foot.
  • Some of the more important changes in nilfs2 are described in the git pull request from nilfs2 maintainer Ryusuke Konishi.
  • Virtual File System (VFS) now correctly implements O_SYNC. Once again, further details can be found on LWN.net.
  • There has been major restructuring work on the XFS file system code to replace XFS' own tracing code with code which uses the kernel's own tracing structure. This is itself relatively new, but has been developed substantially over the last year.

Minor gems

Many further minor, but by no means insignificant, changes can be found in the list below, which contains the commit headers referring to the respective change. Like many of the references in the text above, the links point to the relevant commit in the web front end of the Git branch for the kernel sources maintained by Linus Torvalds at kernel.org. The commit comments and the patches themselves provide extensive further information on the respective changes.

File systems

Btrfs

Ext[234]

Various others

Storage

Block

DM

Libata

MD

MFD/MMC/MTD

SCSI

Various others

For other articles on 2.6.33 and links to the rest of the "Coming in 2.6.33 " series, see The H's Kernel Log - 2.6.33 Tracking page. (thl)

(crve)

Print Version | Send by email | Permalink: http://h-online.com/-914669
 


  • July's Community Calendar





The H Open

The H Security

The H Developer

The H Internet Toolkit