Kernel Log: Coming in 2.6.39 (Part 2) – Storage and file systems
by Thorsten Leemhuis
Various internal changes to the block layer that were specifically mentioned by Linus Torvalds are designed to enhance performance and scalability. The Ext4 file system is also said to offer improvements in this respect. Still classified as experimental, Btrfs now offers Batched Discard functionality, and LIO (Linux-Iscsi.org) includes a loop-back function.
At the end of last week, Andi Kleen published a series of patches that fixes a performance problem between the Virtual File System (VFS) and the security infrastructure; this was an unwanted side effect from the optimisation to the VFS that was merged in Linux 2.6.38. Torvalds showed great interest in them and merged one into the main development tree; a second fix that is based on the three patches from Kleen followed on Monday and fixes the problem in the SELinux codebase. It's still undecided if these or similar patches will make it into a stable kernel 2.6.38, as Torvalds mentioned at one point that he was considering.
Torvalds did not yet produce a fifth release candidate for Linux 2.6.39 – but it should emerge in the next few days, as the RC4 is already more than one week old.
The Kernel Log is taking the ongoing development of kernel version .39 as an opportunity to continue its "Coming in 2.6.39" mini series with a discussion of the kernel's storage infrastructure and file systems. The series started with an overview of the changes to the kernel's network drivers and infrastructure; in the coming weeks, further articles will discuss the changes in such areas as the kernel's graphics drivers, architecture code, infrastructure and other hardware drivers.
Plugging in storage media
Jens Axboe has done considerable internal restructuring work in the block subsystem. The changes delegate some of the tasks involved in writing from a device-specific buffer closer to the code that needs to write the data (for example 1, 2). The measure is designed to improve scalability, and therefore the kernel's performance with the fastest storage media that are currently available.
This "new block device plugging model" was the only change Linus Torvalds specifically pointed out in his release email for the first release candidate of 2.6.39; Torvalds said that the approach avoids locks in busy code paths, cleans up the code and "should generally be a really good idea". Axboe provides some background information about his motivation and about the functional details in his main Git-Pull request; he gives a general description of the block layer's internal structure in "Explicit block device plugging" on LWN.net.
In his RC1 release email, Torvalds also mentioned that a flaw in the new device plugging code temporarily caused the kernel to "eat XFS file systems for breakfast". Some further problems still existed even after the release of RC1, but they were less serious – most of them were fixed with RC4. This demonstrates that data losses aren't exactly unlikely when testing the current development release of Linux during the merge window. However, the kernel hackers typically fix such serious problems very quickly; in the stabilisation phase, and particularly after the second release candidate, such bugs are very rare, as the kernel hackers make an effort not to scare away the already small number of testers.
In brief: Block layer
- The kernel hackers have extended the LIO (Linux-Iscsi.org) target implementation they integrated with 2.6.38 to include the tcm_loop module; this module makes local SPC-4 SCSI emulations possible with arbitrary raw devices. Another change will allow users to query via sysfs various general details as well as information that is particularly relevant for statistical evaluations.
- Offering functionality that is similar to RAID 0, the "striped" Device Mapper target now supports "merge methods", which will, in certain situations, enhance performance with XFS and Ext4. The Device Mapper (DM) now offers the flakey target, which functions like the linear target but periodically returns errors for testing purposes.
- A new addition is the bnx2fc driver that delegates various FCoE (Fibre Channel over Ethernet) tasks to Broadcom's NetXtreme II 57712 FCoE controller.
- The aacraid driver now supports a new, "SRC-based" family of controllers by PMC-Sierra which offer a "0x28b" interface; according to a patch description, these are offered by Series-6 chips that are used on 6-GB/s-RAID controllers with "Advanced ROC (RAID-on-chip)".
- The SCSI subsystem now offers improved support for the Logical Block Provisioning interfaces defined in the T10 specification for SBC3r26. It can inform SSDs and network storage solutions with Thin Provisioning about newly available memory areas via the SCSI commands WRITE SAME and its UNMAP-Bit (1, 2); some background information on this can be found in the "Solid-State Disk Deployment Guidelines" in the RHEL 6.1 Storage Administration Guide.
- A driver that allows MTD devices to be used for swapping has been added to the subsystem for Memory Technology Devices (MTD), which is mainly used in the embedded area.
- The developers have made various changes to the DRBD replication solution that bring the kernel implementation up to DRBD 8.3.10 level in 2.6.39.
- The drivers ahci and ata_piix now support Intel's "Panther Point" chipsets (1, 2) in the 2.6.39 kernel. Intel is expected to introduce these chipsets together with new processors in early 2012.