Trimming, DRBD, Block Layer, Virtualisation and Tracing

In association with heise online

Trimming

For several months, some sections of the kernel have included a rudimentary discard infrastructure, which allows drivers for mass storage adaptors to determine whether storage areas on a disk are free – as a result, for example, of deleting a file or formatting a partition. This infrastructure has been revised and extended in 2.6.33. The result is that the Libata subsystem now also supports discards and can forward information on free storage areas to mass storage devices using the ATA TRIM command. This is especially useful for SSDs (Solid State Drives), because sending the internal controller information on free storage areas allows the controller to optimise internal garbage collector. This increases both SSD performance and SSD lifespan.

For the discard infrastructure to achieve maximum impact, the storage subsystem needs to send the free area information to other parts of the kernel. The Btrfs file system has been able to do this since Linux 2.6.32 and appropriate code has also now been added for Ext4. Since this has not yet been fully tested, this function will for now remain deactivated by default. The new discard support in the code for the FAT file system is also optional.

File systems, Block Layer and RAID

After leaving DRBD (Distributed Replicated Block Device) out in the cold in 2.6.32, kernel hackers have finally merged the replication solution, used predominantly in high availability environments, into Linux 2.6.33. DRBD can be roughly understood as a network-based RAID 1 device. The drive or drives on one system, designated as the master, are mirrored on a slave system, in real time. Should the master fail, the slave takes over with no downtime. In order to ensure that data remains synchronised at all times, the master considers write access to be complete only when the slave has also completed the write. A detailed explanation of DRBD can be found a LWN.net article and in documentation on the DRBD website.

About the source code management system

Many of the links in this article point to the relevant commits in the web front end of Linus Torvalds' Git source code management system for Linux, because these commits tend to contain a lot more information about the respective changes. The commit comment in the mid section of the web page displayed by the Git web front end is often a particularly helpful source of further information. This is where the author of a patch usually describes the background and intended effects of the changes.

The bottom section of the Git web front end lists the files that are affected by the patch. The "diff" link behind each file name shows how the patch modifies the respective file; if you want to view the complete patch in its raw form, click on the commitdiff link. Even if you don't have any programming skills the patches are often a good source of information, because they also contain changes to the documentation and comments within the code.

Developers have removed the anticipatory I/O scheduler (AS), which, according to the commit comments, offers only a subset of the functions offered by the CFQ scheduler. The latter, which has long been the standard in many distributions, is now also described as being suitable for desktop and server environments.

Improvements have been made to the code for migrating software RAIDs managed using mdadm to a different level. The MD subsystem now also supports write barriers. These ensure that data and file system journals are written in the sequence expected by other parts of the kernel. This should ensure better file system integrity in the event of a crash, but can palpably reduce throughput, as MD maintainer Neil Brown notes in his main git pull request. Support for write barriers in the device mapper (DM) has also been extended (git pull request). This now offers a 'merge target' (e.g. 1, 2), which can restore systems to a previous snapshot following, for example, a problematic system update (LWN.net article).

Virtualisation and Tracing

Numerous improvements have been made to the KVM (Kernel-based Virtual Machine) virtualisation solution, which is mainly developed by Red Hat. Some changes to the KVM code, in combination with X86 code modifications, are to reduce the management effort required for context switches, which is designed to improve performance (1, 2). KVM no longer blocks the virtualisation functions of modern CPUs unless it actually needs them. Furthermore, the kernel can now shift memory areas that have been merged by the KSM (Kernel Shared Memory) feature, which was created in the KVM area and introduced in Linux 2.6.32, into swap memory if required.

The developers have made further changes to the tracing infrastructure around Ftrace and to the performance events previously called performance counters (1, 2, 3). The new "kprobe-based event tracer" allows probe points to be added to almost any kernel area at run-time (documentation); users can access this infrastructure via the "perf" program, which is included in Linux and offers a new "probe" subcommand. Several kernel changes allow processes to be monitored simultaneously or improve Big Kernel Lock (BKL) diagnostics. The analysed data can now be filtered via regular expressions; the kernel developers have also extended the Perf command's Perl script support (for example 1, 2, 3, documentation). Its "bench" subcommand, which is new in 2.6.33, offers several speed measuring functions (for example 1 2, 3, 4, documentation).