27 April 2009, 16:11

Kernel Log: What's coming in 2.6.30 - File systems: New and revamped file systems

Kernel Log Penguin The patches adopted in Linux 2.6.30 introduce many significant changes affecting data security and Ext3 and Ext4 performance. Support for the EXOFS and NILFS2 file systems is new, as is the cache for the AFS and NFS network file systems. There are also a few fixes for the almost forgotten ReiserFS file system.

Released mid-week, as is normal for the second phase of the development cycle, the third pre-release version of Linux 2.6.30 included mostly minor enhancements and fixes, although there were two code restructures.

The interminable discussions of the Ext3 and Ext4 file systems and the way they interact with other kernel subsystems have largely subsided. The H Open has reported on the early stages of these discussions – the occasionally abrasive discussion on the LKML (Linux Kernel Mailing List) continued for a further week, with a total of 650 emails, not counting other threads triggered by the discussion.

The debate has been far from fruitless and has led to the development of various modifications which Torvalds has integrated, in some cases immediately, into the main development tree leading to Linux 2.6.30. This part of the "What's coming in 2.6.30" Kernel Log series gives an overview of these and many other changes to the code for the various file systems supported by Linux.

Access time

At a relatively early stage of the above discussion, an old, previously much discussed issue affecting all file systems once more reared its head – when and how frequently should the kernel update a file's atime (last access time)? This information is of importance to only a handful of applications and each update of the atime requires a write process. This not only has a time overhead, it is also somewhat surplus to requirements for SSDs and laptops running on battery.

Spurred on by this, Matthew Garrett has produced a number of patches which result in the kernel now updating last access time just once a day (Relative atime/relatime). It took Linus Torvalds just a few hours to make this one of the first patches to be incorporated into the main development tree following the release of 2.6.29. A further patch from Garrett makes relatime the default. The old style behaviour can be restored using strictatime.

But even these changes, which many kernel hackers have long been calling for, did not satisfy everyone – Valerie Aurora (formally Henson) has listed various criticisms on her blog. Expect this one to run and run.

Latencies

A user reporting long latencies when applications use fsync() to flush the Ext3 write buffer when the kernel is working through large read processes prompted a discussion on LKML. The problem has been known about for several months, but the available workarounds were somewhat controversial.

Ext[2/3/4] file system developer Ted Ts'o put the blame squarely on application developers who, he opined, could save the file system a deal of work with a little more prudence. Other kernel developers disagreed. Ts'o has, however, already developed a number of less controversial patches for Ext3 and Ext4 which, according to his measurements, reduce latency and which have subsequently been incorporated into 2.6.30.

Subsequent tests by Torvalds, however, determined that some of the blame for latencies must be placed on the block layer's CFQ scheduler. Jens Axboe analysed the problem and quickly developed more changes which further reduce latencies. This will in some cases increase the speed of desktop systems not just measurably, but tangibly.

Latency II

Debate and details

This article only describes the most critical points and outcomes of the discussions on Ext3 and Ext4 mentioned above and their interaction with other kernel subsystems such as the block layer. Linux Weekly News (LWN.net) has taken a more detailed look at the discussion and the changes arising from it in the articles That massive filesystem thread and Solving the ext3 latency problem.

The articles "Linux Storage and Filesystem workshop" Day 1 and Day 2 also in part explore the issues discussed. The article ext4 and data loss looks at the problem of potential data loss in Ext4.

Another major factor in the problem of latencies is Ext3's habit of loading the file system as 'data=ordered' by default. Ted Ts'o has even publicly repented of having taken the decision to make this mode the default several years ago.

Though it initially looked as if the debate would be fruitless, Torvalds adopted some of Ts'o's patches a few days later. These included one patch through which the kernel loads the Ext3 file system with 'data=writeback' unless the user explicitly states otherwise during kernel configuration or mounting. This should improve performance, but increases the risk of data loss in the event of a crash or if the computer is turned off without shutting down. There is also a risk that data from previously deleted files belonging to other users could find its way into new files incompletely written to disk before a crash.

The 'data=guarded' mode developed by chief Btrfs developer Chris Mason should resolve some of these problems. Two of the enhancements coded as part of this development have already been incorporated into the main development tree. The rest have been put on hold, with the development cycle already entering the stabilisation phase.

Data security

In the tumult of the discussion, the risk of data loss in Ext4 as a result of delayed allocation once more reared its ugly head. This risk should be significantly reduced by a number of patches for the Ext4 code which have now, as planned, made their way into the main development tree. The changes do, however, have a negative effect on performance in certain situations.

The discussion on the risk of data loss led to a debate on precisely what guarantees a file system should be offering anyway. This led to the question of whether and how kernel and file systems should ensure that data does not just end up in a disk's write cache, but actually gets written in the correct sequence. This and other performance tuning questions led to a further discussion on where the role of the kernel developers in configuration stops and where fine tuning issues are better left to the Linux distributors.

Two new file systems

Following the adoption of Btrfs and SquashFS in Linux 2.6.29, the kernel development team have once more integrated two new file systems into 2.6.30 in the form of NILFS and EXOFS.

NILFS2 (New Implementation of a Log-structured Filesystem Version 2) is a log-structured file system (LFS) with continuous snapshotting optimised for the needs of solid state discs (SSD). A detailed description of how it works can be found on the NILFS2 website and in the kernel documentation on NILFS2. Further details can be found in a presentation PDF given as part of the Linux Storage & File system Workshop 2008 (LSF'08) in February, which includes a comparison between NILFS2 and Btrfs, Ext2 through 4, ReiserFS and XFS when running with an SSD. The presentation by Dongjun Shin, which is already somewhat long in the tooth, also takes a close look at some of the particularities of file systems for SSDs.

EXOFS stands for Extended Object File System and used to be known as OSDFS (Object-Based Storage Devices File System). As the old name suggest, it is intended for the somewhat exotic OSDs (object-based storage devices), which will be supported by the SCSI subsystem for the first time in 2.6.30. Users wanting further information on this kind of storage and the file system can find details in an article on OSDs by Sun, the kernel documentation on EXOFS, the EXOFS developer website and an LWN.net article on EXOFS/OSDFS.

Cache for the network, infusion for Btrfs

After several years of development, the kernel development team have now adopted the FS-Cache patch developed primarily by Red Hat developer David Howells (kernel documentation). This extension allows a file system cache to be set up to reduce network traffic when using network file systems such as AFS and NFS. This is, for instance, of interest for thin clients with no hard drive or flash media which obtain their root file system and all other data over a network.

The Btrfs development team have also been busy and have enhanced the file system code to cope better with 4k stacks – further improvements along these lines remain on the to-do list. There are also enhancements to improve write performance in general and for SSDs.

ReiserFS - forgotten but not gone

The kernel's ReiserFS code is still officially supported, but has long been without an official maintainer. Consequently, in recent months there have been only minor changes to ensure that the once popular file system keeps working.

Novell developer Jeff Mahoney has now added various patches developed as part of SLED/SLES, some of which are already more than two years old. They should resolve irregularities or bugs in ReiserFS, (also referred to as Reiser3). According to Mahoney's Git pull request, after incorporation of these patches ReiserFS should be considered to be in "deep maintenance-only mode". As part of the discussions on Git pull requests, Frederic Weisbecker indicated that he is working on changes to reduce the use of the Big Kernel Lock (BKL) in ReiserFS.

Minor gems

The kernel development team have extended DFS support in CIFS to support access to remote servers.

The changes described are just some of the more significant changes recently undertaken by kernel hackers on the code for the various file systems. Numerous further major changes can be found from the list of commit headers from the main development tree below. The links take you directly to the changes in the main development tree web interface, where the commit comments and the patches themselves provide further information on these, perhaps less major, but in no way insignificant changes.

Further background and information about developments in the Linux kernel and its environment can also be found in previous issues of the Kernel Log at The H Open Source:

Older Kernel Logs can be found in the archives or by using the search function at The H Open Source.

(thl/c't)

File systems:

Relevant Git-Pull-Requests