Kernel Log: What's coming in 2.6.30 - File systems: New and revamped file systems
The patches adopted in Linux 2.6.30 introduce many significant changes affecting data security and Ext3 and Ext4 performance. Support for the EXOFS and NILFS2 file systems is new, as is the cache for the AFS and NFS network file systems. There are also a few fixes for the almost forgotten ReiserFS file system.
Released mid-week, as is normal for the second phase of the development cycle, the third pre-release version of Linux 2.6.30 included mostly minor enhancements and fixes, although there were two code restructures.
The interminable discussions of the Ext3 and Ext4 file systems and the way they interact with other kernel subsystems have largely subsided. The H Open has reported on the early stages of these discussions – the occasionally abrasive discussion on the LKML (Linux Kernel Mailing List) continued for a further week, with a total of 650 emails, not counting other threads triggered by the discussion.
The debate has been far from fruitless and has led to the development of various modifications which Torvalds has integrated, in some cases immediately, into the main development tree leading to Linux 2.6.30. This part of the "What's coming in 2.6.30" Kernel Log series gives an overview of these and many other changes to the code for the various file systems supported by Linux.
At a relatively early stage of the above discussion, an old, previously much discussed issue affecting all file systems once more reared its head – when and how frequently should the kernel update a file's atime (last access time)? This information is of importance to only a handful of applications and each update of the atime requires a write process. This not only has a time overhead, it is also somewhat surplus to requirements for SSDs and laptops running on battery.
Spurred on by this, Matthew Garrett has produced a number of patches which result in the kernel now updating last access time just once a day (Relative atime/relatime). It took Linus Torvalds just a few hours to make this one of the first patches to be incorporated into the main development tree following the release of 2.6.29. A further patch from Garrett makes relatime the default. The old style behaviour can be restored using strictatime.
But even these changes, which many kernel hackers have long been calling for, did not satisfy everyone – Valerie Aurora (formally Henson) has listed various criticisms on her blog. Expect this one to run and run.
A user reporting long latencies when applications use fsync() to flush the Ext3 write buffer when the kernel is working through large read processes prompted a discussion on LKML. The problem has been known about for several months, but the available workarounds were somewhat controversial.
Ext[2/3/4] file system developer Ted Ts'o put the blame squarely on application developers who, he opined, could save the file system a deal of work with a little more prudence. Other kernel developers disagreed. Ts'o has, however, already developed a number of less controversial patches for Ext3 and Ext4 which, according to his measurements, reduce latency and which have subsequently been incorporated into 2.6.30.
Subsequent tests by Torvalds, however, determined that some of the blame for latencies must be placed on the block layer's CFQ scheduler. Jens Axboe analysed the problem and quickly developed more changes which further reduce latencies. This will in some cases increase the speed of desktop systems not just measurably, but tangibly.
Debate and details
This article only describes the most critical points and outcomes of the discussions on Ext3 and Ext4 mentioned above and their interaction with other kernel subsystems such as the block layer. Linux Weekly News (LWN.net) has taken a more detailed look at the discussion and the changes arising from it in the articles That massive filesystem thread and Solving the ext3 latency problem.
The articles "Linux Storage and Filesystem workshop" Day 1 and Day 2 also in part explore the issues discussed. The article ext4 and data loss looks at the problem of potential data loss in Ext4.
Another major factor in the problem of latencies is Ext3's habit of loading the file system as 'data=ordered' by default. Ted Ts'o has even publicly repented of having taken the decision to make this mode the default several years ago.
Though it initially looked as if the debate would be fruitless, Torvalds adopted some of Ts'o's patches a few days later. These included one patch through which the kernel loads the Ext3 file system with 'data=writeback' unless the user explicitly states otherwise during kernel configuration or mounting. This should improve performance, but increases the risk of data loss in the event of a crash or if the computer is turned off without shutting down. There is also a risk that data from previously deleted files belonging to other users could find its way into new files incompletely written to disk before a crash.
The 'data=guarded' mode developed by chief Btrfs developer Chris Mason should resolve some of these problems. Two of the enhancements coded as part of this development have already been incorporated into the main development tree. The rest have been put on hold, with the development cycle already entering the stabilisation phase.
In the tumult of the discussion, the risk of data loss in Ext4 as a result of delayed allocation once more reared its ugly head. This risk should be significantly reduced by a number of patches for the Ext4 code which have now, as planned, made their way into the main development tree. The changes do, however, have a negative effect on performance in certain situations.
The discussion on the risk of data loss led to a debate on precisely what guarantees a file system should be offering anyway. This led to the question of whether and how kernel and file systems should ensure that data does not just end up in a disk's write cache, but actually gets written in the correct sequence. This and other performance tuning questions led to a further discussion on where the role of the kernel developers in configuration stops and where fine tuning issues are better left to the Linux distributors.
Two new file systems
NILFS2 (New Implementation of a Log-structured Filesystem Version 2) is a log-structured file system (LFS) with continuous snapshotting optimised for the needs of solid state discs (SSD). A detailed description of how it works can be found on the NILFS2 website and in the kernel documentation on NILFS2. Further details can be found in a presentation given as part of the Linux Storage & File system Workshop 2008 (LSF'08) in February, which includes a comparison between NILFS2 and Btrfs, Ext2 through 4, ReiserFS and XFS when running with an SSD. The presentation by Dongjun Shin, which is already somewhat long in the tooth, also takes a close look at some of the particularities of file systems for SSDs.
EXOFS stands for Extended Object File System and used to be known as OSDFS (Object-Based Storage Devices File System). As the old name suggest, it is intended for the somewhat exotic OSDs (object-based storage devices), which will be supported by the SCSI subsystem for the first time in 2.6.30. Users wanting further information on this kind of storage and the file system can find details in an article on OSDs by Sun, the kernel documentation on EXOFS, the EXOFS developer website and an LWN.net article on EXOFS/OSDFS.
Cache for the network, infusion for Btrfs
After several years of development, the kernel development team have now adopted the FS-Cache patch developed primarily by Red Hat developer David Howells (kernel documentation). This extension allows a file system cache to be set up to reduce network traffic when using network file systems such as AFS and NFS. This is, for instance, of interest for thin clients with no hard drive or flash media which obtain their root file system and all other data over a network.
The Btrfs development team have also been busy and have enhanced the file system code to cope better with 4k stacks – further improvements along these lines remain on the to-do list. There are also enhancements to improve write performance in general and for SSDs.
ReiserFS - forgotten but not gone
The kernel's ReiserFS code is still officially supported, but has long been without an official maintainer. Consequently, in recent months there have been only minor changes to ensure that the once popular file system keeps working.
Novell developer Jeff Mahoney has now added various patches developed as part of SLED/SLES, some of which are already more than two years old. They should resolve irregularities or bugs in ReiserFS, (also referred to as Reiser3). According to Mahoney's Git pull request, after incorporation of these patches ReiserFS should be considered to be in "deep maintenance-only mode". As part of the discussions on Git pull requests, Frederic Weisbecker indicated that he is working on changes to reduce the use of the Big Kernel Lock (BKL) in ReiserFS.
The kernel development team have extended DFS support in CIFS to support access to remote servers.
The changes described are just some of the more significant changes recently undertaken by kernel hackers on the code for the various file systems. Numerous further major changes can be found from the list of commit headers from the main development tree below. The links take you directly to the changes in the main development tree web interface, where the commit comments and the patches themselves provide further information on these, perhaps less major, but in no way insignificant changes.
Further background and information about developments in the Linux kernel and its environment can also be found in previous issues of the Kernel Log at The H Open Source:
- Kernel Log: 3D support for the new Radeon driver; new Intel drivers
- Kernel Log: What's coming in 2.6.30 - Network: New Wi-Fi drivers and other network novelties
- Kernel Log: Linux 2.6.30 is taking shape
- Kernel Log: Development of 2.6.30 is under way
- Steady Growth: What's new in Linux 2.6.29
- Kernel Log: Tasmanian devil to be Linux's temporary mascot, new Radeon drivers
- Btrfs: 1, 2, 3
- Ext3 und Ext4: 1, 2,3
- FS-Cache: 1, 2
- GFS2: 1
- NFS: 1
- OCFS2: 1
- Quota: 1
- UBIFS: 1
- VFS: 1, 2
- XFS: 1
- Btrfs: add a priority queue to the async thread helpers
- Btrfs: add extra flushing for renames and truncates
- Btrfs: add flushoncommit mount option
- Btrfs: do extent allocation and reference count updates in the background
- Btrfs: notreelog mount option
- Btrfs: Optimize locking in btrfs_next_leaf()
- Btrfs: rework allocation clustering
- Btrfs: stop spinning on mutex_trylock and let the adaptive code spin for us
- Btrfs: use WRITE_SYNC for synchronous writes
- CacheFiles: A cache that backs onto a mounted filesystem
- CacheFiles: Export things for CacheFiles
- CacheFiles: Permit the page lock state to be monitored
- Create a dynamically sized pool of threads for doing very slow work items
- Document the slow work thread pool
- FS-Cache: Add and document asynchronous operation handling
- FS-Cache: Add cache management
- FS-Cache: Add cache tag handling
- FS-Cache: Add main configuration option, module entry points and debugging
- FS-Cache: Add netfs registration
- FS-Cache: Add the FS-Cache cache backend API and documentation
- FS-Cache: Add the FS-Cache netfs API and documentation
- FS-Cache: Add use of /proc and presentation of statistics
- FS-Cache: Bit waiting helpers
- FS-Cache: Implement data I/O part of netfs API
- FS-Cache: Implement the cookie management part of the netfs API
- FS-Cache: Make kAFS use FS-Cache
- FS-Cache: Object management state machine
- FS-Cache: Provide a slab for cookie allocation
- FS-Cache: Recruit a page flags for cache management
- FS-Cache: Release page->private after failed readahead
- FS-Cache: Root index definition
- Make slow-work thread pool actually dynamic
- Make the slow work pool configurable
- NFS: Add comment banners to some NFS functions
- NFS: Add FS-Cache option bit and debug bit
- NFS: Add mount options to enable local caching on NFS
- NFS: Add read context retention for FS-Cache to call back with
- NFS: Add some new I/O counters for FS-Cache doing things for NFS
- NFS: Define and create inode-level cache objects
- NFS: Define and create server-level objects
- NFS: Define and create superblock-level objects
- NFS: Display local caching state
- NFS: FS-Cache page management
- NFS: Invalidate FsCache page flags when cache removed
- NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching
- NFS: Permit local filesystem caching to be enabled for NFS
- NFS: Read pages from FS-Cache into an NFS inode
- NFS: Register NFS for caching and retrieve the top-level index
- NFS: Store pages from an NFS inode into a local cache
- NFS: Use local disk inode cache
- exofs: address_space_operations
- exofs: dir_inode and directory operations
- exofs: Documentation
- exofs: export_operations
- exofs: file and file_inode operations
- exofs: Kbuild, Headers and osd utils
- exofs: super_operations and file_system_type
- exofs: symlink_inode and fast_symlink_inode operations
- fs: Add exofs to Kernel build
- ext3: Add replace-on-rename hueristics for data=writeback mode
- ext3: Add replace-on-truncate hueristics for data=writeback mode
- ext3: Avoid starting a transaction in writepage when not necessary
- ext3: make default data ordering mode configurable
- ext3: Try to avoid starting a transaction in writepage for data=writepage
- ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl
- ext4: Add fine print for the 32000 subdirectory limit
- ext4: Add sysfs support
- ext4: Automatically allocate delay allocated blocks on close
- ext4: Automatically allocate delay allocated blocks on rename
- ext4: Fix discard of inode prealloc space with delayed allocation.
- ext4: Regularize mount options
- ext4: remove /proc tuning knobs
- ext4: Track lifetime disk writes
- trivial: document ext3 semantics of 'ro' option a bit better
- Nilfs2: add document
- Nilfs2: add inode and other major structures
- Nilfs2: add maintainer
- Nilfs2: another dat for garbage collection
- Nilfs2: avoid double error caused by nilfs_transaction_end
- Nilfs2: block cache for garbage collection
- Nilfs2: B-tree based block mapping
- Nilfs2: B-tree node cache
- Nilfs2: buffer and page operations
- Nilfs2: checkpoint file
- Nilfs2: clean up indirect function calling conventions
- Nilfs2: cleanup nilfs_clear_inode
- Nilfs2: clean up sketch file
- Nilfs2: direct block mapping
- Nilfs2: directory entry operations
- Nilfs2: disk address translator
- Nilfs2: disk format and userland interface
- Nilfs2: extend nilfs_sustat ioctl struct
- Nilfs2: file operations
- Nilfs2: fix buggy behavior seen in enumerating checkpoints
- Nilfs2: fix gc failure on volumes keeping numerous snapshots
- Nilfs2: fix improper return values of nilfs_get_cpinfo ioctl
- Nilfs2: fix missed-sync issue for do_sync_mapping_range()
- Nilfs2: fix problems of memory allocation in ioctl
- Nilfs2: inode map file
- Nilfs2: inode operations
- Nilfs2: insert explanations in gcinode file
- Nilfs2: integrated block mapping
- Nilfs2: introduce secondary super block
- Nilfs2: ioctl operations
- Nilfs2: mark minor flag for checkpoint created by internal operation
- Nilfs2: meta data file
- Nilfs2: operations for the_nilfs core object
- Nilfs2: pathname operations
- Nilfs2: persistent object allocator
- Nilfs2: recovery functions
- Nilfs2: remove compat ioctl code
- Nilfs2: remove timedwait ioctl command
- Nilfs2: replace BUG_ON and BUG calls triggerable from ioctl
- Nilfs2: segment buffer
- Nilfs2: segment constructor
- Nilfs2: segment usage file
- Nilfs2: simplify handling of active state of segments
- Nilfs2: super block operations
- Nilfs2: super block operations fix endian bug
- Nilfs2: support nanosecond timestamp
- Nilfs2: update makefile and Kconfig
- Nilfs2: use fixed sized types for ioctl structures
- Nilfs2: use unlocked_ioctl
- CIFS: Add support for posix open during lookup
- Documentation/filesystems: remove out of date reference to BKL being held
- documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls
- Document /proc/fs/nfsd/pool_stats
- filesystem freeze: allow SysRq emergency thaw to thaw frozen filesystems
- GFS2: Merge lock_dlm module into GFS2
- nfs41: common protocol definitions
- nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions
- nfsd41: Documentation/filesystems/nfs41-server.txt
- ocfs2: Add a name indexed b-tree to directory inodes
- ocfs2: Introduce dir free space list
- ocfs2: Introduce dir lookup helper struct
- ocfs2: Store dir index records inline
- quota: Add quota reservation support
- ramfs: add support for "mode=" mount option
- reiserfs: rework reiserfs_panic
- reiserfs: rework reiserfs_warning
- reiserfs: use generic readdir for operations across all xattrs
- UBIFS: add R/O compatibility
- udf: implement mode and dmode mounting options
- vfat: Note the NLS requirement
- xfs: Update maintainers