Kernel Log – Coming in 2.6.31 - Part 3: Storage and file systems
by Thorsten Leemhuis
The experimental file system Btrfs, billed as the "next generation file system for Linux", should now be even faster. Libata drivers for IDE/PATA adaptors are pushing aside the IDE subsystem. The first components for defragmenting Ext4 file systems have been merged into the main development tree. Systems with Intel ATA chipsets now boot faster thanks to parallel hardware scanning.
Last Thursday, Linus Torvalds released the sixth pre-release version of Linux 2.6.31. As usual at this late phase of the development cycle, most of the changes from rc5 are minor. In his release email, Torvalds indicates that he expects 2.6.31 to be complete after the eighth pre-release version, probably in two to three weeks.
The Kernel Log is taking the opportunity to continue its series of reports on the major changes in Linux 2.6.31 compared to the current 2.6.30 kernel with an overview of storage and file systems. Previously, we have looked at the areas of networking and graphics, audio and video.
Btrfs upgraded
Major changes, in the form of a 350 KB patch, should significantly improve the experimental file system's performance by using 'mixed back references' in many areas. The patch does, however, involve a change in the structure of the file system on the storage media ('on disk format'). Kernel versions containing the new Btrfs code deal with the requisite conversion from old to new format automatically the first time the file system is mounted. However, Linux versions with older Btrfs code will thereafter no longer be able to mount file systems which have been modified by the new code.
This is flagged up clearly in the commit comments and Git pull request. Kernel developers usually try to avoid such situations, even with experimental file systems – the result of this change is that hardy users who choose to use Btrfs as their root file system will find themselves unable to start older kernel versions to deal with errors should the need arise. Indeed this is precisely the misfortune that befell Linus Torvalds, who was distinctly unimpressed.
Very late in the development cycle, via a Git pull request, the Btrfs maintainer sent through a number of major changes, work on which has been ongoing for some time. These should make Btrfs less memory hungry during long periods of high load (e.g. see 1, 2). Btrfs developers have also improved support for using SSDs (e.g. see 1, 2).
For other articles on 2.6.31 and links to the rest of the "Coming in 2.6.31" series, see The H's Kernel Log - 2.6.31 Tracking page.
Adieu IDE
Kernel developer David Miller, known for his work as maintainer of the networking subsystem and for SPARC support, has now also taken over the IDE subsystem from Bartlomiej Zolnierkiewicz. The reason for this was a clash over a bug occurring on SPARC systems, in the course of which Miller suggested that Zolnierkiewicz had introduced a number of changes to the IDE subsystem without adequate testing.
Miller has intimated that he is not planning on implementing any major changes to the IDE subsystem, "I'm going to treat IDE as pure legacy." The future therefore now looks definitively to belong to the PATA drivers in the Libata subsystem, which were merged into Linux 2.6.19 in late 2006. While they may not be able to control quite as many IDE/PATA adaptors as the IDE subsystem, they are able to deal with almost all common modern adaptors.
Many developers always viewed the Libata drivers as a replacement for the drivers in the older IDE subsystem, which has been a source of repeated strife between kernel developers for more than a decade. Following a period of almost complete inactivity, over the last eighteen months to two years Zolnierkiewicz had substantially revised the IDE subsystem and added a number of new drivers, so that instead of the anticipated lingering death, the two systems had started to look like competitors - a situation that now finds itself resolved.
Body search
The new Fsnotify replaces Dnotify and Inotify and can be used to monitor changes to the file system, such as creation, deletion or modification of files (1, 2, 3). The actual goal of Red Hat employee Eric Paris, who developed Fsnotify, is Fanotify, which builds on Fsnotify and offers virus and malware scanners operating in userspace a handle for checking files for malware before they are actually accessed. Paris recently invited discussion on the concept and design of Fanotify.
The idea arose from long discussions on TALPA, which set out to achieve the same purpose, but failed to win over kernel developers. Background information can be found in the LWN.net articles "Kernel-based malware scanning", "The TALPA molehill" and "The fanotify API".
In Brief
The changes described above are just some of the more significant of those recently undertaken by kernel hackers in the file system and storage field. A short overview of further changes:
File systems:
- The Ext4 file system now contains code for de-fragmenting the file system while in use (online defrag). This is not, however, finished, as was recently emphasised elsewhere by Ext file system developer Theodore Tso (tytso). He goes on to say that further patches for this function still need to be evaluated and that there is still outstanding work to be done on the associated userspace program.
- Support for NFS 4.1 has been extended, but further changes are still planned for 2.6.32.
Storage:
- Thanks to a change merged prior to the handover from Zolnierkiewicz to Miller, the IDE subsystem now by default respects HPAs (Host Protected Areas) – users who deploy an HPA and are still using IDE subsystem drivers should not be surprised if their drive is a little smaller under 2.6.31.
- IDE/PATA driver ata_piix for Intel controllers, which forms part of the Libata subsystem, now scans for drives in parallel, halving initialisation time on the developer's Eee PC.
- As a result of one of many changes to the block layer, the latter now exports information on I/O topology using data supplied by the SCSI subsystem – this includes a drive's physical sector size. This is, for example, of interest when allocating storage media with sector sizes other than 512 bytes or for optimal arrangement of data in RAID arrays. The developer behind this code explores some of the issues involved in a recently released presentation (see pages 235-238). The MD code responsible for software RAIDs is already able to make use of this topology information.
- There have been major improvements to barrier support in the device mapper (delay, mpath, snapshot).
- Following the merger of generic support for OSDs (Object-Based Storage Devices) and a file system based on them into 2.6.30, kernel hackers have now merged the osdblk driver, which allows OSD objects to be used as block devices.
- The MMC subsystem now includes a driver by the name of via-sdmmc for VIA SD/MMC card readers and a platform driver for SDHCI
- An Emulex developer has contributed a nearly 340 KB patch which adds support for recent Emulex FightPulse fibre channel host adaptors to the lpfc (Light Pulse Fibre Channel) driver; this was followed by a further update which adds support for target reset handler entry points. The same programmer is also responsible for FC (FibreChannel) pass-thru support.
- There's a new iSCSI driver for Broadcom's BNX2 chips: bnx2i. It can, if required, operate in conjunction with the new Cnic driver, which has previously been mentioned in the Kernel Log article on changes in the networking field.
- As well as including various enhancements to existing features, SAS/SATA driver mvsas also adds support for 94xx series Marvell chips.
Minor gems
Many additional minor, but by no means insignificant, changes can be found in the list below. Like many of the references in the text above, the links lead to the relevant commits in the web front end of the main Linux development branch, where the commit comments and the patches themselves provide extensive further information on the respective changes.
File systems
Btrfs:
- Btrfs: Add mount -o nossd
- Btrfs: honor nodatacow/sum mount options for new files
- Btrfs: reduce mount -o ssd CPU usage
- Btrfs: update backrefs while dropping snapshot
CIFS:
- CIFS: add addr= mount option alias for ip=
- CIFS: Add mention of new mount parm (forceuid) to cifs readme
- CIFS: reinstate original behavior when uid=/gid= options are specified
- CIFS: show noforceuid/noforcegid mount options (try #2)
- CIFS: Update readme to indicate change to default mount (serverino)
- CIFS: Update readme to reflect forceuid mount parms
- CIFS: Updates fs/cifs/CHANGES
Ext[2,3&4]:
- Doc fix: ext2 can only have 32,000 subdirs, not 32,768
- ext4: Avoid races caused by on-line resizing and SMP memory reordering
- ext4: Change all super.c messages to print the device
- ext4: document the "abort" mount option
NFS:
- nfs41: Add ability to read RPC call direction on TCP stream.
- nfs41: Add backchannel processing support to RPC state machine
- nfs41: Add Kconfig symbols for NFSv4.1
- nfs41: add mount command option minorversion
- nfs41: add session reset to state manager
- nfs41: add session setup to the state manager
- nfs41: create_session operation
- nfs41: Setup the backchannel
- nfs41: Use mount minorversion option
- NFS: Add separate mountd status code decoders for each mountd version
- NFS: add support for splice writes
- nfsd: support ext4 i_version
- NFS: Invalid mount option values should always fail, even with "sloppy"
- NFS: More "sloppy" parsing problems
- update Documentation/filesystems/00-INDEX with new nfsd related docs.
Various:
- add caching of ACLs in struct inode
- documentation: register ioctl entry of nilfs2
- FAT: add 'errors' mount option
- GFS2: Add commit= mount option
- GFS2: Add tracepoints
- GFS2: Update docs
- hostfs: set maximum filesize in superblock for proper LFS support
- isofs: let mode and dmode mount options override rock ridge mode setting
- nilfs2: allow future expansion of metadata read out via get info ioctl
- nilfs2: modify list of unsupported features in caveats
- ocfs2: Add statistics for the checksum and ecc operations.
- partitions: warn about the partition exceeding device capacity
- proc: export statistics for softirq to /proc
- proc.txt: update kernel filesystem/proc.txt documentation
- splice: implement pipe to pipe splicing
- update Documentation/filesystems/Locking
- VFS: Add VFS helper functions for setting up private namespaces
- xfs: use generic Posix ACL code
Storage
Block Layer:
- Add serial number support for virtio_blk, V4a
- block: enable by default support for large devices and files on 32-bit archs
- block: rename CONFIG_LBD to CONFIG_LBDAF
- block: Update topology documentation
- Make SCSI SG v4 driver enabled by default and remove EXPERIMENTAL dependency, since udev depends on BSG
- ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
Device Mapper:
- dm ioctl: support cookies for udev
- dm mpath: add queue length load balancer
- dm mpath: add service time load balancer
- dm raid1: add userspace log
- dm: sysfs add suspended attribute
Libata:
- ahci: add device ID for 82801JI sata controller
- ahci: add device IDs for Ibex Peak ahci controllers
- ata_piix: Add new laptop short cable IDs
- ata_piix: Add new short cable ID
- ata_piix: Turn on hotplugging support for older chips
- libata: accept late unlocking of HPA
- libata: ahci: Restore SB600 SATA controller 64 bit DMA
- libata: beautify module parameters
- libata: PATA driver for CF interface on AT91SAM9260 SoC
- sata_fsl: Add power mgmt support
- sata_sil: enable 32-bit PIO
MMC:
- sdhci: Add support for hosts that are only capable of 1-bit transfers
- sdhci-s3c: Samsung S3C based SDHCI controller glue
- sdhci: Specific quirk vor VIA SDHCI controller in VX855ES
MTD:
- Documentation: add MTD sysfs docs
- mtd: add MEMERASE64 ioctl for >4GiB devices
- mtd: add on-flash BBT support for Atmel NAND driver
- mtd: add OOB ioctls for >4GiB devices
- mtd: add SST39SF040 chip to jedec_probe
- mtd: CFI 1.0 and CFI 1.1
- mtd: Flex-OneNAND support
- mtd: m25p80: add support for Macronix MX25L12805D
- mtd: nand: add OMAP2/OMAP3 NAND driver
- mtd: OneNAND: add support for OneNAND manufactured by Numonyx
- mtd: physmap_of: Add multiple regions and concatenation support
- mtd: Restore suspend/resume support for mtd devices
SCSI:
- explain the hidden scsi_wait_scan Kconfig variable
- fcoe: Add runtime debug logging with module parameter debug_logging
- libfc: Add runtime debugging with debug_logging module parameter
- libfcoe: Add runtime debugging with module param debug_logging
- SCSI: mpt2sas: add query task support for MPT2COMMAND ioctl
- SCSI: mpt2sas: LUN Reset Support
- SCSI: mpt2sas: T10 DIF Support
- SCSI: mpt fusion: RAID device handling and Dual port Raid support is added
- SCSI: net, libfcoe: Add the FCoE Initialization Protocol ethertype
- SCSI: qla2xxx: Add 10Gb iiDMA support.
- SCSI: qla2xxx: Add CPU affinity support.
- SCSI: qla2xxx: Add QoS support.
- SCSI: zfcp: Add FC pass-through support
- sd: Block limits VPD support
- sd: Detect non-rotational devices
Various:
For other articles on 2.6.31 and links to the rest of the "Coming in 2.6.31" series, see The H's Kernel Log - 2.6.31 Tracking page.
(djwm)