Kernel Log: Coming in 2.6.32 (Part 3) - Storage
by Thorsten Leemhuis
The kernel development team have enhanced various aspects of Btrfs, one effect of which is to significantly improve the experimental file system's write performance. A number of changes to the block layer promise better data throughputs and reactivity. There are also several new drivers for storage hardware.
Linux kernel 2.6.32-rc7 hit the streets late last week, but when 2.6.32 will finally be released remains anyone's guess. There are likely to be at least one, more likely two, further pre-release versions in this development cycle, a cycle which has been slightly disrupted by typos and the Kernel Summit. Following on from our reviews of the changes in the networking and graphics hardware subsystems, this instalment of the Kernel Log 2.6.32 series looks at file systems and storage.
Btrfs
There have been a whole heap of changes to Btrfs. The experimental "next generation file system for Linux" can now write at more than 1Â GB per second on fast hardware and now matches XFS for speed on our test system. Btrfs' previous maximum data transfer speed was 400Â MB per second, as this maxed out CPU usage. Snapshots and subvolumes can now be renamed in Btrfs and can be deleted much more rapidly. Thanks to various enhancements, RPM and Yum now also work faster.
For delayed allocations, Btrfs is now more reliable in reserving enough space for metadata, ensuring that sufficient capacity is available for subsequent writes. There is some new experimental code for 'discard operations' which should eventually allow the file system to tell SSDs which blocks have been freed by deleting data â however the requisite support in the SCSI and Libata subsystems is still a work in progress.
A list of further Btrfs-related changes can be found at the end of this article and in various git pull requests in which Chris Mason, the main Btrfs developer, briefly explains the major changes (1, 2, 3). Mason, who works for Oracle, developed many of the changes himself, with many more contributed by other developers on the payrolls of various different companies. The importance of this kind of distribution of developer resources and know-how for a successful, resilient open source project was recently emphasised in a talk given at the Linux Kongress by Theodore Ts'o ('tytso') (well-known for his work on the Ext file systems).
Ext3, Ext4, XFS, etc.
Other significant chances to the Linux kernel file system code:
- Numerous changes to Ext3 und Ext4 have made their way into the main development tree via Theodore Ts'o (1, 2). One of them speeds up the fs_mark benchmark by around fifty per cent under certain set-ups. 'Data=guarded' mode in Ext3, long under development, has once more been left out in the cold.
- Sysfs now supports security labels, allowing security frameworks such as SELinux to monitor access to the virtual file system.
- The kernel's VFAT code will not mount FAT drives by default with the behaviour activated by the "shortname=lower" option, but will instead now default to use "shortname=mixed". This should mean that upper and lower case in file names will no longer be changed when copying using Linux.
- Support for the 9p file system has been added in Fscache.
- A change to XFS should make finding free inodes three to four times faster in certain situations. An overview of further developments in XFS can be found in the XFS status updates for September and October.
Block layer
The CFQ (Completely Fair Queuing) I/O scheduler used by many distributions now optimises queries for short response times. This should mean that when programs are running in the background which process large volumes of data, desktop applications running in the foreground will no longer be slowed to the same extent and will consequently feel faster. Background on the changes can be found in a piece by block subsystem maintainer Jens Axboe on LWN.net. The new behaviour does, however, cause some patterns of access to work a little slower, for which reason CFQ's low latency mode can also be deactivated via sysfs.
Axboe has also introduced a major rewrite of the writeback infrastructure, the fruit of several months work, as a result of which each device is now dealt with by its own thread. This and other changes should significantly increase data throughput for writeback-intensive access scenarios and cause them to run more evenly. Axboe includes two benchmark graphics and various measurements in his commit comments to underline the point. More measurements are detailed in an email from Chris Mason. Some background information on the changes can be found in a presentation by Axboe and the April LWN.net also has an article on the recently merged blk-iopoll, which describes the NAPI-like approach to accessing storage devices which aims to increase maximum throughput by reducing IRQs.
Axboe's main git pull request lists a number of further changes in the block subsystem. Following a long discussion of its pros and cons, replication solution DRBD (Distributed Replicated Block Device) has not made it into Linux 2.6.32, but Torvalds has signalled his willingness to merge it into 2.6.33.
Libata, drivers, etc.
- It will in future be possible to read certain information on AHCI capabilities â such as whether a port is hot-pluggable or is an eSATA connector â via sysfs. Userspace applications should be able to use this information to make better decisions on optimal behaviours, e.g. when configuring ALPM (Aggressive Link Power Management).
- The kernel development team has merged the pata_atp867x driver for the ARTOP/Acard ATP867X PATA adapter and the pata_rdc driver for RDC PATA adapters. These now include support for AMD's SB900 southbridge, which, as things stand at present, still looks several months away from a release date.
- A 1.4 MB patch for the bfa driver for Brocade FC and FCOE host adapters has been merged into the SCSI subsystem. Much more lightweight are the new be2iscsi driver for iSCSI functionality for ServerEngines' 10Gbps BladeEngine 2 storage adapter and pmcraid driver for PMC Sierra's MaxRAID series 6Gb/s SAS adapters. There have also been numerous changes to the SCSI subsystem's FCOE driver.
- There have been a number of changes to the MD code and other kernel subsystems to improve the ability to offload calculations for RAID 6 to dedicated hardware. There have also been changes to the still fresh MD code for supporting the various options for modifying and converting software RAID arrays â e.g. for converting a RAID 1 to a RAID 5, to a RAID 6 and back again. Such conversions can be carried out by mdadm versions 3.1.x. The developer responsible for the kernel MD code and mdadm has withdrawn the first such version, but has indicated that 3.1.1 should be out shortly. He has also developed several enhancements to the kernel code responsible for modifying and converting RAID arrays, which should find their way into 2.6.33.
Minor Gems
Many further minor, but by no means insignificant, changes can be found in the list below. Like many of the references in the text above, the links point to the relevant commits in the web front end of the Git branch at kernel.org that Linus Torvalds uses for maintaining the kernel sources. There, the commit comments and the patches themselves provide extensive further information on the respective changes.
Filesystems
Btrfs
- Btrfs: add snapshot/subvolume destroy ioctl
- Btrfs: always pin metadata in discard mode
- Btrfs: cache values for locking extents
- Btrfs: change how subvolumes are organized
- Btrfs: check size of inode backref before adding hardlink
- Btrfs: find ideal block group for caching
- Btrfs: improve async block group caching
- Btrfs: only write one super copy during fsync
- Btrfs: optimize fsync for the single writer case
- Btrfs: reduce CPU usage in the extent_state tree
- Btrfs: streamline tree-log btree block writeout
- Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code
Ext3, Ext4
- ext3: Add locking to ext3_do_update_inode
- ext3: Update documentation about ext3 quota mount options
- ext3: Update MAINTAINERS for ext3 and JBD
- ext4: Add configurable run-time mballoc debugging
- ext4: Add new tracepoint: trace_ext4_da_write_pages()
- ext4: async direct IO for holes and fallocate support
- ext4: drop ext4dev compat
- ext4: Fix memory leak fix when mounting an ext4 filesystem
- ext4: limit block allocations for indirect-block files to < 232
- ext4: Split uninitialized extents for direct I/O
- ext4: Update documentation about quota mount options
- ext4: Update documentation about quota mount options
- ext4: Use tracepoints for mb_history trace file
- jbd2: Use tracepoints for history file
Others
- 9p: Update documentation to add fscache related bits
- CIFS: Re-enable Lanman security
- doc/filesystems: more mount cleanups
- doc/filesystems: remove smount program
- Documentation: update stale definition of file-nr in fs.txt
- fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
- fs/Kconfig: move nilfs2 outside misc filesystems
- GFS2: Add a document explaining GFS2's uevents
- GFS2: Add "-o errors=panic
- GFS2: Add sysfs link to device
- inotify: deprecate the inotify kernel interface
- NFS: Add a dns resolver for use with NFSv4 referrals and migration
- NFS: Allow the "nfs" file system type to support NFSv4
- nfsd41: sunrpc: Added rpc server-side backchannel handling
- nfsd: revise 4.1 status documentation
- NFS: Fix port and mountport display in /proc/self/mountinfo
- NFSv4: Disallow 'mount -t nfs4 -overs=2' and 'mount -t nfs4 -overs=3'
- ocfs2: Add CoW support.
- ocfs2: Add CoW support for xattr.
- ocfs2: Add ioctl for reflink.
- ocfs2: Add preserve to reflink.
- ocfs2: Add reflink support for xattr.
- ocfs2: Add support for incrementing refcount in the tree.
- SUNRPC: convert some sysctls into module parameters
- UBIFS: kill BKL
- vfat: change the default from shortname=lower to shortname=mixed
- vfs: allow file truncations when both suid and write permissions set
- xfs: Record new maintainer information
Storage
Block
- block: enable rq CPU completion affinity by default
- cfq-iosched: add a knob for desktop interactiveness
- cfq-iosched: drain device queue before switching to a sync queue
Libata
- ahci: Add the AHCI controller Linux Device ID for NVIDIA chipsets.
- ahci / atiixp / pci quirks: rename AMD SB900 into Hudson-2
- ahci: display all AHCI 1.3 HBA capability flags (v2)
- ahci: Enable SB600 64bit DMA on MSI K9A2 Platinum v2
- ahci: filter FPDMA non-zero offset enable for Aspire 3810T
- ahci: Gigabyte GA-MA69VM-S2 can't do 64bit DMA
- ahci: make ahci_asus_m2a_vm_32bit_only() quirk more generic
- libata: add command name parsing for error output
- libata: add DMA setup FIS auto-activate feature
- libata: implement more acpi filtering options
- libata: remove spindown skipping and warning
- pata_amd: do not filter out valid modes in nv_mode_filter
- pata_atp867x: add Power Management support
- pata_cs5535: add pci id for AMD based CS5535 controllers
- sata_promise: disable hotplug on 1st gen chips
- sata_promise: update reset code
MD
- async_tx: add sum check flags
- async_tx: add support for asynchronous GF multiplication
- async_tx: add support for asynchronous RAID6 recovery operations
- async_tx: kill ASYNC_TX_DEP_ACK flag
- async_tx: raid6 recovery self test
- async_tx: rename zero_sum to val
- async_tx: structify submission arguments, add scribble
- async_xor: permit callers to pass in a 'dma/page scribble' region
- dmaengine: add fence support
- dmaengine, async_tx: add a "no channel switch" allocator
- dmaengine: sh: Add Support SuperH DMA Engine driver
- dmatest: add pq support
- fsldma: Add DMA_SLAVE support
- ioat2+: add fence support
- ioat3: split ioat3 support to its own file, add memset
- ioat: add 'ioat' sysfs attributes
- iop-adma: P+Q support for iop13xx adma engines
- md/raid456: distribute raid processing over multiple cores
- md/raid5,6: add percpu scribble region for buffer lists
- md/raid5: make sure curr_sync_completes is uptodate when reshape starts
- md/raid6: asynchronous raid6 operations
MFD
- mfd: Add basic WM831x OTP support
- mfd: Add Freescale MC13783 driver
- mfd: Add support for TWL4030/5030 dynamic power switching
- mfd: Add twl4030-pwrbutton as a twl4030 child
- mfd: Add WM831x AUXADC support
- mfd: Add WM831x interrupt support
- mfd: Conditionally add WM831x backlight subdevice
- mfd: Hook WM831x into build system
- mfd: Initial core support for WM831x series devices
MMC
- mmc: add ability to save power by powering off cards
- mmc: add 'enable' and 'disable' methods to mmc host
- mmc: add MMC_CAP_NONREMOVABLE host capability
- mmc: add mmc card sleep and awake support
- mmc: core SDIO suspend/resume support
- mmc: msm_sdccc: driver for HTC Dream
- sdhci: support for ADMA only hosts
- mmc: add 'enable' and 'disable' methods to mmc host
- mmc: add MMC_CAP_NONREMOVABLE host capability
- mmc: add mmc card sleep and awake support
- mmc: core SDIO suspend/resume support
- mmc: msm_sdccc: driver for HTC Dream
- sdhci: support for ADMA only hosts
MTD
- mtd: add nand support for w90p910 (v2)
- mtd: Enable Open Firmware initialisation of MTD devices and maps for MicroBlaze
- mtd/maps: gpio-addr-flash: new driver for GPIO assisted flash addressing
- mtd: nand: driver for Nomadik 8815 SoC (on NHK8815 board)
- mtd: omap: adding DMA mode support in nand prefetch/post-write
- mtd: omap: add support for nand prefetch-read and post-write
- mtd: SST25L (non JEDEC) SPI Flash driver
SCSI
- SCSI: fcoe: Introduce and allocate fcoe_interface structure, 1:1 with net_device
- SCSI: hptiop: Add RR44xx adapter support
- SCSI: lpfc 8.3.4: Add bsg (SGIOv4) support for ELS/CT support
- SCSI: mpt2sas: Added mpi2_history.txt for MPI2 headers.
- SCSI: mpt2sas: Added SCSIIO, Internal and high priority memory pools to support multiple TM
- SCSI: mpt2sas: Target Reset will be issued from Interrupt context.
- SCSI: mpt2sas: Update driver to MPI2 REV K headers.
- SCSI: mvsas: Support Areca SAS/SATA HBA, ARC-1300/1320
- SCSI: qla2xxx: Add asynchronous-login support.
- SCSI: scsi_dh_rdac: add support for next generation of Dell PV array
- SCSI: sd: Detach DIF from block integrity infrastructure
- SCSI: sd: Support disks formatted with DIF Type 2
- SCSI: ses: add support for enclosure component hot removal
- SCSI: update MAINTAINERS with new email
Others
- atmel-mci: unified Atmel MCI drivers
- cciss: Add a "raid_level" attribute to each logical drive in /sys
- cciss: Add cciss_allow_hpsa module parameter
- cciss: Add lunid attribute to each logical drive in /sys
- cciss: Add usage_count attribute to each logical drive in /sys
- cciss: Allow triggering of rescan of logical drive topology via sysfs entry
- MAINTAINERS: InfiniBand/RDMA mailing list transition to vger
- omap4: mmc driver support on OMAP4
For other articles on 2.6.32 and links to the rest of the "Coming in 2.6.32 " series, see The H's Kernel Log - 2.6.32 Tracking page. (thl /c't).
(crve)