Kernel Log: Coming in 2.6.33 (Part 2) - Storage
by Thorsten Leemhuis
Extended discard support means that Linux now supports ATA TRIM, which can increase SSD lifespan and throughput. New additions to the Linux kernel include HA solution DRBD and drivers for HP, LSI and VMware storage hardware. The new kernel version, expected in early March, also includes many minor improvements to the code for the Btrfs, Ext4 and ReiserFS file systems.
At the end of last week, Linus Torvalds released the fifth pre-release version of Linux version 2.6.33, with the final release expected in 4 to 5 weeks time. At this stage in the development cycle, it is usually predominantly more minor changes and fixes which find their way into the main development tree. Kernel hackers do, however, make occasional exceptions for drivers. Linux 2.6.33-rc5 sees the addition to the kernel of a V4L/DVB driver which supports the Mantis chip-set used in a range of TV cards. However, the changes to the V4L/DVB subsystem in Linux 2.6.33 are the subject of a future article in the "Coming in 2.6.33" series.
Following on from the first article in the 2.6.33 series, which looked at networking changes, this article examines file systems changes and changes to the kernel's storage subsystem.
Trimming
For several months, some sections of the kernel have included a rudimentary discard infrastructure, which allows drivers for mass storage adaptors to determine whether storage areas on a disk are free – as a result, for example, of deleting a file or formatting a partition. This has been revised and extended in 2.6.33. The result is that the Libata subsystem now also supports discards and can forward information on free storage areas to mass storage devices using the ATA TRIM command. This is especially useful for SSDs (Solid State Drives), as sending the internal controller information on free storage areas, allows the controller to optimise internal garbage collector. This increases both SSD performance and SSD lifespan.
For the discard infrastructure to achieve maximum impact, the storage subsystem needs to send the free area information to other parts of the kernel. The Btrfs file system has been able to do this since Linux 2.6.32 and appropriate code has also now been added for Ext4. Since this has not yet been fully tested, this function will for now remain deactivated by default. The new discard support in the code for the FAT file system is also optional.
Replicated
After leaving DRBD (Distributed Replicated Block Device), largely developed by Vienna-based company Linbit, out in the cold in 2.6.32, kernel hackers have finally merged the replication solution, used predominantly in high availability environments, into Linux 2.6.33. DRBD can be roughly understood as a network-based RAID 1 device. The drive or drives on one system, designated as the master, are mirrored on a slave system, in real time. Should the master fail, the slave takes over with no downtime. In order to ensure that data remains synchronised at all times, the master considers write access to be complete only when the slave has also completed the write. A detailed explanation of DRBD can be found a LWN.net article and in documentation on the DRBD website.
Over recent months and years, a number of groups have worked on solutions for limiting the maximum amount of data which individual processes or groups of processes can write to or read from drives. These solutions affect various points within the kernel. The 'blkio controller cgroup interface' (blkio for short), which links into the CFQ (Completely Fair Queuing) I/O Scheduler, has come out with its nose in front. This does not represent the failure of other approaches, however, rather it is intended to offer a foundation for further enhancements and functions made possible by some of the competing solutions. Background information on this topic can be found in a LWN.net article and in the documentation for the blkio framework.
Optimised
Developers have removed the anticipatory I/O scheduler (AS), which, according to the commit comments, offers only a subset of the functions offered by the CFQ scheduler. The latter, which has long been the standard in many distributions, is now also described as being suitable for desktop and server environments. Like the process scheduler, almost every Linux version contains numerous changes optimising the CFQ I/O scheduler for specific application scenarios. Details can be found in the links in the 'Minor gems' section at the end of this article and in the main git pull request from block subsystem maintainer Jens Axboe.
Improvements have been made to the code for migrating software RAIDs managed using mdadm to a different level. The MD subsystem now also supports write barriers. These ensure that data and file system journals are written in the sequence expected by other parts of the kernel. This should ensure better file system integrity in the event of a crash, but can palpably reduce throughput, as MD maintainer Neil Brown notes in his main git pull request. Support for write barriers in the device mapper (DM) has also been extended (git pull request). This now offers a 'merge target' (e. g. 1, 2), which can restore systems to a previous snapshot following, for example, a problematic system update (LWN.net article).
Drivers
The IDE subsystem drivers are now officially classed as deprecated. Users are advised to switch to the Libata subsystem PATA drivers, which have long been in the kernel and are no longer classed as experimental (1, 2). In 2.6.33, these contain numerous minor enhancements and corrections, some originating from Bartlomiej Zolnierkiewicz, who, until a few months ago, maintained the IDE subsystem.
The SCSI subsystem contains two new drivers: 3w-sas for the LSI 3ware 9750 and vmw_pvscsi for the virtual hardware seen by guest systems under some VMware hypervisors. New smart array controllers from HP can now be addressed both by the block subsystem's cciss driver, which has seen various enhancements in 2.6.33, and by the new hpsa driver. The latter, being part of the SCSI subsystem, like all other SCSI and Libata drivers provides a standard device (/dev/sdx) for access. Other new entries in this week's kernel include the pm8001 driver for SAS/SATA HBAs containing the PMC Sierra SPC 8001 chip.
Miscellaneous
A few further file system and storage code related changes:
- According to main developer Chris Mason's main git pull request for 2.6.33, Btrfs, the experimental 'next generation file system for Linux'-elect, saw primarily minor enhancements and fixes in 2.6.33.
- Ext4, which is based on the Ext2 and Ext3 file system code, can now mount Ext2 and Ext3 file systems. This allows environments which require the smallest possible kernel image to save a little space.
- Distributed file system Ceph was put forward for merger into 2.6.33. Torvalds has, however, chosen to omit it for now, explaining that this was partly due to time constraints and partly because there was not enough noise from kernel developers and distributors in favour of it. Background information can be found in this article on LWN.net.
- Although the ReiserFS code has long lacked an official maintainer, one developer has still taken the trouble to significantly reduce use of the big kernel lock (BKL) in the ReiserFS code. This should make the file system more scalable and in some cases a little more fleet of foot.
- Some of the more important changes in nilfs2 are described in the git pull request from nilfs2 maintainer Ryusuke Konishi.
- Virtual File System (VFS) now correctly implements O_SYNC. Once again, further details can be found on LWN.net.
- There has been major restructuring work on the XFS file system code to replace XFS' own tracing code with code which uses the kernel's own tracing structure. This is itself relatively new, but has been developed substantially over the last year.
Minor gems
Many further minor, but by no means insignificant, changes can be found in the list below, which contains the commit headers referring to the respective change. Like many of the references in the text above, the links point to the relevant commit in the web front end of the Git branch for the kernel sources maintained by Linus Torvalds at kernel.org. The commit comments and the patches themselves provide extensive further information on the respective changes.
File systems
Btrfs
- Btrfs: Avoid superfluous tree-log writeout
- Btrfs: fail mount on bad mount options
- Btrfs: Make fallocate(2) more ENOSPC friendly
- Btrfs: make metadata chunks smaller
- Btrfs: Make truncate(2) more ENOSPC friendly
- Btrfs: Show discard option in /proc/mounts
Ext[234]
- ext3: make "norecovery" an alias for "noload"
- ext3: Support for vfsv1 quota format
- ext3: Unify log messages in ext3
- ext4: add tracepoint for ext4_forget()
- ext4: Do not override ext2 or ext3 if built they are built as modules
- ext4, jbd2: Add barriers for file systems with exernal journals
- ext4: make "norecovery" an alias for "noload"
- ext4: make trim/discard optional (and off by default)
- ext4: Support for 64-bit quota format
- ext4: Update documentation to correct the inode_readahead_blks option name
Various others
- aio: implement request batching
- CIFS: Enable mmap on forcedirectio mounts
- direct-io: cleanup blockdev_direct_IO locking
- exofs: Multi-device mirror support
- fat: make discard a mount option
- fiemap: Add new extent flag FIEMAP_EXTENT_SHARED
- GFS2: add barrier/nobarrier mount options
- GFS2: Add cached ACLs support
- GFS2: Add get_xquota support
- GFS2: Add get_xstate quota function
- GFS2: Add set_xquota support
- GFS2: Fix up system xattrs
- GFS2: Improve statfs and quota usability
- kill-the-BKL/reiserfs: add reiserfs_cond_resched()
- nfs41: add support for callback with RPC version number 4
- nfs41: add support for the exclusive create flags
- NFS: Display compressed (shorthand) IPv6 in /proc/mounts
- nfs: new subdir Documentation/filesystems/nfs
- NFS: Revert default r/wsize behavior
- nilfs2: add cache framework for persistent object allocator
- nilfs2: add norecovery mount option
- nilfs2: update mailing list address
- nilfs2: Using nobarrier option instead of barrier=off
- ocfs2: Always include ACL support
- procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm
- proc: partially revert "procfs: provide stack information for threads"
- proc: remove docbook and example
- quota: Implement quota format with 64-bit space and inode limits
- reiserfs: kill-the-BKL
- reiserfs, kill-the-BKL: fix unsafe j_flush_mutex lock
- reiserfs: remove /proc/fs/reiserfs/version
- sanitize xattr handler prototypes
- seq_file: use proc_create() in documentation
- UBIFS: support mounting of UBI volume character devices
- ufs: NFS support
- VFS: Export dquot_send_warning
- xfs: event tracing support
- xfs: improve metadata I/O merging in the elevator
- xfs: use DECLARE_EVENT_CLASS
Storage
Block
- Add a tracepoint for block request remapping
- blkio: Export disk time and sectors used by a group to user space
- blkio: Introduce per cfq group weights and vdisktime calculations
- blkio: Introduce the notion of cfq groups
- blkio: Introduce the root service tree for cfq groups
- blkio: Some debugging aids for CFQ
- block: add helpers to run flush_dcache_page() against a bio and a request's pages
- block: Allow devices to indicate whether discarded blocks are zeroed
- block: allow large discard requests
- block: Expose discard granularity
- block: use normal I/O path for discard requests
- cfq-iosched: enable idling for last queue on priority class
- cfq-iosched: fairness for sync no-idle queues
- cfq-iosched: fix no-idle preemption logic
- cfq-iosched: reimplement priorities using different service trees
- cfq: merge cooperating cfq_queues
DM
- dm: add request based barrier support
- dm exception store: add merge specific methods
- dm raid1: add framework to hold bios during suspend
- dm raid1: support flush
- dm: simplify request based suspend
- dm snapshot: add allocated metadata to snapshot status
- dm snapshot: permit only one merge at once
Libata
- ahci: let users know that Promise PDC42819 support is limited to SATA devices
- ata_piix: enable 32bit PIO on SATA piix
- libata: add comment documenting PIO latency issues on UP
- libata/drivers: Add driver for Apple "MacIO" IDE controller
- libata: MWDMA0 is unsupported on PIIX-like PATA controllers
- libata: Report zeroed read after TRIM and max discard size
- pata_cs5520: remove dead VDMA support
- pata_hpt37x: add proper cable detection methods
- pata_it8213: MWDMA0 is unsupported
- pata_legacy: fix QDI6580DP support
- pata_piccolo: Driver for old Toshiba chipsets
- sata_fsl: Add asynchronous notification support
- sata_mv: add power management support for the PCI controllers.
- sata_mv: add power management support for the platform driver
- sata_mv: support clkdev framework
- sata_sil24: MSI support, disabled by default
MD
- md: add honouring of suspend_{lo,hi} to raid1.
- md: add MODULE_DESCRIPTION for all md related modules.
- md: add 'recovery_start' per-device sysfs attribute
- md/raid1: add takeover support for raid5->raid1
- md: revise Kconfig help for MD_MULTIPATH
- md: support bitmap offset appropriate for external-metadata arrays.
- md: support updating bitmap parameters via sysfs.
- md: Support write-intent bitmaps with externally managed metadata.
- raid: improve MD/raid10 handling of correctable read errors.
MFD/MMC/MTD
- DaVinci: MMC: MMC/SD controller driver for DaVinci family
- mfd: Add 88PM8607 driver
- mfd: add AB4500 driver
- mfd: Add ADP5520/ADP5501 driver
- mfd: Add all twl4030 regulators to the twl4030 mfd driver
- mfd: Add power control platform data to SDHI driver
- mfd: Add SuperH Mobile SDHI platform driver
- mfd: Add support for remapping twl4030-power power states
- mfd: Add support for twl6030 irq framework
- mfd: Add support for WM8320 PMICs
- mfd: Add twl6030 regulator subdevices
- mfd: Initial support for twl5031
- mmc: add module parameter to set whether cards are assumed removable
- mmc: atmel-mci: new MCI2 module support in atmel-mci driver
- mmc: Blackfin SD Host Controller Driver
- mtd: add ARM pismo support
- mtd: Add bad block table overrides to Davinci NAND driver
- mtd: add bcmring nand driver
- mtd: add lock fixup for AT49BV640D and AT49BV640DT chips
- mtd: add nand_ecc test module
- mtd: add support for switching old SST chips into QRY mode
- mtd: m25p80: Add support for CAT25xxx serial EEPROMs
- mtd: m25p80: add support for Macronix MX25L4005A
- mtd: maps: remove obsolete ipaq-flash driver
- mtd: mtdoops: make record size configurable
- mtd: mtdoops: refactor as a kmsg_dumper
- mtd: nand: add option to quieten off the no device found messgae
- mtd: nandsim: add support for 4KiB pages
- mtd: OneNAND: multiblock erase support
- mtd: Really add ARM pismo support
- mtd: tests: fix read, speed and stress tests on NOR flash
- mxc_nand: Add NFC V2 support
- sdhci: add support for the SysKonnect CardBus2SDIO adapter
- sdhci-of: add support for the wii sdhci controller
- sdhci-of: reorganize driver to support additional hardware
SCSI
- SCSI: add scsi target reset support to scsi ioctl
- SCSI: be2iscsi: Adding msix and mcc_rings V3
- SCSI: be2iscsi: Adding support for various Async messages from chip
- SCSI: bnx2i: Add 5771E device support to bnx2i driver
- SCSI: bnx2i: update CQ arming algorith for 5771x chipsets
- SCSI: fcoe: add a separate scsi transport template for NPIV vports
- SCSI: fcoe, libfc: adds enable/disable for fcoe interface
- SCSI: fcoe: vport symbolic name support
- SCSI: fnic: Add FIP support to the fnic driver
- SCSI: ibmvfc: Add FC Passthru support
- SCSI: libfc: Add libfc/fc_libfc.[ch] for libfc internal routines
- SCSI: libfc: add some generic NPIV support routines to libfc
- SCSI: libfc: add support of receiving ELS_RLS
- SCSI: libfc, fcoe: Add FC passthrough support
- SCSI: libfcoe, fcoe: libfcoe NPIV support
- SCSI: libiscsi: add warm target reset tmf support
- SCSI: lpfc 8.3.5: Add AER support
- SCSI: lpfc 8.3.5: fix fcp command polling, add FIP mode, performance optimisations and devloss timout fixes
- SCSI: lpfc 8.3.5: fix reset path, ELS ordering and discovery issues
- SCSI: lpfc 8.3.5: fix sysfs parameters, vport creation and other bugs and update logging
- SCSI: lpfc 8.3.6 : FC Protocol Fixes
- SCSI: megaraid_sas: Add new megaraid SAS 2 controller support to the driver
- SCSI: megaraid_sas: Add poll mechanism to megaraid sas driver
- SCSI: megaraid_sas: add sysfs for AEN polling
- SCSI: megaraid_sas: add the IEEE SGE support to SAS2 controller
- SCSI: megaraid_sas: Add the support for updating the OS after adding/removing the devices from FW
- SCSI: megaraid_sas: Update version number and documentation
- SCSI: mpt2sas: Added command line option diag_buffer_enable.
- SCSI: mpt2sas: Add Extended Type for Diagnostic Buffer support
- SCSI: mpt2sas: Adding MPI Headers - revision L
- SCSI: mpt2sas : Add support for RAID Action System Shutdown Initiated at OS shutdown
- SCSI: mpt2sas: Add support in the driver to check for valid response info
- SCSI: mpt2sas: New device SAS2208 support is added
- SCSI: mpt2sas: Support for stopping driver when Firmware encounters
- SCSI: mvsas: add support for Adaptec ASC-1045/1405 SAS/SATA HBA
- SCSI: pm8001: enhance IOMB process modules
- SCSI: pmcraid: support SMI-S object model of storage pool
- SCSI: qla2xxx: Add firmware-dump kobject uevent notification.
- SCSI: scsi: Add missing command definitions
- SCSI: scsi_debug: Thin provisioning support
- SCSI: scsi_dh_rdac: Add two new IBM devices to rdac_dev_list
- SCSI: sd: WRITE SAME(16) / UNMAP support
- SCSI: stex: add small dma buffer support
- SCSI: stex: add support for reset request from firmware
Various others
- Add COH 901 318 DMA block driver v5
- cs5535: add pci id for AMD based CS5535 controllers
- IB: Fix typo in ipoib.txt
- MFD: twl4030: add twl4030_codec MFD as a new child to the core
- MFD: twl4030: add twl4030_codec MFD as a new child to the core
- ppc440spe-adma: adds updated ppc440spe adma driver
- RDMA/nes: Add additional SFP+ PHY uC status check and PHY reset
- RDMA/nes: Add support for IB_WR_*INV
- RDMA/ucma: Add option to manually set IB path
For other articles on 2.6.33 and links to the rest of the "Coming in 2.6.33 " series, see The H's Kernel Log - 2.6.33 Tracking page. (thl)
(crve)