In association with heise online

Enhanced to within an inch of its life?

Further enhancements to the file system ensure that, irrespective of persistent preallocation, Ext4 files are wherever possible stored in one piece. Write access is initially buffered, so that the block allocator, which reserves data blocks for write operations in both Ext3 and Ext4, no longer needs to be called immediately for each 4 KB block of data (delayed allocation). Instead it allows multiple blocks to be allocated simultaneously. During large writes, this means that many blocks can be allocated in one go and ideally as a single extent (multi-block allocation).

This change reduces file system overhead – and with it both system load and I/O bottlenecks – when an application writes large volumes of data, and prevents file fragmentation. Temporary files created for short periods only can spend their whole brief lives in the cache, never getting written to the drive. When mounting Ext4, the Ext4 code generates a list of free extents for each block group which remains in memory and is used by the block allocator to optimise distribution of files on the drive.

Delayed allocation is, however, a very aggressive caching strategy which increases the risk of data loss in the event of a system crash or power cut. It is not just that data remains in the cache for longer, delayed allocation also uncouples the writing of data and metadata. A newly created file can be recorded in the file system before the data has actually been written to the hard drive, albeit – until the data is actually written to the drive – with a size of 0 bytes and no data blocks allocated to it.

This has an unseemly side-effect on applications which take extra precautions when overwriting an existing file with new data. Many programmers first write the new data to a temporary file, which they then rename to the old name using rename(). The reasoning behind this is that until the data has actually been written you are at least left with the previous version of the file.

With delayed allocation, however, the overwritten file may be completely empty following a crash, since the entry in the file system points to the newly created file which has not yet been loaded with data – rename() is a purely metadata operation. The same thing can occur if an application calls ftruncate() before writing the new data – truncation of the old data occurs much faster than writing the new data.

This behaviour under Ext4 is entirely in conformity with the POSIX standard and also occurs in other file systems, such as XFS. In its default "data=ordered" mode, Ext3 is of a sweeter temperament and only undertakes changes to metadata once the data has been written to the disc. This is a happy accident rather than a design decision, but nonetheless many Linux applications have come to rely on this behaviour.

The "0-byte problem" has lead to heated discussions between kernel developers, with a clash developing between two opposing points of view – the view taken by file system developers, who are concerned with maximising performance and ensuring file system consistency, and the more pragmatic view taken by Linus Torvalds among others, that file systems should fulfil users' expectations, and that what they want above all else is to avoid data loss. In the forthcoming 2.6.30 kernel, Ext4 attempts to detect problem situations and, where they arise, to behave as Ext3 and write data before changing metadata. The problem does not arise in Ubuntu 9.04, as these patches have already been applied to the 2.6.28 kernel it uses.

In addition to the enhanced allocation strategy which is able to prevent most file fragmentation, Ext4 can also be defragmented online. A defragmenter should allow users to defragment either individual files or complete file systems, with the program essentially doing nothing more than copying data from one place to another. A defragmenter is particularly important where an existing Ext3 file system is to be migrated to Ext4, as it is able to convert files stored in Ext3 format into Ext4 format. Currently, however, the patches required for online defragmentation have not yet been integrated into the Linux kernel, and the defragmenter is not yet complete.

Better reliability

Some of the new features are aimed at improving file system reliability. The journal now adds a checksum to each transaction. This both allows detection of data incorrectly written to the journal and simplifies commits for completed transactions within the journal. Checksums are also used in block group descriptors.

By default Ext4 uses the barrier mechanism offered by newer hard drives. Barriers affect the way write access is cached and sorted. The drive controller performs all writes prior to a barrier before starting on the writes behind the barrier. This makes it possible to, for example, ensure that all writes associated with a single transaction are performed on the file system before the commit is written to the journal. This mechanism can be disabled using the barriers=0 mount option.

Ext4's extents allow more extensive consistency checking than Ext3's block lists. For example, it is possible to check whether a file's extents overlap. Extent headers in an extent tree also record the tree depth, which must be consistent across the entire tree and the extents recorded in an extent index must cover the same part of a file as the index. By contrast, the indirect block addressing used in Ext3 means that it is not possible to distinguish a block containing block numbers from random data, so that only very rudimentary consistency checking is possible.

Ext4 performs a complete fsck significantly faster than Ext3. If the uninit_bg option is set (as it is by default in Ubuntu 9.04), mkfs.ext4 does not initialise all block groups. This not only accelerates file system creation, it also ensures that e2fsck only needs to check initialised inodes. Fsck time is thus solely dependent on the number of files and not on the total number of inodes present (and thus file system size).

Limits and performance

Ext4 breaks a number of barriers. It allows unlimited numbers of sub-directories within a directory – in Ext3 this was limited to 32,000. Inodes now have a default size of 256 bytes, compared to 128 bytes in Ext3. The uses Ext4 makes of the extra space include recording access times in nanoseconds rather than seconds and recording extended attributes directly in the inode.

A patch for the current kernel version 2.6.29 from the Google developers makes it possible to run Ext4 without journaling, which, according to Google's measurements, can increase speed by up to two per cent. In view of the advantages journaling offers in the event of a crash or power cut, however, this should only really be considered when you really need to squeeze out every last drop of I/O. Userland tools do not yet offer the option of running without journaling.

Double-entry bookkeeping

Although Ext4 developers like to stress compatibility with Ext3, this should be taken with a pinch of salt. It is indeed possible to mount an Ext3 file system as Ext4, but doing so has no effect on the file system – Ext4 is familiar with the old Ext3 block addressing schema and simply addresses the file system in exactly the same way as Ext3. It's possible to read from and write to an Ext3 file system mounted as Ext4 and subsequently re-use it as Ext3.

Only when extents are activated by setting the extents file system feature in Ext3 using

tune2fs -O extents

does the Ext4 code actually treat the file system as Ext4. Existing data is stored unchanged in Ext3 format, with a flag in the inode indicating whether the inode contains block numbers or extents. Only files which are created after converting to and mounting as Ext4 enjoy the benefits offered by the new data structures.

Converting a file system in this way does not therefore produce a 'true' Ext4 file system, rather it produces a hybrid of Ext3 and Ext4 structures. For users who wish to fully migrate a file system from Ext3 to Ext4, there is no getting around going through the process of backing up data and creating a new file system. The defragmentation tool discussed above may offer one solution. Let loose on files saved in Ext3 it creates defragmented files in Ext4 format using extents.

Once the extents feature has been turned on, there's no easy way back to Ext3. Likewise a file system created directly as Ext4 using

mkfs -t ext4

can no longer be mounted as Ext3. This can lead to nasty surprises on old rescue systems with no Ext4 support!


Print Version | Permalink:
  • Twitter
  • Facebook
  • submit to slashdot
  • StumbleUpon
  • submit to reddit

  • July's Community Calendar

The H Open

The H Security

The H Developer

The H Internet Toolkit