30 October 2012, 11:00

Comment: Ext4 bug - No need to panic!

by Oliver Diedrich

Much ado about an ultimately irrelevant issue: a bug in the ext4 filesystem has turned out only to be exposed when several exotic options are combined. Apparently, the problem has only affected a single user.

A bug report from a user called "Nix" had caused a big stir last week; the user had lost data on his ext4 filesystem. Although the problem soon turned out to be an isolated case that involved a combination of several critical options, the public searching for the causes generated a lot of publicity and considerable commotion in the Linux world. The following article will, therefore, attempt to provide some clarification of the issue.

Ext4 is the current standard filesystem for Linux, and is considered robust, mature and well tested (see also "The Ext4 Linux filesystem" from The H Open). Robust, because the journal that was introduced with its predecessor, Ext3, guarantees the filesystem's integrity even if there is a power cut during a write operation; mature because Ext4 is the result of many years of development that began with Ext2 almost 20 years ago; and well tested because since then, the vast majority of Linux systems have been installed on Ext2, Ext3 or Ext4 – mass deployment and practical use are still the best way of testing.

Lazy unmount

Normally, a filesystem that's mounted under Linux can't be unmounted while a process is accessing a file within the filesystem, even if it's only a shell whose working directory is located in the filesystem. In this case, umount will report that the "device is busy". The usual solution is to use fuser or lsof to find the process that is blocking the unmount operation and then terminate that process.

However, Linux processes can get stuck in such a way that they can no longer be terminated; or the unmount operation might be blocked by some other bug (such as an unresponsive NFS server) that can't easily be fixed. In such cases, a "lazy unmount" can be requested using the -l mount option: the filesystem will be unmounted immediately, and the system will attempt to fix the resulting chaos (file handles without files, etc.) at a later stage. That mount -l is an option for special circumstances and may cause data loss should be quite obvious, really.

Has all of that now gone down the drain because of the recent Ext4 bug? Of course not – quite the opposite in fact.

Any program code as complex as the code that's required for a modern, powerful filesystem will contain bugs – the Ext4 code consists of approximately 40,000 lines of source code. That a combination of several mount options (nobarrier, journal_checksum and journal_async_commit) – none of which are used by default – as well as an added "lazy unmount" (see box) are required to trigger an error is an argument for, not against, the robustness of Ext4 and the quality of its code testing. That the Ext4 developers immediately began to investigate the bug even in a situation such as this is further testament to Ext4's maturity – otherwise, the developers would have had more important things to do than check out some esoteric bug that only seems to have affected a single user. Even Ext4 lead developer Theodore Ts'o has so far been unable to reproduce the bug on his systems.

We can, therefore, confidently assume that our data is safe on Ext4 – or at least that it is safer than it would be on other Linux filesystems: the development of ReiserFS/Reiser4 has fizzled out, the Btrfs "Next Generation Filesystem" hasn't really become suitable for production use yet, and XFS, which was originally developed for Irix, has become stuck on the sidelines. Filesystem bugs also exist in other operating systems – but they aren't such public discussion topics.

It is commendable that, despite this, the Ext4 developers want to draw useful conclusions from their handling of the Ext4 bug, although they could have done this much earlier: a year ago, Ted Ts'o had already warned at LinuxCon Europe in Prague that some mount options may cause problems with Ext4 because they haven't been extensively tested. After all, these options – some of which were only ever intended for developers – will now disappear from the production code, or they will at least trigger a warning. Also a kind of bug fix ...

Print Version | Permalink: http://h-online.com/-1738995