Kernel developers squabble over Ext3 and Ext4
A number of senior kernel developers, including Linus Torvalds, Ted Ts'o, Alan Cox and Ingo Molnar, have been squabbling over the sense or otherwise of journaling and delayed allocation in Ext3 and Ext4.
The trigger for the discussion was a response from Jesper Krogh to Torvalds' announcement of kernel version 2.6.29, in which he described massive delays in writing out the file system cache on Ext3 file systems despite fast RAID arrays on computers with lots of RAM.
Ingo Molnar identified the maximum dirty ratio – the proportion of memory which has to be written to the hard drive – as a major part of the problem. The dirty ratio is still, according to Molnar, five percent of the RAM, which, in the past, on systems with, say, 1GB of RAM, didn't really matter.
On modern servers with 32 GB RAM and eight or 16 processors producing dirty pages in parallel, the amount of data to be written between two file system synchronisations can swell to as much as 1.6 GB. Since, however, the data transfer rate of hard drives has not increased at the same rate as the available RAM, synchronisation can now, under unfavourable circumstances, take longer than it used to. Ted Ts'o points out that synchronisation in Ext3 is carried out every five seconds and that consequently, on the whole, less data is written.
Linus Torvalds took up Ts'o's point about five second intervals and described the default Ext4 behaviour of writing normal data from the cache only every 120 seconds whilst writing metadata more rapidly as "insane". Ext3 mounted with the "data=writeback" option has the same behaviour, but, complains Torvalds, this is at least not the default setting (Ext3 is usually mounted with "data=ordered", so that changes to metadata only become valid after writing the payload data).
This procedure is, in Torvalds' opinion, "braindamage", and he professes incomprehension at how file system programmers can accept "clean fsck, but data loss," as is typically the case with writeback caching.
On Ted Ts'o's suggestion of simply switching to Ext4 on single user machines to avoid synchronisation problems, Torvalds' comments that Ext4 defaults to the "crappy", "insane" writeback cache. "Sure, it makes things much smoother, since now the actual data is no longer in the critical path for any journal writes, but anybody who thinks that's a solution is just incompetent," he adds, going on to observe that one could just as well go back to Ext2. He rounds off by noting that if, as in Ext4, payload data hits the disk long after the metadata, all sorts of problems are going to arise if the computer crashes.
Ts'o responds that the delayed writing of payload data achieves very good performance boosts and that the method is covered by the POSIX standard. In Ext2, he continues, a lengthy fsck check to identify incompletely written data was always required following a crash. In Ext4 this is no longer necessary, as a consistent state is always guaranteed after writing out the journal – even if the payload data has not made it onto the disk. In addition, it is possible to individually adjust the frequency with which payload data is written from the cache to the hard drive and in extremis to omit this altogether. System administrators can, he adds, also optimise their configuration for best performance, depending on the hardware used.
For Torvalds, the issue of whether fsck needs to be run following a crash is of secondary importance – if a file system has been damaged by a crash and it has not been possible to write data, the file system is no longer sound anyway. He opines that silent corruption, which in the absence of fsck is not immediately apparent, is worse than the situation, as in Ext2, where problems with the file system following a crash are unmistakable. He fails to understand how file system developers can see fsck as the real problem, adding that this is "obviously bogus".
The point, he goes on, is, that an increased delay between writing metadata and real data gives a higher probability that a file will be damaged or incomplete, than writing the two directly one after another. If payload data is written first, no damage to the file system will occur.
This is the reason, he continues, why he detests the idiotic writeback cache in Ext3, which does everything the wrong way round, writing real data after the metadata that points to it. "Whoever came up with this solution", he concludes, "was a moron, no ifs, buts or maybes about it."