In association with heise online

Indirect

How is it possible to fit the millions of data block numbers required for gigabyte-sized files into a static data structure of 128 bytes? It isn't - one Ext3 inode stores exactly 15 block numbers. The first twelve point directly to data blocks, block 13 to a data block containing block numbers (indirectly addressed blocks), block 14 to a block pointing to blocks with block numbers (double indirect), and block 15 points to triple indirect blocks. Therefore, at a block size of 4 KB (that is 1024 block numbers with 4 bytes per indirect block) one inode can handle 12 + 1024 + 10242 + 10243, around a billion block numbers.

Bild 2 [250 x 275 Pixel @ 30,6 KB]
Zoom Indirect addressing allows several Tbytes to be addressed using 15 block numbers.

The resulting maximum file size of just over 4 TB, however, is a theoretical proposition since the inodes save the number of 512-byte hard disk sectors that belong to a file - stat and the stat command in debugfs output this value as "block count". As this block count is a 32-bit value, the maximum file size of Ext3 is actually 2 TB.

Incidentally, the 2GB maximum file size that some older applications struggle with isn't caused by Ext3 but by the system calls for accessing files. The functions and data structures for file access traditionally use a prefixed 32-bit file offset (a pointer which can address any byte within a file), which results in a maximum file size of 231-1 bytes. However, this restriction has been removed with Large File Support (LFS) with 64-bit file offsets.

Grouped

Aside from the directories, inode tables and data blocks themselves, Ext3 also uses several other data structures. Two bitmaps keep a record of which data blocks are in use and which are free. To ensure efficient access in this area, Ext3 organises the file system in block groups - a kind of partitioning within the file system.

Hard disk heads jump between data blocks, directories, inode tables and inode as well as block bitmaps when reading, writing and creating files. However, every movement of the disk heads takes an average of several milliseconds - enough time to read a megabyte of data from a fast disk. Assuming that head movements take longer when the distance between the respective disk sectors increases it is advantageous to store related parts of the file system structure and data in close proximity of each other.

Ext3 does this with block groups: Apart from a block group descriptor with some statistics about how much of the block group is in use, every block group contains a part of the inode table as well as the part of the inode and block bitmap which maps this section of the inode table and the data blocks of the block group. Ext3 tries to create files in such a way that the metadata - inodes and directories - and their related data blocks are located within one block group.

The number of created block groups depends on the size of the file system. Although this parameter can be set with option -g in mke2fs, Ext3 developers advise not to customise it as mke2fs already chooses the optimum value. Since only one data block is reserved for the block group's block bitmap, a block group can only contain a maximum of 32,768 blocks or 128 MB at a block size of 4 KB.

Super

Ext3's last data structure is the superblock describing the file system itself. It contains all the information necessary for correctly interpreting the data within the file system: Block size, number of blocks and inodes, block groups, inode size, first valid inode (some inodes are reserved for internal purposes, for example inode 2 for the root directory and inode 8 for the journal in Ext3). This (and more) information is returned by the debugfs command stats and by the dumpe2fs -h command.

Blocks marked as reserved, five percent of all blocks by default though this can be changed using the mke2fs parameter -m, may only be allocated by root; this is meant to ensure that the system still has a little breathing space if a user completely floods the file system.

In addition, the superblock stores status information - for example, the number of free inodes and blocks -, when the file system was last mounted and when it was last checked by e2fsck, as well as its current state (clean, if everything is ok). This is also where all the other settings can be found which allow a file system check to be carried out every time users need to "just quickly boot" the system. Ext3 stores both a maximum number of mounts and a maximum time span since the last e2fsck run. When one of these limits is reached, the system forces a check to be carried out. The limits can safely be increased for computers which are booted often:

tune2fs -c 100 -i 180

allows 100 mounts or half a year between e2fsck checks. A value of 0 disables automatic testing completely. While the advice never to turn this feature off can be found all over the internet, it has lost some of its relevance; Disk defects can be detected faster with smartmontools, which directly test the disk status and memory or chipset problems are likely to show earlier, in places other than damaged Ext3 data structures.

Secure

Because the superblock is vital for file system access, mke2fs plays it safe and writes it to disk in several different places. Should e2fsck refuse to work with a damaged file system because the superblock is corrupted, the tool can be instructed to use one of the backup superblocks instead:

e2fsck -b superblock

Mke2fs returns the block numbers for the alternative superblocks after creating the file system, but users often forget to record them. Here's a useful mke2fs option which can help:

mke2fs -n device

just pretends to create the file system, but returns all its parameters - including the positions of the superblocks. Of course, this only works if mke2fs is called using the same options which were used when the file system was first set up. Normally, however, the formatting routine doesn't need to be given any options at all - mke2fs automatically chooses the values suitable for most applications.

For the worst case scenario, mke2fs also provides option -S – it only rewrites the superblocks and block group descriptors, while directories as well as inode and bitmap tables remain intact. An e2fsck run is required after this procedure. In the best case scenario, all files are accessible again afterwards, but there is no guarantee - at worst, all the data is lost. Of course, mke2fs -S also has to be called with exactly the same options that were used when the file system was created.

Intricacies

Among the other information the superblock can hold are the mount options to be included when mounting the file system without the user explicitly stating them. For example, the command

tune2fs -o acl

ensures that the files system is always mounted with support for Access Control Lists (ACLs). The file system's volume name used in /etc/fstab by some distributions is also stored in the superblock and can be provided with the tune2fs parameter -L at a later stage or with mke2fs while creating the file system.

The "file system features" define various file system properties – whether it needs to be checked for errors (the needs_recovery flag is set when mounting and deleted when unmounting the file system and shows whether the file system was unmounted properly); whether the file system can be enlarged with resize2fs (resize_inode); whether it supports file sizes larger than 2 GB (large_file); whether only a limited number of backup superblocks were created (sparse_super); whether the directory entry is to contain information about the file type (filetype); or whether directory entries are to be stored in tree structures (dir_index). Some of these properties can be adjusted via tune2fs -O (the ^ prefix disables a set feature); for safety reasons, this should be followed by a forced e2fsck run using the -f option. However, these features are best included via mke2fs -O when the file system is created.

Mke2fs retrieves the features to be set as defaults from the /etc/mke2fs.conf file. Most distributions set sparse_super, filetype, resize_inode and dir_index as defaults. If the operating system supports large files, mke2fs also includes the large_file feature.

Print Version | Permalink: http://h-online.com/-746480
  • Twitter
  • Facebook
  • submit to slashdot
  • StumbleUpon
  • submit to reddit
 


  • July's Community Calendar





The H Open

The H Security

The H Developer

The H Internet Toolkit