Kernel Log: Coming in 3.5 (Part 5) - Infrastructure
by Thorsten Leemhuis
The kernel now better isolates containers and suspicious code. Event logging has been optimised and two important Android functions were merged.
When releasing Linux 3.5-rc6, Linus Torvalds made no indication as to when he might publish Linux 3.5, but he had previously pointed out some of implications the holiday period in the northern hemisphere might have on the main development phase of Linux 3.6. It's possible that the sixth pre-release version may have been the last rc for Linux 3.5; therefore, with a description about the changes to the infrastructure of Linux 3.5, this article will conclude the "Coming in 3.5" Kernel Log mini-series as this kernel version will likely appear soon.
The first four parts of the series already discussed the improvements in networking, filesystems and storage, architecture, and drivers. In the networking area, however, there was one latecomer that should get a mention: the ipheth driver in Linux 3.5 not only supports tethering with various iPhone models, but now also support iPads.
User namespace enhancements, a feature for improving isolation of Linux containers developed primarily by Eric W. Biederman, has been merged into Linux 3.5. The developer describes the patch collection as a "course correction for the user namespace", with the result being that the kernel implementation is now "inexpensive, maintainable and reasonably complete".
The changes mean a clean separation of user and group IDs between host and container. As a result, users who have root privileges in a container no longer, for example, have full access to all files in the directories /proc/ and /sys/. This had previously allowed root users within a container to influence the behaviour of the host system.
Background information on the new approach can be found in this LWN.net article. In an email, Biederman explains that he has successfully booted an unmodified Debian in a container secured with user namespace enhancements, but that further changes to the kernel code are necessary before this feature is sufficiently mature to be used in distribution kernels.
The kernel contains a number of major changes to logging and logging-related functions aimed at improving the reliability of logging information output and allowing automatic analysis of logs (1, 2, 3, and others). Output now runs through a record buffer aimed at preventing mixing of different output flows. Logging data also carries meta information – such as the time, device context and a sequential number; this makes it easier to, for example, filter log data for events involving specific devices. More detail can be found in an article on LWN.net. The new approach initially buffered output too much, causing problems; these have since been fixed in RC5.
The seccomp filters mechanism now enables programs to set up filters (using Berkeley Packet Filter syntax) that regulate which system calls software started by this program can use (1, 2, 3, 4 and others). This could be useful for providing added security in the context of virtualisation or sandbox solutions, for example. One area of use is browsers which need to execute untrusted code. The version of Chrome in Ubuntu 12.04 LTS uses seccomp to provide additional security for the Flash plugin. Further background information can be found in the documentation and on LWN.net.
Autosleep and some associated extensions have been merged into the power management subsystem (1, 2, 3, 4, 5, 6). This "opportunistic sleep" feature allows a system to enter sleep mode autonomously if it remains inactive for a while. This will likely be of more interest for devices such as smartphones, which wake from sleep more often and more quickly than, say, notebooks. The Android kernel has long offered a similar mechanism in the shape of wakelocks (aka. suspend blockers), without which it would have poor battery life. This function has been a frequent bone of contention between Linux and Android developers. It is not known whether the Android development team plans to switch to the new Linux infrastructure at some point.
- Following a lengthy period of development, frontswap has now been merged into the kernel (1, 2, 3, 4, 5). The frontend for the kernel's transcendent memory (TM) infrastructure tries to pass memory areas that would otherwise be swapped out to relatively slow swap devices to the kernel's transcendent memory infrastructure. This is done so that the data can be retained in a more rapidly accessible location with a TM backend, e.g. compressed in the local system memory using zcache or on another system within a cluster with the help of RAMster.
- Primarily of interest in the embedded field, the contiguous memory allocator (CMA) has, after a long development period, now been merged into the kernel. It tells the kernel to restrict its use of some physically contiguous memory areas to moveable data only, in order to allow some larger areas to be used by drivers for DMA tasks as required.
- Mel Gorman and Greg Kroah-Hartman have modified the stable kernel merge rules; this was to clarify the circumstances under which changes can be merged into stable kernels, which, for example, fix performance problems affecting a large number of users but are somewhat riskier than typical stable patches.