Kernel Log: Coming in 2.6.32 (Part 5) - Architecture code, memory management, virtualisation and tracing
by Thorsten Leemhuis
The forthcoming kernel version will support Intel's Moorestown platform, SFI - the alternative to ACPI, and the Trusted Execution Technology, which used to be called "LaGrande Technology". If required, the new KSM can now reduce memory loads by combining identical memory content in virtual machines. The new kernel also includes Timechart, a new tool for visualising what's going on in the system and kernel.
Since Linus Torvalds has been on the road this week, there have been no changes to the main development branch in the past few days. However, numerous developers have already sent emails to the LKML, asking the father of Linux to merge corrections. As the number of bug fixes increases, so does the likelihood that Torvalds will throw in another release candidate of 2.6.32 when he returns – when releasing Rc8, the father of Linux had still hoped this might be the last release candidate.
Therefore, the Kernel Log will wrap up its report about the new features of 2.6.32 with parts 5 and 6. This issue discusses the changes in terms of architecture code, memory management, virtualisation and tracing in the penultimate part of its "Coming in 2.6.32" series. The previous four parts of the series discussed the advancements in the network subsystem as well as those in the fields of graphics hardware, storage hardware, filesystems and further drivers.
A new addition to the kernel is the support of Intel's Trusted Execution Technology (TXT), formerly known as LaGrande Technology. Together with the components of the Trusted Boot (tboot) project, TXT systems can ensure that a kernel hasn't been attacked and compromised before executing it. Details about this technology can be found in the kernel documentation and in an article on LWN.net.
The Linux kernel can now be optimised for Atom CPUs during compilation. The kernel hackers have also integrated code to support Intel's Simple Firmware Interface (SFI) – an alternative to ACPI developed by Intel and intended for use with the Moorestown platform Intel plans to introduce next year. Moorestown is heavily geared towards Linux use and is intended for smartphones, Mobile Internet Devices (MIDs) and embedded environments. Details about SFI can be found in a presentation given by the ACPI subsystem's maintainer, Intel employee Len Brown, at the Linux Symposium 2009. Further components to support Moorestown have also been integrated into the kernel – they are based on Thomas Gleixner's partial overhaul of the x86 support that allows an improved abstraction of x86 platforms like Moorestown.
KVM now supports the "unrestricted guest" mode of Intel's next generation of desktop and notebook processors (Westmere), which is scheduled for release under product names such as Core i3 or Core i5 in early 2010. The main Git-Pull request by KVM developer Avi Kivity discusses further improvements to KVM – including improved tracing facilities and the eventfd-based irqfd and ioeventfd mechanisms for integrating user software and kernel software with guest systems.
That paravirtualisation is losing importance is indicated by a change introduced by VMWare developers which announces that the Virtual Machine Interface (VMI) will no longer be supported from Linux 2.6.37. VMWare used to advocate the VMI. The developers say the reason for this extensively discussed step is that the virtualisation techniques offered by modern CPUs have become so sophisticated that VMI-based paravirtualisation often adds no significant further performance improvements.
Another new addition is KSM, which originates from the KVM developer circle. The acronym is short for "Kernel Shared Memory" or "Kernel SamePage Merging" and describes a framework which scans the memory of multiple userland processes for identical areas; if it finds matching areas, it combines them and reduces the memory load by deallocating the redundant copies. This is, for instance, interesting for KVM virtualisations where multiple similar guest operating systems that share the same software libraries and programs run on a computer, causing large areas of the data stored in the guests' memory to be identical.
An article about the Linux Symposiums 2009 in The H Open and the text version of an OLS 2009 presentation it links to, describe the technology in detail and explain how KSM ensures that there is no chaos when a process modifies a shared memory segment. The PDF document also mentions how CERN used KSM for reducing its hardware requirements when processing the data generated by the LHC (Large Hadron Collider).
The new HWPOISON, which was introduced by Intel developer Andi Kleen, adds some techniques for handling and avoiding memory errors. These techniques are planned to be included in Intel's Nehalem EX series of server processors, which is expected in early 2010. Details can be found in the commit comment, the kernel documentation, and in an article on LWN.net.
The emerging Performance Counters have been renamed Performance Events, because this apparently better describes this technology, which has seen major advancements in the past few months. The extent to which the kernel's performance and runtime analysis infrastructure is transforming is also indicated by the lengthy Git-Pull requests submitted for the Performance Counters/Events (1, 2), for the tracing subsystem and for Oprofile, which briefly mention the most important changes in the respective subsystems. Among them is the ring-buffer, which is now completely lockless and utilises the tracing code to a large degree – once again background information about this can be found on LWN.net.
Also new is the "perf sched" sub-program, which adds convenience to the task of analysing the process scheduler. Another new addition is the Timechart tool mainly developed by Arjan van de Ven, which allows traces recorded via "perf record" to be visualised as SVGs for easy analysis – in his blog, van de Ven explains the whole procedure in detail and gives various examples of use.
Many further minor, but by no means insignificant, changes can be found in the list below. Like many of the references in the text above, the links point to the relevant commits in the web front end of the Git branch at kernel.org that Linus Torvalds uses for maintaining the kernel sources. There, the commit comments and the patches themselves provide extensive further information on the respective changes.
- Add i.MX25 support
- Add support for Eukrea's CPUIMX27
- Add support for Eukrea's MBIMX27
- ARM: 5570/1: at91: Support for at91sam9g10: core chip board support
- ARM: 5572/1: at91: Support for at91sam9g45 series: core chip board support
- ARM: 5580/2: ARM TCM (Tightly-Coupled Memory) support v3
- ARM: 5590/1: Add basic support for ST Nomadik 8815 SoC and evaluation board
- ARM: 5629/1: Add support for Eukrea's CPU9260 CPU9G20
- ARM: 5630/1: Add support for Eukrea's CPUAT91
- ARM: 5641/1: bcmring: add Kconfig and Makefile entries in arch/arm
- ARM: 5667/3: U300 SSP/SPI board setup and test
- ARM: BAST: CPUFREQ: Add board support
- ARM: implement highpte
- ARM: Kirkwood: Marvell OpenRD-Base board support
- ARM: orion5x: Add LaCie NAS 2Big Network support
- ARM: OSIRIS: CPUFREQ: Add CPU frequency scaling support
- ARM: pxa: balloon3 (http://balloonboard.org/) base machine support
- ARM: S3C2410: CPUFREQ: Add core support.
- ARM: S3C2412: CPUFREQ: Add core support.
- ARM: S3C2440: CPUFREQ: Add core support.
- ARM: S3C24XX: CPUFREQ: Add core support.
- ARM: S3C6410: airgoo hmt board support
- ARM: S3C: CPUFREQ: Add debugfs support for cpufreq
- ARM: S3C: CPUFREQ: Add documentation for system
- ARM: S5PC100: Board and configuration file
- ARM: S5PC100: Clock and PLL support
- ARM: S5PC100: CPU initialization
- ARM: S5PC100: Kconfigs and Makefiles
- ep93xx video driver platform support
- Freescale i.MX25 PDK (3ds) board support
- MAINTAINERS: move ARM lists to infradead
- mx27: add support for phytec pca100 (phyCARD-s) board
- MXC: add basic MXC91231 support
- MXC: add iomux pins configuration support for MXC91231
- nommu: Add MMU-less support for Integrator platforms
- nommu: Add MMU-less support for the RealView boards
- nommu: ptrace support
- OMAP2: add board file for Nokia N800 and N810
- OMAP3: Zoom2: Add TWL4030 support
- mpc5200: support for the MAN mpc5200 based board mucmc52
- PCI: document PCIe fundamental reset interfaces
- PCI/powerpc: support PCIe fundamental reset
- powerpc/40x: Add support for the ESTeem 195E (PPC405EP) SBC
- powerpc/44x: Add Eiger AMCC (AppliedMicro) PPC460SX evaluation board support.
- powerpc/83xx: Add support for MPC8377E-WLAN boards
- powerpc/85xx: Add support for P2020RDB board
- powerpc: Enable GCOV
- powerpc: Fix some late PowerMac G5 with PCIe ATI graphics
- powerpc: introduce and document sdhci,wp-inverted property for eSDHC
- powerpc/powermac: Thermal control turns system off too eagerly
- powerpc: Remaining 64-bit Book3E support
- ACPI, x86: expose some IO-APIC routines when CONFIG_ACPI=n
- intel_txt: Force IOMMU on for Intel TXT launch
- Revert "x86, timers: Check for pending timers after (device) interrupts"
- x86: Add early platform detection
- x86: Add hardware_subarch ID for Moorestown
- x86: Add Moorestown early detection
- x86: Add Phoenix/MSC BIOSes to lowmem corruption list
- x86: Add reboot quirk for 3 series Mac mini
- x86/amd-iommu: replace "AMD IOMMU" by "AMD-Vi"
- x86/amd-iommu: Workaround for erratum 63
- x86, EDAC: Provide function to return NodeId of a CPU
- x86, intel_txt: Intel TXT reboot/halt shutdown support
- x86, intel_txt: Intel TXT Sx shutdown support
- x86, mce: fix reporting of Thermal Monitoring mechanism enabled
- x86, mce: Support specifying context for software mce injection
- x86, mce: Support specifying raise mode for software MCE injection
- x86: mce: Update X86_MCE description in x86/Kconfig
- x86, msr: Export the register-setting MSR functions via /dev/*/msr
- x86/oprofile: Enable multiplexing only if the model supports it
- x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init
- x86, perf_counter, bts: Add BTS support to perfcounters
- x86: Remove STACKPROTECTOR_ALL
- x86, timers: Check for pending timers after (device) interrupts
- x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
- x86: Provide an alternative() based cmpxchg64()
- core generic GPIO support for Freescale Coldfire processors.
- davinci: Adding DM365 SOC Support
- davinci: Add support for DA850/OMAP-L138 EVM board
- davinci: da8xx: Add base DA830/OMAP-L137 SoC support
- davinci: da8xx: Add support for DA830/OMAP-L137 EVM board
- IA64: implement ticket locks for Itanium
- m32r: bzip2/lzma kernel compression support
- MAINTAINERS: add entry for TI DaVinci machine support
- microblaze: Add architectural support for USB EHCI host controllers
- microblaze: Add checking mechanism for MSR instruction
- MIPS: BCM63xx: Add integrated ethernet mac support.
- MIPS: BCM63xx: Add PCMCIA Cardbus support.
- MIPS: BCM63xx: Add serial driver for bcm63xx integrated UART.
- MIPS: BCM63xx: Add support for the Broadcom BCM63xx family of SOCs.
- MIPS: Loongson: Add GCC 4.4 support for Loongson2E
- MIPS: Loongson: Add oprofile support
- OMAP2/3/4 core: create omap_device layer
- OMAP2/3/4: create omap_hwmod layer
- OMAP: PM counter infrastructure.
- S390: 64-bit register support for 31-bit processes
- S390: add call home support
- S390: Enable guest page hinting by default.
- score: add maintainers for score architecture
- score: Add support for Sunplus S+core architecture
- sh: Add CEU support for EcoVec24
- sh: Add EcoVec (SH7724) board support
- sh: add FSI driver support for ms7724se
- sh: Add ftrace syscall tracing support
- sh: Add initial support for SH7757 CPU subtype
- sh: Add support DMA Engine to SH7722
- sh: Add support DMA Engine to SH7780
- sh: bzip2/lzma zImage support.
- sh: Function graph tracer support
- sh: kfr2r09 board support - mach-type and defconfig
- sparc: add basic support for 'perf'
- sparc: Add CONFIG_DMA_API_DEBUG support
- sparc,leon: CONFIG_SPARC_LEON option and leon specific files.
- sparc,leon: Introduce the sparc-leon CPU type.
- sparc: Niagara1 perf event support.
- sparc: Support all ultra3 and ultra4 derivatives.
Memory Management (MM)
- cgroups: update documentation of cgroups tasks and procs files
- Documentation/memory.txt: remove some very outdated recommendations
- hugetlb: add MAP_HUGETLB example
- hugetlb: clean up and update huge pages documentation
- HWPOISON: Add basic support for poisoned pages in fault handler v3
- HWPOISON: Add new SIGBUS error codes for hardware poison signals
- HWPOISON: Add page flag for poisoned pages
- HWPOISON: Add poison check to page fault handling
- HWPOISON: Add support for poison swap entries v2
- HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
- kmemcheck: update documentation
- ksm: add mmu_notifier set_pte_at_notify()
- ksm: add some documentation
- ksm: change default values to better fit into mainline kernel
- ksm: more on default values
- ksm: sysfs and defaults
- ksm: the mm interface to ksm
- mm: allow memory hotplug and hibernation in the same kernel
- mm: fix NUMA accounting in numastat.txt
- mm: fix sparsemem configuration
- mm: oom analysis: add buffer cache information to show_free_areas()
- mm: oom analysis: add per-zone statistics to show_free_areas()
- mm: oom analysis: add shmem vmstat
- mm: vmstat: add isolate pages
- nommu: add support for Memory Protection Units (MPU)
- oom: make oom_score to per-process value
- pagemap clear_refs: modify to specify anon or mapped vma clearing
- pagemap: document KPF_KSM and show it in page-types
- pagemap: export KPF_HWPOISON
- page-types: add feature for walking process address space
- page-types: add hwpoison/unpoison feature
- page-types: introduce checked_open()
- page-types: introduce kpageflags_flags()
- page-types: make standalone pagemap/kpageflags read routines
- page-types: make voffset local variables
- proc: document `guest' column in /proc/stat
- slub: add option to disable higher order debugging slabs
- tracing, documentation: Add a document on the kmem tracepoints
- truncate: new helpers
- truncate: use new helpers
- vm: document that setting vfs_cache_pressure to 0 isn't a good idea
- Add a tracepoint for block request remapping
- drm/i915: Add tracepoints
- ext4: Add a tracepoint for ext4_alloc_da_blocks()
- ftrace: document function and function graph implementation
- hrtimer: Add tracepoint for hrtimers
- itimers: Add tracepoints for itimer
- oprofile: Implement performance counter multiplexing
- perf: Add a SVG helper library file
- perf: Add a timestamp to fork events
- perf: Add timechart help text and add timechart to "perf help"
- perf_counter: powerpc: Add callchain support
- perf_counter, sched: Add sched_stat_runtime tracepoint
- perf report: Add raw displaying of per-thread counters
- perf report: Fix and improve the displaying of per-thread event counters
- perf sched: Account for lost events, increase default buffering
- perf sched: Add --input=file option to builtin-sched.c
- perf sched: Add involuntarily sleeping task in work atoms
- perf sched: Add 'perf sched latency' and 'perf sched replay'
- perf sched: Add 'perf sched map' scheduling event map printout
- perf sched: Add 'perf sched trace', improve documentation
- perf sched: Add runtime stats
- perf sched: Add sched latency profiling
- perf sched: Add support for sched:sched_stat_runtime events
- perf sched: Display time in milliseconds, reorganize output
- perf sched: Implement the 'perf sched record' subcommand
- perf sched: Implement the scheduling workload replay engine
- perf sched: Import schedbench.c
- perf sched: Make it easier to plug in new sub profilers
- perf sched: Output runtime and context switch totals
- perf sched: Print PIDs too
- perf: Tidy up after the big rename
- perf timechart: Add a power-only mode
- perf timechart: Add "perf timechart record"
- perf timechart: Show the duration of scheduler delays in the SVG
- perf timechart: Show the name of the waker/wakee in timechart
- perf tools: Add an option to multiplex counters in a single channel
- perf tools: Add missing parameters documentation
- perf tools: Add perf trace
- perf tools: Add trace event debugfs IO handler
- perf tools: Add trace event information parser
- perf tools: Allow the specification of all tracepoints at once
- perf tools: Complete support for dynamic strings
- perf tools: Factorize the thread code in a dedicated file
- perf tools: Implement counter output multiplexing
- perf util: Make the timechart SVG width dynamic
- perf util: SVG performance improvements
- powerpc/sputrace: Use the generic event tracer
- ring-buffer: add design document
- ring-buffer: make lockless
- sched: Add wait, sleep and iowait accounting tracepoints
- sched: Provide iowait counters
- timers: Add tracepoints for timer_list timers
- tracing: Add individual syscalls tracepoint id support
- tracing: add latency format to function_graph tracer
- tracing: add lock depth to entries
- tracing: Add more namespace area to 'perf list' output
- tracing: Add perf counter support for syscalls tracing
- tracing: Add syscall tracepoints
- tracing: Add trace events for each syscall entry/exit
- tracing: Add vim script to enable folding for function_graph traces
- tracing: create generic trace parser
- tracing, documentation: add a document describing how to do some performance analysis with tracepoints
- tracing/events: Add module tracepoints
- tracing/events: Add trace_event boot option
- tracing/filters: add filter Documentation
- tracing/filters: improve subsystem filter
- tracing: make testing syscall events a separate configuration
- tracing, page-allocator: add a postprocessing script for page-allocator-related ftrace events
- tracing: pass around ring buffer instead of tracer
- tracing, perf: Convert the power tracer into an event tracer
- tracing: Remove markers
- tracing: Remove mentioning of legacy latency_trace file from documentation
- tracing: Rename FTRACE_SYSCALLS for tracepoints
- tracing: Support for syscall events raw records in perfcounters
- tracing/syscalls: Add fields format for exit events
- tracing/syscalls: Add filtering support
- tracing: trace parser support for function and graph
- tracing: trace parser support for set_event
- x86, perf_counter, bts: Add BTS support to perfcounters
- Documentation: Update KVM list email address
- KVM: Add Directed EOI support to APIC emulation
- KVM: Add MCE support
- KVM: add module parameters documentation
- KVM: add support for change_pte mmu notifiers
- KVM: Add trace points in irqchip code
- KVM: Cache pdptrs
- KVM: Document basic API
- KVM: Document KVM_CAP_IRQCHIP
- KVM: Implement MSRs used by Hyper-V
- KVM: introduce module parameter for ignoring unknown MSRs accesses
- KVM: MMU: enable gbpages by increasing nr of pagesizes
- KVM: MMU: shadow support for 1gb pages
- KVM: Move common KVM Kconfig items to new file virt/kvm/Kconfig
- KVM: PIT support for HPET legacy mode
- KVM: powerpc: convert marker probes to event trace
- KVM: report 1GB page support to userspace
- KVM: SVM: enable nested svm by default
- KVM: SVM: Improve nested interrupt injection
- KVM: x86 emulator: Add missing EFLAGS bit definitions
- KVM: x86 emulator: add syscall emulation
- KVM: x86 emulator: Add sysenter emulation
- KVM: x86 emulator: Add sysexit emulation
- virtio: add virtio IDs file
- xen: make -fstack-protector work under Xen