Processor Whispers: About incorrectness and inequity
by Andreas Stiller
While Intel presented the Xeon Phi coprocessor with much hoopla at the ISC12, the probably second-last Mohican of the Itanium line, the Poulson, is inconspicuously released. Even Itanium partner Hewlett-Packard is increasingly looking in the other direction, toward the micro-servers.
Modern supercomputers calculate sextillions of floating point operations – and one has to wonder how reliable the results can still be. Apart from possible soft errors, it's like digging in the dark, because calculations with the usual double precision computational accuracy, with only 53 bits for the mantissa, have meanwhile begun to get out of hand a lot more frequently than many users would believe. If only the last bit of each calculation wobbles, depending on the stability of the algorithm, this can quickly build up and eventually result in complete nonsense. And this is even more true since x86 processors have stopped using the FPU with its 80-bit data format and started to rely on the much less precise SIMD units instead.
As early as 1989, the IEEE Standards Committee introduced a format for 128-bit floating point (112-bit mantissa) which is supported by various compilers. Some can even reinterpret double precision as quad precision automatically (flag -r16 with Intel Fortran). But that makes everything run terribly slowly – and so, hardly anyone uses it. The processor companies could have long implemented higher precision or interval arithmetic as an option into their SSE, AVX or other units – apparently, though, they don't want to. It's only when the financial world complains about incorrectly rounded figures that something happens, at least at IBM, where the decimal floating point format has been a standard since Power 6 and z6. While Intel offers powerful libraries for this purpose, one would have expected a fast hardware solution by now. For instance, for the new Itanium Poulson, which already features many new functional units and instructions.
Though the first Poulson processors are not yet commercially available, their product names and clock speeds have appeared in Intel's product change notifications as well as in its MDDS database. Next to numerous test versions, four product versions are listed under Poulson-8, the fastest of which is the Itanium 9560 (SR0T1) with 2.53GHz, 8 cores and 32MB of L3 cache. The previously most powerful Itanium, the Tukwilla-9350, had four cores with a clock speed of 1.73 GHz, which alone signifies a performance increase of almost a factor of 3. Additionally, the width is doubled from two to up to four simultaneously executable instruction bundles. And so, theoretically, the Poulson is capable of up to 12 instructions per clock cycle – no other processor manages that. With accordingly optimised software, this allows the Poulson to put almost another factor of two on top of the last generation.
Wheelbarrows to the moon
Although the Poulson brings significant improvements in terms of energy efficiency, there are some tasks in which it can't keep up with other data transporters: at times, many wheelbarrows have clear advantages over a single heavy truck. Among the wheelbarrow servers, there is Hewlett-Packard's pilot project Moonshot, a concept for highly efficient micro-servers with many small computing nodes that share all their resources. This is aimed at dedicated web servers, content delivery, and data analytics with the current trend-setter Apache Hadoop and the like. The first space project, announced in autumn 2011, goes by the name Redstone. In this undertaking, HP is collaborating with the start-up Calxeda, which has designed a plug-in card with four Calxeda SoCs called "Energy Core ECX-1000", each of which has four ARM A9 cores. According to HP, in comparison to the large x86 servers, the Moonshot architecture is supposed to require 94% less space and 89% less energy and result in 63% lower costs.
But while the Redstone rocket hasn't even left the launch pad yet, HP has already rung the bell for the next round: the Gemini project. Its partner for this project is Intel, which plans to – and probably will be able to – defy the ARM processors with its new dual-core Atom Centerton, announced at the IDF in Peking. The Centerton, scheduled for the second half of 2012, is designed for a TDP of just 6 watts. Intel intends to work with HP to develop a special cartridge with an as yet unknown number of Centerton processors for HP's micro-server family. In this context, nice features like virtualisation and 64 bits are consistently emphasised – something ARM can't offer yet.
The start-up Calxeda, of which ARM owns about 20%, showed off the first working systems at the Ubuntu Developer Summit in May and at the recent ISC12, both times they were running Ubuntu 12.04. Not only HP, but also Supermicro, intend to launch Calxeda microservers. Meanwhile, the small company has cheekily drawn attention with a weird benchmark comparison: at 1.1GHz clock speed, their 5 watt web server is supposed to be 15 times more energy efficient in ApacheBench than a Xeon E3-1240 system. 4GB of memory here, 16GB there, measured watts here, TDP calculation there, maximum load here, only 15% CPU load there – a lot of weak spots to be addressed, but, in the end, a factor of five would probably still remain in a fair comparison. The comparison as a whole is still lopsided, however. Things will only become really interesting once the Centerton comes into play here.
ARM against x86 – Linux creator Linus Torvalds has accompanied this constant struggle, at times with drastic comments. Just recently, when he was angry at NVIDIA, he didn't balk from openly making good use of his middle finger and saying "NVIDIA, fuck you". Although NVIDIA joined the Linux foundation in Spring, it's still being rather tight-lipped regarding the disclosure of Linux drivers – for the hybrid graphics technology Optimus, for instance. Who knows, maybe infringed patents are involved and the sources are under lock and key for a good reason. In any case, Nebojša Novaković of Bright Side of News reports that NVIDIA's reservedness thwarted a huge deal in China, which involved 10 million school PCs. Supposedly, AMD has now pocketed the deal – and it really needs it.