Processor Whispers - About Harmless Plants and Canny Zombies
by Andreas Stiller
Intel gets off lightly in the dispute with the US Federal Trade Commission (FTC), AMD makes the Atom competitor Ontario its highest priority and postpones other processors, and the digits of Pi have finally been calculated to 5 trillion decimal places.
The FTC has fired some harmless chickpeas at the canny Intel zombies. When comparing the original charges with the settlement, a fitting description would be that the FTC plant started out as an aggressive tiger lily but ended up as a harmless ornamental lawn. Intel was not fined for anti-competitive conduct – although, according to FTC chairman Jon Leibowitz, it was definitely guilty of such conduct. Surely, Intel’s amicable settlement with AMD, in which Intel paid $1.25 billion, and the blood-letting by the EU, which was similarly expensive, played an important role. And so, the FTC was content with the semiconductor-market leader obligating itself to refrain from such schemes in the future.
The FTC has also strongly backed down concerning the compilers. In accordance with article VII.C.2, Intel can still exclude compatible x86 microprocessors from optimisations as it sees fit – as long as this is properly communicated. When comparing benchmark results – in particular with SPEC CPU2006, be it by Intel or its partners – this circumstance has to be “Clearly and Prominently” indicated.
Intel now only has to compensate companies for losses resulting from their ignorance of the artificial disadvantage caused by “defective” compilers, for purchasing a new compiler from a third party and the costs of recompilation and validation. The total sum for all aggrieved parties is limited to $10 million though. Still, the Portland Group (PGI), which belongs to the European STMicroelectronics, will surely be happy to get some Intel-funded orders. Its compilers - like Intel’s ones and unlike the ones from Microsoft - offer everything the high performance scene needs (Windows and Linux, Fortran, C99, complete libraries as well as special Opteron optimisations, CUDA support etc.).
Pi in the fast lane
Our Nehalem-Ex test computer with its 64 logical cores, 256 GB of memory and a connected fast JBOD SuperTrak controller EX 8768 from Promise (with 16 x 0.6 TB Seagate Cheetah 15K.7) laboured a good 20 days just to calculate Pi to a petty trillion decimal places. Just then, bad news came in from Japan. There, the same Windows program y-cruncher - designed by the 22-year-old Californian Alexander J. Yee, student at the Northwestern University in Illinois - running on a Xeon Westmere system, calculated Pi to five trillion places, in only 90 days. So much for my plan to surpass the former record by Fabrice Bellard (2.7 trillion) and set a new one at 3.14 trillion - which would at least have sufficed to fill the issues of c’t magazine for the next 100,000 years ...
The term “defective compiler”, which the FTC had used in its strongly-phrased complaint, is nowhere to be found in the “consent Agreement”. And apart from the questionable optimisation flags for SSE2 and SSE3, only the “library dispatching mechanisms” are mentioned - as an aside. Nonetheless, this is the real rub. The programs created with OpenMP and/or the Math Kernel Library (MKL) do run on AMD processors, but the lack of a proper link to the respective cores (affinity) results in a significant performance loss. And this is exactly what is denied to non-Intel processors, for instance, when trying to use the environment variable KMP_AFFINITY to assign an explicit processor list. Under Linux it would still be possible to use the GNU variable GOMP_CPU_AFFINITY as an alternative - under Windows, however, you are left in the lurch.
In compilers, by the way, you often find information about upcoming processors and new instruction sets. For instance, we once found new commands for Intel’s virtualisation (Vanderpool) in beta versions of compilers from Microsoft. For architectural in-depth characteristics, the GNU and Open64 source files represent a fertile Zen garden. As Dresdenboy blogged, they were an early source for data on cache sizes and latencies of AMD’s next generation of processors called Bulldozer and have now disclosed the number of functional units: 4 ALUs, 3 AGUs, 4 FPUs. The Bulldozer decoder might be able to decode up to eight commands simultaneously.
As for the performance of upcoming processors, there is - apart from the usual leaks, mostly from the Far East - also another fertile source: BOINC. The distributed internet projects not only find highly important things - like the new pulsar, recently discovered by an Einstein@home project - but also offer very interesting insights into the processor performance through their statistics. Apparently, the developers use this opportunity to test their prototypes, whose lack of a proper disguise is probably intentional. And so, AMD_ProcVal is most likely the camouflage for the AMD chips Llano (family 18) and Atom competitor Ontario with Bobcat core (family 20). The latter is probably supposed to make the developer colleagues at Intel sweat with its 1.351 Gflops per core - their Atom D510 only manages 0.721 Gflops.
In the past, AMD has mostly ignored the netbook sector, but now it plans to fly its flag high. Ontario - two Bobcat cores and a DirectX 11 enabled graphics processor on one chip - as AMD boss Dirk Meyer recently announced during a telephone conference with analysts, is highest priority now and is supposed to make its appearance this year. So the Llano, which was originally intended to be the first Fusion chip, will have to stand back. However, the Llano’s delay of a few months, Meyer made clear, is mainly due to manufacturing problems - the yield of the 32-nm process at Globalfoundries is still behind expectations. The Ontario - to the disappointment of the Dresden fans - will not be manufactured by Globalfoundries, but by TSMC; in the well-running 40-nm bulk process, like the current Radeon graphics chips. However, TSMC had initially had a lot of trouble with this process, too.