Processor Whispers: About joy and frustration
by Andreas Stiller
Look who's arrived: AMD's Bulldozer. However, the general enthusiasm is limited, with "disappointing" being the most frequently used adjective to describe it. Nevertheless, numerous data centres count on this new architecture, in Oak Ridge, in Stuttgart and elsewhere.
Testers everywhere lament not only the somewhat weak performance and the high power consumption of the FX-8150, but also that AMD had not given them nearly enough time. The finished chip was dispatched but a few days before its official release on 12 October, much too short a time for proper tests, never mind optimisations with AVX, FMA4 or XOP. An official test run with SPEC-CPU2006-Suite alone takes two days – and that's when everything works the first time around.
And it's always a bad sign when a company is already talking about the successor to a product while presenting that same product. Intel, for instance, had promised betterment during the times of the NetBurst power-hog Pentium 4 and also pointed to future improvements when premiering the last Atoms. "Yes, well, it's not great, but the next one surely will be ..." In this light, AMD's promises concerning the Orochi-Bulldozer successors Piledriver, Steamroller and Excavator leave a somewhat stale aftertaste.
AMD has already demonstrated the functioning Piledriver core inside the Trinity mobile processor with integrated graphics. It might launch at the beginning of next year, only a few months after the Bulldozer. Its graphics performance is much better than that of the Llano but, as for the computing cores, no world-shaking improvement is to be expected, just a few larger buffers here and there. It won't be until 2013 that the Steamroller will supposedly do away with the inner architecture and eliminate bottlenecks, while the power consumption is planned to be reduced by 50 per cent by the time the Excavator rolls out. And already, in the mobile sector, various new codenames have popped up: Kaveri-APU with Steamroller or Kabini and Samara including Jaguar processor as Bobcat successor. Meanwhile, AMD's plans for PCI Express 3 are still unclear – HyperTransport is too slow for it and AMD wants to get rid of HyperTransport as soon as possible anyway.
The question that remains is: why can't AMD manage to present an all-round convincing product four years after the first announcement of the Bulldozer? Some think it's partly Intel's fault, saying that the illegal methods Intel used to push AMD products out of the market from after the launch of the Athlon up until 2004 eventually broke the smaller competitor's backbone and that the billion-dollar settlement didn't undo the damage.
But AMD is also struggling with its own mistakes: at $5.4 billion, the acquisition of ATI in 2006 is considered to have been much too expensive, and there is a lack of money in product development. The spin-off of chip production to Globalfoundries caused concerns that separation from the CPU development might lead to a structural disadvantage compared to Intel. At the same time, the revolving door of top-level management isn't slowing down – important managers are leaving all the time. And ex-employees complain that AMD has been relying solely on automated design tools instead of the tweaking expertise of its developers. As a result, they say, AMD has wasted 20 per cent of performance.
And yet, even if AMD had much more money and personnel, not all problems would necessarily get resolved. Intel provides numerous examples of that: in spite of fabulous net profits in the double-digit billion area, Intel hasn't managed to develop competitive 3D graphics drivers or stop the ARM competition with the Atom for years. And as recently as the beginning of this year, Intel had to deal with the design issue in the Series 6 chipset. Also, a fortune was wasted on the Itanium. After all, CPU development is a risky undertaking and a bit of luck goes a long way.
Glimmers of hope
All in all, it doesn't look so bad for the Bulldozer processor, though. Sure, the design is more optimised for servers than for desktop PCs, where multithreading still hasn't fully arrived. As for high-performance computing, the FX-8150, as predecessor of the Interlagos processor, shows some weaknesses, but also strengths. There are, for instance, the often underestimated divisions, which shouldn't be neglected – the Linpack benchmark only adds and multiplies. In this discipline, the FX takes advantage of its two floating-point division units per module and is about twice as fast as the Core i7-2600 when processing SSE3 operations.
In any case, the Interlagos prototypes must have convinced the scientists at Oak Ridge in Tennessee – where Linpack guru Professor Jack Dongarra has a say – as well as the financiers at the United States Department of Energy (DOE). In mid-October, the contract with Cray was finally signed: a Cray XK6 Cluster called Titan with a total of 18,688 Interlagos processors and 600 terabyte of memory. The primary computing power, however, will be provided by 18,688 NVIDIA Tesla chips with the next GPU version, Kepler. This GPU has been especially optimised for Linpack where, according to NVIDIA's head scientist Bill Daily, Kepler manages an efficiency of over 90 per cent. Consequently, the Titan could deliver up to 20 petaflops. Maybe the Bulldozers are particularly well suited for feeding these computing workhorses?
The high-performance data centre in Stuttgart, Germany, is also betting on the Interlagos, although without NVIDIA GPUs, in the Cray XE6 cluster. The test phase is over and the installation is under way: 3552 nodes with two Interlagos processors (Opteron 6276, 2.3 GHz, 16 MB L3) per node, which are supposed to provide a total peak performance of 1 petaflop, about one third of the pure processor performance of the Titan. The next upgrade is scheduled for 2013: Cray Cascade with 4 to 5 petaflops.
Intel is already shipping its next server processor generation, Sandy Bridge EP, to data centres without having officially launched them. At the Leibniz Rechenzentrum in Garching near Munich, representatives of the country and state have already attended the celebratory inauguration of the new edifice for the SuperMUC, which alone cost almost €50 million (about $70 million) – after all, the building has to meet special requirements to allow for the planned hot-water cooling for the accordingly designed IBM iDataPlex racks and to be able to make use of the waste heat. Expressed in cores, the SuperMUC (112,896) will be pretty much on the same level as the Hermit in Stuttgart (113,664), but the SuperMUC is supposed to deliver up to 3 petaflops of peak performance.
Be that all as it may, none of the aforementioned supercomputers can hope to be successful in the next race for the first place of the Top500 list of Supercomputers, which is due in mid-November. The Japanese K Computer should remain unchallenged unless the Chinese pull something out of the hat.