Processor Whispers - About Launches and Corsairs
by Andreas Stiller
On the last day of the 37th International Symposium on Computer Architecture (ISCA 2010), the Intel developer team caused quite a commotion in the corsair stronghold Saint-Malo: they debunked the hundred-times-faster myth of GPUs compared to CPUs (Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU). Normally, this symposium is concerned with the fundamentals of future architectures, not performance comparisons of commercial hardware – and if such comparisons are made, they are presented by neutral scientists, rather than an involved company.
On the other hand, there are no real technical flaws or an absurd selection of workloads that the Intel crew, under captain Victor W. Lee, could be accused of. The fourteen benchmarks they used are classics, mostly taken from the scientific sector: SGEMM, FFT, Lattice Boltzmann (LBM), Ray Casting (RC), Search & Sort, Collision Detection (GJK), Constraint Solver (Solv) and so on.
Saint-Malo
Intel’s GPU vs. CPU comparison (Nvidia GTX 280 vs. Core i7 960) shows a factor of only 2.5 for the geometric average.
Source: Intel
The participants in Intel’s regatta were an Nvidia GTX-280 frigate and an Intel i7-960 corvette. So far, so good, but it seems they encountered problems maintaining the course when calculating the results – maybe an old Pentium cog with FDIV bug? In any case, my old Athlon-64 says that 1020 divided by 67 – the indicated values for the Collision Detection benchmark (GJK) – does not equal 14.9 (as listed in Intel’s chart), but rather 15.2 ...
The Intel benchmarkers even ignored the problematic issue of data transfer to and from the graphics card for all benchmarks even though that’s exactly where – in practical operation – the resulting performance deviates strongly from the bare GPU performance. So, from this point of view, the numbers have actually been sugar-coated for Nvidia. The results of, for instance, the Single-Precision General Matrix Multiply (SGEMM) pretty much coincide with the ones c’t magazine published a year ago – only that the hardware c't compared was a bit more powerful: GTX 285 vs. Core i7 965.
And that’s one of the of criticisms that Nvidia has brought to bear against Intel’s “evaluation”: totally outdated hardware. Indeed, at an event dedicated to future architectures, one would have expected a comparison between the newest technologies, like, for instance, Nvidia GTX 480 (about $500) or Radeon 5970 ($670) vs. Intel’s six-core Core i7 980X ($1000). Nvidia goes on to quote numerous results from independent universities and research institutions that confirm a performance increase of at least 100 times – in comparison to any CPU – for their workloads.
Who knows, maybe the whole thing was just a tit-for-tat response directed against Nvidia’s head scientist Bill Dally. The renowned leading Stanford professor, who received the distinguished Eckert-Mauchly Award at the same symposium, had indirectly laid into Intel, in the May edition of Forbes magazine, by announcing the end of Moore’s law and declaring that the multi-cores, composed of many cores optimised for serial performance, are a dead end in the effort to design efficient parallel computers.
Lisbon
AMD is off the hook in any case, as it is the only company to offer both architectures. Under the name Fusion, AMD is even working on combining them; the first prototype has recently been shown at Computex. Finally, the long- expected new GPU version for high performance computing was also presented there. At about the same time, AMD also launched the Opteron 4100 family with the codename Lisbon. On the inside the Lisbon continues to be an Istanbul processor with four or six cores, but it connects to the outside world through the new C32 socket via the so called Direct Connect Architecture 2.0: HyperTransport 3.0 (two 16-bit links), DDR3-1333 (two channels, also in low power), extended virtualisation features (AMD-V, as I/O Virtualisation with IOMMU) and significantly improved power management (AMD-P).
The little brothers of the Opteron 6100 (Magny-Cours) are not really designed for the HPC sector, but rather for the even bigger market of cloud computing, for web services as well as for small and medium business (SMB), where scalability, costs, density and especially the energy efficiency are paramount. At 5.83 watts/core, the 4162EE/4164EE models claim the lowest per core consumption worldwide, much lower than the 10 watts/core of Intel’s L5609 – however, this comparison is rather curious as AMD takes the Average CPU Power (ACP) as a basis for the evaluation of the Opterons while Intel uses the Thermal Design Power (TDP), which indicates the maximum.
The smallest member of the family, the Opteron 4122 with four cores and 2.2GHz, is the first server processor for two-way systems to break the $100 mark with a $99 OEM price. Some of the other family members are also much cheaper than their respective Intel counterparts.
And AMDs partners are now offering larger Opteron 6100 servers. Hewlett Packard has launched the ProLiant DL585 G7 and the blade server BL685c 67 with four C34 sockets for a total of 48 Magny-Cours cores, and also the 2P blade server BL465c G7. At the same time, HP presented the long expected systems with Intel’s Nehalem-EX. The ProLiant DL980 G7 with Intel’s 8-core processor now features up to 8 sockets, 128 logical cores and 128 DIMM slots.
The competition, including IBM and Dell, already had their Nehalem-EX systems ready in March, and everybody had something special to show off: Dell had its FlexBridge, which, connected to a processor socket, gives other processors access to the RAM connected to that socket and IBM had the eX5 series, which shines with external memory and CPU extensions. Also HP has come up with something special, the so called PREMA architecture with “smart CPU caching” and a “redundant system fabric”.HP developed its own chipset for this, which most likely caused the three month delay. It’s supposed to improve the – categorised by us as rather slow until now – cache performance of the Nehalem-EX by a smart reduction of the overhead for the so-called snooping as well as give an increase in the QPI rate of about 20% in comparison to Intel’s 7500 chipset. There are no comparable stream values yet, but much is still uncertain concerning the Nehalem-EX anyway, like the mysterious “Hemisphere Mode”.
(djwm)



















