Processor Whispers: About 16 and 17-core processors
by Andreas Stiller
With the unusual number of 17 cores, IBM’s new BlueGene/Q processor draws attention. AMD’s 16-core processor Interlagos might arrive a bit earlier than expected and the Itanium shows new signs of life.
The “Heptakaideka-Core” processor BlueGene/Q, presented by IBM at the supercomputing conference SC10 in New Orleans, is intended to power the 20 Pflops computer Sequoia, which IBM is supposed to deliver to the Lawrence Livermore National Laboratory in about 2 years time. However, only 16 of its 17 cores are meant for computing, the extra core will handle control and I/O tasks. Actually, the BlueGene/Q has 18 cores, as there is a spare core that is used to improve the yield or the reliability during operation. Unlike its BlueGene predecessors, the Q-version is upgraded to 64-bit processing and the SIMD unit widened so that now it can execute four double precision fused-multiply-add commands with eight floating-point operations per clock. Accordingly, at 1.6 GHz clock speed, the processor would manage 205 Gflops – but resourceful software engineers may still improve the performance even further by making the seventeenth core calculate too. Additionally, the processor supports four-way SMT and so, for instance, provides the operating systems (RHEL6 on the I/O nodes, special compute OS on the computing nodes) with 64 “logical” cores, or threads.
Thanks to the 64-bit support, the modules can now run 8 or 16 GB of DDR3 memory. Five links (2 GB/s per direction) connect each module to its neighbours, making it possible to create different 5D topologies. Half a rack with 8,192 BlueGene/Q cores has already proven its capabilities in the Linpack benchmark. With 65.3 Tflops, the test system from the Thomas J. Watson Research Center scored one hundred and fifteenth place in the new Top500 list. Its power consumption of 38.8 kW represented a new record value for energy efficiency at close to 1,700 Mflops/Watt. The Sequoia is to get 96 fully equipped racks, which should deliver 20 Pflops of theoretical peak performance by the end of 2012.
The True AVX Processor
By then, AMD’s 16-core Interlagos with the new Bulldozer architecture should already have been on the racetrack for quite a while. At the SC10, AMD even raised the scientists’ hopes that the processor might be ready earlier than expected, which would mean before the third quarter of 2011. The wide-spread doubt in the HPC scene concerning the “halved” FPU – a Bulldozer module contains two integer cores, but only one FPU – was more or less coherently countered by AMD, with the argument that the “Flex FP” is capable of executing two 128-bit commands simultaneously (SSE, AVX). In particular, this is true for the multiply-add commands (FMA) – which are much valued for HPC. These are not supported by Intel’s Sandy Bridge and will probably be lacking from the feature list of its successor, the Ivy Bridge, too. Only for the currently still rarely used 256-bit AVX operations Bulldozer links both units.
Consequently, the Interlagos with its eight modules or 16 cores manages 64 double precision floating-point operations per clock, which makes 224 Gflops at 3.5 GHz. At this clock speed, Intel’s planned 8-core Sandy Bridge EP will achieve the same theoretical peak value. While it doesn’t support FMA, it’s able to execute an AVX multiplication and addition in full 256-bit width in parallel. The Bulldozer’s clock rate specification of 3.5 GHz and the number of transistors per module (213 million) can be found in the abstracts of the presentations for the next International Solid-State Circuits Conference (ISSCC) in February of 2011. Apart from some further details on the Sandy Bridge and Westmere-EX, Intel first of all intends to release first specifications for the next Itanium generation Poulson. The abstract gives away some details already: 32-nm technology, 8 cores with simultaneous multi-threading (SMT), 12-issue superscalar (4 bundles with 3 commands each per clock; two times as many as before), 3.1 billion transistors on 544 mm², a total of 50 MB of cache, 128 GB of bandwidth between the processors and 45 GB of memory bandwidth. Now there are speculations that Poulson might feature fine-grained SMT – maybe even with different priorities, like the Power7. A further similarity with the Power7 architecture could be a possible switch to out-of-order execution. This was brought to attention by David Kanter from realworldtech.com, in whose forum a certain Linus Torvalds provides some pithy lines against the “failed“ Itanium architecture.
The Itanium only knows modern vector units like SSE in the 32-bit emulation. It's not known if the Poulson will have AVX or maybe even something better, but in any case its complement of transistors – after deducting the caches – would most likely be insufficient.
However, the current advanced vector extensions (AVX) are not the same as those Intel presented almost three years ago. Since then, Intel has eliminated some permutation commands and added 256-bit streaming commands. Most importantly, the FMA operations (like “VFMADDPD”) planned as four-operand commands have been reduced to three operands. A source operand will consequently be overridden by the resulting value and which operand that is, can be chosen. So that “VFMADD213PD”multiplies the second operand with the first, adds the third and overrides the first with the result.
It seems probable that these changes were encouraged by Ronak Singhal, under whose lead the Haswell processor – slated for 2012 – is being developed in the 22-nm process in Oregon (and, later on, the Rockwell in 16-nm structures), in order to make the instruction set compatible with that of the future 512 bits wide vector unit. This instruction set, which in the past emerged as Larrabee New Instruction Set (LNI) has exactly the same syntax for FMA; it only supports three operands.
Meanwhile, in addition to the new features, AMD intends to offer the initially planned four-operand version (FMA4) for the Bulldozer, although with a slightly changed encoding. The permutation commands, which Intel eliminated, will probably be supported by the Bulldozer. Regarded in this way, the Bulldozer is the processor that Intel originally had in mind for AVX, not the Sandy Bridge.