Processor Whispers – About Cycles and Circuits
By Andreas Stiller
In Europe, close to Geneva, the protons are circling the Large Hadron Collider again; meanwhile in San Francisco at the International Solid-State Circuits Conference (ISSCC) the talks centred on the new chips from IBM, Intel, AMD and co.
IBM began the circle dance of the enterprise session with the z196, which it had presented last year and which was partially developed in the Swabian city of Böblingen. Celebrated as the fastest microprocessor, the quad-core chip now manages 5.2 GHz at about 260 watts. Six of these chips are mounted on a multi-chip module that requires a total power input of 1,800 watts. “We think there is still room for future improvements, but frequency increases won't go on forever”, warned Jim Warnock, IBM’s leading engineer.
Professor Weiwu Hu from Peking doesn’t need such a high clock rate for his chips. Last summer, he had caused quite a stir with his presentation of the Godson (Loongson) 3B and 3C at the Hot Chips Conference. Now he announced that the Godson 3B with only 1.05 GHz and in 65-nm technology is in full production at STMicroelectronics. In spite of its comparatively slow clock speed, its eight cores are supposed to deliver 128 Gflops thanks to the vector unit and that at only 40 watts of power consumption. This summer, the supercomputer Dawning 6000 with 3000 Godson 3B processors, which had been expected for the end of 2010, will supposedly be completed and – with 300 teraflops – is expected obtain one of the top places in the Top500 list of supercomputers. It would be the first processor with MIPS architecture in the list since 2004. At the mentioned Hot Chips Conference, professor Hu had announced the successor Godson 3C with 16 cores, produced in 28 nm, for 2012, but now he said it will take at least two more years.
A 20 Thread Server CPU
Intel used the ISSCC to present two new processors. The Westmere-EX with 10 cores is almost finished and will probably launch soon – rumours say April. Fortunately, “A 20 Thread Server CPU” doesn’t refer to a multi-threaded A20 Gate, but to the fact that each Westmere-EX processor acts as a logical 20-core processor. Consequently, a four-socket system works with 80 cores – and using them as a whole represents a new challenge. Under Windows Server 2008, for instance, applications will first have to learn to work with processor groups; otherwise they can only detect a maximum of 64 cores.
The current Nehalem-EX didn’t make too many friends because of its high power consumption; in this respect, it couldn’t compete with AMD’s Magny-Cours. However, as Intel’s Senior Engineer Shankar Sawan from Bangalore, India explained, the Westmere-EX not only has more cores as well as the new crypto instruction set extension AES-NI, but also numerous power management improvements. Just the 256-bit wide ring bus by itself, which interconnects the cores, is said to save half a watt in comparison to its predecessor.
As far as the new power-saving modes are concerned, the Westmere-EX has the deep sleep state C6 (not supported by its predecessor Nehalem-EX) and the possibility of dynamically shutting down unused components to significantly reduce the power consumption. For example, the four power-hungry scalable memory interfaces (SMI) together with the buffers linked to them may be shut down. So, while the processor stays within the same TDP range (for example, the Xeon E7-4870 with 10 cores, 2.4 GHz clock rate, 30 MB L3 cache at 130 watts TDP), it also reduces the system’s power consumption. As it supports low voltage DIMMs, that’s another watt saved per DIMM, and when using the new low power modules the buffers will only consume 1.9 watts in standby, compared to between 6 and 7 watts per buffer in Nehalem-EX systems.
Intel hasn’t disclosed the processor’s number of transistors but it probably has around 2.9 billion. This will be exceeded by Intel’s other server processor: Poulson, the next offspring of the Itanium family, which is supposed to make its debut in the more or less up to date 32-nm technology with 3.1 billion transistors in 2012. With the new 8-core Itanium Intel aims to finally play a leading role in the performance league: twice the number of cores, twice the amount of parallel executable instruction bundles per core, improved pipelines and probably a much higher clock speed – everything hints at a highly significant performance increase in comparison to Tukwila. In order to prevent the power consumption from going wild, Intel has also heavily optimized the power management. In comparison to Tukwila, the Poulson cores consume 70 per cent less power while idle and are 60 per cent more efficient under full load. As these values are based on a projected 32-nm Tukwila, the improvements have to be the result of the design. Who knows, maybe the Itanium will actually become popular again. Unfortunately, though, it appears that Intel’s own compiler people don’t seem to believe so: the newest Intel C++ Composer XE 2011 no longer supports the Itanium.
Bulldozer vs. Llano: Thanks to 30,000 individual clock enables, the clock lines “spark” much less frequently.
At the ISSCC, AMD presented some more detailed information about how the Bulldozer processor, planned for early summer, will distribute the instructions in the module consisting of two integer cores and a Floating Point Unit. All three units are fed by a front-end with a shared instruction cache (64 KB). The decoders can deliver four instructions per clock to the units’ three individual schedulers. All in all, this means that half a Bulldozer module, which AMD counts as a core, has fewer resources at its disposal than a K10 core, while – at 16 KB – the L1 data cache is also much smaller. In spite of this, according to AMD, the efficiency of the integer core, expressed in integer instructions per clock, is 90 per cent in comparison to a not further specified fictional single-core Bulldozer with a K10-like architecture.
Consequently, it only requires a slightly higher clock speed – AMD says 3.5 GHz and is probably aiming at the server processor Interlagos – for single-threaded software to run at least as fast on the Bulldozer as it does on the K10. As far as servers are concerned, as for the much more important multi-thread throughput, the Interlagos with its six or eight modules should really crank it up. In another presentation, about the new power management, AMD pointed out that, nonetheless, the processor doesn’t over-rev; Bulldozer has got plenty of new clock gates to reduce the power consumption – the number of individual clock enables is now supposed to exceed 30,000. Concerning the rumours that AMD might be taken over by Dell, the company didn’t make any comments. Analysts regard it as far-fetched: AMD would be much too expensive as the share value in relation to the income is three times higher than Intel.
(djwm)



















