Processor Whispers - About Big Wheel Loaders and Mini Excavators
by Andreas Stiller
AMD says good bye to old odds and ends like 3Dnow! drops the name ATI and presents the upcoming hot chips at the conference of the same name in some more detail.
Bulldozer, Bobcat and Llano – these are three somewhat unequal processor architectures that AMD plans use for chips in the coming years. At the Hot Chips Conference at Stanford University in Palo Alto, AMD developers shed quite some light on two of these architectures. Just like the Atom, the designated Atom competitor has been “designed from scratch” – although, with its two-issue superscalar, its integer and FPU pipelines, its two L1 caches with 32 KB each and its 512 KB L2 cache, it is strongly reminiscent of a good old acquaintance - the successful K6-2, now upgraded with 64 bit, C6 sleep state, SSE1, 2, 3 and SSSE3 as well as a not further specified “High Performance Bus”. 3Dnow!, the SIMD extension that had been introduced with the K6, could most likely be implemented as well, but there seems to be no need; AMD has announced that it’s going to drop the instruction set extension that has fallen into disuse. A pity considering that a command for the Mandelbrot set had especially been dedicated to me by AMD: I had complained about a missing swap command - and who would have guessed, AMD added it in the next stepping.
Bobcat has primarily been optimised for use in notebooks now. AMD proudly points out that it employs an out-of-order architecture (OoO) – like the K6 and the Intel Core and AMD K10 processors, too. The relatively complex OoO architecture is capable of bypassing many idle times by cleverly reorganising commands. The Atom, on contrast, only offers a simple in-order architecture, but it features hyper-threading, which – most of the time – manages to avoid idle times by directing work loads to the other thread.
The OoO technology usually works with a lot of speculation, which often means additional power consumption. But it seems like this doesn’t have to be the case; IBM, for instance, has returned to OoO with the Power7 and Z96 after a short in-order intermezzo with the Power6 and Z9, and the new chips are much more power efficient at lower clock speed.
A single Bobcat core is supposed to be able to run on less than a watt. As for the power consumption of the planned first Bobcat processor Ontario (two cores plus DirectX11 graphics processor plus memory controller between the graphics core and the CPUs), there are still no estimates available. Fudzilla.com has heard a little bird say something about 18 watts; later low-power versions are supposed to manage with as little as 9 watts.
The other processor of the planned “Fusion” series with integrated graphics processors draws on a tried and tested – although much more elaborate – old architecture, the K8 core. It is much smaller than the current K10, thanks to, among other things, thinner internal data buses. For the Llano, however, AMD has presented no new details. Maybe someone else will steal the show from AMD by being the first to release a real CPU/GPU combo chip – no, not Intel, but Microsoft. Together with manufacturing partner IBM, Microsoft presented the new Xbox 360 processor for the 250 GB system at the conference. It’s a SoC with an integrated GPU and it’s not only faster than its predecessor but is also supposed to eat 60 per cent less power. After all, the integrated GPU is from ATI – oh well, no, from AMD, which apparently intends to refrain from using the name ATI in the future.
The Bulldozer architecture, which, for the time being, is intended for servers, is – as was already mentioned earlier in this column – a hybrid between dual-core and hyper-threading. The module, as AMD calls it, houses two separate integer units, each with their own small 16 KB L1 data cache as well as a shared floating point unit consisting of two MMX and two 128-bit FPU units, which can be linked for Intel’s upcoming 256-bit SIMD extension AVX. But AMD had already announced all that on an Analyst Day in November 2009.
AVX as well as SSE4.1 and SSE4.2 have been officially confirmed for the Bulldozer for the first time; regarding the crypto extension AES, it was said – during the preliminary telephone conference – that AMD is still negotiating with Intel over implementation details. The fused multiply-add command, which is still missing from Intel’s next processor Sandy Bridge, is already onboard with the Bulldozer, but apparently in a proprietary version (AMD 4 Operand Form). That’s most likely what remains of the once planned SSE5 extension – whether Bulldozer will also support the other SSE5 commands, AMD couldn’t tell us yet.
The frontend of the Bulldozer pipeline, the one responsible for buffering (in the 64 KB L1 lcache), loading and decoding the commands, as well as the L2 cache (2 MB, 16-way) used for instructions and data, are also for the whole module. These cache sizes apply to the first implementation of the architecture in 32-nm SOI technology, which goes by the name Orochi.
The frontend’s dimensions appear a little small. For instance, it only features four x86 decoders (fast path) for the entire module – as many as the Nehalem has for a single core alone, even if the decoders have to feed two logical cores here. After all, the current AMD K10 has three fast decoders per core. With its 8 modules – depending on the point of view, 8 to 16 cores – the Bulldozer server chip Interlagos is supposed to deliver about 70 per cent more integer performance (SPECint) than the 12-core Magny-Cours; so the frontend doesn’t really appear to be “starving”. As well as the big Interlagos with up to 8 MB of L3 cache for all modules on the chip, AMD intends to release chips half the size for servers (Valencia) and high-end desktop PCs (Zambezi).
Although the Interlagos offers one third fewer computing cores for floating point calculation in comparison to the Magny-Cours, its SPECfp performance is supposed to be one third higher thanks to AVX, FMA and a superior memory interface – although the FPUs are not even linked to the small L1 cache. Intel’s less than successful Itanium had a L1 bypass for the FPUs, too. Hopefully, that’s not a bad omen ...