Processor Whispers – About latencies and compilers
by Andreas Stiller
AMD releases details about the Bulldozer processor, Intel announces AVX compiler support for Bulldozer and Microsoft presents Windows 8 – not on Oak Trail, but on ARM.
While prototypes of the AMD Bulldozer processor – also called Family 15h – are being put to work in the B0 step in the benchmark departments of computer retailers everywhere, AMD has released the "Software Optimisation Guide for AMD Family 15h Processors", which contains lots of new information about the processor's inner workings and performance. For instance, you'll find extensive lists with the latencies of each instruction; however, the at least equally important throughput times, which were given as a matter of course in former optimisation guides, are not present. And indeed, specifying the throughput is not so simple with the hybrid design of the Bulldozer – should AMD do it per module, per core or somewhere in between?
Besides, the throughput specifications would draw attention to the fact that the Bulldozer's integer cores each have one pipeline less than the integer cores of its predecessor K10, although AMD boldly draws four pipelines into its block diagram – as the scheduler can now support the two ALUs (EX0 and EX1) and the two address generation units (AG0 and AG1) separately, whereas before these units were jointly operated by glued-together micro-operations. Still, this is no serious compensation for the three capable ALUs that the K10 and the competition's Sandy Bridge feature, as the two AGUs can only offer very limited aid – apparently, they can only participate in calculations related to the instructions CALL and LEA.
The documentation also mentions an integer divider at EX0 but, in contrast to the Llano processor, with the Bulldozer there's no noticeable effect on the latencies; they lag one or two clocks behind in comparison with the K10.
Actually, many latency values had been available before, mostly from the source code files of the x86 Open64 Compiler. And so, most already knew what to expect – among other things, that the read access latency of the integer core with its only 16 KB large L1 data cache would be increased by one to four clocks, as with the Sandy Bridge. But the Bulldozer often needs the extra one or more clocks more than its predecessor or the Sandy Bridge for arithmetic instructions like addition, multiplication, division and so on. This is also true for the new AVX instructions – only the AES crypto commands execute a bit faster on the Bulldozer.
So, to prevent the Bulldozer from lagging behind, clock speed and turbo core will have to be cranked up considerably. As for the much-discussed topic of IPC (instructions per clock) the Bulldozer will probably not be able to compete with its predecessor in spite of some architectural improvements; especially, because each pair of two cores has to share the frontend with the decoders while also using the same FPU.
FMAC in All Variations
As for the so-called fused multiply-add instructions, AMD still stands unrivalled. Here, AMD even scores thrice: compliance with the old SSE5 specification from 2007 (only 128 bits), the first AVX specification with four operands, and the up-to-date AVX version with three operands now supported by Intel. AMD claims up to four times better performance in comparison to the old Opteron, for instance, in matrix multiplication.
For now, this performance can only be achieved with AMD's x86 Open64 Compiler. However, Intel's compiler group – as James Reinders, the "chief evangelist" for software products, assured at the company's software conference ISTEP in Dubrovnik – intends to do its best to make the compilers offer top performance for non-Intel processors as well.
And Something Else
Only recently, the remake of the C64 was launched. Now, the talk of the town is the AmigaOne 1000 and its PowerPC duo-core PWRficient PA6T-1682M from PA Semi – now owned by Apple. The British armaments group Varisys had been selling a module with this processor for some kind of weapons systems and, apparently, it has numerous remnants.
As soon as the specifications are released and the test systems are available, Intel plans to introduce AVX for Bulldozer; the Intel compilers will probably be limited to the compatible instructions. Whether the Intel compilers will support fused multiply-add instructions in the future or at all, Reinders didn't say. After all, Intel will only introduce this functionality with the Haswell processor at the end of 2012, one and a half years later than AMD.
For Linux, Intel already published SPEC results for the Opteron 6174 a few months ago and they are roughly on the same level with the ones for the AMD compiler. According to Reinders, with the new version 12.0 compilers, the Intel compilers will have a significantly higher performance than the best currently available compiler mix, namely Microsoft Visual Studio 2010 and PGI 10.6. While the older 11.1 compiler was 7 per cent behind MSVC and PGI in SPECint_base2006, version 12.0 took the lead by a margin of 10 per cent. The results from the floating point benchmark SPECfp_base2006 are even more significant: version 11.1 leaves other Windows compilers behind by 24 per cent and version 12.0 already has a 42 per cent head start.
Source: Microsoft At the same time as ISTEP, the Intel Developer Forum (IDF) took place in Beijing, where the Windows-compatible tablet processor Oak Trail was launched with only a few designs and apparently slower than the two-year-old Atom Z530. Intel also presented the successor "Cloverview" in 32-nm technology, which is primarily aimed at Windows 8 and is supposed to roll out sometime next year. Coincidentally, on the same day, Microsoft presented Windows 8 at Mix11 – no, not on an Oak Trail, but on a 1 GHz ARM processor.