Processor Whispers: About many ways and branches
by Andreas Stiller
At the Hot Chips conference in Cupertino, there was a beautiful plethora of new designs. While there was no word on Apple's A6, new information was revealed about AMD Steamroller, Xeon Phi and – through unofficial sources – about Atom Silvermont.
Although some Hot Chips attendees surely had a close look inside the local brew pubs, there was no news about an iPhone 5 prototype being found. Therefore, one has to depend on French web sites like Nowherelse.fr to obtain details about the inner workings of the iPhone 5. At least, their pictures show the A6 label; whether it has two ARM cores, which is probable, or four, is still unknown.
AMD's new CPU chief architect, Jim Keller, should know though. After all, until recently, he was responsible for the Apple processors. He is going to be assisted by another of the scene's old pros: John L. Gustafson, inventor of Gustafson's Law about the performance potential of large multi-processor systems. He has been working at Intel for the past four years. Now, as chief graphics product architect at AMD, he will be concerned with Radeon and FirePro. Also AMD's HSA consortium – within which Keller and Gustafson will be working closely together on the heterogeneous system architecture – is receiving an illustrious addition to its ranks. Next to Vivante and smaller companies like Apical, Artersis and Sonics, one of the biggest fish, no, the biggest fish, has taken the HSA bait: Samsung. This adds a lot of momentum to the whole idea, basically, only Intel, NVIDIA, Qualcomm and Apple are missing ...
During its presentations at Hot Chips, AMD also revealed a few details about the next processor generation, the Steamroller. It will probably fulfil what had been expected of the Bulldozer in the first place: separate decoders for both of a module's cores, improved branch prediction, and a micro-op queue (of unknown size). According to CTO Mark Papermaster, all this is supposed to increase the throughput per clock by 30 per cent. It's quite possible that AMD has not only enlarged the Bulldozer's L1 instruction cache, which had been borrowed from the Athlon with hardly any changes, but also increased its associativity – with only two ways, the associativity was truly unfit to support two cores.
Somewhere half way between Bulldozer and Steamroller, there is the Piledriver core. Reportedly, it is actually available for desktop systems now, in the form of the Trinity APU with integrated graphics. The leaked US prices range from $60 (A4-5300) to $131 (A10-5800K). As for the server versions, Delhi, Seoul and Abu Dhabi, there's still nothing in sight, but, in spite of all the Cassandras, there is much to suggest that at least the FX series for high-end desktop PCs will soon be extended by Piledriver chips – by the Vishera, with four, six or eight cores. If these are indeed released toward the end of October, in the form of an eight-core AMD FX8350 or a six-core FX6350, they'll only be about one or two months late; we've become accustomed to worse. But maybe the Cassandras are right and Vishera will be cancelled in favour of a significantly more efficient Steamroller design, some day.
However, the competition is no stranger to the art of postponement either. Although Intel boss Otellini had promised a massive acceleration of the tick-tock cycle for the Atoms at the developer forum last year, the collapsing netbook market – amongst other things, Acer and Asus have announced their withdrawal – dictates a change of plans. In any case, the leaked roadmap slides from China show the Bay Trail platform with the Valleyview SoC and the Silvermont CPU, listed under 2014. This quad-core Atom in 22nm technology is supposed to draw attention with new out-of-order architecture and merge the current D and N lines. The graphics inside the SoC are also supposed to be up to seven times faster than the graphics of the current Medfield. Until recently, it was still believed to be arriving in early 2013, just like the special server Atom code-named Avoton.
Atomic Time Dilatation
If, by then, there are no new processors, why not make new variants of the old ones. Indeed, Intel has launched a flood of new desktop and mobile processors with one to four cores, most of them with (and in the case of the Core i5-3350P also without) graphics. The cheapest one is the Celeron G465 with 1.9GHz of clock speed for $37. However, truly new chips, such as the Itanium Poulson or the long-delayed Larrabee successor Xeon Phi, are supposed to debut soon too. At the Hot Chips conference, Intel has provided some further details about the latter's inner workings. Basic specifications, however, like clock speed, PCIe version and maximum core number, remain undisclosed. The number of cores had been indicated in the Knights Corner Instruction Set Reference Manual (62 physical, 244 logical cores) published at the start of summer – but only briefly. Shortly after the betraying entries had been mentioned in a previous edition of Processor Whispers, Intel revised the paper, replacing the information with */*/*.
In order to make sure that the Xeon Phi and its programmers will be busy, Intel has released new compiler suites a few days ahead of this year's IDF developer summit. Parallel Studio XE 2013 and Cluster Studio XE 2013 now support the refinements of Sandy/Ivy Bridge (AVX), Haswell (AVX2, PMA3) and Xeon Phi (IMCI). Well, at $2299, Intel isn't exactly giving Parallel Studio XE for Windows and Linux away, but you get highly optimised libraries, powerful profilers and a very useful guide (Intel Advisor XE) that allows you to easily simulate a variety of parallel scenarios. In this context, it is interesting to see the comparison with the Microsoft and GNU compilers using the C/C++ programs of the SPEC CPU2006 suite. In the integer benchmark, Intel was able to achieve a plus of over 50% and in the floating point benchmark 100% (Visual Studio 2010) or even 164% (gcc 4.7.1). However, instead of comparing the multi-processor values (rate), Intel used the speed values, which are rather meant for single threads. The sophisticated auto-parallelisation of the Intel compilers can probably really pack a punch here, while the other compilers are capable of this in a rudimentary way or not at all. So it appears that some sceptical remeasuring should be done.