Processor Whispers: About elisions and epentheses
by Andreas Stiller
Intel releases the instruction extension for transactional memory and gives numerous speeches at the International Solid-State Circuits Conference (ISSCC). AMD equips the Piledriver with inductors and China delivers another new processor.
The last issue of Processor Whispers had barely been finished and sent to press when Intel announced the "transactional synchronization extensions" TSX, mentioned in that issue, for the generation-after-next processor Haswell. Intel's TSX even provides two interfaces for handling this technology. The so-called hardware lock elision (HLE), the omission of mostly unnecessary locks, employs two new prefixes, xacquire and xrelease, with which the pessimistic locks in the sense of transactional memory can be transformed into optimistic ones. In the pessimistic case, as a precaution, only one thread may run at a time in critical regions. All others have to wait, even if they might not even get in each other's way.
Conversely, with the optimistic approach, all threads are allowed to continue, but there has to be a special fallback option in case a conflict does actually arise. The first setting is similar to a set of traffic lights that, to be on the safe side, can only show green for one line at a time, while the second is like a crossing without any traffic lights – but with a repair shop to deal with eventual "conflicts".
The opcodes of the HLE prefixes aren't exactly new, though. Due to a lack of space in the opcode room, Intel "abused" the already existing REP/REPNE prefixes (0xF2/0xF3) to this end. Currently, these only play a role in connection with string operations and are otherwise ignored. Opcode recycling has the big advantage that the same binary code can also run on older architectures, albeit somewhat slower because of the pessimistic locking.
The other option provided by TSX, called restricted transactional memory (RTM), offers three new instructions, xbegin, xend and xabort and is more powerful than HLE. Here, however, the binary code is no longer downwards-compatible and will thus run only on Haswell and following processors. How the transactional memory has been realised hardware-wise hasn't been disclosed by Intel. Well-grounded assumptions from David Kanter can be found at www.realworldtech.com.
Terahertz in Sight
Also at the ISSCC in San Francisco in mid-February, where Intel developers gave numerous speeches, there was no mention of the Haswell processor, neither about the transactional memory nor the cache design. Well, after all, it's the Ivy Bridge that's next and, according to the latest rumours, its launch – at least in larger numbers – will be pushed back from Easter in the direction of Pentecost (from April towards May) because so many Sandy Bridge notebooks are still sitting on the shelves.
Intel developer Scott Siers explained that at first there will be four basic Ivy Bridge versions with different silicon surfaces, the largest of which will pack around 1.4 billion transistors on 160 square millimetres. The design foresees three different trigate transistor types, the fastest with normal leakage current, so-called "quarter leakage" types with medium speed and slow transistors that only have one tenth of the leakage the fast ones have. The processors' fastest function blocks consist of fast and medium-speed transistors, about 70% of the former and 30% of the latter. In the less critical chip regions, about 75% of the transistors are of the especially economical type while the rest are medium-speed transistors. Also new is that, with the Ivy Bridge, the voltage increase accompanying a rise in frequency won't be a linear step-by-step rise but a parabolic one, which saves one or two additional milliwatts.
As for the clock speed, Intel doesn't plan to surpass 4GHz with its next processor generation either, even though, according to Siers, this could be easily accomplished with Ivy Bridge. Not much has happened concerning the processor clock speeds during the last 10 years, but, in his keynote, chief product officer Daddy Perlmutter boldly mentioned terahertz class clients, which he sees at only 20 watts toward the end of the decade. Actually "seeing" terahertz was another topic, although in a very different session: modern CMOS cameras that can capture images in the far infrared at 860GHz – great for body scanners. So CMOS still offers room for innovation.
Source: AMD AMD intends to exceed 4GHz with its Piledriver cores, has licensed a resonant clock distribution technology from Cyclos Semiconductors for this purpose and is the first to have a design, which is supposed to reduce the power consumption of the clock distribution by 24%. Taking into account that the Piledriver has five horizontal clock trees, each one with 54 drivers, this should be quite significant. Cyclos' patented idea is to integrate inductors, bridgeable via switches, between the clock drivers. Thanks to LC resonance, these inductors can then recycle charges. How the coils get into the chips? To achieve this, the windings have to be artfully woven into the upper two metal layers.
While Intel and AMD explained some of the circuitry tricks of future processors, Oracle used the event to release more details about its already-launched SPARC T4 processor. The T4 comes with fewer but much more powerful cores than its T3 predecessor. It would be nice to have some benchmark results from SPECrate2006 to be able to directly compare the new 8-core chip with its 16-core predecessor, but apparently Oracle isn't willing to release them.
There were new processors to be admired, too, with a Chinese University yet again managing the feat of pulling a new one out of the hat. Last autumn, the Jiangnan Computing Research Lab delivered the ShenWei 1600 with 16 cores, which allowed an accordingly equipped computer to score 14th place in the Top500 supercomputer list. Now, the Fudan University from Shanghai has presented another interesting 16-core chip, which comes without caches and works with message passing as a cluster as well as with shared memory. The 16 SIMD RISC cores – supposedly MIPS32 compatible – are grouped in two clusters of 8 cores around a shared-memory node. The two clusters on a chip communicate with each other through three links. The processor, manufactured by TSMC in the 65nm process, manages a 3780 point FFT with 7MSamples/s. Clocked at 750MHz and with a voltage of 1.2 volts, it's supposed to have an operational power consumption of only 34mW/core. And so the Chinese have a head start here. It is just as well that at least ARM comes from Europe...