Processor Whispers: About slowing and scrambling
by Andreas Stiller
It's jinxed: a lot of things related to the Bulldozer go wrong, including the joint endeavour by Microsoft and AMD to improve the processor's performance under Windows 7.
Maybe Microsoft boss, Steve Ballmer, will address the company's various hits and misses in his keynote at the Consumer Electronics Show CES 2012 – the last of its kind, as, to the surprise of the industry, Microsoft is saying farewell to this show. The list of misses got a little longer in mid-December, with the “update to optimise the performance of AMD Bulldozer CPUs” (KB2592546). In the description, Microsoft explains that, up until now, the performance of the AMD Bulldozer CPU has been worse than expected.
However, the update was only online for one day before it was pulled and, for a few days, the link led to a page with the notice: “The code associated with this KB is incomplete and should not be used”. Eventually, and without further ado, this page was removed as well. So, do updates no longer get tested before they are released? Was the chaos a result of AMD firing the employees responsible for the collaboration with Microsoft as part of its mass layoffs?
Whatever the cause, the hotfix developed in collaboration with AMD was meant to factor in the “simultaneous multithreading, (SMT)” – which AMD calls core multithreading (CMT) – in connection with the scheduling of threads in the Bulldozer, and thus contribute to a slightly higher performance.
AMD and Microsoft say they are continually working to improve - but then so do a lot of job references
In spite of its very limited availability, numerous Athlon-FX owners rushed to get the hotfix, hoping for a noticeable performance increase. But they had to face bitter disappointment: no performance boost could be detected and, even worse, often the result was a deceleration. Many of them, however, hadn't really understood what the update was meant to accomplish. It was supposed to spread the threads across the modules before the scheduler starts employing the module's second cores. So, naturally, a performance gain can't be expected for single-thread applications or applications that use all cores. Only if fewer cores than available are used, could a different assignment be advantageous.
Nonetheless, we tried the questionable update on an FX8150 under Windows 7 Ultimate, 64-bit, and logged where two threads of a single process without defined affinities run. Without the hotfix, both threads usually start on core 0 and 1, and thus in the same module, where they have to share many resources (frontend, instruction cache, FPU). Consequently, depending on the code, they run significantly slower than they would if they were spread across two modules. Once in a blue moon, they jump to other cores, but only to quickly return to one of the initial ones.
This behaviour changed with the update. The first initiated thread stays on the first core, 0, the second and following threads then jump back and forth between the three cores in the other modules. We also noted differences in the division of threads, depending whether the threads included FPU instructions or not. Besides, the experiments also showed that Windows 7 almost always starts with the same core (0) when further processors are activated – with or without the update. It seems highly unlikely that this is a sensible use of multithreading.
One reason for the deceleration some users noticed was that the altered division strategy for threads and the merry jumping back and forth between cores could interfere with the power management and the turbo core.
Many of the underlying turbo core technologies and algorithms were developed by Sam Naffziger. The owner of dozens of important processor patents was involved in the development of the PA-RISC processors at Hewlett-Packard. Later on, as an Intel Fellow, he was responsible for the power management and the Foxton overclocking technology in the Itanium Montecito. In October 2009 – quite some time after he had moved on to competitor AMD – he received his US patent for the “sampling of chip activity for real-time power estimation”, a technology which he incorporated into the Bulldozer design. With machine specific registers, it's possible to read the current estimation for the power consumption (under Linux, for instance, via fam15h_power). On the Internet, you will find many complaints about significant discrepancies from the real values, though.
Naffzinger's ex-employer, Intel, doesn't solely rely on clever algorithms for its Sandy Bridge generation, either. Via the PECI interface, the processor can read not only thermal but also power data. Intel has now released Power Gadget 2.0 for Windows, which allows the user to read the current clock and power consumption of Sandy Bridge cores. This “gadget” also includes libraries, so the developers can optimise their software's “power awareness”.
Alias Conflicts
Another performance problem of the Bulldozer processor is the alias conflicts in the L1 instruction cache, described in c't Magazine's test 25/2011. Basically, the instruction cache of the predecessor, K10, which was also addressed virtually and tagged physically (VIPT), had the same problem. However, with the K10, it's comparatively rare that two processes with the same physical, but different virtual addresses run on the same core. With the Bulldozer, on the other hand, it's much more likely that two cores get in each other's way, because the instruction cache has to serve two cores. This happens, for instance, when the operating system starts identical processes with shared libraries with randomly selected different virtual addresses (ASLR: address space layout randomisation). This kind of address scrambling has meanwhile become commonplace because it makes it harder for malware to settle in.
As for Windows, programmers can choose whether their application should be started with ASLR or not when creating it. Although it is enabled by default in new Visual Studios, the majority of the software in use still does without it. Besides, with ASLR, Windows doesn't scramble everything, but leaves the lower 16 bits untouched, which, fortunately, sidesteps the mentioned alias problem of the Bulldozer cache, and so, no hotfix is required for Windows.
Under Linux, something similar can be achieved through a minimal change in the Kernel, so a redesign of the cache doesn't really seem necessary. However, AMD should provide it with a little more than just 2-way associativity in the future, after all, competitor Sandy Bridge has 8-way associativity for its I-cache, which significantly reduces the probability of processes kicking each other out of the cache (thrashing).
(djwm)



















