Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.

With a sorted array, the condition data[c] >= 128 is first false for a streak of values, then becomes true for all later values. That’s easy to predict. With an unsorted array, you pay for the branching cost.

 

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial‘s answer.

Now, if we look at the code

we can find that the meaning of this particular if... else... branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction: cmovl, in an x86 system. The branch and thus the potential branch prediction penalty is removed.

In C, thus C++, the statement, which would compile directly (without any optimization) into the conditional move instruction in x86, is the ternary operator ... ? ... : .... So we rewrite the above statement into an equivalent one:

While maintaining readability, we can check the speedup factor.

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

x86

x64

The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern.

Now let’s look more closely by investigating the x86 assembly they generate. For simplicity, we use two functions max1 and max2.

max1 uses the conditional branch if... else ...:

max2 uses the ternary operator ... ? ... : ...:

On a x86-64 machine, GCC -S generates the assembly below.

max2 uses much less code due to the usage of instruction cmovge. But the real gain is that max2does not involve branch jumps, jmp, which would have a significant performance penalty if the predicted result is not right.

So why does a conditional move perform better?

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict.

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction’s execution time. This is why the conditional move version is slower than the branch when prediction is easy.

The book Computer Systems: A Programmer’s Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can’t (the code in question is using Visual Studio’s native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

 

More Read & Reference: http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array

 

Leave a Reply

Your email address will not be published. Required fields are marked *