http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54073



Jake Stine <jake.stine at gmail dot com> changed:



           What    |Removed                     |Added

----------------------------------------------------------------------------

                 CC|                            |jake.stine at gmail dot com



--- Comment #16 from Jake Stine <jake.stine at gmail dot com> 2013-02-16 
19:12:05 UTC ---

Hi,



I have done quite a bit of analysis on cmov performance across x86

architectures, so I will share here in case it helps:



Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer

architectures should probably be avoided "as a rule."



History: Conditional moves were beneficial for the Intel Pentium 4, and also

(but less-so) for AMD Athlon/Phenom chips.  In the AMD Athlon/Phenom case the

performance of cmov vs cmp+branch is determined more by the alignment of the

target of the branch, than by the prediction rate of the branch.  The

instruction decoders would incur penalties on certain types of unaligned branch

targets (when taken), or when decoding sequences of instructions that contained

multiple branches within a 16byte "fetch" window (taken or not).  cmov was

sometimes handy for avoiding those.



With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture:



I have found that use of conditional moves (cmov) is only beneficial if the

branch that the move is replacing is badly mis-predicted.  In my tests, the

cmov only became clearly "optimal" when the branch was predicted correctly less

than 92% of the time, which is abysmal by modern branch predictor standards and

rarely occurs in practice.  Above 97% prediction rates, cmov is typically

slower than cmp+branch. Inside loops that contain branches with prediction

rates approaching 100% (as is the case presented by the OP), cmov becomes a

severe performance bottleneck.  This holds true for both Core and Bulldozer. 

Bulldozer has less efficient branching than the i7, but is also severely

bottlenecked by its limited fetch/decode.  Cmov requires executing more total

instructions, and that makes Bulldozer very unhappy.



Note that my tests involved relatively simple loops that did not suffer from

the added register pressure that cmov introduces.  In practice, the prognosis

for cmov being "optimal" is even worse than what I've observed in a controlled

environment.  Furthermore, to my knowledge the status of cmov vs. branch

performance on x86 will not be changing anytime soon.  cmov will continue to be

a liability well into the next couple architecture releases from Intel and AMD.

 Piledriver will have added fetch/decode resources but should also have a

smaller mispredict penalty, so its doubtful cmov will gain much advantages

there either.



Therefore I would recommend setting -fno-tree-loop-if-convert for all -march

matching Intel Core and AMD Bulldozer/Bobcat families.





There is one good use-case for cmov on x86:  Mis-predicted conditions inside of

loops.  Currently there's no way to force that behavior in situations where I,

the programmer, am fully aware that the condition is chaotic/random.  A builtin

cmov or condition hint would be nice.  For now I'm forced to address those

(fortunately infrequent) situations via inline asm.

Reply via email to