[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 Martin Liška changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2020-01-21 Ever confirmed|0 |1
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #7 from ncm at cantrip dot org --- (In reply to Richard Biener from comment #6) > (In reply to Andrew Pinski from comment #4) > > (In reply to Alexander Monakov from comment #3) > > > So perhaps an unpopular opinion, but I'd say a > > > __builtin_branchless_select(c, a, b) (guaranteed to live throughout > > > optimization pipeline as a non-branchy COND_EXPR) is badly missing. > > > > I am going to say otherwise. Many of the time conditional move is faster > > than using a branch; even if the branch is predictable (there are a few > > exceptions) on most non-Intel/AMD targets. This is because the conditional > > move is just one cycle and only a "predictable" branch is one cy`le too. > > The issue with a conditional move is that it adds a data dependence while > branches are usually speculated and thus have zero overhead in the execution > stage. The extra dependence can easily slow things down dependent on the > (three!) instructions feeding the conditional move (condition, first and > second source). This is why well-predicted branches are often so much > faster. > > > It is even worse when doing things like: > > if (a && b) > > where on aarch64, this can be done using only one cmp followed by one ccmp. > > NOTE on PowerPC, you could use in theory crand/cror (though this is not done > > currently and I don't know if they are profitable in any recent design). > > > > Plus aarch64 has conditional add and a few other things which improve the > > idea of a conditional move. > > I can see conditional moves are almost always a win on less > pipelined/speculative implementations. Nobody wants a change that makes code slower on our pipelined/ speculative targets, but this is a concrete case where code is already made slower. If the code before optimization has no branch, as in the case of "a = (m & b)|(~m & c)", we can be certain that replacing it with a cmov does not introduce any new data dependence. Anyway, for the case of ?:, where cmov would replace a branch, Gcc is already happy to substitute a cmov instruction. Gcc just refuses to put in a second cmov, after it, for no apparent reason.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #6 from Richard Biener --- (In reply to Andrew Pinski from comment #4) > (In reply to Alexander Monakov from comment #3) > > So perhaps an unpopular opinion, but I'd say a > > __builtin_branchless_select(c, a, b) (guaranteed to live throughout > > optimization pipeline as a non-branchy COND_EXPR) is badly missing. > > I am going to say otherwise. Many of the time conditional move is faster > than using a branch; even if the branch is predictable (there are a few > exceptions) on most non-Intel/AMD targets. This is because the conditional > move is just one cycle and only a "predictable" branch is one cyle too. The issue with a conditional move is that it adds a data dependence while branches are usually speculated and thus have zero overhead in the execution stage. The extra dependence can easily slow things down dependent on the (three!) instructions feeding the conditional move (condition, first and second source). This is why well-predicted branches are often so much faster. > It is even worse when doing things like: > if (a && b) > where on aarch64, this can be done using only one cmp followed by one ccmp. > NOTE on PowerPC, you could use in theory crand/cror (though this is not done > currently and I don't know if they are profitable in any recent design). > > Plus aarch64 has conditional add and a few other things which improve the > idea of a conditional move. I can see conditional moves are almost always a win on less pipelined/speculative implementations.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #5 from ncm at cantrip dot org --- (In reply to Alexander Monakov from comment #3) > The compiler has no way of knowing ahead of time that you will be evaluating > the result on random data; for mostly-sorted arrays branching is arguably > preferable. > > __builtin_expect_with_probability is a poor proxy for unpredictability: a > condition that is true every other time leads to a branch that is both very > predictable and has probability 0.5. If putting it in made my code slower, I would take it back out. The only value it has is if it changes something. If it doesn't improve matters, I need to try something else. For it to do nothing helps nobody. > I think what you really need is a way to express branchless selection in the > source code when you know you need it but the compiler cannot see that on > its own. Other algorithms like constant-time checks for security-sensitive > applications probably also need such computational primitive. > > So perhaps an unpopular opinion, but I'd say a > __builtin_branchless_select(c, a, b) (guaranteed to live throughout > optimization pipeline as a non-branchy COND_EXPR) is badly missing. We can quibble over whether the name of the intrinsic means anything with a value of 0.5, but no other meaning would be useful. But in general I would rather write portable code to get the semantics I need. I don't have a preference between the AND/OR notation and the indexing version, except that the former seems like a more generally useful optimization. Best would be both.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #4 from Andrew Pinski --- (In reply to Alexander Monakov from comment #3) > So perhaps an unpopular opinion, but I'd say a > __builtin_branchless_select(c, a, b) (guaranteed to live throughout > optimization pipeline as a non-branchy COND_EXPR) is badly missing. I am going to say otherwise. Many of the time conditional move is faster than using a branch; even if the branch is predictable (there are a few exceptions) on most non-Intel/AMD targets. This is because the conditional move is just one cycle and only a "predictable" branch is one cyle too. It is even worse when doing things like: if (a && b) where on aarch64, this can be done using only one cmp followed by one ccmp. NOTE on PowerPC, you could use in theory crand/cror (though this is not done currently and I don't know if they are profitable in any recent design). Plus aarch64 has conditional add and a few other things which improve the idea of a conditional move.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- The compiler has no way of knowing ahead of time that you will be evaluating the result on random data; for mostly-sorted arrays branching is arguably preferable. __builtin_expect_with_probability is a poor proxy for unpredictability: a condition that is true every other time leads to a branch that is both very predictable and has probability 0.5. I think what you really need is a way to express branchless selection in the source code when you know you need it but the compiler cannot see that on its own. Other algorithms like constant-time checks for security-sensitive applications probably also need such computational primitive. So perhaps an unpopular opinion, but I'd say a __builtin_branchless_select(c, a, b) (guaranteed to live throughout optimization pipeline as a non-branchy COND_EXPR) is badly missing.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #2 from ncm at cantrip dot org --- Created attachment 47593 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47593=edit a makefile This duplicates code found on the linked archive. E.g.: make all make CC=g++-9 INDEXED make CC=clang++-9 ANDANDOR make CC=clang++-9 INDEXED_PESSIMAL_ON_GCC make CC=g++-9 CHECK=CHECK BOG
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #1 from ncm at cantrip dot org --- Created attachment 47592 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47592=edit code demonstrating the failure