[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-21 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

Martin Liška  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-01-21
 Ever confirmed|0   |1

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-09 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #7 from ncm at cantrip dot org ---
(In reply to Richard Biener from comment #6)
> (In reply to Andrew Pinski from comment #4)
> > (In reply to Alexander Monakov from comment #3)
> > > So perhaps an unpopular opinion, but I'd say a
> > > __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> > > optimization pipeline as a non-branchy COND_EXPR) is badly missing.
> > 
> > I am going to say otherwise.  Many of the time conditional move is faster
> > than using a branch; even if the branch is predictable (there are a few
> > exceptions) on most non-Intel/AMD targets.  This is because the conditional
> > move is just one cycle and only a "predictable" branch is one cy`le too.
> 
> The issue with a conditional move is that it adds a data dependence while
> branches are usually speculated and thus have zero overhead in the execution
> stage.  The extra dependence can easily slow things down dependent on the
> (three!) instructions feeding the conditional move (condition, first and
> second source).  This is why well-predicted branches are often so much
> faster.
> 
> > It is even worse when doing things like:
> > if (a && b)
> > where on aarch64, this can be done using only one cmp followed by one ccmp.
> > NOTE on PowerPC, you could use in theory crand/cror (though this is not done
> > currently and I don't know if they are profitable in any recent design).
> > 
> > Plus aarch64 has conditional add and a few other things which improve the
> > idea of a conditional move.
> 
> I can see conditional moves are almost always a win on less
> pipelined/speculative implementations.

Nobody wants a change that makes code slower on our pipelined/
speculative targets, but this is a concrete case where code is 
already made slower. If the code before optimization has no 
branch, as in the case of "a = (m & b)|(~m & c)", we can be 
certain that replacing it with a cmov does not introduce any 
new data dependence.

Anyway, for the case of ?:, where cmov would replace a branch, 
Gcc is already happy to substitute a cmov instruction. Gcc just 
refuses to put in a second cmov, after it, for no apparent reason.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #6 from Richard Biener  ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Alexander Monakov from comment #3)
> > So perhaps an unpopular opinion, but I'd say a
> > __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> > optimization pipeline as a non-branchy COND_EXPR) is badly missing.
> 
> I am going to say otherwise.  Many of the time conditional move is faster
> than using a branch; even if the branch is predictable (there are a few
> exceptions) on most non-Intel/AMD targets.  This is because the conditional
> move is just one cycle and only a "predictable" branch is one cyle too.

The issue with a conditional move is that it adds a data dependence while
branches are usually speculated and thus have zero overhead in the execution
stage.  The extra dependence can easily slow things down dependent on the
(three!) instructions feeding the conditional move (condition, first and
second source).  This is why well-predicted branches are often so much
faster.

> It is even worse when doing things like:
> if (a && b)
> where on aarch64, this can be done using only one cmp followed by one ccmp.
> NOTE on PowerPC, you could use in theory crand/cror (though this is not done
> currently and I don't know if they are profitable in any recent design).
> 
> Plus aarch64 has conditional add and a few other things which improve the
> idea of a conditional move.

I can see conditional moves are almost always a win on less
pipelined/speculative implementations.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-06 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #5 from ncm at cantrip dot org ---
(In reply to Alexander Monakov from comment #3)
> The compiler has no way of knowing ahead of time that you will be evaluating
> the result on random data; for mostly-sorted arrays branching is arguably
> preferable.
> 
> __builtin_expect_with_probability is a poor proxy for unpredictability: a
> condition that is true every other time leads to a branch that is both very
> predictable and has probability 0.5.

If putting it in made my code slower, I would take it back out. The only value
it has is if it changes something. If it doesn't improve matters, I need to try
something else. For it to do nothing helps nobody.

> I think what you really need is a way to express branchless selection in the
> source code when you know you need it but the compiler cannot see that on
> its own. Other algorithms like constant-time checks for security-sensitive
> applications probably also need such computational primitive.
> 
> So perhaps an unpopular opinion, but I'd say a
> __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> optimization pipeline as a non-branchy COND_EXPR) is badly missing.

We can quibble over whether the name of the intrinsic means anything with a
value of 0.5, but no other meaning would be useful.

But in general I would rather write portable code to get the semantics I need.
I don't have a preference between the AND/OR notation and the indexing version,
except that the former seems like a more generally useful optimization. Best
would be both.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-06 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #4 from Andrew Pinski  ---
(In reply to Alexander Monakov from comment #3)
> So perhaps an unpopular opinion, but I'd say a
> __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> optimization pipeline as a non-branchy COND_EXPR) is badly missing.

I am going to say otherwise.  Many of the time conditional move is faster than
using a branch; even if the branch is predictable (there are a few exceptions)
on most non-Intel/AMD targets.  This is because the conditional move is just
one cycle and only a "predictable" branch is one cyle too.

It is even worse when doing things like:
if (a && b)
where on aarch64, this can be done using only one cmp followed by one ccmp.
NOTE on PowerPC, you could use in theory crand/cror (though this is not done
currently and I don't know if they are profitable in any recent design).

Plus aarch64 has conditional add and a few other things which improve the idea
of a conditional move.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-06 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
The compiler has no way of knowing ahead of time that you will be evaluating
the result on random data; for mostly-sorted arrays branching is arguably
preferable.

__builtin_expect_with_probability is a poor proxy for unpredictability: a
condition that is true every other time leads to a branch that is both very
predictable and has probability 0.5.

I think what you really need is a way to express branchless selection in the
source code when you know you need it but the compiler cannot see that on its
own. Other algorithms like constant-time checks for security-sensitive
applications probably also need such computational primitive.

So perhaps an unpopular opinion, but I'd say a __builtin_branchless_select(c,
a, b) (guaranteed to live throughout optimization pipeline as a non-branchy
COND_EXPR) is badly missing.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #2 from ncm at cantrip dot org ---
Created attachment 47593
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47593=edit
a makefile

This duplicates code found on the linked archive.

E.g.:

make all
make CC=g++-9 INDEXED
make CC=clang++-9 ANDANDOR
make CC=clang++-9 INDEXED_PESSIMAL_ON_GCC
make CC=g++-9 CHECK=CHECK BOG

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #1 from ncm at cantrip dot org ---
Created attachment 47592
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47592=edit
code demonstrating the failure