[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

2021-12-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

2019-10-07 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

--- Comment #4 from Dmitrij Pochepko  ---
(In reply to Andrew Pinski from comment #3)
> ...

I haven't tracked deepsjeng data passed for logL function specifically. I only
measured totals. It might be not directly related to logL code execution time.

I also measured separate synthetic benchmarks with loop-based and
non-loop-based implementations (simple logL function calculation in a loop with
adding result into accumulator). For 0 and 1 arguments I see about 2% slower
numbers with synthetic benchmark on T99. Hope this info will help to anyone
who'll decide to work on this patch.

[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

2019-10-02 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

Andrew Pinski  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org

--- Comment #3 from Andrew Pinski  ---
(In reply to Dmitrij Pochepko from comment #2)
> aarch64 won't be necessarily faster with such fix.
> 531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a).

This sounds like we only pass 0 or 1 to this function in deepsjeng_r?
Have you figured out the values that deepsjeng_r uses for these loops?

Because 31-clz would be:
clz w0, w0
mov w1, 31
sub w0, w1, w0
--- CUT ---
While the loop version would be:
asr w1, w0, 1
mov w0, 0
cbz w1, .L3
.p2align 2
.L5:
add w0, w0, 1
asr w1, w1, 1
cbnzw1, .L5
.L3:

If the first branch was predicted as being taken (and it was actually taken;
that is skip the loop), it would be a few cycles faster than the non-loop based
one.  This would also mean the value of w0 is either 0 or 1.

Did you anlaysis why it was worse for TX2?

[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

2019-10-02 Thread dpochepk at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

Dmitrij Pochepko  changed:

   What|Removed |Added

 CC||dpochepk at gmail dot com

--- Comment #2 from Dmitrij Pochepko  ---
aarch64 won't be necessarily faster with such fix.
531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a).

[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)

2019-06-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-06-12
 CC||rguenth at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
popcount detection is done in niter analysis and I'd put this in there, too
(number_of_iterations_popcount).  It might be possible to generalize the
detection routine rather than creating another one or at least factoring
out common bits.