[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement
[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839 --- Comment #4 from Dmitrij Pochepko --- (In reply to Andrew Pinski from comment #3) > ... I haven't tracked deepsjeng data passed for logL function specifically. I only measured totals. It might be not directly related to logL code execution time. I also measured separate synthetic benchmarks with loop-based and non-loop-based implementations (simple logL function calculation in a loop with adding result into accumulator). For 0 and 1 arguments I see about 2% slower numbers with synthetic benchmark on T99. Hope this info will help to anyone who'll decide to work on this patch.
[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839 Andrew Pinski changed: What|Removed |Added CC||pinskia at gcc dot gnu.org --- Comment #3 from Andrew Pinski --- (In reply to Dmitrij Pochepko from comment #2) > aarch64 won't be necessarily faster with such fix. > 531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a). This sounds like we only pass 0 or 1 to this function in deepsjeng_r? Have you figured out the values that deepsjeng_r uses for these loops? Because 31-clz would be: clz w0, w0 mov w1, 31 sub w0, w1, w0 --- CUT --- While the loop version would be: asr w1, w0, 1 mov w0, 0 cbz w1, .L3 .p2align 2 .L5: add w0, w0, 1 asr w1, w1, 1 cbnzw1, .L5 .L3: If the first branch was predicted as being taken (and it was actually taken; that is skip the loop), it would be a few cycles faster than the non-loop based one. This would also mean the value of w0 is either 0 or 1. Did you anlaysis why it was worse for TX2?
[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839 Dmitrij Pochepko changed: What|Removed |Added CC||dpochepk at gmail dot com --- Comment #2 from Dmitrij Pochepko --- aarch64 won't be necessarily faster with such fix. 531.deepsjeng_r on ThunderX2 shows about 0.5% slower numbers with 31-clz(a).
[Bug tree-optimization/90839] Detect lsb ones counting loop (final value replacement?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90839 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-06-12 CC||rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- popcount detection is done in niter analysis and I'd put this in there, too (number_of_iterations_popcount). It might be possible to generalize the detection routine rather than creating another one or at least factoring out common bits.