[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-11-25 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #22 from PeteVine  ---
> I don't know what exactly "fixed" this

That would be nice to know. This I can say for sure: gcc 7.2.1 20171116 still
produces slower profiled code on the target system. 

I've also discovered, compiling and profiling on a binary compatible Cortex A17
system (same flags), produces binaries that don't run any slower on the target
system.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-11-23 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

Ramana Radhakrishnan  changed:

   What|Removed |Added

   Target Milestone|--- |8.0

--- Comment #21 from Ramana Radhakrishnan  ---
Though I don't know what exactly "fixed" this mark it as fixed for GCC 8 as per
reporter.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-11-23 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

PeteVine  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |FIXED

--- Comment #20 from PeteVine  ---
The bug doesn't reproduce in a recent GCC 8 build (profiling on a Cortex A5
system).

The generated assembly contains no __aeabi_idiv calls whatsoever. Well done.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-11-23 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #19 from PeteVine  ---
Created attachment 42694
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42694&action=edit
Better assembly after profiling

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-21 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #18 from PeteVine  ---
> Well that sounds like the same issue.

> Note -fprofile-generate simple inserts counters in the generated code. In 
> fact the generated code is practically identical between Cortex-A5 and 
> Cortex-A7.

As long as the gcda file is not present, -fprofile-use yields an equally good
binary (obviously!), so clearly it's about the profile data somehow. If you
have any ideas or debugging suggestions, go ahead, I'll gladly test them.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-21 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #17 from wilco at gcc dot gnu.org ---
(In reply to PeteVine from comment #16)
> Also, I'd like to repeat the fact using -mcpu=cortex-a7 fixes the issue (no
> library calls present). 

Cortex-A7 has hardware division so it doesn't emit library calls.

> Incidentally, having run that A7 profiled binary on a Cortex-A53, I'm seeing
> a 10% hit compared to a vanilla A7 binary. Hopefully that's just an artifact
> of profiling a different CPU architecture.

Well that sounds like the same issue.

Note -fprofile-generate simple inserts counters in the generated code. In fact
the generated code is practically identical between Cortex-A5 and Cortex-A7.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-21 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #16 from PeteVine  ---
Also, I'd like to repeat the fact using -mcpu=cortex-a7 fixes the issue (no
library calls present). 

Incidentally, having run that A7 profiled binary on a Cortex-A53, I'm seeing a
10% hit compared to a vanilla A7 binary. Hopefully that's just an artifact of
profiling a different CPU architecture.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-21 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #15 from PeteVine  ---
I don't have a cross-compiler built/installed.

If you're positive the bug doesn't reproduce on your end (targeting generic or
A5 codegen), then maybe it's about some interaction between gcc instrumentation
and the slightly dated system libraries.  

I think my little A5->A53 experiment shows once the instrumented binary is
built, it doesn't matter how the profile data is gathered.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-21 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #14 from wilco at gcc dot gnu.org ---
(In reply to PeteVine from comment #11)
> I've just retested gcc7 on both ARM platforms. 
> 
> AArch64 gets a 3% improvement now, while ARMv7 reproduces the issue, just as
> before. I'm compiling/profiling on a Cortex A5 which could be the main
> reason behind all this, as it doesn't have hard division.

Can you try comparing the .S outputs on both the Cortex-A5 and Cortex-A53
system using exact same options, ie. -marm -mcpu=cortex-a5? Assuming you're
using the same GCC version, you should get identical .S files and the same
.gcda.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-20 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #13 from PeteVine  ---
Created attachment 41240
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41240&action=edit
Assembly files produced with -fverbose-asm

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-20 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #12 from PeteVine  ---
It even reproduces the following way:

I built an instrumented ARMv7 binary natively, ran it on a Cortex-A53, copied
the gcda file back, recompiled with -fprofile-use and got the same 20%
slowdown.

Surely, that must count (pun intended) for something, as both CPU's are
in-order designs.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-20 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #11 from PeteVine  ---
I've just retested gcc7 on both ARM platforms. 

AArch64 gets a 3% improvement now, while ARMv7 reproduces the issue, just as
before. I'm compiling/profiling on a Cortex A5 which could be the main reason
behind all this, as it doesn't have hard division.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2017-04-19 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

wilco at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |WAITING
 CC||wilco at gcc dot gnu.org

--- Comment #10 from wilco at gcc dot gnu.org ---
I can't reproduce any of this. GCC6 and GCC7 always use smull for the divisions
on ARM, even with profile-use. I could only make GCC emit a library call by
using -Os on a CPU that doesn't have divide, but that is expected and correct.

On AArch64 I get > 20% speedup with -fprofile-use vs plain -O3, so it works as
expected. With -mcpu=cortex-a53 there are more uses of sdiv, but the profiled
version is still faster.

So without more details I don't see any issue here.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2016-10-22 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #9 from PeteVine  ---
It seems the LPATHBench exhibits the same issue.

https://raw.githubusercontent.com/logicchains/LPATHBench/master/c_fast.c

compiled the following way:

gcc -falign-functions=32 -std=gnu99 -O2 -mcpu=cortex-a5 -fomit-frame-pointer
-mfpu=neon -ftree-vectorize -ffast-math c_fast.c -o c_fast 

is faster than a profiled version. (10 runs avg. shows about 4% slowdown)

Once again division is present in the profiled assembly:

bl  __aeabi_idiv

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2016-10-04 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #8 from PeteVine  ---
Created attachment 39749
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39749&action=edit
aarch64 assembly

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2016-10-04 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

--- Comment #7 from PeteVine  ---
Even though it's probably a dfifferent issue (affecting GCC6/7), profiling
makes the solver about 2-3% slower on aarch64:

profiled/non-profiled
gcc5.4 799/875
gcc6.2 790/773
gcc7.0 752/730

But guess what, if you grep for `sdiv`, there's 9 of them in the profiled asm
file versus just 6 in the non-profiled version. FWIW, so I'm attaching the
files.

[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv

2016-09-16 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773

Ramana Radhakrishnan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-09-16
 CC||ramana at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #6 from Ramana Radhakrishnan  ---
Confirmed then.



Ramana