[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #22 from PeteVine --- > I don't know what exactly "fixed" this That would be nice to know. This I can say for sure: gcc 7.2.1 20171116 still produces slower profiled code on the target system. I've also discovered, compiling and profiling on a binary compatible Cortex A17 system (same flags), produces binaries that don't run any slower on the target system.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 Ramana Radhakrishnan changed: What|Removed |Added Target Milestone|--- |8.0 --- Comment #21 from Ramana Radhakrishnan --- Though I don't know what exactly "fixed" this mark it as fixed for GCC 8 as per reporter.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 PeteVine changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |FIXED --- Comment #20 from PeteVine --- The bug doesn't reproduce in a recent GCC 8 build (profiling on a Cortex A5 system). The generated assembly contains no __aeabi_idiv calls whatsoever. Well done.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #19 from PeteVine --- Created attachment 42694 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42694&action=edit Better assembly after profiling
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #18 from PeteVine --- > Well that sounds like the same issue. > Note -fprofile-generate simple inserts counters in the generated code. In > fact the generated code is practically identical between Cortex-A5 and > Cortex-A7. As long as the gcda file is not present, -fprofile-use yields an equally good binary (obviously!), so clearly it's about the profile data somehow. If you have any ideas or debugging suggestions, go ahead, I'll gladly test them.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #17 from wilco at gcc dot gnu.org --- (In reply to PeteVine from comment #16) > Also, I'd like to repeat the fact using -mcpu=cortex-a7 fixes the issue (no > library calls present). Cortex-A7 has hardware division so it doesn't emit library calls. > Incidentally, having run that A7 profiled binary on a Cortex-A53, I'm seeing > a 10% hit compared to a vanilla A7 binary. Hopefully that's just an artifact > of profiling a different CPU architecture. Well that sounds like the same issue. Note -fprofile-generate simple inserts counters in the generated code. In fact the generated code is practically identical between Cortex-A5 and Cortex-A7.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #16 from PeteVine --- Also, I'd like to repeat the fact using -mcpu=cortex-a7 fixes the issue (no library calls present). Incidentally, having run that A7 profiled binary on a Cortex-A53, I'm seeing a 10% hit compared to a vanilla A7 binary. Hopefully that's just an artifact of profiling a different CPU architecture.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #15 from PeteVine --- I don't have a cross-compiler built/installed. If you're positive the bug doesn't reproduce on your end (targeting generic or A5 codegen), then maybe it's about some interaction between gcc instrumentation and the slightly dated system libraries. I think my little A5->A53 experiment shows once the instrumented binary is built, it doesn't matter how the profile data is gathered.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #14 from wilco at gcc dot gnu.org --- (In reply to PeteVine from comment #11) > I've just retested gcc7 on both ARM platforms. > > AArch64 gets a 3% improvement now, while ARMv7 reproduces the issue, just as > before. I'm compiling/profiling on a Cortex A5 which could be the main > reason behind all this, as it doesn't have hard division. Can you try comparing the .S outputs on both the Cortex-A5 and Cortex-A53 system using exact same options, ie. -marm -mcpu=cortex-a5? Assuming you're using the same GCC version, you should get identical .S files and the same .gcda.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #13 from PeteVine --- Created attachment 41240 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41240&action=edit Assembly files produced with -fverbose-asm
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #12 from PeteVine --- It even reproduces the following way: I built an instrumented ARMv7 binary natively, ran it on a Cortex-A53, copied the gcda file back, recompiled with -fprofile-use and got the same 20% slowdown. Surely, that must count (pun intended) for something, as both CPU's are in-order designs.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #11 from PeteVine --- I've just retested gcc7 on both ARM platforms. AArch64 gets a 3% improvement now, while ARMv7 reproduces the issue, just as before. I'm compiling/profiling on a Cortex A5 which could be the main reason behind all this, as it doesn't have hard division.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 wilco at gcc dot gnu.org changed: What|Removed |Added Status|NEW |WAITING CC||wilco at gcc dot gnu.org --- Comment #10 from wilco at gcc dot gnu.org --- I can't reproduce any of this. GCC6 and GCC7 always use smull for the divisions on ARM, even with profile-use. I could only make GCC emit a library call by using -Os on a CPU that doesn't have divide, but that is expected and correct. On AArch64 I get > 20% speedup with -fprofile-use vs plain -O3, so it works as expected. With -mcpu=cortex-a53 there are more uses of sdiv, but the profiled version is still faster. So without more details I don't see any issue here.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #9 from PeteVine --- It seems the LPATHBench exhibits the same issue. https://raw.githubusercontent.com/logicchains/LPATHBench/master/c_fast.c compiled the following way: gcc -falign-functions=32 -std=gnu99 -O2 -mcpu=cortex-a5 -fomit-frame-pointer -mfpu=neon -ftree-vectorize -ffast-math c_fast.c -o c_fast is faster than a profiled version. (10 runs avg. shows about 4% slowdown) Once again division is present in the profiled assembly: bl __aeabi_idiv
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #8 from PeteVine --- Created attachment 39749 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39749&action=edit aarch64 assembly
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 --- Comment #7 from PeteVine --- Even though it's probably a dfifferent issue (affecting GCC6/7), profiling makes the solver about 2-3% slower on aarch64: profiled/non-profiled gcc5.4 799/875 gcc6.2 790/773 gcc7.0 752/730 But guess what, if you grep for `sdiv`, there's 9 of them in the profiled asm file versus just 6 in the non-profiled version. FWIW, so I'm attaching the files.
[Bug middle-end/70773] Profiled sudoku solver slower due to lack of sdiv/udiv
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70773 Ramana Radhakrishnan changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-09-16 CC||ramana at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #6 from Ramana Radhakrishnan --- Confirmed then. Ramana