[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 Martin Liška changed: What|Removed |Added CC||marxin at gcc dot gnu.org --- Comment #28 from Martin Liška --- Vladimir: Can the bug be marked as resolved?
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #27 from Wilco --- On AArch64 this patch improves codesize slightly and perf results are within noise. Typical changes are like: 1) mov x1, x2 mov w2, 96 ldr x1, [x1, 8] --->> ldr x1, [x2, 8] mov w2, 96 2) addx0, x0, :lo12:board movx26, x0 --->>> addx26, x0, :lo12:board so basically avoiding redundant moves which is always good.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #26 from Andreas Krebbel --- (In reply to Andreas Krebbel from comment #21) > However, there appear to be performance regressions at least with cactusADM. I was not able to reproduce the regression in an non-shared environment. In fact the re-measurement shows a slight improvement instead.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #25 from Vladimir Makarov --- Author: vmakarov Date: Fri Aug 5 21:31:31 2016 New Revision: 239180 URL: https://gcc.gnu.org/viewcvs?rev=239180=gcc=rev Log: 2016-08-05 Vladimir MakarovPR rtl-optimization/69847 * lra-constraints.c (process_invariant_for_inheritance): Save pattern instead of src. (remove_inheritance_pseudos): Use the pattern. Add assert. Modified: trunk/gcc/ChangeLog trunk/gcc/lra-constraints.c
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #24 from mwahab at gcc dot gnu.org --- Created attachment 39055 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39055=edit Testcase for ICE on arm targets Attached vshuf-v64qi.i for the ICE on arm targets. The failure is triggered with -mfloat-abi=hard -O1. The cc1 command line is: cc1 vshuf-v64qi.i -mfloat-abi=hard -O1 The float-abi option is suspicious but I get the same failure when -march=armv8-a -mfpu=neon-fp-armv8 is added so I don't think that it's the lack of FP registers that's the problem. Matthew
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #23 from Vladimir Makarov --- (In reply to mwahab from comment #22) > I believe that this patch is the cause of compilation failures for a number > of tests on arm-none-linux-gnueabihf and arm-none-eabi. > > E.g. arm-none-linux-gnueabihf-gcc -S testsuite/gcc.dg/torture/vshuf-v64qi.c > -O1 > results in an ICE with > > testsuite/gcc.dg/torture/vshuf-main.inc: In function ‘test_2’: > testsuite/gcc.dg/torture/vshuf-main.inc:28:1: error: insn does not satisfy > its constraints: > TESTS > ^ > (insn 606 607 589 2 (set (reg:QI 3 r3 [326]) > (reg:QI 31 s15)) testsuite/gcc.dg/torture/vshuf-main.inc:28 182 > {*arm_movqi_insn} > (expr_list:REG_EQUAL (const_int 13 [0xd]) > (nil))) > testsuite/gcc.dg/torture/vshuf-main.inc:28:1: internal compiler error: in > extract_constrain_insn, at recog.c:2211 > 0xaea3e8 _fatal_insn(char const*, rtx_def const*, char const*, int, char > const*) > /work/src/gnutools/armdsg/src/gcc/gcc/rtl-error.c:108 > 0xaea40f _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) > /work/src/gnutools/armdsg/src/gcc/gcc/rtl-error.c:119 > 0xabcf1d extract_constrain_insn(rtx_insn*) > /work/src/gnutools/armdsg/src/gcc/gcc/recog.c:2211 > 0x9d76af check_rtl > /work/src/gnutools/armdsg/src/gcc/gcc/lra.c:2108 > 0x9dbfa9 lra(_IO_FILE*) > /work/src/gnutools/armdsg/src/gcc/gcc/lra.c:2516 > 0x99106f do_reload > /work/src/gnutools/armdsg/src/gcc/gcc/ira.c:5384 > 0x99106f execute > /work/src/gnutools/armdsg/src/gcc/gcc/ira.c:5568 > > > The test compiles with trunk before the change. > > Let me know if there's anything else I should check. I tried to reproduce it on my arm board (armv7l-unknown-linux-gnueabihf) and using cross-compiler arm-non-eabi without any success. Could you provide a preprocessed source file with all *cc1* options for arm-non-eabi. Thank you.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 mwahab at gcc dot gnu.org changed: What|Removed |Added CC||mwahab at gcc dot gnu.org --- Comment #22 from mwahab at gcc dot gnu.org --- I believe that this patch is the cause of compilation failures for a number of tests on arm-none-linux-gnueabihf and arm-none-eabi. E.g. arm-none-linux-gnueabihf-gcc -S testsuite/gcc.dg/torture/vshuf-v64qi.c -O1 results in an ICE with testsuite/gcc.dg/torture/vshuf-main.inc: In function ‘test_2’: testsuite/gcc.dg/torture/vshuf-main.inc:28:1: error: insn does not satisfy its constraints: TESTS ^ (insn 606 607 589 2 (set (reg:QI 3 r3 [326]) (reg:QI 31 s15)) testsuite/gcc.dg/torture/vshuf-main.inc:28 182 {*arm_movqi_insn} (expr_list:REG_EQUAL (const_int 13 [0xd]) (nil))) testsuite/gcc.dg/torture/vshuf-main.inc:28:1: internal compiler error: in extract_constrain_insn, at recog.c:2211 0xaea3e8 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) /work/src/gnutools/armdsg/src/gcc/gcc/rtl-error.c:108 0xaea40f _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) /work/src/gnutools/armdsg/src/gcc/gcc/rtl-error.c:119 0xabcf1d extract_constrain_insn(rtx_insn*) /work/src/gnutools/armdsg/src/gcc/gcc/recog.c:2211 0x9d76af check_rtl /work/src/gnutools/armdsg/src/gcc/gcc/lra.c:2108 0x9dbfa9 lra(_IO_FILE*) /work/src/gnutools/armdsg/src/gcc/gcc/lra.c:2516 0x99106f do_reload /work/src/gnutools/armdsg/src/gcc/gcc/ira.c:5384 0x99106f execute /work/src/gnutools/armdsg/src/gcc/gcc/ira.c:5568 The test compiles with trunk before the change. Let me know if there's anything else I should check. Matthew
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #21 from Andreas Krebbel --- I did a S/390 Spec measurement. All the testcases got smaller with the patch what definitely is a good sign. However, there appear to be performance regressions at least with cactusADM.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #20 from Vladimir Makarov --- (In reply to Bill Schmidt from comment #17) > Vlad, the patch checks out very well on powerpc64le. 403.gcc no longer > degrades. We are seeing some very nice improvements from LRA over reload on > a few benchmarks (435.gromacs leads the way with +9.5%). Everything in > CPU2006 is positive or noise except for 401.bzip2, which shows about a 2% > degradation. We'll look into that separately to see if it's real and open > another bug if there is anything to report. > > Once the patch is in place, I think we will be ready to switch the POWER > server targets to use LRA by default. > > Thanks! Excellent results. Mike, Bill, thank you for benchmarking the patch. I've just submitted the patch into the trunk. As the patch affects a complicated inheritance code in LRA, there is a possibility of new regressions on some targets. So if this happens, another iteration on the patch might be done.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #19 from Vladimir Makarov --- Author: vmakarov Date: Tue Aug 2 16:07:36 2016 New Revision: 238991 URL: https://gcc.gnu.org/viewcvs?rev=238991=gcc=rev Log: 2016-08-02 Vladimir MakarovPR rtl-optimization/69847 * lra-int.h (struct lra-reg): Use restore_rtx instead of restore_regno. (lra_rtx_hash): New. * lra.c (initialize_lra_reg_info_element): Use restore_rtx instead of restore_regno. (lra_rtx_hash): Rename and move lra-remat.c::rtx_hash. * lra-remat.c (rtx_hash): Rename and Move to lra.c. * lra-spills.c (lra_final_code_change): Don't delete insn when the next insn is USE with the same reg as the current insn source. * lra-constraints.c (curr_insn_transform): Use restore_rtx instead of restore_regno. (lra_constraints_init): Call initiate_invariants. (lra_constraints_finish): Call finish_invariants. (struct invariant, invariant_t, invariant_ptr_t): New. (const_invariant_ptr_t, invariants, invariants_pool): New. (invariant_table, invariant_hash, invariant_eq_p): New. (insert_invariant, initiate_invariants, finish_invariants): New. (clear_invariants, invalid_invariant_regs): New. (inherit_reload_reg, split_reg, fix_bb_live_info): Use restore_rtx instead of restore_regno. (invariant_p, process_invariant_for_inheritance): New. (inherit_in_ebb): Implement invariant inheritance. (lra_inheritance): Initialize and finalize invalid_invariant_regs. (remove_inheritance_pseudos): Implement undoing invariant inheritance. (undo_optional_reloads, lra_undo_inheritance): Use restore_rtx instead of restore_regno. * lra-assigns.c (regno_live_length): New. (reload_pseudo_compare_func): Use regno_live_length. (assign_by_spills): Use restore_rtx instead of restore_regno. (lra_assign): Ditto. Initiate regno_live_length. Modified: trunk/gcc/ChangeLog trunk/gcc/lra-assigns.c trunk/gcc/lra-constraints.c trunk/gcc/lra-int.h trunk/gcc/lra-remat.c trunk/gcc/lra-spills.c trunk/gcc/lra.c
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #18 from Ramana Radhakrishnan --- (In reply to Bill Schmidt from comment #17) > Vlad, the patch checks out very well on powerpc64le. 403.gcc no longer > degrades. We are seeing some very nice improvements from LRA over reload on > a few benchmarks (435.gromacs leads the way with +9.5%). Everything in > CPU2006 is positive or noise except for 401.bzip2, which shows about a 2% > degradation. We'll look into that separately to see if it's real and open > another bug if there is anything to report. > > Once the patch is in place, I think we will be ready to switch the POWER > server targets to use LRA by default. > > Thanks! Excellent results. > > Bill We'll look at what this does on AArch64 and feedback on this thread.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #17 from Bill Schmidt --- Vlad, the patch checks out very well on powerpc64le. 403.gcc no longer degrades. We are seeing some very nice improvements from LRA over reload on a few benchmarks (435.gromacs leads the way with +9.5%). Everything in CPU2006 is positive or noise except for 401.bzip2, which shows about a 2% degradation. We'll look into that separately to see if it's real and open another bug if there is anything to report. Once the patch is in place, I think we will be ready to switch the POWER server targets to use LRA by default. Thanks! Excellent results. Bill
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #16 from Bill Schmidt --- Hi Vlad, I need to re-run my tests one more time because I goofed up the build on a few of them; however, I was able to verify that the degradation on 403.gcc has now gone away (I saw a slight improvement with lra over reload with your patch). I'll check back in tomorrow with full results, but for now it looks like you nailed the problem, and the overall results with lra are tentatively looking quite good. Thanks! Bill
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #15 from Michael Meissner --- Thanks for doing this. It looks like Bill will do spec testing, but if he doesn't I will fire off a run early next week.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #14 from Bill Schmidt --- Thanks, Vlad! I'll do some benchmarking with this patch in the next few days. Much obliged!
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #13 from Vladimir Makarov --- Hi, on the next week I am going to commit the patch I've just attached. The final version of the patch will have more comments. With the patch LRA generates the same code for the test case as reload (the only difference is in difference of used hard registers for some insns). So you can try the patch. Sorry, for the long delay. It took more time than I planned originally.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #12 from Vladimir Makarov --- Created attachment 39029 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39029=edit A patch
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #11 from Michael Meissner --- Thanks for the update.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #10 from Vladimir Makarov --- I've been working on this for about 2 weeks and still I don't see the problem will be solved soon. Therefore I've decided to write some update. First of all after analyzing hot functions, I found that LRA generates more code to load high parts of addresses than the reload pass. Digging into the reload pass I found that the reload does an inheritance beyond memory load/stores and reuse values in pseudos containing an address high part where it is possible. This functionality is missed in LRA. I started to think how to better implement it in LRA (inheritance or rematerialization) but I came to the conclusion that the inheritance is the only way to do this. Unfortunately, inheritance in LRA is a complicated thing and I need to make it even more complicated. I almost finished the implementation of the inheritance beyond the spills. Unfortunately, I found that even LRA generates fewer insns and loads of address high parts for the code examples, there are still additional reg-reg moves in comparison with the code generated by the reload pass. I think removing these additional moves will require changes in LRA assignment sub-pass. Currently LRA assigns hard registers to reload pseudos first (to guarantee they will get hard registers in any case) and then to inheritance pseudos. To remove the moves, assignment to reload pseudos and inheritance pseudos connected to the reload pseudos should be assigned at the same time (or one immediately after another). It means that we need to able to spill the inheritance pseudos (or assign it to other registers) if another reload pseudo needs an inheritance pseudo register. So I am going to try this new assignment scheme. ETA for this is June-July.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 David Edelsohn changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-02-18 Ever confirmed|0 |1 --- Comment #9 from David Edelsohn --- Thanks for investigating, Vlad! This issue seems likely to occur on all RISC architectures, especially including ARM and SPARC in addition to PPC.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #8 from Vladimir Makarov --- (In reply to Michael Meissner from comment #7) > The following options were used for LRA code generation: > -DSPEC_CPU -DNDEBUG -I. -g -mlittle -save-temps=obj -ffast-math -O3 > -mveclibabi=mass -mcpu=power8 -mrecip=rsqrt -fpeel-loops -funroll-loops > -ftree-vectorize -fvect-cost-model -msave-toc-indirect > -fno-aggressive-loop-optimizations -mno-pointers-to-nested-functions -mlra > -m64 -I/opt/at8.0/include > -I/opt/at8.0/lib/gcc/powerpc64le-linux-gnu/4.9.4/include -fgnu89-inline > -DSPEC_CPU_LP64 -DSPEC_CPU -DSPEC_CPU2006 -DINLINE= > > The following options were used for reload code generation: > -DSPEC_CPU -DNDEBUG -I. -g -mlittle -save-temps=obj -ffast-math -O3 > -mveclibabi=mass -mcpu=power8 -mrecip=rsqrt -fpeel-loops -funroll-loops > -ftree-vectorize -fvect-cost-model -msave-toc-indirect > -fno-aggressive-loop-optimizations -mno-pointers-to-nested-functions -m64 > -I/opt/at8.0/include > -I/opt/at8.0/lib/gcc/powerpc64le-linux-gnu/4.9.4/include -fgnu89-inline > -DSPEC_CPU_LP64 -DSPEC_CPU -DSPEC_CPU2006 -DINLINE= Mike, thanks for a good and detail report. After spending some time on comparison of code generated by reload and LRA, at least I found that LRA generates unnecessary address reloads: Creating newreg=225, assigning class BASE_REGS to addr r225 alt=0,overall=0,losers=0,rld_nregs=0 Choosing alt 0 in insn 15: (0) =Y (1) r {*movdi_internal64} 15: [r225:DI+low(unspec[`*.LANCHOR0',%2:DI] 47)]=r181:DI REG_DEAD r181:DI REG_DEAD %2:DI Inserting insn reload before: 133: r225:DI=high(unspec[`*.LANCHOR0',%2:DI] 47) After checking what reload pass do in this situation, I found that reload transforms the memnory address into sum of hi and low in the insn through *LEGITIMIZE_RELOAD_ADDRESS* before actual reloading. LRA does not use LEGITIMIZE_RELOAD_ADDRESS at all as the macro frequently uses push_reload which is not a part of LRA. I'll think how to fix it better. Probably it will need introduction analogous macro for LRA.
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #7 from Michael Meissner --- The following options were used for LRA code generation: -DSPEC_CPU -DNDEBUG -I. -g -mlittle -save-temps=obj -ffast-math -O3 -mveclibabi=mass -mcpu=power8 -mrecip=rsqrt -fpeel-loops -funroll-loops -ftree-vectorize -fvect-cost-model -msave-toc-indirect -fno-aggressive-loop-optimizations -mno-pointers-to-nested-functions -mlra -m64 -I/opt/at8.0/include -I/opt/at8.0/lib/gcc/powerpc64le-linux-gnu/4.9.4/include -fgnu89-inline -DSPEC_CPU_LP64 -DSPEC_CPU -DSPEC_CPU2006 -DINLINE= The following options were used for reload code generation: -DSPEC_CPU -DNDEBUG -I. -g -mlittle -save-temps=obj -ffast-math -O3 -mveclibabi=mass -mcpu=power8 -mrecip=rsqrt -fpeel-loops -funroll-loops -ftree-vectorize -fvect-cost-model -msave-toc-indirect -fno-aggressive-loop-optimizations -mno-pointers-to-nested-functions -m64 -I/opt/at8.0/include -I/opt/at8.0/lib/gcc/powerpc64le-linux-gnu/4.9.4/include -fgnu89-inline -DSPEC_CPU_LP64 -DSPEC_CPU -DSPEC_CPU2006 -DINLINE=
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #6 from Michael Meissner --- Created attachment 37715 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37715=edit operf output showing hotspots for reload
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #5 from Michael Meissner --- Created attachment 37714 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37714=edit operf output showing hotspots for LRA
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #4 from Michael Meissner --- Created attachment 37713 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37713=edit Assembler code produced by reload of the two hot functions
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #3 from Michael Meissner --- Created attachment 37712 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37712=edit Assembler code produced by LRA of the two hot functions
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #2 from Michael Meissner --- Created attachment 37711 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37711=edit Reduced preprocessed file containing just the two hot functions
[Bug rtl-optimization/69847] Spec 2006 403.gcc slows down with -mlra vs. reload on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69847 --- Comment #1 from Michael Meissner --- Created attachment 37710 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37710=edit Bitmap.i file that contains the two hot functions.