[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 Roger Sayle changed: What|Removed |Added Status|ASSIGNED|RESOLVED Target Milestone|--- |14.0 Resolution|--- |FIXED --- Comment #19 from Roger Sayle --- This should now be fixed on mainline.
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #18 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:650c36ec461a722d9c65e82512b4c3aeec2ffee1 commit r14-335-g650c36ec461a722d9c65e82512b4c3aeec2ffee1 Author: Roger Sayle Date: Fri Apr 28 14:21:53 2023 +0100 PR rtl-optimization/109476: Use ZERO_EXTEND instead of zeroing a SUBREG. This patch fixes PR rtl-optimization/109476, which is a code quality regression affecting AVR. The cause is that the lower-subreg pass is sometimes overly aggressive, lowering the LSHIFTRT below: (insn 7 4 8 2 (set (reg:HI 51) (lshiftrt:HI (reg/v:HI 49 [ b ]) (const_int 8 [0x8]))) "t.ii":4:36 557 {lshrhi3} (nil)) into a pair of QImode SUBREG assignments: (insn 19 4 20 2 (set (subreg:QI (reg:HI 51) 0) (reg:QI 54 [ b+1 ])) "t.ii":4:36 86 {movqi_insn_split} (nil)) (insn 20 19 8 2 (set (subreg:QI (reg:HI 51) 1) (const_int 0 [0])) "t.ii":4:36 86 {movqi_insn_split} (nil)) but this idiom, SETs of SUBREGs, interferes with combine's ability to associate/fuse instructions. The solution, on targets that have a suitable ZERO_EXTEND (i.e. where the lower-subreg pass wouldn't itself split a ZERO_EXTEND, so "splitting_zext" is false), is to split/lower LSHIFTRT to a ZERO_EXTEND. To answer Richard's question in comment #10 of the bugzilla PR, the function resolve_shift_zext is called with one of four RTX codes, ASHIFTRT, LSHIFTRT, ZERO_EXTEND and ASHIFT, but only with LSHIFTRT can the setting of low_part and high_part SUBREGs be replaced by a ZERO_EXTEND. For ASHIFTRT, we require a sign extension, so don't set the high_part to zero; if we're splitting a ZERO_EXTEND then it doesn't make sense to replace it with a ZERO_EXTEND, and for ASHIFT we've played games to swap the high_part and low_part SUBREGs, so that we assign the low_part to zero (for double word shifts by greater than word size bits). 2023-04-28 Roger Sayle gcc/ChangeLog PR rtl-optimization/109476 * lower-subreg.cc: Include explow.h for force_reg. (find_decomposable_shift_zext): Pass an additional SPEED_P argument. If decomposing a suitable LSHIFTRT and we're not splitting ZERO_EXTEND (based on the current SPEED_P), then use a ZERO_EXTEND instead of setting a high part SUBREG to zero, which helps combine. (decompose_multiword_subregs): Update call to resolve_shift_zext. gcc/testsuite/ChangeLog PR rtl-optimization/109476 * gcc.target/avr/mmcu/pr109476.c: New test case.
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 Roger Sayle changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |roger at nextmovesoftware dot com --- Comment #17 from Roger Sayle --- I've submitted an improved version of my patch for review: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616527.html
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #16 from Wilhelm M --- (In reply to Roger Sayle from comment #14) > My apologies for the delay/issues. My bootstrap and regression testing of > this patch (on x86_64-pc-linux-gnu) revealed an issue or two (including the > reported ICE). My plan was to fix/resolve all these before posting a > concrete submission to gcc-patches. We all appreciate your great effort in this case! Please don't hesitate to send here some patches to test with. I'll be happy to test your patches!
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #15 from Wilhelm M --- Just checked actual gcc 13.0.1 without the patch: then no ICE accurs.
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #14 from Roger Sayle --- My apologies for the delay/issues. My bootstrap and regression testing of this patch (on x86_64-pc-linux-gnu) revealed an issue or two (including the reported ICE). My plan was to fix/resolve all these before posting a concrete submission to gcc-patches. The general approach is solid (thanks to everyone that agreed this was the correct place to fix things) but it's the corner cases (such as RTL sharing) that all need to be addressed prior to a reviewable submission.
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #13 from Wilhelm M --- Yes, the ICE is with the patch. I'm pretty sure that this does not happen without the patch, but I will that check again.
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #12 from Segher Boessenkool --- With the modified compiler? Does it ICE with an unmodified compiler as well?
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #11 from Wilhelm M --- After testing some more code, I got this ICE: /home/lmeier/Projekte/wmucpp/boards/rcfoc/gimbal_sbus_01.cc: In static member function 'static void GlobalFsm::ratePeriodic() [with BusDevs =BusDevs > >]': /home/lmeier/Projekte/wmucpp/boards/rcfoc/gimbal_sbus_01.cc:426:5: error: unrecognizable insn: 426 | } | ^ (insn 1584 1583 25 3 (set (concatn:HI [ (reg:QI 800) (reg:QI 801 [+1 ]) ]) (reg:HI 826)) "../../include0/external/sbus/sbus.h":226:69 -1 (nil)) during RTL pass: subreg1 /home/lmeier/Projekte/wmucpp/boards/rcfoc/gimbal_sbus_01.cc:426:5: internal compiler error: in extract_insn, at recog.cc:2791 0x79f2cb _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) ../../gcc/rtl-error.cc:108 0x79f2e7 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*) ../../gcc/rtl-error.cc:116 0x79dc77 extract_insn(rtx_insn*) ../../gcc/recog.cc:2791 0x1a43d91 decompose_multiword_subregs ../../gcc/lower-subreg.cc:1651 0x1a447ca execute ../../gcc/lower-subreg.cc:1790 Please submit a full bug report, with preprocessed source (by using -freport-bug).
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #10 from Richard Biener --- (In reply to Roger Sayle from comment #7) > Created attachment 54843 [details] > proposed patch > > Proposed patch. Does this look reasonable? Yes, but doesn't this work for all GET_CODE (op) != ASHIFTRT?
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #9 from Segher Boessenkool --- That patch looks fine :-)
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #8 from Wilhelm M --- Yes. Looks like the patch does its job.
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com --- Comment #7 from Roger Sayle --- Created attachment 54843 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54843=edit proposed patch Proposed patch. Does this look reasonable?
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #6 from Wilhelm M --- The following "solution" constexpr uint16_t mul(const uint8_t a, const uint16_t b) { const auto aa = std::bit_cast>(b); return aa[1] * a; } or constexpr uint16_t mul(const uint8_t a, const uint16_t b) { uint8_t aa[2]; std::memcpy(aa, , 2); return aa[1] * a; } gives optimal result ;-)
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 --- Comment #5 from Segher Boessenkool --- Correct, this certainly can not be done by combine, it see two independent pseudos here. For hard registers it *can* do many tricks, but not for pseudos like this.
[Bug rtl-optimization/109476] Missing optimization for 8bit/8bit multiplication / regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109476 Richard Biener changed: What|Removed |Added Component|target |rtl-optimization Ever confirmed|0 |1 Last reconfirmed||2023-04-12 Status|UNCONFIRMED |NEW CC||iant at google dot com, ||law at gcc dot gnu.org, ||segher at gcc dot gnu.org --- Comment #4 from Richard Biener --- In the good case (with + 1) combine succeeds: Trying 9 -> 11: 9: r55:HI=zero_extend(r54:QI) REG_DEAD r54:QI 11: r52:HI=r55:HI*r56:HI REG_DEAD r56:HI REG_DEAD r55:HI Successfully matched this instruction: (set (reg:HI 52) (mult:HI (zero_extend:HI (reg:QI 54)) (reg:HI 56 [ a ]))) allowing combination of insns 9 and 11 original costs 4 + 28 = 32 replacement cost 20 deferring deletion of insn with uid = 9. modifying insn i311: r52:HI=zero_extend(r54:QI)*r56:HI REG_DEAD r54:QI REG_DEAD r56:HI deferring rescan insn with uid = 11. and then Trying 10 -> 11: 10: r56:HI=zero_extend(r61:QI) REG_DEAD r61:QI 11: r52:HI=zero_extend(r54:QI)*r56:HI REG_DEAD r54:QI REG_DEAD r56:HI Successfully matched this instruction: (set (reg:HI 52) (mult:HI (zero_extend:HI (reg:QI 54)) (zero_extend:HI (reg:QI 61 allowing combination of insns 10 and 11 original costs 4 + 20 = 24 replacement cost 12 deferring deletion of insn with uid = 10. modifying insn i311: r52:HI=zero_extend(r54:QI)*zero_extend(r61:QI) REG_DEAD r61:QI REG_DEAD r54:QI deferring rescan insn with uid = 11. in the bad case instead Trying 8 -> 9: 8: r52:HI=zero_extend(r55:QI) REG_DEAD r55:QI 9: r50:HI=r51:HI*r52:HI REG_DEAD r52:HI REG_DEAD r51:HI Successfully matched this instruction: (set (reg:HI 50) (mult:HI (zero_extend:HI (reg:QI 55)) (reg:HI 51 [ b+1 ]))) allowing combination of insns 8 and 9 original costs 4 + 28 = 32 replacement cost 20 deferring deletion of insn with uid = 8. modifying insn i3 9: r50:HI=zero_extend(r55:QI)*r51:HI REG_DEAD r55:QI REG_DEAD r51:HI deferring rescan insn with uid = 9. Trying 20 -> 9: 20: r51:HI#1=0 9: r50:HI=zero_extend(r55:QI)*r51:HI REG_DEAD r55:QI REG_DEAD r51:HI Can't combine i2 into i3 that's because the RTL into combine in the bad case is (insn 19 22 20 2 (set (subreg:QI (reg:HI 51 [ b+1 ]) 0) (reg:QI 54 [ b+1 ])) "t.ii":4:36 86 {movqi_insn_split} (expr_list:REG_DEAD (reg:QI 54 [ b+1 ]) (nil))) (insn 20 19 8 2 (set (subreg:QI (reg:HI 51 [ b+1 ]) 1) (const_int 0 [0])) "t.ii":4:36 86 {movqi_insn_split} (nil)) (insn 8 20 9 2 (set (reg:HI 52 [ a ]) (zero_extend:HI (reg/v:QI 48 [ a ]))) "t.ii":4:49 635 {zero_extendqihi2} (expr_list:REG_DEAD (reg/v:QI 48 [ a ]) (nil))) (insn 9 8 14 2 (set (reg:HI 50) (mult:HI (reg:HI 51 [ b+1 ]) (reg:HI 52 [ a ]))) "t.ii":4:47 328 {*mulhi3_enh_split} (expr_list:REG_DEAD (reg:HI 52 [ a ]) (expr_list:REG_DEAD (reg:HI 51 [ b+1 ]) (nil so the 'b' operand of the multiplication is now not a zero_extend:HI but instead a two instruction set. The first subreg pass produces this IL, turning (insn 3 2 4 2 (set (reg/v:HI 49 [ b ]) (reg:HI 22 r22 [ b ])) "t.ii":3:49 101 {*movhi_split} (nil)) (insn 7 4 8 2 (set (reg:HI 51) (lshiftrt:HI (reg/v:HI 49 [ b ]) (const_int 8 [0x8]))) "t.ii":4:36 557 {lshrhi3} (nil)) into (insn 17 2 18 2 (set (reg:QI 53 [ b ]) (reg:QI 22 r22 [ b ])) "t.ii":3:49 86 {movqi_insn_split} (nil)) (insn 18 17 4 2 (set (reg:QI 54 [ b+1 ]) (reg:QI 23 r23 [ b+1 ])) "t.ii":3:49 86 {movqi_insn_split} (nil)) (insn 19 4 20 2 (set (subreg:QI (reg:HI 51) 0) (reg:QI 54 [ b+1 ])) "t.ii":4:36 86 {movqi_insn_split} (nil)) (insn 20 19 8 2 (set (subreg:QI (reg:HI 51) 1) (const_int 0 [0])) "t.ii":4:36 86 {movqi_insn_split} (nil)) with -fno-split-wide-types the rotate doesn't get a zero_extend so the multiplication pattern doesn't match either. I think that possibly the lower subreg pass should more optimally handle the situation, creating (insn ... (set (zero_extend:HI (reg:QI 54 [ b + 1]))) here. I'm quite sure combine/forwprop cannot combine the seemingly unrelated subreg sets. resolve_shift_zext seems to be supposed to handle this and it receives (insn 7 4 8 2 (set (reg:HI 51) (lshiftrt:HI (concatn/v:HI [ (reg:QI 53 [ b ]) (reg:QI 54 [ b+1 ]) ]) (const_int 8 [0x8]))) "t.ii":4:36 557 {lshrhi3} (nil)) maybe for GET_CODE (op) != ASHIFTRT && offset1 == 0