[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 Jiu Fu Guo changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #12 from Jiu Fu Guo --- The generated code is a lot better on the trunk (both -O2 and -O3): .cfi_startproc dcmpuq 0,4,6 beq 0,.L4 blr .p2align 4,,15 .L4: fmr 5,7 fmr 4,6 fmr 3,7 fmr 2,6 blr And the expanded RTL like below, and cse/fwprop/dse optimize it as expected. 2: r119:TD=%2:TD 3: r120:TD=%4:TD 4: r121:TD=%6:TD 5: NOTE_INSN_FUNCTION_BEG 10: r122:CCFP=cmp(r120:TD,r121:TD) 11: pc={(r122:CCFP!=0)?L13:pc} REG_BR_PROB 708669604 12: NOTE_INSN_BASIC_BLOCK 4 6: r120:TD=r121:TD 7: r119:TD=r121:TD 13: L13: 14: NOTE_INSN_BASIC_BLOCK 5 15: r123:DI=r112:DI+0x10 16: [r123:DI]=r120:TD 17: [r112:DI]=r119:TD 18: r124:TD=[r112:DI] 19: r126:DI=r112:DI+0x10 20: r125:TD=[r126:DI] 21: r117:TD=r124:TD 22: r118:TD=r125:TD 26: %2:TD=r117:TD 27: %4:TD=r118:TD 28: use %2:TD 29: use %4:TD
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #11 from Segher Boessenkool --- Please open a separate bug for x86 problems.
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 Michael Clark changed: What|Removed |Added CC||michaeljclark at mac dot com --- Comment #10 from Michael Clark --- another data point. I am seeing something similar on x86-64. SysV x86-64 ABI specifies that _Decimal128 is to be passed in xmm regs so I believe the stack stores here are redundant. ; cat > dec1.c << EOF _Decimal128 add_d(_Decimal128 a, _Decimal128 b) { return a + b; } EOF ; gcc -O2 -S -masm=intel dec1.c ; cat dec1.s add_d: .LFB0: .cfi_startproc endbr64 sub rsp, 40 .cfi_def_cfa_offset 48 movaps XMMWORD PTR [rsp], xmm0 movaps XMMWORD PTR 16[rsp], xmm1 call__bid_addtd3@PLT movaps XMMWORD PTR [rsp], xmm0 add rsp, 40 .cfi_def_cfa_offset 8 ret .cfi_endproc
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #9 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #8) > I see no conversion there? > > But, why does it it store to memory at all? Yes, no conversion for this case, only adjust_address to TImode. mem/c:TD means a MEM cannot trap. Reason of store to memory: D.2914 is a local struct variable here, seems we need do some optimization to sink the D.2914.td0 and D.2914.td1 from BB3 to BB5 to avoid store/load on stack? Or if there already exists some pass in Gimple? Or should this be optimized after expander by some new pass like store sink? O2/pr70053.c.234t.optimized: D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c) { struct TDx2_t D.2914; [local count: 1073741824]: if (b_4(D) == c_5(D)) goto ; [34.00%] else goto ; [66.00%] [local count: 365072224]: D.2914.td0 = c_5(D); D.2914.td1 = c_5(D); goto ; [100.00%] [local count: 708669601]: D.2914.td0 = a_3(D); D.2914.td1 = b_4(D); [local count: 1073741824]: return D.2914; }
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #8 from Segher Boessenkool --- I see no conversion there? But, why does it it store to memory at all?
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #7 from luoxhu at gcc dot gnu.org --- When expanding "D.2914.td0 = c_5(D);" in expand_assignment (to=, from=, nontemporal=false) at ../../gcc-master/gcc/expr.c:5058 1) expr.c:5158: to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE); gdb pr to_rtx (mem/c:BLK (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S32 A128]) ... 2) expr.c:5167: to_rtx = adjust_address (to_rtx, mode1, 0); p mode1 $86 = E_TDmode (gdb) pr to_rtx (mem/c:TD (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S16 A128]) to_rtx is generated with address conversion from DImode to TDmode here. ... 3) expr.c:5374: result = store_field (to_rtx, bitsize, bitpos,bitregion_start, bitregion_end, mode1, from, get_alias_set (to), nontemporal, reversep); then the assignment instruction is generated as below: (insn 11 10 12 4 (set (mem/c:TD (reg/f:DI 112 virtual-stack-vars) [1 D.2914.td0+0 S16 A128]) (reg/v:TD 121 [ c ])) "pr70053.c":20:14 -1 (nil)) So if we need remove the redundant store/load in expand, the conversion from DImode to TDmode should be avoided for this case when using virtual-stack-vars registers. (For PR65421, there are similar DImode to DFmode conversion). pr70053.c.236r.expand with -O2: 1: NOTE_INSN_DELETED 6: NOTE_INSN_BASIC_BLOCK 2 2: r119:TD=%2:TD 3: r120:TD=%4:TD 4: r121:TD=%6:TD 5: NOTE_INSN_FUNCTION_BEG 8: r122:CCFP=cmp(r120:TD,r121:TD) 9: pc={(r122:CCFP!=0)?L16:pc} REG_BR_PROB 708669604 10: NOTE_INSN_BASIC_BLOCK 4 11: [r112:DI]=r121:TD 12: r123:DI=r112:DI+0x10 13: [r123:DI]=r121:TD 14: pc=L21 15: barrier 16: L16: 17: NOTE_INSN_BASIC_BLOCK 5 18: [r112:DI]=r119:TD 19: r124:DI=r112:DI+0x10 20: [r124:DI]=r120:TD 21: L21: 22: NOTE_INSN_BASIC_BLOCK 6 23: r125:TD=[r112:DI] 24: r127:DI=r112:DI+0x10 25: r126:TD=[r127:DI] 26: r117:TD=r125:TD 27: r118:TD=r126:TD 31: %2:TD=r117:TD 32: %4:TD=r118:TD 33: use %2:TD 34: use %4:TD
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #6 from luoxhu at gcc dot gnu.org --- "-O2 -ftree-slp-vectorize" could also generate the expected simple fmrs. Reason is pass_cselim will transform conditional stores into unconditional ones with PHI instructions when vectorization and if-conversion is enabled(gcc/tree-ssa-phiopt.c:2482). pr70053.c.108t.cdce: D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c) { struct TDx2_t D.2914; [local count: 1073741824]: if (b_4(D) == c_5(D)) goto ; [34.00%] else goto ; [66.00%] [local count: 365072224]: D.2914.td0 = c_5(D); D.2914.td1 = c_5(D); goto ; [100.00%] [local count: 708669601]: D.2914.td0 = a_3(D); D.2914.td1 = b_4(D); [local count: 1073741824]: return D.2914; } => pr70053.c.109t.cselim: D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c) { struct TDx2_t D.2914; _Decimal128 cstore_10; _Decimal128 cstore_11; [local count: 1073741824]: if (b_4(D) == c_5(D)) goto ; [34.00%] else goto ; [66.00%] [local count: 708669601]: [local count: 1073741824]: # cstore_10 = PHI # cstore_11 = PHI D.2914.td1 = cstore_11; D.2914.td0 = cstore_10; return D.2914; } Then at expand pass, the PHI instruction "cstore_10 = PHI " will be expanded to move for "-O2 -ftree-slp-vectorize". If no such PHI generated, bb3 and bb4 in pr70053.c.108t.cdce will be expanded to STORE/LOAD with TD->DI conversion, causing a lot st/ld conversion finally.
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 Bill Schmidt changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2020-05-22 --- Comment #5 from Bill Schmidt --- I'd like to understand why the difference between -O2 and -O3 exists. We shouldn't generate this kind of nasty store-load at -O2. Confirmed, BTW. :)
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||bergner at gcc dot gnu.org, ||luoxhu at gcc dot gnu.org --- Comment #4 from luoxhu at gcc dot gnu.org --- Is this just the difference of O3 and O2? Since O3 is OK, maybe this bug is not effective? $ /opt/at10.0/bin/gcc -O3 -S pr70053.c $ cat pr70053.s .file "pr70053.c" .abiversion 2 .section".text" .align 2 .p2align 4,,15 .globl D256_add_finite .type D256_add_finite, @function D256_add_finite: dcmpuq 7,4,6 beq 7,.L3 fmr 7,3 fmr 6,2 fmr 3,7 fmr 2,6 blr .p2align 4,,15 .L3: fmr 5,7 fmr 4,6 fmr 3,7 fmr 2,6 blr .long 0 .byte 0,0,0,0,0,0,0,0 .size D256_add_finite,.-D256_add_finite .ident "GCC: (GNU) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10 branch, based on subversion id 250395." .section.note.GNU-stack,"",@progbits
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #3 from Andrew Pinski --- Related to bug 30271, bug 38532, bug 69493. There might be more bugs too.
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #2 from Peter Bergner --- One interesting point is if we delete the "return result;" that is within the then clause and fall thru to the end "return result;" with is logically identical, then we get the code we want: D256_add_finite: dcmpuq 7,4,6 bnelr 7 fmr 5,7 fmr 4,6 fmr 3,7 fmr 2,6 blr
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 Peter Bergner changed: What|Removed |Added Target||powerpc64le-linux CC||dje at gcc dot gnu.org, ||meissner at gcc dot gnu.org, ||munroesj at us dot ibm.com, ||wschmidt at gcc dot gnu.org Known to fail||4.9.4, 5.3.1 --- Comment #1 from Peter Bergner --- A quick check seems to show this happens on older compiler versions too.