[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2024-01-14 Thread LpSolit at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Jiu Fu Guo  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Jiu Fu Guo  ---
The generated code is a lot better on the trunk (both -O2 and -O3):
.cfi_startproc
dcmpuq 0,4,6
beq 0,.L4
blr
.p2align 4,,15
.L4:
fmr 5,7
fmr 4,6
fmr 3,7
fmr 2,6
blr


And the expanded RTL like below, and cse/fwprop/dse optimize it as expected. 
2: r119:TD=%2:TD
3: r120:TD=%4:TD
4: r121:TD=%6:TD
5: NOTE_INSN_FUNCTION_BEG
   10: r122:CCFP=cmp(r120:TD,r121:TD)
   11: pc={(r122:CCFP!=0)?L13:pc}
  REG_BR_PROB 708669604
   12: NOTE_INSN_BASIC_BLOCK 4
6: r120:TD=r121:TD
7: r119:TD=r121:TD
   13: L13:
   14: NOTE_INSN_BASIC_BLOCK 5
   15: r123:DI=r112:DI+0x10
   16: [r123:DI]=r120:TD
   17: [r112:DI]=r119:TD
   18: r124:TD=[r112:DI]
   19: r126:DI=r112:DI+0x10
   20: r125:TD=[r126:DI]
   21: r117:TD=r124:TD
   22: r118:TD=r125:TD
   26: %2:TD=r117:TD
   27: %4:TD=r118:TD
   28: use %2:TD
   29: use %4:TD

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2021-02-01 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #11 from Segher Boessenkool  ---
Please open a separate bug for x86 problems.

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2021-01-30 Thread michaeljclark at mac dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Michael Clark  changed:

   What|Removed |Added

 CC||michaeljclark at mac dot com

--- Comment #10 from Michael Clark  ---
another data point. I am seeing something similar on x86-64.
SysV x86-64 ABI specifies that _Decimal128 is to be passed in
xmm regs so I believe the stack stores here are redundant.

; cat > dec1.c << EOF
_Decimal128 add_d(_Decimal128 a, _Decimal128 b) { return a + b; }
EOF
; gcc -O2 -S -masm=intel dec1.c 
; cat dec1.s
add_d:
.LFB0:
.cfi_startproc
endbr64
sub rsp, 40
.cfi_def_cfa_offset 48
movaps  XMMWORD PTR [rsp], xmm0
movaps  XMMWORD PTR 16[rsp], xmm1
call__bid_addtd3@PLT
movaps  XMMWORD PTR [rsp], xmm0
add rsp, 40
.cfi_def_cfa_offset 8
ret
.cfi_endproc

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-31 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #9 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #8)
> I see no conversion there?
> 
> But, why does it it store to memory at all?

Yes, no conversion for this case, only adjust_address to TImode. mem/c:TD means
a MEM cannot trap.

Reason of store to memory:
D.2914 is a local struct variable here, seems we need do some optimization to
sink the D.2914.td0 and D.2914.td1 from BB3 to BB5 to avoid store/load on
stack? Or if there already exists some pass in Gimple? Or should this be
optimized after expander by some new pass like store sink?

O2/pr70053.c.234t.optimized:
D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;

   [local count: 1073741824]:
  if (b_4(D) == c_5(D))
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 365072224]:
  D.2914.td0 = c_5(D);
  D.2914.td1 = c_5(D);
  goto ; [100.00%]

   [local count: 708669601]:
  D.2914.td0 = a_3(D);
  D.2914.td1 = b_4(D);

   [local count: 1073741824]:
  return D.2914;

}

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-25 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #8 from Segher Boessenkool  ---
I see no conversion there?

But, why does it it store to memory at all?

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-25 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #7 from luoxhu at gcc dot gnu.org ---
When expanding "D.2914.td0 = c_5(D);" in expand_assignment (to=, from=, nontemporal=false) at
../../gcc-master/gcc/expr.c:5058

1) expr.c:5158:   to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);

gdb pr to_rtx
(mem/c:BLK (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S32 A128])

...

2) expr.c:5167:   to_rtx = adjust_address (to_rtx, mode1, 0);

p mode1
$86 = E_TDmode
(gdb) pr to_rtx
(mem/c:TD (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S16 A128])

to_rtx is generated with address conversion from DImode to TDmode here.

...

3) expr.c:5374:   result = store_field (to_rtx, bitsize,
bitpos,bitregion_start, bitregion_end, mode1, from, get_alias_set (to),
nontemporal, reversep);

then the assignment instruction is generated as below:

(insn 11 10 12 4 (set (mem/c:TD (reg/f:DI 112 virtual-stack-vars) [1
D.2914.td0+0 S16 A128]) (reg/v:TD 121 [ c ])) "pr70053.c":20:14 -1 (nil))


So if we need remove the redundant store/load in expand, the conversion from
DImode to TDmode should be avoided for this case when using virtual-stack-vars
registers. (For PR65421, there are similar DImode to DFmode conversion). 

pr70053.c.236r.expand with -O2:
1: NOTE_INSN_DELETED
6: NOTE_INSN_BASIC_BLOCK 2
2: r119:TD=%2:TD
3: r120:TD=%4:TD
4: r121:TD=%6:TD
5: NOTE_INSN_FUNCTION_BEG
8: r122:CCFP=cmp(r120:TD,r121:TD)
9: pc={(r122:CCFP!=0)?L16:pc}
  REG_BR_PROB 708669604
   10: NOTE_INSN_BASIC_BLOCK 4
   11: [r112:DI]=r121:TD
   12: r123:DI=r112:DI+0x10
   13: [r123:DI]=r121:TD
   14: pc=L21
   15: barrier
   16: L16:
   17: NOTE_INSN_BASIC_BLOCK 5
   18: [r112:DI]=r119:TD
   19: r124:DI=r112:DI+0x10
   20: [r124:DI]=r120:TD
   21: L21:
   22: NOTE_INSN_BASIC_BLOCK 6
   23: r125:TD=[r112:DI]
   24: r127:DI=r112:DI+0x10
   25: r126:TD=[r127:DI]
   26: r117:TD=r125:TD
   27: r118:TD=r126:TD
   31: %2:TD=r117:TD
   32: %4:TD=r118:TD
   33: use %2:TD
   34: use %4:TD

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-25 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #6 from luoxhu at gcc dot gnu.org ---
"-O2 -ftree-slp-vectorize" could also generate the expected simple fmrs.

Reason is pass_cselim will transform conditional stores into unconditional ones
with PHI instructions when vectorization and if-conversion is
enabled(gcc/tree-ssa-phiopt.c:2482).

pr70053.c.108t.cdce:
D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;

   [local count: 1073741824]:
  if (b_4(D) == c_5(D))
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 365072224]:
  D.2914.td0 = c_5(D);
  D.2914.td1 = c_5(D);
  goto ; [100.00%]

   [local count: 708669601]:
  D.2914.td0 = a_3(D);
  D.2914.td1 = b_4(D);

   [local count: 1073741824]:
  return D.2914;

}

=> pr70053.c.109t.cselim:

D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;
  _Decimal128 cstore_10;
  _Decimal128 cstore_11;

   [local count: 1073741824]:
  if (b_4(D) == c_5(D))
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 708669601]:

   [local count: 1073741824]:
  # cstore_10 = PHI 
  # cstore_11 = PHI 
  D.2914.td1 = cstore_11;
  D.2914.td0 = cstore_10;
  return D.2914;

}

Then at expand pass, the PHI instruction "cstore_10 = PHI " will be expanded to move for "-O2 -ftree-slp-vectorize". If no such
PHI generated, bb3 and bb4 in pr70053.c.108t.cdce will be expanded to
STORE/LOAD with TD->DI conversion, causing a lot st/ld conversion finally.

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-22 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Bill Schmidt  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-05-22

--- Comment #5 from Bill Schmidt  ---
I'd like to understand why the difference between -O2 and -O3 exists.  We
shouldn't generate this kind of nasty store-load at -O2.

Confirmed, BTW. :)

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-20 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||bergner at gcc dot gnu.org,
   ||luoxhu at gcc dot gnu.org

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Is this just the difference of O3 and O2? Since O3 is OK, maybe this bug is not
effective?

$ /opt/at10.0/bin/gcc -O3 -S pr70053.c
$ cat pr70053.s
.file   "pr70053.c"
.abiversion 2
.section".text"
.align 2
.p2align 4,,15
.globl D256_add_finite
.type   D256_add_finite, @function
D256_add_finite:
dcmpuq 7,4,6
beq 7,.L3
fmr 7,3
fmr 6,2
fmr 3,7
fmr 2,6
blr
.p2align 4,,15
.L3:
fmr 5,7
fmr 4,6
fmr 3,7
fmr 2,6
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.size   D256_add_finite,.-D256_add_finite
.ident  "GCC: (GNU) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10
branch, based on subversion id 250395."
.section.note.GNU-stack,"",@progbits

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2016-03-02 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #3 from Andrew Pinski  ---
Related to bug 30271, bug 38532, bug 69493.  There might be more bugs too.

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2016-03-02 Thread bergner at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #2 from Peter Bergner  ---
One interesting point is if we delete the "return result;" that is within the
then clause and fall thru to the end "return result;" with is logically
identical, then we get the code we want:

D256_add_finite:
dcmpuq 7,4,6
bnelr 7
fmr 5,7
fmr 4,6
fmr 3,7
fmr 2,6
blr

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2016-03-02 Thread bergner at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

Peter Bergner  changed:

   What|Removed |Added

 Target||powerpc64le-linux
 CC||dje at gcc dot gnu.org,
   ||meissner at gcc dot gnu.org,
   ||munroesj at us dot ibm.com,
   ||wschmidt at gcc dot gnu.org
  Known to fail||4.9.4, 5.3.1

--- Comment #1 from Peter Bergner  ---
A quick check seems to show this happens on older compiler versions too.