[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #29 from Bill Schmidt --- Author: wschmidt Date: Mon Feb 4 16:48:30 2019 New Revision: 268524 URL: https://gcc.gnu.org/viewcvs?rev=268524=gcc=rev Log: 2019-02-04 Bill Schmidt PR target/87064 Backport from mainline 2019-01-30 Bill Schmidt PR target/87064 * config/rs6000/vsx.md (*vsx_reduc__v4sf_scalar): Disable for little-endian. 2019-01-22 Jakub Jelinek PR target/87064 * config/rs6000/vsx.md (*vsx_reduc__v2df_scalar): Disable for little endian. Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/config/rs6000/vsx.md
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #28 from Bill Schmidt --- Author: wschmidt Date: Mon Feb 4 16:47:12 2019 New Revision: 268523 URL: https://gcc.gnu.org/viewcvs?rev=268523=gcc=rev Log: 2019-02-04 Bill Schmidt PR target/87064 Backport from mainline 2019-01-30 Bill Schmidt PR target/87064 * config/rs6000/vsx.md (*vsx_reduc__v4sf_scalar): Disable for little-endian. 2019-01-22 Jakub Jelinek PR target/87064 * config/rs6000/vsx.md (*vsx_reduc__v2df_scalar): Disable for little endian. Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/config/rs6000/vsx.md
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #27 from Bill Schmidt --- Author: wschmidt Date: Wed Jan 30 20:52:08 2019 New Revision: 268403 URL: https://gcc.gnu.org/viewcvs?rev=268403=gcc=rev Log: 2019-01-30 Bill Schmidt PR target/87064 * config/rs6000/vsx.md (*vsx_reduc__v4sf_scalar): Disable for little-endian. Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/vsx.md
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #26 from Bill Schmidt --- I believe it's also incorrect (the assumption on the value being in element 3 is a big-endian statement) but latent because this is really hard to match. I'll take an internal note to clean this up. I will also look at backporting these fixes to earlier releases to avoid future rediscovery.
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 Jakub Jelinek changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #25 from Jakub Jelinek --- Testcase fixed, *vsx_reduc__v4sf_scalar hasn't been disabled for little endian (is it incorrect too or just non-optimal)?
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #24 from Jakub Jelinek --- Author: jakub Date: Tue Jan 22 22:27:32 2019 New Revision: 268164 URL: https://gcc.gnu.org/viewcvs?rev=268164=gcc=rev Log: PR target/87064 * config/rs6000/vsx.md (*vsx_reduc__v2df_scalar): Disable for little endian. Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/vsx.md
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #23 from Jakub Jelinek --- Created attachment 45496 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45496=edit gcc9-pr87064.patch Patch I've so far tested on powerpc64le-linux only, where it fixed FAIL: libgomp.oacc-fortran/reduction-3.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -O1 execution test and didn't regress anything else. I can bootstrap/regtest even on powerpc64-linux (though I believe it is pointless, given that I know from the earlier statistics gathering that the pattern is never used on powerpc64-linux during bootstrap nor -m32/-m64 regtest). So, I'll post to gcc-patches. The v4sf_scalar I'll leave to you, ok?
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #22 from Bill Schmidt --- (I'll test with both disabled for LE and report results.)
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #21 from Bill Schmidt --- We should probably disable the _v4sf_scalar one for LE also, as this seems to be doing a similar trick for V4SF.
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #20 from Bill Schmidt --- Oh, sorry, I missed that in all the commentary. I had looked at the code and seen the "obvious" problem in the expansion, and noted you had suggested that also. Should have read further. I think that's right, using this is wrong for LE. Jakub, do you want to push that patch, or shall I regstrap it once more and take care of it?
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #19 from Segher Boessenkool --- The pattern makes no sense at all for LE. If LE, (vec_concat:V2DF (vec_select:DF (match_operand:V2DF 1 "vfloat_operand" "wd,wa,wd,wa") (parallel [(const_int 1)])) (vec_select:DF (match_dup 1) (parallel [(const_int 0)]))) means exactly the same as just (match_operand:V2DF 1 "vfloat_operand" "wd,wa,wd,wa") So how does this pattern ever match for LE anyway? Are there some serious missed simplifications?
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #18 from Jakub Jelinek --- The comment on the define_insn_and_split says: ;; Combiner patterns with the vector reduction patterns that knows we can get ;; to the top element of the V2DF array without doing an extract. So, the question is, is that ever the case for little endian also? As I said, --- gcc/config/rs6000/vsx.md2019-01-19 09:39:13.224924063 +0100 +++ gcc/config/rs6000/vsx.md2019-01-21 22:04:15.372792335 +0100 @@ -4351,7 +4351,7 @@ (match_dup 1)) (parallel [(const_int 1)]))) (clobber (match_scratch:DF 2 "=0,0,,"))] - "VECTOR_UNIT_VSX_P (V2DFmode)" + "BYTES_BIG_ENDIAN && VECTOR_UNIT_VSX_P (V2DFmode)" "#" "" [(const_int 0)] generates much better code at least on this testcase, and from bootstrap/regtest I'm not away of this pattern ever trigging for something different.
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #17 from Bill Schmidt --- Actually I *think* the *vsx_reduc__v4sf_scalar code is probably okay. This is all being done with insns that should leave the reduction result in the right-hand element of the register (element 3 for BE, as is referenced in the xxsldwi insn). I'll regtest a patch with Jakub's second alternative from above, which matches my understanding of the current flaw. Bill
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #16 from Bill Schmidt --- (In reply to Jakub Jelinek from comment #13) > So, both the following patches should fix it IMHO, but no idea which one if > any is right. > With > --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 > +++ gcc/config/rs6000/vsx.md 2019-01-18 18:07:37.194899062 +0100 > @@ -4356,7 +4356,9 @@ >"" >[(const_int 0)] > { > - rtx hi = gen_highpart (DFmode, operands[1]); > + rtx hi = (BYTES_BIG_ENDIAN > + ? gen_highpart (DFmode, operands[1]) > + : gen_lowpart (DFmode, operands[1])); >rtx lo = (GET_CODE (operands[2]) == SCRATCH) > ? gen_reg_rtx (DFmode) > : operands[2]; > > the assembly changes: > --- reduction-3.s12019-01-18 18:05:14.313229730 +0100 > +++ reduction-3.s22019-01-18 18:10:20.617233358 +0100 > @@ -27,7 +27,7 @@ MAIN__._omp_fn.0: > addi 9,9,16 > bdnz .L2 ># vec_extract to same register > - lfd 12,-8(1) > + lfd 12,-16(1) > xsmaxdp 0,12,0 > stfd 0,0(10) > blr > with: > --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 > +++ gcc/config/rs6000/vsx.md 2019-01-18 18:16:30.680186709 +0100 > @@ -4361,7 +4361,9 @@ > ? gen_reg_rtx (DFmode) > : operands[2]; > > - emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx)); > + emit_insn (gen_vsx_extract_v2df (lo, operands[1], > +BYTES_BIG_ENDIAN > +? const1_rtx : const0_rtx)); >emit_insn (gen_df3 (operands[0], hi, lo)); >DONE; > } This is what looks right to me. This code all pre-dates little-endian support, and I think we missed changing the element to be extracted in this spot. There is probably something wrong with _v4sf_scalar also -- the gen_vsx_xxsldwi_v4sf probably needs to be adjusted also for little-endian, but I have a hard time following this code and I'm not certain. Bill > the assembly changes: > --- reduction-3.s12019-01-18 18:05:14.313229730 +0100 > +++ reduction-3.s32019-01-18 18:17:18.977397458 +0100 > @@ -26,7 +26,7 @@ MAIN__._omp_fn.0: > xxpermdi 0,0,0,2 > addi 9,9,16 > bdnz .L2 > - # vec_extract to same register > + xxpermdi 0,0,0,3 > lfd 12,-8(1) > xsmaxdp 0,12,0 > stfd 0,0(10) > > So just judging from this exact testcase, the first patch seems to be more > efficient, though still unsure about that, because it goes through memory in > either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that > swaps the two elements instead of loading it from memory?
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 Jakub Jelinek changed: What|Removed |Added Priority|P3 |P1 --- Comment #15 from Jakub Jelinek --- I've gathered some statistics, the vsx_reduc__v2df define_insn_and_split is never used during powerpc64-linux bootstrap and -m32/-m64 testing and during powerpc64le-linux bootstrap and testing is only triggered on the miscompiled reduction-3.f90. That doesn't mean it can't be useful for big endian, I just don't know how to trigger it to verify if it is beneficial or not.
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #14 from Jakub Jelinek --- And, if I disable that define_insn_and_split altogether (add 0 && to the condition), the assembly change is: --- reduction-3.s2 2019-01-18 18:19:42.184057246 +0100 +++ reduction-3.s4 2019-01-18 18:26:23.079506011 +0100 @@ -9,26 +9,16 @@ MAIN__._omp_fn.0: .cfi_startproc ld 10,0(3) lxvdsx 0,0,10 - addi 9,1,-16 - xxpermdi 0,0,0,2 - stxvd2x 0,0,9 ld 9,8(3) li 8,5 mtctr 8 .L2: - lxvd2x 0,0,9 - addi 8,1,-16 - lxvd2x 12,0,8 - xxpermdi 12,12,12,2 - xvmaxdp 0,12,0 - xxpermdi 0,0,0,2 - stxvd2x 0,0,8 - xxpermdi 0,0,0,2 + lxvd2x 12,0,9 + xvmaxdp 0,0,12 addi 9,9,16 bdnz .L2 -# vec_extract to same register - lfd 12,-16(1) - xsmaxdp 0,12,0 + xxsldwi 12,0,0,2 + xvmaxdp 0,12,0 stfd 0,0(10) blr .long 0 which looks much better. So, what is the reason for this define_insn_and_split? Is it useful for BYTES_BIG_ENDIAN only perhaps?
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #13 from Jakub Jelinek --- So, both the following patches should fix it IMHO, but no idea which one if any is right. With --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 +++ gcc/config/rs6000/vsx.md2019-01-18 18:07:37.194899062 +0100 @@ -4356,7 +4356,9 @@ "" [(const_int 0)] { - rtx hi = gen_highpart (DFmode, operands[1]); + rtx hi = (BYTES_BIG_ENDIAN + ? gen_highpart (DFmode, operands[1]) + : gen_lowpart (DFmode, operands[1])); rtx lo = (GET_CODE (operands[2]) == SCRATCH) ? gen_reg_rtx (DFmode) : operands[2]; the assembly changes: --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100 +++ reduction-3.s2 2019-01-18 18:10:20.617233358 +0100 @@ -27,7 +27,7 @@ MAIN__._omp_fn.0: addi 9,9,16 bdnz .L2 # vec_extract to same register - lfd 12,-8(1) + lfd 12,-16(1) xsmaxdp 0,12,0 stfd 0,0(10) blr with: --- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100 +++ gcc/config/rs6000/vsx.md2019-01-18 18:16:30.680186709 +0100 @@ -4361,7 +4361,9 @@ ? gen_reg_rtx (DFmode) : operands[2]; - emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx)); + emit_insn (gen_vsx_extract_v2df (lo, operands[1], + BYTES_BIG_ENDIAN + ? const1_rtx : const0_rtx)); emit_insn (gen_df3 (operands[0], hi, lo)); DONE; } the assembly changes: --- reduction-3.s1 2019-01-18 18:05:14.313229730 +0100 +++ reduction-3.s3 2019-01-18 18:17:18.977397458 +0100 @@ -26,7 +26,7 @@ MAIN__._omp_fn.0: xxpermdi 0,0,0,2 addi 9,9,16 bdnz .L2 -# vec_extract to same register + xxpermdi 0,0,0,3 lfd 12,-8(1) xsmaxdp 0,12,0 stfd 0,0(10) So just judging from this exact testcase, the first patch seems to be more efficient, though still unsure about that, because it goes through memory in either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that swaps the two elements instead of loading it from memory?
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 --- Comment #12 from Segher Boessenkool --- Yes, I think so (just the vec_select arg?)
[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064 Jakub Jelinek changed: What|Removed |Added CC||dje at gcc dot gnu.org, ||meissner at gcc dot gnu.org, ||segher at gcc dot gnu.org Component|libgomp |target --- Comment #11 from Jakub Jelinek --- Seems to be a powerpc64le backend bug or RA bug. Reduced testcase for -fopenacc -O1: program reduction_3 implicit none integer, parameter:: n = 10, vl = 32 integer :: i double precision :: vresult, rv double precision, parameter :: e = 0.001 double precision, dimension (n) :: array do i = 1, n array(i) = i end do rv = 0 vresult = 0 !$acc parallel vector_length(vl) copy(rv) !$acc loop reduction(max:rv) vector do i = 1, n rv = max (rv, array(i)) end do !$acc end parallel do i = 1, n vresult = max (vresult, array(i)) end do if (abs (rv - vresult) .ge. e) STOP 11 end program reduction_3 In *.optimized it looks all correct: [local count: 437450368]: # vect_M.23_45 = PHI # ivtmp.34_3 = PHI _2 = (void *) ivtmp.34_3; vect__28.26_44 = MEM[base: _2, offset: 0B]; vect_M.27_34 = MAX_EXPR ; ivtmp.34_4 = ivtmp.34_3 + 16; if (ivtmp.34_4 != _25) goto ; [80.00%] else goto ; [20.00%] [local count: 437450371]: stmp_M.28_8 = .REDUC_MAX (vect_M.27_34); *_10 = stmp_M.28_8; and the loop indeed iterates properly and we end up with { 10.0, 9.0 } vector which REDUC_MAX ifn should reduce to 10.0. During early RTL opts it also looks correct: (insn 20 19 21 4 (parallel [ (set (reg:V2DF 134) (smax:V2DF (vec_concat:V2DF (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 1 [0x1]) ])) (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 0 [0]) ]))) (reg:V2DF 128 [ vect_M.23 ]))) (clobber (scratch:V2DF)) ]) 1330 {vsx_reduc_smax_v2df} (nil)) (insn 21 20 22 4 (set (reg:DF 123 [ stmp_M.28 ]) (vec_select:DF (reg:V2DF 134) (parallel [ (const_int 0 [0]) ]))) 1219 {vsx_extract_v2df} (nil)) Then combine turns that into: (insn 21 20 22 4 (parallel [ (set (reg:DF 123 [ stmp_M.28 ]) (vec_select:DF (smax:V2DF (vec_concat:V2DF (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 1 [0x1]) ])) (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 0 [0]) ]))) (reg:V2DF 128 [ vect_M.23 ])) (parallel [ (const_int 1 [0x1]) ]))) (clobber (scratch:DF)) ]) 1336 {*vsx_reduc_smax_v2df_scalar} (expr_list:REG_DEAD (reg:V2DF 128 [ vect_M.23 ]) (nil))) That is then split into: (insn 34 20 35 4 (set (reg:DF 137) (vec_select:DF (reg:V2DF 128 [ vect_M.23 ]) (parallel [ (const_int 1 [0x1]) ]))) -1 (nil)) (insn 35 34 22 4 (set (reg:DF 123 [ stmp_M.28 ]) (smax:DF (subreg:DF (reg:V2DF 128 [ vect_M.23 ]) 8) (reg:DF 137))) -1 (nil)) at which point I'm already not sure if it is correct or not. As I said, at least in the debugger it shows that the input to this .REDUC_MAX contains the value { 10, 9 } is the vec_select extracting the second elt (i.e. 9.0) and (subreg 8) also the second one? In the end, that is what happens, the resulting assembly is: 0x186c <+32>:lxvd2x vs0,0,r9 0x1870 <+36>:addir8,r1,-16 0x1874 <+40>:lxvd2x vs12,0,r8 0x1878 <+44>:xxswapd vs12,vs12 0x187c <+48>:xvmaxdp vs0,vs12,vs0 0x1880 <+52>:xxswapd vs0,vs0 0x1884 <+56>:stxvd2x vs0,0,r8 0x1888 <+60>:xxswapd vs0,vs0 0x188c <+64>:addir9,r9,16 0x1890 <+68>:bdnz0x186c => 0x1894 <+72>:lfd f12,-8(r1) 0x1898 <+76>:xsmaxdp vs0,vs12,vs0 0x189c <+80>:stfdf0,0(r10) 0x18a0 <+84>:blr and at that point x/2fg $r1-16 0x3fffed90: 10 9 p $vs0.v2_double $6 = {10, 9} p $vs12.v2_double $7 = {8, 7} Now, the lfd loads into f12