[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-02-04 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #29 from Bill Schmidt  ---
Author: wschmidt
Date: Mon Feb  4 16:48:30 2019
New Revision: 268524

URL: https://gcc.gnu.org/viewcvs?rev=268524=gcc=rev
Log:
2019-02-04  Bill Schmidt  

PR target/87064
Backport from mainline

2019-01-30  Bill Schmidt  

PR target/87064
* config/rs6000/vsx.md (*vsx_reduc__v4sf_scalar):
Disable for little-endian.

2019-01-22  Jakub Jelinek  

PR target/87064
* config/rs6000/vsx.md (*vsx_reduc__v2df_scalar):
Disable for little endian.


Modified:
branches/gcc-7-branch/gcc/ChangeLog
branches/gcc-7-branch/gcc/config/rs6000/vsx.md

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-02-04 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #28 from Bill Schmidt  ---
Author: wschmidt
Date: Mon Feb  4 16:47:12 2019
New Revision: 268523

URL: https://gcc.gnu.org/viewcvs?rev=268523=gcc=rev
Log:
2019-02-04  Bill Schmidt  

PR target/87064
Backport from mainline

2019-01-30  Bill Schmidt  

PR target/87064
* config/rs6000/vsx.md (*vsx_reduc__v4sf_scalar):
Disable for little-endian.

2019-01-22  Jakub Jelinek  

PR target/87064
* config/rs6000/vsx.md (*vsx_reduc__v2df_scalar):
Disable for little endian.


Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/config/rs6000/vsx.md

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-30 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #27 from Bill Schmidt  ---
Author: wschmidt
Date: Wed Jan 30 20:52:08 2019
New Revision: 268403

URL: https://gcc.gnu.org/viewcvs?rev=268403=gcc=rev
Log:
2019-01-30  Bill Schmidt  

PR target/87064
* config/rs6000/vsx.md (*vsx_reduc__v4sf_scalar):
Disable for little-endian.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/vsx.md

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-24 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #26 from Bill Schmidt  ---
I believe it's also incorrect (the assumption on the value being in element 3
is a big-endian statement) but latent because this is really hard to match. 
I'll take an internal note to clean this up.  I will also look at backporting
these fixes to earlier releases to avoid future rediscovery.

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

Jakub Jelinek  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #25 from Jakub Jelinek  ---
Testcase fixed, *vsx_reduc__v4sf_scalar hasn't been disabled
for little endian (is it incorrect too or just non-optimal)?

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-22 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #24 from Jakub Jelinek  ---
Author: jakub
Date: Tue Jan 22 22:27:32 2019
New Revision: 268164

URL: https://gcc.gnu.org/viewcvs?rev=268164=gcc=rev
Log:
PR target/87064
* config/rs6000/vsx.md (*vsx_reduc__v2df_scalar):
Disable for little endian.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/vsx.md

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-22 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #23 from Jakub Jelinek  ---
Created attachment 45496
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45496=edit
gcc9-pr87064.patch

Patch I've so far tested on powerpc64le-linux only, where it fixed
FAIL: libgomp.oacc-fortran/reduction-3.f90 -DACC_DEVICE_TYPE_host=1
-DACC_MEM_SHARED=1  -O1  execution test
and didn't regress anything else.  I can bootstrap/regtest even on
powerpc64-linux (though I believe it is pointless, given that I know from the
earlier statistics gathering that the pattern is never used on powerpc64-linux
during bootstrap nor -m32/-m64 regtest).
So, I'll post to gcc-patches.  The v4sf_scalar I'll leave to you, ok?

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-22 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #22 from Bill Schmidt  ---
(I'll test with both disabled for LE and report results.)

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-22 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #21 from Bill Schmidt  ---
We should probably disable the _v4sf_scalar one for LE also, as this seems to
be doing a similar trick for V4SF.

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-22 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #20 from Bill Schmidt  ---
Oh, sorry, I missed that in all the commentary.  I had looked at the code and
seen the "obvious" problem in the expansion, and noted you had suggested that
also.  Should have read further.

I think that's right, using this is wrong for LE.  Jakub, do you want to push
that patch, or shall I regstrap it once more and take care of it?

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-21 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #19 from Segher Boessenkool  ---
The pattern makes no sense at all for LE.

If LE,

 (vec_concat:V2DF
  (vec_select:DF
   (match_operand:V2DF 1 "vfloat_operand" "wd,wa,wd,wa")
   (parallel [(const_int 1)]))
  (vec_select:DF
   (match_dup 1)
   (parallel [(const_int 0)])))

means exactly the same as just

 (match_operand:V2DF 1 "vfloat_operand" "wd,wa,wd,wa")


So how does this pattern ever match for LE anyway?  Are there some
serious missed simplifications?

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-21 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #18 from Jakub Jelinek  ---
The comment on the define_insn_and_split says:
;; Combiner patterns with the vector reduction patterns that knows we can get
;; to the top element of the V2DF array without doing an extract.
So, the question is, is that ever the case for little endian also?
As I said,
--- gcc/config/rs6000/vsx.md2019-01-19 09:39:13.224924063 +0100
+++ gcc/config/rs6000/vsx.md2019-01-21 22:04:15.372792335 +0100
@@ -4351,7 +4351,7 @@
  (match_dup 1))
 (parallel [(const_int 1)])))
(clobber (match_scratch:DF 2 "=0,0,,"))]
-  "VECTOR_UNIT_VSX_P (V2DFmode)"
+  "BYTES_BIG_ENDIAN && VECTOR_UNIT_VSX_P (V2DFmode)"
   "#"
   ""
   [(const_int 0)]
generates much better code at least on this testcase, and from
bootstrap/regtest I'm not away of this pattern ever trigging for something
different.

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-21 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #17 from Bill Schmidt  ---
Actually I *think* the *vsx_reduc__v4sf_scalar code is probably
okay.  This is all being done with insns that should leave the reduction result
in the right-hand element of the register (element 3 for BE, as is referenced
in the xxsldwi insn).

I'll regtest a patch with Jakub's second alternative from above, which matches
my understanding of the current flaw.

Bill

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-21 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #16 from Bill Schmidt  ---
(In reply to Jakub Jelinek from comment #13)
> So, both the following patches should fix it IMHO, but no idea which one if
> any is right.
> With
> --- gcc/config/rs6000/vsx.md.jj   2019-01-01 12:37:44.305529527 +0100
> +++ gcc/config/rs6000/vsx.md  2019-01-18 18:07:37.194899062 +0100
> @@ -4356,7 +4356,9 @@
>""
>[(const_int 0)]
>  {
> -  rtx hi = gen_highpart (DFmode, operands[1]);
> +  rtx hi = (BYTES_BIG_ENDIAN
> + ? gen_highpart (DFmode, operands[1])
> + : gen_lowpart (DFmode, operands[1]));
>rtx lo = (GET_CODE (operands[2]) == SCRATCH)
>   ? gen_reg_rtx (DFmode)
>   : operands[2];
> 
> the assembly changes:
> --- reduction-3.s12019-01-18 18:05:14.313229730 +0100
> +++ reduction-3.s22019-01-18 18:10:20.617233358 +0100
> @@ -27,7 +27,7 @@ MAIN__._omp_fn.0:
>   addi 9,9,16
>   bdnz .L2
># vec_extract to same register
> - lfd 12,-8(1)
> + lfd 12,-16(1)
>   xsmaxdp 0,12,0
>   stfd 0,0(10)
>   blr
> with:
> --- gcc/config/rs6000/vsx.md.jj   2019-01-01 12:37:44.305529527 +0100
> +++ gcc/config/rs6000/vsx.md  2019-01-18 18:16:30.680186709 +0100
> @@ -4361,7 +4361,9 @@
>   ? gen_reg_rtx (DFmode)
>   : operands[2];
>  
> -  emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx));
> +  emit_insn (gen_vsx_extract_v2df (lo, operands[1],
> +BYTES_BIG_ENDIAN
> +? const1_rtx : const0_rtx));
>emit_insn (gen_df3 (operands[0], hi, lo));
>DONE;
>  }

This is what looks right to me.  This code all pre-dates little-endian support,
and I think we missed changing the element to be extracted in this spot.  There
is probably something wrong with _v4sf_scalar also -- the gen_vsx_xxsldwi_v4sf
probably needs to be adjusted also for little-endian, but I have a hard time
following this code and I'm not certain.

Bill

> the assembly changes:
> --- reduction-3.s12019-01-18 18:05:14.313229730 +0100
> +++ reduction-3.s32019-01-18 18:17:18.977397458 +0100
> @@ -26,7 +26,7 @@ MAIN__._omp_fn.0:
>   xxpermdi 0,0,0,2
>   addi 9,9,16
>   bdnz .L2
> -  # vec_extract to same register
> + xxpermdi 0,0,0,3
>   lfd 12,-8(1)
>   xsmaxdp 0,12,0
>   stfd 0,0(10)
> 
> So just judging from this exact testcase, the first patch seems to be more
> efficient, though still unsure about that, because it goes through memory in
> either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that
> swaps the two elements instead of loading it from memory?

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-18 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

Jakub Jelinek  changed:

   What|Removed |Added

   Priority|P3  |P1

--- Comment #15 from Jakub Jelinek  ---
I've gathered some statistics, the vsx_reduc__v2df
define_insn_and_split is never used during powerpc64-linux bootstrap and
-m32/-m64 testing and during powerpc64le-linux bootstrap and testing is only
triggered on the miscompiled reduction-3.f90.  That doesn't mean it can't be
useful for big endian, I just don't know how to trigger it to verify if it is
beneficial or not.

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-18 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #14 from Jakub Jelinek  ---
And, if I disable that define_insn_and_split altogether (add 0 && to the
condition), the assembly change is:
--- reduction-3.s2  2019-01-18 18:19:42.184057246 +0100
+++ reduction-3.s4  2019-01-18 18:26:23.079506011 +0100
@@ -9,26 +9,16 @@ MAIN__._omp_fn.0:
.cfi_startproc
ld 10,0(3)
lxvdsx 0,0,10
-   addi 9,1,-16
-   xxpermdi 0,0,0,2
-   stxvd2x 0,0,9
ld 9,8(3)
li 8,5
mtctr 8
 .L2:
-   lxvd2x 0,0,9
-   addi 8,1,-16
-   lxvd2x 12,0,8
-   xxpermdi 12,12,12,2
-   xvmaxdp 0,12,0
-   xxpermdi 0,0,0,2
-   stxvd2x 0,0,8
-   xxpermdi 0,0,0,2
+   lxvd2x 12,0,9
+   xvmaxdp 0,0,12
addi 9,9,16
bdnz .L2
-# vec_extract to same register
-   lfd 12,-16(1)
-   xsmaxdp 0,12,0
+   xxsldwi 12,0,0,2
+   xvmaxdp 0,12,0
stfd 0,0(10)
blr
.long 0

which looks much better.  So, what is the reason for this
define_insn_and_split?  Is it useful for BYTES_BIG_ENDIAN only perhaps?

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-18 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #13 from Jakub Jelinek  ---
So, both the following patches should fix it IMHO, but no idea which one if any
is right.
With
--- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100
+++ gcc/config/rs6000/vsx.md2019-01-18 18:07:37.194899062 +0100
@@ -4356,7 +4356,9 @@
   ""
   [(const_int 0)]
 {
-  rtx hi = gen_highpart (DFmode, operands[1]);
+  rtx hi = (BYTES_BIG_ENDIAN
+   ? gen_highpart (DFmode, operands[1])
+   : gen_lowpart (DFmode, operands[1]));
   rtx lo = (GET_CODE (operands[2]) == SCRATCH)
? gen_reg_rtx (DFmode)
: operands[2];

the assembly changes:
--- reduction-3.s1  2019-01-18 18:05:14.313229730 +0100
+++ reduction-3.s2  2019-01-18 18:10:20.617233358 +0100
@@ -27,7 +27,7 @@ MAIN__._omp_fn.0:
addi 9,9,16
bdnz .L2
 # vec_extract to same register
-   lfd 12,-8(1)
+   lfd 12,-16(1)
xsmaxdp 0,12,0
stfd 0,0(10)
blr
with:
--- gcc/config/rs6000/vsx.md.jj 2019-01-01 12:37:44.305529527 +0100
+++ gcc/config/rs6000/vsx.md2019-01-18 18:16:30.680186709 +0100
@@ -4361,7 +4361,9 @@
? gen_reg_rtx (DFmode)
: operands[2];

-  emit_insn (gen_vsx_extract_v2df (lo, operands[1], const1_rtx));
+  emit_insn (gen_vsx_extract_v2df (lo, operands[1],
+  BYTES_BIG_ENDIAN
+  ? const1_rtx : const0_rtx));
   emit_insn (gen_df3 (operands[0], hi, lo));
   DONE;
 }
the assembly changes:
--- reduction-3.s1  2019-01-18 18:05:14.313229730 +0100
+++ reduction-3.s3  2019-01-18 18:17:18.977397458 +0100
@@ -26,7 +26,7 @@ MAIN__._omp_fn.0:
xxpermdi 0,0,0,2
addi 9,9,16
bdnz .L2
-# vec_extract to same register
+   xxpermdi 0,0,0,3
lfd 12,-8(1)
xsmaxdp 0,12,0
stfd 0,0(10)

So just judging from this exact testcase, the first patch seems to be more
efficient, though still unsure about that, because it goes through memory in
either case, wouldn't it be better to emit a xxpermdi from 0 to 12 that swaps
the two elements instead of loading it from memory?

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-18 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

--- Comment #12 from Segher Boessenkool  ---
Yes, I think so (just the vec_select arg?)

[Bug target/87064] [9 regression] libgomp.oacc-fortran/reduction-3.f90 fails starting with r263751

2019-01-18 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87064

Jakub Jelinek  changed:

   What|Removed |Added

 CC||dje at gcc dot gnu.org,
   ||meissner at gcc dot gnu.org,
   ||segher at gcc dot gnu.org
  Component|libgomp |target

--- Comment #11 from Jakub Jelinek  ---
Seems to be a powerpc64le backend bug or RA bug.
Reduced testcase for -fopenacc -O1:
program reduction_3
  implicit none
  integer, parameter:: n = 10, vl = 32
  integer   :: i
  double precision  :: vresult, rv
  double precision, parameter :: e = 0.001
  double precision, dimension (n) :: array
  do i = 1, n
 array(i) = i
  end do
  rv = 0
  vresult = 0
  !$acc parallel vector_length(vl) copy(rv)
  !$acc loop reduction(max:rv) vector
  do i = 1, n
 rv = max (rv, array(i))
  end do
  !$acc end parallel
  do i = 1, n
 vresult = max (vresult, array(i))
  end do
  if (abs (rv - vresult) .ge. e) STOP 11
end program reduction_3

In *.optimized it looks all correct:
   [local count: 437450368]:
  # vect_M.23_45 = PHI 
  # ivtmp.34_3 = PHI 
  _2 = (void *) ivtmp.34_3;
  vect__28.26_44 = MEM[base: _2, offset: 0B];
  vect_M.27_34 = MAX_EXPR ;
  ivtmp.34_4 = ivtmp.34_3 + 16;
  if (ivtmp.34_4 != _25)
goto ; [80.00%]
  else
goto ; [20.00%]

   [local count: 437450371]:
  stmp_M.28_8 = .REDUC_MAX (vect_M.27_34);
  *_10 = stmp_M.28_8;
and the loop indeed iterates properly and we end up with { 10.0, 9.0 } vector
which REDUC_MAX ifn should reduce to 10.0.
During early RTL opts it also looks correct:
(insn 20 19 21 4 (parallel [
(set (reg:V2DF 134)
(smax:V2DF (vec_concat:V2DF (vec_select:DF (reg:V2DF 128 [
vect_M.23 ])
(parallel [
(const_int 1 [0x1])
]))
(vec_select:DF (reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 0 [0])
])))
(reg:V2DF 128 [ vect_M.23 ])))
(clobber (scratch:V2DF))
]) 1330 {vsx_reduc_smax_v2df}
 (nil))
(insn 21 20 22 4 (set (reg:DF 123 [ stmp_M.28 ])
(vec_select:DF (reg:V2DF 134)
(parallel [
(const_int 0 [0])
]))) 1219 {vsx_extract_v2df}
 (nil))
Then combine turns that into:
(insn 21 20 22 4 (parallel [
(set (reg:DF 123 [ stmp_M.28 ])
(vec_select:DF (smax:V2DF (vec_concat:V2DF (vec_select:DF
(reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 1 [0x1])
]))
(vec_select:DF (reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 0 [0])
])))
(reg:V2DF 128 [ vect_M.23 ]))
(parallel [
(const_int 1 [0x1])
])))
(clobber (scratch:DF))
]) 1336 {*vsx_reduc_smax_v2df_scalar}
 (expr_list:REG_DEAD (reg:V2DF 128 [ vect_M.23 ])
(nil)))
That is then split into:
(insn 34 20 35 4 (set (reg:DF 137)
(vec_select:DF (reg:V2DF 128 [ vect_M.23 ])
(parallel [
(const_int 1 [0x1])
]))) -1
 (nil))
(insn 35 34 22 4 (set (reg:DF 123 [ stmp_M.28 ])
(smax:DF (subreg:DF (reg:V2DF 128 [ vect_M.23 ]) 8)
(reg:DF 137))) -1
 (nil))
at which point I'm already not sure if it is correct or not.  As I said, at
least
in the debugger it shows that the input to this .REDUC_MAX contains the value {
10, 9 }
is the vec_select extracting the second elt (i.e. 9.0) and (subreg 8) also the
second one?
In the end, that is what happens, the resulting assembly is:
   0x186c <+32>:lxvd2x  vs0,0,r9
   0x1870 <+36>:addir8,r1,-16
   0x1874 <+40>:lxvd2x  vs12,0,r8
   0x1878 <+44>:xxswapd vs12,vs12
   0x187c <+48>:xvmaxdp vs0,vs12,vs0
   0x1880 <+52>:xxswapd vs0,vs0
   0x1884 <+56>:stxvd2x vs0,0,r8
   0x1888 <+60>:xxswapd vs0,vs0
   0x188c <+64>:addir9,r9,16
   0x1890 <+68>:bdnz0x186c 
=> 0x1894 <+72>:lfd f12,-8(r1)
   0x1898 <+76>:xsmaxdp vs0,vs12,vs0
   0x189c <+80>:stfdf0,0(r10)
   0x18a0 <+84>:blr
and at that point
x/2fg $r1-16
0x3fffed90: 10  9
p $vs0.v2_double
$6 = {10, 9}
p $vs12.v2_double
$7 = {8, 7}
Now, the lfd loads into f12