[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2022-05-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|9.5 |---

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2021-06-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|9.4 |9.5

--- Comment #12 from Richard Biener  ---
GCC 9.4 is being released, retargeting bugs to GCC 9.5.

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2020-03-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|9.3 |9.4

--- Comment #11 from Jakub Jelinek  ---
GCC 9.3.0 has been released, adjusting target milestone.

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2019-08-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|9.2 |9.3

--- Comment #10 from Jakub Jelinek  ---
GCC 9.2 has been released.

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2019-05-03 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|9.0 |9.2

--- Comment #9 from Jakub Jelinek  ---
GCC 9.1 has been released.

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-12 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

--- Comment #8 from Bill Schmidt  ---
Looks like Peter was able to help you on the binutils forum over the weekend. 
Thanks, Peter!

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-10 Thread noloader at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

--- Comment #7 from Jeffrey Walton  ---
(In reply to Bill Schmidt from comment #4)
> ...
> 
> The best performance will be achieved by writing this loop entirely using
> inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no
> swaps), thus in "big-endian element order" (element 0 in the high-order
> position of the register).  Because of the big-endian nature of vshasigmaw,
> this is always going to be the best approach.

Thanks Bill.

We are working on your lxvd2x suggestion using inline assembly.

Related, see "GCC vec_xl_be replacement using inline assembly",
https://stackoverflow.com/q/49215090/608639.

-

I'm not sure if I am doing something wrong, or this is a new issue:

$ cat test.cxx
...

typedef __vector unsigned int  uint32x4_p8;

uint32x4_p8 VEC_XL_BE(const uint8_t* data, int offset)
{
#if defined(__xlc__) || defined(__xlC__)
  return (uint32x4_p8)vec_xl_be(offset, (uint8_t*)data);
#else
  uint32x4_p8 res;
  __asm(" lxvd2x  %x0, %1, %2\n\t"
: "=wa" (res)
: "g" (data), "g" (offset));
  return res;

#endif
}

When I use VEC_XL_BE in real life it results in:

$ g++ -DTEST_MAIN -g3 -O3 -mcpu=power8 sha256-p8.cxx -o sha256-p8.exe
/home/noloader/tmp/ccbDnfFr.s: Assembler messages:
/home/noloader/tmp/ccbDnfFr.s:758: Error: operand out of range (32 is not
between 0 and 31)
/home/noloader/tmp/ccbDnfFr.s:983: Error: operand out of range (48 is not
between 0 and 31)

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-08 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Bill Schmidt  changed:

   What|Removed |Added

   Target Milestone|--- |9.0

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-08 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Bill Schmidt  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-03-08
 Ever confirmed|0   |1

--- Comment #6 from Bill Schmidt  ---
We should keep this issue open for a different problem, though.  When swap
optimization doesn't succeed on such a loop, we don't end up optimizing away
the xxlnand associated with the vperm on Power8.  Currently that is only done
as part of swap optimization.  It's a minor savings but worth doing.

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-07 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

--- Comment #5 from Bill Schmidt  ---
s/this loop/this function

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-07 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Bill Schmidt  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

--- Comment #4 from Bill Schmidt  ---
OK, I see.  We optimize swapped vperm in most cases as part of a general
swap-optimization algorithm.  However, this algorithm is defeated when there is
a mix of loads/stores accompanied by swaps and loads/stores that are not
accompanied by swaps.  The "big-endian" loads that are used with vshasigmaw and
friends are the problem here.  (This problem goes away with Power9, but doesn't
help you here.)

There is a slight possibility we can address this in GCC 8, but it is unlikely,
as the code base is closed except for regression fixes.  In any case, a
solution would still keep some swap instructions in place, and thus would not
be ideal.  (I.e., we can fold a swap and a vperm when the result of the swap is
not used elsewhere, but other swaps associated with loads and stores will still
be present.)  So I don't think we should go this route.

The best performance will be achieved by writing this loop entirely using
inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no
swaps), thus in "big-endian element order" (element 0 in the high-order
position of the register).  Because of the big-endian nature of vshasigmaw,
this is always going to be the best approach.

I am still poking the bushes for a reference implementation; I thought of
another person to ask while writing this note.  Will let you know what I find
out.

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-07 Thread noloader at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

--- Comment #3 from Jeffrey Walton  ---
(In reply to Jeffrey Walton from comment #2)
> (In reply to Bill Schmidt from comment #1)
> > GCC 4.8.5 is out of service.  This is fixed in all in-service versions of
> > GCC (6.4 and later).
> 
> Interesting. I'm seeing it in GCC 7.2.0. Are you certain of this?

Here's an example to make sure we are on the same page.

$ /opt/cfarm/gcc-latest/bin/g++ --version
g++ (GCC) 7.2.0

$ /opt/cfarm/gcc-latest/bin/g++ -g3 -O3 -Wall -DTEST_MAIN -mcpu=power8
sha256-p8.cxx -o sha256-p8.exe

$ objdump --disassemble sha256-p8.exe | c++filt

1880 :
1880:   03 10 40 3c lis r2,4099
1884:   00 81 42 38 addir2,r2,-32512
1888:   f0 ff c1 fb std r30,-16(r1)
188c:   f8 ff e1 fb std r31,-8(r1)
1890:   fe ff 22 3d addis   r9,r2,-2
1894:   10 00 c4 3b addir30,r4,16
1898:   70 8e 29 39 addir9,r9,-29072
189c:   10 00 e3 3b addir31,r3,16
18a0:   20 00 84 39 addir12,r4,32
18a4:   20 00 63 39 addir11,r3,32
18a8:   99 4e 00 7c lxvd2x  vs32,0,r9
18ac:   30 00 a3 38 addir5,r3,48
18b0:   40 00 23 39 addir9,r3,64
18b4:   c4 ff c0 38 li  r6,-60
18b8:   c0 ff e0 38 li  r7,-64
18bc:   99 26 20 7c lxvd2x  vs33,0,r4
18c0:   30 00 84 38 addir4,r4,48
18c4:   f8 ff 00 39 li  r8,-8
18c8:   e4 ff 40 39 li  r10,-28
18cc:   57 02 00 f0 xxswapd vs32,vs32
18d0:   57 0a 21 f0 xxswapd vs33,vs33
18d4:   97 05 00 f0 xxlnand vs32,vs32,vs32
18d8:   2b 08 21 10 vperm   v1,v1,v1,v0
18dc:   57 0a 21 f0 xxswapd vs33,vs33
18e0:   99 1f 20 7c stxvd2x vs33,0,r3
18e4:   18 00 60 38 li  r3,24
18e8:   a6 03 69 7c mtctr   r3
18ec:   99 f6 20 7c lxvd2x  vs33,0,r30
18f0:   57 0a 21 f0 xxswapd vs33,vs33
18f4:   2b 08 21 10 vperm   v1,v1,v1,v0
18f8:   57 0a 21 f0 xxswapd vs33,vs33
18fc:   99 ff 20 7c stxvd2x vs33,0,r31
1900:   99 66 20 7c lxvd2x  vs33,0,r12
1904:   57 0a 21 f0 xxswapd vs33,vs33
1908:   2b 08 21 10 vperm   v1,v1,v1,v0
190c:   57 0a 21 f0 xxswapd vs33,vs33
1910:   99 5f 20 7c stxvd2x vs33,0,r11
1914:   99 26 20 7c lxvd2x  vs33,0,r4
1918:   57 0a 21 f0 xxswapd vs33,vs33
191c:   2b 08 01 10 vperm   v0,v1,v1,v0
1920:   57 02 00 f0 xxswapd vs32,vs32
1924:   99 2f 00 7c stxvd2x vs32,0,r5
1928:   00 00 00 60 nop
192c:   00 00 42 60 ori r2,r2,0
1930:   99 36 09 7c lxvd2x  vs32,r9,r6
1934:   99 3e 89 7d lxvd2x  vs44,r9,r7
1938:   99 56 a9 7d lxvd2x  vs45,r9,r10
193c:   99 46 29 7c lxvd2x  vs33,r9,r8
1940:   82 06 00 10 vshasigmaw v0,v0,0,0
1944:   82 7e 21 10 vshasigmaw v1,v1,0,15
1948:   80 60 00 10 vadduwm v0,v0,v12
194c:   80 68 00 10 vadduwm v0,v0,v13
1950:   80 08 00 10 vadduwm v0,v0,v1
1954:   99 4f 00 7c stxvd2x vs32,0,r9
1958:   08 00 29 39 addir9,r9,8
195c:   d4 ff 00 42 bdnz1930 
1960:   f0 ff c1 eb ld  r30,-16(r1)
1964:   f8 ff e1 eb ld  r31,-8(r1)
1968:   20 00 80 4e blr
196c:   00 00 00 00 .long 0x0
1970:   00 09 00 00 .long 0x900
1974:   00 02 00 00 attn
1978:   00 00 00 60 nop
197c:   00 00 42 60 ori r2,r2,0

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-07 Thread noloader at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

--- Comment #2 from Jeffrey Walton  ---
(In reply to Bill Schmidt from comment #1)
> GCC 4.8.5 is out of service.  This is fixed in all in-service versions of
> GCC (6.4 and later).

Interesting. I'm seeing it in GCC 7.2.0. Are you certain of this?

[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm

2018-03-07 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753

Bill Schmidt  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||wschmidt at gcc dot gnu.org
 Resolution|--- |INVALID

--- Comment #1 from Bill Schmidt  ---
GCC 4.8.5 is out of service.  This is fixed in all in-service versions of GCC
(6.4 and later).