[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Richard Biener changed: What|Removed |Added Target Milestone|9.5 |---
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Richard Biener changed: What|Removed |Added Target Milestone|9.4 |9.5 --- Comment #12 from Richard Biener --- GCC 9.4 is being released, retargeting bugs to GCC 9.5.
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Jakub Jelinek changed: What|Removed |Added Target Milestone|9.3 |9.4 --- Comment #11 from Jakub Jelinek --- GCC 9.3.0 has been released, adjusting target milestone.
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Jakub Jelinek changed: What|Removed |Added Target Milestone|9.2 |9.3 --- Comment #10 from Jakub Jelinek --- GCC 9.2 has been released.
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Jakub Jelinek changed: What|Removed |Added Target Milestone|9.0 |9.2 --- Comment #9 from Jakub Jelinek --- GCC 9.1 has been released.
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 --- Comment #8 from Bill Schmidt --- Looks like Peter was able to help you on the binutils forum over the weekend. Thanks, Peter!
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 --- Comment #7 from Jeffrey Walton --- (In reply to Bill Schmidt from comment #4) > ... > > The best performance will be achieved by writing this loop entirely using > inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no > swaps), thus in "big-endian element order" (element 0 in the high-order > position of the register). Because of the big-endian nature of vshasigmaw, > this is always going to be the best approach. Thanks Bill. We are working on your lxvd2x suggestion using inline assembly. Related, see "GCC vec_xl_be replacement using inline assembly", https://stackoverflow.com/q/49215090/608639. - I'm not sure if I am doing something wrong, or this is a new issue: $ cat test.cxx ... typedef __vector unsigned int uint32x4_p8; uint32x4_p8 VEC_XL_BE(const uint8_t* data, int offset) { #if defined(__xlc__) || defined(__xlC__) return (uint32x4_p8)vec_xl_be(offset, (uint8_t*)data); #else uint32x4_p8 res; __asm(" lxvd2x %x0, %1, %2\n\t" : "=wa" (res) : "g" (data), "g" (offset)); return res; #endif } When I use VEC_XL_BE in real life it results in: $ g++ -DTEST_MAIN -g3 -O3 -mcpu=power8 sha256-p8.cxx -o sha256-p8.exe /home/noloader/tmp/ccbDnfFr.s: Assembler messages: /home/noloader/tmp/ccbDnfFr.s:758: Error: operand out of range (32 is not between 0 and 31) /home/noloader/tmp/ccbDnfFr.s:983: Error: operand out of range (48 is not between 0 and 31)
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Bill Schmidt changed: What|Removed |Added Target Milestone|--- |9.0
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Bill Schmidt changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2018-03-08 Ever confirmed|0 |1 --- Comment #6 from Bill Schmidt --- We should keep this issue open for a different problem, though. When swap optimization doesn't succeed on such a loop, we don't end up optimizing away the xxlnand associated with the vperm on Power8. Currently that is only done as part of swap optimization. It's a minor savings but worth doing.
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 --- Comment #5 from Bill Schmidt --- s/this loop/this function
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Bill Schmidt changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID |--- --- Comment #4 from Bill Schmidt --- OK, I see. We optimize swapped vperm in most cases as part of a general swap-optimization algorithm. However, this algorithm is defeated when there is a mix of loads/stores accompanied by swaps and loads/stores that are not accompanied by swaps. The "big-endian" loads that are used with vshasigmaw and friends are the problem here. (This problem goes away with Power9, but doesn't help you here.) There is a slight possibility we can address this in GCC 8, but it is unlikely, as the code base is closed except for regression fixes. In any case, a solution would still keep some swap instructions in place, and thus would not be ideal. (I.e., we can fold a swap and a vperm when the result of the swap is not used elsewhere, but other swaps associated with loads and stores will still be present.) So I don't think we should go this route. The best performance will be achieved by writing this loop entirely using inline asm code, with all data loaded/stored using lxvd2x and stxvd2x (no swaps), thus in "big-endian element order" (element 0 in the high-order position of the register). Because of the big-endian nature of vshasigmaw, this is always going to be the best approach. I am still poking the bushes for a reference implementation; I thought of another person to ask while writing this note. Will let you know what I find out.
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 --- Comment #3 from Jeffrey Walton --- (In reply to Jeffrey Walton from comment #2) > (In reply to Bill Schmidt from comment #1) > > GCC 4.8.5 is out of service. This is fixed in all in-service versions of > > GCC (6.4 and later). > > Interesting. I'm seeing it in GCC 7.2.0. Are you certain of this? Here's an example to make sure we are on the same page. $ /opt/cfarm/gcc-latest/bin/g++ --version g++ (GCC) 7.2.0 $ /opt/cfarm/gcc-latest/bin/g++ -g3 -O3 -Wall -DTEST_MAIN -mcpu=power8 sha256-p8.cxx -o sha256-p8.exe $ objdump --disassemble sha256-p8.exe | c++filt 1880 : 1880: 03 10 40 3c lis r2,4099 1884: 00 81 42 38 addir2,r2,-32512 1888: f0 ff c1 fb std r30,-16(r1) 188c: f8 ff e1 fb std r31,-8(r1) 1890: fe ff 22 3d addis r9,r2,-2 1894: 10 00 c4 3b addir30,r4,16 1898: 70 8e 29 39 addir9,r9,-29072 189c: 10 00 e3 3b addir31,r3,16 18a0: 20 00 84 39 addir12,r4,32 18a4: 20 00 63 39 addir11,r3,32 18a8: 99 4e 00 7c lxvd2x vs32,0,r9 18ac: 30 00 a3 38 addir5,r3,48 18b0: 40 00 23 39 addir9,r3,64 18b4: c4 ff c0 38 li r6,-60 18b8: c0 ff e0 38 li r7,-64 18bc: 99 26 20 7c lxvd2x vs33,0,r4 18c0: 30 00 84 38 addir4,r4,48 18c4: f8 ff 00 39 li r8,-8 18c8: e4 ff 40 39 li r10,-28 18cc: 57 02 00 f0 xxswapd vs32,vs32 18d0: 57 0a 21 f0 xxswapd vs33,vs33 18d4: 97 05 00 f0 xxlnand vs32,vs32,vs32 18d8: 2b 08 21 10 vperm v1,v1,v1,v0 18dc: 57 0a 21 f0 xxswapd vs33,vs33 18e0: 99 1f 20 7c stxvd2x vs33,0,r3 18e4: 18 00 60 38 li r3,24 18e8: a6 03 69 7c mtctr r3 18ec: 99 f6 20 7c lxvd2x vs33,0,r30 18f0: 57 0a 21 f0 xxswapd vs33,vs33 18f4: 2b 08 21 10 vperm v1,v1,v1,v0 18f8: 57 0a 21 f0 xxswapd vs33,vs33 18fc: 99 ff 20 7c stxvd2x vs33,0,r31 1900: 99 66 20 7c lxvd2x vs33,0,r12 1904: 57 0a 21 f0 xxswapd vs33,vs33 1908: 2b 08 21 10 vperm v1,v1,v1,v0 190c: 57 0a 21 f0 xxswapd vs33,vs33 1910: 99 5f 20 7c stxvd2x vs33,0,r11 1914: 99 26 20 7c lxvd2x vs33,0,r4 1918: 57 0a 21 f0 xxswapd vs33,vs33 191c: 2b 08 01 10 vperm v0,v1,v1,v0 1920: 57 02 00 f0 xxswapd vs32,vs32 1924: 99 2f 00 7c stxvd2x vs32,0,r5 1928: 00 00 00 60 nop 192c: 00 00 42 60 ori r2,r2,0 1930: 99 36 09 7c lxvd2x vs32,r9,r6 1934: 99 3e 89 7d lxvd2x vs44,r9,r7 1938: 99 56 a9 7d lxvd2x vs45,r9,r10 193c: 99 46 29 7c lxvd2x vs33,r9,r8 1940: 82 06 00 10 vshasigmaw v0,v0,0,0 1944: 82 7e 21 10 vshasigmaw v1,v1,0,15 1948: 80 60 00 10 vadduwm v0,v0,v12 194c: 80 68 00 10 vadduwm v0,v0,v13 1950: 80 08 00 10 vadduwm v0,v0,v1 1954: 99 4f 00 7c stxvd2x vs32,0,r9 1958: 08 00 29 39 addir9,r9,8 195c: d4 ff 00 42 bdnz1930 1960: f0 ff c1 eb ld r30,-16(r1) 1964: f8 ff e1 eb ld r31,-8(r1) 1968: 20 00 80 4e blr 196c: 00 00 00 00 .long 0x0 1970: 00 09 00 00 .long 0x900 1974: 00 02 00 00 attn 1978: 00 00 00 60 nop 197c: 00 00 42 60 ori r2,r2,0
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 --- Comment #2 from Jeffrey Walton --- (In reply to Bill Schmidt from comment #1) > GCC 4.8.5 is out of service. This is fixed in all in-service versions of > GCC (6.4 and later). Interesting. I'm seeing it in GCC 7.2.0. Are you certain of this?
[Bug rtl-optimization/84753] GCC does not fold xxswapd followed by vperm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753 Bill Schmidt changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED CC||wschmidt at gcc dot gnu.org Resolution|--- |INVALID --- Comment #1 from Bill Schmidt --- GCC 4.8.5 is out of service. This is fixed in all in-service versions of GCC (6.4 and later).