https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753
Bug ID: 84753 Summary: GCC does not fold xxswapd followed by vperm Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: noloader at gmail dot com Target Milestone: --- I'm working on GCC112 from the compile farm. It is ppc64-le machine. It has both GCC 4.8.5 and GCC 7.2.0 installed. The issue is present on both. We are trying to recover missing 1 to 2 cpb performance when using Power8 SHA built-ins. Part of the code to load a message into the message schedule looks like so: uint8_t msg[64] = {...}; __vector unsigned char mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12}; __vector unsigned int t = vec_vsx_ld(0, msg); t = vec_perm(t, t, mask); When I compile at -O3 and disassemble it, I see: 100008bc: 99 26 20 7c lxvd2x vs33,0,r4 ... 100008d0: 57 0a 21 f0 xxswapd vs33,vs33 100008d8: 2b 08 21 10 vperm v1,v1,v1,v0 Calling xxswapd followed by vperm seems to be a lot like calling shuffle_epi32 followed by shuffle_epi8 on an x86 machine. It feels like the two permutes should be folded into one. On x86 I would manually fold the two shuffles. On PPC I cannot because xxswapd is generated as part of the load, and then I call vperm. I have not figured out how to avoid the xxswapd. (I even tried to issue my own xxswapd to cancel out the one being generated by the compiler). ********** Here's a minimal case, but the optimizer is removing the code of interest. The real code suffers it, and it can be found at https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-p8.cxx . $ cat test.cxx #include <stdint.h> #if defined(__ALTIVEC__) # include <altivec.h> # undef vector # undef pixel # undef bool #endif typedef __vector unsigned char uint8x16_p8; typedef __vector unsigned int uint32x4_p8; // Unaligned load template <class T> static inline uint32x4_p8 VectorLoad32x4u(const T* data, int offset) { return vec_vsx_ld(offset, (uint32_t*)data); } // Unaligned store template <class T> static inline void VectorStore32x4u(const uint32x4_p8 val, T* data, int offset) { vec_vsx_st(val, offset, (uint32_t*)data); } static inline uint32x4_p8 VectorPermute32x4(const uint32x4_p8 val, const uint8x16_p8 mask) { return (uint32x4_p8)vec_perm(val, val, mask); } int main(int argc, char* argv[]) { uint8_t M[64]; uint32_t W[64]; uint8_t* m = M; uint32_t* w = W; const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12}; for (unsigned int i=0; i<16; i+=4, m+=4, w+=4) VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(m, 0), mask), w, 0); return 0; }