https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319
Bug ID: 114319 Summary: htobe64-like function is not optimized on 32-bit x86 Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pali at kernel dot org Target Milestone: --- Target: x86 Here is very simple and straightforward implementation of htobe64 function which takes 64-bit number stored in unsigned long long variable and encodes it into byte buffer unsigned char[]. void test1(unsigned long long val, unsigned char *buf) { buf[0] = val >> 56; buf[1] = val >> 48; buf[2] = val >> 40; buf[3] = val >> 32; buf[4] = val >> 24; buf[5] = val >> 16; buf[6] = val >> 8; buf[7] = val; } Compiling it for 64-bit x86 via "gcc -m64 -O2" produces optimized code: 0000000000000000 <test1>: 0: 48 0f cf bswap %rdi 3: 48 89 3e mov %rdi,(%rsi) 6: c3 retq But compiling it for 32-bit x86 via "gcc -m32 -O2" produces not so optimized code: 00000000 <test1>: 0: 8b 54 24 08 mov 0x8(%esp),%edx 4: 8b 44 24 0c mov 0xc(%esp),%eax 8: 89 d1 mov %edx,%ecx a: 88 70 02 mov %dh,0x2(%eax) d: c1 e9 18 shr $0x18,%ecx 10: 88 50 03 mov %dl,0x3(%eax) 13: 88 08 mov %cl,(%eax) 15: 89 d1 mov %edx,%ecx 17: 8b 54 24 04 mov 0x4(%esp),%edx 1b: c1 e9 10 shr $0x10,%ecx 1e: 0f ca bswap %edx 20: 88 48 01 mov %cl,0x1(%eax) 23: 89 50 04 mov %edx,0x4(%eax) 26: c3 ret I tried to compile it for 32-bit powerpc via "powerpc-linux-gnu-gcc -m32 -O2" and it produces optimized code: 00000000 <test1>: 0: 90 65 00 00 stw r3,0(r5) 4: 90 85 00 04 stw r4,4(r5) 8: 4e 80 00 20 blr Same for 64-bit powerpc via "powerpc-linux-gnu-gcc -m64 -O2": 0000000000000000 <.test1>: 0: f8 64 00 00 std r3,0(r4) 4: 4e 80 00 20 blr As a next experiment I tried to rewrite the simple implementation to use gcc builtins. void test2(unsigned long long val, unsigned char *buf) { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ val = __builtin_bswap64(val); #endif __builtin_memcpy(buf, &val, sizeof(val)); } If I compile it for 32-bit x86 then I get optimized code: 00000030 <test2>: 30: 8b 4c 24 0c mov 0xc(%esp),%ecx 34: 8b 44 24 04 mov 0x4(%esp),%eax 38: 8b 54 24 08 mov 0x8(%esp),%edx 3c: 0f c8 bswap %eax 3e: 89 41 04 mov %eax,0x4(%ecx) 41: 0f ca bswap %edx 43: 89 11 mov %edx,(%ecx) 45: c3 ret If I compile it for 64-bit x86 then I get exactly same code as for test1: 0000000000000010 <test2>: 10: 48 0f cf bswap %rdi 13: 48 89 3e mov %rdi,(%rsi) 16: c3 retq I tried to compile it for powerpc too and the result of test1 and test2 was same. So it looks like that the issue here is specific for 32-bit x86 and gcc does not detect that test1 function on x86 is doing bswap64. All tests I have done on (amd64) Debian gcc and for powerpc target I used Debian's powerpc-linux-gnu-gcc cross compiler.