https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114319

            Bug ID: 114319
           Summary: htobe64-like function is not optimized on 32-bit x86
           Product: gcc
           Version: 12.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pali at kernel dot org
  Target Milestone: ---
            Target: x86

Here is very simple and straightforward implementation of htobe64 function
which takes 64-bit number stored in unsigned long long variable and encodes it
into byte buffer unsigned char[].

void test1(unsigned long long val, unsigned char *buf) {
  buf[0] = val >> 56;
  buf[1] = val >> 48;
  buf[2] = val >> 40;
  buf[3] = val >> 32;
  buf[4] = val >> 24;
  buf[5] = val >> 16;
  buf[6] = val >> 8;
  buf[7] = val;
}

Compiling it for 64-bit x86 via "gcc -m64 -O2" produces optimized code:

0000000000000000 <test1>:
   0:   48 0f cf                bswap  %rdi
   3:   48 89 3e                mov    %rdi,(%rsi)
   6:   c3                      retq

But compiling it for 32-bit x86 via "gcc -m32 -O2" produces not so optimized
code:

00000000 <test1>:
   0:   8b 54 24 08             mov    0x8(%esp),%edx
   4:   8b 44 24 0c             mov    0xc(%esp),%eax
   8:   89 d1                   mov    %edx,%ecx
   a:   88 70 02                mov    %dh,0x2(%eax)
   d:   c1 e9 18                shr    $0x18,%ecx
  10:   88 50 03                mov    %dl,0x3(%eax)
  13:   88 08                   mov    %cl,(%eax)
  15:   89 d1                   mov    %edx,%ecx
  17:   8b 54 24 04             mov    0x4(%esp),%edx
  1b:   c1 e9 10                shr    $0x10,%ecx
  1e:   0f ca                   bswap  %edx
  20:   88 48 01                mov    %cl,0x1(%eax)
  23:   89 50 04                mov    %edx,0x4(%eax)
  26:   c3                      ret


I tried to compile it for 32-bit powerpc via "powerpc-linux-gnu-gcc -m32 -O2"
and it produces optimized code:

00000000 <test1>:
   0:   90 65 00 00     stw     r3,0(r5)
   4:   90 85 00 04     stw     r4,4(r5)
   8:   4e 80 00 20     blr

Same for 64-bit powerpc via "powerpc-linux-gnu-gcc -m64 -O2":

0000000000000000 <.test1>:
   0:   f8 64 00 00     std     r3,0(r4)
   4:   4e 80 00 20     blr


As a next experiment I tried to rewrite the simple implementation to use gcc
builtins.

void test2(unsigned long long val, unsigned char *buf) {
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  val = __builtin_bswap64(val);
#endif
  __builtin_memcpy(buf, &val, sizeof(val));
}

If I compile it for 32-bit x86 then I get optimized code:

00000030 <test2>:
  30:   8b 4c 24 0c             mov    0xc(%esp),%ecx
  34:   8b 44 24 04             mov    0x4(%esp),%eax
  38:   8b 54 24 08             mov    0x8(%esp),%edx
  3c:   0f c8                   bswap  %eax
  3e:   89 41 04                mov    %eax,0x4(%ecx)
  41:   0f ca                   bswap  %edx
  43:   89 11                   mov    %edx,(%ecx)
  45:   c3                      ret

If I compile it for 64-bit x86 then I get exactly same code as for test1:

0000000000000010 <test2>:
  10:   48 0f cf                bswap  %rdi
  13:   48 89 3e                mov    %rdi,(%rsi)
  16:   c3                      retq

I tried to compile it for powerpc too and the result of test1 and test2 was
same.



So it looks like that the issue here is specific for 32-bit x86 and gcc does
not detect that test1 function on x86 is doing bswap64.

All tests I have done on (amd64) Debian gcc and for powerpc target I used
Debian's powerpc-linux-gnu-gcc cross compiler.

Reply via email to