https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90262
--- Comment #3 from Liu Hao ---
This exists on x86_64 too: https://gcc.godbolt.org/z/z5MW4E4aE
```c
int xcopy(char* dst, const char* src)
{
__builtin_memmove(dst, src, 32);
return dst[0];
}
```
Clang generates this assembly:
```
xcopy(char*, char const*): # @xcopy(char*, char
const*)
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rsi + 16]
movups xmmword ptr [rdi], xmm0
movups xmmword ptr [rdi + 16], xmm1
movsx eax, byte ptr [rdi]
ret
```
which comprises two XMM loads followed by two XMM stores, and should work as
expected no matter whether `dst` and `src` point to overlapped regions.
But GCC generates a call to `memmove()` instead, and is rather inefficient for
this tiny amount of memory:
```
xcopy(char*, char const*):
sub rsp, 8
mov edx, 32
callmemmove
movsx eax, BYTE PTR [rax]
add rsp, 8
ret
```