https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261
--- Comment #4 from Peter Cordes ---
GCC will emit SHLD / SHRD as part of shifting an integer that's two registers
wide.
Hironori Bono proposed the following functions as a workaround for this missed
optimization (https://stackoverflow.com/a/71805063/224132)
#include
#ifdef __SIZEOF_INT128__
uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
return (uint64_t)(unsigned __int128)high << 64) | (unsigned __int128)low)
<< (count & 63)) >> 64);
}
uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
return (uint64_t)unsigned __int128)high << 64) | (unsigned __int128)low)
>> (count & 63));
}
#endif
uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
return (uint32_t)(uint64_t)high << 32) | (uint64_t)low) << (count & 31))
>> 32);
}
uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
return (uint32_t)uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}
---
The uint64_t functions (using __int128) compile cleanly in 64-bit mode
(https://godbolt.org/z/1j94Gcb4o) using 64-bit operand-size shld/shrd
but the uint32_t functions compile to a total mess in 32-bit mode (GCC11.2 -O3
-m32 -mregparm=3) before eventually using shld, including a totally insane
or dh, 0
GCC trunk with -O3 -mregparm=3 compiles them cleanly, but without regparm it's
also slightly different mess.
Ironically, the uint32_t functions compile to quite a few instructions in
64-bit mode, actually doing the operations as written with shifts and ORs, and
having to manually mask the shift count to &31 because it uses a 64-bit
operand-size shift which masks with &63. 32-bit operand-size SHLD would be a
win here, at least for -mtune=intel or a specific Intel uarch.
I haven't looked at whether they still compile ok after inlining into
surrounding code, or whether operations would tend to combine with other things
in preference to becoming an SHLD.