https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820
Bug ID: 80820 Summary: _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it. Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc with -mtune=generic likes to bounce through memory when moving data from integer registers to xmm for things like _mm_set_epi32. There are 3 related tuning issues here: * -mtune=haswell -mno-sse4 still uses one store/reload for _mm_set_epi64x. * -mtune=znver1 should definitely favour movd/movq instead of store/reload. (Ryzen has 1 m-op movd/movq between vector and integer with 3c latency, shorter than store-forwarding. All the reasons to favour store/reload on other AMD uarches are gone.) * -mtune=generic should probably favour movd/movq. I think it's better for a weighted-average of CPUs we care about for -mtune=generic. Most of the text below is an attempt to back up this claim, but I don't have hardware to test with so all I can do is look at Agner Fog's tables and microarch pdf. movd is about break-even on Bulldozer, better on SnB-family, much better on Core2/Nehalem, and significantly worse only on AMD K8/K10. Or maybe use a hybrid strategy that does half with movd and half with store/reload, which can actually be better than either strategy alone on Bulldozer and SnB-family. ----------- The tune=haswell issue is maybe separate from the others, since gcc already knows that bouncing through memory isn't the optimal strategy. #include <immintrin.h> __m128i combine64(long long a, long long b) { return _mm_set_epi64x(b,a); } gcc8 -O3 -mtune=haswell emits: movq %rsi, -16(%rsp) movq %rdi, %xmm0 movhps -16(%rsp), %xmm0 (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819 for the wasted store with -msse4 -mno-avx). I think what clang and ICC do is optimal for the SSE2-only case, for Intel CPUs and Ryzen: movq %rsi, %xmm1 movq %rdi, %xmm0 punpcklqdq %xmm1, %xmm0 _mm_set_epi32(d,c,b,a) with -mtune=haswell gives us the expected movd/punpck (without SSE4), no store/reload. ----- Using movd or movq instead of a store/reload is a code-size win: movd %eax, %xmm0 is 4 bytes (or 5 with a REX prefix for movq or high registers). Store/reload to -0x10(%rsp) is 10, 11, or 12 bytes, depending on operand size and high register(s). movd int->xmm is lower latency than store/reload on most CPUs, especially Intel SnB-family where it's 1c latency, and also AMD Ryzen. On SnB family, store/reload's only advantage is rare cases where port5 is a throughput bottleneck and latency isn't important. It replaces a store and a load uop with 1 ALU uop on Intel Core2 and later, and Atom/Silvermont/KNL. Also 1 uop on VIA Nano. movd int->xmm is 2 ALU uops on AMD K10/Bulldozer-family and Jaguar, and P4, and 3 on K8/Bobcat. It never costs any more total uops for the front-end (since a movd load is 2 uops on K8/Bobcat), but decoding a multi-uop instruction can sometimes be a bottleneck (especially on K8 where a 3 m-op instruction is a "vectorpath" (microcode)). Store/reload has one per clock throughput on every CPU, AFAIK. On most CPUs that have much weight in -mtune=generic, movd's throughput is one-per-clock or better. (According to Agner Fog's tables, only Bobcat, K8/K10, and P4 have throughput of one per 2 or 3 clocks for movd/movq int->xmm). The biggest problem is K10, with something like one per 2.8c throughput (according to a couple reports from http://users.atw.hu/instlatx64/, e.g. http://users.atw.hu/instlatx64/AuthenticAMD0100FA0_K10_Thuban_InstLatX64.txt). Agner Fog says 3, but none of these are measuring with other instructions mixed in. Some CPUs have better than one-per-clock throughput for movd/movq: Core2 is 0.5, and Nehalem is 0.33. So do we hurt them a lot to help PhenomII? I'd guess that Core2+Nehalem has somewhat more weight in tune=generic than K10. Some AMD PhenomII CPUs are still around, though. (But we could exclude them for code built with -mssse3) --------- Probably the deciding factor for tune=generic is whether it hurts AMD Bulldozer-family significantly or at all. It looks there's not much difference either way: similar throughput and latency. However, store/reload may have an advantage when two cores in a cluster are competing for their shared vector unit. Probably both of movd's macro-ops need to run on the shared vector unit, but for store/reload maybe only the load needs the shared resource. IDK if this is correct or relevant, though. Probably -mtune=bdver* should keep using store/reload, but this might not be enough of a reason to stop -mtune=generic from using movd. Agner Fog's microarch pdf (Bulldozer section 18.11) says: > Nevertheless, I cannot confirm that it is faster to move data from a general purpose register > to a vector register through a memory intermediate, as recommended in AMD's optimization guide. That AMD optimization guide advice may have been left over from K8/K10, where movd/movq from integer->vector has bad throughput. As far as latency goes, scalar store -> vector reload is 10c on Bulldozer according to Agner Fog's numbers, while movd/movq is 10c on Bulldozer/Piledriver, and 5c on Steamroller. (Steamroller also appears to have reduced the store-forwarding latency to 6c. Agner's tables are supposed to have the store+load latencies add up to the store-forwarding latency.) Store/reload is 2 instructions / 2 m-ops, but movd or movq is 1 instruction / 2 m-ops. This is mostly ok for the decoders, but bdver1 can't decode in a 2-2 pattern (ver2/ver3 can). Scheduling instructions to avoid consecutive multi-uop instructions may help decode throughput on bdver1. But pairs of 2 m-op instructions are good on bdver2 and later. With SSE4, pinsrd/q is probably good, because it's still only 2 m-ops on Bulldozer-family. Indeed, -mtune=bdver1 uses 2x store/reload and 2x pinsrd for _mm_set_epi32(d,c,b,a). movl %edx, -12(%rsp) movd -12(%rsp), %xmm1 movl %edi, -12(%rsp) movd -12(%rsp), %xmm0 pinsrd $1, %ecx, %xmm1 pinsrd $1, %esi, %xmm0 punpcklqdq %xmm1, %xmm0 Even better would probably be movd %edx, %xmm1 movl %edi, -12(%rsp) pinsrd $1, %ecx, %xmm1 # for bdver2, schedule so it can decode in a 2-2 pattern with the other pinsrd movd -12(%rsp), %xmm0 pinsrd $1, %esi, %xmm0 punpcklqdq %xmm1, %xmm0 The store/reload can happen in parallel with the direct movd int->xmm1. This would be pretty reasonable for tune=generic, and should run well on Intel SnB-family CPUs. ----- For -msse4 -mtune=core2, -mtune=nehalem, probably this is optimal: movd %edi, %xmm0 pinsrd $1, %esi, %xmm0 pinsrd $2, %edx, %xmm0 pinsrd $3, %ecx, %xmm0 movd can run on any port and pinsrd is only 1 uop. So this has a total latency of 2 + 3*1 = 5c on Core2 Wolfdale. (First-gen core2 doesn't have SSE4.1). Front-end bottlenecks are more common on Core2/Nehalem since they don't have a uop-cache, so fewer instructions is probably a good bet even at the expense of latency. It might not be worth the effort to get gcc to emit this for Core2/Nehalem, since they're old and getting less relevant all the time. It may also be good for -mtune=silvermont or KNL, though, since they also have 1 uop pinsrd/q. But with 3c latency for pinsrd, the lack ILP may be a big problem. Also, decode on Silvermont without VEX will stall if the pinsrd needs a REX (too many prefixes). KNL should always use VEX or EVEX to avoid that.