https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101693
Bug ID: 101693 Summary: Terrible SIMD register allocation with a tight loop operating on 8 registers. Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ts.tomeksopel at gmail dot com Target Milestone: --- There are a few issues regarding unnecessary register spilling, but this also exhibits a lot of unnecessary juggling between registers. See https://godbolt.org/z/da76fY1n7 and https://www.reddit.com/r/cpp_questions/comments/oui5tc/simd_what_to_do_when_your_compiler_forgets_how_to/ The gist is that there's a tight loop, executed a constant number of times (~64 times) where accumulation happens to 8 ymm registers, and only those 8 registers are used from outside of the loop. Before the loop zeros are assinged, and after the loop horizontal addition is performed. GCC generates suboptimal code, whereas clang gets it right. It seems to perform unnecessary movs in a pattern following a -> b -> vpdpbusd to b -> a. All versions on godbolt >=8.1 seem to exhibit the issue, including trunk.