[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Target Milestone|--- |8.0 Resolution|--- |FIXED --- Comment #3 from Andrew Pinski --- In GCC 8.1 and above we get: .cfi_startproc pushebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 mov ebp, esp .cfi_def_cfa_register 5 mov eax, DWORD PTR [ebp+8] pop ebp .cfi_restore 5 .cfi_def_cfa 4, 4 vmovaps ymm0, YMMWORD PTR [eax] vaddsubps ymm0, ymm0, ymm0 ret So fixed.
[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 --- Comment #2 from Peter Cordes --- The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least for this testcase. Gcc's stack-alignment strategy seems to be improved overall (not copying the return address when not needed), so probably it's really fixed. It's still present in 7.3.
[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- This LEA stuff is part of what gcc does to align the stack by 32 for spilling AVX locals. Gcc's stack-align sequence is over-complicated and ties up an extra register for the whole function (add volatile to the local and see the -O3 code). Or at least it was; it seems gcc8 trunk just makes a stack frame with EBP / RBP but references 32-byte aligned locals from aligned RSP instead of unaligned RBP. It used to copy the address of the return address to make a full copy of ret-addr / saved-RBP for the aligned stack frame, which was super weird. https://godbolt.org/g/RLJNtd. (With an alloca or something, gcc8 does the same crazy stack-frame stuff as gcc7, otherwise it's much cleaner, like clang) The actual bug here is that it's not fully optimized away when it turns out that no 32-byte spills / reloads from locals are left in the function. gcc for x86-64 sometimes has a few leftover instructions like that in more complex functions using __m256; this is not exclusively an i386 problem, but it's happens more easily for 32-bit it seems.