[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2021-12-14 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
   Target Milestone|--- |8.0
 Resolution|--- |FIXED

--- Comment #3 from Andrew Pinski  ---
In GCC 8.1 and above we get:

.cfi_startproc
pushebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
mov ebp, esp
.cfi_def_cfa_register 5
mov eax, DWORD PTR [ebp+8]
pop ebp
.cfi_restore 5
.cfi_def_cfa 4, 4
vmovaps ymm0, YMMWORD PTR [eax]
vaddsubps   ymm0, ymm0, ymm0
ret

So fixed.

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

--- Comment #2 from Peter Cordes  ---
The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least
for this testcase.  Gcc's stack-alignment strategy seems to be improved overall
(not copying the return address when not needed), so probably it's really
fixed.

It's still present in 7.3.

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-15 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
This LEA stuff is part of what gcc does to align the stack by 32 for spilling
AVX locals.

Gcc's stack-align sequence is over-complicated and ties up an extra register
for the whole function (add  volatile  to the local and see the -O3 code).  Or
at least it was; it seems gcc8 trunk just makes a stack frame with EBP / RBP
but references 32-byte aligned locals from aligned RSP instead of unaligned
RBP.

It used to copy the address of the return address to make a full copy of
ret-addr / saved-RBP for the aligned stack frame, which was super weird.

https://godbolt.org/g/RLJNtd.  (With an alloca or something, gcc8 does the same
crazy stack-frame stuff as gcc7, otherwise it's much cleaner, like clang)



The actual bug here is that it's not fully optimized away when it turns out
that no 32-byte spills / reloads from locals are left in the function.

gcc for x86-64 sometimes has a few leftover instructions like that in more
complex functions using __m256; this is not exclusively an i386 problem, but
it's happens more easily for 32-bit it seems.