[Bug target/56676] unnecesary splitted load when using avx2

2023-12-17 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676

Andrew Pinski  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=98172
   Target Milestone|--- |11.0
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Andrew Pinski  ---
Changed the generic tuning by r11-7115-gb80fefd626460f (PR 98172) so fixed.

[Bug target/56676] unnecesary splitted load when using avx2

2023-12-17 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676

--- Comment #6 from Andrew Pinski  ---
GCC 11 produces:
```
_Z3fooPiS_:
.LFB0:
.cfi_startproc
vmovdqu (%rdi), %ymm2
vmovdqu 32(%rdi), %ymm3
vpmulld (%rsi), %ymm2, %ymm1
vpmulld 32(%rsi), %ymm3, %ymm0
vpaddd  %ymm0, %ymm1, %ymm1
vmovdqu 64(%rdi), %ymm4
vpmulld 64(%rsi), %ymm4, %ymm0
vpaddd  %ymm1, %ymm0, %ymm0
vmovdqu 96(%rdi), %ymm1
vpmulld 96(%rsi), %ymm1, %ymm1
vpaddd  %ymm0, %ymm1, %ymm1
vextracti128$0x1, %ymm1, %xmm0
vpaddd  %xmm1, %xmm0, %xmm0
vpsrldq $8, %xmm0, %xmm1
vpaddd  %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd  %xmm1, %xmm0, %xmm0
vmovd   %xmm0, %eax
vzeroupper
ret
```


While GCC 10 produces:
```
_Z3fooPiS_:
.LFB0:
.cfi_startproc
vmovdqu (%rdi), %xmm3
vmovdqu (%rsi), %xmm4
vinserti128 $0x1, 16(%rdi), %ymm3, %ymm1
vinserti128 $0x1, 16(%rsi), %ymm4, %ymm0
vmovdqu 32(%rdi), %xmm5
vmovdqu 32(%rsi), %xmm6
vpmulld %ymm1, %ymm0, %ymm0
vmovdqu 64(%rdi), %xmm7
vmovdqu 64(%rsi), %xmm3
vinserti128 $0x1, 48(%rdi), %ymm5, %ymm2
vinserti128 $0x1, 48(%rsi), %ymm6, %ymm1
vmovdqu 96(%rsi), %xmm4
vmovdqu 96(%rdi), %xmm5
vpmulld %ymm2, %ymm1, %ymm1
vinserti128 $0x1, 80(%rdi), %ymm7, %ymm2
vpaddd  %ymm1, %ymm0, %ymm0
vinserti128 $0x1, 80(%rsi), %ymm3, %ymm1
vpmulld %ymm2, %ymm1, %ymm1
vinserti128 $0x1, 112(%rsi), %ymm4, %ymm2
vpaddd  %ymm0, %ymm1, %ymm0
vinserti128 $0x1, 112(%rdi), %ymm5, %ymm1
vpmulld %ymm2, %ymm1, %ymm1
vpaddd  %ymm0, %ymm1, %ymm1
vmovdqa %xmm1, %xmm0
vextracti128$0x1, %ymm1, %xmm1
vpaddd  %xmm1, %xmm0, %xmm0
vpsrldq $8, %xmm0, %xmm1
vpaddd  %xmm1, %xmm0, %xmm0
vpsrldq $4, %xmm0, %xmm1
vpaddd  %xmm1, %xmm0, %xmm0
vmovd   %xmm0, %eax
vzeroupper
ret
```

[Bug target/56676] unnecesary splitted load when using avx2

2016-10-19 Thread rivanvx at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676

Vedran Miletic  changed:

   What|Removed |Added

 CC||rivanvx at gmail dot com

--- Comment #5 from Vedran Miletic  ---
Confirmed still affecting GCC 6.2.1. Similar C++ example:

#include 
#include 
float f(std::vector& A, std::vector& B)
{
  return std::inner_product(A.begin(), A.end(), B.begin(), 0.f);
}

[Bug target/56676] unnecesary splitted load when using avx2

2013-03-21 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676



--- Comment #1 from Richard Biener rguenth at gcc dot gnu.org 2013-03-21 
13:30:42 UTC ---

I believe we split unaligned loads by default because that's faster for generic

tuning.


[Bug target/56676] unnecesary splitted load when using avx2

2013-03-21 Thread neleai at seznam dot cz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676



--- Comment #2 from Ondrej Bilka neleai at seznam dot cz 2013-03-21 14:53:26 
UTC ---

On Thu, Mar 21, 2013 at 01:30:42PM +, rguenth at gcc dot gnu.org wrote:

 

 

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676

 

 

 

 --- Comment #1 from Richard Biener rguenth at gcc dot gnu.org 2013-03-21 
 13:30:42 UTC ---

 

 I believe we split unaligned loads by default because that's faster for 
 generic

 

 tuning.



I used avx2 which is far from generic. Now only haswell supports it.

Documentation says it supports 2 32byte loads per cycle. Unless 32 byte

loads have bigger latency they will be more effective.

 

 

 -- 

 

 Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email

 

 --- You are receiving this mail because: ---

 

 You reported the bug.


[Bug target/56676] unnecesary splitted load when using avx2

2013-03-21 Thread rguenth at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676



--- Comment #3 from Richard Biener rguenth at gcc dot gnu.org 2013-03-21 
15:11:14 UTC ---

Well, while true we don't adjust tuning based on that.  Use -march=core-avx2

instead.


[Bug target/56676] unnecesary splitted load when using avx2

2013-03-21 Thread izamyatin at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56676



Igor Zamyatin izamyatin at gmail dot com changed:



   What|Removed |Added



 CC||izamyatin at gmail dot com



--- Comment #4 from Igor Zamyatin izamyatin at gmail dot com 2013-03-21 
15:18:24 UTC ---

We (at Intel) used to try to remove splitting for avx2 but saw no reasonable

gains in general