[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #21 from Michael_S --- (In reply to Mason from comment #20) > Doh! You're right. > I come from a background where overlapping/aliasing inputs are heresy, > thus got blindsided :( > > This would be the optimal code, right? > >

[Bug target/105617] [12/13/14 Regression] Slp is maybe too aggressive in some/many cases

2023-06-07 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #19 from Michael_S --- (In reply to Mason from comment #18) > Hello Michael_S, > > As far as I can see, massaging the source helps GCC generate optimal code > (in terms of instruction count, not convinced about scheduling). > >

[Bug libgcc/108279] Improved speed for float128 routines

2023-02-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #24 from Michael_S --- (In reply to Michael_S from comment #22) > (In reply to Michael_S from comment #8) > > (In reply to Thomas Koenig from comment #6) > > > And there will have to be a decision about 32-bit targets. > > > > > >

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #23 from Michael_S --- (In reply to Jakub Jelinek from comment #19) > So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86 > sfp-machine.h > #define FP_INIT_ROUNDMODE \ > do {

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #22 from Michael_S --- (In reply to Michael_S from comment #8) > (In reply to Thomas Koenig from comment #6) > > And there will have to be a decision about 32-bit targets. > > > > IMHO, 32-bit targets should be left in their

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #16 from Michael_S --- (In reply to Jakub Jelinek from comment #15) > libquadmath is not needed nor useful on aarch64-linux, because long double > type there is already IEEE 754 quad. That's good to know. Thank you. If you are

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #12 from Michael_S --- (In reply to Thomas Koenig from comment #10) > What we would need for incorporation into gcc is to have several > functions, which would then called depending on which floating point > options are in force at

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #11 from Michael_S --- (In reply to Thomas Koenig from comment #9) > Created attachment 54273 [details] > matmul_r16.i > > Here is matmul_r16.i from a relatively recent trunk. Thank you. Unfortunately, I was not able to link it

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-12 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #8 from Michael_S --- (In reply to Thomas Koenig from comment #6) > (In reply to Michael_S from comment #5) > > Hi Thomas > > Are you in or out? > > Depends a bit on what exactly you want to do, and if there is > a chance that what

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-12 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #7 from Michael_S --- Either here or my yahoo e-mail

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-11 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #5 from Michael_S --- Hi Thomas Are you in or out? If you are still in, I can use your help on several issues. 1. Torture. See if Invalid Operand exception raised properly now. Also if there are still remaining problems with NaN.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-04 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279 --- Comment #4 from Michael_S --- (In reply to Jakub Jelinek from comment #2) > From what I can see, they are certainly not portable. > E.g. the relying on __int128 rules out various arches (basically all 32-bit > arches, > ia32, powerpc 32-bit

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #22 from Michael_S --- (In reply to Alexander Monakov from comment #21) > (In reply to Michael_S from comment #19) > > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > > 'unlaminated' (turned to 2 uops before

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #20 from Michael_S --- (In reply to Richard Biener from comment #17) > (In reply to Michael_S from comment #16) > > On unrelated note, why loop overhead uses so many instructions? > > Assuming that I am as misguided as gcc about

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-26 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #19 from Michael_S --- (In reply to Alexander Monakov from comment #18) > The apparent 'bias' is introduced by instruction scheduling: haifa-sched > lifts a +64 increment over memory accesses, transforming +0 and +32 > displacements

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #16 from Michael_S --- On unrelated note, why loop overhead uses so many instructions? Assuming that I am as misguided as gcc about load-op combining, I would write it as: sub %rax, %rdx .L3: vmovupd (%rdx,%rax), %ymm1

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2022-11-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #14 from Michael_S --- I tested a smaller test bench from Comment 3 with gcc trunk on godbolt. Issue appears to be only partially fixed. -Ofast result is no longer a horror that it was before, but it is still not as good as -O3 or

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-07-29 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #15 from Michael_S --- (In reply to Richard Biener from comment #14) > (In reply to Michael_S from comment #12) > > On related note... > > One of the historical good features of gcc relatively to other popular > > compilers was

[Bug target/106220] x86-64 optimizer forgets about shrd peephole optimization pattern when faced with more than one in close proximity

2022-07-06 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106220 --- Comment #3 from Michael_S --- -march-haswell is not very important. I added it only because in absence of BMI extension an issue is somewhat obscured by need to keep shift count in CL register. -O2 is also not important. -O3 is the same.

[Bug c/106220] New: x86-64 optimizer forgets about shrd peephole optimization pattern when faced with more than one in close proximity

2022-07-06 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106220 Bug ID: 106220 Summary: x86-64 optimizer forgets about shrd peephole optimization pattern when faced with more than one in close proximity Product: gcc Version:

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-13 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #23 from Michael_S --- (In reply to jos...@codesourcery.com from comment #22) > On Mon, 13 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote: > > > > The function should be sqrtf128 (present in glibc

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-13 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #21 from Michael_S --- (In reply to jos...@codesourcery.com from comment #20) > On Sat, 11 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote: > > > On MSYS2 _Float128 and __float128 appears to be mostly th

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-11 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #19 from Michael_S --- (In reply to jos...@codesourcery.com from comment #18) > libquadmath is essentially legacy code. People working directly in C > should be using the C23 _Float128 interfaces and *f128 functions, as in >

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #17 from Michael_S --- (In reply to Jakub Jelinek from comment #15) > From what I can see, it is mostly integral implementation and we already > have one such in GCC, so the question is if we just shouldn't use it (most > of the

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-06-10 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #16 from Michael_S --- (In reply to Thomas Koenig from comment #14) > @Michael: Now that gcc 12 is out of the door, I would suggest we try to get > your code into the gcc tree for gcc 13. > > It should follow the gcc style

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-17 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #12 from Michael_S --- On related note... One of the historical good features of gcc relatively to other popular compilers was absence of auto-vectorization at -O2. When did you decide to change it and why?

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-17 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #11 from Michael_S --- (In reply to Richard Biener from comment #10) > (In reply to Hongtao.liu from comment #9) > > (In reply to Hongtao.liu from comment #8) > > > (In reply to Hongtao.liu from comment #7) > > > > Hmm, we have

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #6 from Michael_S --- (In reply to Michael_S from comment #5) > > Even scalar-to-scalar or vector-to-vector moves that are hoisted at renamer > does not have a zero cost, because quite often renamer itself constitutes > the

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #5 from Michael_S --- (In reply to Richard Biener from comment #3) > We are vectorizing the store it dst[] now at -O2 since that appears > profitable: > > t.c:10:10: note: Cost model analysis: > r0.0_12 1 times scalar_store costs

[Bug target/105617] [12/13 Regression] Slp is maybe too aggressive in some/many cases

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 --- Comment #4 from Michael_S --- (In reply to Andrew Pinski from comment #1) > This is just the vectorizer still being too aggressive right before a return. > It is a cost model issue and it might not really be an issue in the final > code

[Bug target/105617] New: Regression in code generation for _addcarry_u64()

2022-05-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105617 Bug ID: 105617 Summary: Regression in code generation for _addcarry_u64() Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #4 from Michael_S --- Created attachment 52925 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52925=edit build script

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #3 from Michael_S --- Created attachment 52924 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52924=edit Another test bench that shows lower impact on Zen3, but higher impact on some Intel CPUs

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #2 from Michael_S --- Created attachment 52923 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52923=edit test bench that shows lower impact on Zen3, but higher impact on some Intel CPUs

[Bug target/105468] Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 --- Comment #1 from Michael_S --- Created attachment 52922 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52922=edit test bench that demonstrates maximal impact on Zen3

[Bug target/105468] New: Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target.

2022-05-03 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105468 Bug ID: 105468 Summary: Suboptimal code generation for access of function parameters and return values of type __float128 on x86-64 Windows target. Product: gcc

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-21 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #13 from Michael_S --- It turned out that on all micro-architectures that I care about (and majority of those that I don't care) double precision floating point division is quite fast. It's so fast that it easily beats my clever

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-20 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #12 from Michael_S --- (In reply to Michael_S from comment #11) > (In reply to Michael_S from comment #10) > > BTW, the same ideas as in the code above could improve speed of division > > operation (on modern 64-bit HW) by factor of

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-18 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #11 from Michael_S --- (In reply to Michael_S from comment #10) > BTW, the same ideas as in the code above could improve speed of division > operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD). Did it. On Intel

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #10 from Michael_S --- BTW, the same ideas as in the code above could improve speed of division operation (on modern 64-bit HW) by factor of 3 (on Intel) or 2 (on AMD).

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 --- Comment #9 from Michael_S --- (In reply to Michael_S from comment #4) > If you want quick fix for immediate shipment then you can take that: > > #include > #include > > __float128 quick_and_dirty_sqrtq(__float128 x) > { > if

[Bug libquadmath/105101] incorrect rounding for sqrtq

2022-04-02 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101 Michael_S changed: What|Removed |Added CC||already5chosen at yahoo dot com ---

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-19 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #10 from Michael_S --- I lost track of what you're talking about long time ago. But that's o.k.

[Bug target/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 --- Comment #3 from Michael_S --- (In reply to Richard Biener from comment #2) > It's again reassociation making a mess out of the natural SLP opportunity > (and thus SLP discovery fails miserably). > > One idea worth playing with would be to

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 --- Comment #9 from Michael_S --- Despite what I wrote above, I did took a look at the trunk on godbolt with same old code from a year ago. Because it was so easy. And indeed a trunk looks ALOT better. But until it's released I wouldn't know if

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 --- Comment #8 from Michael_S --- (In reply to Jakub Jelinek from comment #7) > (In reply to Michael_S from comment #5) > > I agree with regard to "other targets", first of all, aarch64, but x86_64 > > variant of gcc already provides requested

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 --- Comment #6 from Michael_S --- (In reply to Marc Glisse from comment #1) > We could start with the simpler: > > void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned > b,unsigned c, unsigned d){ > *r=a+b; >

[Bug target/79173] add-with-carry and subtract-with-borrow support (x86_64 and others)

2020-11-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 Michael_S changed: What|Removed |Added CC||already5chosen at yahoo dot com --- Comment

[Bug target/97832] New: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

2020-11-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832 Bug ID: 97832 Summary: AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3 Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity:

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-16 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428 --- Comment #9 from Michael_S --- Hopefully, you did regression tests for all main AoS<->SoA cases. I.e. typedef struct { double re, im; } dcmlx_t; void soa2aos(double* restrict dstRe, double* restrict dstIm, const dcmlx_t src[], int nq) {

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428 --- Comment #6 from Michael_S --- (In reply to Richard Biener from comment #4) > > while the lack of cross-lane shuffles in AVX2 requires a > > .L3: > vmovupd (%rsi,%rax), %xmm5 > vmovupd 32(%rsi,%rax), %xmm6 >

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-15 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428 --- Comment #5 from Michael_S --- (In reply to Richard Biener from comment #4) > I have a fix that, with -mavx512f generates just > > .L3: > vmovupd (%rcx,%rax), %zmm0 > vpermpd (%rsi,%rax), %zmm1, %zmm2 > vpermpd %zmm0,

[Bug target/97428] New: -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

2020-10-14 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428 Bug ID: 97428 Summary: -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic Product: gcc Version: 10.2.0 Status: UNCONFIRMED

[Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

2020-10-09 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343 --- Comment #2 from Michael_S --- (In reply to Richard Biener from comment #1) > All below for Part 2. > > Without -ffast-math you are seeing GCC using in-order reductions now while > with -ffast-math the vectorizer gets a bit confused about

[Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

2020-10-08 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343 Bug ID: 97343 Summary: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product Product: gcc Version: 10.2.0 Status: UNCONFIRMED

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #15 from Michael_S --- (In reply to Hongtao.liu from comment #14) > > Still I don't understand why compiler does not compare the cost of full loop > > body after combining to the cost before combining and does not come to > >

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #13 from Michael_S --- (In reply to Hongtao.liu from comment #11) > (In reply to Michael_S from comment #10) > > (In reply to Hongtao.liu from comment #9) > > > (In reply to Michael_S from comment #8) > > > > What are values of gcc

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-24 Thread already5chosen at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127 --- Comment #10 from Michael_S --- (In reply to Hongtao.liu from comment #9) > (In reply to Michael_S from comment #8) > > What are values of gcc "loop" cost of the relevant instructions now? > > 1. AVX256 Load > > 2. FMA3 ymm,ymm,ymm > > 3.