[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32

2018-12-31 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510 Devin Hussey changed: What|Removed |Added Summary|GCC generates inefficient |GCC generates inefficient

[Bug tree-optimization/88605] vector extensions: Widening or conversion generates inefficient or scalar code.

2019-01-02 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605 --- Comment #4 from Devin Hussey --- I also want to note that LLVM is probably a good place to look. They have been pushing to remove as many intrinsic builtins as they can in favor of idiomatic code. This has multiple advantages: 1. You can

[Bug c/88698] Relax generic vector conversions

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #7 from Devin Hussey --- I mean, sure, but how about this? What about meeting in the middle? -fno-lax-vector-conversions generates errors like it does now. -flax-vector-conversions shuts GCC up. No flag causes warnings on

[Bug target/88705] New: [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON

2019-01-04 Thread husseydevin at gmail dot com
Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- For some reason, GCC scalarizes float32x4_t and float64x2_t on ARM32 NEON when using vector extensions

[Bug target/88705] [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705 Devin Hussey changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID

[Bug middle-end/88670] [meta-bug] generic vector extension issues

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88670 Bug 88670 depends on bug 88705, which changed state. Bug 88705 Summary: [ARM][Generic Vector Extensions] float32x4/float64x2 vector operator overloads scalarize on NEON https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88705 What

[Bug c/88698] Relax generic vector conversions

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #5 from Devin Hussey --- Well, if we are aiming for strict compliance, might as well throw out every GCC extension in existence (including vector extensions), those aren't strictly compliant to the C/C++ standard. /s The whole point

[Bug c/88698] Relax generic vector conversions

2019-01-04 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #2 from Devin Hussey --- What I am saying is that I think -flax-vector-conversions should be default, or we should only have minimal warnings instead of errors. That will make generic vectors much easier to use. It is to be noted

[Bug c++/85052] Implement support for clang's __builtin_convertvector

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052 --- Comment #6 from Devin Hussey --- The patch seems to be working. typedef unsigned u32x2 __attribute__((vector_size(8))); typedef unsigned long long u64x2 __attribute__((vector_size(16))); u64x2 cvt(u32x2 in) { return

[Bug c/88698] Relax generic vector conversions

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88698 --- Comment #10 from Devin Hussey --- Well what about a special type attribute or some kind of transparent_union like thing for Intel's types? It seems that Intel's intrinsics are the main (only) platform that uses generic types.

[Bug c++/85052] Implement support for clang's __builtin_convertvector

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85052 --- Comment #7 from Devin Hussey --- Wait, silly me, this isn't about optimizations, this is about patterns. It does the same thing it was doing for this code: typedef unsigned u32x2 __attribute__((vector_size(8))); typedef unsigned long long

[Bug target/85048] [missed optimization] vector conversions

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048 Devin Hussey changed: What|Removed |Added CC||husseydevin at gmail dot com --- Comment

[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32

2019-01-14 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510 --- Comment #4 from Devin Hussey --- I am deciding to refer to goodmul as ssemul from now on. I think it is a better name. I am also wondering if Aarch64 gets a benefit from this vs. scalarizing if the value is already in a NEON register. I

[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-22 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963 Devin Hussey changed: What|Removed |Added CC||husseydevin at gmail dot com --- Comment

[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-22 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963 --- Comment #9 from Devin Hussey --- (In reply to Andrew Pinski from comment #6) > Try using 128 (or 256) and you might see that aarch64 falls down similarly. yup. Oof. test: sub sp, sp, #560 stp x29, x30, [sp]

[Bug tree-optimization/88605] New: vector extensions: Widening or conversion generates inefficient or scalar code.

2018-12-26 Thread husseydevin at gmail dot com
: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- If you want to, say, convert a u32x2 vector to a u64x2 while avoiding intrinsics, good luck. GCC doesn't

[Bug tree-optimization/88605] vector extensions: Widening or conversion generates inefficient or scalar code.

2018-12-27 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88605 --- Comment #2 from Devin Hussey --- While __builtin_convertvector would improve the situation, the main issue here is the blindness to some obvious patterns. If I write this code, I want either pmovzdq or vmovl. I don't want to waste time with

[Bug target/88510] New: GCC generates inefficient U64x2 scalar multiply for NEON32

2018-12-14 Thread husseydevin at gmail dot com
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Note: I use these typedefs here for brevity. typedef uint64x2_t U64x2; typedef uint32x2_t U32x2; typedef uint32x2x2_t U32x2x2; typedef

[Bug target/88255] New: Thumb-1: GCC too aggressive on mul->lsl/sub/add optimization

2018-11-28 Thread husseydevin at gmail dot com
mal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- I might be wrong, but it appears that GCC is too aggressive in its conversion from multiplication to shift+add when targeting Thum

[Bug target/88510] GCC generates inefficient U64x2/v2di scalar multiply for NEON32

2019-01-03 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88510 --- Comment #2 from Devin Hussey --- Update: I did the calculations, and twomul has the same cycle count as goodmul_sse. vmul.i32 with 128-bit operands takes 4 cycles (I assumed it was two), so just like goodmul_sse, it takes 11 cycles.

[Bug c/88698] New: Relax generic vector conversions

2019-01-04 Thread husseydevin at gmail dot com
Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- GCC is far too strict about vector conversions. Currently, mixing generic vector extensions and platform-specific intrinsics almost always requires either a cast or -flax-vector-extensions

[Bug target/88963] gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-22 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963 --- Comment #10 from Devin Hussey --- I also want to add that aarch64 shouldn't even be spilling; it has 32 NEON registers and with 128 byte vectors it should only use 24.

[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-27 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418 --- Comment #8 from Devin Hussey --- Seems to work. ~ $ ~/gcc-test/bin/x86_64-pc-cygwin-gcc.exe -mavx2 -O3 _mm_sllv_bug.c ~ $ ./a.exe Without optimizations (correct result): 8000 fff8 With optimizations (incorrect

[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-24 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418 --- Comment #3 from Devin Hussey --- I think I found the culprit commit. Haven't set up a GCC build tree yet, though. https://github.com/gcc-mirror/gcc/commit/a51c4926712307787d133ba50af8c61393a9229b

[Bug regression/93418] New: GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-24 Thread husseydevin at gmail dot com
Component: regression Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Regression starting in GCC 9 Currently, GCC constant propagates the AVX2 _mm_sllv family with constant amounts to only shift by the first element instead

[Bug target/93418] [9/10 Regression] GCC incorrectly constant propagates _mm_sllv/srlv/srav

2020-01-24 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93418 Devin Hussey changed: What|Removed |Added Build||2020-01-24 0:00 --- Comment #5 from

[Bug middle-end/103781] New: [AArch64, 11 regr.] Failed partial vectorization of mulv2di3

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- As of GCC 11, the AArch64 backend is very greedy in trying to vectorize mulv2di3. However, there is no mulv2di3 routine so

[Bug target/103781] Cost model for SLP for aarch64 is not so good still

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 --- Comment #2 from Devin Hussey --- Yeah my bad, I meant SLP, I get them mixed up all the time.

[Bug target/103781] generic/cortex-a53 cost model for SLP for aarch64 is good

2021-12-20 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103781 --- Comment #4 from Devin Hussey --- Makes sense because the multiplier is what, 5 cycles on an A53?

[Bug rtl-optimization/103641] New: [aarch64][11 regression] Severe compile time regression in SLP vectorize step

2021-12-10 Thread husseydevin at gmail dot com via Gcc-bugs
Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Created attachment 51966 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51966=edit aarch64-linux-

[Bug middle-end/103641] [11/12 regression] Severe compile time regression in SLP vectorize step

2021-12-10 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103641 --- Comment #19 from Devin Hussey --- > The new costs on AArch64 have a vector multiplication cost of 4, which is > very reasonable. Would this include multv2di3 by any chance? Because another thing I noticed is that GCC is also trying to

[Bug target/110013] New: [i386] vector_size(8) on 32-bit ABI

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
Assignee: unassigned at gcc dot gnu.org Reporter: husseydevin at gmail dot com Target Milestone: --- Closely related to bug 86541, which was fixed on x64 only. On 32-bit, GCC passes any vector_size(8) vectors to external functions in MMX registers, similar to how it passes 16

[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 --- Comment #1 from Devin Hussey --- As a side note, the official psABI does say that function call parameters use MM0-MM2, if Clang follows its own rules then it means that the supposed stability of the ABI is meaningless.

[Bug target/110013] [i386] vector_size(8) on 32-bit ABI emits broken MMX

2023-05-27 Thread husseydevin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110013 --- Comment #2 from Devin Hussey --- Scratch that. There is a somewhat easy way to fix this following psABI AND using MMX with SSE. Upon calling a function, we can have the following sequence func: movdq2q mm0, xmm0 movq mm1,