[Bug target/99195] Optimise away vec_concat of 64-bit AdvancedSIMD operations with zeroes in aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99195 ktkachov at gcc dot gnu.org changed: What|Removed |Added Known to work||14.0 Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #21 from ktkachov at gcc dot gnu.org --- I think all the straightforward cases are handled and the infrastructure for doing this is added. Any future improvements in the area should be tracked separately. Marking as fixed for GCC 14.1
[Bug rtl-optimization/113019] [NOT A BUG] Multi-architecture binaries for Linux
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113019 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #1 from ktkachov at gcc dot gnu.org --- GCC provides the Function Multiversioning feature that's supported on some architectures: https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html That seems to do what you want?
[Bug middle-end/111782] New: [11/12/13/14 Regression] Extra move in complex double multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111782 Bug ID: 111782 Summary: [11/12/13/14 Regression] Extra move in complex double multiplication Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 The testcase: __complex double foo (__complex double a, __complex double b) { return a * b; } With GCC trunk at -Ofast I see on aarch64: foo(double _Complex, double _Complex): fmovd31, d1 fmuld1, d1, d2 fmadd d1, d0, d3, d1 fmuld31, d31, d3 fnmsub d0, d0, d2, d31 ret with GCC 10 the codegen used to be tighter: foo(double _Complex, double _Complex): fmuld4, d1, d3 fmuld5, d1, d2 fmadd d1, d0, d3, d5 fnmsub d0, d0, d2, d4 ret There's an extra fmov emitted on trunk. I noticed this regressed with the GCC 11 series
[Bug target/111733] New: Emit inline SVE FSCALE instruction for ldexp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111733 Bug ID: 111733 Summary: Emit inline SVE FSCALE instruction for ldexp Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 Having noticed https://github.com/llvm/llvm-project/pull/67552 in LLVM GCC should be able to emit the SVE fscale instruction [1] to implement the ldexp standard function. There is already an ldexpm3 optab defined so it should be a relatively simple matter of wiring up the expander for TARGET_SVE [1] https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FSCALE--Floating-point-adjust-exponent-by-vector--predicated--?lang=en
[Bug tree-optimization/111478] [12/13/14 regression] aarch64 SVE ICE: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111478 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Target Milestone|14.0|12.4 Priority|P3 |P1 --- Comment #3 from ktkachov at gcc dot gnu.org --- Marking as P1. We hit this with a Fortran reproducer: SUBROUTINE REPRODUCER( M, A, LDA ) IMPLICIT NONE INTEGERLDA, M, I COMPLEXA( LDA, * ) DO I = 2, M A( I, 1 ) = A( I, 1 ) / A( 1, 1 ) END DO RETURN END on aarch64 with -march=armv8-a+sve -O3 The ICE triggeres on 12.3 but compiles fine wiht 12.2
[Bug tree-optimization/111476] [14 regression] ICE when building Ruby 3.1.4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111476 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2023-09-19 CC||ktkachov at gcc dot gnu.org --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed. Reduced testcase. int a, b, c, d; void e() { int f, g, h; for (;;) switch (c) { case '-': if (!b) { if (a) { g = 0; goto i; } goto j; } for (; a;) i: g++; if (b) continue; f = 1; for (; f < g; f++) { b++; if (b) h *= 10; } } j: d = h; }
[Bug middle-end/111378] Missed optimization for comparing with exact_log2 constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111378 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2023-09-12 CC||ktkachov at gcc dot gnu.org --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed. On aarch64 GCC generates: test: mov w2, 65535 cmp w1, w2 bhi .L2 b do_something .L2: b do_something_other but LLVM generates the shorter: test: // @test lsr w8, w1, #16 cbnzw8, .LBB0_2 b do_something .LBB0_2: b do_something_other
[Bug web/111120] Rrrrr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20 ktkachov at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #1 from ktkachov at gcc dot gnu.org --- .
[Bug target/110280] internal compiler error: in const_unop, at fold-const.cc:1884
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110280 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 CC||ktkachov at gcc dot gnu.org Last reconfirmed||2023-06-16 Status|UNCONFIRMED |NEW Target|arm64 |aarch64 --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed, reducing.
[Bug target/110235] [14 Regression] Wrong use of us_truncate in SSE and AVX RTL representation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110235 --- Comment #4 from ktkachov at gcc dot gnu.org --- (In reply to Hongtao.liu from comment #3) > (In reply to Hongtao.liu from comment #2) > > FAIL: gcc.target/i386/avx2-vpackssdw-2.c execution test > > > > This one is about sign saturation which should match rtl SS_TRUNCATE. > > I realize for 256-bit/512-bit vpackssdw, it's an 128-bit iterleave of src1 > and src2, and then ss_truncate to the dest, not just vec_concat src1 and > src2. So the simplification exposed the bug. Thanks for looking at it. I think it'd make sense for someone with x86/sse/avx experience to rewrite the RTL representation of the patterns involved to match the correct semantics for saturation and lane behaviour. Alternatively, a quick solution would be to convert uses of us_truncate/ss_truncate in the problematic patterns to an x86-specific UNSPEC, which would make things work like they did before the simplification was added. That would be just a stop-gap solution as it's better to use standard RTL operations where possible.
[Bug target/110235] New: Wrong use of us_truncate in SSE and AVX RTL representation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110235 Bug ID: 110235 Summary: Wrong use of us_truncate in SSE and AVX RTL representation Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org CC: uros at gcc dot gnu.org Target Milestone: --- Target: x86 After g:921b841350c4fc298d09f6c5674663e0f4208610 added constant-folding for SS_TRUNCATE and US_TRUNCATE some tests in i386.exp started failing: FAIL: gcc.target/i386/avx-vpackuswb-1.c execution test FAIL: gcc.target/i386/avx2-vpackssdw-2.c execution test FAIL: gcc.target/i386/avx2-vpackusdw-2.c execution test FAIL: gcc.target/i386/avx2-vpackuswb-2.c execution test FAIL: gcc.target/i386/sse2-packuswb-1.c execution test >From what I can gather from the documentation for intrinsics like _mm_packus_epi16 the operation they perform is not what we model as us_truncate in RTL. That is, they don't perform a truncation while treating their input as an unsigned value. Rather, they treat the input as a signed value and saturate it to the unsigned min and max of the narrow mode before truncation. In that regard they seem similar to the SQMOVUN instructions in aarch64. I think it'd be best to change the representation of those instructions to a truncating clamp operation, similar to g:b747f54a2a930da55330c2861cd1e344f67a88d9 in aarch64.
[Bug target/110059] When SPEC is used to test the GCC (10.3.1), the test result of subitem 548 fluctuates abnormally.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110059 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #3 from ktkachov at gcc dot gnu.org --- 548.exchange2_r was improved in GCC 12 after PR98782 was fixed. I'd suggest you try out a later version of GCC
[Bug target/110039] [14 Regression] FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tw[0-9]+ 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110039 ktkachov at gcc dot gnu.org changed: What|Removed |Added Target Milestone|--- |14.0
[Bug target/110039] New: FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tw[0-9]+ 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110039 Bug ID: 110039 Summary: FAIL: gcc.target/aarch64/rev16_2.c scan-assembler-times rev16\\tw[0-9]+ 2 Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 I think after g:d8545fb2c71683f407bfd96706103297d4d6e27b the test regresses on aarch64. We now generate: __rev16_32_alt: rev w0, w0 ror w0, w0, 16 ret __rev16_32: rev w0, w0 ror w0, w0, 16 ret whereas before it was: __rev16_32_alt: rev16 w0, w0 ret __rev16_32: rev16 w0, w0 ret I think the GIMPLE at expand time is better and the RTL that it tries to match is simpler: Failed to match this instruction: (set (reg:SI 95) (rotate:SI (bswap:SI (reg:SI 96)) (const_int 16 [0x10]))) So maybe it's simply a matter of adding that pattern to aarch64.md. Anyway, filing this here to track the regression
[Bug target/109939] Invalid return type for __builtin_arm_ssat: Unsigned instead of signed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109939 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |ktkachov at gcc dot gnu.org --- Comment #5 from ktkachov at gcc dot gnu.org --- Fixed for GCC 14. It should be a very low risk patch to backport to the branches as it fixes an inconsistency with the spec. Will do so after some time for testing on trunk.
[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from ktkachov at gcc dot gnu.org --- Fixed, thanks for the report.
[Bug target/109939] Invalid return type for __builtin_arm_ssat: Unsigned instead of signed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109939 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|WAITING |NEW CC||ktkachov at gcc dot gnu.org --- Comment #3 from ktkachov at gcc dot gnu.org --- I think you're right, the qualifier for the return value of SAT_BINOP_UNSIGNED_IMM should be qualifier_none
[Bug c/109940] [14 Regression] ICE in decide_candidate_validity, bisected
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109940 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Target Milestone|--- |14.0 Known to fail||14.0 Known to work||13.1.0 Last reconfirmed||2023-05-23 Summary|ICE in |[14 Regression] ICE in |decide_candidate_validity, |decide_candidate_validity, |bisected|bisected CC||ktkachov at gcc dot gnu.org --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed. A more cleaned up testcase: int a; int *b; void c (int *d) { *d = a; } int e(int d, int f) { if (d <= 1) return 1; int g = d / 2; for (int h = 0; h < g; h++) if (f == (long int)b > b[h]) c([h]); e(g, f); e(g, f); }
[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855 ktkachov at gcc dot gnu.org changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |ktkachov at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #7 from ktkachov at gcc dot gnu.org --- I'll take it.
[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855 --- Comment #6 from ktkachov at gcc dot gnu.org --- (In reply to ktkachov from comment #5) > (In reply to rsand...@gcc.gnu.org from comment #4) > > I guess the problem is that the define_subst output template has: > > > > (match_operand: 0) > > > > which creates a new operand 0 with an empty predicate and constraint, > > as opposed to a (match_dup 0), which would be substituted with the > > original operand 0. Unfortunately > > > > (match_dup: 0) > > > > doesn't work as a way of inserting the original destination with > > a different mode, since the : is ignored. Perhaps we should > > “fix” that. Alternatively: > > > > (match_operand: 0 "register_operand" "=w") > > > > should work, but probably locks us into using patterns that have one > > alternative only. > > I think this approach is the most promising and probably okay for the vast > majority of cases we want to handle with these substs. Interestingly, it does seem to do the right thing for multi-alternative patterns too. For example: (define_insn ("aarch64_cmltv4hf_vec_concatz_le") [ (set (match_operand:V8HI 0 ("register_operand") ("=w,w")) (vec_concat:V8HI (neg:V4HI (lt:V4HI (match_operand:V4HF 1 ("register_operand") ("w,w")) (match_operand:V4HF 2 ("aarch64_simd_reg_or_zero") ("w,YDz" (match_operand:V4HI 3 ("aarch64_simd_or_scalar_imm_zero") ("" ] ("(!BYTES_BIG_ENDIAN) && ((TARGET_SIMD) && (TARGET_SIMD_F16INST))") ("@ fcmgt\t%0.4h, %2.4h, %1.4h fcmlt\t%0.4h, %1.4h, 0") [ (set_attr ("type") ("neon_fp_compare_s")) (set_attr ("add_vec_concat_subst_le") ("no")) ])
[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855 --- Comment #5 from ktkachov at gcc dot gnu.org --- (In reply to rsand...@gcc.gnu.org from comment #4) > I guess the problem is that the define_subst output template has: > > (match_operand: 0) > > which creates a new operand 0 with an empty predicate and constraint, > as opposed to a (match_dup 0), which would be substituted with the > original operand 0. Unfortunately > > (match_dup: 0) > > doesn't work as a way of inserting the original destination with > a different mode, since the : is ignored. Perhaps we should > “fix” that. Alternatively: > > (match_operand: 0 "register_operand" "=w") > > should work, but probably locks us into using patterns that have one > alternative only. I think this approach is the most promising and probably okay for the vast majority of cases we want to handle with these substs.
[Bug target/109855] [14 Regression] ICE: in curr_insn_transform, at lra-constraints.cc:4231 unable to generate reloads for {aarch64_mlav4hi_vec_concatz_le} at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109855 ktkachov at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2023-05-22 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed. The ICE in LRA happens very early on: ** Local #1: ** Spilling non-eliminable hard regs: 31 alt=0: Bad operand -- refuse The pattern matches: [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w") (plus:VDQ_BHSI (mult:VDQ_BHSI (match_operand:VDQ_BHSI 2 "register_operand" "w") (match_operand:VDQ_BHSI 3 "register_operand" "w")) (match_operand:VDQ_BHSI 1 "register_operand" "0")))] I wonder whether the substitution breaks something on the constraint in operand 1, which is tied to 0. The define_subst rule adds another operand to the pattern to match the zero vector, but I would have expected the substitution machinery to handle it all transparently...
[Bug target/108140] ICE expanding __rbit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108140 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #9 from ktkachov at gcc dot gnu.org --- This should have been fixed for 12.3.
[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636 --- Comment #7 from ktkachov at gcc dot gnu.org --- (In reply to rsand...@gcc.gnu.org from comment #6) > Ugh. I guess we've got no option but to force the original > subreg into a fresh register, but that's going to pessimise > cases where arithmetic is done on tuple types. > > Perhaps we should just expose the SVE operation as a native > V2DI one. Handling predicated ops would be a bit more challenging > though. I did try a copy_to_mode_reg to a fresh V2DI register for non-REG_P arguments and that did progress, but (surprisingly?) still ICEd during fwprop: during RTL pass: fwprop1 mulice.c: In function 'foom': mulice.c:17:1: internal compiler error: in paradoxical_subreg_p, at rtl.h:3205 17 | } | ^ 0xe903b9 paradoxical_subreg_p(machine_mode, machine_mode) $SRC/gcc/rtl.h:3205 0xe903b9 simplify_context::simplify_subreg(machine_mode, rtx_def*, machine_mode, poly_int<2u, unsigned long>) $SRC/gcc/simplify-rtx.cc:7533 0xe1b5f7 insn_propagation::apply_to_rvalue_1(rtx_def**) $SRC/gcc/recog.cc:1176 0xe1b3d8 insn_propagation::apply_to_rvalue_1(rtx_def**) $SRC/gcc/recog.cc:1118 0xe1b7b7 insn_propagation::apply_to_rvalue_1(rtx_def**) $SRC/gcc/recog.cc:1254 0xe1babf insn_propagation::apply_to_pattern_1(rtx_def**) $SRC/gcc/recog.cc:1361 0xe1bae4 insn_propagation::apply_to_pattern(rtx_def**) $SRC/gcc/recog.cc:1383 0x1c22e5b try_fwprop_subst_pattern $SRC/gcc/fwprop.cc:454 0x1c22e5b try_fwprop_subst $SRC/gcc/fwprop.cc:627 0x1c239a9 forward_propagate_and_simplify $SRC/gcc/fwprop.cc:823 0x1c239a9 forward_propagate_into $SRC/gcc/fwprop.cc:886 0x1c23bc1 fwprop_insn $SRC/gcc/fwprop.cc:943 0x1c23d98 fwprop $SRC/gcc/fwprop.cc:995 0x1c240e1 execute $SRC/gcc/fwprop.cc:1033 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. fwprop ended up creating: (mult:VNx2DI (subreg:VNx2DI (reg/v:V2DI 95 [ v ]) 0) (subreg:VNx2DI (subreg:V2DI (reg/v:OI 97 [ w ]) 16) 0)) and something blew up anyway, so it seems the RTL passes *really* don't like these kind of subregs ;) I'll look into expressing these ops as native V2DI patterns. I guess for the unpredicated SVE2 mul that's easy, but for the predicated forms perhaps we can have them consume a predicate register, generated at expand time, similar to the aarch64-sve.md expanders. Not super-pretty but maybe it'll be enough
[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636 ktkachov at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P1 Assignee|unassigned at gcc dot gnu.org |ktkachov at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #5 from ktkachov at gcc dot gnu.org --- The multiplication case also ICEs void foom (V v, W w) { bar (__builtin_shuffle (v, __builtin_shufflevector ((V){}, w, 4, 5) * v)); } as mulv2di3 was implemented with a similar trick for TARGET_SVE. I'll take this, once I figure out how to wire up the Neon modes through SVE...
[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636 ktkachov at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2023-04-27 Status|UNCONFIRMED |NEW CC||rsandifo at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #4 from ktkachov at gcc dot gnu.org --- Confirmed. The operand that's blowing it up is: (subreg:V2DI (reg/v:OI 97 [ w ]) 16) at rtx sve_op1 = simplify_gen_subreg (sve_mode, operands[1], mode, 0); simplify_gen_subreg, lowpart_subreg, copy_to_mode_reg and force_reg all ICE :(
[Bug target/109636] [14 Regression] ICE: in paradoxical_subreg_p, at rtl.h:3205 with -O -mcpu=a64fx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109636 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #2 from ktkachov at gcc dot gnu.org --- (In reply to Andrew Pinski from comment #1) > Are you sure this is not a regression also in GCC 13.1.0. > The most obvious revision which caused this is r13-6620-gf23dc726875c26f2c3 . I'd expect it's g:c69db3ef7f7d82a50f46038aa5457b7c8cc2d643 but haven't looked deeper yet
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 109406, which changed state. Bug 109406 Summary: Missing use of aarch64 SVE2 unpredicated integer multiply https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/109406] Missing use of aarch64 SVE2 unpredicated integer multiply
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Target Milestone|--- |14.0 Resolution|--- |FIXED --- Comment #3 from ktkachov at gcc dot gnu.org --- Fixed for GCC 14
[Bug target/108779] AARCH64 should add an option to change TLS register location to support EL1/EL2/EL3 system registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108779 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Target Milestone|--- |14.0 Resolution|--- |FIXED --- Comment #10 from ktkachov at gcc dot gnu.org --- Implemented for GCC 14.
[Bug c/109553] New: Atomic operations vs const locations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109553 Bug ID: 109553 Summary: Atomic operations vs const locations Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- When reasoning about optimal sequences for atomic operations for various targets the issue of read-only memory locations keeps coming up, particularly when talking about doing non-native larger-sized accesses locklessly I wonder if the frontends in GCC should be more assertive with warnings on such constructs. Consider, for example: #include uint32_t load_uint32_t (const uint32_t *a) { return __atomic_load_n (a, __ATOMIC_ACQUIRE); } void casa_uint32_t (const uint32_t *a, uint32_t *b, uint32_t *c) { __atomic_compare_exchange_n (a, b, 3, 0, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE); } Both of these functions compile fine with GCC. With Clang casa_uint32_t gives a hard error: error: address argument to atomic operation must be a pointer to non-const type ('const uint32_t *' (aka 'const unsigned int *') invalid) __atomic_compare_exchange_n (a, b, 3, 0, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE); I would argue that for both cases the compiler should emit something. I think an error is a appropriate for the __atomic_compare_exchange_n case, but even for atomic load we may want to hint to the user to avoid doing an atomic load from const types.
[Bug target/108840] Aarch64 doesn't optimize away shift counter masking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108840 ktkachov at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED Target Milestone|--- |14.0 --- Comment #5 from ktkachov at gcc dot gnu.org --- Fixed for GCC 14.
[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154 ktkachov at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P1 --- Comment #43 from ktkachov at gcc dot gnu.org --- Indeed, thank you for the high quality analysis and improvements! Marking this as P1 as it's a regression on aarch64-linux in GCC 13 so we'd want to track this for the release, but of course it's up to RMs for the final say.
[Bug target/109406] Missing use of aarch64 SVE2 unpredicated integer multiply
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406 ktkachov at gcc dot gnu.org changed: What|Removed |Added Severity|normal |enhancement
[Bug target/109406] New: Missing use of aarch64 SVE2 unpredicated integer multiply
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109406 Bug ID: 109406 Summary: Missing use of aarch64 SVE2 unpredicated integer multiply Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 For the testcase #define N 1024 long long res[N]; long long in1[N]; long long in2[N]; void mult (void) { for (int i = 0; i < N; i++) res[i] = in1[i] * in2[i]; } With -O3 -march=armv8.5-a+sve2 we generate the loop: ptrue p1.b, all whilelo p0.d, wzr, w2 .L2: ld1dz0.d, p0/z, [x4, x0, lsl 3] ld1dz1.d, p0/z, [x3, x0, lsl 3] mul z0.d, p1/m, z0.d, z1.d st1dz0.d, p0, [x1, x0, lsl 3] incdx0 whilelo p0.d, w0, w2 b.any .L2 ret SVE2 supports the MUL (vectors, unpredicated) instruction that would allow us to eliminate the use of p1. Clang manages to do this (though it has other inefficiencies) in https://godbolt.org/z/7xj6xEchx
[Bug tree-optimization/109401] New: Optimise max (a, b) + min (a, b) into a + b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109401 Bug ID: 109401 Summary: Optimise max (a, b) + min (a, b) into a + b Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- The testcase #include #include uint32_t foo (uint32_t a, uint32_t b) { return std::max (a, b) + std::min (a, b); } uint32_t foom (uint32_t a, uint32_t b) { return std::max (a, b) * std::min (a, b); } could optimise foo into a + b and foom into a * b. Should be a matter of some match.pd patterns?
[Bug target/109332] Bug in gcc (13.0.1) support for ARM SVE, which randomly ignore the predict register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109332 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #1 from ktkachov at gcc dot gnu.org --- That's expected. Please see https://github.com/ARM-software/acle/blob/main/main/acle.md#sve-naming-convention Since the input uses the _x form of the intrinsic svsub_n_s64_x the predication behaviour is left to the compiler and the ACLE specifies: "This form of predication removes the need to choose between zeroing and merging in cases where the inactive elements are unimportant. The code generator can then pick whichever form of instruction seems to give the best code. This includes using unpredicated instructions, where available and suitable." So using an unpredicated sub instruction is appropriate here and not a bug.
[Bug tree-optimization/109176] [13 Regression] internal compiler error: in to_constant, at poly-int.h:504
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109176 --- Comment #10 from ktkachov at gcc dot gnu.org --- For the testcase, having it in gcc.target/aarch64/sve as /* { dg-options "-O2" } */ #include svbool_t foo (svint8_t a, svint8_t b, svbool_t c) { svbool_t d = svcmplt_s8 (svptrue_pat_b8 (SV_ALL), a, b); return svsel_b (d, c, d); } would be fine.
[Bug tree-optimization/109176] [13 Regression] internal compiler error: in to_constant, at poly-int.h:504
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109176 --- Comment #3 from ktkachov at gcc dot gnu.org --- Created attachment 54708 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54708=edit Reduced testcase Reduced testcase ICEs at -O2
[Bug tree-optimization/109176] internal compiler error: in to_constant, at poly-int.h:504
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109176 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW CC||ktkachov at gcc dot gnu.org Target Milestone|--- |13.0 Ever confirmed|0 |1 Last reconfirmed||2023-03-17 --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed. Running reduction
[Bug middle-end/109153] missed vector constructor optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109153 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed||2023-03-16 Status|UNCONFIRMED |NEW --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed. Does the midend have a way of judging whether a constructor is cheaper?
[Bug c++/108967] internal compiler error: in expand_debug_expr, at cfgexpand.cc:5450
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108967 ktkachov at gcc dot gnu.org changed: What|Removed |Added Target||aarch64 Last reconfirmed||2023-02-28 Status|UNCONFIRMED |NEW CC||ktkachov at gcc dot gnu.org Ever confirmed|0 |1 Target Milestone|--- |13.0 Keywords||ice-on-valid-code Known to fail||13.0 --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed
[Bug rtl-optimization/106594] [13 Regression] sign-extensions no longer merged into addressing mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #12 from ktkachov at gcc dot gnu.org --- (In reply to Tamar Christina from comment #11) > This patch seems to have stalled. CC'ing the maintainers as this is still a > large regression for us. Roger's latest updated patch was posted recently at https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612840.html
[Bug target/108840] Aarch64 doesn't optimize away shift counter masking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108840 --- Comment #3 from ktkachov at gcc dot gnu.org --- Created attachment 54531 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54531=edit Candidate patch Candidate patch attached.
[Bug tree-optimization/108901] [13 Regression] Testsuite failures in gcc.target/aarch64/sve/cond_*
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108901 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from ktkachov at gcc dot gnu.org --- Yes, they are fixed now. Thank you!
[Bug tree-optimization/108901] [13 Regression] Testsuite failures in gcc.target/aarch64/sve/cond_*
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108901 ktkachov at gcc dot gnu.org changed: What|Removed |Added Target Milestone|--- |13.0 Priority|P3 |P1
[Bug tree-optimization/108901] New: [13 Regression] Testsuite failures in gcc.target/aarch64/sve/cond_*
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108901 Bug ID: 108901 Summary: [13 Regression] Testsuite failures in gcc.target/aarch64/sve/cond_* Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: testsuite-fail Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 After g:3da77f217c8b2089ecba3eb201e727c3fcdcd19d we're seeing testsuite failures like: gcc.target/aarch64/sve/cond_fmaxnm_7.c gcc.target/aarch64/sve/cond_fminnm_7.c gcc.target/aarch64/sve/cond_fmaxnm_8.c gcc.target/aarch64/sve/cond_fminnm_8.c gcc.target/aarch64/sve/cond_fminnm_6.c gcc.target/aarch64/sve/fmla_2.c gcc.target/aarch64/sve/cond_xorsign_2.c gcc.target/aarch64/sve/cond_xorsign_1.c gcc.target/aarch64/sve/cond_fmaxnm_6.c on aarch64. I haven't looked into the cause, just reporting here for tracking
[Bug target/108874] [10/11/12/13 Regression] Missing bswap detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108874 --- Comment #3 from ktkachov at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > The regression is probably rtl-optimization/target specific since we never > had this kind of pattern detected on the tree/GIMPLE level and there's no > builtin or IFN for this shuffling on u32. FWIW a colleague reported that he bisected the failure to g:98e30e515f184bd63196d4d500a682fbfeb9635e though I haven't tried it myself. We do have patterns for these in aarch64 and arm, but combine would need to match about 5 insns to get there and that's beyond its current limit of 4
[Bug tree-optimization/108874] [10/11/12/13 Regression] Missing bswap detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108874 --- Comment #1 from ktkachov at gcc dot gnu.org --- (In reply to ktkachov from comment #0) > If we look at the arm testcases in gcc.target/arm/rev16.c > typedef unsigned int __u32; > > __u32 > __rev16_32_alt (__u32 x) > { > return (((__u32)(x) & (__u32)0xff00ff00UL) >> 8) > | (((__u32)(x) & (__u32)0x00ff00ffUL) << 8); > } > > __u32 > __rev16_32 (__u32 x) > { > return (((__u32)(x) & (__u32)0x00ff00ffUL) << 8) > | (((__u32)(x) & (__u32)0xff00ff00UL) >> 8); > } > this isn't a simple __builtin_bswap16 as that returns a uint16_t, this is sort of a __builtin_swap16 in each of the half-words of the u32
[Bug tree-optimization/108874] New: [10/11/12/13 Regression] Missing bswap detection
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108874 Bug ID: 108874 Summary: [10/11/12/13 Regression] Missing bswap detection Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- If we look at the arm testcases in gcc.target/arm/rev16.c typedef unsigned int __u32; __u32 __rev16_32_alt (__u32 x) { return (((__u32)(x) & (__u32)0xff00ff00UL) >> 8) | (((__u32)(x) & (__u32)0x00ff00ffUL) << 8); } __u32 __rev16_32 (__u32 x) { return (((__u32)(x) & (__u32)0x00ff00ffUL) << 8) | (((__u32)(x) & (__u32)0xff00ff00UL) >> 8); } we should be able to generate rev16 instructions for aarch64 (and arm) i.e. recognise a __builtin_bswap16 essentially. GCC fails to do so and generates: __rev16_32_alt: lsr w1, w0, 8 lsl w0, w0, 8 and w1, w1, 16711935 and w0, w0, -16711936 orr w0, w1, w0 ret __rev16_32: lsl w1, w0, 8 lsr w0, w0, 8 and w1, w1, -16711936 and w0, w0, 16711935 orr w0, w1, w0 ret whereas clang manages to recognise it all into: __rev16_32_alt: // @__rev16_32_alt rev16 w0, w0 ret __rev16_32: // @__rev16_32 rev16 w0, w0 ret does the bswap pass need some tweaking perhaps? Looks like this worked fine with GCC 5 but broke in the GCC 6 timeframe so marking as a regression
[Bug target/108840] Aarch64 doesn't optimize away shift counter masking
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108840 ktkachov at gcc dot gnu.org changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |ktkachov at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from ktkachov at gcc dot gnu.org --- I have a patch to simplify and fix the aarch64 rtx costs for this case. I'll aim it for GCC 14 as it's not a regression.
[Bug target/108779] AARCH64 should add an option to change TLS register location to support EL1/EL2/EL3 system registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108779 --- Comment #3 from ktkachov at gcc dot gnu.org --- Created attachment 54459 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54459=edit Candidate patch Patch that implements -mtp= similar to clang if you have the capability to try it out
[Bug target/108779] AARCH64 should add an option to change TLS register location to support EL1/EL2/EL3 system registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108779 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 CC||ktkachov at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |ktkachov at gcc dot gnu.org Last reconfirmed||2023-02-14 --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed. I have a patch I'm testing for it. Since GCC 13 is in stage4 (regression and wrong-code fixes only) this would be GCC 14 material. Would that timeline be okay with you?
[Bug target/108659] Suboptimal 128 bit atomics codegen on AArch64 and x64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108659 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #2 from ktkachov at gcc dot gnu.org --- (In reply to Niall Douglas from comment #0) > Related: > - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878 > - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94649 > - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 > > I got bitten by this again, latest GCC still does not emit single > instruction 128 bit atomics, even when the -march is easily new enough. Here > is a godbolt comparing latest MSVC, latest GCC and latest clang for the > skylake-avx512 architecture, which unquestionably supports cmpxchg16b. Only > clang emits the single instruction atomic: > > https://godbolt.org/z/EnbeeW4az > > I'm gathering from the issue comments and from the comments at > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 that you're going to > wait for AMD to guarantee atomicity of SSE instructions before changing the > codegen here, which makes sense. However I also wanted to raise potentially > suboptimal 128 bit atomic codegen by GCC for AArch64 as compared to clang: > > https://godbolt.org/z/oKv4o81nv > > GCC emits `dmb` to force a global memory fence, whereas clang does not. > > I think clang is in the right here, the seq_cst atomic semantics are not > supposed to globally memory fence. FWIW, the GCC codegen for aarch64 is at https://godbolt.org/z/qvx9484nY (arm and aarch64 are different targets). It emits a call to libatomic, which for GCC 13 will use a lockless implementation when possible at runtime, see g:d1288d850944f69a795e4ff444a427eba3fec11b
[Bug target/108495] [10/11/12/13 Regression] aarch64 ICE with __builtin_aarch64_rndr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108495 --- Comment #7 from ktkachov at gcc dot gnu.org --- Yes, GCC could be more helpful here. The intrinsics and their use is documented in the ACLE document: https://github.com/ARM-software/acle/blob/main/main/acle.md#random-number-generation-intrinsics There is work ongoing to augument it with more user-friendly information about compiler flags, but GCC could keep track of the options used to gate these builtins/intrinsics and report a hint
[Bug target/108495] [10/11/12/13 Regression] aarch64 ICE with __builtin_aarch64_rndr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108495 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 Keywords||ice-on-invalid-code Last reconfirmed||2023-01-23 Status|UNCONFIRMED |NEW --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed. That said, __builtin_aarch64_rndr is not supposed to be used directly by the user. They should include and use the __rndr intrinsic instead. That will give the appropriate error: inlining failed in call to 'always_inline' '__rndr': target specific option mismatch Still, I suppose the compiler shouldn't ICE
[Bug tree-optimization/108446] New: GCC fails to elide udiv/msub when doing modulus by select of constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108446 Bug ID: 108446 Summary: GCC fails to elide udiv/msub when doing modulus by select of constants Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- unsigned foo(int vl, unsigned len) { unsigned pad = vl <= 256 ? 128 : 256; return len % pad; } At -O2 aarch64 gcc generates: foo: cmp w0, 256 mov w2, 256 mov w0, 128 cselw2, w2, w0, gt udivw0, w1, w2 msubw0, w0, w2, w1 ret clang, for example can generate the cheaper: foo:// @foo cmp w0, #256 mov w8, #127 mov w9, #255 cselw8, w9, w8, gt and w0, w8, w1 ret Similar situation on x86. I suppose this could be a match.pd fix or otherwise something during expand-time?
[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 CC||ktkachov at gcc dot gnu.org Last reconfirmed||2023-01-17 --- Comment #12 from ktkachov at gcc dot gnu.org --- (In reply to Kito Cheng from comment #7) > We are hitting this issue on RISC-V, and got some complain from linux kernel > developers, but in different form as the original report, we found cold > function or any function is marked as cold by `-fguess-branch-probability` > are all not honor to the -falign-functions=N setting, that become problem on > some linux kernel feature since they want to control the minimal alignment > to make sure they can atomically update the instruction which require align > to 4 byte. > > However current GCC behavior can't guarantee that even -falign-functions=4 > is given, there is 3 option in my mind: > > 1. Fix -falign-functions=N, let it work as expect on -Os and all cold > functions > 2. Force align to 4 byte if -fpatchable-function-entry is given, that's > should be doable by adjust RISC-V's FUNCTION_BOUNDARY > 3. Adjust RISC-V's FUNCTION_BOUNDARY to let it honor to -falign-functions=N > 4. Adding a -malign-functions=N...Okay, I know that suck idea, x86 already > deprecated that. > > But I think ideally this should fixed by 1 option if possible. > > Testcase from RISC-V kernel guy: > ``` > /* { dg-do compile } */ > /* { dg-options "-march=rv64gc -mabi=lp64d -O1 -falign-functions=128" } */ > /* { dg-final { scan-assembler-times ".align 7" 2 } } */ > > // Using 128 byte align rather than 4 byte align since it easier to observe. > > __attribute__((__cold__)) void a() {} // This function isn't align to 128 > byte > void b() {} // This function align to 128 byte. > ``` > > Proposed fix: > ``` > diff --git a/gcc/varasm.c b/gcc/varasm.c > index 49d5cda122f..6f8ed85fea9 100644 > --- a/gcc/varasm.c > +++ b/gcc/varasm.c > @@ -1907,8 +1907,7 @@ assemble_start_function (tree decl, const char *fnname) > Note that we still need to align to DECL_ALIGN, as above, > because ASM_OUTPUT_MAX_SKIP_ALIGN might not do any alignment at all. > */ >if (! DECL_USER_ALIGN (decl) > - && align_functions.levels[0].log > align > - && optimize_function_for_speed_p (cfun)) > + && align_functions.levels[0].log > align) > { > #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN >int align_log = align_functions.levels[0].log; > > ``` I think this patch makes sense given the extra information you and Mark have provided. Would you mind testing it and posting it to gcc-patches for review please?
[Bug rust/106072] [13 Regression] -Wnonnull warning breaks rust bootstrap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106072 --- Comment #18 from ktkachov at gcc dot gnu.org --- (In reply to Richard Biener from comment #17) > Fixed(?) Yes on aarch64, thanks!
[Bug target/102218] 128-bit atomic compare and exchange does not honor memory model on AArch64 and Arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102218 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #3 from ktkachov at gcc dot gnu.org --- Does this need to be backported to other release versions as it's a wrong-code bug?
[Bug target/95751] [aarch64] Consider using ldapr for __atomic_load_n(acquire) on ARMv8.3-RCPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95751 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed||2022-12-20 Status|UNCONFIRMED |NEW --- Comment #1 from ktkachov at gcc dot gnu.org --- I had not seen this report at the time, but LDAPR generation has now been implemented in GCC 13.1 for acquire loads with https://gcc.gnu.org/g:0431e8ae5bdb854bda5f9005e41c8c4d03f6d74e and follow-ups. Any testing/evaluation/feedback would be welcome
[Bug target/107209] [13 Regression] ICE: verify_gimple failed (error: statement marked for throw, but doesn't)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107209 --- Comment #5 from ktkachov at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #4) > Looking at other backends, rs6000 uses in *gimple_fold_builtin gsi_replace > (..., true); > all the time, ix86 gsi_replace (..., false); all the time, alpha with true, > aarch64 with true. But perhaps what is more important if the builtins > folded are declared nothrow or not, if they are nothrow, then they shouldn't > have any EH edges at the start already and so it shouldn't matter what is > used. The vmulx_f64 intrinsic is not marked "nothrow" by the logic: 1284 static tree 1285 aarch64_get_attributes (unsigned int f, machine_mode mode) 1286 { 1287 tree attrs = NULL_TREE; 1288 1289 if (!aarch64_modifies_global_state_p (f, mode)) 1290 { 1291 if (aarch64_reads_global_state_p (f, mode)) 1292 attrs = aarch64_add_attribute ("pure", attrs); 1293 else 1294 attrs = aarch64_add_attribute ("const", attrs); 1295 } 1296 1297 if (!flag_non_call_exceptions || !aarch64_could_trap_p (f, mode)) 1298 attrs = aarch64_add_attribute ("nothrow", attrs); 1299 1300 return aarch64_add_attribute ("leaf", attrs); 1301 } aarch64_could_trap_p returns true for it as it can raise an FP exception. Should that affect the nothrow attribute though? Shouldn't that be for C++ exceptions only?
[Bug middle-end/108140] ICE expanding __rbit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108140 ktkachov at gcc dot gnu.org changed: What|Removed |Added Target Milestone|--- |12.3 CC||ktkachov at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed||2022-12-16 Status|UNCONFIRMED |ASSIGNED --- Comment #5 from ktkachov at gcc dot gnu.org --- Confirmed the ICE and I'm testing a patch to fix that, thanks for the report
[Bug rust/108084] New: AArch64 Linux bootstrap failure in rust
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108084 Bug ID: 108084 Summary: AArch64 Linux bootstrap failure in rust Product: gcc Version: unknown Status: UNCONFIRMED Keywords: build Severity: normal Priority: P3 Component: rust Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org CC: dkm at gcc dot gnu.org Target Milestone: --- Host: aarch64-none-linux-gnu Target: aarch64-none-linux-gnu Congratulations on getting the rust frontend committed! When trying a bootstrap on aarch64-none-linux with --enable-languages=c,c++,fortran,rust I get a -Werror=nonnull failure In file included from $SRC/gcc/rust/parse/rust-parse.h:730, from $SRC/gcc/rust/expand/rust-macro-builtins.cc:25: $SRC/gcc/rust/parse/rust-parse-impl.h: In member function 'Rust::AST::ClosureParam Rust::Parser::parse_closure_param() [with ManagedTokenSource = Rust::Lexer]': $SRC/gcc/rust/parse/rust-parse-impl.h:8916:70: error: 'this' pointer is null [-Werror=nonnull] 8916 | std::move (type), std::move (outer_attrs)); | ^ In file included from $SRC/gcc/rust/parse/rust-parse.h:730, from $SRC/gcc/rust/expand/rust-macro-expand.h:23, from $SRC/gcc/rust/expand/rust-macro-expand.cc:19: $SRC/gcc/rust/parse/rust-parse-impl.h: In member function 'Rust::AST::ClosureParam Rust::Parser::parse_closure_param() [with ManagedTokenSource = Rust::MacroInvocLexer]': $SRC/gcc/rust/parse/rust-parse-impl.h:8916:70: error: 'this' pointer is null [-Werror=nonnull] 8916 | std::move (type), std::move (outer_attrs)); | ^ In file included from $SRC/gcc/rust/parse/rust-parse.h:730, from $SRC/gcc/rust/rust-session-manager.cc:23: $SRC/gcc/rust/parse/rust-parse-impl.h: In member function 'Rust::AST::ClosureParam Rust::Parser::parse_closure_param() [with ManagedTokenSource = Rust::Lexer]': $SRC/gcc/rust/parse/rust-parse-impl.h:8916:70: error: 'this' pointer is null [-Werror=nonnull] 8916 | std::move (type), std::move (outer_attrs));
[Bug target/108006] [13 Regression] ICE in aarch64_move_imm building 502.gcc_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #1 from ktkachov at gcc dot gnu.org --- Wilco, is this something you've touched recently?
[Bug target/108006] New: [13 Regression] ICE in aarch64_move_imm building 502.gcc_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006 Bug ID: 108006 Summary: [13 Regression] ICE in aarch64_move_imm building 502.gcc_r Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Building 502.gcc_r from SPEC2017 with -O2 -mcpu=neoverse-v1 ICEs with trunk. Reduced testcase: void c(); short *foo; short *bar; void a() { for (bar; bar < foo; bar++) *bar = 999; c(); } backtrace is: during RTL pass: expand ice.c: In function a: ice.c:8:10: internal compiler error: in aarch64_move_imm, at config/aarch64/aarch64.cc:5692 8 | *bar = 999; | ~^ 0x129db4c aarch64_move_imm(unsigned long, machine_mode) $SRC/gcc/config/aarch64/aarch64.cc:5692 0x12c01cd aarch64_expand_sve_const_vector $SRC/gcc/config/aarch64/aarch64.cc:6516 0x12c63cb aarch64_expand_mov_immediate(rtx_def*, rtx_def*) $SRC/gcc/config/aarch64/aarch64.cc:6996 0x18c3248 gen_movvnx8hi(rtx_def*, rtx_def*) $SRC/gcc/config/aarch64/aarch64-sve.md:662 0xa09062 rtx_insn* insn_gen_fn::operator()(rtx_def*, rtx_def*) const $SRC/gcc/recog.h:407 0xa09062 emit_move_insn_1(rtx_def*, rtx_def*) $SRC/gcc/expr.cc:4172 0xa095bb emit_move_insn(rtx_def*, rtx_def*) $SRC/gcc/expr.cc:4342 0x9db8aa copy_to_mode_reg(machine_mode, rtx_def*) $SRC/gcc/explow.cc:654 0xd0607d maybe_legitimize_operand $SRC/gcc/optabs.cc:7809 0xd0607d maybe_legitimize_operands(insn_code, unsigned int, unsigned int, expand_operand*) $SRC/gcc/optabs.cc:7941 0xd06366 maybe_gen_insn(insn_code, unsigned int, expand_operand*) $SRC/gcc/optabs.cc:7960 0xd06592 maybe_expand_insn(insn_code, unsigned int, expand_operand*) $SRC/gcc/optabs.cc:8005 0xd05b17 expand_insn(insn_code, unsigned int, expand_operand*) $SRC/gcc/optabs.cc:8036 0xb53fb7 expand_partial_store_optab_fn $SRC/gcc/internal-fn.cc:2878 0xb54307 expand_MASK_STORE $SRC/gcc/internal-fn.def:141 0xb59960 expand_internal_call(internal_fn, gcall*) $SRC/gcc/internal-fn.cc:4436 0xb5997a expand_internal_call(gcall*) $SRC/gcc/internal-fn.cc: 0x8b6161 expand_call_stmt $SRC/gcc/cfgexpand.cc:2737 0x8b6161 expand_gimple_stmt_1
[Bug target/107988] [13 Regression] ICE: in extract_insn, at recog.cc:2791 (unrecognizable insn) on aarch64-unknown-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107988 ktkachov at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2022-12-06 Ever confirmed|0 |1 CC||ktkachov at gcc dot gnu.org, ||tnfchris at gcc dot gnu.org Status|UNCONFIRMED |NEW --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed. Looks related to the recent div-by-special-constant changes but ICEs only at -O0
[Bug target/107830] [13 Regression] ICE in gen_aarch64_bitmask_udiv3, at ./insn-opinit.h:813
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107830 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||tnfchris at gcc dot gnu.org --- Comment #2 from ktkachov at gcc dot gnu.org --- I think it's more likely Tamar's recent patches for that optab
[Bug target/107102] SVE function fails to realize it doesn't need the frame-pointer in the tail call.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107102 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2022-10-04 --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed, clang tail-calls this: bar:// @bar ptrue p1.b ptrue p0.s and p0.b, p1/z, p1.b, p0.b b foo
[Bug target/107025] gas doesn't accept code produced by -mcpu=thunderx3t110
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107025 ktkachov at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2022-09-26 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from ktkachov at gcc dot gnu.org --- In the Arm architecture this is FEAT_LRCPC2. LLVM does have an MC (essentially assembler-level) feature string for it called "rcpc-immo", so if we wanted to support this I guess we'd want to be compatible. That said, it may be cleaner to just remove support for thunderx3t110 if we think it's the right time. Unfortunately we do still have some cases where our features aren't fine-grained enough and are tied to architecture levels that some CPUs don't claim to support: https://godbolt.org/z/axbnd4c5o
[Bug target/106583] New: Suboptimal immediate generation on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106583 Bug ID: 106583 Summary: Suboptimal immediate generation on aarch64 Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 A simple codegen issue: unsigned long long foo (void) { return 0x7efefefefefefeff; } generates at -O2 foo: mov x0, 65279 movkx0, 0xfefe, lsl 16 movkx0, 0xfefe, lsl 32 movkx0, 0x7efe, lsl 48 ret whereas LLVM can do: foo:// @foo mov x0, #-72340172838076674 movkx0, #65279 movkx0, #32510, lsl #48 ret Should be a matter of just making aarch64_internal_mov_immediate in aarch64.cc a bit smarter
[Bug middle-end/106568] -freorder-blocks-algorithm appears to causes a crash in stable code, no way to disable it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106568 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #1 from ktkachov at gcc dot gnu.org --- > We are fairly certain the problem is with the -freorder-blocks-algorithm > optimization. The problem we are now having is, we don't know how to disable > it. The following fails to compile: > > -fno-reorder-blocks-algorithm > -freorder-blocks-algorithm=none > -freorder-blocks-algorithm= > You should be able to use -fno-reorder-blocks to disable it. Alternatively, if you use -freorder-blocks-algorithm= you can only pass it the "simple" or "stc" options as per the documentation. This will pick one of the two available algorithms. That said, one major change that happened in GCC 12.1 was enabling auto-vectorisation by default at -O2. See https://gcc.gnu.org/gcc-12/changes.html The vectorisation at -O2 uses less aggressive heuristics than at -O3 so could trigger different behaviour than -O3 or lower options (where it doesn't vectorise at all). May be worth investigating.
[Bug tree-optimization/106343] Addition with constants is not vectorized by SLP when it includes zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106343 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2022-07-18 CC||ktkachov at gcc dot gnu.org, ||rguenth at gcc dot gnu.org Target|aarch64 |aarch64, x86_64 --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed, it's quite odd. x86_64 is also affected: https://godbolt.org/z/q46z3hh9Y
[Bug target/106324] ptrue not reused between vector instructions and predicate instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106324 ktkachov at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2022-07-18 Ever confirmed|0 |1 CC||ktkachov at gcc dot gnu.org Keywords||missed-optimization Status|UNCONFIRMED |NEW --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed.
[Bug tree-optimization/98138] BB vect fail to SLP one case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 ktkachov at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2022-07-06 CC||ktkachov at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #10 from ktkachov at gcc dot gnu.org --- Note that current clang does a pretty decent job on this now on aarch64 (in case it gives some inspiration on the approach) https://godbolt.org/z/EPvqMhh7v
[Bug tree-optimization/106064] Wrong code comparing two global zero-sized arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106064 ktkachov at gcc dot gnu.org changed: What|Removed |Added Keywords||wrong-code CC||ktkachov at gcc dot gnu.org --- Comment #1 from ktkachov at gcc dot gnu.org --- This seems to have changed in the GCC 9 series. GCC 8.5 generates: f(): mov w0, 1 ret g(): mov w0, 1 ret b: a: Tagging as a claimed wrong-code bug.
[Bug tree-optimization/105793] New: Missed vectorisation with conditional-select inside loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105793 Bug ID: 105793 Summary: Missed vectorisation with conditional-select inside loop Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- The code: #define N 1024 float f(const float in[N], unsigned int n) { float a = 0.0f; for (unsigned i = 0; i < N; ++i) { float b = in[i]; if (b < 10.f) a += b; else a -= b; } return a; } with -Ofast does not vectorise (on aarch64, for example): f: moviv0.2s, #0 add x1, x0, 4096 fmovs3, 1.0e+1 .L5: ldr s1, [x0], 4 fsubs2, s0, s1 fcmpe s1, s3 fadds0, s0, s1 fcsel s0, s0, s2, mi cmp x1, x0 bne .L5 ret whereas clang can and does. Commenting out the "else a -=b;" line allows GCC to vectorise it: f: moviv0.4s, 0 add x1, x0, 4096 fmovv3.4s, 1.0e+1 .L2: ldr q2, [x0], 16 fcmgt v1.4s, v3.4s, v2.4s and v1.16b, v1.16b, v2.16b faddv0.4s, v0.4s, v1.4s cmp x1, x0 bne .L2 faddp v0.4s, v0.4s, v0.4s faddp v0.4s, v0.4s, v0.4s ret Examples at https://gcc.godbolt.org/z/qbn6T73qE
[Bug target/99037] Invalid representation of vector zero in aarch64-simd.md
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99037 ktkachov at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #10 from ktkachov at gcc dot gnu.org --- This has been fixed in all active branches.
[Bug target/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Target Milestone|--- |12.0 CC||ktkachov at gcc dot gnu.org Priority|P3 |P1 --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed then.
[Bug target/105162] [AArch64] outline-atomics drops dmb ish barrier on __sync builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #6 from ktkachov at gcc dot gnu.org --- Can you please send the patch to gcc-patches for review. It'll get more eyes there
[Bug middle-end/104026] [12 Regression] ICE in wide_int_to_tree_1, at tree.c:1755 via tree-vect-loop-manip.c:673
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104026 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2022-01-14 Ever confirmed|0 |1 CC||ktkachov at gcc dot gnu.org Target|amdgcn-amdhsa |amdgcn-amdhsa, aarch64 --- Comment #9 from ktkachov at gcc dot gnu.org --- We're also seeing this on aarch64-none-elf with: #include void execute(int *y); void foo (int n) { int *b = (int *)malloc((n - 1) * sizeof(int)); execute(b); int n1 = 1.0 / (n - 1); for (int i = 0; i < n - 1; i++) { b[i] *= n1; } } compiled with -O2 -march=armv8-a+sve
[Bug other/79469] Feature request: provide `__builtin_assume` builtin function to allow more aggressive optimizations and to match clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79469 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2021-11-24 CC||aldyh at gcc dot gnu.org, ||ktkachov at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #3 from ktkachov at gcc dot gnu.org --- We've received requests from some users for this builtin as well. Given the new ranger infrastructure, would it be able to make use of the semantics of such a builtin in a useful way? (It'd be good to see GCC eliminate some redundant extensions, maybe threading opportunities could be improved etc)
[Bug tree-optimization/102652] Unnecessary zeroing out of local ARM NEON arrays
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102652 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2021-10-08 --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed on the GCC 11 release. There is an active effort to improve the code generation for these intrinsics and current trunk produces: bug: ldr q5, [x1] sshrv4.16b, v5.16b, 7 mov v0.16b, v5.16b mov v1.16b, v4.16b mov v2.16b, v4.16b mov v3.16b, v4.16b st4 {v0.16b - v3.16b}, [x0], 64 ldr q4, [x1, 16] mov v0.16b, v4.16b sshrv4.16b, v4.16b, 7 mov v1.16b, v4.16b mov v2.16b, v4.16b mov v3.16b, v4.16b st4 {v0.16b - v3.16b}, [x0] ret Not optimal yet, but moving in the right direction
[Bug tree-optimization/102324] ICE in initialize_matrix_A, at tree-data-ref.c:3959
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102324 ktkachov at gcc dot gnu.org changed: What|Removed |Added Target Milestone|--- |10.4 Target||aarch64 Known to fail||10.3.1, 11.1.1, 12.0
[Bug tree-optimization/102324] New: ICE in initialize_matrix_A, at tree-data-ref.c:3959
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102324 Bug ID: 102324 Summary: ICE in initialize_matrix_A, at tree-data-ref.c:3959 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- The AArch64 SVE ACLE testcase below #include svint8_t doit(svbool_t ptrue, svint8_t m0) { auto combine_low = []( svint8_t in) -> svint8_t { int8_t data [2000]; svst1(ptrue, (int8_t *)data, in); for (int _i = (int)svcntb()/2; _i < (int)svcntb(); ++_i) data[_i] = data[_i-(int)svcntb()]; in = svld1(ptrue, data); return in; }; return combine_low(m0); } ICEs with -march=armv8-a+sve -O2 ice.cc: In lambda function: ice.cc:4:5: internal compiler error: in initialize_matrix_A, at tree-data-ref.c:3959 4 | []( svint8_t in) -> svint8_t { | ^ 0x1ed2988 initialize_matrix_A $SRC/gcc/tree-data-ref.c:3959 0x1ed2965 initialize_matrix_A $SRC/gcc/tree-data-ref.c:3929 0x1ed8454 analyze_subscript_affine_affine $SRC/gcc/tree-data-ref.c:4361 0x1edb8fd analyze_siv_subscript $SRC/gcc/tree-data-ref.c:4703 0x1edb8fd analyze_overlapping_iterations $SRC/gcc/tree-data-ref.c:4933 0x1edb8fd subscript_dependence_tester_1 $SRC/gcc/tree-data-ref.c:5487 0x1edc10c subscript_dependence_tester $SRC/gcc/tree-data-ref.c:5537 0x1edc10c compute_affine_dependence(data_dependence_relation*, loop*) $SRC/gcc/tree-data-ref.c:5597 0x118ea4d loop_distribution::get_data_dependence(graph*, data_reference*, data_reference*) $SRC/gcc/tree-loop-distribution.c:1379 0x118eaba loop_distribution::data_dep_in_cycle_p(graph*, data_reference*, data_reference*) $SRC/gcc/tree-loop-distribution.c:1398 0x118ed49 loop_distribution::update_type_for_merge(graph*, partition*, partition*) $SRC/gcc/tree-loop-distribution.c:1441 0x118f927 loop_distribution::build_rdg_partition_for_vertex(graph*, int) $SRC/gcc/tree-loop-distribution.c:1485 0x118fb51 loop_distribution::rdg_build_partitions(graph*, vec, vec*) $SRC/gcc/tree-loop-distribution.c:1938 0x1191c19 loop_distribution::distribute_loop(loop*, vec const&, control_dependences*, int*, bool*, bool) $SRC/gcc/tree-loop-distribution.c:2984 0x11940f8 loop_distribution::execute(function*) $SRC/gcc/tree-loop-distribution.c:3353 0x119508d execute $SRC/gcc/tree-loop-distribution.c:3441 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions.
[Bug target/102252] svbool_t with SVE can generate invalid assembly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102252 ktkachov at gcc dot gnu.org changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |ktkachov at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #4 from ktkachov at gcc dot gnu.org --- Testing a patch
[Bug target/102252] svbool_t with SVE can generate invalid assembly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102252 --- Comment #3 from ktkachov at gcc dot gnu.org --- The RTL for the offending insn: (insn 9 8 10 (set (reg:VNx16BI 68 p0) (mem:VNx16BI (plus:DI (mult:DI (reg:DI 1 x1 [93]) (const_int 8 [0x8])) (reg/f:DI 0 x0 [92])) [2 work_3(D)->array[offset_4(D)]+0 S8 A16])) "asm.c":29:29 4465 {*aarch64_sve_movvnx16bi} (nil)) That addressing mode isn't valid for predicate loads. In aarch64.c:aarch64_classify_address if we set allow_reg_index_p to false when vec_flags & VEC_SVE_PRED that fixes it, but will need more testing
[Bug target/102252] svbool_t with SVE can generate invalid assembly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102252 ktkachov at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2021-09-09 Status|UNCONFIRMED |NEW CC||ktkachov at gcc dot gnu.org Target||aarch64 Ever confirmed|0 |1 --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed.
[Bug target/102226] [12 Regression] ICE with -O3 -msve-vector-bits=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102226 ktkachov at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P1 Summary|ICE with -O3|[12 Regression] ICE with |-msve-vector-bits=128 |-O3 -msve-vector-bits=128 Target Milestone|--- |12.0 Known to fail||12.0 Known to work||11.0 --- Comment #4 from ktkachov at gcc dot gnu.org --- Works in GCC 11
[Bug target/102226] ICE with -O3 -msve-vector-bits=128
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102226 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Last reconfirmed||2021-09-07 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #3 from ktkachov at gcc dot gnu.org --- Reduced testcase template struct b { using c = a; }; template class> using f = b; template class g> using h = typename f::c; struct i { template using k = typename j::l; }; struct m : i { using l = h; }; class n { public: char operator[](long o) { m::l s; return s[o]; } } p; n r; int q() { long d; for (long e; e; e++) if (p[e] == r[e]) d++; return d; }
[Bug target/95969] Use of __builtin_aarch64_im_lane_boundsi in AArch64 arm_neon.h interferes with gimple optimisation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95969 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #4 from ktkachov at gcc dot gnu.org --- (In reply to Andrew Pinski from comment #3) > Created attachment 51396 [details] > Patch > > Simple patch which adds both generic and gimple level folding for > __builtin_aarch64_im_lane_boundsi. > In this case (and most likely others), __builtin_aarch64_im_lane_boundsi is > removed during early inlining so it will fix the majority of the issue. looks like the wrong patch was attached?
[Bug target/102066] aarch64: Suboptimal addressing modes for SVE LD1W, ST1W
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102066 --- Comment #2 from ktkachov at gcc dot gnu.org --- (In reply to rsand...@gcc.gnu.org from comment #1) > > I guess the predicates and constraints in @aarch64_pred_mov in > > aarch64-sve.md should allow for the scaled address modes > They already allow them. I'm guessing this is an ivopts problem, > in that it doesn't realise it can promote the unsigned iterator > to uint64_t for a svcntw() step. ah indeed #include void foo(int n, float *x, float *y) { for (uint64_t i=0; i
[Bug target/102066] New: aarch64: Suboptimal addressing modes for SVE LD1W, ST1W
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102066 Bug ID: 102066 Summary: aarch64: Suboptimal addressing modes for SVE LD1W, ST1W Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org CC: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64 For the code: #include void foo(int n, float *x, float *y) { for (unsigned i=0; i in aarch64-sve.md should allow for the scaled address modes
[Bug tree-optimization/101637] #pragma omp for simd defeats VECT_COMPARE_COSTS optimisations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101637 ktkachov at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 CC||ktkachov at gcc dot gnu.org Last reconfirmed||2021-07-27 Status|UNCONFIRMED |NEW --- Comment #2 from ktkachov at gcc dot gnu.org --- Confirmed, though it also needs -fopenmp to trigger for me
[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390 --- Comment #3 from ktkachov at gcc dot gnu.org --- (In reply to Richard Biener from comment #2) > scalar patterns are the appropriate way to do this There may be parts of the compiler I'm not familiar here, so apologies... By scalar patterns do you mean something in match.pd?
[Bug tree-optimization/101390] New: Expand vector mod as vector div + multiply-subtract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390 Bug ID: 101390 Summary: Expand vector mod as vector div + multiply-subtract Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- When the target supports an sdiv/udiv pattern for vector modes we could synthesise a vector modulus operation using the division and a multiply-subtract operation. #define N 128 extern signed int si_a[N], si_b[N], si_c[N]; void test_si () { for (int i = 0; i < N; i++) si_c[i] = si_a[i] % si_b[i]; } On AArch64 SVE (but not Neon) has vector SDIV/UDIV instructions and so could generate: .L2: ld1wz2.s, p0/z, [x4, x0, lsl 2] ld1wz1.s, p0/z, [x3, x0, lsl 2] movprfx z0, z2 sdivz0.s, p1/m, z0.s, z1.s msb z0.s, p1/m, z1.s, z2.s st1wz0.s, p0, [x1, x0, lsl 2] incwx0 whilelo p0.s, w0, w2 b.any .L2 This can be achieved by implementing the smod and mod optabs in the aarch64 backend for SVE, but this is a generic transformation, so could be handled more generally in vect_recog_divmod_pattern and/or the vector lowering code so that more targets can benefit.
[Bug target/100441] [8/9 Regression] ICE in output_constant_pool_2, at varasm.c:3955
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100441 --- Comment #12 from ktkachov at gcc dot gnu.org --- Should be fixed on GCC 8 and 9 branches now?
[Bug tree-optimization/96974] [10/11 Regression] ICE in vect_get_vector_types_for_stmt compiling for SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974 --- Comment #13 from ktkachov at gcc dot gnu.org --- Fixed now?
[Bug target/99820] aarch64: ICE (segfault) in aarch64_analyze_loop_vinfo with -moverride=tune=use_new_vector_costs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99820 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from ktkachov at gcc dot gnu.org --- Fixed.
[Bug target/99822] [11 Regression] Assembler messages: Error: integer register expected in the extended/shifted operand register at operand 3 -- `adds x1,xzr,#2'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99822 ktkachov at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from ktkachov at gcc dot gnu.org --- Fixed.