[Bug tree-optimization/111882] [13 Regression] : internal compiler error: in get_expr_operand in ifcvt with Variable length arrays and bitfields inside a struct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111882 avieira at gcc dot gnu.org changed: What|Removed |Added Summary|[13/14/15 Regression] : |[13 Regression] : internal |internal compiler error: in |compiler error: in |get_expr_operand in ifcvt |get_expr_operand in ifcvt |with Variable length arrays |with Variable length arrays |and bitfields inside a |and bitfields inside a |struct |struct Known to work||14.0 --- Comment #5 from avieira at gcc dot gnu.org --- Fixed on gcc-14 (when it was trunk, so removing 14 and 15 tag. Still needs backport to gcc-13
[Bug target/114801] [14/15 Regression] arm: ICE in find_cached_value, at rtx-vector-builder.cc:100 with MVE intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114801 --- Comment #18 from avieira at gcc dot gnu.org --- Sorry to be clear, the 'here' in the last sentence refers to supporting masks as 0x to control the writing of the output register as the ISA allows, rather than interpret 0x and 0x as the same mask. I'll also see if I can propose a change to the ACLE specs to make this clearer.
[Bug target/114801] [14/15 Regression] arm: ICE in find_cached_value, at rtx-vector-builder.cc:100 with MVE intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114801 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #17 from avieira at gcc dot gnu.org --- Before anything, it might be worth to redefine the testcase to something where the predicate would have an effect in the result, for instance: #include uint32x4_t test_9() { return vdupq_m_n_u32(vdupq_n_u32(0x), 0, 0x); } Next, it might be worth pointing out that the ISA does specify what happens when a predicate mask does not have all bits set for a specific element. Basically, the predicate mask operates on a per byte basis. Hence 16-bits in the mask, controlling all 16-bytes in a vector register. So for the above, the expected output would be {0x, 0x, 0x, 0x}. Having said that I can see how you'd interpret the ACLE specs as defining such a mask to be 'UB', but I believe the intent was to make clear that all bits needed to be set if you wanted to true-predicate the full {32,16}-bit element. This is the most common use, I can't imagine many users will be manipulating the mask in such ways. clang seems to follow this behavior generating an assembly sequence that leads to the expected output, though they use vpsel probably due to some canonicalization. And I'd prefer to be consistent with clang here.
[Bug target/112787] Codegen regression of large GCC vector extensions when enabling SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112787 --- Comment #13 from avieira at gcc dot gnu.org --- They have both been backported, @Eric the tests should be passing again now.
[Bug target/112787] Codegen regression of large GCC vector extensions when enabling SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112787 --- Comment #12 from avieira at gcc dot gnu.org --- Sorry, missed that comment, thanks! I'll test backporting both.
[Bug target/112787] Codegen regression of large GCC vector extensions when enabling SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112787 --- Comment #10 from avieira at gcc dot gnu.org --- First of all, apologies for this! I don't know why I didn't test this on x86_64 too, I usually do for such backports. Anyway I checked locally and backporting: r14-2821-gd1c072a1c3411a6fe29900750b38210af8451eeb seems to be enough for gcc-12, I'm testing it on gcc-13 and running full regression tests on both x86_64 and aarch64 and will get back to you. @Andrew what made you think we also needed r14-2985-g04aa0edcace22a ? Not to say we may not want to backport it, but just trying to figure out why it's needed for this particular case.
[Bug ipa/113359] [13/14 Regression] LTO miscompilation of ceph on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #19 from avieira at gcc dot gnu.org --- Should we update target and summary to also include x86_64?
[Bug tree-optimization/111478] [12 Regression] aarch64 SVE ICE: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111478 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #10 from avieira at gcc dot gnu.org --- This has now been backported to gcc-13 and gcc-12, so I think we should close, will leave that to Richard.
[Bug target/113229] [14 Regression] gcc.dg/torture/pr70083.c ICEs when compiled with -march=armv9-a+sve2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113229 --- Comment #6 from avieira at gcc dot gnu.org --- Oh forgot to mention, this is triggering because of the div optimization in: https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=c69db3ef7f7d82a50f46038aa5457b7c8cc2d643 But I suspect that too is just an enabler and not the root cause? Unless we aren't supposed to use subregs for sve modes...
[Bug target/113229] [14 Regression] gcc.dg/torture/pr70083.c ICEs when compiled with -march=armv9-a+sve2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113229 --- Comment #5 from avieira at gcc dot gnu.org --- Oh forgot to mention, this is triggering because of the div optimization in: https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=c69db3ef7f7d82a50f46038aa5457b7c8cc2d643 But I suspect that too is just an enabler and not the root cause? Unless we aren't supposed to use subregs for sve modes...
[Bug target/113229] [14 Regression] gcc.dg/torture/pr70083.c ICEs when compiled with -march=armv9-a+sve2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113229 avieira at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2024-01-05 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #4 from avieira at gcc dot gnu.org --- So I can confirm this ICE and it was exposed rather than caused by my patch. The problem arises because it seems we have never tried to simplify a: (subreg: (subreg:<...> () N) M) This makes simplify_subreg neter the if (GET_CODE (op) == SUBREG) which calls: 'paradoxical_subreg_p (VNx4SImode, OImode)' Which seems to assume these are ordered with an assert. I am not sure what the right fix is here, I did check and changing paradoxical_subreg_p to return false if the mode sizes are not ordered leads to some bizarre fail, it looks like simplify_gen_subreg then just returns 0 ... rather than the original nested subregs. Before I dig deeper I'll get richi and Richard S to comment.
[Bug target/113040] [14 Regression] libmvec test failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113040 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org --- Comment #4 from avieira at gcc dot gnu.org --- Yeah my bad. For cases where we don't expose the definition the new code sequence doesn't add multiple vector parameters for cases where the vector length of the single parameter is less than that of the simdclone's simdlen. Testing a patch now.
[Bug tree-optimization/113026] Bogus -Wstringop-overflow warning on simple memcpy type loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113026 --- Comment #4 from avieira at gcc dot gnu.org --- Drive by comments as it's been a while since I looked at this. I'm also surprised we didn't adjust the bounds. But why do we only subtract VF? Like you say, if there's no loop around edge, can't we guarantee the epilogue will only need to iterate at most VF-1? This is assuming we didn't take an early exit, if we do then we can't assume anything as the iterations 'reset'.
[Bug target/112787] Codegen regression of large GCC vector extensions when enabling SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112787 avieira at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org Last reconfirmed||2023-11-30 Target||aarch64 --- Comment #1 from avieira at gcc dot gnu.org --- The problem is veclower tries to find the largest vector type it can use for a particular element type, which when SVE is enabled without a specified vector length will always be a VLA type. This then fails the check of it having less elements than the type being used to do the computation, given that a VLA element count is never 'known_lt' a constant one. I am currently testing a patch that makes sure the mode selected does not have more elements than the type we are trying to compute, given that it wouldn't be used anyway.
[Bug target/112787] New: Codegen regression of large GCC vector extensions when enabling SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112787 Bug ID: 112787 Summary: Codegen regression of large GCC vector extensions when enabling SVE Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- When compiling: typedef int __attribute__((__vector_size__ (64))) vec; vec fn (vec a, vec b) { return a + b; } with '-O2 -march=armv8-a' vs '-O2 -march=armv8-a+sve' the codegen defaults to scalar rather than using Advanced SIMD vectors.
[Bug tree-optimization/112282] [14 Regression] wrong code (generated code hangs) at -O3 on x86_64-linux-gnu since r14-4777-g88c27070c25309
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112282 --- Comment #11 from avieira at gcc dot gnu.org --- So I had a look at that u_lsm.72_510 variable and it's only undefined if we don't loop, but if we don't loop then u_lsm_flag is set to 0 and we don't use u_lsm. So it's OK. I also checked and the early exits are covered by the same mechanism. So really the question is, why does irange think the range is [-21, 0]. Anyone have an idea of how to debug this?
[Bug tree-optimization/112282] [14 Regression] wrong code (generated code hangs) at -O3 on x86_64-linux-gnu since r14-4777-g88c27070c25309
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112282 --- Comment #10 from avieira at gcc dot gnu.org --- So I had a look at that u_lsm.72_510 variable and it's only undefined if we don't loop, but if we don't loop then u_lsm_flag is set to 0 and we don't use u_lsm. So it's OK. I also checked and the early exits are covered by the same mechanism. So really the question is, why does irange think the range is [-21, 0]. Anyone have an idea of how to debug this?
[Bug tree-optimization/112282] [14 Regression] wrong code (generated code hangs) at -O3 on x86_64-linux-gnu since r14-4777-g88c27070c25309
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112282 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #9 from avieira at gcc dot gnu.org --- So I had a look at this and this is as far as I got. It seems to get stuck in the 'for (u = -22; u < 2; ++u)' loop. It looks like the loop IV never gets updated and it keeps looping. Looking at the codegen it seems that cunroll decides to remove A LOT of code and there is now: bb 4: .. # ivtmp_1055 = PHI .. bb 24: ... ivtmp_1056 = ivtmp_1055 - 1; goto ; [100.00%] I've not yet been able to figure out why this happens, the dumps weren't very helpful. So I tried -fdisable-tree-cunroll, it was still failing. So I looked at the dumps to try and see what was turning this loop into an infinite loop and vrp2 shows me: Global Exported: _19 = [irange] int [-21, 0] Folding predicate _19 != 2 to 1 and in the dump before vrp2 we see: [local count: 7354175]: # u.13_485 = PHI <_19(105), -22(3)> # u_lsm.72_510 = PHI <_19(105), _497(D)(3)> # u_lsm_flag.73_235 = PHI <1(105), 0(3)> ... [local count: 6634488]: al ={v} {CLOBBER(eol)}; _19 = u.13_485 + 1; if (_19 != 2) goto ; [96.34%] else goto ; [3.66%] [local count: 6391666]: goto ; [100.00%] Something to point out here, that u_lsm.72_510 seems odd. It is used to set global 'u', but its initialized with _497(D) which is undefined... So that itself seems wrong to me too... I'll try and find out what's causing that codegen next. Maybe that can explain why the irange for _19 is so wrong here.
[Bug tree-optimization/111882] [13/14 Regression] : internal compiler error: in get_expr_operand in ifcvt with Variable length arrays and bitfields inside a struct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111882 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org --- Comment #3 from avieira at gcc dot gnu.org --- Taking this, first time I see a SAVE_EXPR. It looks like it indicates side-effects, I'm gonna see if I can detect the presence of side-effects and reject lowering if so. Does that sound OK?
[Bug plugins/110610] [14 Regression] File insn-opinit.h not installed ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110610 avieira at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #11 from avieira at gcc dot gnu.org --- This should fix it. David please reopen if the problem still persists.
[Bug plugins/110610] File insn-opinit.h not installed ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110610 avieira at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2023-07-10 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org Status|UNCONFIRMED |ASSIGNED
[Bug plugins/110610] File insn-opinit.h not installed ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110610 --- Comment #8 from avieira at gcc dot gnu.org --- I'll try adding to one of the header file lists in gcc's makefile. Probably the INTERNAL_FN_H one.
[Bug plugins/110610] File insn-opinit.h not installed ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110610 --- Comment #7 from avieira at gcc dot gnu.org --- > I guess you mean insn-opinit.h, not internal-fn.h. internal-fn.h is in the > GCC Git repo. Yeah sorry! I did mean insn-opinit.h > We are already installing insn-{addr,attr-common,attr,codes,...}.h anyway. Fair!
[Bug plugins/110610] File insn-opinit.h not installed ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110610 --- Comment #5 from avieira at gcc dot gnu.org --- intenral-fn.h is generated at gcc build-time. I'm not sure we want to 'install' it with a gcc install. Might make more sense to trigger a the generation of it when building this gcc-plugin. But I'm not sure... I'll ask around the community see what people think.
[Bug plugins/110610] File insn-opinit.h not installed ?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110610 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #2 from avieira at gcc dot gnu.org --- I can't reproduce this but it seems like the modula2 build also suffers from the same issue, see PR110284. David, what exactly are you trying to build? Can you give us the configure command?
[Bug tree-optimization/110557] [13/14 Regression] Wrong code for x86_64-linux-gnu with -O3 -mavx2: vectorized loop mishandles signed bit-fields
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110557 --- Comment #5 from avieira at gcc dot gnu.org --- Hi Xi, Feel free to test your patch and submit it to the list for review. I had a look over and it looks correct to me. I feel like it also addresses the cases where the bitfield is 'sandwiched' like: int x : 7; ptrdiff_t y : 56; long long z: 1; As you left-shift it, and it also addresses the case where you have both sign-extension and have to widen-it, because you still transform the type into signed. But it might be nice to add tests to cover those two, just in case someone changes this. In the future, if you do plan to work on something it would be nice to let people know on the bugzilla ticket (preferably by assigning it to yourself) so that multiple people don't end up working on the same thing, I had started to write a patch, but wasn't as far as you and I like your approach :)
[Bug tree-optimization/110557] [13/14 Regression] Wrong code for x86_64-linux-gnu with -O3 -mavx2: vectorized loop mishandles signed bit-fields
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110557 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org --- Comment #2 from avieira at gcc dot gnu.org --- I'll have a look.
[Bug tree-optimization/110436] [14 Regression] ICE in vectorizable_live_operation, at tree-vect-loop.cc:10170
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110436 --- Comment #4 from avieira at gcc dot gnu.org --- Meant to say I'll look at it ;)
[Bug tree-optimization/110436] [14 Regression] ICE in vectorizable_live_operation, at tree-vect-loop.cc:10170
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110436 avieira at gcc dot gnu.org changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org --- Comment #3 from avieira at gcc dot gnu.org --- I
[Bug tree-optimization/110310] vector epilogue handling is inefficient
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110310 --- Comment #4 from avieira at gcc dot gnu.org --- > OK, so I take away from this that you don't think this is done the way it is on purpose. I don't think so, I think I just found a place where it was safe to do so, i.e. where we knew the vectorization factor would not change after. I have a vague recollection that vect_analyze_loop used to be somewhat more complex, but given the now clear separation between main loop and epilogue vinfo selection we have now, we could probably do this as we analyze loop_vinfos for epilogue? Assuming that during analysis we've had determined vf, peeling and use of masks, which I'm pretty sure we have. Might be worth asking Richard Sandiford if he can think of anything that we might not be 'fixing' during analysis.
[Bug tree-optimization/110310] vector epilogue handling is inefficient
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110310 --- Comment #2 from avieira at gcc dot gnu.org --- I can't remember the exact reason either, though I do vaguely remember niter updating being something that we felt 'needed more future work' at the time. Just a side question, AVX512 has predication right? So how come you are expecting an epilogue? I'm also curious about the condition on that snippet of code, 'known_eq (vf, lowest_vf)' seems odd.. lowest_vf is by definition constant, so known_eq only succeeds if vf is constant and the same as lowest_vf, but lowest_vf is the constant lower bound of vf, i.e. that seems like a very convoluted way of doing vf.is_constant (_vf)? Maybe this helper function wasn't around back then. Either way, it feels like we shouldn't be doing this if loop_vinfo is predicated? But I also agree that we probably want to be doing all of this during analysis, seems odd to be ruling out loop_vinfo's during transformation.
[Bug middle-end/110142] [14 Regression] x264 from SPECCPU 2017 miscompares from g:2f482a07365d9f4a94a56edd13b7f01b8f78b5a0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110142 avieira at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #4 from avieira at gcc dot gnu.org --- I believe that fixes the issue.
[Bug middle-end/110142] [14 Regression] x264 from SPECCPU 2017 miscompares from g:2f482a07365d9f4a94a56edd13b7f01b8f78b5a0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110142 --- Comment #1 from avieira at gcc dot gnu.org --- Found the issue to be with passing a subtype to vect_recog_widen_op_pattern in vect_recog_widen_{plus,minus}_pattern where we didn't before. Removing those and letting it default to a NULL pointer seems to fix the codegen issue. Will test patches locally and send in patch when done.
[Bug tree-optimization/109543] Avoid using BLKmode for unions with a non-BLKmode member when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109543 --- Comment #3 from avieira at gcc dot gnu.org --- Err that should be 'double d[4];' so: typedef struct { float __attribute__ ((vector_size(16))) v[2]; } STRUCT; #ifdef GOOD typedef STRUCT TYPE; #else typedef union { STRUCT s; double d[4]; } TYPE; #endif
[Bug tree-optimization/109543] Avoid using BLKmode for unions with a non-BLKmode member when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109543 --- Comment #2 from avieira at gcc dot gnu.org --- Sorry for the delay. Here's the typedefs with GNU vectors. typedef struct { float __attribute__ ((vector_size(16))) v[2]; } STRUCT; #ifdef GOOD typedef STRUCT TYPE; #else typedef union { STRUCT s; double d[2]; } TYPE; #endif To be fair I suspect you could see similar behaviour with just 16-byte vectors, but with aarch64 the backend will know to use 64-bit scalar moves for 128-bit BLKmodes, though even then, picking the vector mode would result in more optimal (single vector move) code.
[Bug tree-optimization/109543] New: Avoid using BLKmode for unions with a non-BLKmode member when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109543 Bug ID: 109543 Summary: Avoid using BLKmode for unions with a non-BLKmode member when possible Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Hi, So with the following C-code: $ cat t.c #include #ifdef GOOD typedef float64x2x2_t TYPE; #else typedef union { float64x2x2_t v; double d[4]; } TYPE; #endif void foo (TYPE *a, TYPE *b, TYPE *c, unsigned n) { TYPE X = a[0]; TYPE Y = b[0]; TYPE Z = c[0]; for (unsigned i = 0; i < n; ++n) { TYPE temp = X; X = Y; Y = Z; Z = temp; } a[0] = X; b[0] = Y; c[0] = Z; } If compiled for aarch64 with -DGOOD the compiler will use vector register moves in the loop, whereas without -DGOOD it will use the stack with memmoves. The reason for this is because when picking the mode to address a UNION with gcc will always choose BLKmode as soon as any member of a UNION is BLKmode. In such cases I think it would be safe to go with not-BLKmode of members that have the same size as the entire UNION?
[Bug tree-optimization/108888] [13 Regression] error: definition in block 26 follows the use
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #6 from avieira at gcc dot gnu.org --- After this patch Andrew Stubbs patch (https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=3da77f217c8b2089ecba3eb201e727c3fcdcd19d) to use in-branch simd-clones for cases like in gcc/testsuite/gcc.dg/vect/vect-simd-clone-16.c no longer work. I believe this is because this patch changes the 'if (gimple_call ..)' into a else 'if (...is_gimple_call (stmt))' which doesn't work because stmt will be 0 (it's a dyn_cast of gassign). I'm testing a patch locally to fix this.
[Bug target/98850] ICE in expand_debug_locations, at cfgexpand.c:5458
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98850 --- Comment #2 from avieira at gcc dot gnu.org --- I failed to reproduce it with a trunk build of arm-none-linux-gnueabihf.
[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154 --- Comment #5 from avieira at gcc dot gnu.org --- Im slightly confused here, on entry to BB 5 we know the opposite of _1 < 0.0 no? if we branch to BB 5 we know !(_1 < 0.0) so we can't fold _1 <= 1.0, we just know that the range of _1 is >= 0.0 . Or am I misreading, I've not tried compiling myself just going off the code both of you posted here.
[Bug tree-optimization/109230] [13 Regression] Maybe wrong code for opus package on aarch64 since r13-4122-g1bc7efa948f751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109230 --- Comment #9 from avieira at gcc dot gnu.org --- Hmm I was seeing the change in opus_ifft but that does look like different codegen :/ I might not be looking at the right thing. That transformation looks definitely wrong though as the selection selects 3 values from the first vector (which is the result of the plus), and the fneg would negate 2 values right?
[Bug tree-optimization/109230] [13 Regression] Maybe wrong code for opus package on aarch64 since r13-4122-g1bc7efa948f751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109230 --- Comment #6 from avieira at gcc dot gnu.org --- Thanks! My initial investigation has lead me to think the change is being caused at vrp2, which is the only time the pattern gets triggered with -O2, the tree before the pass (at the place where the transformation happens): vect__83.466_787 = VEC_PERM_EXPR ; vect__87.467_786 = vect__81.462_791 * vect__83.466_787; vect__91.469_784 = vect__84.458_794 - vect__87.467_786; vect__88.468_785 = vect__84.458_794 + vect__87.467_786; _783 = VEC_PERM_EXPR ; ... vect__96.470_782 = vect__95.450_800 - _783; after the pass: vect__83.466_787 = VEC_PERM_EXPR ; vect__87.467_786 = vect__83.466_787 * vect__81.462_791; vect__91.469_784 = vect__84.458_794 - vect__87.467_786; vect__88.468_785 = vect__87.467_786 + vect__84.458_794; _756 = VIEW_CONVERT_EXPR(vect__87.467_786); _755 = -_756; _739 = VIEW_CONVERT_EXPR(_755); _783 = _739 + vect__84.458_794; ... vect__96.470_782 = vect__95.450_800 - _783; So before we had: _783 = the first element of vect_88 and the second element of vect__91 these are respectively vect__88 = vect__84 + vect__87 vect__91 = vect__84 - vect__87 so _783 = {vect__84[0] + vect__87[0], vect__84[1] - vect__87[1]} after the pass _783 = _739 + vect__84 This is where I don't know if I'm reading the optimization correctly, but it says all 'even' lanes are negated, does that mean we end up with: _739 = { -vect__87[0] , vect__87[1]} if so then that's why we have a wrong result as you want to negate lane 1 not 0. Otherwise if lane 1 is the one that gets negated then it should be OK as you'd get: so _783 = { vect__87[0] + vect__84[0], -vect__87[1] + vect__84[1] } Now obviously that's assuming -a + b == b - a (not sure if that's true with floating point errors etc)
[Bug tree-optimization/109230] [13 Regression] Maybe wrong code for opus package on aarch64 since r13-4122-g1bc7efa948f751
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109230 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #3 from avieira at gcc dot gnu.org --- Hi Martin, what options do you build these tests with?
[Bug tree-optimization/109005] [13 Regression] ICE during GIMPLE pass: ifcvt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109005 --- Comment #21 from avieira at gcc dot gnu.org --- Something else that might be obvious, how do I create a minimal ifcvt_demo.adb file that uses the .ads, so that I can add it as a testcase to gcc, as the testsuite seems to pick up .adb files only.
[Bug tree-optimization/109005] [13 Regression] ICE during GIMPLE pass: ifcvt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109005 --- Comment #20 from avieira at gcc dot gnu.org --- It's probably obvious to people that know Ada, so I just have to apologize for my ignorance in that area :)
[Bug tree-optimization/109005] [13 Regression] ICE during GIMPLE pass: ifcvt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109005 --- Comment #15 from avieira at gcc dot gnu.org --- @richi: Yeah and as I mentioned on IRC I can confirm it fixes the issue, I also bootstrapped and regression tested the change on aarch64-unknown-linux-gnu. Simon, I can't compile your minimal reproducer, first it complains about missing the body keyword, so I added that, but then it complains about missing a ifcvt_demo.ads, tried adding an empty one but that didn't work.
[Bug tree-optimization/109005] [13 Regression] ICE during GIMPLE pass: ifcvt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109005 --- Comment #8 from avieira at gcc dot gnu.org --- Oh nvm... you did.
[Bug tree-optimization/109005] [13 Regression] ICE during GIMPLE pass: ifcvt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109005 --- Comment #7 from avieira at gcc dot gnu.org --- I'm still trying to build ADA to reproduce this. Could you try 'p debug_tree (var)' if var is a SSA_NAME debug won't print anything. If it comes back as not 0 could you also do p debug_tree (TREE_TYPE (var)) Thank you! I'll keep trying to build ADA locally to see if I can debug this too.
[Bug target/96342] [SVE] Add support for "omp declare simd"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96342 --- Comment #10 from avieira at gcc dot gnu.org --- yang I assume you are no longer working on this?
[Bug target/107987] [12 Regression] MVE vcmpq vector-scalar can trigger ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107987 avieira at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from avieira at gcc dot gnu.org --- Fixed in GCC-13 and backported to GCC-12, closing.
[Bug target/108443] New: arm: MVE wrongly re-interprets predicate constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108443 Bug ID: 108443 Summary: arm: MVE wrongly re-interprets predicate constants Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- compiling: $ cat t.c #include uint32x4_t foo (uint32_t *a) { mve_pred16_t p = 0x00cc; return vldrwq_z_u32 (a, p); } with: $ arm-none-eabi-gcc -march=armv8.1-m.main+mve -mfloat-abi=hard -O2 -S will yield: foo: mov r3, #-4 @ movhi vmsr p0, r3 @ movhi vpst vldrwt.32 q0, [r0] bx lr That leads to a P0 mask of 0xFFFC and not 0x00CC as it should be.
[Bug target/108442] arm: MVE's vld1* and vst1* do not work when __ARM_MVE_PRESERVE_USER_NAMESPACE is defined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108442 --- Comment #1 from avieira at gcc dot gnu.org --- This fails equally for any vld1* vstr1* intrinsic.
[Bug target/108442] New: arm: MVE's vld1* and vst1* do not work when __ARM_MVE_PRESERVE_USER_NAMESPACE is defined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108442 Bug ID: 108442 Summary: arm: MVE's vld1* and vst1* do not work when __ARM_MVE_PRESERVE_USER_NAMESPACE is defined Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- When compiling: $ cat t.c #include uint32x4_t foo (uint32_t *p) { return __arm_vld1q_u32 (p); } with: $ arm-none-eabi-gcc -march=armv8.1-m.main+mve -mfloat-abi=hard -D__ARM_MVE_PRESERVE_USER_NAMESPACE it will fail to compile as __arm_vld1q_u32 is defined in arm_mve.h as calling vldrwq_u32 which will not exist when __ARM_MVE_PRESERVE_USER_NAMESPACE is defined.
[Bug target/108177] MVE predicated stores to same address get optimized away
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108177 --- Comment #3 from avieira at gcc dot gnu.org --- The architecture describes it as only writing the true-predicated bytes and leaving the others untouched. I guess reading and writting to the same memory is the best we can do to 'mimic' that in RTL. SVE does the same as x86, so I'll try that approach over unspec_volatile.
[Bug target/108177] MVE predicated stores to same address get optimized away
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108177 --- Comment #1 from avieira at gcc dot gnu.org --- I noticed that for SVE stores seem to be marked as volatile memory accesses, I suspect it's because they are represented using masked stores which probably are by definition volatile (for this reason?). A fix for this for now, before MVE starts using maskedstore patterns, would be to use unspec_volatile for such stores.
[Bug target/108177] New: MVE predicated stores to same address get optimized away
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108177 Bug ID: 108177 Summary: MVE predicated stores to same address get optimized away Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- GCC currently generates wrong code for predicated MVE stores to the same address. Like: #include uint8x16_t foo (uint8x16_t a, uint8_t *pa, mve_pred16_t p1, mve_pred16_t p2) { vstrbq_p_u8 (pa, a, p1); vstrbq_p_u8 (pa, a, p2); } with 'gcc -mcpu=cortex-m55 -mfloat-abi=hard -O3' it will only generate the second MVE store. Though if (p2 | p1) != p2 then the second store will not fully overwrite the first.
[Bug target/107987] New: [12/13 Regression] MVE vcmpq vector-scalar can trigger ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107987 Bug ID: 107987 Summary: [12/13 Regression] MVE vcmpq vector-scalar can trigger ICE Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Using the following testcase $ cat t.c #include uint32x4_t foo (uint32x4_t a, uint32x4_t b) { mve_pred16_t p = vcmpneq_n_u32 (vandq_u32 (a, b), 0); return vaddq_x_u32 (a, b, p); } and compiling with arm-none-eabi-gcc -mcpu=cortex-m55 -mfloat-abi=hard -O2 will trigger an ICE in combine. This was caused by g:d083fbf72d4533d2009c725524983e1184981e74 as when removing the unspec's around the vcmp's it now exposed the compiler to a comparison operator with a vector and a scalar operand.
[Bug tree-optimization/107808] gcc.dg/vect/vect-bitfield-write-2.c etc.FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107808 --- Comment #2 from avieira at gcc dot gnu.org --- Hi Rainer, I suspect this means SPARC should be added to the list of targets that fail check_effective_target_vect_long_long. From the dump it looks like the target doesn't support a long long vectype.
[Bug tree-optimization/107326] [13 Regression] ICE: verify_gimple failed (error: type mismatch in binary expression) since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107326 avieira at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #7 from avieira at gcc dot gnu.org --- Closing this then.
[Bug libgcc/107678] New: [13 Regression] Segfault in aarch64_fallback_frame_state
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107678 Bug ID: 107678 Summary: [13 Regression] Segfault in aarch64_fallback_frame_state Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgcc Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Hi, We ran into a segfault when running SPEC 2017 Parest for aarch64-none-linux-gnu on a Neoverse V1 target after g:146e45914032 These are the relevant frames of the segfault: #0 0x8bd2dd04 in aarch64_fallback_frame_state (context=0xe11f6e10, fs=0xe11f71d0) at ./md-unwind-support.h:74 #1 uw_frame_state_for (context=context@entry=0xe11f6e10, fs=fs@entry=0xe11f71d0) at .../libgcc/unwind-dw2.c:1275 #2 0x8bd2f0ec in _Unwind_RaiseException (exc=0x36b105d0) at .../libgcc/unwind.inc:104 #3 0x8be8d6b4 in __cxxabiv1::__cxa_throw (obj=, tinfo=0x56bf58 , dest=0x468c00 ) at .../libstdc++-v3/libsupc++/eh_throw.cc:93 We do not see the same failure for a NEON only run, so the size of the vectors could be a hint? But I haven't confirmed this.
[Bug tree-optimization/107326] [13 Regression] ICE: verify_gimple failed (error: type mismatch in binary expression) since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107326 --- Comment #5 from avieira at gcc dot gnu.org --- It looks that way on my end, but I'll let Arseny confirm.
[Bug tree-optimization/107346] [13 Regression] gnat.dg/loop_optimization23_pkg.ad failure afer r13-3413-ge10ca9544632db
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107346 --- Comment #9 from avieira at gcc dot gnu.org --- Hi Eric, I realised the same, got a patch pending here: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604139.html
[Bug tree-optimization/107346] [13 Regression] gnat.dg/loop_optimization23_pkg.ad failure afer r13-3413-ge10ca9544632db
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107346 --- Comment #6 from avieira at gcc dot gnu.org --- > There are no differences between gnat1 and cc1/cc1plus as far as dumps are > concerned, e.g. -fdump-tree-optimized creates the .optimized dump. This was my bad, I'm not used to using cc1 directly, usually go through the driver, so didn't realize it was putting the dumps in the same place as the source file.
[Bug tree-optimization/107346] [13 Regression] gnat.dg/loop_optimization23_pkg.ad failure afer r13-3413-ge10ca9544632db
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107346 --- Comment #4 from avieira at gcc dot gnu.org --- Funnily enough, if I transform the Int24 into a 32-bit integer in the testcase and disable all bitfield lowering just to make sure, I get the same failure. I tried using __attribute__((packed)) in C to reproduce this, but I keep getting a 32-bit offset... Either way, I will test a patch where vect_check_gather_scatter bails out if pbitpos isn't a multiple of BITS_PER_UNIT.
[Bug tree-optimization/107346] [13 Regression] gnat.dg/loop_optimization23_pkg.ad failure afer r13-3413-ge10ca9544632db
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107346 --- Comment #3 from avieira at gcc dot gnu.org --- I am wondering whether I should try to support this, or bail out of vect_check_gather_scatter if pbitpos is not a multiple of BITS_PER_UNIT. The latter obviously feels safer.
[Bug testsuite/107338] new test case gcc.dg/vect/vect-bitfield-read-7.c in r13-3413-ge10ca9544632db fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107338 --- Comment #3 from avieira at gcc dot gnu.org --- Hi Kewen, I believe you are right. I was waiting for a powerpc machine in the board farm, but I suspect I can reproduce this with an aarch64 BE target and I should be able to confirm. But your reasoning seems valid to me. Because of the widening the shift_n becomes 32-shift_n-mask_width, but the start of the bitfield didn't move by widening the container, so it is still 16 - shift_n - mask_width bits away from the start of the container. Moving the calculation before the widening seems like the neatest solution to me, there's no point in keeping the old type around I think. Do you want to produce a patch for this, seeing you solved it?
[Bug tree-optimization/107346] gnat.dg/loop_optimization23_pkg.ad failure afer r13-3413-ge10ca9544632db
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107346 avieira at gcc dot gnu.org changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2022-10-21 --- Comment #1 from avieira at gcc dot gnu.org --- I've tracked this down to 'vect_check_gather_scatter's pbytepos calculation: poly_int64 pbytepos = exact_div (pbitpos, BITS_PER_UNIT); Where pbitpos is 4 and that triggers an assert in exact_div. I am not sure what the best fix would be here. The stmt this fails on is: _ifc__23 = (*x_7(D))[_1].b.D.3707; But I am having trouble debugging this as I cant' seem to break on vect_recog_bit_insert_pattern and I haven't figured out how to get gnat1 to create dumps :(
[Bug tree-optimization/107346] New: gnat.dg/loop_optimization23_pkg.ad failure afer r13-3413-ge10ca9544632db
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107346 Bug ID: 107346 Summary: gnat.dg/loop_optimization23_pkg.ad failure afer r13-3413-ge10ca9544632db Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- As reported by Eric in https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603356.html
[Bug tree-optimization/107326] [13 Regression] ICE: verify_gimple failed (error: type mismatch in binary expression) since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107326 --- Comment #2 from avieira at gcc dot gnu.org --- Hi Arseny, Apologies for this, I thought I had caught this with testing, but seems I had not. I am testing a fix right now.
[Bug tree-optimization/107275] [13 Regression] Recent ifcvt changes resulting in references to SSA_NAME on free list
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107275 --- Comment #3 from avieira at gcc dot gnu.org --- The prodding helped! The problem is that dce was indeed removing the ASM as it wasn't recognizing it as a stmt that was live. This is because ifcvt would have normally bailed out when encountering such an asm stmt when doing 'find_data_references_in_loop'. I have a patch that fixes this, will test it and post it upstream. My plan is to bring forward the references check, as we do not need to lower bitfields if that fails, given loop-vectorization will fail altogether anyway.
[Bug tree-optimization/107275] [13 Regression] Recent ifcvt changes resulting in references to SSA_NAME on free list
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107275 avieira at gcc dot gnu.org changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2022-10-17 Status|UNCONFIRMED |NEW --- Comment #2 from avieira at gcc dot gnu.org --- ifcvt's dce seems to be removing the asm, which is rather odd... Moving the 'struct device_link *link;' outside of the function, making it a global seems to give a different ICE too, related to vdefs. So I suspect my vdef/vuse update is confusing things. I've never quite understood what and how the vdef/vuse update is supposed to happen, update_stmt used to be my goto fix-all, but that doesnt' seem to be helping. As a side-not, I also noticed that doing the gimple_move_vops after inserting seems to yield different results as well... Just to say I am nowhere yet, if anyone has an idea of what might be going wrong I welcome the suggestion, in the meantime I'll continue prodding this.
[Bug tree-optimization/107275] [13 Regression] Recent ifcvt changes resulting in references to SSA_NAME on free list
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107275 avieira at gcc dot gnu.org changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |avieira at gcc dot gnu.org --- Comment #1 from avieira at gcc dot gnu.org --- I'll have a look, thank you for the reduced testcase!
[Bug testsuite/107240] [13 Regression] FAIL: gcc.dg/vect/vect-bitfield-write-2.c since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107240 --- Comment #4 from avieira at gcc dot gnu.org --- Might be worth posting the output of -fdump-tree-vect-all might be failing to vectorize due to some specific lack of feature that we can test for.
[Bug testsuite/107240] [13 Regression] FAIL: gcc.dg/vect/vect-bitfield-write-2.c since r13-3219-g25413fdb2ac249
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107240 --- Comment #2 from avieira at gcc dot gnu.org --- Hi Seurer, Peter, Adding something like: { xfail { powerpc*-*-* && { ! powerpc_vsx_ok } } } } should xfail all powerpc architectures that don't support this no?
[Bug tree-optimization/107226] [13 regression] r13-3219-g25413fdb2ac249 caused a lot of testcase failures
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107226 avieira at gcc dot gnu.org changed: What|Removed |Added Last reconfirmed||2022-10-12 Status|UNCONFIRMED |NEW Ever confirmed|0 |1 --- Comment #1 from avieira at gcc dot gnu.org --- So this is a regression because SLP is using the new patterns for BITFIELD_REF's of vector's. Seeing that I never actually found a good use of supporting non-integral container types I will just remove that and that will cause the pattern to not match BITFIELD_REF's of vectors. I'll go test those changes.
[Bug tree-optimization/107229] [13 Regression] ICE at -O1 and -Os with "-ftree-vectorize": verify_gimple failed since r13-3219-g25413fdb2ac24933
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107229 --- Comment #2 from avieira at gcc dot gnu.org --- So it seems I should have taken DECL_FIELD_OFFSET into account when computing the bitpos in get_bitfield_rep (tree-if-conv.cc). I am testing a patch for this whilst I also look at the failures in PR107226
[Bug tree-optimization/105219] [12 Regression] SVE: Wrong code with -O3 -msve-vector-bits=128 -mtune=thunderx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105219 --- Comment #18 from avieira at gcc dot gnu.org --- (In reply to Richard Biener from comment #16) > (In reply to rsand...@gcc.gnu.org from comment #15) > > (In reply to Richard Biener from comment #14) > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > > index d7bc34636bd..3b63ab7b669 100644 > > > --- a/gcc/tree-vect-loop.cc > > > +++ b/gcc/tree-vect-loop.cc > > > @@ -9977,7 +9981,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, > > > gimple > > > *loop_vectorized_call) > > > lowest_vf) - 1 > > >: wi::udiv_floor (loop->nb_iterations_upper_bound + > > > bias_for_lowest, > > > lowest_vf) - 1); > > > - if (main_vinfo) > > > + if (main_vinfo && !main_vinfo->peeling_for_alignment) > > > { > > > unsigned int bound; > > > poly_uint64 main_iters > > It might be better to add the maximum peeling amount to main_iters. > > Maybe you'd prefer this anyway for GCC 12 though. > > > > I wonder if there's a similar problem for peeling for gaps, > > in cases where the epilogue doesn't need the same peeling. > > I don't quite understand the code in if (main_vinfo) but the point is > that for our case main_iters is zero (and so is prologue_iters if that > would exist). I'm not sure how the code can be adjusted with that > given it computes upper bounds and uses min() for the upper bound > of the epilogue - we'd need to adjust that with a max (2*vf-2, > old-upper-bound) > when there's prologue peeling and the short cut exists (I don't actually > compute that). > > peeling for gaps means we run the epilogue for main VF more iterations, > but that would just mean the vectorized epilogue executes one more time > and has peeling for gaps applied as well, so the scalar epilogue runs > for epilogue VF more iterations. > > I'm not sure what conditions prevent epilogue vectorization but I think > there were some at least. I think disabling this for peeling makes sense for now, but just to explain how the code works. The perhaps misnamed 'main_iters' represents the maximum number of iterations left to do after the main loop, either entered or not. The maximum number of iterations left to do after the main loop the largest of the three: - the main loop's VF, in case we enter the main loop there are at most VF-1 iterations left, I see I didn't add a -1 there. - LOOP_VINFO_COST_MODEL_THRESHOLD or LOOP_VINFO_VERSIONING_THRESHOLD in case we don't enter the main loop because we don't have enough iterations to meet these (but do still have enough for the epilogue). Our problem is that this didn't take peeling into account, since skipping main -> skipping peeling and thus really the number of iters we could be left with after skipping main are actually main_iters + to-peel. So I think the approach should be to add 'to_peel' to main_iters where 'to_peel' is either: VF - 1 if PEELING_FOR_GAPS or PEELING_FOR_ALIGNMENT = -1 PEELING_FOR_ALIGNMENT otherwise. But like I said first, disabling is probably the safest and easiest for gcc 12 and given the niche of this, I'm not even sure it's worth tightening it for gcc 13?
[Bug target/105157] [12 Regression] compile-time regressions with generic tuning since r12-7756-g27d8748df59fe6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105157 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #11 from avieira at gcc dot gnu.org --- The commit above should have fixed the issue. Let me know if you still observe the higher compile-time in your nightlies.
[Bug target/105157] [12 Regression] compile-time regressions with generic tuning since r12-7756-g27d8748df59fe6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105157 --- Comment #9 from avieira at gcc dot gnu.org --- Found the issue, it's due to the way we encode TARGET_CPU_DEFAULT in aarch64, it is only able to support 64 cores and we have 65 now. Testing a work around for now and we have plans to fix this properly in GCC 13.
[Bug rtl-optimization/104498] Alias attribute being ignored by scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104498 avieira at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #10 from avieira at gcc dot gnu.org --- Should be fixed with latest patch.
[Bug rtl-optimization/104498] Alias attribute being ignored by scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104498 --- Comment #7 from avieira at gcc dot gnu.org --- And I was thinking it didn't know how to handle anchor + offset... Anyway if I just record the swap and use it to invert the distance calculation that seems to 'work' for the testcase. I'm happy to go bootstrap it, or would you rather fix this some other way?
[Bug rtl-optimization/104498] Alias attribute being ignored by scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104498 --- Comment #5 from avieira at gcc dot gnu.org --- You mean this https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92294 it only works for direct symbols I think it never enters the block under: if (GET_CODE (x) == SYMBOL_REF && GET_CODE (y) == SYMBOL_REF) which is where he made his changes. I'll go try to understand his changes better, just had a quick look over.
[Bug rtl-optimization/104498] Alias attribute being ignored by scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104498 --- Comment #3 from avieira at gcc dot gnu.org --- Sorry some confusion there, I thought it was base_alias_check bailing out early, but that seems to return true, it is the memrefs_conflict_p that returns 0. I suspect rtx_equal_for_memref_p should have returned 1 for: x: (plus:DI (mult:DI (reg:DI 99 [ off.0_1 ]) (const_int 4 [0x4])) (const:DI (plus:DI (symbol_ref:DI ("*.LANCHOR0") [flags 0x182]) (const_int 16 [0x10] and y: (plus:DI (mult:DI (reg:DI 99 [ off.0_1 ]) (const_int 4 [0x4])) (symbol_ref:DI ("b") [flags 0x2] )) But it does not... must be because of that trailing (equivalence notes? that's what I assume they are?)
[Bug rtl-optimization/104498] Alias attribute being ignored by scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104498 --- Comment #1 from avieira at gcc dot gnu.org --- Forgot to mention, this happens during the sched1 pass.
[Bug rtl-optimization/104498] New: Alias attribute being ignored by scheduler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104498 Bug ID: 104498 Summary: Alias attribute being ignored by scheduler Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: avieira at gcc dot gnu.org Target Milestone: --- Whilst working on a tuning structure I saw a correctness regression that I believe is a result of the alias attribute not working properly. You can reproduce it using an existing tuning for AArch64 using: gcc -O2 src/gcc/gcc/testsuite/gcc.c-torture/execute/alias-2.c -S -mtune=cortex-a34 This will lead to the 'a[off] = 2' store being moved after the b load in 'b[off] != 2'. In RTL: (insn 23 18 19 2 (set (reg:SI 110 [ b[off.0_1] ]) (mem:SI (plus:DI (mult:DI (reg:DI 99 [ off.0_1 ]) (const_int 4 [0x4])) (reg/f:DI 97)) [1 b[off.0_1]+0 S4 A32])) "gcc/gcc/testsuite/gcc.c-torture/execute/alias-2.c":10:6 52 {*movsi_aarch64} (expr_list:REG_DEAD (reg:DI 99 [ off.0_1 ]) (expr_list:REG_DEAD (reg/f:DI 97) (nil (insn 19 23 24 2 (set (mem:SI (plus:DI (mult:DI (reg:DI 99 [ off.0_1 ]) (const_int 4 [0x4])) (reg/f:DI 104)) [1 a[off.0_1]+0 S4 A32]) (reg:SI 106)) "gcc/gcc/testsuite/gcc.c-torture/execute/alias-2.c":9:9 52 {*movsi_aarch64} (expr_list:REG_DEAD (reg:SI 106) (expr_list:REG_DEAD (reg/f:DI 104) (nil After some debugging I found that true_dependence returns false for these two memory accesses because base_alias_check sees they have different base objects ('a' and 'b') and deduces they can't alias based on that, without realising 'b' isn't an actual base object but an alias to 'a'. I think we should make it so that at expand pointers to 'b' get 'a' as a base object.
[Bug regression/103997] [12 Regression] gcc.target/i386/pr88531-??.c scan-assembler-times FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103997 --- Comment #12 from avieira at gcc dot gnu.org --- Right and did you happen to see a perf increase on these benchmarks with any of the patches I mentioned the hash of in the previous comment? Just to explain a bit further what I think is going on. Before my initial patches the epilogue loop analysis would start at the mode_i + 1 of the first loop, in other others, the next mode in the list of modes. After the patch (1) we started this from mode_i = 1, so the first mode after VOIDmode, this caused some ICEs if the target didn't add any, not sure about your targets, but that was fixed in (2). In patch (3) Kewen added a fix to my check for potential use of partial vectors, to check the param_vect_partial_vector_usage since that can disable partial vector even if the target supports them. So I suspect that either of these 3 patches inadvertently changed the vectorization strategy for the epilogue of some loop(s) in these benchmarks. So when I commited patch (4) f4ca0a53be18dfc7162fd5dcc1e73c4203805e14, the vectorization strategy went back to what it was previously. If this is indeed what happened then the regression you are seeing is just an indication that the original vectorization strategy was sub-optimal. This is something that should be looked at in separate and looked at as an optimization, probably by improving the cost modelling of the vectorizer for your target. Patch 1) commit d3ff7420e941931d32ce2e332e7968fe67ba20af Author: Andre Vieira Date: Thu Dec 2 14:34:15 2021 + [vect] Re-analyze all modes for epilogues Patch 2) commit 016bd7523131b645bca5b5530c81ab5149922743 Author: Andre Vieira Date: Tue Jan 11 15:52:59 2022 + [vect] PR103971, PR103977: Fix epilogue mode selection for autodetect only Patch 3) commit 6d51a9c6447bace21f860e70aed13c6cd90971bd Author: Kewen Lin Date: Fri Jan 14 07:02:10 2022 -0600 vect: Check partial vector param for supports_partial_vectors [PR104015] Patch 4) commit f4ca0a53be18dfc7162fd5dcc1e73c4203805e14 Author: Andre Vieira Date: Wed Jan 19 14:11:32 2022 + vect: Fix epilogue mode skipping
[Bug regression/103997] [12 Regression] gcc.target/i386/pr88531-??.c scan-assembler-times FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103997 --- Comment #10 from avieira at gcc dot gnu.org --- Hi Levy, I did a quick experiment, compiled exchange2_r with trunk and with trunk + all my epilogue and unroll vector patches reverted, with '-march=alderlake -Ofast -flto -funroll_loops' and the codegen is pretty much the same. Could it be that picking a different mode than we did before all of my patches, was a better choice? If this is the case then this is something that should be fixed by an appropriate cost-model, picking the best mode for the specific loop's epilogue. The patches I reverted were: f4ca0a53be18dfc7162fd5dcc1e73c4203805e14 7ca1582ca60dc84cc3fc46b9cda620e2a0bed1bb 016bd7523131b645bca5b5530c81ab5149922743 d3ff7420e941931d32ce2e332e7968fe67ba20af What were you using as a baseline for that last regression?
[Bug target/104015] [12 regression] gcc.dg/vect/slp-perm-9.c fails on power 9 (only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104015 --- Comment #7 from avieira at gcc dot gnu.org --- Yeah I'm with Richard on this one, I just checked and the generated assembly is the same for before and after my patch, so this looks like a testism. And yeah I agree, if we were to decide to unroll this for instance then you'd likely see it being printed more too, since you would likely end up with the epilogue using the same mode. I'll suggest changing it to just testing the existance of that string, rather than requring it N times. Having said that, the fail will go away for this particular case with the param change.
[Bug target/104015] [12 regression] gcc.dg/vect/slp-perm-9.c fails on power 9 (only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104015 --- Comment #5 from avieira at gcc dot gnu.org --- Thanks Kewen, that seems worrying, I'll have a look.
[Bug target/104015] [12 regression] gcc.dg/vect/slp-perm-9.c fails on power 9 (only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104015 --- Comment #3 from avieira at gcc dot gnu.org --- Hi Kewen, Thanks for the analysis. The param_vect_partial_vector_usage suggestion seems valid, but that shouldn't be the root cause. I would expect an unpredicated V8HI epilogue to fail for a V8HI main loop (unless the main loop was unrolled). That is what the following code in vect_analyze_loop_2 is responsible for: /* If we're vectorizing an epilogue loop, the vectorized loop either needs to be able to handle fewer than VF scalars, or needs to have a lower VF than the main loop. */ if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo))) return opt_result::failure_at (vect_location, "Vectorization factor too high for" " epilogue loop.\n"); So PR103997 is looking at fixing the skipping, because we skip too much now. You seem to be describing a case where it doesn't skip enough, but like I said that should be dealt with the code above, so I have a feeling there may be some confusion here. I have a patch for the earlier bug at https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588330.html This is still under review whils we work out a better way of dealing with the issue. Could you maybe check whether that fixes your failures? I'll start a cross build for powerpc in the meantime to see if I can check out these tests. As for why I don't use LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P on the first loop vinfo to skip epilogue modes, that's because it is possible to have a non-predicated main loop with a predicated epilogue. The test I added for aarch64 with that patch is a motivating case. On another note, unfortunately LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P only 'forces' the use of partial vectors it doesn't tell us whether it is possible or not AFAIU, hence why I introduced that new function, that really only checks whether the target is at all capable of partial vector generation, since if we know it's not possible at all we can skip more modes and avoid unnecessary analysis.
[Bug regression/103997] [12 Regression] gcc.target/i386/pr88531-??.c scan-assembler-times FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103997 --- Comment #7 from avieira at gcc dot gnu.org --- Hmm thinking out loud here. As vector sizes (or ISAs) change vectorization strategies could indeed change. Best that I can think of is things like rounding, where you might need to do operations in higher precision, and some targets could potentially support instructions that widen, round and narrow again in the same instruction at some size + ISA combination and not in other, which means some would have a 'higher' element size mode in there where others don't. But that assumes the vectorizer would represent such 'widen + round + narrow' instructions in a single pattern, hiding the 'higher precision' elements. Which as far as I know don't exist right now. There may be other cases I can't think of ofc. We could always be even more conservative and only skip if the highest possible element size for the current vector size + ISA would lead to a mode with NUNITS greater or equal to the current vector mode. Or ... just never skip a mode, I don't have a good feeling for how much that would cost compile time wise though.
[Bug regression/103997] [12 Regression] gcc.target/i386/pr88531-??.c scan-assembler-times FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103997 avieira at gcc dot gnu.org changed: What|Removed |Added CC||avieira at gcc dot gnu.org --- Comment #5 from avieira at gcc dot gnu.org --- Yeah I made a mistake there using the vector_mode like that, since that vector mode really only determines vector size (and vector ISA for aarch64). I am almost finished testing a patch that instead goes through the 'used_vector_modes' to find the largest element for all used vector modes, then use related_vector_mode to get the vector mode for that element with the same size as the current vector_mode[mode_i]. That would give us the lowest possible VF for that loop and vector size. Should be posting the fix soon.
[Bug tree-optimization/103977] [12 Regression] ice in try_vectorize_loop_1 since r12-6420-gd3ff7420e941931d32ce2e332e7968fe67ba20af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103977 --- Comment #8 from avieira at gcc dot gnu.org --- The patch Jeff mentioned is this: [vect] PR103971, PR103977: Fix epilogue mode selection for autodetect only gcc/ChangeLog: * tree-vect-loop.c (vect-analyze-loop): Handle scenario where target does not add autovectorize_vector_modes. https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=016bd7523131b645bca5b5530c81ab5149922743 Should be OK to close this now?
[Bug tree-optimization/103971] [12 regression] build fails after r12-6420, ICE at libgfortran/generated/matmul_i1.c:2450
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103971 avieira at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #2 from avieira at gcc dot gnu.org --- Have been told powerpc is working again after: [vect] PR103971, PR103977: Fix epilogue mode selection for autodetect only gcc/ChangeLog: * tree-vect-loop.c (vect-analyze-loop): Handle scenario where target does not add autovectorize_vector_modes. https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=016bd7523131b645bca5b5530c81ab5149922743 Closing this PR.
[Bug tree-optimization/103977] [12 Regression] ice in try_vectorize_loop_1 since r12-6420-gd3ff7420e941931d32ce2e332e7968fe67ba20af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103977 --- Comment #7 from avieira at gcc dot gnu.org --- Thanks for confirming that Jeff :)
[Bug tree-optimization/103971] [12 regression] build fails after r12-6420, ICE at libgfortran/generated/matmul_i1.c:2450
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103971 --- Comment #1 from avieira at gcc dot gnu.org --- seurer could you check whether https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588237.html fixes this? I don't have easy access to a powerpc target for bootstrap.
[Bug tree-optimization/103977] [12 Regression] ice in try_vectorize_loop_1 since r12-6420-gd3ff7420e941931d32ce2e332e7968fe67ba20af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103977 --- Comment #5 from avieira at gcc dot gnu.org --- Posted a fix on ML: https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588237.html Sorry for the breakage, wrong assumption by my part :(
[Bug tree-optimization/100981] ICE in info_for_reduction, at tree-vect-loop.c:4897
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100981 --- Comment #6 from avieira at gcc dot gnu.org --- FYI Tamar asked me to make sure the instructions were being generated. I checked and they were, but not being used as it decides to inline MAIN__ and inlining seems to break (as in not apply/missed oppurtunity) the complex optimization. So for this specific test I'd use -fno-inline, it executes the fcmla instructions that way and it runs fine.
[Bug tree-optimization/100981] ICE in info_for_reduction, at tree-vect-loop.c:4897
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100981 --- Comment #5 from avieira at gcc dot gnu.org --- Yeah that works. Ran it as is, no abort, ran it with s/ne/eq/ and it aborts.
[Bug rtl-optimization/98791] [10 Regression] ICE in paradoxical_subreg_p (in ira) with SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98791 avieira at gcc dot gnu.org changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|--- |FIXED --- Comment #10 from avieira at gcc dot gnu.org --- Closing now as backport is done.
[Bug rtl-optimization/98791] [10 Regression] ICE in paradoxical_subreg_p (in ira) with SVE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98791 --- Comment #8 from avieira at gcc dot gnu.org --- Aye my bad there, Thanks for the change.