[Bug tree-optimization/102793] AArch64: sequential comparisons with equal conditional blocks don't use ccmp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102793 --- Comment #7 from Manolis Tsamis --- Also submitted in the lists: https://gcc.gnu.org/pipermail/gcc-patches/2024-April/648856.html I should note that I needed to modify the test uninit-pred-6_c.c and remove this check: if (l) if (n > 12) blah(v); /* { dg-bogus "uninitialized" "bogus warning" } */ because it was inconsistent. If the branch is merged to a simple if (l && (n > 12)) is fails (on master). With the proposed patch we get more ifcombining and better codegen but this condition fails. I believe this would need an enhancement in tree-ssa-uninit.
[Bug tree-optimization/102793] AArch64: sequential comparisons with equal conditional blocks don't use ccmp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102793 Manolis Tsamis changed: What|Removed |Added CC||manolis.tsamis at vrull dot eu --- Comment #6 from Manolis Tsamis --- Created attachment 57886 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57886=edit Patch with proposed fix for the issue. Adding a copy of the ifcombine pass after pre solves this issues and improves code generation is some other cases too. I specifically chose after the pre pass because it deduplicates basic blocks and helps in maximizing the effect of ifcombine.
[Bug tree-optimization/114326] New: Missed optimization for A || B when !B implies A.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114326 Bug ID: 114326 Summary: Missed optimization for A || B when !B implies A. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu Target Milestone: --- The function below doesn't fold to return 0; int cmp1(uint64_t d1, uint64_t d2) { if (((d1 ^ d2) & 0xabcd) == 0 || d1 != d2) return 0; return foo(); } while the following function does: int cmp2(uint64_t d1, uint64_t d2) { if (d1 != d2 || ((d1 ^ d2) & 0xabcd) == 0) return 0; return foo(); } The functions are equivalent since the lhs and rhs of || don't have side effects. In general, there pattern here is a side-effect free expression a || b where !b implies a should be optimized to true. As in the testcase above, a doesn't necessarily imply !b. Something similar could be stated for && expressions. Complementary godbolt link: https://godbolt.org/z/qK5bYf36T
[Bug tree-optimization/114287] New: [13.2.1 Regression] 416.gamess in SPEC CPU 2006 miscompiled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114287 Bug ID: 114287 Summary: [13.2.1 Regression] 416.gamess in SPEC CPU 2006 miscompiled Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu Target Milestone: --- On Linux/AArch64, 416.gamess from SPEC2006 is miscompiled: Contents of h2ocu2+.gradient.err Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG STOP IN ABRT Contents of triazolium.err STOP IN ABRT When compiled with: -g -O2 -flto=32 -fno-strict-aliasing Compiles fine with GCC 13.2.1. GCC master built from commit d1b241b9506cdc0ebd3f43d12cf77d7c33271342, configured with: --enable-shared, --enable-threads=posix, --enable-checking=release, --enable-linker-build-id, --enable-plugin, --with-isl, --enable-lto
[Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010 --- Comment #10 from Manolis Tsamis --- (In reply to ptomsich from comment #9) > (In reply to Manolis Tsamis from comment #0) > > E.g. another loop, non canonicalized names: > > > > .L120: > > ldr q30, [x0], 16 > > moviv29.2s, 0 > > ld2 {v26.16b - v27.16b}, [x4], 32 > > moviv25.4s, 0 > > zip1v29.16b, v30.16b, v29.16b > > zip2v30.16b, v30.16b, v25.16b > > umlal v29.8h, v26.8b, v28.8b > > umlal2 v30.8h, v26.16b, v28.16b > > uaddw v31.4s, v31.4s, v29.4h > > uaddw2 v31.4s, v31.4s, v29.8h > > uaddw v31.4s, v31.4s, v30.4h > > uaddw2 v31.4s, v31.4s, v30.8h > > cmp x5, x0 > > bne .L120 > > Is it just me, or are the zip1 and zip2 instructions dead? > > Philipp. They certainly look dead, but they're not because the umlal/umlal2 (and other accumulate instructions) also read from the destination register. There looks to be a missed optimization opportunity to use just a single `movi v25.4s, 0` here though. Also, looking again at the generated code in the first example: mov v23.16b, v18.16b mla v23.16b, v17.16b, v25.16b If I'm correct this could be folded into just mla v18.16b, v17.16b, v25.16b In which case most of the movs in the first and second example could be eliminated. To me it looks like the accumulate instructions are missing some optimizations.
[Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010 --- Comment #5 from Manolis Tsamis --- Also, I further investigated the codegen difference in the second example (zip + umlal vs umull) and it looks to be some sort of RTL ordering + combine issue. Specifically, when the we expand the RTL for the example there are some very slight ordering differences where some non-dependent insns have swapped order. On of these happens to precede a relevant vector statement and then in one case combine does the umlal transformation but in the other not. Afaik combine has some limits about the instruction window that it looks, so it looks feasible that ordering differences in RTL can later transform into major codegen differences in a number of ways. Other differences seem to come from register allocation, as you mentioned. This doesn't yet provide any useful insights whether the vectorization improvements are accidental or not.
[Bug tree-optimization/114010] Unwanted effects of using SSA free lists.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010 --- Comment #4 from Manolis Tsamis --- Hi Andrew, Thank for your insights on this. Let me reply to some of your points: (In reply to Andrew Pinski from comment #1) > >The most important case I have observed is that the vectorizer can fail or > >create inferior code with more shuffles/moves when the SSA names aren't > >monotonically increasing. > > That should not be true. Indeed, after further cleaning-up the dumps, some differences that I was considering were just due to the diff algorithm not doing a good job and that confused me (sigh). So, for this example while we're in tree form I observe only naming changes, but no different code or order of statements. (In reply to Andrew Pinski from comment #2) > Note what I had found in the past it is not exactly SSA_NAMEs that cause the > difference but rather the RTL register pesdu # causes differences in > register allocation which was exposed from the different in operands > canonicalization. Yes, I have also observed this and it looks to be the main issue. (In reply to Andrew Pinski from comment #3) > The first example (of assembly here) in comment #0 is extra moves due to the > RA not handling subreg that decent for the load/store lane. There are other > bug reports dealing with that. Why the SSA_NAMES being monotonically help is > just by an accident really. > > Do you happen to recall the relevant ticket(s)? I would like to have a look but couldn't find them so far. Also, while I agree than in some cases changes like this 'just happen' to improve codegen in some particular case, it was in multiple experiments that vectorized code was superior with sorted names and it never was worse with sorted names. In most cases that I recall the version that used unsorted names had additional shuffles of different sorts or moves. So, which anecdotal, the effects doesn't look accidental to me in this case. I feel like there may be some subtle difference due to the names that helps in this case? > > Also: > > This mostly affects all the bitmaps that use SSA_NAME_VERSION as a key. > > Most use sparse bitmaps there so it is not a big deal. > Agreed and that's probably why I couldn't measure any non-trivial difference in compilation times. I should just note that there are also places that create vectors or other data structures sized to the number of ssa_names, so in theory this could still help in extreme cases. > I think this should be split up in a few different bug reports really. > One for each case where better optimizations happen. > Ok, the only cases that I found to be clearly better are the ones related to vectorization. Would it help to create a ticket just for that now, or should I wait for the discussion in this one to conclude first? > Also: > >I have seen two similar source files generating the exact same GIMPLE code > >up to some optimization pass but then completely diverging due to different > >freelists. > > The only case where I have seen this happen is expand will have different > pesdu # really. Yes I noticed this effect while I did > r14-569-g21e2ef2dc25de3 really. Afaik, the codegen differences that I observed was due to the same reason, but it nonetheless felt weird that the same GIMPLE could produce two different w.r.t. name ordering files later on just because the freelists were different (but invisible in the dumps). So I naturally questioned 'why don't we just flush the freelists after every pass if it's not a performance issue'?
[Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010 Bug ID: 114010 Summary: Unwanted effects of using SSA free lists. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu Target Milestone: --- Created attachment 57469 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57469=edit Testcase and example compilations on AArch64 After some experiments, I have observed a number of unwanted effects due to re-using SSA names through free lists. Specifically, the issue is that the recycled SSA names have unpredictable ordering and the compiler is affected by this ordering. The effects include: 1) Worse code generation from some passes that explicitly or implicitly depend on SSA name ordering. The most important case I have observed is that the vectorizer can fail or create inferior code with more shuffles/moves when the SSA names aren't monotonically increasing. 2) Missed canonicalization opportunities. Functions that do the same thing but have different SSA ordering will usually end up diverging after various optimization passes. The generated assembly code is also much different when it shouldn't. Since the SSA freelists are affected by things like moving around dead code this also makes comparing GIMPLE or assembly code much harder than necessary. 3) Because freelists persist through optimization passes, their hidden state can affect optimizations/codegen in unexpected ways. I have seen two similar source files generating the exact same GIMPLE code up to some optimization pass but then completely diverging due to different freelists. Getting something completely different by starting from the same GIMPLE was mildly surprising at first. 4) Although I haven't observed something worthwhile on that front, in theory if a large chunk of names is free'd (e.g. dead code elimination) then the pending free'd names will negatively affect performance and memory because they're not compacted. This mostly affects all the bitmaps that use SSA_NAME_VERSION as a key. The easier way to experiment with this behaviour is code like below: ...some code... if (condition that will be eliminated, but later) { ...some function call that will be inlined now... ...or some code... } ...some other code... I have crafted and attached a testcase (canon.cpp) that can showcase some of these behaviors at different optimization levels. As part of the experiments I have a modified GCC compiler that calls `release_free_names_and_compact_live_names` after every optimization pass, which is enough to guarantee that the generated names are in monotonically increasing order. Together with the testcase I have attached the AArch64 assembly code generated at 8 different configurations: with or without the dead code in the source file, with or without free name canonicalization, with O2 or with O3. The observations at O2: As per the previous notes, it can be seen that code in with-dead-code.O2.txt is quite different from without-dead-code.O2.txt (different instruction mix, different instruction count, different register allocation etc.). This is due to the free'd up names causing different SSA name allocation. Compared to that with-dead-code.canonicalized.O2.txt and without-dead-code.canonicalized.O2.txt are identical except for the stack offsets chosen for the variables. It can be easily seen as it diffs cleanly, both at the gimple and assembly level. The observations at O3: At O3 the assembly no longer cleanly diffs in either case, but it is interesting to observe the better vectorization done due to the better ordering of names (in almost all places that there is vectorized code): E.g. non-canonicalized: .L37: ld4 {v24.16b - v27.16b}, [x1], 64 shl v23.16b, v24.16b, 5 add v28.16b, v23.16b, v24.16b mov v23.16b, v18.16b mla v23.16b, v17.16b, v25.16b mov v29.16b, v23.16b mov v23.16b, v20.16b mla v23.16b, v19.16b, v26.16b mov v30.16b, v23.16b mov v23.16b, v22.16b mla v23.16b, v21.16b, v27.16b mov v31.16b, v23.16b st4 {v28.16b - v31.16b}, [x0], 64 cmp x3, x1 bne .L37 Assembly for the same code, 3 instructions less: .L37: ld4 {v28.16b - v31.16b}, [x1], 64 mov v26.16b, v23.16b mov v27.16b, v21.16b shl v25.16b, v28.16b, 5 mla v26.16b, v24.16b, v29.16b mla v27.16b, v22.16b, v30.16b add v25.16b, v25.16b, v28.16b mov v28.16b, v19.16b mla v28.16b, v20.16b, v31.16b st4 {v25.16b - v28.16b}, [x0], 64 cmp x1, x3 bne .L37 E.g. another loop, non canonicalized names: .L120: ldr
[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089 --- Comment #8 from Manolis Tsamis --- (In reply to Alex Coplan from comment #7) > (In reply to Manolis Tsamis from comment #6) > > (In reply to Alex Coplan from comment #5) > > > Also ICEs without -fno-strict-aliasing FWIW, so just -g -O2 -funroll-loops > > > seems to be enough. > > > > On my local build it doesn't reproduce without -fno-strict-aliasing; I just > > double-checked (GCC master @ 4e0a467302fea56d63b7a6d17f99c0f388960dc7). > > Hmm, that is odd, I definitely don't need that option even when built from > the same commit. How are you configuring GCC? I suppose it doesn't matter > since I can reproduce the problem as long as the fix works with/without > -fno-strict-aliasing. Odd indeed, but probably something with my configure flags as you mentioned. Here's what I use: --enable-shared \ --enable-threads=posix \ --enable-checking=release \ --with-system-zlib \ --enable-__cxa_atexit \ --disable-libunwind-exceptions \ --enable-linker-build-id \ --enable-libstdcxx-backtrace \ --enable-plugin \ --enable-initfini-array \ --enable-gnu-indirect-function \ --with-isl \ --enable-lto \ --with-cpu=neoverse-n1 \ --disable-multilib \ --disable-bootstrap (I think the most uncommon one here is --with-cpu=neoverse-n1 ?)
[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089 --- Comment #6 from Manolis Tsamis --- (In reply to Alex Coplan from comment #5) > Also ICEs without -fno-strict-aliasing FWIW, so just -g -O2 -funroll-loops > seems to be enough. On my local build it doesn't reproduce without -fno-strict-aliasing; I just double-checked (GCC master @ 4e0a467302fea56d63b7a6d17f99c0f388960dc7).
[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089 --- Comment #3 from Manolis Tsamis --- Created attachment 56908 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56908=edit Reduced testcase
[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089 Manolis Tsamis changed: What|Removed |Added CC||manolis.tsamis at vrull dot eu --- Comment #2 from Manolis Tsamis --- The following testcase, reduced from x264 intrapred_chroma_plane: typedef unsigned short uint16; void intrapred_chroma_plane(uint16 ***mb_preds, int* max_imgpel_values, int crx, int cry, int px) { for (int uv = 0; uv < 2; uv++) { uint16 **mb_pred = mb_preds[uv + 1]; uint16 **predU2 = _pred[px - 2]; uint16 *upPred = _pred[px][px]; int max_imgpel_value = max_imgpel_values[uv]; int ih = upPred[crx - 1]; for (int i = 0; i < crx*3; ++i) ih += upPred[crx*3]; int iv = (mb_pred[cry - 1][px+1]); for (int i = 0; i < cry - 1; ++i) { iv += (i + 1) * (*(mb_preds[uv][0]) - *(*predU2--)); } for (int j = 0; j < cry; ++j) for (int i = 0; i < crx; ++i) mb_pred[j][i] = (uint16) (max_imgpel_value * ((i * ih + iv))); } } reproduces the same ICE issue when compiled with: gcc -g -O2 -funroll-loops -fno-strict-aliasing intra_chroma_pred.c
[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 --- Comment #52 from Manolis Tsamis --- (In reply to Sam James from comment #51) > manolis, did you have a chance to look at the remaining pass issue? You'll > need to revert Dave's commit locally which made the issue latent for > building Python. Hi Sam, I had to work on some other things so I didn't get to find a fix yet, but I'll be working on that again now (in light of the new info from PR111601 too). Thanks for the ping, Manolis
[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 --- Comment #48 from Manolis Tsamis --- (In reply to dave.anglin from comment #47) > On 2023-11-13 4:33 a.m., manolis.tsamis at vrull dot eu wrote: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 > > > > --- Comment #44 from Manolis Tsamis --- > > (In reply to John David Anglin from comment #39) > >> In the f-m-o pass, the following three insns that set call clobbered > >> registers r20-r22 are pulled from loop: > >> > >> (insn 186 183 190 29 (set (reg/f:SI 22 %r22 [478]) > >> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > >> (const_int 388 [0x184]))) "../Python/compile.c":5964:9 120 > >> {addsi3} > >> (nil)) > >> (insn 190 186 187 29 (set (reg/f:SI 21 %r21 [479]) > >> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > >> (const_int 392 [0x188]))) "../Python/compile.c":5964:9 120 > >> {addsi3} > >> (nil)) > >> (insn 194 191 195 29 (set (reg/f:SI 20 %r20 [480]) > >> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > >> (const_int 396 [0x18c]))) "../Python/compile.c":5964:9 120 > >> {addsi3} > >> (nil)) > >> > >> They are used in the following insns before call to compiler_visit_expr1: > >> > >> (insn 242 238 258 32 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int > >> *)prephit > >> mp_37 + 388B]+0 S4 A32]) > >> (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173])) > >> "../Python/compile.c" > >> :5968:22 42 {*pa.md:2193} > >> (expr_list:REG_DEAD (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173]) > >> (expr_list:REG_DEAD (reg/f:SI 22 %r22 [478]) > >> (nil > >> (insn 258 242 246 32 (set (reg:SI 26 %r26) > >> (reg/v/f:SI 5 %r5 [orig:198 c ] [198])) > >> "../Python/compile.c":5969:15 42 {*pa.md:2193} > >> (nil)) > >> (insn 246 258 250 32 (set (mem:SI (reg/f:SI 21 %r21 [479]) [4 MEM[(int > >> *)prephitmp_37 + 392B]+0 S4 A32]) > >> (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169])) > >> "../Python/compile.c":5968:22 42 {*pa.md:2193} > >> (expr_list:REG_DEAD (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169]) > >> (expr_list:REG_DEAD (reg/f:SI 21 %r21 [479]) > >> (nil > >> (insn 250 246 254 32 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int > >> *)prephitmp_37 + 396B]+0 S4 A32]) > >> (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145])) > >> "../Python/compile.c":5968:22 42 {*pa.md:2193} > >> (expr_list:REG_DEAD (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145]) > >> (expr_list:REG_DEAD (reg/f:SI 20 %r20 [480]) > >> (nil > >> > >> After the call, we have: > >> > >> (insn 1241 269 273 30 (set (reg/f:SI 22 %r22 [478]) > >> (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])) > >> "../Python/compile.c":5970:20 -1 > >> (nil)) > >> (insn 273 1241 1242 30 (set (mem:SI (plus:SI (reg/f:SI 22 %r22 [478]) > >> (const_int 388 [0x184])) [4 MEM[(int *)_107 + 388B]+0 S4 > >> A32]) > >> (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167])) > >> "../Python/compile.c":5970:20 42 {*pa.md:2193} > >> (nil)) > >> (insn 1242 273 277 30 (set (reg/f:SI 21 %r21 [479]) > >> (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])) > >> "../Python/compile.c":5970:20 -1 > >> (nil)) > >> (insn 277 1242 1243 30 (set (mem:SI (plus:SI (reg/f:SI 21 %r21 [479]) > >> (const_int 392 [0x188])) [4 MEM[(int *)_107 + 392B]+0 S4 > >> A32]) > >> (reg:SI 13 %r13 [orig:156 vect_pretmp_36.2451 ] [156])) > >> "../Python/compile.c":5970:20 42 {*pa.md:2193} > >> (nil)) > >> (insn 1243 277 281 30 (set (reg/f:SI 20 %r20 [480]) > >> (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])) > >> "../Python/compile.c":5970:20 -1 > >> (nil)) > >> (insn 281 1243 299 30 (set (mem:SI (plus:SI (reg/f:SI 20 %r20 [480]) > >> (const_int 396 [0x18c])) [4 MEM[(int *)_107 + 396B]+0 S4 > >> A32]) > >> (reg:SI 12 %r12 [orig:134 vect_pretmp_36.2452 ] [134])) > >> "../Python/compile.c":5970:20 42 {*pa.md:2193} > >> (nil)) > >> > >> We have lost the offsets that were added initially to r20, r21 and r22. > >> > >> Previous ce3 pass had: > >> > >> (insn 272 269 273 30 (set (reg/f:SI 22 %r22 [478]) > >> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > >> (const_int 388 [0x184]))) "../Python/compile.c":5970:20 120 > >> {addsi3} > >> (nil)) > >> (insn 273 272 276 30 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int > >> *)_107 + 388B]+0 S4 A32]) > >> (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167])) > >> "../Python/compile.c":5970:20 42 {*pa.md:2193} > >> (nil)) > >> (insn 276 273 277 30 (set (reg/f:SI 21 %r21 [479]) > >> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > >> (const_int 392 [0x188]))) "../Python/compile.c":5970:20 120 > >> {addsi3} > >>
[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 --- Comment #46 from Manolis Tsamis --- I have reproduced the segfault with f-m-o limited to only fold insn 272 from compiler_call_helper. The exact transformation is: Memory offset changed from 0 to 388 for instruction: (insn 273 272 276 30 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(intD.1 *)_107 + 388B]+0 S4 A32]) (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2448D.32932 ] [167])) "Python/compile.c":5970:20 42 {*pa.md:2193} (nil)) deferring rescan insn with uid = 273. Instruction folded:(insn 272 269 273 30 (set (reg/f:SI 22 %r22 [478]) (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) (const_int 388 [0x184]))) "Python/compile.c":5970:20 120 {addsi3} (nil)) This instruction is also included to the ones that Dave mentioned. Again, if I'm missing something as to why this transformation is illegal please tell me. Given these are also consecutive instructions, I'm just seeing here that %r22 = %r19 + 388 [%r22] = %r14 is transformed to %r22 = %r19 [%r22 + 388] = %r14 I haven't tracked all other uses of %r22 yet, but in theory if there was any non-foldable use of that register then the transformation wouldn't be made. Manolis
[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 --- Comment #45 from Manolis Tsamis --- (In reply to Jeffrey A. Law from comment #41) > I would agree. In fact,the whole point of the f-m-o pass is to bring those > immediates into the memory reference. It'd be really useful to know why > that isn't happening. > > The only thing I can think of would be if multiple instructions needed the > %r20 in the RTL you attached. Which might point to a refinement we should > make in f-m-o, specifically the transformation isn't likely profitable if we > aren't able to fold away a term or fold a constant term into the actual > memory reference. Jeff, I'm confused about "It'd be really useful to know why that isn't happening.". It can be seen in Dave's dumps that it *is* happening, e.g.: Memory offset changed from 0 to 396 for instruction: (insn 281 280 284 30 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int *)_107 + 396B]+0 S4 A32]) (reg:SI 12 %r12 [orig:134 vect_pretmp_36.2452 ] [134])) "../Python/compile.c":5970:20 42 {*pa.md:2193} (nil)) Instruction folded:(insn 280 277 281 30 (set (reg/f:SI 20 %r20 [480]) (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) (const_int 396 [0x18c]))) "../Python/compile.c":5970:20 120 {addsi3} (nil)) If you looks at the RTL in f-m-o all these offsets are indeed moved in the respective load/store. I don't know if cprop afterwards manages to eliminate the unwanted move, but f-m-o does what it's supposed to do in this case.
[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 --- Comment #44 from Manolis Tsamis --- (In reply to John David Anglin from comment #39) > In the f-m-o pass, the following three insns that set call clobbered > registers r20-r22 are pulled from loop: > > (insn 186 183 190 29 (set (reg/f:SI 22 %r22 [478]) > (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > (const_int 388 [0x184]))) "../Python/compile.c":5964:9 120 > {addsi3} > (nil)) > (insn 190 186 187 29 (set (reg/f:SI 21 %r21 [479]) > (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > (const_int 392 [0x188]))) "../Python/compile.c":5964:9 120 > {addsi3} > (nil)) > (insn 194 191 195 29 (set (reg/f:SI 20 %r20 [480]) > (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > (const_int 396 [0x18c]))) "../Python/compile.c":5964:9 120 > {addsi3} > (nil)) > > They are used in the following insns before call to compiler_visit_expr1: > > (insn 242 238 258 32 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int > *)prephit > mp_37 + 388B]+0 S4 A32]) > (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173])) > "../Python/compile.c" > :5968:22 42 {*pa.md:2193} > (expr_list:REG_DEAD (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173]) > (expr_list:REG_DEAD (reg/f:SI 22 %r22 [478]) > (nil > (insn 258 242 246 32 (set (reg:SI 26 %r26) > (reg/v/f:SI 5 %r5 [orig:198 c ] [198])) > "../Python/compile.c":5969:15 42 {*pa.md:2193} > (nil)) > (insn 246 258 250 32 (set (mem:SI (reg/f:SI 21 %r21 [479]) [4 MEM[(int > *)prephitmp_37 + 392B]+0 S4 A32]) > (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169])) > "../Python/compile.c":5968:22 42 {*pa.md:2193} > (expr_list:REG_DEAD (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169]) > (expr_list:REG_DEAD (reg/f:SI 21 %r21 [479]) > (nil > (insn 250 246 254 32 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int > *)prephitmp_37 + 396B]+0 S4 A32]) > (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145])) > "../Python/compile.c":5968:22 42 {*pa.md:2193} > (expr_list:REG_DEAD (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145]) > (expr_list:REG_DEAD (reg/f:SI 20 %r20 [480]) > (nil > > After the call, we have: > > (insn 1241 269 273 30 (set (reg/f:SI 22 %r22 [478]) > (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])) > "../Python/compile.c":5970:20 -1 > (nil)) > (insn 273 1241 1242 30 (set (mem:SI (plus:SI (reg/f:SI 22 %r22 [478]) > (const_int 388 [0x184])) [4 MEM[(int *)_107 + 388B]+0 S4 > A32]) > (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167])) > "../Python/compile.c":5970:20 42 {*pa.md:2193} > (nil)) > (insn 1242 273 277 30 (set (reg/f:SI 21 %r21 [479]) > (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])) > "../Python/compile.c":5970:20 -1 > (nil)) > (insn 277 1242 1243 30 (set (mem:SI (plus:SI (reg/f:SI 21 %r21 [479]) > (const_int 392 [0x188])) [4 MEM[(int *)_107 + 392B]+0 S4 > A32]) > (reg:SI 13 %r13 [orig:156 vect_pretmp_36.2451 ] [156])) > "../Python/compile.c":5970:20 42 {*pa.md:2193} > (nil)) > (insn 1243 277 281 30 (set (reg/f:SI 20 %r20 [480]) > (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])) > "../Python/compile.c":5970:20 -1 > (nil)) > (insn 281 1243 299 30 (set (mem:SI (plus:SI (reg/f:SI 20 %r20 [480]) > (const_int 396 [0x18c])) [4 MEM[(int *)_107 + 396B]+0 S4 > A32]) > (reg:SI 12 %r12 [orig:134 vect_pretmp_36.2452 ] [134])) > "../Python/compile.c":5970:20 42 {*pa.md:2193} > (nil)) > > We have lost the offsets that were added initially to r20, r21 and r22. > > Previous ce3 pass had: > > (insn 272 269 273 30 (set (reg/f:SI 22 %r22 [478]) > (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > (const_int 388 [0x184]))) "../Python/compile.c":5970:20 120 > {addsi3} > (nil)) > (insn 273 272 276 30 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int > *)_107 + 388B]+0 S4 A32]) > (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167])) > "../Python/compile.c":5970:20 42 {*pa.md:2193} > (nil)) > (insn 276 273 277 30 (set (reg/f:SI 21 %r21 [479]) > (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > (const_int 392 [0x188]))) "../Python/compile.c":5970:20 120 > {addsi3} > (nil)) > (insn 277 276 280 30 (set (mem:SI (reg/f:SI 21 %r21 [479]) [4 MEM[(int > *)_107 + 392B]+0 S4 A32]) > (reg:SI 13 %r13 [orig:156 vect_pretmp_36.2451 ] [156])) > "../Python/compile.c":5970:20 42 {*pa.md:2193} > (nil)) > (insn 280 277 281 30 (set (reg/f:SI 20 %r20 [480]) > (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]) > (const_int 396 [0x18c]))) "../Python/compile.c":5970:20 120 > {addsi3} > (nil)) > (insn 281 280 284 30 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int > *)_107 + 396B]+0 S4 A32]) >
[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 --- Comment #15 from Manolis Tsamis --- (In reply to Sam James from comment #13) > Created attachment 56527 [details] > compile.c.323r.fold_mem_offsets.bad.xz > > Output from > ``` > hppa2.0-unknown-linux-gnu-gcc -c -DNDEBUG -g -fwrapv -O3 -Wall -O2 > -std=c11 -Werror=implicit-function-declaration -fvisibility=hidden > -I/home/sam/git/cpython/Include/internal -IObjects -IInclude -IPython -I. > -I/home/sam/git/cpython/Include-DPy_BUILD_CORE -o Python/compile.o > /home/sam/git/cpython/Python/compile.c -fdump-rtl-fold_mem_offsets-all > ``` > > If I instrument certain functions in compile.c with no optimisation > attribuet or build the file with -fno-fold-mem-offsets, Python works, so I'm > reasonably sure this is the relevant object. Thanks for the dump file! There are 66 folded/eliminated instructions in this object file; I did look at each case and there doesn't seem to be anything strange. In fact most of the transformations are straightforward: - All except a couple of cases don't involve any arithmetic, so it's just moving a constant around. - The majority of the transformations are 'trivial' and consist of a single add and then a memory operation: a sequence like X = Y + Const, R = MEM[X + 0] is folded to X = Y, R = MEM[X + Const]. I wonder why so many of these exist and are not optimized elsewhere. - There are some cases with negative offsets, but the calculations look correct. - There are few more complicated cases, but I've done these on paper and also look correct. Of course I could be missing some more complicated effect, but what I want to say is that everything looks sensible in this particular file. > Thanks! You are very welcome to have access to some HPPA machines for > this kind of work. Please email me an SSH public key + desired username > if that sounds helpful. Yes, since I couldn't find anything interesting in the dump, that would definitely be helpful. Thanks! Manolis
[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415 --- Comment #11 from Manolis Tsamis --- Hi all, I will also go ahead and try to reproduce that, although it may take me some time due to my limited experience with HPPA. Once I manage to reproduce, most f-m-o issues are straightforward to locate by bisecting the transformed instructions. > I think the key object is Python/compile.o, but not certain yet. In this case the dump file of fold-mem-offsets (-fdump-rtl-fold_mem_offsets-all) could also be useful, as it contains all the information needed to see whether a transformation is valid. If it would be easy for anyone to provide the dump file, I could look at it and see if anything stands out (until I manage to reproduce this). Thanks, Manolis
[Bug c/109393] Very trivial address calculation does not fold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393 --- Comment #9 from Manolis Tsamis --- Created attachment 55856 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55856=edit Address calculation pattern v1
[Bug c/109393] Very trivial address calculation does not fold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393 --- Comment #7 from Manolis Tsamis --- After some attempts to improve on this, my current solution is: 1) Do not change pointer_int_sum in c-common (otherwise codegen regressions are observed) 2) Introduce a pattern that folds (unsigned type) (x + CONST_INT_1) * CONST_INT_2 to (unsigned type) (x * CONST_INT_2) + (unsigned type) (CONST_INT_1 * CONST_INT_1) under the right circumstances. This combination improves codegen (including the testcase provided) and also doesn't cause any regressions on the GCC testsuite or benchmarks that I have tried so far. Given that a[(ulong) i +- C] in GIMPLE is quite common, I believe this change helps with folding / canonicalization in many cases. My current match.pd pattern to do that is below; any feedback or improvements are welcome. (simplify (mult (convert@3 (plus @0 INTEGER_CST@1)) INTEGER_CST@2) (with { tree tt = TREE_TYPE(@3); } (if (INTEGRAL_TYPE_P (type) && !TYPE_SATURATING (type) && !TYPE_OVERFLOW_TRAPS (type) && !TYPE_OVERFLOW_SANITIZED (type) && !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) (plus (mult (convert:tt @0) @2) (mult (convert:tt @1) @2)
[Bug rtl-optimization/111267] New: Codegen regression from i386 argument passing changes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267 Bug ID: 111267 Summary: Codegen regression from i386 argument passing changes Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu CC: roger at nextmovesoftware dot com, ubizjak at gmail dot com Target Milestone: --- Code generation for the following testcase: typedef struct { float minx, miny; float maxx, maxy; } AABB; int TestOverlap(AABB a, AABB b) { return a.minx <= b.maxx && a.miny <= b.maxy && a.maxx >= b.minx && a.maxx >= b.minx; } int TestOverlap2(AABB a, AABB b) { return a.miny <= b.maxy && a.maxx >= b.minx; } has regressed due to commit https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=bdf2737cda53a83332db1a1a021653447b05a7e7 GCC13.2: TestOverlap: comiss xmm3, xmm0 movqrdx, xmm0 movqrsi, xmm1 movqrax, xmm3 jb .L10 shr rdx, 32 shr rax, 32 movdxmm0, eax movdxmm4, edx comiss xmm0, xmm4 jb .L10 movdxmm1, esi xor eax, eax comiss xmm1, xmm2 setnb al ret .L10: xor eax, eax ret TestOverlap2: shufps xmm0, xmm0, 85 shufps xmm3, xmm3, 85 comiss xmm3, xmm0 jb .L17 xor eax, eax comiss xmm1, xmm2 setnb al ret .L17: xor eax, eax ret GCC trunk: TestOverlap: movqrax, xmm1 movqrdx, xmm2 movqrsi, xmm0 mov rdi, rax movqrax, xmm3 mov rcx, rsi xchgrdx, rax movdxmm1, edx mov rsi, rax mov rax, rdx comiss xmm1, xmm0 jb .L10 shr rcx, 32 shr rax, 32 movdxmm0, eax movdxmm4, ecx comiss xmm0, xmm4 jb .L10 movdxmm0, esi movdxmm1, edi xor eax, eax comiss xmm1, xmm0 setnb al ret .L10: xor eax, eax ret TestOverlap2: movqrdx, xmm2 movqrax, xmm3 movqrsi, xmm0 xchgrdx, rax mov rcx, rsi mov rsi, rax mov rax, rdx shr rcx, 32 shr rax, 32 movdxmm4, ecx movdxmm0, eax comiss xmm0, xmm4 jb .L17 movdxmm0, esi xor eax, eax comiss xmm1, xmm0 setnb al ret .L17: xor eax, eax ret (Can also be seen here https://godbolt.org/z/E4xrEn6KW)
[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313 --- Comment #9 from manolis.tsamis at vrull dot eu --- Hi, This commit is known to be an issue and I'm working on a fix, you can find more details on this ticket https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308. Would it be easy for you to test whether the patch in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308#c10 also fixes this issue? Thanks, Manolis
[Bug fortran/110311] [14 Regression] gfortran 14.0 regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110311 manolis.tsamis at vrull dot eu changed: What|Removed |Added CC||manolis.tsamis at vrull dot eu --- Comment #6 from manolis.tsamis at vrull dot eu --- Hi, Due to the time frame mentioned (June 12-19), could you please test if the offending commit is r14-1873-g6a2e8dcbbd4bab374b27abea375bf7a921047800 ? This commit is now known to cause general issues, as also described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308. Thanks, Manolis
[Bug debug/110308] [14 Regression] ICE on audiofile-0.3.6: RTL: vartrack: Segmentation fault in mode_to_precision(machine_mode)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308 --- Comment #10 from manolis.tsamis at vrull dot eu --- Created attachment 55369 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55369=edit ICE-fix-proposal-1
[Bug debug/110308] [14 Regression] ICE on audiofile-0.3.6: RTL: vartrack: Segmentation fault in mode_to_precision(machine_mode)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308 --- Comment #7 from manolis.tsamis at vrull dot eu --- Some context for the commit: This change is originally part of an late rtl pass to optimize memory accesses. There are a lot of cases (especially involving local arrays) that generate reduntant stack pointer adds for address calculation which can get reduced to `mov reg, sp`, but without actually propagating these we don't gain something. In general it should be good to allow propagation of the stack pointer if that is correct. Now for the actual issue, it looks like my change for that was a bit carelees and I didn't properly understand the context of cprop-hardreg. > For debug info, propagation of the sp with the extra debug info related stuff > (ORIGINAL_REGNO and REG_ATTRS) is I think very harmful, but I admit I haven't > tried to understand why that change has been done. Yes, the attachment of ORIGINAL_REGNO and REG_ATTRS to the stack pointer has been accidental and is of course wrong. I have a proposed change below that fixes the segfault by not setting ORIGINAL_REGNO/REG_ATTRS for the stack pointer. My understanding is that this should be fine, but I'm still testing that. > So I think there are 2 bugs here. First the lost of debugging info because of > ch, and the latent segfault. I'm still looking into the difference in the debug statements. --cut here-- --- a/gcc/regcprop.cc +++ b/gcc/regcprop.cc @@ -423,7 +423,7 @@ maybe_mode_change (machine_mode orig_mode, machine_mode copy_mode, It's unclear if we need to do the same for other special registers. */ if (regno == STACK_POINTER_REGNUM) { - if (orig_mode == new_mode) + if (orig_mode == new_mode && new_mode == GET_MODE (stack_pointer_rtx)) return stack_pointer_rtx; else return NULL_RTX; @@ -487,9 +487,14 @@ find_oldest_value_reg (enum reg_class cl, rtx reg, struct value_data *vd) new_rtx = maybe_mode_change (oldmode, vd->e[regno].mode, mode, i, regno); if (new_rtx) { - ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (reg); - REG_ATTRS (new_rtx) = REG_ATTRS (reg); - REG_POINTER (new_rtx) = REG_POINTER (reg); + if (new_rtx != stack_pointer_rtx) + { + ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (reg); + REG_ATTRS (new_rtx) = REG_ATTRS (reg); + REG_POINTER (new_rtx) = REG_POINTER (reg); + } + else if (REG_POINTER (new_rtx) != REG_POINTER (reg)) + return NULL_RTX; return new_rtx; } } @@ -965,15 +970,27 @@ copyprop_hardreg_forward_1 (basic_block bb, struct value_data *vd) if (validate_change (insn, _SRC (set), new_rtx, 0)) { - ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (src); - REG_ATTRS (new_rtx) = REG_ATTRS (src); - REG_POINTER (new_rtx) = REG_POINTER (src); - if (dump_file) - fprintf (dump_file, -"insn %u: replaced reg %u with %u\n", -INSN_UID (insn), regno, REGNO (new_rtx)); - changed = true; - goto did_replacement; + bool can_change; + if (new_rtx != stack_pointer_rtx) + { + ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (src); + REG_ATTRS (new_rtx) = REG_ATTRS (src); + REG_POINTER (new_rtx) = REG_POINTER (src); + can_change = true; + } + else + can_change + = (REG_POINTER (new_rtx) == REG_POINTER (src)); + + if (can_change) + { + if (dump_file) + fprintf (dump_file, +"insn %u: replaced reg %u with %u\n", +INSN_UID (insn), regno, REGNO (new_rtx)); + changed = true; + goto did_replacement; + } } /* We need to re-extract as validate_change clobbers recog_data. */(In reply to Jakub Jelinek from comment #5) --cut here--
[Bug c/109393] Very trivial address calculation does not fold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393 --- Comment #6 from manolis.tsamis at vrull dot eu --- (In reply to Richard Biener from comment #5) > (In reply to manolis.tsamis from comment #4) > > Given the original transform it should be valid to propagate the constant > > addition through the cast? > > Yes. Note doing so loses information, we know i + 1 doesn't overflow > (undefined behavior). Widening preserves this knowledge I think, but if > just an unsigned conversion would be propagated it would be lost. So, is that the reason that this transform isn't already implemented as an optimisation? But then again isn't this information also lost in the code currently produced by GCC, where the constant is already propagated? Although I can see how it is different to do the propagation of constants in the front-end only vs doing it anywhere while transforming the code; maybe that's the difference that matters. But hopefully doing better canonicalization/constant folding on address calculations and constants should also result in better optimisation opportunities overall.
[Bug c/109393] Very trivial address calculation does not fold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393 --- Comment #4 from manolis.tsamis at vrull dot eu --- (In reply to Richard Biener from comment #3) > It's probably a mismatch of GENERIC/GIMPLE folding. In this case it's > pointer_int_sum prematurely distributing the multiplication: > > /* Return a tree for the sum or difference (RESULTCODE says which) >of pointer PTROP and integer INTOP. */ > > tree > pointer_int_sum (location_t loc, enum tree_code resultcode, > tree ptrop, tree intop, bool complain) > { > ... > /* If what we are about to multiply by the size of the elements > contains a constant term, apply distributive law > and multiply that constant term separately. > This helps produce common subexpressions. */ > > but this kind of stuff shouldn't be done by the frontends these days. > > Gating this fixes the issue. I think this piece of code should be axed > (after careful evaluation of its effect) I have been testing this change and by looking at some tests that it causes to fail, there is a regression on testcases like this (taken from copy-headers-5.c, there are some similar fails): int is_sorted(int *a, int n) { for (int i = 0; i < n - 1; i++) if (a[i] > a[i + 1]) return 0; return 1; } The gimple code for gcc currently is (taken from dump-tree-ch2): _1 = (long unsigned int) i_18; _2 = _1 * 4; _3 = a_13(D) + _2; _4 = *_3; _5 = _1 + 1; _6 = _5 * 4; _7 = a_13(D) + _6; _8 = *_7; While with this change the result is: _1 = (long unsigned int) i_11; _2 = _1 * 4; _3 = a_14(D) + _2; _4 = *_3; _5 = i_11 + 1; _6 = (long unsigned int) _5; _7 = _6 * 4; _8 = a_14(D) + _7; _9 = *_8; As a result the generated code has two loads per loop instead of one. The same holds if e.g. a[i + 1] > a[i + 2] is used because the constant addition is done before the cast to long unsigned int. I believe this is because the logic to distribute the constant is missing as a GIMPLE pattern? Given the original transform it should be valid to propagate the constant addition through the cast?
[Bug tree-optimization/109393] Very trivial address calculation does not fold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393 --- Comment #2 from manolis.tsamis at vrull dot eu --- (In reply to Andrew Pinski from comment #1) > Note sometimes -fwrapv will optimize things because it can assume that > overflow is defined as wrapping and this is one case that is true. Yes it > sounds counter intuitive but it is true. Even re-association happens more > with -fwrapv :). Yes it is truly counter intuitive :) Yet, leaving this wrapv behaviour aside, isn't this something that should be always optimized regardless of wrap/overflow flags?
[Bug tree-optimization/109393] New: Very trivial address calculation does not fold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393 Bug ID: 109393 Summary: Very trivial address calculation does not fold Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu Target Milestone: --- The following function int func(int *a, int j) { int k = j - 1; return a[j - 1] == a[k]; } surprisingly does not fold to `return 1;` at -O2 or higher (with any GCC version). It can also be seen here: https://godbolt.org/z/cqr43q7fq There are a lot of variants for this behaviour but this is the most apparent. As can be seen in the godbolt link, the issue seems to be a combination of: 1) The -1 in a[j - 1] is turned into GIMPLE equivalent with *((a + (ulong) j) + (ulong) -1) but a[k] is turned into *(a + (ulong) (j - 1)). 2) The -1 is never propagated outside of the (long unsigned int) casts even if it's completely legal/possible. I feel that I'm missing something here about pointer rules / historical context of these choices and I would appreciate if someone more knowlegable could explain this combination to me. There are a lot of cases where this can lead to inefficient codegen but most prominently this is the reason for a additional redundant load in a hot loop of SPEC2017's nab in the function downheap_pairs and similar missed optimizations in omnetpp's shiftup function. Hence this issue can both cause very unexpected missed optimization (as in the example) and also decreases the performance of important benchmarks. Note: The testcase is not optimized even with -fno-wrapv or -fstrict-overflow, but does optimize with -fwrapv which is the reverse of what I would expect since -fno-wrapv should be more permissive?
[Bug tree-optimization/98138] BB vect fail to SLP one case
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 manolis.tsamis at vrull dot eu changed: What|Removed |Added CC||manolis.tsamis at vrull dot eu --- Comment #11 from manolis.tsamis at vrull dot eu --- > The full satd_8x4 looks like the following, the 2nd loop isn't to be > disregarded Regarding the second loop, this patch https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608827.html should result in improved vectorization and performance. Currently ((a>>15)&0x10001)*0x from abs2 is calculated using 4 vector operations (shift, bitand, shift+sub for the multiplication) whereas with the proposed patch this will be a single vector compare operation.
[Bug target/106346] Potential regression on vectorization of left shift with constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346 --- Comment #2 from manolis.tsamis at vrull dot eu --- I bisected the changes from GCC 10.3 onwards and the first commit that results in the "worse" version of the generated code is 9fc9573f9a5e9432e53c7de93985cfbd267f0309: [2/3] [vect] Add widening add, subtract patterns Add widening add, subtract patterns to tree-vect-patterns. Update the widened code of patterns that detect PLUS_EXPR to also detect WIDEN_PLUS_EXPR. These patterns take 2 vectors with N elements of size S and perform an add/subtract on the elements, storing the results as N elements of size 2*S (in 2 result vectors). This is implemented in the aarch64 backend as addl,addl2 and subl,subl2 respectively. Add aarch64 tests for patterns.
[Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346 Bug ID: 106346 Summary: Potential regression on vectorization of left shift with constants Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu Target Milestone: --- Target: aarch64 Created attachment 53317 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53317=edit Does not vectorize on GCC > 10.3 The following test case: void foo (uint32_t dst[8], uint8_t src1[8], uint8_t src2[8]) { uint16_t diff_e0 = src1[0] - src2[0]; uint16_t diff_e1 = src1[1] - src2[1]; uint16_t diff_e2 = src1[2] - src2[2]; uint16_t diff_e3 = src1[3] - src2[3]; uint16_t diff_e4 = src1[4] - src2[4]; uint16_t diff_e5 = src1[5] - src2[5]; uint16_t diff_e6 = src1[6] - src2[6]; uint16_t diff_e7 = src1[7] - src2[7]; uint32_t a0 = diff_e0 << 1; uint32_t a1 = diff_e1 << 3; uint32_t a2 = diff_e2 << 4; uint32_t a3 = diff_e3 << 2; uint32_t a4 = diff_e4 << 12; uint32_t a5 = diff_e5 << 11; uint32_t a6 = diff_e6 << 9; uint32_t a7 = diff_e7 << 3; dst[0] = a0; dst[1] = a1; dst[2] = a2; dst[3] = a3; dst[4] = a4; dst[5] = a5; dst[6] = a6; dst[7] = a7; } Compiles at -O3 to nice vectorized code by loading the constants from memory in GCC 10.3: ldr d0, [x1] adrpx3, .LC0 ldr d1, [x2] adrpx1, .LC1 ldr q3, [x3, #:lo12:.LC0] usubl v0.8h, v0.8b, v1.8b ldr q2, [x1, #:lo12:.LC1] uxtlv1.4s, v0.4h uxtl2 v0.4s, v0.8h sshlv1.4s, v1.4s, v3.4s sshlv0.4s, v0.4s, v2.4s stp q1, q0, [x0] ret But this has regressed in later releases, with GCC still loading the constants from memory but also emitting a lot of scalar code before that. For example GCC 13 produces: adrpx3, .LC0 ldrbw6, [x1, 4] fmovd0, x6 ldrbw7, [x1] ldr q5, [x3, #:lo12:.LC0] fmovd1, x7 ldrbw3, [x1, 5] ldrbw4, [x1, 1] ldrbw8, [x2, 4] ldrbw5, [x2, 5] ins v0.h[1], w3 ldrbw6, [x2] fmovd2, x8 ldrbw3, [x2, 1] fmovd3, x6 ins v2.h[1], w5 ins v1.h[1], w4 ldrbw9, [x1, 2] ins v3.h[1], w3 ldrbw8, [x1, 6] ldrbw7, [x2, 2] ldrbw6, [x2, 6] ins v1.h[2], w9 ins v0.h[2], w8 ldrbw5, [x1, 3] ins v3.h[2], w7 ldrbw4, [x1, 7] ins v2.h[2], w6 ldrbw1, [x2, 7] ldrbw3, [x2, 3] ins v1.h[3], w5 ins v0.h[3], w4 ins v2.h[3], w1 ins v3.h[3], w3 adrpx1, .LC1 ldr q4, [x1, #:lo12:.LC1] sub v1.4h, v1.4h, v3.4h sub v0.4h, v0.4h, v2.4h uxtlv1.4s, v1.4h uxtlv0.4s, v0.4h sshlv1.4s, v1.4s, v5.4s sshlv0.4s, v0.4s, v4.4s stp q1, q0, [x0] ret Interestingly, this happens only with left shift and not with right shift. GCC 10.3 vs trunk comparison: https://godbolt.org/z/xWbfGdfen
[Bug tree-optimization/106343] New: Addition with constants is not vectorized by SLP when it includes zero
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106343 Bug ID: 106343 Summary: Addition with constants is not vectorized by SLP when it includes zero Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: manolis.tsamis at vrull dot eu Target Milestone: --- Target: aarch64 Created attachment 53316 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53316=edit Does not vectorize The following test case: void foo (uint32_t dst[8], uint8_t src1[8], uint8_t src2[8]) { uint16_t diff_e0 = src1[0] - src2[0]; uint16_t diff_e1 = src1[1] - src2[1]; uint16_t diff_e2 = src1[2] - src2[2]; uint16_t diff_e3 = src1[3] - src2[3]; uint16_t diff_e4 = src1[4] - src2[4]; uint16_t diff_e5 = src1[5] - src2[5]; uint16_t diff_e6 = src1[6] - src2[6]; uint16_t diff_e7 = src1[7] - src2[7]; uint32_t a0 = diff_e0 + 1; uint32_t a1 = diff_e1 + 3; uint32_t a2 = diff_e2 + 4; uint32_t a3 = diff_e3 + 2; uint32_t a4 = diff_e4 + 12; uint32_t a5 = diff_e5 + 11; uint32_t a6 = diff_e6 + 9; uint32_t a7 = diff_e7 + 3; dst[0] = a0; dst[1] = a1; dst[2] = a2; dst[3] = a3; dst[4] = a4; dst[5] = a5; dst[6] = a6; dst[7] = a7; } Produces nice vectorized code on aarch64: ldr d2, [x2] adrpx3, .LC0 ldr d0, [x1] ldr q1, [x3, #:lo12:.LC0] usubl v0.8h, v0.8b, v2.8b uaddl v2.4s, v0.4h, v1.4h uaddl2 v0.4s, v0.8h, v1.8h stp q2, q0, [x0] ret But if any of the constants is replaced with zero instead then scalar code is produced: ldrbw4, [x2, 1] ldrbw8, [x1, 1] ldrbw3, [x2, 3] ldrbw7, [x1, 3] sub w8, w8, w4 ldrbw6, [x1, 4] and w8, w8, 65535 ldrbw4, [x2, 4] sub w7, w7, w3 ldrbw5, [x1, 5] and w7, w7, 65535 ldrbw3, [x2, 5] sub w6, w6, w4 ldrbw9, [x2, 6] and w6, w6, 65535 ldrbw4, [x1, 6] sub w5, w5, w3 ldrbw10, [x2, 7] and w5, w5, 65535 ldrbw3, [x1, 7] sub w4, w4, w9 ldrbw11, [x2] and w4, w4, 65535 ldrbw9, [x1] sub w3, w3, w10 ldrbw2, [x2, 2] add w8, w8, 3 ldrbw10, [x1, 2] sub w9, w9, w11 and w1, w3, 65535 and w9, w9, 65535 sub w10, w10, w2 add w3, w5, 11 add w2, w4, 9 add w7, w7, 2 add w6, w6, 12 add w1, w1, 3 add w4, w9, 1 and w5, w10, 65535 stp w4, w8, [x0] stp w5, w7, [x0, 8] stp w6, w3, [x0, 16] stp w2, w1, [x0, 24] ret It would be possible to produce the same vectorized code as above but with zero in the constants. If I understand correctly, the identity element of addition is not taken into consideration in the SLP vectorizer, which could be improved. The same happens with subtraction. I can reproduce this in any recent version of GCC (e.g. >= 10). Vectorized case: https://godbolt.org/z/5sbb1an89 Scalar case: https://godbolt.org/z/v8jPT9jEe