[Bug tree-optimization/102793] AArch64: sequential comparisons with equal conditional blocks don't use ccmp

2024-04-05 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102793

--- Comment #7 from Manolis Tsamis  ---
Also submitted in the lists:
https://gcc.gnu.org/pipermail/gcc-patches/2024-April/648856.html

I should note that I needed to modify the test uninit-pred-6_c.c and remove
this check:

  if (l)
if (n > 12)
  blah(v); /* { dg-bogus "uninitialized" "bogus warning" } */

because it was inconsistent. If the branch is merged to a simple if (l && (n >
12)) is fails (on master). With the proposed patch we get more ifcombining and
better codegen but this condition fails. I believe this would need an
enhancement in tree-ssa-uninit.

[Bug tree-optimization/102793] AArch64: sequential comparisons with equal conditional blocks don't use ccmp

2024-04-05 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102793

Manolis Tsamis  changed:

   What|Removed |Added

 CC||manolis.tsamis at vrull dot eu

--- Comment #6 from Manolis Tsamis  ---
Created attachment 57886
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57886=edit
Patch with proposed fix for the issue.

Adding a copy of the ifcombine pass after pre solves this issues and improves
code generation is some other cases too. I specifically chose after the pre
pass because it deduplicates basic blocks and helps in maximizing the effect of
ifcombine.

[Bug tree-optimization/114326] New: Missed optimization for A || B when !B implies A.

2024-03-13 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114326

Bug ID: 114326
   Summary: Missed optimization for A || B when !B implies A.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---

The function below doesn't fold to return 0;

int cmp1(uint64_t d1, uint64_t d2) {
  if (((d1 ^ d2) & 0xabcd) == 0 || d1 != d2)
return 0;
  return foo();
}

while the following function does: 

int cmp2(uint64_t d1, uint64_t d2) {
  if (d1 != d2 || ((d1 ^ d2) & 0xabcd) == 0)
return 0;
  return foo();
}

The functions are equivalent since the lhs and rhs of || don't have side
effects.

In general, there pattern here is a side-effect free expression a || b where !b
implies a should be optimized to true. As in the testcase above, a doesn't
necessarily imply !b. Something similar could be stated for && expressions.

Complementary godbolt link: https://godbolt.org/z/qK5bYf36T

[Bug tree-optimization/114287] New: [13.2.1 Regression] 416.gamess in SPEC CPU 2006 miscompiled

2024-03-08 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114287

Bug ID: 114287
   Summary: [13.2.1 Regression] 416.gamess in SPEC CPU 2006
miscompiled
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---

On Linux/AArch64, 416.gamess from SPEC2006 is miscompiled:


Contents of h2ocu2+.gradient.err

Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG
STOP IN ABRT




Contents of triazolium.err

STOP IN ABRT



When compiled with:

  -g -O2 -flto=32 -fno-strict-aliasing

Compiles fine with GCC 13.2.1. 

GCC master built from commit d1b241b9506cdc0ebd3f43d12cf77d7c33271342,
configured with:

  --enable-shared, --enable-threads=posix, --enable-checking=release,
--enable-linker-build-id, --enable-plugin, --with-isl, --enable-lto

[Bug tree-optimization/114010] Unwanted effects of using SSA free lists.

2024-02-23 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010

--- Comment #10 from Manolis Tsamis  ---
(In reply to ptomsich from comment #9)
> (In reply to Manolis Tsamis from comment #0) 
> > E.g. another loop, non canonicalized names:
> > 
> > .L120:
> > ldr q30, [x0], 16
> > moviv29.2s, 0
> > ld2 {v26.16b - v27.16b}, [x4], 32
> > moviv25.4s, 0
> > zip1v29.16b, v30.16b, v29.16b
> > zip2v30.16b, v30.16b, v25.16b
> > umlal   v29.8h, v26.8b, v28.8b
> > umlal2  v30.8h, v26.16b, v28.16b
> > uaddw   v31.4s, v31.4s, v29.4h
> > uaddw2  v31.4s, v31.4s, v29.8h
> > uaddw   v31.4s, v31.4s, v30.4h
> > uaddw2  v31.4s, v31.4s, v30.8h
> > cmp x5, x0
> > bne .L120
> 
> Is it just me, or are the zip1 and zip2 instructions dead?
> 
> Philipp.

They certainly look dead, but they're not because the umlal/umlal2 (and other
accumulate instructions) also read from the destination register.

There looks to be a missed optimization opportunity to use just a single `movi
v25.4s, 0` here though.

Also, looking again at the generated code in the first example:

mov v23.16b, v18.16b
mla v23.16b, v17.16b, v25.16b

If I'm correct this could be folded into just

mla v18.16b, v17.16b, v25.16b

In which case most of the movs in the first and second example could be
eliminated. To me it looks like the accumulate instructions are missing some
optimizations.

[Bug tree-optimization/114010] Unwanted effects of using SSA free lists.

2024-02-21 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010

--- Comment #5 from Manolis Tsamis  ---
Also, I further investigated the codegen difference in the second example (zip
+ umlal vs umull) and it looks to be some sort of RTL ordering + combine issue.

Specifically, when the we expand the RTL for the example there are some very
slight ordering differences where some non-dependent insns have swapped order.
On of these happens to precede a relevant vector statement and then in one case
combine does the umlal transformation but in the other not.

Afaik combine has some limits about the instruction window that it looks, so it
looks feasible that ordering differences in RTL can later transform into major
codegen differences in a number of ways. Other differences seem to come from
register allocation, as you mentioned.

This doesn't yet provide any useful insights whether the vectorization
improvements are accidental or not.

[Bug tree-optimization/114010] Unwanted effects of using SSA free lists.

2024-02-21 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010

--- Comment #4 from Manolis Tsamis  ---
Hi Andrew,

Thank for your insights on this. Let me reply to some of your points:

(In reply to Andrew Pinski from comment #1)
> >The most important case I have observed is that the vectorizer can fail or 
> >create inferior code with more shuffles/moves when the SSA names aren't 
> >monotonically increasing.
> 
> That should not be true.

Indeed, after further cleaning-up the dumps, some differences that I was
considering were just due to the diff algorithm not doing a good job and that
confused me (sigh).

So, for this example while we're in tree form I observe only naming changes,
but no different code or order of statements. 

(In reply to Andrew Pinski from comment #2)
> Note what I had found in the past it is not exactly SSA_NAMEs that cause the
> difference but rather the RTL register pesdu # causes differences in
> register allocation which was exposed from the different in operands
> canonicalization.

Yes, I have also observed this and it looks to be the main issue.

(In reply to Andrew Pinski from comment #3)
> The first example (of assembly here) in comment #0 is extra moves due to the
> RA not handling subreg that decent for the load/store lane. There are other
> bug reports dealing with that. Why the SSA_NAMES being monotonically help is
> just by an accident really. 
> 
> 

Do you happen to recall the relevant ticket(s)? I would like to have a look but
couldn't find them so far.

Also, while I agree than in some cases changes like this 'just happen' to
improve codegen in some particular case, it was in multiple experiments that
vectorized code was superior with sorted names and it never was worse with
sorted names. In most cases that I recall the version that used unsorted names
had additional shuffles of different sorts or moves. So, which anecdotal, the
effects doesn't look accidental to me in this case. I feel like there may be
some subtle difference due to the names that helps in this case?

> 
> Also:
> > This mostly affects all the bitmaps that use SSA_NAME_VERSION as a key.
> 
> Most use sparse bitmaps there so it is not a big deal.
> 

Agreed and that's probably why I couldn't measure any non-trivial difference in
compilation times.

I should just note that there are also places that create vectors or other data
structures sized to the number of ssa_names, so in theory this could still help
in extreme cases.

> I think this should be split up in a few different bug reports really.
> One for each case where better optimizations happen.
> 
Ok, the only cases that I found to be clearly better are the ones related to
vectorization. Would it help to create a ticket just for that now, or should I
wait for the discussion in this one to conclude first?

> Also:
> >I have seen two similar source files generating the exact same GIMPLE code 
> >up to some optimization pass but then completely diverging due to different 
> >freelists.
> 
> The only case where I have seen this happen is expand will have different
> pesdu # really. Yes I noticed this effect while I did
> r14-569-g21e2ef2dc25de3 really.

Afaik, the codegen differences that I observed was due to the same reason, but
it nonetheless felt weird that the same GIMPLE could produce two different
w.r.t. name ordering files later on just because the freelists were different
(but invisible in the dumps). So I naturally questioned 'why don't we just
flush the freelists after every pass if it's not a performance issue'?

[Bug tree-optimization/114010] New: Unwanted effects of using SSA free lists.

2024-02-20 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114010

Bug ID: 114010
   Summary: Unwanted effects of using SSA free lists.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---

Created attachment 57469
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57469=edit
Testcase and example compilations on AArch64

After some experiments, I have observed a number of unwanted effects due to
re-using SSA names through free lists. Specifically, the issue is that the
recycled SSA names have unpredictable ordering and the compiler is affected by
this ordering. The effects include:

  1) Worse code generation from some passes that explicitly or implicitly
depend on SSA name ordering. The most important case I have observed is that
the vectorizer can fail or create inferior code with more shuffles/moves when
the SSA names aren't monotonically increasing.
  2) Missed canonicalization opportunities. Functions that do the same thing
but have different SSA ordering will usually end up diverging after various
optimization passes. The generated assembly code is also much different when it
shouldn't. Since the SSA freelists are affected by things like moving around
dead code this also makes comparing GIMPLE or assembly code much harder than
necessary.
  3) Because freelists persist through optimization passes, their hidden state
can affect optimizations/codegen in unexpected ways. I have seen two similar
source files generating the exact same GIMPLE code up to some optimization pass
but then completely diverging due to different freelists. Getting something
completely different by starting from the same GIMPLE was mildly surprising at
first.
  4) Although I haven't observed something worthwhile on that front, in theory
if a large chunk of names is free'd (e.g. dead code elimination) then the
pending free'd names will negatively affect performance and memory because
they're not compacted. This mostly affects all the bitmaps that use
SSA_NAME_VERSION as a key.

The easier way to experiment with this behaviour is code like below:

...some code...
if (condition that will be eliminated, but later) {
  ...some function call that will be inlined now...
  ...or some code...
}
...some other code...

I have crafted and attached a testcase (canon.cpp) that can showcase some of
these behaviors at different optimization levels.
As part of the experiments I have a modified GCC compiler that calls
`release_free_names_and_compact_live_names` after every optimization pass,
which is enough to guarantee that the generated names are in monotonically
increasing order.

Together with the testcase I have attached the AArch64 assembly code generated
at 8 different configurations: with or without the dead code in the source
file, with or without free name canonicalization, with O2 or with O3.

The observations at O2:

As per the previous notes, it can be seen that code in with-dead-code.O2.txt is
quite different from without-dead-code.O2.txt (different instruction mix,
different instruction count, different register allocation etc.). This is due
to the free'd up names causing different SSA name allocation. Compared to that
with-dead-code.canonicalized.O2.txt and without-dead-code.canonicalized.O2.txt
are identical except for the stack offsets chosen for the variables. It can be
easily seen as it diffs cleanly, both at the gimple and assembly level. 

The observations at O3:

At O3 the assembly no longer cleanly diffs in either case, but it is
interesting to observe the better vectorization done due to the better ordering
of names (in almost all places that there is vectorized code):

E.g. non-canonicalized:

.L37:
ld4 {v24.16b - v27.16b}, [x1], 64
shl v23.16b, v24.16b, 5
add v28.16b, v23.16b, v24.16b
mov v23.16b, v18.16b
mla v23.16b, v17.16b, v25.16b
mov v29.16b, v23.16b
mov v23.16b, v20.16b
mla v23.16b, v19.16b, v26.16b
mov v30.16b, v23.16b
mov v23.16b, v22.16b
mla v23.16b, v21.16b, v27.16b
mov v31.16b, v23.16b
st4 {v28.16b - v31.16b}, [x0], 64
cmp x3, x1
bne .L37

Assembly for the same code, 3 instructions less:

.L37:
ld4 {v28.16b - v31.16b}, [x1], 64
mov v26.16b, v23.16b
mov v27.16b, v21.16b
shl v25.16b, v28.16b, 5
mla v26.16b, v24.16b, v29.16b
mla v27.16b, v22.16b, v30.16b
add v25.16b, v25.16b, v28.16b
mov v28.16b, v19.16b
mla v28.16b, v20.16b, v31.16b
st4 {v25.16b - v28.16b}, [x0], 64
cmp x1, x3
bne .L37

E.g. another loop, non canonicalized names:

.L120:
ldr   

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2023-12-20 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

--- Comment #8 from Manolis Tsamis  ---
(In reply to Alex Coplan from comment #7)
> (In reply to Manolis Tsamis from comment #6)
> > (In reply to Alex Coplan from comment #5)
> > > Also ICEs without -fno-strict-aliasing FWIW, so just -g -O2 -funroll-loops
> > > seems to be enough.
> > 
> > On my local build it doesn't reproduce without -fno-strict-aliasing; I just
> > double-checked (GCC master @ 4e0a467302fea56d63b7a6d17f99c0f388960dc7).
> 
> Hmm, that is odd, I definitely don't need that option even when built from
> the same commit.  How are you configuring GCC?  I suppose it doesn't matter
> since I can reproduce the problem as long as the fix works with/without
> -fno-strict-aliasing.

Odd indeed, but probably something with my configure flags as you mentioned.
Here's what I use:

--enable-shared \
--enable-threads=posix \
--enable-checking=release \
--with-system-zlib \
--enable-__cxa_atexit \
--disable-libunwind-exceptions \
--enable-linker-build-id \
--enable-libstdcxx-backtrace \
--enable-plugin \
--enable-initfini-array \
--enable-gnu-indirect-function \
--with-isl \
--enable-lto \
--with-cpu=neoverse-n1 \
--disable-multilib \
--disable-bootstrap

(I think the most uncommon one here is --with-cpu=neoverse-n1 ?)

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2023-12-20 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

--- Comment #6 from Manolis Tsamis  ---
(In reply to Alex Coplan from comment #5)
> Also ICEs without -fno-strict-aliasing FWIW, so just -g -O2 -funroll-loops
> seems to be enough.

On my local build it doesn't reproduce without -fno-strict-aliasing; I just
double-checked (GCC master @ 4e0a467302fea56d63b7a6d17f99c0f388960dc7).

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2023-12-20 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

--- Comment #3 from Manolis Tsamis  ---
Created attachment 56908
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56908=edit
Reduced testcase

[Bug target/113089] [14 Regression][aarch64] ICE in process_uses_of_deleted_def, at rtl-ssa/changes.cc:252 since r14-6605-gc0911c6b357ba9

2023-12-20 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113089

Manolis Tsamis  changed:

   What|Removed |Added

 CC||manolis.tsamis at vrull dot eu

--- Comment #2 from Manolis Tsamis  ---
The following testcase, reduced from x264 intrapred_chroma_plane:

typedef unsigned short uint16;

void intrapred_chroma_plane(uint16 ***mb_preds, int* max_imgpel_values, int
crx, int cry, int px) {
  for (int uv = 0; uv < 2; uv++) {
uint16 **mb_pred = mb_preds[uv + 1];
uint16 **predU2 = _pred[px - 2];
uint16 *upPred = _pred[px][px];
int max_imgpel_value = max_imgpel_values[uv];

int ih = upPred[crx - 1];
for (int i = 0; i < crx*3; ++i)
  ih += upPred[crx*3];

int iv = (mb_pred[cry - 1][px+1]);  
for (int i = 0; i < cry - 1; ++i) {
  iv += (i + 1) * (*(mb_preds[uv][0]) - *(*predU2--));
}

for (int j = 0; j < cry; ++j)
  for (int i = 0; i < crx; ++i)
mb_pred[j][i] = (uint16) (max_imgpel_value * ((i * ih + iv)));
  }
}

reproduces the same ICE issue when compiled with:

gcc -g -O2 -funroll-loops -fno-strict-aliasing intra_chroma_pred.c

[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94

2023-11-28 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415

--- Comment #52 from Manolis Tsamis  ---
(In reply to Sam James from comment #51)
> manolis, did you have a chance to look at the remaining pass issue? You'll
> need to revert Dave's commit locally which made the issue latent for
> building Python.

Hi Sam, I had to work on some other things so I didn't get to find a fix yet,
but I'll be working on that again now (in light of the new info from PR111601
too). 

Thanks for the ping,
Manolis

[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94

2023-11-13 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415

--- Comment #48 from Manolis Tsamis  ---
(In reply to dave.anglin from comment #47)
> On 2023-11-13 4:33 a.m., manolis.tsamis at vrull dot eu wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415
> >
> > --- Comment #44 from Manolis Tsamis  ---
> > (In reply to John David Anglin from comment #39)
> >> In the f-m-o pass, the following three insns that set call clobbered
> >> registers r20-r22 are pulled from loop:
> >>
> >> (insn 186 183 190 29 (set (reg/f:SI 22 %r22 [478])
> >>  (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> >>  (const_int 388 [0x184]))) "../Python/compile.c":5964:9 120
> >> {addsi3}
> >>   (nil))
> >> (insn 190 186 187 29 (set (reg/f:SI 21 %r21 [479])
> >>  (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> >>  (const_int 392 [0x188]))) "../Python/compile.c":5964:9 120
> >> {addsi3}
> >>   (nil))
> >> (insn 194 191 195 29 (set (reg/f:SI 20 %r20 [480])
> >>  (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> >>  (const_int 396 [0x18c]))) "../Python/compile.c":5964:9 120
> >> {addsi3}
> >>   (nil))
> >>
> >> They are used in the following insns before call to compiler_visit_expr1:
> >>
> >> (insn 242 238 258 32 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int
> >> *)prephit
> >> mp_37 + 388B]+0 S4 A32])
> >>  (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173]))
> >> "../Python/compile.c"
> >> :5968:22 42 {*pa.md:2193}
> >>   (expr_list:REG_DEAD (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173])
> >>  (expr_list:REG_DEAD (reg/f:SI 22 %r22 [478])
> >>  (nil
> >> (insn 258 242 246 32 (set (reg:SI 26 %r26)
> >>  (reg/v/f:SI 5 %r5 [orig:198 c ] [198]))
> >> "../Python/compile.c":5969:15 42 {*pa.md:2193}
> >>   (nil))
> >> (insn 246 258 250 32 (set (mem:SI (reg/f:SI 21 %r21 [479]) [4 MEM[(int
> >> *)prephitmp_37 + 392B]+0 S4 A32])
> >>  (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169]))
> >> "../Python/compile.c":5968:22 42 {*pa.md:2193}
> >>   (expr_list:REG_DEAD (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169])
> >>  (expr_list:REG_DEAD (reg/f:SI 21 %r21 [479])
> >>  (nil
> >> (insn 250 246 254 32 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int
> >> *)prephitmp_37 + 396B]+0 S4 A32])
> >>  (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145]))
> >> "../Python/compile.c":5968:22 42 {*pa.md:2193}
> >>   (expr_list:REG_DEAD (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145])
> >>  (expr_list:REG_DEAD (reg/f:SI 20 %r20 [480])
> >>  (nil
> >>
> >> After the call, we have:
> >>
> >> (insn 1241 269 273 30 (set (reg/f:SI 22 %r22 [478])
> >>  (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]))
> >> "../Python/compile.c":5970:20 -1
> >>   (nil))
> >> (insn 273 1241 1242 30 (set (mem:SI (plus:SI (reg/f:SI 22 %r22 [478])
> >>  (const_int 388 [0x184])) [4 MEM[(int *)_107 + 388B]+0 S4
> >> A32])
> >>  (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167]))
> >> "../Python/compile.c":5970:20 42 {*pa.md:2193}
> >>   (nil))
> >> (insn 1242 273 277 30 (set (reg/f:SI 21 %r21 [479])
> >>  (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]))
> >> "../Python/compile.c":5970:20 -1
> >>   (nil))
> >> (insn 277 1242 1243 30 (set (mem:SI (plus:SI (reg/f:SI 21 %r21 [479])
> >>  (const_int 392 [0x188])) [4 MEM[(int *)_107 + 392B]+0 S4
> >> A32])
> >>  (reg:SI 13 %r13 [orig:156 vect_pretmp_36.2451 ] [156]))
> >> "../Python/compile.c":5970:20 42 {*pa.md:2193}
> >>   (nil))
> >> (insn 1243 277 281 30 (set (reg/f:SI 20 %r20 [480])
> >>  (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]))
> >> "../Python/compile.c":5970:20 -1
> >>   (nil))
> >> (insn 281 1243 299 30 (set (mem:SI (plus:SI (reg/f:SI 20 %r20 [480])
> >>  (const_int 396 [0x18c])) [4 MEM[(int *)_107 + 396B]+0 S4
> >> A32])
> >>  (reg:SI 12 %r12 [orig:134 vect_pretmp_36.2452 ] [134]))
> >> "../Python/compile.c":5970:20 42 {*pa.md:2193}
> >>   (nil))
> >>
> >> We have lost the offsets that were added initially to r20, r21 and r22.
> >>
> >> Previous ce3 pass had:
> >>
> >> (insn 272 269 273 30 (set (reg/f:SI 22 %r22 [478])
> >>  (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> >>  (const_int 388 [0x184]))) "../Python/compile.c":5970:20 120
> >> {addsi3}
> >>   (nil))
> >> (insn 273 272 276 30 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int
> >> *)_107 + 388B]+0 S4 A32])
> >>  (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167]))
> >> "../Python/compile.c":5970:20 42 {*pa.md:2193}
> >>   (nil))
> >> (insn 276 273 277 30 (set (reg/f:SI 21 %r21 [479])
> >>  (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> >>  (const_int 392 [0x188]))) "../Python/compile.c":5970:20 120
> >> {addsi3}
> >>   

[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94

2023-11-13 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415

--- Comment #46 from Manolis Tsamis  ---
I have reproduced the segfault with f-m-o limited to only fold insn 272 from
compiler_call_helper. The exact transformation is:

Memory offset changed from 0 to 388 for instruction:
(insn 273 272 276 30 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(intD.1
*)_107 + 388B]+0 S4 A32])
(reg:SI 14 %r14 [orig:167 vect_pretmp_36.2448D.32932 ] [167]))
"Python/compile.c":5970:20 42 {*pa.md:2193}
 (nil))
deferring rescan insn with uid = 273.
Instruction folded:(insn 272 269 273 30 (set (reg/f:SI 22 %r22 [478])
(plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
(const_int 388 [0x184]))) "Python/compile.c":5970:20 120 {addsi3}
 (nil))

This instruction is also included to the ones that Dave mentioned. Again, if
I'm missing something as to why this transformation is illegal please tell me.
Given these are also consecutive instructions, I'm just seeing here that 

%r22 = %r19 + 388
[%r22] = %r14

is transformed to

%r22 = %r19
[%r22 + 388] = %r14

I haven't tracked all other uses of %r22 yet, but in theory if there was any
non-foldable use of that register then the transformation wouldn't be made.

Manolis

[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94

2023-11-13 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415

--- Comment #45 from Manolis Tsamis  ---
(In reply to Jeffrey A. Law from comment #41)
> I would agree.  In fact,the whole point of the f-m-o pass is to bring those
> immediates into the memory reference.  It'd be really useful to know why
> that isn't happening.
> 
> The only thing I can think of would be if multiple instructions needed the
> %r20 in the RTL you attached.  Which might point to a refinement we should
> make in f-m-o, specifically the transformation isn't likely profitable if we
> aren't able to fold away a term or fold a constant term into the actual
> memory reference.

Jeff,

I'm confused about "It'd be really useful to know why that isn't happening.".
It can be seen in Dave's dumps that it *is* happening, e.g.:

Memory offset changed from 0 to 396 for instruction:
(insn 281 280 284 30 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int *)_107 +
396B]+0 S4 A32])
(reg:SI 12 %r12 [orig:134 vect_pretmp_36.2452 ] [134]))
"../Python/compile.c":5970:20 42 {*pa.md:2193}
 (nil))

Instruction folded:(insn 280 277 281 30 (set (reg/f:SI 20 %r20 [480])
(plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
(const_int 396 [0x18c]))) "../Python/compile.c":5970:20 120
{addsi3}
 (nil))

If you looks at the RTL in f-m-o all these offsets are indeed moved in the
respective load/store. I don't know if cprop afterwards manages to eliminate
the unwanted move, but f-m-o does what it's supposed to do in this case.

[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94

2023-11-13 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415

--- Comment #44 from Manolis Tsamis  ---
(In reply to John David Anglin from comment #39)
> In the f-m-o pass, the following three insns that set call clobbered
> registers r20-r22 are pulled from loop:
> 
> (insn 186 183 190 29 (set (reg/f:SI 22 %r22 [478])
> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> (const_int 388 [0x184]))) "../Python/compile.c":5964:9 120
> {addsi3}
>  (nil))
> (insn 190 186 187 29 (set (reg/f:SI 21 %r21 [479])
> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> (const_int 392 [0x188]))) "../Python/compile.c":5964:9 120
> {addsi3}
>  (nil))
> (insn 194 191 195 29 (set (reg/f:SI 20 %r20 [480])
> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> (const_int 396 [0x18c]))) "../Python/compile.c":5964:9 120
> {addsi3}
>  (nil))
> 
> They are used in the following insns before call to compiler_visit_expr1:
> 
> (insn 242 238 258 32 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int
> *)prephit
> mp_37 + 388B]+0 S4 A32])
> (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173]))
> "../Python/compile.c"
> :5968:22 42 {*pa.md:2193}
>  (expr_list:REG_DEAD (reg:SI 23 %r23 [orig:173 vect__102.2442 ] [173])
> (expr_list:REG_DEAD (reg/f:SI 22 %r22 [478])
> (nil
> (insn 258 242 246 32 (set (reg:SI 26 %r26)
> (reg/v/f:SI 5 %r5 [orig:198 c ] [198]))
> "../Python/compile.c":5969:15 42 {*pa.md:2193}
>  (nil))
> (insn 246 258 250 32 (set (mem:SI (reg/f:SI 21 %r21 [479]) [4 MEM[(int
> *)prephitmp_37 + 392B]+0 S4 A32])
> (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169]))
> "../Python/compile.c":5968:22 42 {*pa.md:2193}
>  (expr_list:REG_DEAD (reg:SI 29 %r29 [orig:169 vect__102.2443 ] [169])
> (expr_list:REG_DEAD (reg/f:SI 21 %r21 [479])
> (nil
> (insn 250 246 254 32 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int
> *)prephitmp_37 + 396B]+0 S4 A32])
> (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145]))
> "../Python/compile.c":5968:22 42 {*pa.md:2193}
>  (expr_list:REG_DEAD (reg:SI 31 %r31 [orig:145 vect__102.2444 ] [145])
> (expr_list:REG_DEAD (reg/f:SI 20 %r20 [480])
> (nil
> 
> After the call, we have:
> 
> (insn 1241 269 273 30 (set (reg/f:SI 22 %r22 [478])
> (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]))
> "../Python/compile.c":5970:20 -1
>  (nil))
> (insn 273 1241 1242 30 (set (mem:SI (plus:SI (reg/f:SI 22 %r22 [478])
> (const_int 388 [0x184])) [4 MEM[(int *)_107 + 388B]+0 S4
> A32])
> (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167]))
> "../Python/compile.c":5970:20 42 {*pa.md:2193}
>  (nil))
> (insn 1242 273 277 30 (set (reg/f:SI 21 %r21 [479])
> (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]))
> "../Python/compile.c":5970:20 -1
>  (nil))
> (insn 277 1242 1243 30 (set (mem:SI (plus:SI (reg/f:SI 21 %r21 [479])
> (const_int 392 [0x188])) [4 MEM[(int *)_107 + 392B]+0 S4
> A32])
> (reg:SI 13 %r13 [orig:156 vect_pretmp_36.2451 ] [156]))
> "../Python/compile.c":5970:20 42 {*pa.md:2193}
>  (nil))
> (insn 1243 277 281 30 (set (reg/f:SI 20 %r20 [480])
> (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127]))
> "../Python/compile.c":5970:20 -1
>  (nil))
> (insn 281 1243 299 30 (set (mem:SI (plus:SI (reg/f:SI 20 %r20 [480])
> (const_int 396 [0x18c])) [4 MEM[(int *)_107 + 396B]+0 S4
> A32])
> (reg:SI 12 %r12 [orig:134 vect_pretmp_36.2452 ] [134]))
> "../Python/compile.c":5970:20 42 {*pa.md:2193}
>  (nil))
> 
> We have lost the offsets that were added initially to r20, r21 and r22.
> 
> Previous ce3 pass had:
> 
> (insn 272 269 273 30 (set (reg/f:SI 22 %r22 [478])
> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> (const_int 388 [0x184]))) "../Python/compile.c":5970:20 120
> {addsi3}
>  (nil))
> (insn 273 272 276 30 (set (mem:SI (reg/f:SI 22 %r22 [478]) [4 MEM[(int
> *)_107 + 388B]+0 S4 A32])
> (reg:SI 14 %r14 [orig:167 vect_pretmp_36.2450 ] [167]))
> "../Python/compile.c":5970:20 42 {*pa.md:2193}
>  (nil))
> (insn 276 273 277 30 (set (reg/f:SI 21 %r21 [479])
> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> (const_int 392 [0x188]))) "../Python/compile.c":5970:20 120
> {addsi3}
>  (nil))
> (insn 277 276 280 30 (set (mem:SI (reg/f:SI 21 %r21 [479]) [4 MEM[(int
> *)_107 + 392B]+0 S4 A32])
> (reg:SI 13 %r13 [orig:156 vect_pretmp_36.2451 ] [156]))
> "../Python/compile.c":5970:20 42 {*pa.md:2193}
>  (nil))
> (insn 280 277 281 30 (set (reg/f:SI 20 %r20 [480])
> (plus:SI (reg/f:SI 19 %r19 [orig:127 prephitmp_37 ] [127])
> (const_int 396 [0x18c]))) "../Python/compile.c":5970:20 120
> {addsi3}
>  (nil))
> (insn 281 280 284 30 (set (mem:SI (reg/f:SI 20 %r20 [480]) [4 MEM[(int
> *)_107 + 396B]+0 S4 A32])
>   

[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94

2023-11-08 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415

--- Comment #15 from Manolis Tsamis  ---
(In reply to Sam James from comment #13)
> Created attachment 56527 [details]
> compile.c.323r.fold_mem_offsets.bad.xz
> 
> Output from
> ```
> hppa2.0-unknown-linux-gnu-gcc -c  -DNDEBUG -g -fwrapv -O3 -Wall -O2  
> -std=c11 -Werror=implicit-function-declaration -fvisibility=hidden 
> -I/home/sam/git/cpython/Include/internal -IObjects -IInclude -IPython -I.
> -I/home/sam/git/cpython/Include-DPy_BUILD_CORE -o Python/compile.o
> /home/sam/git/cpython/Python/compile.c -fdump-rtl-fold_mem_offsets-all
> ```
> 
> If I instrument certain functions in compile.c with no optimisation
> attribuet or build the file with -fno-fold-mem-offsets, Python works, so I'm
> reasonably sure this is the relevant object.

Thanks for the dump file! There are 66 folded/eliminated instructions in this
object file; I did look at each case and there doesn't seem to be anything
strange. In fact most of the transformations are straightforward:

 - All except a couple of cases don't involve any arithmetic, so it's just
moving a constant around.
 - The majority of the transformations are 'trivial' and consist of a single
add and then a memory operation: a sequence like X = Y + Const, R = MEM[X + 0]
is folded to X = Y, R = MEM[X + Const]. I wonder why so many of these exist and
are not optimized elsewhere.
 - There are some cases with negative offsets, but the calculations look
correct.
 - There are few more complicated cases, but I've done these on paper and also
look correct.

Of course I could be missing some more complicated effect, but what I want to
say is that everything looks sensible in this particular file.

> Thanks! You are very welcome to have access to some HPPA machines for
> this kind of work. Please email me an SSH public key + desired username
> if that sounds helpful.

Yes, since I couldn't find anything interesting in the dump, that would
definitely be helpful. Thanks!

Manolis

[Bug rtl-optimization/112415] [14 regression] Python 3.11 miscompiled on HPPA with new RTL fold mem offset pass, since r14-4664-g04c9cf5c786b94

2023-11-07 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112415

--- Comment #11 from Manolis Tsamis  ---
Hi all,

I will also go ahead and try to reproduce that, although it may take me some
time due to my limited experience with HPPA. Once I manage to reproduce, most
f-m-o issues are straightforward to locate by bisecting the transformed
instructions.

> I think the key object is Python/compile.o, but not certain yet.

In this case the dump file of fold-mem-offsets
(-fdump-rtl-fold_mem_offsets-all) could also be useful, as it contains all the
information needed to see whether a transformation is valid. If it would be
easy for anyone to provide the dump file, I could look at it and see if
anything stands out (until I manage to reproduce this).

Thanks,
Manolis

[Bug c/109393] Very trivial address calculation does not fold

2023-09-08 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393

--- Comment #9 from Manolis Tsamis  ---
Created attachment 55856
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55856=edit
Address calculation pattern v1

[Bug c/109393] Very trivial address calculation does not fold

2023-09-04 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393

--- Comment #7 from Manolis Tsamis  ---
After some attempts to improve on this, my current solution is:
  1) Do not change pointer_int_sum in c-common (otherwise codegen regressions
are observed)
  2) Introduce a pattern that folds (unsigned type) (x + CONST_INT_1) *
CONST_INT_2 to (unsigned type) (x * CONST_INT_2) + (unsigned type) (CONST_INT_1
* CONST_INT_1) under the right circumstances.

This combination improves codegen (including the testcase provided) and also
doesn't cause any regressions on the GCC testsuite or benchmarks that I have
tried so far.

Given that a[(ulong) i +- C] in GIMPLE is quite common, I believe this change
helps with folding / canonicalization in many cases.

My current match.pd pattern to do that is below; any feedback or improvements
are welcome.

(simplify
 (mult (convert@3 (plus @0 INTEGER_CST@1)) INTEGER_CST@2)
  (with { tree tt = TREE_TYPE(@3); }
   (if (INTEGRAL_TYPE_P (type)
&& !TYPE_SATURATING (type)
&& !TYPE_OVERFLOW_TRAPS (type)
&& !TYPE_OVERFLOW_SANITIZED (type)
&& !TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
(plus (mult (convert:tt @0) @2) (mult (convert:tt @1) @2)

[Bug rtl-optimization/111267] New: Codegen regression from i386 argument passing changes

2023-09-01 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111267

Bug ID: 111267
   Summary: Codegen regression from i386 argument passing changes
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: manolis.tsamis at vrull dot eu
CC: roger at nextmovesoftware dot com, ubizjak at gmail dot com
  Target Milestone: ---

Code generation for the following testcase:

typedef struct {
  float minx, miny;
  float maxx, maxy;
} AABB;

int TestOverlap(AABB a, AABB b) {
  return a.minx <= b.maxx
  && a.miny <= b.maxy
  && a.maxx >= b.minx
  && a.maxx >= b.minx;
}

int TestOverlap2(AABB a, AABB b) {
  return a.miny <= b.maxy
  && a.maxx >= b.minx;
}

has regressed due to commit
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=bdf2737cda53a83332db1a1a021653447b05a7e7

GCC13.2:

TestOverlap:
comiss  xmm3, xmm0
movqrdx, xmm0
movqrsi, xmm1
movqrax, xmm3
jb  .L10
shr rdx, 32
shr rax, 32
movdxmm0, eax
movdxmm4, edx
comiss  xmm0, xmm4
jb  .L10
movdxmm1, esi
xor eax, eax
comiss  xmm1, xmm2
setnb   al
ret
.L10:
xor eax, eax
ret
TestOverlap2:
shufps  xmm0, xmm0, 85
shufps  xmm3, xmm3, 85
comiss  xmm3, xmm0
jb  .L17
xor eax, eax
comiss  xmm1, xmm2
setnb   al
ret
.L17:
xor eax, eax
ret

GCC trunk:

TestOverlap:
movqrax, xmm1
movqrdx, xmm2
movqrsi, xmm0
mov rdi, rax
movqrax, xmm3
mov rcx, rsi
xchgrdx, rax
movdxmm1, edx
mov rsi, rax
mov rax, rdx
comiss  xmm1, xmm0
jb  .L10
shr rcx, 32
shr rax, 32
movdxmm0, eax
movdxmm4, ecx
comiss  xmm0, xmm4
jb  .L10
movdxmm0, esi
movdxmm1, edi
xor eax, eax
comiss  xmm1, xmm0
setnb   al
ret
.L10:
xor eax, eax
ret
TestOverlap2:
movqrdx, xmm2
movqrax, xmm3
movqrsi, xmm0
xchgrdx, rax
mov rcx, rsi
mov rsi, rax
mov rax, rdx
shr rcx, 32
shr rax, 32
movdxmm4, ecx
movdxmm0, eax
comiss  xmm0, xmm4
jb  .L17
movdxmm0, esi
xor eax, eax
comiss  xmm1, xmm0
setnb   al
ret
.L17:
xor eax, eax
ret

(Can also be seen here https://godbolt.org/z/E4xrEn6KW)

[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'

2023-06-21 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313

--- Comment #9 from manolis.tsamis at vrull dot eu ---
Hi,

This commit is known to be an issue and I'm working on a fix, you can find more
details on this ticket https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308.

Would it be easy for you to test whether the patch in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308#c10 also fixes this issue?

Thanks,
Manolis

[Bug fortran/110311] [14 Regression] gfortran 14.0 regression

2023-06-21 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110311

manolis.tsamis at vrull dot eu changed:

   What|Removed |Added

 CC||manolis.tsamis at vrull dot eu

--- Comment #6 from manolis.tsamis at vrull dot eu ---
Hi,

Due to the time frame mentioned (June 12-19), could you please test if the
offending commit is r14-1873-g6a2e8dcbbd4bab374b27abea375bf7a921047800 ? This
commit is now known to cause general issues, as also described in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308.

Thanks,
Manolis

[Bug debug/110308] [14 Regression] ICE on audiofile-0.3.6: RTL: vartrack: Segmentation fault in mode_to_precision(machine_mode)

2023-06-20 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308

--- Comment #10 from manolis.tsamis at vrull dot eu ---
Created attachment 55369
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55369=edit
ICE-fix-proposal-1

[Bug debug/110308] [14 Regression] ICE on audiofile-0.3.6: RTL: vartrack: Segmentation fault in mode_to_precision(machine_mode)

2023-06-20 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110308

--- Comment #7 from manolis.tsamis at vrull dot eu ---
Some context for the commit:

This change is originally part of an late rtl pass to optimize memory accesses.
There are a lot of cases (especially involving local arrays) that generate
reduntant stack pointer adds for address calculation which can get reduced to
`mov reg, sp`, but without actually propagating these we don't gain something.
In general it should be good to allow propagation of the stack pointer if that
is correct.

Now for the actual issue, it looks like my change for that was a bit carelees
and I didn't properly understand the context of cprop-hardreg.

> For debug info, propagation of the sp with the extra debug info related stuff 
> (ORIGINAL_REGNO and REG_ATTRS) is I think very harmful, but I admit I haven't 
> tried to understand why that change has been done.

Yes, the attachment of ORIGINAL_REGNO and REG_ATTRS to the stack pointer has
been accidental and is of course wrong.

I have a proposed change below that fixes the segfault by not setting
ORIGINAL_REGNO/REG_ATTRS for the stack pointer. My understanding is that this
should be fine, but I'm still testing that.

> So I think there are 2 bugs here. First the lost of debugging info because of 
> ch, and the latent segfault.

I'm still looking into the difference in the debug statements.

--cut here--
--- a/gcc/regcprop.cc
+++ b/gcc/regcprop.cc
@@ -423,7 +423,7 @@ maybe_mode_change (machine_mode orig_mode, machine_mode
copy_mode,
  It's unclear if we need to do the same for other special registers.  */
   if (regno == STACK_POINTER_REGNUM)
 {
-  if (orig_mode == new_mode)
+  if (orig_mode == new_mode && new_mode == GET_MODE (stack_pointer_rtx))
return stack_pointer_rtx;
   else
return NULL_RTX;
@@ -487,9 +487,14 @@ find_oldest_value_reg (enum reg_class cl, rtx reg, struct
value_data *vd)
   new_rtx = maybe_mode_change (oldmode, vd->e[regno].mode, mode, i,
regno);
   if (new_rtx)
{
- ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (reg);
- REG_ATTRS (new_rtx) = REG_ATTRS (reg);
- REG_POINTER (new_rtx) = REG_POINTER (reg);
+ if (new_rtx != stack_pointer_rtx)
+   {
+ ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (reg);
+ REG_ATTRS (new_rtx) = REG_ATTRS (reg);
+ REG_POINTER (new_rtx) = REG_POINTER (reg);
+   }
+ else if (REG_POINTER (new_rtx) != REG_POINTER (reg))
+   return NULL_RTX;
  return new_rtx;
}
 }
@@ -965,15 +970,27 @@ copyprop_hardreg_forward_1 (basic_block bb, struct
value_data *vd)

  if (validate_change (insn, _SRC (set), new_rtx, 0))
{
- ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (src);
- REG_ATTRS (new_rtx) = REG_ATTRS (src);
- REG_POINTER (new_rtx) = REG_POINTER (src);
- if (dump_file)
-   fprintf (dump_file,
-"insn %u: replaced reg %u with %u\n",
-INSN_UID (insn), regno, REGNO (new_rtx));
- changed = true;
- goto did_replacement;
+ bool can_change;
+ if (new_rtx != stack_pointer_rtx)
+   {
+ ORIGINAL_REGNO (new_rtx) = ORIGINAL_REGNO (src);
+ REG_ATTRS (new_rtx) = REG_ATTRS (src);
+ REG_POINTER (new_rtx) = REG_POINTER (src);
+ can_change = true;
+   }
+ else
+   can_change
+ = (REG_POINTER (new_rtx) == REG_POINTER (src));
+
+ if (can_change)
+   {
+ if (dump_file)
+   fprintf (dump_file,
+"insn %u: replaced reg %u with %u\n",
+INSN_UID (insn), regno, REGNO (new_rtx));
+ changed = true;
+ goto did_replacement;
+   }
}
  /* We need to re-extract as validate_change clobbers
 recog_data.  */(In reply to Jakub Jelinek from comment #5)
--cut here--

[Bug c/109393] Very trivial address calculation does not fold

2023-05-11 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393

--- Comment #6 from manolis.tsamis at vrull dot eu ---
(In reply to Richard Biener from comment #5)
> (In reply to manolis.tsamis from comment #4)
> > Given the original transform it should be valid to propagate the constant
> > addition through the cast?
> 
> Yes.  Note doing so loses information, we know i + 1 doesn't overflow
> (undefined behavior).  Widening preserves this knowledge I think, but if
> just an unsigned conversion would be propagated it would be lost.

So, is that the reason that this transform isn't already implemented as an
optimisation?

But then again isn't this information also lost in the code currently produced
by GCC, where the constant is already propagated? Although I can see how it is
different to do the propagation of constants in the front-end only vs doing it
anywhere while transforming the code; maybe that's the difference that matters.
But hopefully doing better canonicalization/constant folding on address
calculations and constants should also result in better optimisation
opportunities overall.

[Bug c/109393] Very trivial address calculation does not fold

2023-05-10 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393

--- Comment #4 from manolis.tsamis at vrull dot eu ---
(In reply to Richard Biener from comment #3)
> It's probably a mismatch of GENERIC/GIMPLE folding.  In this case it's
> pointer_int_sum prematurely distributing the multiplication:
> 
> /* Return a tree for the sum or difference (RESULTCODE says which)
>of pointer PTROP and integer INTOP.  */
> 
> tree  
> pointer_int_sum (location_t loc, enum tree_code resultcode,
>  tree ptrop, tree intop, bool complain)
> { 
> ...
>   /* If what we are about to multiply by the size of the elements
>  contains a constant term, apply distributive law
>  and multiply that constant term separately.
>  This helps produce common subexpressions.  */
> 
> but this kind of stuff shouldn't be done by the frontends these days.
> 
> Gating this fixes the issue.  I think this piece of code should be axed
> (after careful evaluation of its effect)

I have been testing this change and by looking at some tests that it causes to
fail, there is a regression on testcases like this (taken from
copy-headers-5.c, there are some similar fails):

int is_sorted(int *a, int n)
{
  for (int i = 0; i < n - 1; i++)
if (a[i] > a[i + 1])
  return 0;
  return 1;
}

The gimple code for gcc currently is (taken from dump-tree-ch2):
  _1 = (long unsigned int) i_18;
  _2 = _1 * 4;
  _3 = a_13(D) + _2;
  _4 = *_3;
  _5 = _1 + 1;
  _6 = _5 * 4;
  _7 = a_13(D) + _6;
  _8 = *_7;

While with this change the result is:
  _1 = (long unsigned int) i_11;
  _2 = _1 * 4;
  _3 = a_14(D) + _2;
  _4 = *_3;
  _5 = i_11 + 1;
  _6 = (long unsigned int) _5;
  _7 = _6 * 4;
  _8 = a_14(D) + _7;
  _9 = *_8;

As a result the generated code has two loads per loop instead of one. 

The same holds if e.g. a[i + 1] > a[i + 2] is used because the constant
addition is done before the cast to long unsigned int. I believe this is
because the logic to distribute the constant is missing as a GIMPLE pattern?
Given the original transform it should be valid to propagate the constant
addition through the cast?

[Bug tree-optimization/109393] Very trivial address calculation does not fold

2023-04-04 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393

--- Comment #2 from manolis.tsamis at vrull dot eu ---
(In reply to Andrew Pinski from comment #1)
> Note sometimes -fwrapv will optimize things because it can assume that
> overflow is defined as wrapping and this is one case that is true. Yes it
> sounds counter intuitive but it is true. Even re-association happens more
> with -fwrapv :).

Yes it is truly counter intuitive :)

Yet, leaving this wrapv behaviour aside, isn't this something that should be
always optimized regardless of wrap/overflow flags?

[Bug tree-optimization/109393] New: Very trivial address calculation does not fold

2023-04-03 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109393

Bug ID: 109393
   Summary: Very trivial address calculation does not fold
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---

The following function

int func(int *a, int j) {
  int k = j - 1;
  return a[j - 1] == a[k];
}

surprisingly does not fold to `return 1;` at -O2 or higher (with any GCC
version). It can also be seen here: https://godbolt.org/z/cqr43q7fq

There are a lot of variants for this behaviour but this is the most apparent.
As can be seen in the godbolt link, the issue seems to be a combination of:

  1) The -1 in a[j - 1] is turned into GIMPLE equivalent with *((a + (ulong) j)
+ (ulong) -1) but a[k] is turned into *(a + (ulong) (j - 1)).
  2) The -1 is never propagated outside of the (long unsigned int) casts even
if it's completely legal/possible.

I feel that I'm missing something here about pointer rules / historical context
of these choices and I would appreciate if someone more knowlegable could
explain this combination to me.

There are a lot of cases where this can lead to inefficient codegen but most
prominently this is the reason for a additional redundant load in a hot loop of
SPEC2017's nab in the function downheap_pairs and similar missed optimizations
in omnetpp's shiftup function.

Hence this issue can both cause very unexpected missed optimization (as in the
example) and also decreases the performance of important benchmarks.

Note: The testcase is not optimized even with -fno-wrapv or -fstrict-overflow,
but does optimize with -fwrapv which is the reverse of what I would expect
since -fno-wrapv should be more permissive?

[Bug tree-optimization/98138] BB vect fail to SLP one case

2023-02-01 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

manolis.tsamis at vrull dot eu changed:

   What|Removed |Added

 CC||manolis.tsamis at vrull dot eu

--- Comment #11 from manolis.tsamis at vrull dot eu ---
> The full satd_8x4 looks like the following, the 2nd loop isn't to be
> disregarded

Regarding the second loop, this patch
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608827.html should
result in improved vectorization and performance.

Currently ((a>>15)&0x10001)*0x from abs2 is calculated using 4 vector
operations (shift, bitand, shift+sub for the multiplication) whereas with the
proposed patch this will be a single vector compare operation.

[Bug target/106346] Potential regression on vectorization of left shift with constants

2022-07-19 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #2 from manolis.tsamis at vrull dot eu ---
I bisected the changes from GCC 10.3 onwards and the first commit that results
in the "worse" version of the generated code is
9fc9573f9a5e9432e53c7de93985cfbd267f0309:

[2/3] [vect] Add widening add, subtract patterns

Add widening add, subtract patterns to tree-vect-patterns. Update the
widened code of patterns that detect PLUS_EXPR to also detect
WIDEN_PLUS_EXPR. These patterns take 2 vectors with N elements of size
S and perform an add/subtract on the elements, storing the results as N
elements of size 2*S (in 2 result vectors). This is implemented in the
aarch64 backend as addl,addl2 and subl,subl2 respectively. Add aarch64
tests for patterns.

[Bug tree-optimization/106346] New: Potential regression on vectorization of left shift with constants

2022-07-18 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

Bug ID: 106346
   Summary: Potential regression on vectorization of left shift
with constants
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---
Target: aarch64

Created attachment 53317
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53317=edit
Does not vectorize on GCC > 10.3

The following test case:

  void foo (uint32_t dst[8], uint8_t src1[8], uint8_t src2[8])
  {
uint16_t diff_e0 = src1[0] - src2[0];
uint16_t diff_e1 = src1[1] - src2[1];
uint16_t diff_e2 = src1[2] - src2[2];
uint16_t diff_e3 = src1[3] - src2[3];
uint16_t diff_e4 = src1[4] - src2[4];
uint16_t diff_e5 = src1[5] - src2[5];
uint16_t diff_e6 = src1[6] - src2[6];
uint16_t diff_e7 = src1[7] - src2[7];

uint32_t a0 = diff_e0 << 1;
uint32_t a1 = diff_e1 << 3;
uint32_t a2 = diff_e2 << 4;
uint32_t a3 = diff_e3 << 2;
uint32_t a4 = diff_e4 << 12;
uint32_t a5 = diff_e5 << 11;
uint32_t a6 = diff_e6 << 9;
uint32_t a7 = diff_e7 << 3;

dst[0] = a0;
dst[1] = a1;
dst[2] = a2;
dst[3] = a3;
dst[4] = a4;
dst[5] = a5;
dst[6] = a6;
dst[7] = a7;
  }

Compiles at -O3 to nice vectorized code by loading the constants from memory in
GCC 10.3:

  ldr d0, [x1]
  adrpx3, .LC0
  ldr d1, [x2]
  adrpx1, .LC1
  ldr q3, [x3, #:lo12:.LC0]
  usubl   v0.8h, v0.8b, v1.8b
  ldr q2, [x1, #:lo12:.LC1]
  uxtlv1.4s, v0.4h
  uxtl2   v0.4s, v0.8h
  sshlv1.4s, v1.4s, v3.4s
  sshlv0.4s, v0.4s, v2.4s
  stp q1, q0, [x0]
  ret

But this has regressed in later releases, with GCC still loading the constants
from memory but also emitting a lot of scalar code before that. For example GCC
13 produces:

  adrpx3, .LC0
  ldrbw6, [x1, 4]
  fmovd0, x6
  ldrbw7, [x1]
  ldr q5, [x3, #:lo12:.LC0]
  fmovd1, x7
  ldrbw3, [x1, 5]
  ldrbw4, [x1, 1]
  ldrbw8, [x2, 4]
  ldrbw5, [x2, 5]
  ins v0.h[1], w3
  ldrbw6, [x2]
  fmovd2, x8
  ldrbw3, [x2, 1]
  fmovd3, x6
  ins v2.h[1], w5
  ins v1.h[1], w4
  ldrbw9, [x1, 2]
  ins v3.h[1], w3
  ldrbw8, [x1, 6]
  ldrbw7, [x2, 2]
  ldrbw6, [x2, 6]
  ins v1.h[2], w9
  ins v0.h[2], w8
  ldrbw5, [x1, 3]
  ins v3.h[2], w7
  ldrbw4, [x1, 7]
  ins v2.h[2], w6
  ldrbw1, [x2, 7]
  ldrbw3, [x2, 3]
  ins v1.h[3], w5
  ins v0.h[3], w4
  ins v2.h[3], w1
  ins v3.h[3], w3
  adrpx1, .LC1
  ldr q4, [x1, #:lo12:.LC1]
  sub v1.4h, v1.4h, v3.4h
  sub v0.4h, v0.4h, v2.4h
  uxtlv1.4s, v1.4h
  uxtlv0.4s, v0.4h
  sshlv1.4s, v1.4s, v5.4s
  sshlv0.4s, v0.4s, v4.4s
  stp q1, q0, [x0]
  ret

Interestingly, this happens only with left shift and not with right shift.

GCC 10.3 vs trunk comparison: https://godbolt.org/z/xWbfGdfen

[Bug tree-optimization/106343] New: Addition with constants is not vectorized by SLP when it includes zero

2022-07-18 Thread manolis.tsamis at vrull dot eu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106343

Bug ID: 106343
   Summary: Addition with constants is not vectorized by SLP when
it includes zero
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: manolis.tsamis at vrull dot eu
  Target Milestone: ---
Target: aarch64

Created attachment 53316
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53316=edit
Does not vectorize

The following test case:

  void foo (uint32_t dst[8], uint8_t src1[8], uint8_t src2[8])
  {
uint16_t diff_e0 = src1[0] - src2[0];
uint16_t diff_e1 = src1[1] - src2[1];
uint16_t diff_e2 = src1[2] - src2[2];
uint16_t diff_e3 = src1[3] - src2[3];
uint16_t diff_e4 = src1[4] - src2[4];
uint16_t diff_e5 = src1[5] - src2[5];
uint16_t diff_e6 = src1[6] - src2[6];
uint16_t diff_e7 = src1[7] - src2[7];

uint32_t a0 = diff_e0 + 1;
uint32_t a1 = diff_e1 + 3;
uint32_t a2 = diff_e2 + 4;
uint32_t a3 = diff_e3 + 2;
uint32_t a4 = diff_e4 + 12;
uint32_t a5 = diff_e5 + 11;
uint32_t a6 = diff_e6 + 9;
uint32_t a7 = diff_e7 + 3;

dst[0] = a0;
dst[1] = a1;
dst[2] = a2;
dst[3] = a3;
dst[4] = a4;
dst[5] = a5;
dst[6] = a6;
dst[7] = a7;
  }

Produces nice vectorized code on aarch64:

  ldr d2, [x2]
  adrpx3, .LC0
  ldr d0, [x1]
  ldr q1, [x3, #:lo12:.LC0]
  usubl   v0.8h, v0.8b, v2.8b
  uaddl   v2.4s, v0.4h, v1.4h
  uaddl2  v0.4s, v0.8h, v1.8h
  stp q2, q0, [x0]
  ret

But if any of the constants is replaced with zero instead then scalar code is
produced:

  ldrbw4, [x2, 1]
  ldrbw8, [x1, 1]
  ldrbw3, [x2, 3]
  ldrbw7, [x1, 3]
  sub w8, w8, w4
  ldrbw6, [x1, 4]
  and w8, w8, 65535
  ldrbw4, [x2, 4]
  sub w7, w7, w3
  ldrbw5, [x1, 5]
  and w7, w7, 65535
  ldrbw3, [x2, 5]
  sub w6, w6, w4
  ldrbw9, [x2, 6]
  and w6, w6, 65535
  ldrbw4, [x1, 6]
  sub w5, w5, w3
  ldrbw10, [x2, 7]
  and w5, w5, 65535
  ldrbw3, [x1, 7]
  sub w4, w4, w9
  ldrbw11, [x2]
  and w4, w4, 65535
  ldrbw9, [x1]
  sub w3, w3, w10
  ldrbw2, [x2, 2]
  add w8, w8, 3
  ldrbw10, [x1, 2]
  sub w9, w9, w11
  and w1, w3, 65535
  and w9, w9, 65535
  sub w10, w10, w2
  add w3, w5, 11
  add w2, w4, 9
  add w7, w7, 2
  add w6, w6, 12
  add w1, w1, 3
  add w4, w9, 1
  and w5, w10, 65535
  stp w4, w8, [x0]
  stp w5, w7, [x0, 8]
  stp w6, w3, [x0, 16]
  stp w2, w1, [x0, 24]
  ret

It would be possible to produce the same vectorized code as above but with zero
in the constants. If I understand correctly, the identity element of addition
is not taken into consideration in the SLP vectorizer, which could be improved.
The same happens with subtraction.

I can reproduce this in any recent version of GCC (e.g. >= 10).

Vectorized case: https://godbolt.org/z/5sbb1an89
Scalar case: https://godbolt.org/z/v8jPT9jEe