[Bug target/114288] [14 regression] ICE when building binutils-2.41 on hppa (extract_constrain_insn, at recog.cc:2713)

2024-03-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114288

Richard Biener  changed:

   What|Removed |Added

   Keywords||ice-on-valid-code
   Target Milestone|--- |14.0
 Target||hppa

[Bug analyzer/114285] Use of uninitialized value when copying a struct field by field

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114285

--- Comment #1 from Richard Biener  ---
Given GCC considers memory to be initialized when you write to it and
copying from A to B involves a write to B the uninit info would be lost if
A is uninitialized.  So IMO it's reasonable to diagnose a copy from
uninitialized, at least unless you can fully analyze all possible uses
of B (which, when B is memory is unlikely).

Note that's not the analyzer-specific opinion but viewed from the
-Wuninitialized implementation point of view.

[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since r14-9193

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

--- Comment #18 from Richard Biener  ---
r14-9391-g018ddc86b92851 doesn't seem to make a difference for this aarch64
IVOPTs case.  It might be that tree-affine.cc needs similar handling.  I'm
going to dig into that on Monday.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 114238, which changed state.

Bug 114238 Summary: [14 regression] Multiple 554.roms_r run-time regressions 
(4%-20%) since r14-9193-ga0b1798042d033
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114238

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/114238] [14 regression] Multiple 554.roms_r run-time regressions (4%-20%) since r14-9193-ga0b1798042d033

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114238

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #1 from Richard Biener  ---
r14-9391-g018ddc86b92851 fixed this on Zen2 for me as well.

[Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Richard Biener  ---
For me this is fixed on Zen2 with -Ofast -march=native.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 114269, which changed state.

Bug 114269 Summary: [14 Regression] Multiple 3-27% exec time regressions of 
434.zeusmp since r14-9193-ga0b1798042d033
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

Richard Biener  changed:

   What|Removed |Added

 CC||ebotcazou at gcc dot gnu.org

--- Comment #14 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #13)
> (In reply to Richard Biener from comment #12)
> 
> > > But I think, we could do better. Adding CC.
> > 
> > We sure could, but I doubt it's too important?  Maybe for Go/Ada.
> 
> Preloading stuff is simply loading from the same DImode address, so I'd
> think that EH_NOTE should be moved from the original insn to the new insn
> without much problems.
> 
> Please note that on x86_32 split pass is later splitting DImode memory
> access to two SImode loads, this looks somehow harder problem as far as EH
> notes are concerned, as the one above.
> 
> I'm not versed in this area, so I'll leave the fix to someone else.

On RTL I'd defer to Eric here.

Note for the correctness issue on branches I'd probably prefer the
"simple" approach (unless a true solution turns out equally simple).

[Bug target/114284] [14 Regression] arm: Load of volatile short gets miscompiled (loaded twice)

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114284

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
   Priority|P3  |P1

--- Comment #1 from Richard Biener  ---
fold-mem-offsets pass?

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #12 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #11)
> (In reply to Richard Biener from comment #10)
> > The easiest fix would be to refuse applying STV to a insn that
> > can_throw_internal () (that's an insn that has associated EH info).  
> > Updating
> > in this case would require splitting the BB or at least moving the now
> > no longer throwing insn to the next block (along the fallthru edge).
> 
> This would be simply:
> 
> --cut here--
> diff --git a/gcc/config/i386/i386-features.cc
> b/gcc/config/i386/i386-features.cc
> index 1de2a07ed75..90acb33db49 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -437,6 +437,10 @@ scalar_chain::add_insn (bitmap candidates, unsigned int
> insn_uid,
>&& !HARD_REGISTER_P (SET_DEST (def_set)))
>  bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>  
> +  if (cfun->can_throw_non_call_exceptions

that part shouldn't be necessary, can_throw_internal is cheap enough
(but yes, unless STV handles calls it's correct)

> +  && can_throw_internal (insn))
> +return false;
> +
>/* ???  The following is quadratic since analyze_register_chain
>   iterates over all refs to look for dual-mode regs.  Instead this
>   should be done separately for all regs mentioned in the chain once.  */
> --cut here--
> 
> But I think, we could do better. Adding CC.

We sure could, but I doubt it's too important?  Maybe for Go/Ada.

[Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269

--- Comment #4 from Richard Biener  ---
The following is a C testcase for a case where ranges will not help:

void foo (int *a, long js, long je, long is, long ie, long ks, long ke, long
xi, long xj)
{
  for (long j = js; j < je; ++j)
for (long i = is; i < ie; ++i)
  for (long k = ks; k < ke; ++k)
a[i + j*xi + k*xi*xj] = 5;
}

SCEV analysis result before/after shows issues.  When you re-order the loops
so the fast increment goes innermost this doesn't make a difference for
vectorization though.  In the order above we now require (emulated) gather
which with SSE didn't work out and previously we used strided stores.

The reason seems to be that when analyzing k*xi*xj the first multiply
yields

(long int) {(unsigned long) ks_21(D) * (unsigned long) xi_24(D), +, (unsigned
long) xi_24(D)}_3

but when then asking to fold the multiply by xj we fail as we run into

tree
chrec_fold_multiply (tree type,
 tree op0,
 tree op1)
{ 
...
CASE_CONVERT:
  if (tree_contains_chrecs (op0, NULL))
return chrec_dont_know;
  /* FALLTHRU */ 

but this case is somewhat odd as all other unhandled cases simply run into
fold_build2.  This possibly means we'd never build other ops with
CHREC operands.  This was added for PR42326.

I think we can handle sign-conversions from unsigned just fine, chrec_fold_plus
does such thing already (but it misses one case).

Doing this restores things to some extent.

I'm testing this as an intermediate step before considering reversion of the
change.

[Bug tree-optimization/114074] [11/12/13 Regression] wrong code at -O1 and above on x86_64-linux-gnu since r8-343

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114074

--- Comment #10 from Richard Biener  ---
Some thoughts on the CHREC folding, in the context of the many reported
optimization regressions.

We try to handle { a, +, b } * c as { a * c, +, b * c } and the issue is
cases of undefined overflow this exposes.

We have to make sure to not introduce undefined behavior that isn't there

 - at the first evaluation (the first iteration) both expressions behave
   the same with respect to overflow since it's a * c without any increment
 - further evaluations will do ((a + b) + ... + b) * c before and
   (a * c + b * c) + ... + b * c after the transform
 - (a + b) + b doesn't necessarily behave the same as a + 2*b

I'm not sure we can, say, rely on a * c not invoking undefined overflow
since we do not know whether the expression will be evaluated at runtime
and whether SCEV analysis properly handles conditional execution in this
regard (it just follows the data dependence graph).

For the fix I've looked at the simplest part, when does (a + b) * c possibly
not overflow but a * c or b * c does?  Only when a + b has smaller magnitude
than a or b which should mean a and b have to have opposite sign.  Without
proving I think (a + b + b) * c vs. a * c + b * c + b * c doesn't add
anything, thus the addition can be ignored and just the multiplication matters.

When in the future maybe adding 'assumptions' to SCEV results we could
unconditionally do the simplification but register an appropriate
assumption that needs to hold (and which we could check at runtime).
At runtime it's probably enough to verify that b * c does not overflow,
the evaluation of a * c should be guarded.

[Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269

--- Comment #3 from Richard Biener  ---
good (base) vs. bad (peak) on Zen2 with -Ofast -march=native shows

Samples: 654K of event 'cycles', Event count (approx.): 743149709374
Overhead   Samples  Command  Shared Object   Symbol 
  16.71%109793  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.] hsmoc_
  14.37% 94016  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.] hsmoc_
   8.82% 57979  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.]
lorentz_
   8.48% 55451  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.]
lorentz_
   4.84% 31575  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.] momx3_
   4.68% 30456  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.] momx3_
   4.08% 26675  zeusmp_peak.amd  zeusmp_peak.amd64-m64-mine  [.]
tranx3_
   3.56% 23145  zeusmp_base.amd  zeusmp_base.amd64-m64-mine  [.]
tranx3_

for hsmoc_ it looks like a difference in transformations done:

-hsmoc.f:826:19: optimized: loop vectorized using 32 byte vectors

(there are a lot more missed vectorizations).

   subroutine hsmoc ( emf1, emf2, emf3 )

   integer is, ie, js, je, ks, ke
   common /gridcomi/
 &   is, ie, js, je, ks, ke
   integer in, jn, kn, ijkn
   integer  i   , j   , k
   parameter(in =   128+5
 &, jn =   128+5
 &, kn =   128+5)
   parameter(ijkn =   128+5)
   real*8 emf1(  in,  jn,  kn), emf2(  in,  jn,  kn)
   real*8 vint(ijkn), bint(ijkn)

   do 199 j=js,je+1
 do 59 i=is,ie
  do 858 k=ks,ke+1
 vint(k)= k
 bint(k)= k
 858  continue
  do 58 k=ks,ke+1
 emf1(i,j,k) = vint(k)
 emf2(i,j,k) = bint(k)
 58   continue
 59  continue
 199   continue

   return
   end

doesn't reproduce it though.  The actual difference for the whole testcase
is of course failed data-ref analysis:

 Creating dr for (*emf2_1966(D))[_402]
-analyze_innermost: success.
-   base_address: emf2_1966(D)
-   offset from base address: (ssizetype) sizetype) _1928 * 17689 +
(sizetype) j_2705 * 133) + (sizetype) i_2672) * 8)
-   constant offset from base address: -142584
-   step: 141512
-   base alignment: 8
+analyze_innermost: hsmoc.f:828:72: missed:  failed: evolution of offset is not
affine.
+   base_address: 
+   offset from base address: 
+   constant offset from base address: 
+   step: 
+   base alignment: 0

and then

 hsmoc.f:826:19: note:   === vect_analyze_data_ref_accesses ===
-hsmoc.f:826:19: missed:   not consecutive access (*emf1_1964(D))[_402] = _403;
-hsmoc.f:826:19: note:   using strided accesses
-hsmoc.f:826:19: missed:   not consecutive access (*emf2_1966(D))[_402] = _404;
-hsmoc.f:826:19: note:   using strided accesses

and we use gather and fail because of costs.

I suspect that relying on global ranges (that could save us here) is quite
fragile when there's a lot of other code around and thus opportunity for
random transforms "trashing" them.

Using the patch from PR114151 and enabling ranger during vectorization oddly
enough doesn't help (even when wiping the SCEV cache).

The odd thing is with the testcase above we get

Access function 0: (integer(kind=8)) {(((unsigned long) _30 * 17689 +
(unsigned long) _10) + (unsigned long) _66) + 18446744073709533793, +,
17689}_4;

where you can see some of the unsigned promotion being done, but we
still succeed.

As I'm lacking a smaller testcase right now it's difficult to understand why
we fail in one case but not the other.

[Bug tree-optimization/114277] [11/12/13/14 Regression] Missed optimization: x*(x||b) => x

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114277

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #10 from Richard Biener  ---
The easiest fix would be to refuse applying STV to a insn that
can_throw_internal () (that's an insn that has associated EH info).  Updating
in this case would require splitting the BB or at least moving the now
no longer throwing insn to the next block (along the fallthru edge).

[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since r14-9193

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

--- Comment #16 from Richard Biener  ---
(In reply to Andrew Macleod from comment #12)
> 
> all VRP passes are the same now. so just schedule EVRP.   in theory, you
> could schedule the fast vrp pass I added, but its not heavily tested... but
> you could try it.  It doesnt do any back edges or switches (iirc), but does
> basic calculations in DOM order and exports/updates globals.
> 
> NEXT_PASS (pass_fast_vrp)

When I just want to update global ranges what do I do?  It looks like
VRP first and foremost calls range_of_stmt on each PHI and stmt in the
pre-fold hook.  Does that update global ranges?  It should at least
fill the cache so SCEV would pick up ranges, right?

So doing in the vectorizer sth like the following should get us the best
possible ranges?  Ah, probably only global ranges since the SCEV query
itself would still lack context sensitive info (but as said we don't have
a good context we can easily use).

Would doing sth like below gain anything in addition to your proposed
patch (for context-less queries like those done in SCEV)?

@@ -1240,6 +1241,37 @@ pass_vectorize::execute (function *fun)
   if (vect_loops_num <= 1)
 return 0;

+  scev_reset ();
+  auto ranger = enable_ranger (fun);
+
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fun)
+   {
+ for (auto gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next
())
+   {
+ tree name = gimple_range_ssa_p (PHI_RESULT (*gsi));
+ if (name)
+   {
+ Value_Range vr(TREE_TYPE (name));
+ ranger->range_of_stmt (vr, *gsi, name);
+   }
+   }
+ for (auto gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next ())
+   {
+ gimple *s = *gsi;
+ if (is_gimple_debug (s))
+   continue;
+ tree type = gimple_range_type (s);
+ if (type)
+   {
+ Value_Range vr(type);
+ ranger->range_of_stmt (vr, s);
+   }
+   }
+   }
+}
+

[Bug middle-end/114270] Integer multiplication on floating point constant with conversion back to integer is not optimized

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114270

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-03-08
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
I think it makes sense to optimize for 1/power-of-two only.  Whether
an actual integer division instruction we could replace x * FP_CST with
would be faster than int->FP, FP multiply, FP->int is questionable.
But a shift very likely is.

Special-casing just * 0.5 might also an option.

[Bug tree-optimization/114269] [14 Regression] Multiple 3-27% exec time regressions of 434.zeusmp since r14-9193-ga0b1798042d033

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
   Target Milestone|--- |14.0
   Last reconfirmed||2024-03-08
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
I will look if I can find a nice testcase for x86_64 here.

[Bug tree-optimization/114268] [14 Regression] 5% exec time regression in 454.calculix on Aarch64

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114268

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

Richard Biener  changed:

   What|Removed |Added

   Keywords||wrong-code
  Component|rtl-optimization|target

--- Comment #8 from Richard Biener  ---
I think it's split1 doing wrong.  We end up with

;; basic block 3, loop depth 0, count 118111600 (estimated locally, freq
1.), maybe hot
;;  prev block 2, next block 4, flags: (NEW, HOT_PARTITION, RTL, MODIFIED)
;;  pred:   2 [always]  count:118111600 (estimated locally, freq 1.)
(FALLTHRU)
;; bb 3 artificial_defs: { }
;; bb 3 artificial_uses: { u-1(6){ }u-1(7){ }u-1(16){ }u-1(19){ }}
;; lr  in
;; lr  use
;; lr  def
;; live  in
;; live  gen
;; live  kill
(note 124 10 126 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(jump_insn 126 124 127 3 (set (pc)
(label_ref 125)) -1
 (nil)
 -> 125)
;;  succ:   6 [always]  count:118111600 (estimated locally, freq 1.)
;; lr  out
;; live  out

(barrier 127 126 84)
;; basic block 4, loop depth 0, count 0 (precise, freq 0.), probably never
executed
;;  prev block 3, next block 5, flags: (REACHABLE, HOT_PARTITION, RTL,
MODIFIED)
;;  pred:
;; bb 4 artificial_defs: { d-1(0){ }d-1(1){ }}
;; bb 4 artificial_uses: { u-1(6){ }u-1(7){ }u-1(16){ }u-1(19){ }}
;; lr  in6 [bp] 7 [sp] 16 [argp] 19 [frame]
;; lr  use   6 [bp] 7 [sp] 16 [argp] 19 [frame]
;; lr  def   0 [ax] 1 [dx] 114 115
;; live  in  6 [bp] 7 [sp] 16 [argp] 19 [frame]
;; live  gen 0 [ax] 1 [dx] 114 115
;; live  kill
(code_label/s 84 127 86 4 13 (nil) [1 uses])
(note 86 84 93 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 93 86 85 4 (set (reg:SI 115)
(reg:SI 0 ax)) "t.ii":22:42 -1
 (expr_list:REG_DEAD (reg:SI 0 ax)

so block 4 is unreachable.  split1 does

   102: r122:DI#0=vec_concat([r98:SI],0)
10: r102:DI#0=r122:DI#0
-  REG_EH_REGION 0xd
   124: NOTE_INSN_BASIC_BLOCK 3

that looks spurious, so possibly some other pass leaves around the dead EH.
Earlier this was

   10: r102:DI=[r98:SI]
  REG_EH_REGION 0xd
  ; pc falls through to BB 5

and STV2 changes this like

-   10: r102:DI=[r98:SI]
+  102: r122:DI#0=vec_concat([r98:SI],0)
+   10: r102:DI#0=r122:DI#0
   REG_EH_REGION 0xd
   ; pc falls through to BB 5

failing to move EH (or refuse the lowering).

Thus a target issue, even wrong-code I think as we now fail to catch
a trap by the [r98:SI] load.

[Bug rtl-optimization/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g0

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

--- Comment #7 from Richard Biener  ---
IMO verify_flow_info on RTL should ICE with unreachable blocks.

[Bug rtl-optimization/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g0

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

Richard Biener  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #6 from Richard Biener  ---
Ah, -march=x86-64 was it.  The ICE means that the entry block wasn't reachable
from EXIT_BLOCK which means there are unreachable blocks.

This usually means some pass lacks CFG cleanup or delete_unreachable_blocks ().

A simple fix is the following, but the proper thing to do is track down who
leaves unreachable blocks around in the IL.

diff --git a/gcc/sched-rgn.cc b/gcc/sched-rgn.cc
index eb75d1bdb26..ff455ddd12e 100644
--- a/gcc/sched-rgn.cc
+++ b/gcc/sched-rgn.cc
@@ -65,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "dbgcnt.h"
 #include "pretty-print.h"
 #include "print-rtl.h"
+#include "cfgcleanup.h"

 /* Disable warnings about quoting issues in the pp_xxx calls below
that (intentionally) don't follow GCC diagnostic conventions.  */
@@ -3707,6 +3708,7 @@ rest_of_handle_live_range_shrinkage (void)
 #ifdef INSN_SCHEDULING
   int saved;

+  delete_unreachable_blocks ();
   initialize_live_range_shrinkage ();
   saved = flag_schedule_interblock;
   flag_schedule_interblock = false;

[Bug tree-optimization/108355] [13/14 Regression] Dead Code Elimination Regression at -O2 since r13-2772-g9baee6181b4e42

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108355

--- Comment #12 from Richard Biener  ---
(In reply to Andrew Pinski from comment #11)
> (In reply to Hans-Peter Nilsson from comment #10)
> > (In reply to Richard Biener from comment #9)
> > > gcc.dg/tree-ssa/ssa-fre-104.c has been XFAILed.
> > 
> > Any obvious target-specific reason for this to XPASS on cris-elf, m68k-linux
> > and pru-elf?
> > 
> > (per recent gcc-testresults posts)
> 
> Most likely because
> int e[][1] = {0, 0, 0, 0, 0, 1};
> 
> Is done as a copy from a const static decl vs done via stores to e[i][0].
> 
> Maybe do s/5/2/ and change the number of elements down to 3 for the array
> and you will hit the issue again on those targets.

Huh, most likely, but I don't see how that would help ... that should
make it _fail_ to optimize this ...

So checking cris-elf I see

   :
  e[0][0] = 0;
  e[1][0] = 0;
  e[2][0] = 0;
  e[3][0] = 0;
  e[4][0] = 0;
  e[5][0] = 1;

   :
  bar25_ ();
  a.0_1 = a;
  _2 = e[5][a.0_1];
  if (_2 != 0)
goto ; [INV]
  else
goto ; [INV]

   :
  a.1_3 = a;
  e[a.1_3][0] = 0;
  foo ();
  goto ; [INV]

before FRE, the same IL as on x86_64.  A FRE dump diff reveals

 Setting value number of a.0_1 to a.0_1 (changed)
 Making available beyond BB3 a.0_1 for value a.0_1
 Value numbering stmt = _2 = e[5][a.0_1];
-Skipping possible redundant definition e[5][0] = 1;
-Setting value number of _2 to _2 (changed)
-Using extra use virtual operand .MEM_5
-Making available beyond BB3 _2 for value _2
+Setting value number of _2 to 1 (changed)

that's exactly the reason for the regression - we're now trying to skip
the definition on x86_64.  And we can do so there because of the alignment
of 'e' which on cris seems to be less than the size of 'int' (int is
aligned to 1 byte but its size is still 4 bytes).

So

int a;
int *b = 
int **c = 
int d;
void bar25_(void);
void foo(void);
int main() {
  int __attribute__((aligned(sizeof(int e[][1] = {0, 0, 0, 0, 0, 1};
  for (;;) {
bar25_();
/* We should optimistically treat a == 0 because of the bounds of
   the object.  */
if (e[5][a])
  break;
e[a][0] = 0;
foo();
  }
  *c = 
}

also fails on cris.  Let me update the testcase.

[Bug rtl-optimization/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g0

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |WAITING
   Priority|P1  |P2

--- Comment #4 from Richard Biener  ---
We've released with the bug so this cannot be P1.  Note the bisected to rev.
likely just made this latent issue show up.

Btw, I can't reproduce - any implicit options missing?

> ./cc1plus -quiet t.ii -O2 -m32 -flive-range-shrinkage -fno-dce 
> -fnon-call-exceptions

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #13 from Richard Biener  ---
(In reply to Georg-Johann Lay from comment #12)
> (In reply to Richard Biener from comment #10)
> > I think the target controls the "libcall" ABI that's used for calls to
> > libgcc,
> 
> You have a pointer how to do it or an example? IIRC I looked into it quite a
> while ago, and it didn't allow to specify/adjust call_used_regs[] etc.
> 
> > I think the target should implement an inline bswap, possibly via a
> > define_insn_and_split or define_split so the byte ops are only exposed
> > at a desired point;  important points being lower_subreg (split-wide-types)
> > and register allocation - possibly lower_subreg should itself know
> > how to handle bswap (though the degenerate AVR case is quite special).
> 
> That would result in SUBREGs all over the place.  As Vladimir pointed out in 
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110093#c5
> 
> DFA doesn't handle subregs properly, and register alloc then uses extra
> reloads, bloating the code (not only in PR110093 but also 114243.  Unlikely
> any pass will untangle the mess of four (set (subreg:QI (SI)) (subreg:QI
> (SI)))

Yep.  Which is why I was playing thoughts of having (bswap:SI ..) handled
during reload itself ...

The alternative would be to have SImode hardregs by using consecutive
registers and special constraints.  That reduces RA freedom but it would
allow bswap:SI to be split after reload.  Or not split at all but
emitted directly as a sequence of those eor's - of course then making
the assembly quite big, not taking advantage of the fact that we can
probably elide most reg-reg moves.  So splitting after reload might
allow the moves to be eliminated and avoiding the subreg DF.

That said, it probably needs (a lot of) experimenting.

What I've tried to communicate with the store-merging patch attempt is
that GIMPLE optimizations have not enough information to decide whether
a bswap replacement is profitable or not.  Or at least there's no
sophisticated way I can think of that would work for AVR and other targets?

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #11 from Richard Biener  ---
diff --git a/gcc/gimple-ssa-store-merging.cc b/gcc/gimple-ssa-store-merging.cc
index 42b68abf61b..c9d4662656f 100644
--- a/gcc/gimple-ssa-store-merging.cc
+++ b/gcc/gimple-ssa-store-merging.cc
@@ -170,6 +170,7 @@
 #include "optabs-tree.h"
 #include "dbgcnt.h"
 #include "selftest.h"
+#include "regs.h"

 /* The maximum size (in bits) of the stores this pass should generate.  */
 #define MAX_STORE_BITSIZE (BITS_PER_WORD)
@@ -1484,7 +1485,8 @@ maybe_optimize_vector_constructor (gimple *cur_stmt)
   break;
 case 32:
   if (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
- && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing)
+ && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
+ && have_regs_of_mode[SImode])
{
  load_type = uint32_type_node;
  fndecl = builtin_decl_explicit (BUILT_IN_BSWAP32);
@@ -1545,7 +1547,8 @@ pass_optimize_bswap::execute (function *fun)
   tree bswap32_type = NULL_TREE, bswap64_type = NULL_TREE;

   bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
-  && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing);
+  && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
+  && have_regs_of_mode[SImode]);
   bswap64_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP64)
   && (optab_handler (bswap_optab, DImode) != CODE_FOR_nothing
   || (bswap32_p && word_mode == SImode)));


doesn't work.  AVR has regs of SImode.  There doesn't seem to be a way to
query the (maximum?) number of hardregs used for a mode.  Using

  bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
   && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
   && have_regs_of_mode[SImode]
   && hard_regno_nregs (0, SImode) == 1);

"works" but is surely wrong (whatever hardreg zero corresponds to).
Looking only at word_mode, requiring SImode size >= word_mode size like with

  bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
   && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
   && known_ge (GET_MODE_SIZE (word_mode), GET_MODE_SIZE
(SImode)));

"works" but would affect many more targets.  Maybe && word_mode != QImode
is better.

Note that this will cut off _all_ bswap detection.  Thus my question on
profitability of detecting cases like those in libgcc2.c which then produces

__bswapsi2:
push r12
push r13
push r14
push r15
push r16
push r17
/* prologue: function */
/* frame size = 0 */
/* stack size = 6 */
.L__stack_usage = 6
mov r16,r22
mov r17,r23
mov r18,r24
mov r19,r25
mov r22,r19
clr r23
clr r24
clr r25
mov r15,r16
clr r14
clr r13
clr r12
or r22,r12
or r23,r13
or r24,r14
or r25,r15
mov r12,r17
mov r13,r18
mov r14,r19
clr r15
clr r12
clr r14
clr r15
or r22,r12
or r23,r13
or r24,r14
or r25,r15
mov r19,r18
mov r18,r17
mov r17,r16
clr r16
clr r16
clr r17
clr r19
or r22,r16
or r23,r17
or r24,r18
or r25,r19
/* epilogue start */
pop r17
pop r16
pop r15
pop r14
pop r13
pop r12
ret

then.

bswap detection does not try to do any sophisticated evaluation of costs.

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

Richard Biener  changed:

   What|Removed |Added

 CC||sayle at gcc dot gnu.org

--- Comment #10 from Richard Biener  ---
(In reply to Georg-Johann Lay from comment #8)
> (In reply to Richard Biener from comment #7)
> > Note I do understand what you are saying, just the middle-end in detecting
> > and using __builtin_bswap32 does what it does everywhere else - it checks
> > whether the target implements the operation.
> > 
> > The middle-end doesn't try to actually compare costs (it has no idea of the
> > bswapsi costs),
> 
> But even when the bswapsi insn costs nothing, the v14 code has these
> additional 6 movqi insns 32...37 compared to v13 code.  In order to have the
> same performance like v13 code, a bswapsi would have to cost negative 6
> insns.  And an optimizer that assumes negative costs is not reasonable, in
> particular because the recognition of bswap opportunities serves
> optimization -- or is supposed to serve it as far as I understand.
> 
> > and it most definitely doesn't see how AVR is special in
> > having only QImode registers and thus the created SImode load (which the
> > target supports!) will end up as four registers.
> 
> Even when the bswap insn would cost nothing the code is worse.

Yes, I know.

> > The only thing that maybe would make sense with AVR exposing bswapsi is
> > users calling __builtin_bswap but since it always expands as a libcall
> > even that makes no sense.
> 
> It makes perfect sense when C/C++ code uses __builtin_bswap32:
> 
> * With current bswapsi insn, the code does a call that performs SI:22 =
> bswap(SI:22) with NO additionall register pressure.
> 
> * Without bswap insn, the code does a real ABI call that performs SI:22 =
> bswap(SI:22) PLUS IT CLOBBERS r18, r19, r20, r21, r26, r27, r30 and r31;
> which are the most powerful GPRs.

I think the target controls the "libcall" ABI that's used for calls to
libgcc, but somehow we fail to go that path (but I can see __bswapsi
and __bswapdi even in the x86_64 libgcc).  In particular

OPTAB_NC(bswap_optab, "bswap$a2", BSWAP)

doesn't list bswap as having a libfunc ...

> > So my preferred fix would be to remove bswapsi from avr.md?
> 
> Is there a way that the backend can fold a call to an insn that performs
> better that a call? Like in TARGET_FOLD_BUILTIN?  As far as I know, the
> backend can only fold target builtins, but not common builtins?  Tree fold
> cannot fold to an insn obviously, but it could fold to inline asm, no?
> 
> Or can the target change an optabs entry so it expands to an insn that's
> more profitable that a respective call? (like avr.md's bswap insn with
> transparent call is more profitable than a real call).

I think the target should implement an inline bswap, possibly via a
define_insn_and_split or define_split so the byte ops are only exposed
at a desired point;  important points being lower_subreg (split-wide-types)
and register allocation - possibly lower_subreg should itself know
how to handle bswap (though the degenerate AVR case is quite special).

I've CCed Roger who might know the traps with "implementing" an SImode
bswap on a target with just QImode regs but multi-reg operations not
decomposed during most of the RTL pipeline(?)

> The avr backend does this for many other stuff, too:
> 
> divmod, SI and PSI multiplications, parity, popcount, clz, ffs, 

Indeed.  Maybe it's never the case that a loop implementing clz is
better than a libcall or separate div/mod are better than divmod
(oddly divmod also lacks the libcall entry for the optabs...).

> > Does it benefit from recognizing bswap done with shifts on an int?
> 
> I don't fully understand that question. You mean to write code that shifts
> bytes around like in
> uint32_t res = 0;
> res |= ((uint32_t) buf[0]) << 24;
> res |= ((uint32_t) buf[1]) << 16;
> res |= (uint32_t) buf[2] << 8;
> res |= buf[3];
> return res;
> is better than a bswapsi call?

Yeah.  Or comparing to open-coding the bswap without going through the call.
I don't have a AVR libgcc around, but libgcc2.s has

#ifdef L_bswapsi2
SItype
__bswapsi2 (SItype u)
{
  return u) & 0xff00u) >> 24)
  | (((u) & 0x00ffu) >>  8)
  | (((u) & 0xff00u) <<  8)
  | (((u) & 0x00ffu) << 24));
}
#endif 

and that's compiled to

__bswapsi2:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
rcall __bswapsi2
/* epilogue start */
ret

so this can't be it ;)

[Bug middle-end/114258] 2 stores happen when copying from a const union (array) to an union

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114258

--- Comment #2 from Richard Biener  ---
Huh, very strange RTL we emit for the union assignment.

void func_1(union U6 *a) {
  g_13 = *a;
}

works OK though.

[Bug ipa/114254] [11/12/13/14 regression] Indirect inlining through C++ member pointers fails if the underlying class has a virtual function

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114254

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
 Status|UNCONFIRMED |NEW
   Keywords||missed-optimization
   Target Milestone|--- |11.5
   Last reconfirmed||2024-03-07
 Ever confirmed|0   |1

[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since r14-9193

2024-03-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

--- Comment #14 from Richard Biener  ---
(In reply to Andrew Macleod from comment #13)
> Created attachment 57638 [details]
> patch
> 
> Ok, there were 2 issues with simply invoking range_of_stmt, which this new
> patch resolves.  IF we aren't looking to fix this in GCC 14 right now
> anyway, this is the way to go.
> 
> 1) The cache has always tried to provide a global range by pre-folding a
> stmt for an estimate using global values.  This is a bad idea for PHIs when
> SCEV is invoked AND SCEV is calling ranger. This changes it to not
> pre-evaluate PHIs, which also saves time when functions have a lot of edges.
> Its mostly pointless for PHIs anyway since we're about to do a real
> evaluation.
> 
> 2) The cache's entry range propagator was not re-entrant.  We didn't
> previously need this, but with SCEV (and possible other place) invoking
> range_of_expr without context and having range_of_stmt being called, we can
> occasionally get layered calls for cache filling (of different ssa-names) 
> 
> With those 2 changes, we can now safely invoke range_of_stmt from a
> contextless range_of_expr call.
> 
> We would have tripped over this earlier if SCEV or one of those other places
> using range_of_expr without context had instead invoked range_of_stmt.  That
> would have been perfectly reasonable, and would have resulting in these same
> issues.  We never tripped over it because range_of_stmt is not used much
> outside of ranger.  That is the primary reason I wanted to track this down. 
> There were alternative paths to the same end result that would have
> triggered these issues.

It sounds like this part is a bugfix?

> Give this patch a try. it also bootstraps with no regressions.  I will queue
> it up for stage 1 instead assuming all is good.

It seems to work well, it now computes a lot of additional ranges and
causes a minor code generation change on the testcase (it doesn't fix the
observed regression though).

Thanks for working on this.

As of things unexplored is whether we can with better range-info lift the
constraint on the folding some more.  We're turning (A + i * B) * C into
(A * C + i * (B * C)) and need to avoid any additional intermediate undefined
overflow with this association for i in [0, n] (with n being the number of
iterations of the loop where i varies).

As said, if the regression is too important to ignore we could choose to
leave the bug unfixed for all but the case with A, B and C constant which
was the case for the testcase in the original PR.

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #7 from Richard Biener  ---
Note I do understand what you are saying, just the middle-end in detecting and
using __builtin_bswap32 does what it does everywhere else - it checks whether
the target implements the operation.

The middle-end doesn't try to actually compare costs (it has no idea of the
bswapsi costs), and it most definitely doesn't see how AVR is special in
having only QImode registers and thus the created SImode load (which the
target supports!) will end up as four registers.  To me a 'bswap' on
AVR never makes sense since whatever is swapped will be _always_ available
as a set of byte registers.

That's why I question AVR exposing bswapsi to the middle-end rather than
suggesting the middle-end should maybe see whether AVR has any regs of
HImode or larger.  Note that would break for targets that could eventually
do a load-multiple byteswapped to a set of QImode regs (guess there's no
such one in GCC at least), but it's the only heuristic that might work here.

The only thing that maybe would make sense with AVR exposing bswapsi is
users calling __builtin_bswap but since it always expands as a libcall
even that makes no sense.

So my preferred fix would be to remove bswapsi from avr.md?

Does it benefit from recognizing bswap done with shifts on an int?

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
Yes, the missing "detail" for the middle-end is that the uint32_t is actually
4 separate byte registers.  And the 'int' argument to bswap32 requires
4 registers as well.

So bswap on a value is just register shuffling, right?  And thus this
libcall will never be better than doing it inline as you probably
cannot expect the incoming arguments and the outgoing return registers to be
allocated in a way so no reg-reg moves remain?

Of course since it's still SImode pseudos on RTL you might want to write
an expander that populates 4 QImode pseudos from the SImode one and
composes that back to a byte-swapped SImode one.  Hoping register allocation
can then elide everything again?

I'd try to avoid using subregs if possible though using those would be easiest
I think (but you might fall foul of RA similar to -fsplit-wide-types).
Shifts and truncates/zero_extends are possibly superior.  Who knows.  Or
split it only after reload and have the pattern consume one scratch you
need for the register-register moves.

Hey, maybe the RA itself can know how to allocate a bswap:SI optimally
and "reload" it to be reg-reg moves ...

[Bug tree-optimization/114239] [14 regression] ice: error: definition in block does not dominate use in block

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114239

Richard Biener  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #5 from Richard Biener  ---
*** Bug 114234 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/114234] [14 Regression] verify_ssa failure with early-break vectorisation

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114234

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE
   Target Milestone|--- |14.0

--- Comment #4 from Richard Biener  ---
Verified the PR114239 fix fixes this as well, dup.

*** This bug has been marked as a duplicate of bug 114239 ***

[Bug tree-optimization/114239] [14 regression] ice: error: definition in block does not dominate use in block

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114239

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Richard Biener  ---
Fixed.

[Bug tree-optimization/114249] [14 regression] ICE when building lvm2-2.03.22 (error: invalid types in nop conversion)

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114249

--- Comment #9 from Richard Biener  ---
*** Bug 114251 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/114251] [14 regression] ICE when building python-3.12.2's Hacl_Hash_SHA2.c (tree check: expected class ‘type’, have ‘exceptional’ (error_mark) in useless_type_conversion_p, at g

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114251

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from Richard Biener  ---
Another duplicate.

*** This bug has been marked as a duplicate of bug 114249 ***

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #2 from Richard Biener  ---
Looking at avr.md there's no bswap implementation, only the libcall.  Why
expose it this way?

I suppose the pattern was added to get bswap recognition, so this is what you
get if you do that?

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-03-06
 Status|UNCONFIRMED |NEW
  Component|tree-optimization   |target
   Keywords||missed-optimization

--- Comment #1 from Richard Biener  ---
Confirmed.  It looks like we do no cost evaluation in
maybe_optimize_vector_constructor but checking that there's an optab
for bswap with SImode.

insn-flags.h:#define HAVE_bswapsi2 1

but somehow we end up doing a libcall?

We expand as

;; bswapdst_10 = __builtin_bswap32 (load_dst_9);

(insn 6 5 7 (set (reg:SI 47)
(mem:SI (reg/v/f:HI 46 [ buf ]) [0 MEM  [(const
uint8_t *)buf_5(D)]+0 S4 A8])) -1
 (nil))

(insn 7 6 8 (set (reg:SI 22 r22)
(reg:SI 47)) -1
 (nil))

(insn 8 7 9 (set (reg:SI 22 r22)
(bswap:SI (reg:SI 22 r22))) -1
 (nil))

^^^

so why does that turn into a library call?

I think this is mis-communication between the middle-end and the target.

[Bug tree-optimization/114253] False positive maybe-uninitialized with std::optional and ternary

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114253

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
  Known to work||7.5.0
 Blocks||24639
   Last reconfirmed||2024-03-06
  Known to fail||12.3.1, 13.2.1, 14.0

--- Comment #3 from Richard Biener  ---
Confirmed.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=24639
[Bug 24639] [meta-bug] bug to track all Wuninitialized issues

[Bug tree-optimization/114246] [11/12/13 Regression] ICE: verify_gimple failed: invalid argument to gimple call with __builtin_memcpy()

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114246

Richard Biener  changed:

   What|Removed |Added

Summary|[11/12/13/14 Regression]|[11/12/13 Regression] ICE:
   |ICE: verify_gimple failed:  |verify_gimple failed:
   |invalid argument to gimple  |invalid argument to gimple
   |call with   |call with
   |__builtin_memcpy()  |__builtin_memcpy()
   Priority|P3  |P2
  Known to work||14.0
  Known to fail|14.0|
   Keywords|needs-bisection |ice-checking

--- Comment #5 from Richard Biener  ---
Fixed on trunk sofar.

[Bug tree-optimization/114249] [14 regression] ICE when building lvm2-2.03.22 (error: invalid types in nop conversion)

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114249

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Richard Biener  ---
Fixed.

[Bug tree-optimization/114249] [14 regression] ICE when building lvm2-2.03.22 (error: invalid types in nop conversion)

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114249

--- Comment #6 from Richard Biener  ---
*** Bug 114250 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/114250] [14 regression] ICE when building glslang-1.3.275 (error: invalid types in nop conversion)

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114250

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #3 from Richard Biener  ---
Yes, duplicate.

*** This bug has been marked as a duplicate of bug 114249 ***

[Bug tree-optimization/114249] [14 regression] ICE when building lvm2-2.03.22 (error: invalid types in nop conversion)

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114249

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-03-06
 Ever confirmed|0   |1

--- Comment #5 from Richard Biener  ---
Mine.

[Bug ipa/114247] RISC-V: miscompile at -O3 and IPA SRA

2024-03-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114247

Richard Biener  changed:

   What|Removed |Added

 CC||jamborm at gcc dot gnu.org

--- Comment #2 from Richard Biener  ---
So there's a type mismatch of the formal argument type with the actual call
argument as inserted by IPA SRA, possibly confused by the union.

Martin?

[Bug tree-optimization/114246] [11/12/13/14 Regression] ICE: verify_gimple failed: invalid argument to gimple call with __builtin_memcpy()

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114246

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #3 from Richard Biener  ---
Mine.

[Bug lto/114241] False-positive -Wodr warning when using -flto and -fno-semantic-interposition

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114241

Richard Biener  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org
 Ever confirmed|0   |1
  Known to fail||13.2.1
   Last reconfirmed||2024-03-06
 Status|UNCONFIRMED |NEW

--- Comment #1 from Richard Biener  ---
Confirmed also with GCC 13.  I guess we're optimizing things in a strange
(invalid) way before WPA and get confused because of that.

Honza?

[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since r14-9193

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

--- Comment #11 from Richard Biener  ---
(In reply to Richard Biener from comment #10)
> (In reply to Andrew Macleod from comment #9)
> > Created attachment 57620 [details]
> > proposed patch
> > 
> > Does this solve your problem if there is an active ranger?  it bootstraps
> > with no regressions
> 
> I'll check what it does.

So it does seem to help, not on the testcases ultimate outcome, but for the
important bits of the analysis.  With adding an active ranger around IVOPTs
with

diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 7cae5bdefea..626fc5bf5d7 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -132,6 +132,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "dbgcnt.h"
 #include "cfganal.h"
+#include "gimple-range.h"

 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -8280,6 +8281,8 @@ tree_ssa_iv_optimize (void)
   tree_ssa_iv_optimize_init ();
   mark_ssa_maybe_undefs ();

+  enable_ranger (cfun);
+
   /* Optimize the loops starting with the innermost ones.  */
   for (auto loop : loops_list (cfun, LI_FROM_INNERMOST))
 {
@@ -8292,6 +8295,8 @@ tree_ssa_iv_optimize (void)
   tree_ssa_iv_optimize_loop (, loop, toremove);
 }

+  disable_ranger (cfun);
+
   /* Remove eliminated IV defs.  */
   release_defs_bitset (toremove);


I then see the following difference with a ranger-debug dump during IVOPTs:

 11   range_of_expr(_12)
- TRUE : (11) range_of_expr (_12) [irange] int VARYING
+ TRUE : (11) range_of_expr (_12) [irange] int [0, +INF]
...
   Base:(long unsigned int) (int) ((unsigned int) _12 + 1) * 2
   Step:2
   Biv: N
-  Overflowness wrto loop niter:Overflow
+  Overflowness wrto loop niter:No-overflow
...
-74   range_of_expr(_103)
- TRUE : (74) range_of_expr (_103) [irange] int VARYING
+64   range_of_expr(_103)
+ TRUE : (64) range_of_expr (_103) [irange] int [-INF, 0]
 Analyzing # of iterations of loop 1
   exit condition [1, + , 1](no_overflow) <= _103
-  bounds on difference of bases: -2147483649 ... 2147483646
+  bounds on difference of bases: -2147483649 ... -1
   result:
 zero if _103 < 0
-# of iterations (unsigned int) _103, bounded by 2147483647
+# of iterations (unsigned int) _103, bounded by 0

So the important part is that it got the fact that _12 is positive.  As
analyzed in earlier comments I think that's all we can do, we don't know
anything about the other variable involved and thus can't avoid the
unsigned punning during SCEV analysis.

I think it's a good change, let's keep it queued for stage1 at this point
unless we really know a case it helps to avoid a regression with
r14-9193-ga0b1798042d033

For testing, what's the "easiest" pass/thing to do to recompute global
ranges now?  In the past I'd schedule EVRP but is there now a ranger
API to do this?  Just to see if full global range compute before IVOPTs
would help.

[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since r14-9193

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

--- Comment #10 from Richard Biener  ---
(In reply to Andrew Macleod from comment #9)
> Created attachment 57620 [details]
> proposed patch
> 
> Does this solve your problem if there is an active ranger?  it bootstraps
> with no regressions

I'll check what it does.

> ITs pretty minimal, and basically we invokes the cache's version of
> range_of_expr if there is no context.   I tweaked it such that if there is
> no context, and the def has not been calculated yet, it calls range_of_def,
> and combines it with any SSA_NAME_RANGE_INFO that may have pre-existed.  All
> without invoking any new lookups.
> 
> This seems relatively harmless and does not spawn new dynamic lookups.   As
> long as it resolves your issue...   If it still does not work, and we
> require the def to actually be evaluated, I will look into that. we prpbably
> should do that anyway.  There appears to be a cycle when this is invoked
> from the loop analysis, probably because folding of PHIs uses loop info...
> and back and forth we go.

Yeah, I ran into this as well.

> I'd probably need to add a flag to the ranger
> instantiation to tell it to avoid using loop info.

I've quickly tried to detect active discovery in SCEV but it wasn't as easy
as I thought.

> Are we looking to fix this in this release?

I think the full evaluation has to wait for stage1 because of that recursion
issue.  I'm also sure we're going to need ways to _not_ do this, so maybe
a clearer separation in the API is warranted.  As I see it when you call
range_of_expr without context you get the same result as if using the
global range query so maybe it should be a different API from the start
(the one that is now without context) and range_of_expr without context
using a conservative default (the definition point).

[Bug tree-optimization/112307] Segmentation fault with -O1 -fcode-hoisting

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112307

--- Comment #6 from Richard Biener  ---
Thanks, so keeping this open but it will likely end up INVALID.

[Bug tree-optimization/114239] [14 regression] ice: error: definition in block does not dominate use in block

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114239

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Priority|P3  |P1
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
   Target Milestone|--- |14.0
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-03-05

--- Comment #2 from Richard Biener  ---
Confirmed.  This is the peeled early exit reduction epilog case.  I will have a
look tomorrow.

[Bug tree-optimization/114236] introduce unnecessary store operation when unrolling a loop

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114236

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WONTFIX
 CC||rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
Executing store motion of MEM[(short int *)_16] from loop 2
Re-issueing dependent store of *g_70.2_4 from loop 2 on exit 4 -> 5
Moving statement

The extra store is required to enable sinking of the store to g_16 as
we don't know whether g_70 points to aliased memory.

I think we fail to realize that *g_70 is loop invariant as well, but
what you observe is a feature - it allows reducing the number of stores
to g_16.

[Bug tree-optimization/112307] Segmentation fault with -O1 -fcode-hoisting

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112307

Richard Biener  changed:

   What|Removed |Added

 CC||redi at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
I don't quite see how

  return EnumeratorRange(Enumerator(std::views::single(Intersection(;

makes CompositeMesh::Intersections() return an object that refers to itself
but points-to doesn't consider it could:

callarg(12) = 
...
ESCAPED = 
intersections = NONLOCAL
intersections = callarg(12)

debugging at -O0 shows

(gdb) p intersections
$1 = {enumerator_ = {
range_ =
{ >> = {}, _M_value = {_M_value = {is_boundary_ = true}}}, 
begin_ = std::optional = {
  [contained value] = 0x7fffde80}}, end_reached_ = false}
(gdb) p 
$2 = (EnumeratorRange *)
0x7fffde80

so it contains a reference to itself.  As said this feels like those other
issues which are maybe invalid because of object lifetimes.

But maybe there's sth special here.

[Bug tree-optimization/114231] [12/13 Regression] ICE when building libjxl

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114231

Richard Biener  changed:

   What|Removed |Added

Summary|[12/13/14 regression] ICE   |[12/13 Regression] ICE when
   |when building libjxl|building libjxl
  Known to work||14.0

--- Comment #14 from Richard Biener  ---
Fixed on trunk sofar.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #44 from Richard Biener  ---
(In reply to Richard Sandiford from comment #42)
> Created attachment 57605 [details]
> proof-of-concept patch to suppress peeling for gaps
> 
> How about the attached?  It records whether all accesses that require
> peeling for gaps could instead have used gathers, and only retries when
> that's true.  It means that we retry for only 0.034% of calls to
> vect_analyze_loop_1 in a build of SPEC2017 with -mcpu=neoverse-v1 -Ofast
> -fomit-frame-pointer.

I guess this idea would work, but as said full re-analysis shouldn't be
required, instead "just" the updated cost on the affected loads/stores
need to be recomputed?  Of course this would require quite some
implementation work.  If we want to just fix this regression the approach
looks sensible but it would be also applied to x86 which doesn't want to
compare costs, right?  I'm not sure the gather vs. permute costing there
makes this a good idea for stage4?

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #11 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #10)
> Created attachment 57612 [details]
> Prototype patch
> 
> Let's try this approach.

Yeah, I guess !TARGET_PARTIAL_REG_STALL || optimize_function_for_size_p (cfun)
is best elided (or also avoid the pattern when optimizing for size).

[Bug tree-optimization/114231] [12/13/14 regression] ICE when building libjxl

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114231

--- Comment #12 from Richard Biener  ---
So the immediate reason is that between analysis and transform whether we
consider the shift vectorizable changes.  That's because we code generated
a live lane which ended up changing operands in stmts we will vectorize
(there's some odd broken handling there I think).  That's because we
match up the scalar from the SLP node with the scalar in the scalar GIMPLE
IL (which changed) here:

  stmt_vec_info op1_def_stmt_info;
  slp_tree slp_op1;
  if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 1, , _op1,
   [1], _vectype, _def_stmt_info))
...
  FOR_EACH_VEC_ELT (stmts, k, slpstmt_info)
{
  gassign *slpstmt = as_a  (slpstmt_info->stmt);
  if (!operand_equal_p (gimple_assign_rhs2 (slpstmt), op1, 0))
scalar_shift_arg = false;
} 

which is a bit fragile.

But the underlying issue seems to be the live lane stuff.

Ah, and that's because we do the reduction discovery on the original
scalar stmt while live lane extraction honors patterns when checking
whether the stmt is vectorized ...

I have a patch, not sure how big the fallout might be though.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #8 from Richard Biener  ---
> grep optimize_ insn-flags.h | wc -l
14

so it's not very many standard patterns that would be affected.  I'd say
using these kind of flags on standard patterns is at least fragile?

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 CC||hubicka at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org

--- Comment #7 from Richard Biener  ---
I think the question is more whether it's stable between optab queries the
vectorizer or veclower does and RTL expansion.  There whether it's LTO or not
shouldn't play a role (it might for the actual testcase of course).

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #6 from Richard Biener  ---
It's possibly on a cold path (yes, optimize_function_for_size_p should be
stable).  Note though that optimize_function_for_size_p might in theory
change between vectorization and RTL expansion, so maybe optab queries
in the vectorizer are broken by this if we ever re-check that condition
after optab initialization.

[Bug tree-optimization/114231] [12/13/14 regression] ICE when building libjxl

2024-03-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114231

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED
   Priority|P3  |P2

--- Comment #11 from Richard Biener  ---
I will have a look.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #40 from Richard Biener  ---
So I wonder if we can use "local costing" to decide a gather is always OK
compared to the alternative with peeling for gaps.  On x86 gather tends
to be slow compared to open-coding it.

In the future we might want to explore whether we can re-do costing for
alternatives without re-running all of the analysis at least for decisions
we know have only "local" effect.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #37 from Richard Biener  ---
(In reply to Richard Sandiford from comment #36)
> Created attachment 57602 [details]
> proof-of-concept patch to suppress peeling for gaps
> 
> This patch does what I suggested in the previous comment: if the loop needs
> peeling for gaps, try again without that, and pick the better loop.  It
> seems to restore the original style of code for SVE.
> 
> A more polished version would be a bit smarter about when to retry.  E.g.
> it's pointless if the main loop already operates on full vectors (i.e. if
> peeling 1 iteration is natural in any case).  Perhaps the condition should
> be that either (a) the number of epilogue iterations is known to be equal to
> the VF of the main loop or (b) the target is known to support partial
> vectors for the loop's vector_mode.
> 
> Any thoughts?

Even more iteration looks bad.  I do wonder why when gather can avoid
peeling for GAPs using load-lanes cannot?  Also for the stores we
seem to use elementwise stores rather than store-lanes.

To me the most obvious thing to try optimizing in this testcase is DR
analysis.  With -march=armv8.3-a I still see

t.c:26:22: note:   === vect_analyze_data_ref_accesses ===
t.c:26:22: note:   Detected single element interleaving array1[0][_8] step 4
t.c:26:22: note:   Detected single element interleaving array1[1][_8] step 4
t.c:26:22: note:   Detected single element interleaving array1[2][_8] step 4
t.c:26:22: note:   Detected single element interleaving array1[3][_8] step 4
t.c:26:22: note:   Detected single element interleaving array1[0][_1] step 4
t.c:26:22: note:   Detected single element interleaving array1[1][_1] step 4
t.c:26:22: note:   Detected single element interleaving array1[2][_1] step 4
t.c:26:22: note:   Detected single element interleaving array1[3][_1] step 4
t.c:26:22: missed:   not consecutive access array2[_4][_8] = _69;
t.c:26:22: note:   using strided accesses
t.c:26:22: missed:   not consecutive access array2[_4][_1] = _67;
t.c:26:22: note:   using strided accesses

so we don't figure

Creating dr for array1[0][_1]
base_address: 
offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2)
constant offset from base address: 0
step: 4
base alignment: 16
base misalignment: 0
offset alignment: 4
step alignment: 4
base_object: array1
Access function 0: {m_111 * 2, +, 2}_4
Access function 1: 0
Creating dr for array1[0][_8]
analyze_innermost: success.
base_address: 
offset from base address: (ssizetype) ((sizetype) (m_111 * 2 + 1) * 2)
constant offset from base address: 0
step: 4
base alignment: 16
base misalignment: 0
offset alignment: 2
step alignment: 4
base_object: array1
Access function 0: {m_111 * 2 + 1, +, 2}_4
Access function 1: 0

belong to the same group (but the access functions tell us it worked out).
Above we fail to split the + 1 to the constant offset.

See my hint to use int32_t m instead of uint32_t yielding

t.c:26:22: note:   Detected interleaving load of size 2
t.c:26:22: note:_2 = array1[0][_1];
t.c:26:22: note:_9 = array1[0][_8];
t.c:26:22: note:   Detected interleaving load of size 2
t.c:26:22: note:_18 = array1[1][_1];
t.c:26:22: note:_23 = array1[1][_8];
t.c:26:22: note:   Detected interleaving load of size 2
t.c:26:22: note:_32 = array1[2][_1];
t.c:26:22: note:_37 = array1[2][_8];
t.c:26:22: note:   Detected interleaving load of size 2
t.c:26:22: note:_46 = array1[3][_1];
t.c:26:22: note:_51 = array1[3][_8];
t.c:26:22: note:   Detected interleaving store of size 2
t.c:26:22: note:array2[_4][_1] = _67;
t.c:26:22: note:array2[_4][_8] = _69;

(and SLP being thrown away because we can use load/store lanes)

[Bug middle-end/114197] [14 regression] ICE in verify_dominators

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114197

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Richard Biener  ---
Fixed both issues.

[Bug middle-end/114197] [14 regression] ICE in verify_dominators

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114197

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

[Bug target/114187] [14 regression] bizarre register dance on x86_64 for pass-by-value struct since r14-2526

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114187

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Richard Biener  ---
Fixed I assume.

[Bug rtl-optimization/114190] [14 regression] Wrong code with -O2 -fno-dce -fharden-compares -mvpclmulqdq --param=max-rtl-if-conversion-unpredictable-cost=136

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114190

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

[Bug tree-optimization/114108] [14 regression] ICE when building opencv-4.8.1 (error: type mismatch in binary expression) since r14-1833

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114108

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

[Bug ada/113536] [14 regression] valid reduction expression rejected

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113536

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4

[Bug ada/113979] [11/12/13/14 regression] bogus error on allocator for type with Dynamic_Predicate

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113979

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P4

[Bug ada/113979] [11/12/13/14 regression] bogus error on allocator for type with Dynamic_Predicate

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113979

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |11.5

[Bug analyzer/113619] [14 Regression] -Wanalyzer-tainted-divisor false positive seen in Linux kernel's fs/ceph/ioctl.c

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113619

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/113606] [14 Regression] -Wanalyzer-infinite-recursion false positive on code involving strstr, memset, strnlen and -D_FORTIFY_SOURCE

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113606

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/113505] [14 Regression] ICE: SIGSEGV in tree_class_check (tree.h:3766) with -O -fdump-analyzer -fanalyzer

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113505

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/113496] [12/13/14 Regression] ICE: in cmp, at analyzer/constraint-manager.cc:782 with -fanalyzer -fdump-analyzer

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113496

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |12.4

[Bug analyzer/113314] [14 Regression] -Wanalyzer-infinite-loop false positive seen on haproxy's fd.c

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113314

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/113150] [14 Regression] FAIL: c-c++-common/analyzer/fd-glibc-byte-stream-socket.c -std=c++98 (test for excess errors)

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113150

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/112975] [14 Regression] -Wanalyzer-tainted-allocation-size false positive seen in Linux kernel's drivers/xen/privcmd.c

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112975

Richard Biener  changed:

   What|Removed |Added

Version|unknown |14.0
   Target Milestone|--- |14.0

[Bug target/112548] [14 regression] 5% exec time regression in 429.mcf on AMD zen4 CPU (since r14-5076-g01c18f58d37865)

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112548

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/112974] [14 Regression] -Wanalyzer-tainted-array-index false positive seen on Linux kernel drivers/platform/x86/intel/speed_select_if/isst_tpmi_core.c

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112974

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
Version|unknown |14.0

[Bug analyzer/111441] [14 Regression] ICE generating access diagram, in fold_binary_loc, at fold-const.cc:11580

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111441

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug c++/111075] [14 Regression] ICE on g++.dg/torture/tail-padding1.C on darwin

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111075

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/111099] [13/14 Regression] -fanalyzer -Os segmentation fault due to infinite recursion in ana::constraint_manager::eval_condition

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111099

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/110285] [13/14 Regression] -Wanalyzer-infinite-recursion false positive involving floating-point values

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110285

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/111305] [13/14 Regression] GCC Static Analyzer -Wanalyzer-out-of-bounds FP and ICE problem

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111305

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/110928] [14 Regression] ICE with -fanalyzer on -Wanalyzer-out-of-bounds checker

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110928

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug analyzer/108722] [13/14 Regression] gcc.dg/analyzer/file-CWE-1341-example.c fails on power 9 BE

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108722

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/109851] [13/14 Regression] False positive va_arg when iterating through format string with for-loop

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109851

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/109251] [13/14 Regression] -Wanalyzer-deref-before-check false positives seen in Linux kernel due to check in macros

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109251

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/109131] [13/14 Regression] -Wanalyzer-deref-before-check false positive seen in git's builtin/show-ref.c

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109131

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/108400] [12/13/14 Regression] -Wanalyzer-null-dereference false positive on SoftEtherVPN's src/Cedar/WebUI.c

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108400

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |12.4

[Bug analyzer/109014] [13/14 Regression] -Wanalyzer-use-of-uninitialized-value seen in pcre2-10.42's pcre2test.c

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109014

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug analyzer/108708] [13/14 Regression] __analyzer_dump_named_constant fails with derived values

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108708

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |13.3

[Bug debug/92387] [11/12/13 Regression] gcc generates wrong debug information at -O1 since r10-1907-ga20f263ba1a76a

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92387

Richard Biener  changed:

   What|Removed |Added

Summary|[11/12/13/14 regression]|[11/12/13 Regression] gcc
   |gcc generates wrong debug   |generates wrong debug
   |information at -O1 since|information at -O1 since
   |r10-1907-ga20f263ba1a76a|r10-1907-ga20f263ba1a76a
  Known to fail||13.2.1
   Keywords||needs-bisection
  Known to work||14.0

--- Comment #4 from Richard Biener  ---
Confirmed on the 13 branch.  Confirmed fixed on trunk, not sure why, would need
bisection.

[Bug debug/92387] [11/12/13/14 regression] gcc generates wrong debug information at -O1 since r10-1907-ga20f263ba1a76a

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92387

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |11.5

[Bug tree-optimization/114164] simdclone vectorization creates unsupported IL

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114164

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Richard Biener  ---
The instance I spotted is fixed now.

[Bug middle-end/114197] [14 regression] ICE in verify_dominators

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114197

--- Comment #5 from Richard Biener  ---
The issue is really unexpected if-conversion which ends up putting the
vector copy of an innner loop outside of the enclosing loop of a scalar
loop.  Such mishap usually happens because of simplifications.

In this case we see bitfield lowering touching a volatile access which
it shouldn't do, even removing the volatile marking and then value-numbering
concluding the lowered d.b is 5:

Value numbering stmt = _ifc__21 = d.D.2770;
RHS d.D.2770 simplified to 5
Setting value number of _ifc__21 to 5 (changed)
Replaced d.D.2770 with 5 in all uses of _ifc__21 = d.D.2770;
Value numbering stmt = _ifc__22 = BIT_FIELD_REF <_ifc__21, 8, 0>;
Match-and-simplified BIT_FIELD_REF <_ifc__21, 8, 0> to 5
RHS BIT_FIELD_REF <_ifc__21, 8, 0> simplified to 5
Setting value number of _ifc__22 to 5 (changed)
Replaced BIT_FIELD_REF <_ifc__21, 8, 0> with 5 in all uses of _ifc__22 =
BIT_FIELD_REF <_ifc__21, 8, 0>;
Value numbering stmt = _1 = _ifc__22;
Setting value number of _1 to 5 (changed)
Value numbering stmt = if (_1 == 0)
marking known outgoing edge 4 -> 5 executable
gimple_simplified to if (0 != 0)

[Bug middle-end/114197] [14 regression] ICE in verify_dominators

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114197

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-03-04

--- Comment #4 from Richard Biener  ---
I will have a look.

[Bug tree-optimization/114203] [13 Regression] Miscompilation: A possible miscompilation in GCC 13 and 14 with option -Os

2024-03-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114203

Richard Biener  changed:

   What|Removed |Added

  Known to work||14.0
Summary|[13/14 Regression]  |[13 Regression]
   |Miscompilation: A possible  |Miscompilation: A possible
   |miscompilation in GCC 13|miscompilation in GCC 13
   |and 14 with option -Os  |and 14 with option -Os
  Known to fail||13.2.0
   Priority|P3  |P2

<    3   4   5   6   7   8   9   10   11   12   >