[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #37 from rguenther at suse dot de  ---
On Tue, 17 Sep 2024, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
> 
> --- Comment #36 from Aldy Hernandez  ---
> (In reply to Richard Biener from comment #33)
> > Can we just sort m_paths after the path entry BB and fix the lookup that 
> > way?
> 
> This seemed promising, especially because the adjust_paths_after_duplication()
> gets called from the backwards threader in order, and always starting at
> position 0, so searching for the first item would be super fast.  However, 
> this
> function will rewire the edges in the path (rewire_first_differing_edge), thus
> messing up the sort.

Bah.  Remove those "fixed" candidates from the set (dropping 2nd level
opportunities that way)?  Alternatively sort m_paths into a better
data structure (IIRC we only have splay-tree as binary tree) for
the purpose of the lookup and re-place fixed up paths (but keep
m_paths for the purpose of processing things).

[Bug tree-optimization/116352] [15 regression] ICE when building opencv-4.9.0 (error: definition in block 208 does not dominate use in block 188) since r15-2820-gab18785840d7b8

2024-09-12 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116352

--- Comment #14 from rguenther at suse dot de  ---
On Wed, 11 Sep 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116352
> 
> Andrew Pinski  changed:
> 
>What|Removed |Added
> 
>Keywords||needs-bisection
> 
> --- Comment #13 from Andrew Pinski  ---
> (In reply to Konstantinos Eleftheriou from comment #12)
> > How can I reproduce this on aarch64? I tried using the code in comment 3,
> > using `-O3 -fno-vect-cost-model` with SVE disabled as Andrew mentioned and
> > GCC on commit 7a970bd03f1d8eed7703db8a8db3c753ea68899f, but couldn't manage
> > to reproduce it.
> 
> Looks like it was fixed ...

I don't think so but it might have become latent.

[Bug middle-end/114855] ICE: Segfault when compiling large autogenerated C source file

2024-09-04 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #30 from rguenther at suse dot de  ---
On Wed, 4 Sep 2024, amacleod at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855
> 
> --- Comment #29 from Andrew Macleod  ---
> Huh. Do we calculate *all* paths ahead of time?

Yes, we first identify all profitable threading candidates in a function
and then perform the threading.

> I tried running valgrind, which died, but up to that point it showed 77% of 
> the
> time spend in the body of 
> back_jt_path_registry::adjust_paths_after_duplication ()

Hmm, I can have a look there if you don't beat me to it (though the
profiles didn't have this part in AFAIR).

[Bug tree-optimization/116460] [14 Regression] LTO ICE with -g during GIMPLE pass: forwprop

2024-09-03 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116460

--- Comment #26 from rguenther at suse dot de  ---
On Tue, 3 Sep 2024, ales.astone at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116460
> 
> --- Comment #25 from Alessandro Astone  ---
> Thanks!
> 
> What's the process for getting this to the stable branch?

It will be picked up after some soaking and when time permits.

[Bug tree-optimization/36010] Loop interchange not performed

2024-09-02 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36010

--- Comment #8 from rguenther at suse dot de  ---
On Mon, 2 Sep 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36010
> 
> Tamar Christina  changed:
> 
>What|Removed |Added
> 
>  CC||tnfchris at gcc dot gnu.org
> 
> --- Comment #7 from Tamar Christina  ---
> It looks like today at -Ofast this is due to the full unrolling
> 
> https://godbolt.org/z/3K4hPbWfG
> 
> i.e. at -Ofast we fail due to the inner loop being fully unrolled.
> 
> Would it make sense to perform loop distribution before cunrolli?
> 
> in principle it should make any potential vect and SLP simpler no?

cunrolli is to remove abstraction penalty and thus has to be done early.
The option is to integrate that with value-numbering somehow to allow
the "abstraction" to remain and still allow followup optimization.

[Bug tree-optimization/116573] [15 Regression] Recent SLP work appears to generate significantly worse code on RISC-V

2024-09-02 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573

--- Comment #1 from rguenther at suse dot de  ---
> Am 02.09.2024 um 18:48 schrieb law at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116573
> 
> Jeffrey A. Law  changed:
> 
>   What|Removed |Added
> 
> CC||rguenth at gcc dot gnu.org

Can you explain the setvli difference?  I think some of the folks adding the
with-Len-and-mask IV support have to look, it’s probably going wrong there for
unknown reasons.


> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug lto/116535] LTO partitioning vs. offloading compilation

2024-09-02 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535

--- Comment #5 from rguenther at suse dot de  ---
On Mon, 2 Sep 2024, burnus at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116535
> 
> --- Comment #4 from Tobias Burnus  ---
> (In reply to Richard Biener from comment #3)
> > The forked processes may not write to any "global" data because forking
> > makes that data not global ... instead any such "global" data has to be
> > computed before forking.
> 
> Well, the data is "computed" before - just a means is missing to propagate the
> information whether to output the offload tables should be written or not is
> missing. The current call stack is:
> 
> With WPA, the code is run as follows:
> 
>  output_offload_tables
>   stream_out_partitions_1
>stream_out
> ipa_write_optimization_summaries
>  write_lto
>   lto_output
>output_offload_tables
> 
> * * *
> 
> I see two possibilities, the quick one:
> 
> (A)
> 
> --- a/gcc/lto-streamer.h
> +++ b/gcc/lto-streamer.h
> @@ -463,0 +464 @@ struct lto_symtab_encoder_d
> +  bool output_offload_tables;
> 
> setting it - and propagating it through.

Or set a global flag - all forked processes should see that.

> (B)
> Change how we output the tables. Currently, those bypass normal WPA and IPA
> handling. Making them visible should avoid the 'force_output = 1' for all 
> table
> entries (and only require it for the tables themselves) - and presumably fix
> this issue as well as fixing PR 95622.

Yeah, but the this sounds like a lot of work and we might want to fix
this on branches.

[Bug middle-end/116510] [15 Regression] ice in decompose, at wide-int.h:1049

2024-08-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116510

--- Comment #8 from rguenther at suse dot de  ---
On Wed, 28 Aug 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116510
> 
> --- Comment #6 from Andrew Pinski  ---
> #6  0x0187822c in gimple_simplify_226 (res_op=0x7fffcd00,
> seq=0x7fffced0, valueize=0xd13880 ,
> type=0x7741cb28, captures=0x7fffa160, cmp=EQ_EXPR) at
> gimple-match-9.cc:1994
> 1994  if (wi::bit_and_not (wi::to_wide (captures[1]), get_nonzero_bits
> (captures[0])) != 0
> (gdb) p debug_tree(captures[1])
>  
> constant 92>
> $1 = void
> (gdb) p debug_tree(captures[0])
>   type  size 
> unit-size 
> align:32 warn_if_not_align:0 symtab:0 alias-set 1 canonical-type
> 0x7741c5e8 precision:32 min  max
> 
> pointer_to_this >
> visited
> def_stmt _8 = gg_strescape_i.2_11 & 255;
> version:8
> ptr-info 0x7733ddb0>
> 
> ...
> (gdb) p debug_tree(switch_cond)
>   type  size 
> unit-size 
> align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
> 0x7741cb28 precision:1 min  max  0x7741f2a0 1>>
> 
> arg:0  _Bool>
> 
> arg:0 
> 
> arg:0  int>
> visited
> def_stmt _8 = gg_strescape_i.2_11 & 255;
> version:8
> ptr-info 0x7733ddb0>
> arg:1 
> t5.c:6:5 start: t5.c:6:5 finish: t5.c:6:10>
> arg:1 
> arg:0 
> arg:1 
> t5.c:6:5 start: t5.c:6:5 finish: t5.c:6:10>
> t5.c:6:5 start: t5.c:6:5 finish: t5.c:6:10>
> t5.c:6:5 start: t5.c:6:5 finish: t5.c:6:10>
> 
> (gdb) p debug_tree((tree)0x7730fb40)
>  
> constant 92>
> 
> 
> CASE_HIGH/CASE_LOW have a type of `unsigned char` but index is type of `int`.
> 
> 
> There is a missing fold_convert in the predicate_bbs on CASE_HIGH/CASE_LOW.
> 
> Note one more thing:
> + tree low = build2_loc (loc, GE_EXPR,
> +boolean_type_node,
> +index, CASE_LOW (label));
> + tree high = build2_loc (loc, LE_EXPR,
> + boolean_type_node,
> + index, CASE_HIGH (label));
> + case_cond = build2_loc (loc, TRUTH_AND_EXPR,
> + boolean_type_node,
> + low, high);
> 
> Why use TRUTH_AND_EXPR here, just use AND_EXPR .
> 
> Likewise, use TRUTH_NOT_EXPR.
> 
> + switch_cond = build1_loc (loc, TRUTH_NOT_EXPR, boolean_type_node,
> +   unshare_expr (switch_cond));
> 
> Also why not do something like what is done in convert_single_case_switch ?

Note this is what ifcvt requires internally, so I'd rather not change
this just in this isolated place.

[Bug tree-optimization/116348] [15 regression] ICE when building gavl-1.4.0 with -O3 -march=znver4 since r15-2791-g2083389a18d366

2024-08-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116348

--- Comment #11 from rguenther at suse dot de  ---
On Thu, 22 Aug 2024, xry111 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116348
> 
> --- Comment #10 from Xi Ruoyao  ---
> I've tested the change and it fixes PR116314 case as well.
> 
> Richard: do you want me to send your change as a patch like before (the
> PR116142 fix)?

Yes pleae.

[Bug tree-optimization/101390] Expand vector mod as vector div + multiply-subtract

2024-08-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390

--- Comment #8 from rguenther at suse dot de  ---
On Wed, 7 Aug 2024, jschmitz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101390
> 
> --- Comment #7 from Jennifer Schmitz  ---
> Thank you for the reply. Seems like I have been looking in the right places.
> I'm a new member of the GCC community, so I'm still getting familiar with many
> parts of the code base. I have been trying to find out where the related case,
> but with the division operator is implemented, as this seems a natural place 
> to
> also implement the modulo operator. This does not seem to happen in
> vect_recog_divmod_pattern. Do you still think vect_recog_divmod_pattern is the
> right location to implement this or can you point me to the implementation of
> the same case with division?

Division by non-constant is not emulated in any way, so there is nothing
to find.  I would suggest you add a vect_recog_mod_var_pattern
function implementing modulo by non-constant via division and 
multiply/sub.

[Bug tree-optimization/115278] [13 Regression] -ftree-vectorize optimizes away volatile write on x86_64 since r13-3219

2024-08-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115278

--- Comment #13 from rguenther at suse dot de  ---
On Tue, 6 Aug 2024, amylaar at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115278
> 
> Jorn Wolfgang Rennecke  changed:
> 
>What|Removed |Added
> 
>  CC||amylaar at gcc dot gnu.org
> 
> --- Comment #12 from Jorn Wolfgang Rennecke  ---
> On a newlib based toolchain with gcc 14.2, the test fails:
> 
> Excess errors:
> /home/amylaar/riscy-ip/gcc14/gcc/trunk/gcc/gcc/testsuite/g++.dg/vect/pr115278.cc:24:28:
> error: invalid conversion from 'volatile unsigned int*' to 'volatile 
> uint32_t*'
> {aka 'volatile long unsigned int*'} [-fpermissive]
> 
> g++.dg/vect/pr115278.cc  -std=c++14 : dump file does not exist
> 
> excerpt from the preprocessed file:
> 
> # 37
> "/home/amylaar/riscy-ip/gcc14/gcc/trunk/gcc/newlib/libc/include/machine/_default_types.h"
> 3 4
> extern "C" {
> ...
> # 77
> "/home/amylaar/riscy-ip/gcc14/gcc/trunk/gcc/newlib/libc/include/machine/_default_types.h"
> 3 4
> typedef long int __int32_t;
> 
> typedef long unsigned int __uint32_t;
> ...
> # 15
> "/home/amylaar/riscy-ip/gcc14/gcc/trunk/gcc/newlib/libc/include/sys/_stdint.h"
> 3 4
> extern "C" {
> ...
> typedef __uint32_t uint32_t ;
> 
> 
> I tried to add -fpermissive, but that still gives a warning that ends up as
> excess errors. 
> There doesn't seem to be any specific option to suppress this warning, just 
> -w.
> 
> So, should we add:
> // { dg-additional-options "-fpermissive -w"  { target newlib } }
> 
> 
> or should we do something like:
> 
> Index: g++.dg/vect/pr115278.cc
> ===
> --- g++.dg/vect/pr115278.cc (revision 6673)
> +++ g++.dg/vect/pr115278.cc (working copy)
> @@ -1,6 +1,7 @@
>  // { dg-do compile }
>  // { dg-require-effective-target c++11 }
>  // { dg-additional-options "-fdump-tree-optimized" }
> +// { dg-additional-options "-fpermissive"  { target newlib } }
> 
>  #include 
> 
> @@ -21,7 +22,7 @@
>  BitfieldStructUnion(uint32_t value_low, uint32_t value_high) :
> value_low(value_low), value_high(value_high) {}
>  };
> 
> -volatile uint32_t *WRITE = (volatile unsigned*)0x42;
> +volatile uint32_t *WRITE = (volatile unsigned*)0x42; // { dg-warning "invalid
> conversion" { target newlib } }

Why not just write (volatile uint32_t *)?

OK with just that.

> 
>  void buggy() {
>  for (int i = 0; i < runs; i++) {

[Bug tree-optimization/116142] vec_widen_smult_{odd,even}_M ineffective for a simple widening dot product (vect_used_by_reduction is not set?)

2024-08-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116142

--- Comment #12 from rguenther at suse dot de  ---
On Tue, 6 Aug 2024, xry111 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116142
> 
> --- Comment #11 from Xi Ruoyao  ---
> (In reply to rguent...@suse.de from comment #10)
> > On Thu, 1 Aug 2024, xry111 at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116142
> > > 
> > > --- Comment #9 from Xi Ruoyao  ---
> > > In the following call stack:
> > > 
> > > #0  single_imm_use (var=0x7564a3a8, use_p=0x7fffc9d8, 
> > > stmt=0x7fffc9e0) at ../../gcc/gcc/ssa-iterators.h:427
> > > #1  0x01aabcde in vec_info::lookup_single_use (this=0x33a1b80, 
> > > lhs=0x7564a3a8) at ../../gcc/gcc/tree-vectorizer.cc:581
> > > #2  0x01a1a40a in supportable_widening_operation 
> > > (vinfo=0x33a1b80, 
> > > code=..., stmt_info=0x3482b50, vectype_out=0x7595a1f8, 
> > > vectype_in=0x7595a000, code1=0x7fffcb90, 
> > > code2=0x7fffcb94, 
> > > multi_step_cvt=0x7fffcba0, interm_types=0x7fffcc48)
> > > at ../../gcc/gcc/tree-vect-stmts.cc:14196
> > > #3  0x019f7c2d in vectorizable_conversion (vinfo=0x33a1b80, 
> > > stmt_info=0x3482b50, gsi=0x0, vec_stmt=0x0, slp_node=0x33b4cc8, 
> > > cost_vec=0x7fffd190) at ../../gcc/gcc/tree-vect-stmts.cc:5438
> > > #4  0x01a16ecc in vect_analyze_stmt (vinfo=0x33a1b80, 
> > > stmt_info=0x3482b50, need_to_vectorize=0x7fffcf73, 
> > > node=0x33b4cc8, 
> > > node_instance=0x32fb6e0, cost_vec=0x7fffd190)
> > > at ../../gcc/gcc/tree-vect-stmts.cc:13291
> > > 
> > > "ptr == ptr->next" is satisfied (the comment says it means "there aren't 
> > > any
> > > uses whatsoever" but I don't really understand it: I knows almost nothing 
> > > about
> > > tree vectorization) and single_imm_use directly returns false.
> > 
> > So likely 
> > 
> >   tree lhs = gimple_assign_lhs (stmt_info->stmt);
> >   stmt_vec_info use_stmt_info = loop_info->lookup_single_use 
> > (lhs);
> >   if (use_stmt_info
> > 
> > needs to be
> > 
> >   tree lhs = gimple_assign_lhs (vect_orig_stmt (stmt_info)->stmt);
> >   stmt_vec_info use_stmt_info = loop_info->lookup_single_use 
> > (lhs);
> 
> Works for me!

Can you produce, test and post a patch?

[Bug tree-optimization/116142] vec_widen_smult_{odd,even}_M ineffective for a simple widening dot product (vect_used_by_reduction is not set?)

2024-08-02 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116142

--- Comment #10 from rguenther at suse dot de  ---
On Thu, 1 Aug 2024, xry111 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116142
> 
> --- Comment #9 from Xi Ruoyao  ---
> In the following call stack:
> 
> #0  single_imm_use (var=0x7564a3a8, use_p=0x7fffc9d8, 
> stmt=0x7fffc9e0) at ../../gcc/gcc/ssa-iterators.h:427
> #1  0x01aabcde in vec_info::lookup_single_use (this=0x33a1b80, 
> lhs=0x7564a3a8) at ../../gcc/gcc/tree-vectorizer.cc:581
> #2  0x01a1a40a in supportable_widening_operation (vinfo=0x33a1b80, 
> code=..., stmt_info=0x3482b50, vectype_out=0x7595a1f8, 
> vectype_in=0x7595a000, code1=0x7fffcb90, code2=0x7fffcb94, 
> multi_step_cvt=0x7fffcba0, interm_types=0x7fffcc48)
> at ../../gcc/gcc/tree-vect-stmts.cc:14196
> #3  0x019f7c2d in vectorizable_conversion (vinfo=0x33a1b80, 
> stmt_info=0x3482b50, gsi=0x0, vec_stmt=0x0, slp_node=0x33b4cc8, 
> cost_vec=0x7fffd190) at ../../gcc/gcc/tree-vect-stmts.cc:5438
> #4  0x01a16ecc in vect_analyze_stmt (vinfo=0x33a1b80, 
> stmt_info=0x3482b50, need_to_vectorize=0x7fffcf73, node=0x33b4cc8, 
> node_instance=0x32fb6e0, cost_vec=0x7fffd190)
> at ../../gcc/gcc/tree-vect-stmts.cc:13291
> 
> "ptr == ptr->next" is satisfied (the comment says it means "there aren't any
> uses whatsoever" but I don't really understand it: I knows almost nothing 
> about
> tree vectorization) and single_imm_use directly returns false.

So likely 

  tree lhs = gimple_assign_lhs (stmt_info->stmt);
  stmt_vec_info use_stmt_info = loop_info->lookup_single_use 
(lhs);
  if (use_stmt_info

needs to be

  tree lhs = gimple_assign_lhs (vect_orig_stmt (stmt_info)->stmt);
  stmt_vec_info use_stmt_info = loop_info->lookup_single_use 
(lhs);

[Bug c/116130] Implement C23 N2956 paper - [[unsequenced]] and [[reproducible]] function type arguments

2024-07-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116130

--- Comment #10 from rguenther at suse dot de  ---
On Tue, 30 Jul 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116130
> 
> --- Comment #9 from Jakub Jelinek  ---
> (In reply to rguent...@suse.de from comment #8)
> > IMO this asks for metadata on each (standard) attribute whether it
> > merges conservatively by union or by intersection since there are
> > two classes of attributes.
> 
> Well, doing so would violate the standard, which added a clear:
> "If one of the types has a standard attribute, the composite type also has 
> that
> attribute."

We do violate the standard in other places - IMO QOI/safety should be
a priority.  I view the above as a defect.

I'd at least emit an unconditional diagnostic when doing this kind
of merging in any place (also for GCC type/decl attributes).

> > And then there's things like [[gnu::warn_unused_result]] where
> > it's not at all clear what is "conservative" (drop valid diagnostics
> > or introduce broken ones?).
> 
> gnu::warn_unused_result is not a standard attribute (but currently we merge 
> all
> of them), but more importantly it isn't a function type attribute, but 
> function
> decl attribute.  Those aren't merged.

Huh, that's an odd distinction.  I hope we document this somewhere
in a way understandable by users since IIRC it's not possible to
distinguish those syntactically when placed in a function declaration?

[Bug c/116130] Implement C23 N2956 paper - [[unsequenced]] and [[reproducible]] function type arguments

2024-07-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116130

--- Comment #8 from rguenther at suse dot de  ---
On Mon, 29 Jul 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116130
> 
> --- Comment #7 from Jakub Jelinek  ---
> Seems we actually implement it that way already, or at least sometimes.
> build_conditional_expr has
>   /* Quickly detect the usual case where op1 and op2 have the same type
>  after promotion.  */
>   if (TYPE_MAIN_VARIANT (type1) == TYPE_MAIN_VARIANT (type2))
> {
>   if (type1 == type2)
> result_type = type1;
>   else
> result_type = TYPE_MAIN_VARIANT (type1);
> }
> which would do the wrong thing and
>   else if (code1 == POINTER_TYPE && code2 == POINTER_TYPE)
> {
> ...
>   if (comp_target_types (colon_loc, type1, type2))
> result_type = common_pointer_type (type1, type2);
> which would merge them.
> All I want to say is that there are big chances of this breaking real-world
> code if somebody decides to add the new standard attributes to decls.

IMO this asks for metadata on each (standard) attribute whether it
merges conservatively by union or by intersection since there are
two classes of attributes.

And then there's things like [[gnu::warn_unused_result]] where
it's not at all clear what is "conservative" (drop valid diagnostics
or introduce broken ones?).

Possibly a program with ?: on types with differing attributes should
be ill-formed and instead allow

  ?[[...]]x:[[...]]y

to fixup?  Is there a way to "drop" an attribute, like cast it
away? [[no-...]] or [[!...]]?

[Bug tree-optimization/116098] [14/15 Regression] _Bool value from tagged union is incorrect when built with optimization since r14-1597-g64d90d06d2db43

2024-07-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116098

--- Comment #9 from rguenther at suse dot de  ---
> Am 27.07.2024 um 02:38 schrieb pinskia at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116098
> 
> --- Comment #8 from Andrew Pinski  ---
> (In reply to Richard Biener from comment #7)
>> Ah, it's the code that makes SRA widen bit-precision values to
>> size-precision integers.  That said, I don't see why the V_C_E should not be
>> moveable.
>> 
>>  value$8_11 = MEM  [(struct Value *)&s_value + 8B];
>>  _8 = VIEW_CONVERT_EXPR<_Bool>(value$8_11);
>> 
>> is exactly what RTL expansion will create - we're not truncating the
>> result to bit-precision on _Bool loads.
> 
> The problem is the bits outside of bool are set and we don't truncate them out
> and since we turn it into:
> 
>  _6 = VIEW_CONVERT_EXPR<_Bool>(value$8_8);
>  _12 = value_7 != 0;
>  _13 = _6 | _12;
> 
> We get the wrong answer. This is very similar to if _6 had been an
> uninitialized which we fixed during the development of GCC 14.
> That is the range of _6 (or _8) is only conditionally `[0,1]`, otherwise it
> should be considered similar to an uninitialized variable.
> 
> The other thing which we might be able to solve this is change:
> ```
> /* For integral conversions with the same precision or pointer
>   conversions use a NOP_EXPR instead.  */
> (simplify
>  (view_convert @0)
>  (if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
>   && (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE (@0)))
>   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)))
>   (convert @0)))
> ```
> 
> to change bool convert to be convert too. In a similar way how we handle
> `zero_one != 0` into a convert instead of a VCE (in VPR and other passes). 
> This
> will fix the above too.
> 
> Maybe I will test that to see what code generation will look like.

That’s a possibility though it will affect all uses and cause a lot of extra
truncation I fear.

Richard 

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug middle-end/116065] [13/14/15 Regression] Function attribute optimize() might make ISA target attribute broken

2024-07-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116065

--- Comment #8 from rguenther at suse dot de  ---
On Thu, 25 Jul 2024, hongyuw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116065
> 
> --- Comment #7 from Hongyu Wang  ---
> (In reply to Andrew Pinski from comment #6)
> > (In reply to Andrew Pinski from comment #5)
> > > then if that is the case then aarch64 started with 
> > > r14-6290-g9f0f7d802482a8
> > > (which added OPT_mearly_ra_ to aarch_option_optimization_table).
> > > 
> > > What happens if you mark -munroll-only-small-loops as Optimization ?
> > > if that works, then aarch64 fix is to mark -mearly-ra= as Optimization 
> > > too.
> > 
> > Yes this fixes aarch64 testcase:
> > ```
> > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> > index 2f90f10352a..6229bcb371e 100644
> > --- a/gcc/config/aarch64/aarch64.opt
> > +++ b/gcc/config/aarch64/aarch64.opt
> > @@ -256,7 +256,7 @@ EnumValue
> >  Enum(early_ra_scope) String(none) Value(AARCH64_EARLY_RA_NONE)
> > 
> >  mearly-ra=
> > -Target RejectNegative Joined Enum(early_ra_scope) Var(aarch64_early_ra)
> > Init(AARCH64_EARLY_RA_NONE) Save
> > +Target RejectNegative Joined Enum(early_ra_scope) Var(aarch64_early_ra)
> > Init(AARCH64_EARLY_RA_NONE) Optimization
> >  Specify when to enable an early register allocation pass.  The 
> > possibilities
> >  are: all functions, functions that have access to strided multi-register
> >  instructions, and no functions.
> > 
> > ```
> > 
> > So yes adding Optimization to -munroll-only-small-loops should fix that too.
> 
> Confirmed, append Optimization fixed this issue in x86.
> I'm quite confused by how the unmarked Optimization seems resets the flags, 
> the
> target attribute was overrided and the error reports like isa flag mismatched.

It's because the change in optimization level enables a target flag
and as the target doesn't allow a difference in that inlining fails.
When marked as 'Optimization' it's not considered a "target" flag.

[Bug middle-end/45215] Tree-optimization misses a trick with bit tests

2024-07-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45215

--- Comment #4 from rguenther at suse dot de  ---
On Fri, 19 Jul 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45215
> 
> Andrew Pinski  changed:
> 
>What|Removed |Added
> 
>  CC||pinskia at gcc dot gnu.org
>   Component|tree-optimization   |middle-end
>  Status|NEW |ASSIGNED
> 
> --- Comment #3 from Andrew Pinski  ---
>   _1 = t_3(D) & 256;
>   if (_1 != 0)
> goto ; [1.04%]
>   else
> goto ; [98.96%]
> 
>[local count: 1062574912]:
> 
>[local count: 1073741824]:
>   # _2 = PHI <-26(2), 0(3)>
> 
> 
> So the trick here is that 256 is `0x1<<8` so we want to shift that bit up to
> the sign bit and then arthimetic shift down to get 0/-1 and then and with -26.

So .BIT_SPLAT (t_3(D), 8) & -26, there's nothing special in x86 to help
.BIT_SPLAT though, back-to-back shift might be throughput constrained.
I think x86 can do -1 vs. 0 set from flags of the and though.

I'm not sure whether two shifts and and are a good way to recover an
optimal non-branch insn sequence later?

[Bug tree-optimization/114966] fails to optimize avx2 in-register permute written with std::experimental::simd

2024-07-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966

--- Comment #6 from rguenther at suse dot de  ---
On Wed, 17 Jul 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966
> 
> --- Comment #5 from Hongtao Liu  ---
> I saw pass_eras optimize BIT_FIELD_REF of big memory into load from small
> memory
> 
> 
> Created a replacement for D.161366 offset: 0, size: 64: SR.20D.170101
> Created a replacement for D.161366 offset: 64, size: 64: SR.21D.170102
> Created a replacement for D.161366 offset: 128, size: 64: SR.22D.170103
> Created a replacement for D.161547 offset: 0, size: 256: SR.23D.170104
> 
> 
>   _8 = BIT_FIELD_REF  *)&D.159286].D.158970._M_data, 64, 0>;
> _9 = BIT_FIELD_REF  *)&D.159286].D.158970._M_data, 64, 64>;
> _10 = BIT_FIELD_REF  *)&D.159286].D.158970._M_data, 64, 128>;
>   _11 = {0, _8, _9, _10};
> 
> to 
> 
>   SR.20_3 = MEM  [(struct simd *)&data];
>   SR.21_13 = MEM  [(struct simd *)&data + 8B];
>   SR.22_14 = MEM  [(struct simd *)&data + 16B];
>   _7 = SR.20_3;
>   _8 = SR.21_13;
>   _9 = SR.22_14;
>   _10 = {0, _7, _8, _9};
> 
> 
> So I guess for the later GCC somehow can't be sure the whole 256-bit memory is
> valid and fail to optimize it with vec_perm_expr?

I think the above would be a candidate for SLP vectorization of the
vector CTOR.  A specific example we don't handle right now of course.
Or alternatively by instruction combination in 
simplify_vector_constructor.

[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd

2024-07-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #11 from rguenther at suse dot de  ---
On Wed, 17 Jul 2024, mkretz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908
> 
> --- Comment #10 from Matthias Kretz (Vir)  ---
> (In reply to Richard Biener from comment #9)
> > One issue with
> > 
> > V load3(const unsigned long* ptr)
> > {
> >   V ret = {};
> >   __builtin_memcpy(&ret, ptr, 3 * sizeof(unsigned long));
> > 
> > is that we cannot load a vector worth of data from ptr because that might
> > trap
> 
> Unless the target has a masked load instruction (e.g. AVX512) or ptr is known
> to be aligned to at least 16 Bytes (in which case we know there cannot be a
> page boundary at ptr + 24 Bytes). No? In this specific example, ptr is 
> pointing
> to a 32-Byte vector object.

Sure but here we have no alignment info available (at most 8 byte 
alignment from the pointer type).  I don't think introducing a .MASK_LOAD
for the purpose of eliding a memcpy is a good thing to do (locally,
just taking into account the memcpy on its own).

> The library can do this and it makes a difference:
> 
> if (__builtin_object_size(ptr, 0) >= 4 * sizeof(T))
>   __builtin_memcpy(&ret, ptr, 4 * sizeof(T));
> else
>   __builtin_memcpy(&ret, ptr, 3 * sizeof(T));

I see, but that's then of course after inlining.

In my former C++ times I've used template metaprogramming to implement
this as an unrolled element-by-element copy (emitting a loop would
be possible as well, of course).

[Bug target/114189] Target implements obsolete vcond{,u,eq} expanders

2024-07-12 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114189

--- Comment #9 from rguenther at suse dot de  ---
On Fri, 12 Jul 2024, ebotcazou at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114189
> 
> Eric Botcazou  changed:
> 
>What|Removed |Added
> 
>  CC||ebotcazou at gcc dot gnu.org
> 
> --- Comment #8 from Eric Botcazou  ---
> There are 2 entries for vcond_mask in the internal manual:
> 
> 'vcond_mask_MN'
>  Similar to 'vcondMN' but operand 3 holds a pre-computed result of
>  vector comparison.
> 
> 'vcond_mask_MN'

This should be vcond_mask_len_MN I think.

>  Set each element of operand 0 to the corresponding element of
>  operand 2 or operand 3.  Choose operand 2 if both the element index
>  is less than operand 4 plus operand 5 and the corresponding element
>  of operand 1 is nonzero:
> 
>   for (i = 0; i < GET_MODE_NUNITS (M); i++)
> op0[i] = i < op4 + op5 && op1[i] ? op2[i] : op3[i];
> 
>  Operands 0, 2 and 3 have mode M.  Operand 1 has mode N.  Operands 4
>  and 5 have a target-dependent scalar integer mode

[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

2024-07-01 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

--- Comment #5 from rguenther at suse dot de  ---
> Am 01.07.2024 um 12:10 schrieb tnfchris at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629
> 
> --- Comment #4 from Tamar Christina  ---
> (In reply to Richard Biener from comment #3)
>> So we now tail-merge the two b[i] loading blocks.  Can you check SVE
>> code-gen with this?  If that fixes the PR consider adding a SVE testcase.
> 
> Thanks, the codegen is much better now, but shows some other missing mask
> tracking in the vectorizer.
> 
> Atm we generate:
> 
> .L3:
>ld1wz31.s, p6/z, [x0, x6, lsl 2] <-- load a
>cmpeq   p7.s, p6/z, z31.s, #0<-- a == 0, !a
>ld1wz0.s, p7/z, [x2, x6, lsl 2]  <-- load c conditionally on !a
>cmpeq   p7.s, p7/z, z0.s, #0 <-- !a && !c
>orr z0.d, z31.d, z0.d<-- a || c
>ld1wz29.s, p7/z, [x3, x6, lsl 2] <--- load d where !a && !c
>cmpne   p5.s, p6/z, z0.s, #0 <--- (a || c) & loop_mask
>and p7.b, p6/z, p7.b, p7.b   <--- ((!a && !c) && (!a && !c)) &
> loop_mask
>ld1wz30.s, p5/z, [x1, x6, lsl 2] <-- load b conditionally on (a ||
> c)
>sel z30.s, p7, z29.s, z30.s  <-- select (!a && !c, d, b)
>st1wz30.s, p6, [x4, x6, lsl 2]
>add x6, x6, x7
>whilelo p6.s, w6, w5
>b.any   .L3
> 
> which corresponds to:
> 
>  # loop_mask_63 = PHI 
>  vect__4.10_64 = .MASK_LOAD (vectp_a.8_53, 32B, loop_mask_63);
>  mask__31.11_66 = vect__4.10_64 != { 0, ... };
>  mask__56.12_67 = ~mask__31.11_66;
>  vec_mask_and_70 = mask__56.12_67 & loop_mask_63;
>  vect__7.15_71 = .MASK_LOAD (vectp_c.13_68, 32B, vec_mask_and_70);
>  mask__22.16_73 = vect__7.15_71 == { 0, ... };
>  mask__34.17_75 = vec_mask_and_70 & mask__22.16_73;
>  vect_iftmp.20_78 = .MASK_LOAD (vectp_d.18_76, 32B, mask__34.17_75);
>  vect__61.21_79 = vect__4.10_64 | vect__7.15_71;
>  mask__35.22_81 = vect__61.21_79 != { 0, ... };
>  vec_mask_and_84 = mask__35.22_81 & loop_mask_63;
>  vect_iftmp.25_85 = .MASK_LOAD (vectp_b.23_82, 32B, vec_mask_and_84);
>  _86 = mask__34.17_75 & loop_mask_63;
>  vect_iftmp.26_87 = VEC_COND_EXPR <_86, vect_iftmp.20_78, vect_iftmp.25_85>;
>  .MASK_STORE (vectp_res.27_88, 32B, loop_mask_63, vect_iftmp.26_87);
> 
> it looks like what's missing is that the mask tracking doesn't know that other
> masked operations naturally perform an AND when combined.  We do some of this
> in the backend but I feel like it may be better to do it in the vectorizer.
> 
> In this case, the second load is conditional on the first load mask,  which
> means it's already done an AND.
> And crucially inverting it means you also inverted both conditions.
> 
> So there are some superflous masking operations happening.  But I guess that's
> a separate bug.  Shall I just add some tests here and close it and open a new
> PR?

Not sure if that helps - do we fully understand this is a separate issue and
not related to how we if-convert?

Adding a testcase is nevertheless OK of course.

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug tree-optimization/115710] [11/12/13/14/15 Regression] complex abs does not vectorise

2024-06-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115710

--- Comment #11 from rguenther at suse dot de  ---
On Sun, 30 Jun 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115710
> 
> --- Comment #8 from Andrew Pinski  ---
> I wonder if we could also add:
> +/* cabs(x+0i) or cabs(0+xi) -> abs(x).  */
> +(simplify
> + (CABS (complex:c @0 real_zerop@1))
> + (abs @0))
> 
> 
> + /* cabs(x+xi) -> fabs(x)*sqrt(2).  */
> + (simplify
> +  (CABS (complex @0 @0))
> +  (mult (abs @0) { build_real_truncate (type, dconst_sqrt2 ()); })))
> 
> 
> to the lowering part.
> 
> let me look into doing that but it will be 3rd part to the patch series. Since
> we should have this info already ...

Complex lowering should use its lattice to do the usual simplifications,
sure.  Having those simplification patterns would be OK as well
of course.

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #15 from rguenther at suse dot de  ---
On Wed, 26 Jun 2024, ams at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #14 from Andrew Stubbs  ---
> On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> > 
> > --- Comment #13 from Richard Biener  ---
> > (In reply to Richard Biener from comment #12)
> >> (In reply to Andrew Stubbs from comment #10)
> >>> GFX10 has more limited permutation capabilities than GFX9 because it
> >>> only has 32-lane vectors natively, even though we're using the 64-lane
> >>> "compatibility" mode.
> >>>
> >>> However, in theory, the permutation capabilities on V32 and below should
> >>> be the same, and some permutations on V64 are allowed, so I don't know
> >>> why it doesn't use it. It's possible I broke the logic in
> >>> gcn_vectorize_vec_perm_const:
> >>>
> >>> /* RDNA devices can only do permutations within each group of 
> >>> 32-lanes.
> >>>Reject permutations that cross the boundary.  */
> >>> if (TARGET_RDNA2_PLUS)
> >>>   for (unsigned int i = 0; i < nelt; i++)
> >>> if (i < 31 ? perm[i] > 31 : perm[i] < 32)
> >>>   return false;
> >>>
> >>> It looks right to me though?
> >>
> >> nelt == 32 so I think the last element has the wrong check applied?
> >>
> >> It should be
> >>
> >>> if (i < 32 ? perm[i] > 31 : perm[i] < 32)
> >>
> >> I think.  With that the vectorization happens in a similar way but the
> >> failure still doesn't reproduce (without the patch, of course).
> 
> Oops, I think you're right.
> 
> > Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
> > two vectors src0 and src1 into one 32 element dst vector (it's no longer
> > required that src0 and src1 line up with the dst vector size btw, they
> > might have different nelt).  So the loop would reject interleaving
> > the low parts of two 32 element vectors, a permute that would look like
> > { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
> > mean you can never mix the two vector inputs?  Or does GCN not have
> > a two-to-one vector permute instruction?
> 
> GCN does not have two-to-one vector permute in hardware, so we do two 
> permutes and a vec_merge to get the same effect.
> 
> GFX9 can permute all the elements within a 64 lane vector arbitrarily.
> 
> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but 
> no value may cross the boundary. AFAIK there's no way to do that via any 
> vector instruction (i.e. without writing to memory, or extracting values 
> element-wise).

I see - so it cannot even swap low-32 and high-32?  I'm thinking of
what sub-part of permutes would be possible by extending the two-to-one
vec_merge trick.

OTOH we restrict GFX10/11 to 32 lane vectors so in practice this
restriction should be fine.

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #6 from rguenther at suse dot de  ---
On Tue, 25 Jun 2024, tschwinge at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> Thomas Schwinge  changed:
> 
>What|Removed |Added
> 
> Summary|GCN: FAIL:  |[15 Regression] GCN: FAIL:
>|gfortran.dg/vect/pr115528.f |gfortran.dg/vect/pr115528.f
>|  -O  execution test|  -O  execution test
>See Also||https://gcc.gnu.org/bugzill
>||a/show_bug.cgi?id=114107
> 
> --- Comment #5 from Thomas Schwinge  ---
> Turns out, this is a regression after all: before commit
> r15-1238-g1fe55a1794863b5ad9eeca5062782834716016b2 "tree-optimization/114107 -
> avoid peeling for gaps in more cases", this test case
> 'gfortran.dg/vect/pr115528.f' (which, of course, wasn't in-tree back then) 
> does
> PASS its execution test for GCN target (tested '-march=gfx908').

Yes, the testcase outer loop wasn't vectorized before due to the
"gap" access in the inner loop.  With this rev. we can now vectorize
the inner loop without touching the gap.  Note I'm no longer sure
the loop mask use is what's wrong.  Instead it looks like the IV for
'aa' is off (just like I remember what happens when there is a grouped
access in the inner loop):

  # vectp_aa.21_73 = PHI 
...
  vect__14.23_71 = .MASK_LOAD (vectp_aa.21_73, 64B, loop_mask_77);
  vectp_aa.21_70 = vectp_aa.21_73 + 18446744073709551360; // -256
...
  vectp_aa.21_72 = vectp_aa.21_70 + 32;

for the inner loop - but the inner loop should advance by 32.

So I think the issue is a latent one, but maybe one that couldn't be
triggered.  _Possibly_ it could be triggered with V2mode vectors.

For an inner loop with a grouped access this case is still disqualified
but IIRC it has the same issue.

[Bug target/115640] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #4 from rguenther at suse dot de  ---
On Tue, 25 Jun 2024, ams at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #3 from Andrew Stubbs  ---
> (In reply to Richard Biener from comment #2)
> > If you force GCN to use fixed length vectors (how?), does it work?  How's
> > it behaving on aarch64 with SVE?  (the CI was happy, but maybe doesn't
> > enable SVE)
> 
> I believe "--param vect-partial-vector-usage=0" will disable the use of
> WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked
> the value is inherited from the host in the offload toolchain; the default for
> x86_64 was "1", meaning approximately "only use partial vectors in epilogue
> loops".

For unknown reasons x86-64 refuses to use partial vectors.

[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs

2024-06-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304

--- Comment #13 from rguenther at suse dot de  ---
On Thu, 20 Jun 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304
> 
> Andrew Pinski  changed:
> 
>What|Removed |Added
> 
>  CC||pinskia at gcc dot gnu.org
> 
> --- Comment #12 from Andrew Pinski  ---
> Note when I am adding V4QI support to the aarch64 backend (emulating it via
> V8QI), I am getting a failure in slp-gap-1.c but it is different from the
> others.
> 
> Without V4QI, the pattern matched `\{_\[0-9\]\+, 0` was able to match 6 times.
> 
> we got in the IR:
> ```
>   unsigned int _50;
>   vector(2) unsigned int _49;
> ...
>   _50 = MEM  [(uint8_t *)vectp_pix1.5_58];
>   _49 = {_50, 0};
> ```
> 
> 
> But afterwards we now get:
> ```
>   vector(4) unsigned char _50;
>   vector(8) unsigned char vect__34.9;
> ...
>   _50 = MEM  [(uint8_t *)vectp_pix1.5_58];
>   vect__34.9_49 = {_50, { 0, 0, 0, 0 }};
> ```
> 
> Which produces the exact same code. I am trying to figure out the best way to
> change the testcase pattern to make sure we don't match:
>   vect__37.23_6 = VEC_PERM_EXPR  8, 9, 10, 11 }>;
> 
> too.
> 
> `\{_\[0-9\]\+, { 0, 0` I think that will work but should I just do an
> alternative for the scan-tree-dump-times or should I put it as a seperate one
> with some target selection here?

Maybe match \{_\[0-9\]\+, (0\|{ 0(, 0)+ })?  (with proper quoting)

[Bug tree-optimization/115537] [15 Regression] vectorizable_reduction ICEs after g:d66b820f392aa9a7c34d3cddaf3d7c73bf23f82d

2024-06-18 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115537

--- Comment #6 from rguenther at suse dot de  ---
> Am 18.06.2024 um 16:11 schrieb tnfchris at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115537
> 
> --- Comment #5 from Tamar Christina  ---
> Thanks for the fix!
> 
> I think the testcase needs SVE enabled to ICE no?
> shouldn't that be -mcpu=neoverse-v1 and not -mcpu=neoverse-n1?

I intended to copy from the PR options, in case I mistyped feel free to correct 

> --
> You are receiving this mail because:
> You are on the CC list for the bug.
> You are the assignee for the bug.

[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab

2024-06-18 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

--- Comment #5 from rguenther at suse dot de  ---
On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517
> 
> --- Comment #4 from Hongtao Liu  ---
> (In reply to rguent...@suse.de from comment #3)
> > On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517
> > > 
> > > --- Comment #2 from Hongtao Liu  ---
> > > (In reply to Richard Biener from comment #1)
> > > > Btw, I had opened PR115490 with my results for this already.  Some 
> > > > mitigation
> > > > should be from optimizing ISEL expansion to vcond_mask and I'd start 
> > > > with
> > > > looking at some of the fallout from that side (note that might require
> > > > the backend reject not natively implemented vec_cmp via its operand 1
> > > > predicate)
> > > 
> > > w/o AVX512, vector integer comparison only supports EQ/GT, others 
> > > comparison
> > > rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + eq 
> > > +
> > > negative the vector mask)
> > > If we restrict the predicate of operand 1, would middle-end reject
> > > vectorization (or lower it to scalar version)?
> > 
> > Richard suggests that we implement the "obvious" transforms like
> > inversion in the middle-end but if for example unsigned compares
> > are not supported the us_minus + eq + negative trick isn't on
> > that list.
> > 
> > The main reason to restrict vec_cmp would be to avoid
> > a <= b ? c : d going with an unsupported vec_cmp but instead
> > do a > b ? d : c - the alternative is trying to fix this
> > on the RTL side via combine.  I understand the non-native
> 
> Yes, I have a patch which can fix most regressions via pattern match in
> combine.
> Still there is a situation that is difficult to deal with, mainly the
> optimization w/o sse4.1 . Because pblendvb/blendvps/blendvpd only exists under
> sse4.1, w/o sse4.1, it takes 3 instructions (pand,pandn,por) to simulate the
> vcond_mask, and the combine matches up to 4 instructions, which makes it
> currently impossible to use the combine to recover those optimizations in the
> vcond{,u,eq}.i.e min/max.
> In the case of sse 4.1 and above, there is basically no regression anymore.

Maybe it's possible to use a define_insn_and_split for blends w/o SSE 4.1?
That would allow combine matching the high-level blend operation and
we'd only lower it afterwards?  The question is what we lose in
combinations of/into the loweredn pand/pandn/por of course.

Maybe it's possible to catch the higher-level optimization (min/max)
on the GIMPLE level instead?

[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab

2024-06-18 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

--- Comment #3 from rguenther at suse dot de  ---
On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517
> 
> --- Comment #2 from Hongtao Liu  ---
> (In reply to Richard Biener from comment #1)
> > Btw, I had opened PR115490 with my results for this already.  Some 
> > mitigation
> > should be from optimizing ISEL expansion to vcond_mask and I'd start with
> > looking at some of the fallout from that side (note that might require
> > the backend reject not natively implemented vec_cmp via its operand 1
> > predicate)
> 
> w/o AVX512, vector integer comparison only supports EQ/GT, others comparison
> rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + eq +
> negative the vector mask)
> If we restrict the predicate of operand 1, would middle-end reject
> vectorization (or lower it to scalar version)?

Richard suggests that we implement the "obvious" transforms like
inversion in the middle-end but if for example unsigned compares
are not supported the us_minus + eq + negative trick isn't on
that list.

The main reason to restrict vec_cmp would be to avoid
a <= b ? c : d going with an unsupported vec_cmp but instead
do a > b ? d : c - the alternative is trying to fix this
on the RTL side via combine.  I understand the non-native
compares are already expanded to supported form and we
don't use a split after combine to make combinations to
a supported form easier?

I don't have a good feeling which approach is going to be better
maintainable here.  But for example even for the unsigned compare
"lowering" the middle-end would have range info while RTL does
not (to some extent it's available at RTL expansion time).

[Bug tree-optimization/115427] fallback for interclass mathfn bifs like isinf, isfinite, isnormal

2024-06-11 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115427

--- Comment #4 from rguenther at suse dot de  ---
On Tue, 11 Jun 2024, linkw at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115427
> 
> --- Comment #3 from Kewen Lin  ---
> (In reply to Richard Biener from comment #2)
> > The canonical way would be to handle these in the ISEL pass and remove
> > the (fallback) expansion.  But then we can see whether the expander FAILs
> > (ideally expanders would never be allowed to FAIL, and for FAILing expanders
> > we'd have a way to query the target like we have the vec_perm_const hook).
> > 
> > But I'll note that currently the expanders may FAIL but then we expand to
> > a call rather than the inline-expansion (and for example AVR relies on this
> > now to avoid early folding of isnan).
> > 
> > So - for the cases of isfininte and friends without a fallback call I
> > would suggest to expand from ISEL to see if it FAILs and throw away
> > the result (similar as how IVOPTs probes things).  Or make those _not_
> > allowed to FAIL?  Why would they fail to expand anyway?
> 
> Thanks for the suggestion! IIUC considering the AVR example we still want
> *isinf* to fall back with the library call (so not falling back with
> inline-expansion way then).  Currently at least for rs6000 port there is no
> case that we want to make it FAIL, but not sure some other targets will have
> such need in future.  From the review comment[1], we don't note it's not
> allowed to FAIL so we probably need to ensure there is some handling for FAIL
> in case some future FAIL cause some unexpected failure. Do you prefer not
> allowing it to FAIL? then re-open this and go with ISEL if some port wants it
> to FAIL?

I think it would be cleaner to not allow it FAIL since there's no library
fallback.  FAILing patterns are a hassle when it comes to GIMPLE
optimizations.

As said, there should be a good reason why patterns FAIL - what's
the idea behind this feature anyway?

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2024-06-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #14 from rguenther at suse dot de  ---
On Thu, 6 Jun 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
> 
> --- Comment #13 from Tamar Christina  ---
> (In reply to rguent...@suse.de from comment #12) 
> > > since we don't care about overflow here, it looks like the stripping 
> > > should
> > > be recursive as long as it's a NOP expression between two integral types.
> > > 
> > > That would get them to hash to the same IV expression.  Trying now..
> > 
> > Note tree-affine is a tool that's used for this kind of "weak" equalities.
> > Convert both to affine, subtract them and if that's zero they are equal.
> 
> Hmm that's useful, though in this case this becomes the actual expression that
> IVOpts uses.
> 
> For instance this is done in alloc_iv and add_iv_candidate_for_use when
> determining the uses for the IV.
> 
> It looks like it's trying to force a canonical representation with as minimum
> casting as possible.
> 
> would the "affine"'ed tree be safe to use for this context?

No, I'd use that just for the comparison.

> What I've done currently is make a STRIP_ALL_NOPS that recursively strips NOPs
> for PLUS/MULT/MINUS.

But the stripped expression isn't necessarily valid to use either because
of possible undefined overflow.  It's probably safe to pick any of the
existing expressions (all are evaluated at each iteration), but if you
strip all NOPs from all of them you might end up with new undefined
behavior.

At least if that stripped expression is inserted somewhere or other
new expressions are built upon it.

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2024-06-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #12 from rguenther at suse dot de  ---
On Wed, 5 Jun 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932
> 
> --- Comment #11 from Tamar Christina  ---
> (In reply to Richard Biener from comment #10)
> > I think the question is why IVOPTs ends up using both the signed and
> > unsigned variant of the same IV instead of expressing all uses of both with
> > one IV?
> > 
> > That's where I'd look into.
> 
> It looks like this is because of a subtle difference in the expressions.
> 
> In get_loop_invariant_expr IVOPTs first tries to strip away all casts with
> STRIP_NOPS.
> 
> The first expression is (unsigned long) (stride.3_27 * 4) and the second
> expression is ((unsigned long) stride.3_27) * 4 (The pretty printing here is
> pretty bad...)
> 
> So the first one becomes:
>   (unsigned long) (stride.3_27 * 4) -> stride.3_27 * 4
> 
> and second one:
>   ((unsigned long) stride.3_27) * 4 -> ((unsigned long) stride.3_27) * 4
> 
> since we don't care about overflow here, it looks like the stripping should
> be recursive as long as it's a NOP expression between two integral types.
> 
> That would get them to hash to the same IV expression.  Trying now..

Note tree-affine is a tool that's used for this kind of "weak" equalities.
Convert both to affine, subtract them and if that's zero they are equal.

[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs

2024-06-03 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304

--- Comment #10 from rguenther at suse dot de  ---
On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304
> 
> --- Comment #9 from Andrew Stubbs  ---
> (In reply to Richard Biener from comment #6)
> > The best strathegy for GCN would be to gather V4QImode aka SImode into the
> > V64QImode (or V16SImode) vector.  For pix2 we have a gap of 28 elements,
> > doing consecutive loads isn't a good strategy here.
> 
> I don't fully understand what you're trying to say here, so apologies if you
> knew all this already and I missed the point.
> 
> In general, on GCN V4QImode is not in any way equivalent to SImode (when the
> values are in registers). The vector registers are not one single string of
> re-interpretable bits.
> 
> For the same reason, you can't load a value as V64QImode and then try to
> interpret it as V16SImode. GCN vector registers just don't work like
> SSE/Neon/etc.
> 
> When you load a V64QImode vector, each lane is extended to 32 bits, so what 
> you
> actually get in hardware is a V64SImode vector.
> 
> Likewise, when you load a V4QImode vector the hardware representation is
> actually V4SImode (which in itself is just V64SImode with undefined values in
> the unused lanes).

I see.  I wonder if there's not one or two latent wrong-code because of
this and the vectorizers assumptions ;)  I suppose modes_tieable_p
will tell us whether a VIEW_CONVERT_EXPR will do the right thing?
Is GET_MODE_SIZE (V64QImode) == GET_MODE_SIZE (V64SImode) btw?
And V64QImode really V64PSImode?

Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33], 
c[34], c[35], ... } it's probably best to use a single V64QImode gather 
with GCN then rather than four "consecutive" V64QImode loads and then
element swizzling.

[Bug c++/95349] Using std::launder(p) produces unexpected behavior where (p) produces expected behavior

2024-06-03 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95349

--- Comment #50 from rguenther at suse dot de  ---
On Mon, 3 Jun 2024, Christopher.Nerz at de dot bosch.com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95349
> 
> --- Comment #49 from Christopher Nerz  
> ---
> Ah, misunderstood and therefore forgot optimization to a constant.
> 
> In the current code example, we have the problem that the (second)
> initialization does not initialize with a value (ergo undefined behaviour due
> to uninitialized variable although it is created in an initialized memory).
> In the more complex code - which I am trying to simplify, but so far get the
> behaviour only in quite special situations which are hard to reduce to an
> example - we have that the buffer is explicitly initialized via `new T{}`
> (calling an inline initialization, i.e. not a trivial default constructor, but
> T is trivially copy & move constructible), then the buffer is copied (as
> std::byte-array) and then std::launder is applied to the resulting byte-array.
> To my understanding [basic.types.general].$3
> 
> > For two distinct objects obj1 and obj2 of trivially copyable type T, where 
> > neither obj1 nor obj2 is a potentially-overlapping subobject, if the 
> > underlying bytes (6.7.1) making up obj1 are copied into obj2, obj2 shall 
> > subsequently hold the same value as obj1.
> 
> guarantees that the copied buffer can be used as `T` via
> `*std::launder(second_buffer)`. Clang, MSVC agree, gcc does not.
> 
> Again: Still trying to construct the minimal example for that behavior :-/

Hmm, copying as std::byte-array should make the store use alias-set zero
so a followup load as T should be OK.  Unless "then the buffer is copied 
(as std::byte-array)" is really doing something more advanced with regard
to type-based aliasing rules (thus not a plain memcpy).

[Bug tree-optimization/115298] [15 Regression] Various targets failing DSE tests after recent changes due to default of -fno-fdelete-null-pointer-checks on those targets

2024-06-03 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115298

--- Comment #5 from rguenther at suse dot de  ---
On Fri, 31 May 2024, law at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115298
> 
> --- Comment #4 from Jeffrey A. Law  ---
> Agh.  I was looking in the main config directory, not common/config.  So it 
> all
> makes sense now.
> 
> So if we go back to your original analysis, I think we can say things are
> behaving correctly and we just need to adjust the testcase to either skip on
> these targets or add the flag.

Yes.  It's still bad now with a memset(p, 1) having the same effect on
other archs.  But it's still "correct" given in the caller one could do

  if (q == (void *)0x0101010101010101)
foo (p, q);

so I don't really see how we can avoid this without magic hand-waving
and muttering "unlikely", aka not what a compiler should do.

Of course it now means that every time an INTEGER escapes which means
always (a literal INTEGER would always have to be considered escaped)
everything breaks down.  So do we need to consider

void bar () { *(int *)0xdeadbeef = 1; }

int foo ()
{
  int x = 0;
  bar ();
  return x;
}

and consider carefully laid out stack so bar () clobbers 'x'?  We
do have to assume 0xdeadbeef is a valid address as otherwise UB.

We do close bugs as INVALID that use the address of 'a' to get to
'b' for 'int a, b;' based on pointer arithmetic rules but we
have bugs with conditional copy propagation wrecking things.

That said, I'm somewhat considering to change INTEGER address
behavior in points-to, making them _not_ alias any named object.
Would that work in practice?  We currently have

ANYTHING = &ANYTHING
ESCAPED = *ESCAPED
ESCAPED = ESCAPED + UNKNOWN
*ESCAPED = NONLOCAL
NONLOCAL = &NONLOCAL
NONLOCAL = &ESCAPED
INTEGER = &ANYTHING

and I'd change default INTEGER constraints to

INTEGER = &INTEGER
INTEGER = NONLOCAL

'INTEGER' are objects at literal addresses, they can initially
refer to other 'INTEGER' objects and to global (named) objects.
That excludes pointing to the stack top for example (we'd be
back to ANYTHING if that's OK).

Anyway, we'll see if any concrete performance issues arise from
the fixed ANYTHING handling.

[Bug tree-optimization/115298] [15 Regression] Various targets failing DSE tests after recent changes

2024-05-31 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115298

--- Comment #3 from rguenther at suse dot de  ---
On Fri, 31 May 2024, law at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115298
> 
> --- Comment #2 from Jeffrey A. Law  ---
> What still doesn't make sense is why nds32 would be special here.  It doesn't
> do anything special with flag_delete_null_pointer_checks and I don't think it
> uses any of the address space hooks.  So why does nds32 behave differently 
> than
> x86?

/* Implement TARGET_OPTION_OPTIMIZATION_TABLE.  */ 
static const struct default_options nds32_option_optimization_table[] =
{  
#if TARGET_LINUX_ABI == 0
  /* Disable -fdelete-null-pointer-checks by default in ELF toolchain.  */
  { OPT_LEVELS_ALL,   OPT_fdelete_null_pointer_checks,
   NULL, 0 },
#endif

that's in common/config/nds32/nds32-common.cc, a place to easily
overlook.  So it should be nds32-elf only.  And yeah, points-to
should use TARGET_ADDR_SPACE_ZERO_ADDRESS_VALID but since it tracks
pointers across integer conversions we have to decide on integer
zero as well where we do not know what address-space is affected.
I guess we could use the default address space and try to be clever
with ADDR_SPACE_CONVERT.  So points-to only checks 
flag_delete_null_pointer_checks at the moment.

[Bug middle-end/28831] [11/12/13/14/15 Regression] Aggregate copy not elided when using a return value as a pass-by-value parameter

2024-05-23 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28831

--- Comment #44 from rguenther at suse dot de  ---
On Thu, 23 May 2024, mkretz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28831
> 
> --- Comment #43 from Matthias Kretz (Vir)  ---
> I see this issue in SIMD programming. Example (on x86_64 with only '-O2', i.e.
> without AVX512) https://compiler-explorer.com/z/K64djP356:
> 
> typedef int V __attribute__((vector_size(64)));
> 
> V gen();
> 
> void g0(V const&, V const&);
> void g1(V, V);
> 
> void constref()
> {
>   g0(gen(), gen());
> }
> 
> void byvalue()
> {
>   g1(gen(), gen());
> }
> 
> Both the 'constref' and 'byvalue' cases copy every V argument before calling
> g0/g1.

The copy on GIMPLE is due to IL constraints:

  _10 = gen ();

   :
  D.2805 = _10;
  g0 (&D.2805, &D.2806);

when the call has a register type return value the LHS of the call
statement has to be a register (SSA name).  But the argument to
g0 has to be memory, so we get the extra copy.  Now, w/o AVX512
that "register" doesn't work out and we allocate it to memory
causing a memory-to-memory copy.

That's also because in vector lowering we do not lower those
register-to-memory stores (doing that would possibly help a bit,
as would more clever expansion of the copy or more clever
expanding of _10)

[Bug tree-optimization/114072] gcc.dg/vect/vect-pr111779.c FAILs

2024-05-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072

--- Comment #6 from rguenther at suse dot de  ---
On Wed, 22 May 2024, ro at CeBiTec dot Uni-Bielefeld.DE wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072
> 
> --- Comment #5 from ro at CeBiTec dot Uni-Bielefeld.DE  Uni-Bielefeld.DE> ---
> > --- Comment #4 from rguenther at suse dot de  ---
> [...]
> >> I think the best we can do then is
> >> 
> >> /* { dg-skip-if "PR tree-optimization/114072" { be && { ! vect_shift_char 
> >> } } }
> >> */
> >> 
> >> Lets the test become UNSUPPORTED on 32 and 64-bit SPARC, but still PASS
> >> as before on 32 and 64-bit x86.
> >
> > Can we instead guard the scan-tree-dump?  This way the correctness
> > execute part still is exercised?
> 
> Sure, even if the result is somewhat hard to read with all those levels
> of braces:
> 
> /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { vect_int &&
> { le || { be && vect_shift_char } } } } } } */
> 
> This way, all of compile, execute, and scan are run on x86, while on
> sparc it's only compile, execute.

Looks good, pre-approved.

[Bug tree-optimization/114072] gcc.dg/vect/vect-pr111779.c FAILs

2024-05-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072

--- Comment #4 from rguenther at suse dot de  ---
On Wed, 22 May 2024, ro at CeBiTec dot Uni-Bielefeld.DE wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114072
> 
> --- Comment #3 from ro at CeBiTec dot Uni-Bielefeld.DE  Uni-Bielefeld.DE> ---
> > --- Comment #2 from Richard Biener  ---
> > Hmm, is solaris-sparc big-endian?  It seems so.  That makes the bitfield
> 
> It is indeed.
> 
> > access require a VnQImode logical right shift (but little-endian doesn't
> > require it - it's actually bitfield endianess that matters).
> >
> > There is vect_shift_char you could use and somehow conditionalize that
> > on big-endianess.
> 
> I think the best we can do then is
> 
> /* { dg-skip-if "PR tree-optimization/114072" { be && { ! vect_shift_char } } 
> }
> */
> 
> Lets the test become UNSUPPORTED on 32 and 64-bit SPARC, but still PASS
> as before on 32 and 64-bit x86.

Can we instead guard the scan-tree-dump?  This way the correctness
execute part still is exercised?

[Bug tree-optimization/115138] [15 Regression] Bootstrap compare-debug fail after r15-580-gf3e5f4c58591f5

2024-05-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115138

--- Comment #2 from rguenther at suse dot de  ---
> Am 17.05.2024 um 16:20 schrieb iains at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115138
> 
>Bug ID: 115138
>   Summary: [15 Regression] Bootstrap compare-debug fail after
>r15-580-gf3e5f4c58591f5
>   Product: gcc
>   Version: 14.0
>Status: UNCONFIRMED
>  Severity: normal
>  Priority: P3
> Component: tree-optimization
>  Assignee: unassigned at gcc dot gnu.org
>  Reporter: iains at gcc dot gnu.org
>CC: rguenth at gcc dot gnu.org
>  Target Milestone: ---
>Target: x86_64-darwin
> 
> The compare fail is a symbol name mismatch (AFAICT) the code is otherwise
> identical.
> 
> a single fail fails gcc/d/opover.o
> 
> There's:
> 
>.const
> _CSWTCH.155:
>.byte   38
>.byte   37
>.byte   40
>.byte   39
> 
> where the stage1 compiler (and x86_64 Linux) produces _CSWTCH.154
> 
> At present, still trying to figure out how to debug this further .. it's D so
> no preprocessed output - I guess will have to try tree dumps.

Did the followup fix maybe resolve this?

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug rtl-optimization/101523] Huge number of combine attempts

2024-05-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523

--- Comment #64 from rguenther at suse dot de  ---
On Sat, 4 May 2024, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523
> 
> --- Comment #61 from Segher Boessenkool  ---
> We used to do the wrong thing in combine.  Now that my fix was reverted, we
> still do.  This should be undone soonish,

As promised I'm going to revert the revert after 14.1 is released 
(hopefully tomorrow).  As for distros I have decided to include my
hack posted in 
https://gcc.gnu.org/pipermail/gcc-patches/2024-April/648725.html
for SUSE based distros in GCC 13 and 14 as that seems to improve
the problematical memory uses in our build farm.

[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd

2024-05-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #4 from rguenther at suse dot de  ---
On Mon, 6 May 2024, mkretz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908
> 
> --- Comment #3 from Matthias Kretz (Vir)  ---
> The stdx::simd implementation in this area is old and mainly tuned to be
> correct. I can rewrite the split and concat implementation to use
> __builtin_shufflevector (which wasn't available in GCC at the time when I
> originally implemented it). Doing so I can resolve this issue.
> 
> How do you want to handle this? Because it would certainly be nice if the
> compiler can optimize this in the same way as Clang can. Should I try to come
> up with a testcase that doesn't need stdx::simd and then improve stdx::simd
> independently?

Yes, that would be nice (best the testcase w/o stdx::simd being a C
testcase even, no C++ abstraction).

I do think stdx::simd should be improved to use __builtin_shufflevector
though.

[Bug tree-optimization/114774] Missed DSE in simple code due to interleaving sotres

2024-04-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774

--- Comment #4 from rguenther at suse dot de  ---
On Fri, 19 Apr 2024, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774
> 
> --- Comment #3 from Jan Hubicka  ---
> > Yes, DSE walking doesn't "branch" but goes to some length handling some 
> > trivial
> > branches only.  Mainly to avoid compile-time issues.  It needs larger
> > re-structuring to fix that, but in principle it shouldn't be difficult. 
> 
> Looking into it, instead of having simple outer loop it needs to
> maintain worklist of defs to proceed each annotated with live bitmap,
> rigt?

Yeah, I have some patch on some branch somewhere ... IIRC it was broken
and miscompiled things and I got distracted ...

I will eventually get back to DSE for stage1 also because of some other
PRs.

> With Maritn we noticed it while looking into push_back codegen. In case
> aggregate is passed to a function call but does not escape, we now SRA
> it removing all uses and corresponding clobbers, but we do not remove the
> stores (Martin was expecting DSE to do it). As a result the store stays
> around which causes partial memory stall (storing in pieces, loading as
> a whole). It seems easy to remove this specific store during SRA, but it
> seems improving DSE here would be desirable.

Yes.

[Bug tree-optimization/114749] [13 Regression] RISC-V rv64gcv ICE: in vectorizable_load, at tree-vect-stmts.cc

2024-04-17 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114749

--- Comment #5 from rguenther at suse dot de  ---
On Wed, 17 Apr 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114749
> 
> --- Comment #4 from JuzheZhong  ---
> Hi, Patrick.
> 
> It seems that Richard didn't append the testcase in the patch.
> Could you send a patch to add the testcase for RISC-V port ?

I did not because of the extra riscv specific options needed, you
may have a better idea how to correctly put that into the riscv
target specific harness (which eventually iterates over extra options?)

[Bug target/111231] [12/13/14 regression] armhf: Miscompilation with -O2/-fno-exceptions level (-fno-tree-vectorize is working)

2024-04-16 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111231

--- Comment #35 from rguenther at suse dot de  ---
On Tue, 16 Apr 2024, rearnsha at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111231
> 
> --- Comment #34 from Richard Earnshaw  ---
> To be honest, I'm more concerned that we aren't eliminating a lot of these
> copies during the gimple optimization phase.  The memcpy is really a type
> punning step (that's strictly ISO C compliant, rather than using the GCC union
> extension), so ideally we'd recognize that and eliminate as many of the copies
> as possible (perhaps using some form of view_convert or whatever gimple is
> appropriate for changing the view without changing the contents).

Yeah, there's currently no way to represent a change just in the
effective type that wouldn't generate code in the end but still
serves as barrier for these type related optimizations.

When modifying the earlier store is an option then another possibility
would be to attach multiple effective types to it in some way.  Of course
that's pessimizing as well.

That said, the choice has been made to prune those "invalid" redundant
store removals but as we see here the implemented checks are not working
as intended.

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #17 from rguenther at suse dot de  ---
On Mon, 15 Apr 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635
> 
> --- Comment #16 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #14)
> > I think
> > 
> >   if (safelen)
> > {
> >   poly_uint64 val;
> >   safelen = OMP_CLAUSE_SAFELEN_EXPR (safelen);
> >   if (!poly_int_tree_p (safelen, &val))
> > safelen_int = 0;
> >   else
> > safelen_int = MIN (constant_lower_bound (val), INT_MAX);
> > 
> > should simply become
> > 
> > safelen_int = constant_upper_bound_with_limit (val, INT_MAX);
> > 
> > ?  Usually targets do have a limit on the actual length but I see
> > constant_upper_bound_with_limit doesn't query such.  But it would
> > be a more appropriate way to say there might be an actual target limit here?
> 
> OMP_CLAUSE_SAFELEN_EXPR is always an INTEGER_CST, the FEs verify that and 
> error
> if it is not.  So, I must say I don't really understand parts of the
> r8-5649-g9d2f08ab97be
> changes.  I can understand the intent to make max_vf a poly_int, but don't
> understand why safelen should be, what would it mean and when it would be set
> that way?

It would be only to "better" encode "infinity".  But I see loop->safelen
is 'int' but only [0, MAX_INT] is specified so we'd conveniently have
say -1 to encode "infinity".  We'd have to special case that value
anyway?

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #15 from rguenther at suse dot de  ---
On Mon, 15 Apr 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635
> 
> --- Comment #13 from Jakub Jelinek  ---
> (In reply to kugan from comment #12)
> > > Why?
> > > Then it just is INT_MAX value, which is a magic value that says that it is
> > > infinity.
> > > No need to say it is a poly_int infinity.
> > 
> > For this test case, omp_max_vf gets [16, 16] from the backend. This then
> > becomes 16. If we keep it as poly_int, it would pass maybe_lt (max_vf,
> > min_vf)) after applying safelen?
> 
> No.  You should just special case loop->safelen == INT_MAX to mean infinity in
> the comparisons where it currently doesn't work as infinity.

But then an actual safelen(INT_MAX) would need to be adjusted.

Maybe using a poly-int safelen internally is cleaner.

[Bug tree-optimization/114635] OpenMP reductions fail dependency analysis

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635

--- Comment #7 from rguenther at suse dot de  ---
> Am 08.04.2024 um 16:55 schrieb tnfchris at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635
> 
> --- Comment #6 from Tamar Christina  ---
> (In reply to Jakub Jelinek from comment #4)
>> Now, with SVE/RISCV vectors the actual vectorization factor is a poly_int
>> rather than constant.  One possibility would be to use VLA arrays in those
>> cases, but then it will be hard to undo that later, or allow both shrinking
>> and growing the arrays and even turning them into VLA-like ones.
> 
> I think they are already VLAs of some kind going into vect, unless I've
> misunderstood the declaration:
> 
>  float D.28295[0:POLY_INT_CST [15, 16]];
>  float D.28293[0:POLY_INT_CST [15, 16]];
>  float D.28290[0:POLY_INT_CST [15, 16]];
> 
> it looks like during vectorization they are lowered though.

Maybe it’s about upper vs lower bound when setting loop->safelen from
‚unlimited‘

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

--- Comment #26 from rguenther at suse dot de  ---
On Mon, 8 Apr 2024, douglas.boffey at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
> 
> --- Comment #25 from Douglas Boffey  ---
> (In reply to rguent...@suse.de from comment #24)
> > dumpbin /headers executable.exe
> 
> ...
>   C0 size of stack reserve

OK, so that's the expected 12MB we adjust the stack reserve to.  It's
odd that you run into stack exhaustion (or maybe you don't and instead
run out of other memory).

I've now tried GCC 11.4 as you reportedly used on x86_64-linux and
can compile the testcase successfully with that using 2GB memory
and 75s compile-time.  Stack usage itself isn't measured but 8MB
were enough.

GCC 13 runs out of (8MB) stack for me.

[Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480

--- Comment #24 from rguenther at suse dot de  ---
On Mon, 8 Apr 2024, douglas.boffey at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114480
> 
> --- Comment #23 from Douglas Boffey  ---
> (In reply to Richard Biener from comment #22)
> > Note we're using -Wl,--stack,12582912 when linking the GCC executables, I
> > wonder
> > if the reporter can verify the segfaulting executables have the correct 
> > stack
> > size set?
> 
> How do I find that out?

googling tells me

dumpbin /headers executable.exe

which should show a "size of stack reserve".  Alternatively if you
have objdump from binutils that might also show this info.  Basically
it's encoded in the binary (you want to check cc1plus.exe here).

There's also

editbin /stack:N executable.exe

where you can alter the stack reserve of the binary to N

[Bug libfortran/114304] [13/14 Regression] libgfortran I/O – bogus "Semicolon not allowed as separator with DECIMAL='point'"

2024-04-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304

--- Comment #29 from rguenther at suse dot de  ---
On Mon, 8 Apr 2024, burnus at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304
> 
> --- Comment #28 from Tobias Burnus  ---
> Created attachment 57896
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57896&action=edit
> Testcase
> 
> It seems as if 'tabs' cause problems, e.g. for:
> 
>  profile_single_file= .true.
> 
> where there are two tabs before '='.
> 
> * * *
> 
> The problem seems to be that the new code uses:
> 
> -  eat_spaces (dtp);
>dtp->u.p.comma_flag = 0;
> +  c = next_char (dtp);
> +  if (c == ' ')
> +{
> +  eat_spaces (dtp);
> 
> Thus, it explicitly checks for ' ' while eat_spaces handles:
> 
>   while (c != EOF && (c == ' ' || c == '\r' || c == '\t'));
> 
> Testcase attached.
> 
> I think we need at least an "|| c == '\t'"; I guess '\r' isn't really required
> here, or is it?

Might be for \r\n line endings?  I'd keep it for the sake of preserving
previous behavior.  isspace(3) tests for \f, \n, \r, \t, \v and space
(but of course all depends on the locale, not sure whether libgfortran
needs to care for locales)

[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e

2024-03-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683

--- Comment #25 from rguenther at suse dot de  ---
On Fri, 22 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683
> 
> --- Comment #24 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #23)
> > It looks like this could go to suitable_reference_p instead?
> 
> You mean return false for those making them not suitable at all?
> I thought without a write such references would act more like RS_ANY, but
> a reason I didn't just treat such references as RS_ANY rather than RS_NONZERO
> in suitable_reference_p was because of the assert that all refs in a component
> have the same ref_step_type but nothing actually comparing it before the
> assertion.

Hmm, true.

> But if you think I should just return false from suitable_reference_p if the
> step isn't a multiple of the sizes, I can surely try that.
> 
> > That said, I do wonder why with my patch split_data_refs_to_components
> > doesn't fixup.  I think it's supposed to handle the case where
> > dependences are unknown conservatively...
> 
> I'm afraid I'm not familiar with data refs enough to know what was going on.

I tried to follow what happens there and I'm also somewhat lost.

Anyway, I think fixing this in predcom in a convenient place even if
it might be not the true fix is OK.  You might want to put a comment
before any such fix indicating there might be more latent issues
in predcom or dependence analysis in general.

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2024-03-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #17 from rguenther at suse dot de  ---
On Thu, 21 Mar 2024, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736
> 
> --- Comment #16 from Uro? Bizjak  ---
> (In reply to Richard Biener from comment #13)
> > The original testcase is fixed, appearantly slapping 'extern' on the int
> > makes it not effective.
> > 
> > Possibly better amend the
> > 
> >   if (VAR_P (inner) && DECL_HARD_REGISTER (inner))
> > return;
> > 
> > check though.  As indicated my fix fixed only VAR_DECL cases, there's
> > still pointer-based accesses (MEM_REF) to consider.  So possibly even
> > the following is necessary
> 
> I must admit that to create the patch from Comment #11 I just mindlessly
> searched for DECL_THREAD_LOCAL_P in asan.cc and amended the location with
> ADDR_SPACE_GENERIC_P check.

It might be that the DECL_THREAD_LOCAL_P handling is similarly broken
in that

int foo (int *p)
{
  return *p;
}

is instrumented even when p is a pointer to thread local storage?  But
maybe that works fine and it handled in the runtime.  But of course
the runtime can't handle non-generic address-spaces at all unless
we can convert all those addresses to ones in the generic address-space
(given suitable overlap of course).

> However, ASAN should back off from annotating *any* gs: prefixed address. 
> 
> I'll test your patch from Comment #13 ASAP.

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

--- Comment #8 from rguenther at suse dot de  ---
On Thu, 21 Mar 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
> 
> Hongtao Liu  changed:
> 
>What|Removed |Added
> 
>  CC||liuhongt at gcc dot gnu.org
> 
> --- Comment #7 from Hongtao Liu  ---
> Another simple case is 
> 
> typedef int v4si __attribute__((vector_size(16)));
> typedef short v8hi __attribute__((vector_size(16)));
> 
> v8hi a;
> v4si b;
> void
> foo ()
> {
>b = __extension__(v4si){0, 0, 0, 0};
>a = __extension__(v8hi){0, 0, 0, 0, 0, 0, 0, 0};
> }
> 
> GCC generates 2 pxor
> 
> foo():
> vpxor   xmm0, xmm0, xmm0
> vmovdqa XMMWORD PTR b[rip], xmm0
> vpxor   xmm0, xmm0, xmm0
> vmovdqa XMMWORD PTR a[rip], xmm0
> ret

If we were to expose that vpxor before postreload we'd likely CSE but
we have

5: xmm0:V4SI=const_vector
  REG_EQUIV const_vector
6: [`b']=xmm0:V4SI
7: xmm0:V8HI=const_vector
  REG_EQUIV const_vector
8: [`a']=xmm0:V8HI

until the very end.  But since we have the same mode size on the xmm0
sets CSE could easily handle (integral) constants by hashing/comparing
on their byte representation rather than by using the RTX structure.
OTOH as we mostly have special constants allowed in the IL like this
treating all-zeros and all-ones specially might be good enough ...

[Bug tree-optimization/114345] FRE missing knowledge of semantics of IFN loads

2024-03-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114345

--- Comment #6 from rguenther at suse dot de  ---
On Fri, 15 Mar 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114345
> 
> --- Comment #5 from Tamar Christina  ---
> (In reply to Richard Biener from comment #4)
> > Well, the shuffling in .LOAD_LANES will be a bit awkward to do, but sure.  
> > We
> > basically lack "constant folding" of .LOAD_LANES and similarly of course
> > we can't see through .STORE_LANES of a constant when later folding a scalar
> > load from the same memory.
> 
> I guess it becomes harder with the 3 and 4 lane ones, but the 2 lanes one is
> just a single VEC_PERM_EXPR no?

It's all about constant folding and thus "shuffling" properly.  But if
you consider that the vector type might be punned a later "long" load
of a .STORE_LANES with "int" lanes it will get interesting to now
follow non-consecutive bits... (read: that's not implemented).  That said,
some careful set of testcases should accompany support for .LOAD_LANES
and .STORE_LANES handling in VN.

I suppose it should be possible to leverage the GIMPLE FE for this.

[Bug libfortran/114304] [13/14 Regression] libgfortran I/O – bogus "Semicolon not allowed as separator with DECIMAL='point'"

2024-03-11 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304

--- Comment #10 from rguenther at suse dot de  ---
> Am 11.03.2024 um 20:03 schrieb jvdelisle at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114304
> 
> --- Comment #9 from Jerry DeLisle  ---
> Patch on comment #8 breaks all sorts of things. Not so obvious. I will try
> reverting the original hunk from pr105473 and then work from there.

Just to add, I think rejecting something we accepted before and when this
doesn’t fix a rejects-valid shouldn’t be done on branches and given it affects
the standard library, when it’s SONAME is not altered as it might affect
programs compiled with older libgfortran (maybe there’s the argument for
something like a LIBGFORTRAN_STRICT environment to control such if really
needed?)

> --
> You are receiving this mail because:
> You reported the bug.

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #16 from rguenther at suse dot de  ---
On Thu, 7 Mar 2024, gjl at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252
> 
> --- Comment #14 from Georg-Johann Lay  ---
> The code in the example is not a perfect bswap, it needs additional shuffling
> of bytes.  The tree passes must know that bswap is not a perfect fit.  There
> must be *some* criterion that depends on the permutation, and when a bswap is
> closer to the bswapped-permutation that a non-bswapped permutation is to the
> original one.

I think we only support bswap + rotate as replacement (but just use that
if it matches).

[Bug middle-end/105533] UBSAN: gcc/expmed.cc:3272:26: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long int'

2024-03-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105533

--- Comment #10 from rguenther at suse dot de  ---
On Wed, 6 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105533
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||rguenth at gcc dot gnu.org
> 
> --- Comment #7 from Jakub Jelinek  ---
> The second UB is on
> #2  ao_ref_init_from_vn_reference (ref=, set=1, base_set=1,
> type=, ops=...) at ../../gcc/tree-ssa-sccvn.cc:1224
> 1224offset += op->off << LOG2_BITS_PER_UNIT;
> where op->off is negative.
> Isn't this just an unnecessary optimization?  I mean can't we just do
> offset += op->off * BITS_PER_UNIT;
> BITS_PER_UNIT is a constant 8 on all targets we support...

It's a habit from dealing with offset_int (but this is poly_int64)
where the multiply is possibly a lot more costly than a shift.

So yeah, a multiply is fine I guess.

[Bug target/114252] Introducing bswapsi reduces code performance

2024-03-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #6 from rguenther at suse dot de  ---
> Am 06.03.2024 um 17:12 schrieb gjl at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252
> 
> --- Comment #5 from Georg-Johann Lay  ---
> (In reply to Richard Biener from comment #4)
>> So bswap on a value is just register shuffling, right?
> 
> The point is that there is no need for bswap in the first place, just have a
> look at the code that v13 generates.  It's 4 QI loads and that's it, no
> shuffling required at all.
> 
> But v14 dropped that, and the bswapsi (presumably due to previous flawed tree
> optmizations) is introduced by some tree pass.
> 
> There's nothing the backend can do about it.  So would you explain why you
> think it's a "target" issue?
> 
> Maybe the PR title I used is confusing and does not hit the point?

Why does the target say it has bswapsi then? In which case is that profitable?

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #22 from rguenther at suse dot de  ---
On Tue, 5 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232
> 
> --- Comment #17 from Jakub Jelinek  ---
> Either change those too, or the splitter needs some variant what to do if 
> there
> is a mismatch.
> Though, optimize_size implies optimize_function_for_size_p (cfun), so if a
> named pattern uses && optimize_size and the insn it splits into uses
> optimize_function_for_size_p (cfun), it shouldn't fail.  The other direction 
> is
> not always true, optimize_function_for_size_p (cfun) can be true just because
> the function
> is cold, not just because it is -Os.

I think optimize_function_for_size_p (cfun) isn't always true if
optimize_size is since it looks at the function-specific setting
of that flag, so you'd have to use opt_for_fn (cfun, optimize_size).

[Bug target/114232] [14 regression] ICE when building rr-5.7.0 with LTO on x86

2024-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232

--- Comment #14 from rguenther at suse dot de  ---
On Tue, 5 Mar 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114232
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||jakub at gcc dot gnu.org
> 
> --- Comment #12 from Jakub Jelinek  ---
> Still, it would be nice to understand what changed 
> optimize_function_for_size_p
> (cfun)
> after IPA.  Is something adjusting node->count or node->frequency?
> Otherwise it should just depend on the optimize_size flag which should not
> change...

Maybe we share the TREE optimization node (it gets re-materialized
during LTO streaming) between hot and cold functions and the "first"
one getting in set_cfun will overwrite TREE_OPTIMIZATION_OPTABS in
init_tree_optimization_optabs (though it seems to overwrite things).

In any case I think the tree node sharing only looks at options, not
at function frequency so having TREE_OPTIMIZATION_OPTABS part of
the optimization node looks a bit fragile in the light of sharing.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #43 from rguenther at suse dot de  ---
On Mon, 4 Mar 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #41 from Richard Sandiford  ---
> (In reply to Richard Biener from comment #40)
> > So I wonder if we can use "local costing" to decide a gather is always OK
> > compared to the alternative with peeling for gaps.  On x86 gather tends
> > to be slow compared to open-coding it.
> Yeah, on SVE gathers are generally ?enabling? instructions rather than
> something to use for their own sake.
> 
> I suppose one problem is that we currently only try to use gathers for
> single-element groups.  If we make a local decision to use gathers while
> keeping that restriction, we could end up using gathers ?unnecessarily? while
> still needing to peel for gaps for (say) a two-element group.
> 
> That is, it's only better to use gathers than contiguous loads if by doing 
> that
> we avoid all need to peel for gaps (and if the cost of peeling for gaps was
> high enough to justify the cost of using gathers over consecutive loads).

Yep.  I do want to experiment with a way to have vectorizable_* register
multiple variants of vectorization and have ways to stitch together and 
cost the overall vectorization as a cheaper (and more flexible) way to
"iteration".  It will to some extent blow up combinations to try but
there might be a way to use greedy relaxation techniques to converge to
a lowest cost variant.

> One of the things on the list to do (once everything is SLP!) is to support
> loads with gaps directly via predication, so that we never load elements that
> aren't needed.  E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a
> 32-bit load (LD1W .S) to load only even-indexed elements.  So a single-element
> group with a group size of 2 could be done cheaply with just consecutive 
> loads,
> without peeling for gaps.

Yep.  Gap handling leaves to be desired (also when no predication is
available), I also plan to address some shortcomings in that area early
stage1.

Note that generally the idea is that gap peeling is very cheap - unless
that is the only reason to have an epilogue at all.  The exeption might
be small round-trip loops but those are best handled with predication
where there's no good reason to do peeling for gaps at all.

[Bug other/114191] Flags "Warning" and "Target" don't mix well in target.opt files

2024-03-04 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114191

--- Comment #4 from rguenther at suse dot de  ---
On Mon, 4 Mar 2024, gjl at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114191
> 
> --- Comment #2 from Georg-Johann Lay  ---
> (In reply to Richard Biener from comment #1)
> > Wmisspelled-isr
> > Target C C++ Var(avr_warn_misspelled_isr) Init(1)
> > Warn if the ISR is misspelled, ...
> > 
> > should eventually work?
> 
> With that, the warnings appear as they should, but the option is not
> recognized:
> 
> $ avr-gcc signal.c -S -Wmisspelled-isr
> error: unrecognized command-line option '-Wmisspelled-isr'
> 
> $ avr-gcc signal.c -S -Wno-misspelled-isr
> error: unrecognized command-line option '-Wno-misspelled-isr'
> 
> $ avr-gcc signal.c -S -Werror=misspelled-isr
> error: '-Werror=misspelled-isr': '-Wmisspelled-isr' is not an option that
> controls warnings

Hmm.  This is how some other targets do it, grepping

grep ^W config/*/*.opt

(a bit noisy because of descriptions).

config/gcn/gcn.opt:Wopenacc-dims
config/gcn/gcn.opt:Warn about invalid OpenACC dimensions.
config/i386/mingw.opt:Wpedantic-ms-format
config/i386/mingw.opt:Warn about none ISO msvcrt scanf/printf width 
extensions.

would be not Target but

config/rs6000/darwin.opt:Waltivec-long-deprecated

uses Target (and thus is dysfunctional)

So it seems we're not set up for 'Target' warnings.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-03-01 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #34 from rguenther at suse dot de  ---
On Fri, 1 Mar 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #33 from Richard Sandiford  ---
> Can you give me a chance to look at it a bit when I back?  This doesn't feel
> like the way to go to me.

Sure.

[Bug tree-optimization/114151] [14 Regression] weird and inefficient codegen and addressing modes since g:a0b1798042d033fd2cc2c806afbb77875dd2909b

2024-02-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

--- Comment #4 from rguenther at suse dot de  ---
On Wed, 28 Feb 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151
> 
> --- Comment #3 from Tamar Christina  ---
> > 
> > This was a correctness fix btw, so I'm not sure we can easily recover - we
> > could try using niter information for CHREC_VARIABLE but then there's
> > variable niter here so I don't see a chance.
> > 
> 
> It looks like it's mostly SVE suffering here. Adv. SIMD an scalar codegen 
> seems
> to have vastly improved with it. we see ~10% improvements due to better
> addressing for scalar.
> 
> It also only happens at -O3 but -O2 is fine, which is weird as I was expected
> IVopts to be enabled at -O2 too.

Note the underlying issue, analyzing { a, +, b } * c differently and
thus eventually dependent CHRECs failing to analyze should be independent
on the architecture.

What was definitely missing is consideration of POLY_INT_CSTs (and
variable polys, as I think there's no range info for those).

We do eventually want to improve how ranger behaves here.  I'm not sure
why when we do not provide a context 'stmt' it can't see to compute
a range valid at the SSA names point of definition?  (so basically
compute the global range)

Maybe there's another API to do that.  But I thought it was convenient
to use range_of_expr as that also handles GENERIC expression trees
to some extent and those are common within SCEV.

[Bug tree-optimization/114041] wrong code with _BitInt() and -O -fgraphite-identity

2024-02-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041

--- Comment #14 from rguenther at suse dot de  ---
On Wed, 28 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041
> 
> --- Comment #12 from Jakub Jelinek  ---
> I can change the comparison into assert, or defer that for stage1?

Defer I think, if you want to bother ...

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #15 from rguenther at suse dot de  ---
On Wed, 28 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #14 from Hongtao Liu  ---
> (In reply to rguent...@suse.de from comment #13)
> > On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> > > 
> > > --- Comment #11 from Hongtao Liu  ---
> > > 
> > > >Loop body is likely going to simplify further, this is difficult
> > > >to guess, we just decrease the result by 1/3.  */
> > > > 
> > > 
> > > This is introduced by r0-68074-g91a01f21abfe19
> > > 
> > > /* Estimate number of insns of completely unrolled loop.  We assume
> > > +   that the size of the unrolled loop is decreased in the
> > > +   following way (the numbers of insns are based on what
> > > +   estimate_num_insns returns for appropriate statements):
> > > +
> > > +   1) exit condition gets removed (2 insns)
> > > +   2) increment of the control variable gets removed (2 insns)
> > > +   3) All remaining statements are likely to get simplified
> > > +  due to constant propagation.  Hard to estimate; just
> > > +  as a heuristics we decrease the rest by 1/3.
> > > +
> > > +   NINSNS is the number of insns in the loop before unrolling.
> > > +   NUNROLL is the number of times the loop is unrolled.  */
> > > +
> > > +static unsigned HOST_WIDE_INT
> > > +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +unsigned HOST_WIDE_INT nunroll)
> > > +{
> > > +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  if (unr_insns <= 0)
> > > +unr_insns = 1;
> > > +  unr_insns *= (nunroll + 1);
> > > +
> > > +  return unr_insns;
> > > +}
> > > 
> > > And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> > > likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> > > keeped.
> > > 
> > > +/* Estimate number of insns of completely unrolled loop.
> > > +   It is (NUNROLL + 1) * size of loop body with taking into account
> > > +   the fact that in last copy everything after exit conditional
> > > +   is dead and that some instructions will be eliminated after
> > > +   peeling.
> > > 
> > > -   NINSNS is the number of insns in the loop before unrolling.
> > > -   NUNROLL is the number of times the loop is unrolled.  */
> > > +   Loop body is likely going to simplify futher, this is difficult
> > > +   to guess, we just decrease the result by 1/3.  */
> > > 
> > >  static unsigned HOST_WIDE_INT
> > > -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> > > +estimated_unrolled_size (struct loop_size *size,
> > >  unsigned HOST_WIDE_INT nunroll)
> > >  {
> > > -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> > > +  HOST_WIDE_INT unr_insns = ((nunroll)
> > > +* (HOST_WIDE_INT) (size->overall
> > > +   -
> > > size->eliminated_by_peeling));
> > > +  if (!nunroll)
> > > +unr_insns = 0;
> > > +  unr_insns += size->last_iteration -
> > > size->last_iteration_eliminated_by_peeling;
> > > +
> > > +  unr_insns = unr_insns * 2 / 3;
> > >if (unr_insns <= 0)
> > >  unr_insns = 1;
> > > -  unr_insns *= (nunroll + 1);
> > > 
> > > It looks to me 1 / 3 overestimates the instructions that can be optimised 
> > > away,
> > > especially if we've subtracted eliminated_by_peeling
> > 
> > Yes, that 1/3 reduction is a bit odd - you could have the same effect
> > by increasing the instruction limit by 1/3, but that means it doesn't
> > really matter, does it?  It would be interesting to see if increasing
> > the limit by 1/3 and removing the above is neutral on SPEC?
> 
> Remove 1/3 reduction get ~2% improvement for 525.x264_r on SPR with
> -march=native -O3, no big impact on other integer benchmark.

454.calculix was always the benchmark to cross check as that benefits
from much unrolling.

I'm all for removing the 1/3 for innermost loop handling (in cunroll
the unrolled loop is then innermost).  I'm more concerned about
unrolling more than one level which is exactly what's required for
454.calculix.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #13 from rguenther at suse dot de  ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #11 from Hongtao Liu  ---
> 
> >Loop body is likely going to simplify further, this is difficult
> >to guess, we just decrease the result by 1/3.  */
> > 
> 
> This is introduced by r0-68074-g91a01f21abfe19
> 
> /* Estimate number of insns of completely unrolled loop.  We assume
> +   that the size of the unrolled loop is decreased in the
> +   following way (the numbers of insns are based on what
> +   estimate_num_insns returns for appropriate statements):
> +
> +   1) exit condition gets removed (2 insns)
> +   2) increment of the control variable gets removed (2 insns)
> +   3) All remaining statements are likely to get simplified
> +  due to constant propagation.  Hard to estimate; just
> +  as a heuristics we decrease the rest by 1/3.
> +
> +   NINSNS is the number of insns in the loop before unrolling.
> +   NUNROLL is the number of times the loop is unrolled.  */
> +
> +static unsigned HOST_WIDE_INT
> +estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +unsigned HOST_WIDE_INT nunroll)
> +{
> +  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  if (unr_insns <= 0)
> +unr_insns = 1;
> +  unr_insns *= (nunroll + 1);
> +
> +  return unr_insns;
> +}
> 
> And r0-93444-g08f1af2ed022e0 try do it more accurately by marking
> likely_eliminated stmt and minus that from total insns, But 2 / 3 is still
> keeped.
> 
> +/* Estimate number of insns of completely unrolled loop.
> +   It is (NUNROLL + 1) * size of loop body with taking into account
> +   the fact that in last copy everything after exit conditional
> +   is dead and that some instructions will be eliminated after
> +   peeling.
> 
> -   NINSNS is the number of insns in the loop before unrolling.
> -   NUNROLL is the number of times the loop is unrolled.  */
> +   Loop body is likely going to simplify futher, this is difficult
> +   to guess, we just decrease the result by 1/3.  */
> 
>  static unsigned HOST_WIDE_INT
> -estimated_unrolled_size (unsigned HOST_WIDE_INT ninsns,
> +estimated_unrolled_size (struct loop_size *size,
>  unsigned HOST_WIDE_INT nunroll)
>  {
> -  HOST_WIDE_INT unr_insns = 2 * ((HOST_WIDE_INT) ninsns - 4) / 3;
> +  HOST_WIDE_INT unr_insns = ((nunroll)
> +* (HOST_WIDE_INT) (size->overall
> +   -
> size->eliminated_by_peeling));
> +  if (!nunroll)
> +unr_insns = 0;
> +  unr_insns += size->last_iteration -
> size->last_iteration_eliminated_by_peeling;
> +
> +  unr_insns = unr_insns * 2 / 3;
>if (unr_insns <= 0)
>  unr_insns = 1;
> -  unr_insns *= (nunroll + 1);
> 
> It looks to me 1 / 3 overestimates the instructions that can be optimised 
> away,
> especially if we've subtracted eliminated_by_peeling

Yes, that 1/3 reduction is a bit odd - you could have the same effect
by increasing the instruction limit by 1/3, but that means it doesn't
really matter, does it?  It would be interesting to see if increasing
the limit by 1/3 and removing the above is neutral on SPEC?

Note this kind of "simplification guessing" is most important for
the 2nd stage unrolling an outer loop with an unrolled inner loop
as there are 2nd level recurrences to be optimized the "elmiminated by
peeling" heuristics do not get (but value-numbering would).  So another
thing to do would be not do the 1/3 reduction for innermost loops
but only for loops up from that.

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325

--- Comment #12 from rguenther at suse dot de  ---
On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> 
> --- Comment #10 from Hongtao Liu  ---
> (In reply to Hongtao Liu from comment #9)
> > The original case is a little different from the one in PR.
> But the issue is similar, after cunrolli, GCC failed to vectorize the outer
> loop.
> 
> The interesting thing is in estimated_unrolled_size, the original unr_insns is
> 288 which is bigger than param_max_completely_peeled_insns(200), but unr_insn
> is decreased by 1/3 due to
> 
>Loop body is likely going to simplify further, this is difficult
>to guess, we just decrease the result by 1/3.  */
> 
> In practice, this loop body is not simplied for 1/3 of the instructions.
> 
> Considering the unroll factor is 16, the unr_insn is large(192), I was
> wondering if we could add some heuristic algorithm to avoid complete loop
> unroll, because usually for such a big loop, both loop and BB vectorizer may
> not perform well.

There were several attempts at making the unroller guess less (that 1/3
reduction) but work out what actually will be simplified to be able to
shrink those numbers.

My favorite (but never implemented) idea was to code-generate 
optimistically but while running value-numbering on-the-fly on the
code and cost the so simplified unrolled code, stopping when we
reach a limit (and scrap the sofar accumulated code).  While
reasonably "easy" for unrolled code that ends up without branches
it gets complicated for branches.

My most recent attempt at improving was only for tracking what
unrolling estimates as ending up constant.

I think what might be the least controversical thing to do is to split
the instruction limit between the early cunrolli and the late cunroll
passes and lower the ones for cunrolli a lot.

[Bug tree-optimization/113441] [14 Regression] Fail to fold the last element with multiple loop since g:2efe3a7de0107618397264017fb045f237764cc7

2024-02-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #28 from rguenther at suse dot de  ---
On Mon, 26 Feb 2024, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #27 from Tamar Christina  ---
> Created attachment 57538
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57538&action=edit
> proposed1.patch
> 
> proposed patch, this gets the gathers and scatters back. doing regression run.

I don't think this will fly.

[Bug tree-optimization/114041] wrong code with _BitInt() and -O -fgraphite-identity

2024-02-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041

--- Comment #4 from rguenther at suse dot de  ---
On Thu, 22 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114041
> 
> --- Comment #3 from Jakub Jelinek  ---
> I thought graphite is done only after bitint lowering.
> At that point, there should be just <= MAX_FIXED_MODE_SIZE BITINT_TYPEs around
> in the IL, with the exception of loading SSA_NAMEs from memory in order to 
> pass
> them to function calls, or undefined SSA_NAMEs, or usually TREE_ADDRESSABLE
> PARM_DECLs being converted to something else.
> Would be probably wise to punt on graphite optimizations on loops which have
> those larger SSA_NAMEs anywhere in the loop bodies.
> That said, ub4_0_17(D) is unsigned _BitInt(4).

I think we want to understand what goes wrong, not just disable
graphite on any BitInt.

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #23 from rguenther at suse dot de  ---
On Thu, 22 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #22 from Jakub Jelinek  ---
> Yeah, I was worried about partial ints.  Or it could be punt if TYPE_MODEs are
> different and at least one of them is BLKmode.

Well, then you can also check whether one mode is BLKmode.

Btw, I think forwprop is lucky if it really changes

  _2 = VIEW_CONVERT_EXPR(x);

to

  _2 = (uint256_t) x_1(D);

because match on its own would create

  _2 = (uint256_t) x;

which would be invalid GIMPLE.  So are you sure it's not
update-address-taken first rewriting x to a register?

For 'foo' I see x becomes a register already during into-SSA so
another fix might be to re-consider and make x a non-register
because of that V_C_E?  (and hopefully the match pattern would
then not be applied, but I'm not 100% sure on that)

In principle there's nothing wrong about the transform it's
the use of uint256_t that makes this problematic.

So maybe we should bite the bullet and try the memcpy folding
fix even at this stage ...

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-22 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #21 from rguenther at suse dot de  ---
On Thu, 22 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #20 from Jakub Jelinek  ---
> (In reply to rguent...@suse.de from comment #19)
> > I think the usual BLKmode check would be better here?  Apart from
> > that this looks correct, we shouldn't use a regular convert on
> > a non-register type.  In fact, it looks like all bitint types are
> > register types because we want SSA names for them.  A bit of a
> > "bad" design ...
> > 
> > We've used BLKmode checks elsewhere so I think it would be appropriate
> > here, too.
> 
> But we don't want to disallow turning VIEW_CONVERT_EXPR from one _BitInt to
> another _BitInt of the same size, even if they both have BLKmode.
> So, if we'd use BLKmode check, it would need to be one of the types is
> INTEGER_TYPE and the other type is BLKmode BITINT_TYPE (or vice versa).
> Or shall the test be TYPE_MODE (type) == TYPE_MODE (TREE_TYPE (@0)) regardless
> of precision?
> Or it could be the mode check && at least one of the involved types is
> BITINT_TYPE,
> that would maintain existing behavior when _BitInt isn't involved.
> 
> The #c18 patch passed bootstrap/regtest, the following:
> 2024-02-22  Jakub Jelinek  
> 
> PR tree-optimization/113988
> * match.pd (view_convert -> convert): Punt for VCEs involving
> BITINT_TYPEs if there is mode mismatch.
> 
> * gcc.dg/bitint-91.c: New test.
> 
> --- gcc/match.pd.jj 2024-02-19 09:42:16.583617451 +0100
> +++ gcc/match.pd2024-02-22 09:00:34.302420089 +0100
> @@ -4679,7 +4679,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(view_convert @0)
>(if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
> && (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE
> (@0)))
> -   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)))
> +   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0))
> +   /* Punt for conversions with mode mismatches if BITINT_TYPEs are
> + involved.  */
> +   && (TYPE_MODE (type) == TYPE_MODE (TREE_TYPE (@0))

I like the mode check and I think we should avoid the transform
also for non-bit-int in case the mode differs, I can only think
of odd partial-int vs int mode cases here with pointer vs.
integer?

> +  || (TREE_CODE (type) != BITINT_TYPE
> +  && TREE_CODE (TREE_TYPE (@0)) != BITINT_TYPE)))
> (convert @0)))
> 
>  /* Strip inner integral conversions that do not change precision or size, or
> --- gcc/testsuite/gcc.dg/bitint-91.c.jj 2024-02-21 13:47:55.244885020 +0100
> +++ gcc/testsuite/gcc.dg/bitint-91.c2024-02-21 12:51:16.900026979 +0100
> @@ -0,0 +1,38 @@
> +/* PR tree-optimization/113988 */
> +/* { dg-do compile { target bitint } } */
> +/* { dg-options "-O2" } */
> +/* { dg-additional-options "-mavx512f" { target i?86-*-* x86_64-*-* } } */
> +
> +int i;
> +
> +#if __BITINT_MAXWIDTH__ >= 256
> +void
> +foo (void *p, _BitInt(256) x)
> +{
> +  __builtin_memcpy (p, &x, sizeof x);
> +}
> +
> +_BitInt(256)
> +bar (void *p, _BitInt(256) x)
> 
> passed
> make check-gcc check-g++ -j32 -k GCC_TEST_RUN_EXPENSIVE=1
> RUNTESTFLAGS="GCC_TEST_RUN_EXPENSIVE=1 dg.exp='*bitint* pr112673.c
> builtin-stdc-bit-*.c pr112566-2.c pr112511.c' dg-torture.exp=*bitint*
> dfp.exp=*bitint*"
> 
>

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #19 from rguenther at suse dot de  ---
On Wed, 21 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #17 from Jakub Jelinek  ---
> So, either we could somehow handle that case during expansion (treat it
> basically as VCE), or tweak the
> /* For integral conversions with the same precision or pointer
>conversions use a NOP_EXPR instead.  */
> (simplify
>   (view_convert @0)
>   (if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
>&& (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE 
> (@0)))
>&& TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)))
>(convert @0)))
> match.pd rule not to do that for INTEGER_TYPEs with PRECISION >
> MAX_FIXED_TYPE_PRECISION (then we don't need the gimple-lower-bitint.cc 
> changes
> either).
> --- gcc/match.pd.jj 2024-02-19 09:42:16.583617451 +0100
> +++ gcc/match.pd2024-02-21 13:32:06.567816298 +0100
> @@ -4679,7 +4679,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(view_convert @0)
>(if ((INTEGRAL_TYPE_P (type) || POINTER_TYPE_P (type))
> && (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE
> (@0)))
> -   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0)))
> +   && TYPE_PRECISION (type) == TYPE_PRECISION (TREE_TYPE (@0))
> +   /* Punt for conversions from or to barely supported huge
> + INTEGER_TYPEs.  Those can handle just loads/stores/moves but
> + nothing else.  */
> +   && (TYPE_PRECISION (type) <= MAX_FIXED_MODE_SIZE
> +  || (TREE_CODE (type) != INTEGER_TYPE
> +  && TREE_CODE (TREE_TYPE (@0)) != INTEGER_TYPE)))
> (convert @0)))
> 
>  /* Strip inner integral conversions that do not change precision or size, or

I think the usual BLKmode check would be better here?  Apart from
that this looks correct, we shouldn't use a regular convert on
a non-register type.  In fact, it looks like all bitint types are
register types because we want SSA names for them.  A bit of a
"bad" design ...

We've used BLKmode checks elsewhere so I think it would be appropriate
here, too.

[Bug ipa/113476] [14 Regression] irange::maybe_resize leaks memory via IPA VRP

2024-02-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113476

--- Comment #17 from rguenther at suse dot de  ---
On Wed, 21 Feb 2024, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113476
> 
> --- Comment #8 from Aldy Hernandez  ---
> (In reply to Richard Biener from comment #5)
> > (In reply to Martin Jambor from comment #4)
> > > The right place where to free stuff in lattices post-IPA would be in
> > > ipa_node_params::~ipa_node_params() where we should iterate over lattices
> > > and deinitialize them or perhaps destruct the array because since
> > > ipcp_vr_lattice directly contains Value_Range which AFAIU directly 
> > > contains
> > > int_range_max which has a virtual destructor... does not look like a POD
> > > anymore.  This has escaped me when I was looking at the IPA-VR changes but
> > > hopefully it should not be too difficult to deal with.
> > 
> > OK, that might work for the IPA side.
> > 
> > It's quite unusual to introduce a virtual DTOR in the middle of the class
> > hierarchy though.  Grepping I do see quite some direct uses of 'irange'
> > and also 'vrange' which do not have the DTOR visible but 'irange' already
> > exposes and uses 'maybe_resize'.  I think those should only be introduced
> > in the class exposing the virtual DTOR (but why virtual?!).
> > 
> > Would be nice to have a picture of the range class hierarchies with
> > pointers on which types to use in which circumstances ...
> 
> For reference, you should always use the most base class you can.  If
> you can get the work done with the vrange API, use that.  If you're
> dealing with an integer, use irange.  The int_range<> derived class
> should only be used for defining a variable, so:
> 
> int_range<123> foobar; // Defined with storage.
> 
> if (is_a )
> {
>   irange &p = as_a  (foobar);
>   ...
> }
> 
> // Use an irange reference here, not an int_range reference.
> void somefunc (const irange &r)
> {
> }
> 
> Also, the reason irange does not have a destructor is because it has
> no storage.  Only int_range<> has storage associated with it, so it is
> the only one able to see if the range grew:

But when I do

 irange *r = new int_range<>;
 delete r;

then it will fail to release memory?  Are virtual DTORs not exactly
invented for this, when you delete on a reference/pointer to a base
class but the object is really a derived one?

That's what I was refering to with "introducing a virtual DTOR
in a not base-class".

I still think that's bad OO design, no?

[Bug middle-end/113988] during GIMPLE pass: bitintlower: internal compiler error: in lower_stmt, at gimple-lower-bitint.cc:5470

2024-02-19 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988

--- Comment #12 from rguenther at suse dot de  ---
On Mon, 19 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113988
> 
> --- Comment #10 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Jakub Jelinek from comment #7)
> > > > I think I can handle it like the VIEW_CONVERT_EXPR case, bet with
> > > > _BitInt(511) it would actually be a VCE, but when it is same size
> > > > BITINT_TYPE to INTEGER_TYPE we choose NOP_EXPR.
> > > > That said, I think it would be better if the memcpy folding used say 
> > > > vector
> > > > types instead of these extra large integer types.
> > > 
> > > Hmm.  Maybe we want a target hook to specify the "move mode" for a given
> > > byte size and then we can use bitwise_type_for_mode to get a type?
> > > 
> > > Maybe we can even get rid of that large integer mode requirement that way 
> > > ...
> > 
> > Or we refuse to use integer types for > MAX_FIXED_MODE_SIZE sizes and
> > instead always try VNQImode?  For memcpy folding I mean.
> 
> Yeah, and punt if that mode isn't supported.

Or simply use mode_for_size with MODE_VECTOR_INT (though that doesn't
check targetm.vector_mode_supported_p for vector modes).  Just
in case a target doesn't have QImode component support.

[Bug middle-end/111156] [14 Regression] aarch64 aarch64/sve/mask_struct_store_4.c failures

2024-02-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56

--- Comment #22 from rguenther at suse dot de  ---
> Am 15.02.2024 um 19:53 schrieb tnfchris at gcc dot gnu.org 
> :
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56
> 
> --- Comment #21 from Tamar Christina  ---
> (In reply to Richard Biener from comment #18)
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index 7cf9504398c..8deeecfd4aa 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -1280,8 +1280,11 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char
>> *swap,
>>&& rhs_code.is_tree_code ()
>>&& (TREE_CODE_CLASS (tree_code (first_stmt_code))
>>== tcc_comparison)
>> -   && (swap_tree_comparison (tree_code (first_stmt_code))
>> -   == tree_code (rhs_code)))
>> +   && ((swap_tree_comparison (tree_code (first_stmt_code))
>> +== tree_code (rhs_code))
>> +   || ((TREE_CODE_CLASS (tree_code (alt_stmt_code))
>> +== tcc_comparison)
>> +   && rhs_code == alt_stmt_code)))
>>   && !(STMT_VINFO_GROUPED_ACCESS (stmt_info)
>>&& (first_stmt_code == ARRAY_REF
>>|| first_stmt_code == BIT_FIELD_REF
>> 
>> should get you SLP but:
>> 
>> t.c:8:26: note:   === vect_slp_analyze_operations ===
>> t.c:8:26: note:   ==> examining statement: pretmp_29 = *_28;
>> t.c:8:26: missed:   unsupported load permutation
>> t.c:10:30: missed:   not vectorized: relevant stmt not supported: pretmp_29
>> = *_28;
>> 
>> t.c:8:26: note:   op template: pretmp_29 = *_28;
>> t.c:8:26: note: stmt 0 pretmp_29 = *_28;
>> t.c:8:26: note: stmt 1 pretmp_29 = *_28;
>> t.c:8:26: note: load permutation { 0 0 }
> 
> hmm with that applied I get:
> 
> sve-mis.c:8:26: note:   ==> examining statement: pretmp_29 = *_28;
> sve-mis.c:8:26: note:   Vectorizing an unaligned access.
> sve-mis.c:8:26: note:   vect_model_load_cost: unaligned supported by hardware.
> sve-mis.c:8:26: note:   vect_model_load_cost: inside_cost = 1, prologue_cost =
> 0 .
> 
> but it bails out at:
> 
> sve-mis.c:8:26: missed:   Not using elementwise accesses due to variable
> vectorization factor.
> sve-mis.c:10:25: missed:   not vectorized: relevant stmt not supported:
> .MASK_STORE (_5, 8B, _27, pretmp_29);
> sve-mis.c:8:26: missed:  bad operation or unsupported loop bound.
> 
> for me

I’ve used -fno-cost-model and looked at the SVE variant only.

> --
> You are receiving this mail because:
> You are the assignee for the bug.
> You are on the CC list for the bug.

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #35 from rguenther at suse dot de  ---
On Thu, 15 Feb 2024, rguenther at suse dot de wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> 
> --- Comment #34 from rguenther at suse dot de  ---
> On Thu, 15 Feb 2024, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> > 
> > --- Comment #32 from Jakub Jelinek  ---
> > (In reply to Jan Hubicka from comment #31)
> [...]
> > > Not streaming value ranges is an omission on my side (I mistakely assumed 
> > > we
> > > do stream them).  We ought to stream them, since otherwise we will lose
> > > propagated return value ranges in partitioned programs, which is pity.
> > 
> > Not sure if it is a good idea to start streaming them now in stage4, but 
> > sure,
> 
> Please don't do that now ;)
> 
> (we've not had ranges early originally, and even nonzero bits were not
> computed)
> 
> But yes, IPA CP jump functions with value-ranges might be a problem.
> I think it does instantiate them as local ranges, does it?  We
> have streaming support for ranges, we just don't stream the auxiliary
> data for all SSA names (like also not points-to info).

Also remember we like to have a fix that's easily backportable, and
that's probably going to be resetting the info.  We can do something
more fancy for GCC 15

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-15 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #34 from rguenther at suse dot de  ---
On Thu, 15 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> 
> --- Comment #32 from Jakub Jelinek  ---
> (In reply to Jan Hubicka from comment #31)
[...]
> > Not streaming value ranges is an omission on my side (I mistakely assumed we
> > do stream them).  We ought to stream them, since otherwise we will lose
> > propagated return value ranges in partitioned programs, which is pity.
> 
> Not sure if it is a good idea to start streaming them now in stage4, but sure,

Please don't do that now ;)

(we've not had ranges early originally, and even nonzero bits were not
computed)

But yes, IPA CP jump functions with value-ranges might be a problem.
I think it does instantiate them as local ranges, does it?  We
have streaming support for ranges, we just don't stream the auxiliary
data for all SSA names (like also not points-to info).

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #18 from rguenther at suse dot de  ---
On Wed, 14 Feb 2024, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787
> 
> --- Comment #17 from Jan Hubicka  ---
> > > I guess PTA gets around by tracking points-to set also for non-pointer
> > > types and consequently it also gives up on any such addition.
> > 
> > It does.  But note it does _not_ for POINTER_PLUS where it treats
> > the offset operand as non-pointer.
> > 
> > > I think it is ipa-prop.c::unadjusted_ptr_and_unit_offset. It accepts
> > > pointer_plus expression, but does not look through POINTER_PLUS.
> > > We can restrict it further, but tracking base pointer is quite useful,
> > > so it would be nice to not give up completely.
> > 
> > It looks like that function might treat that
> > 
> >  ADDR_EXPR >
> > 
> > as integer_zerop base.  It does
> > 
> >   if (TREE_CODE (op) == ADDR_EXPR) 
> > {
> >   poly_int64 extra_offset = 0; 
> >   tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0),
> >  &offset);
> >   if (!base)
> > {
> >   base = get_base_address (TREE_OPERAND (op, 0));
> >   if (TREE_CODE (base) != MEM_REF)
> > break;
> >   offset_known = false;
> > }
> >   else
> > {
> >   if (TREE_CODE (base) != MEM_REF)
> > break;
> > 
> > with a variable offset we fall to the TREE_CODE (base) != MEM_REF
> > and will have offset_known == true.  Not sure what it does with
> > the result though (it's not the address of a decl).
> > 
> > This function seems to oddly special-case != MEM_REF ... (maybe
> > it wants to hande DECL_P () as finishing?
> 
> Hmm the function was definitely not written with TARGET_MEM_REF in mind,
> since it was originally used for IPA passes only.
> We basically want to handle stuff like
>  &decl->foo
> or
>  &(ptr->foo)
> In the second case we want to continue the SSA walk to hopefully work
> out the origin of PTR.
> ipa-modref then looks if the base pointer is derived from function
> parameter or points to local or readonly memory to produce its summary.
> > 
> > Note get_addr_base_and_unit_offset will return NULL for
> > a TARGET_MEM_REF <&decl, ..., offset> but TARGET_MEM_REF
> > itself if the base isn't an ADDR_EXPR, irrespective of whether
> > the offset within it is constant or not.
> 
> Hmm, interesting.  I would expect it to interpret the emantics of TMR
> and return base.
> > 
> > Not sure if the above is a problem, but it seems the only caller
> > will just call points_to_local_or_readonly_memory_p on the
> > ADDR_EXPR where refs_local_or_readonly_memory_p via
> > points_to_local_or_readonly_memory_p will eventually do
> > 
> >   /* See if memory location is clearly invalid.  */
> >   if (integer_zerop (t))
> > return flag_delete_null_pointer_checks;
> > 
> > and that might be a problem.  As said, we rely on
> > ADDR_EXPR  > to be an address computation
> > that's not subject to strict interpretation to allow IVOPTs
> > doing this kind of optimization w/o introducing some kind of
> > INTEGER_LEA <...>.  I know that's a bit awkward but we should
> > make sure this is honored by IPA as well.
> > 
> > I'd say
> > 
> > diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> > index 74c9b4e1d1e..45a770cf940 100644
> > --- a/gcc/ipa-fnsummary.cc
> > +++ b/gcc/ipa-fnsummary.cc
> > @@ -2642,7 +2642,8 @@ points_to_local_or_readonly_memory_p (tree t)
> > return true;
> >return !ptr_deref_may_alias_global_p (t, false);
> >  }
> > -  if (TREE_CODE (t) == ADDR_EXPR)
> > +  if (TREE_CODE (t) == ADDR_EXPR
> > +  && TREE_CODE (TREE_OPERAND (t, 0)) != TARGET_MEM_REF)
> >  return refs_local_or_readonly_memory_p (TREE_OPERAND (t, 0));
> >return false;
> >  }
> > 
> > might eventually work?  Alternatively a bit less aggressive like
> > the following.
> > 
> > diff --git a/gcc/ipa-fnsummary.cc b/gcc/ipa-fnsummary.cc
> > index 74c9b4e1d1e..7c79adf6440 100644
> > --- a/gcc/ipa-fnsummary.cc
> > +++ b/gcc/ipa-fnsummary.cc
> > @@ -2642,7 +2642,9 @@ points_to_local_or_readonly_memory_p (tree t)
> > return true;
> > 

[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41

2024-02-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907

--- Comment #28 from rguenther at suse dot de  ---
On Wed, 14 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907
> 
> --- Comment #27 from Jakub Jelinek  ---
> So:
> --- gcc/ipa-icf.cc.jj   2024-02-10 11:25:09.645478952 +0100
> +++ gcc/ipa-icf.cc  2024-02-14 10:44:27.906216458 +0100
> @@ -1244,6 +1244,29 @@ sem_function::merge (sem_item *alias_ite
>else
>  create_alias = true;
> 
> +  unsigned i;
> +  tree name;
> +  FOR_EACH_SSA_NAME (i, name, original->get_fun ())
> +{
> +  /* We need to either merge or reset SSA_NAME_*_INFO.
> +For merging we don't preserve the mapping between
> +original and alias SSA_NAMEs from successful equals
> +calls.  */
> +  if (POINTER_TYPE_P (TREE_TYPE (name)))
> +{
> + if (SSA_NAME_PTR_INFO (name))
> +   {
> + gcc_assert (!flag_wpa);
> + SSA_NAME_PTR_INFO (name) = NULL;
> +   }
> +}
> +  else if (SSA_NAME_RANGE_INFO (name))
> +   {
> + gcc_assert (!flag_wpa);
> + SSA_NAME_RANGE_INFO (name) = NULL;
> +   }
> +}
> +
>if (redirect_callers)
>  {
>int nredirected = redirect_all_callers (alias, local_original);
> then?

Yeah, though can't we do this in the caller and thus only once for the
target when we merge more than 2 nodes?

  For the merging, I guess we'd need to move one of the 2 vec vectors
> from m_checker to the sem_function instead of throwing it away in
> sem_function::equals
> if returning true.

It might require quite some amount of memory though when N is big,
I'm not sure it's worth it?

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-14 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #16 from rguenther at suse dot de  ---
On Tue, 13 Feb 2024, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787
> 
> --- Comment #15 from Jan Hubicka  ---
> > 
> > IVOPTs does the above but it does it (or should) as
> > 
> >   offset = (uintptr)&base2 - (uintptr)&base1;
> >   val = *((T *)((uintptr)base1 + i + offset))
> > 
> > which is OK for points-to as no POINTER_PLUS_EXPR is involved so the
> > resulting pointer points to both base1 and base2 (which isn't optimal
> > but correct).
> > 
> > If we somehow get back a POINTER_PLUS that's where things go wrong.
> > 
> > Doing the above in C code would be valid input so we have to treat
> > it correctly (OK, the standard only allows back-and-forth
> > pointer-to-integer casts w/o any adjustment, but of course we relax
> > this).
> 
> OK. Modrefs tracks base pointer for accesses and tries to prove that
> they are function parameters.  This should immitate ivopts:
> void
> __attribute__ ((noinline))
> set(int *a, unsigned long off)
> {
>   *(int *)((unsigned long)a + off) = 1;
> }
> int
> test ()
> {
>   int a;
>   int b = 0;
>   set (&a, (unsigned long)&b - (unsigned long)&a);
>   return b;
> }
> 
> Here set gets following gimple at modref2 time:
> __attribute__((noinline)) 
> void set (int * a, long unsigned int off)
> {
>   long unsigned int a.0_1;
>   long unsigned int _2;
>   int * _3; 
> 
>[local count: 1073741824]:
>   a.0_1 = (long unsigned int) a_4(D);
>   _2 = a.0_1 + off_5(D); 
>   _3 = (int *) _2; 
>   *_3 = 1; 
>   return;
> 
> }
> 
> This is not pattern matched so modref does not think the access has a as
> a base:
> 
>   stores:
>   Base 0: alias set 1
> Ref 0: alias set 1
>   Every access
> 
> While for:
> 
> void
> __attribute__ ((noinline))
> set(int *a, unsigned long off)
> {
>   *(a+off/sizeof(int))=1;
> }
> 
> we produce:
> 
> __attribute__((noinline))
> void set (int * a, long unsigned int off)
> {
>   sizetype _1;
>   int * _2;
> 
>[local count: 1073741824]:
>   _1 = off_3(D) & 18446744073709551612;
>   _2 = a_4(D) + _1;
>   *_2 = 1;
>   return;
> }
> 
> And this is understood:
> 
>   stores:
>   Base 0: alias set 1
> Ref 0: alias set 1
>   access: Parm 0
> 
> If we consider it correct to optimize out the conversion from and to
> pointer type, then I suppose any addition of pointer and integer which
> we do not see means that we need to give up on tracking base completely.
> 
> I guess PTA gets around by tracking points-to set also for non-pointer
> types and consequently it also gives up on any such addition.
> 
> But what we really get from relaxing this?
> > 
> > IVOPTs then in putting all of the stuff into 'offset' gets at
> > trying a TARGET_MEM_REF based on a NULL base but that's invalid.
> > We then resort to a LEA (ADDR_EXPR of TARGET_MEM_REF) to compute
> > the address which gets us into some phishy argument that it's
> > not valid to decompose ADDR_EXPR of TARGET_MEM_REF to
> > POINTER_PLUS of the TARGET_MEM_REF base and the offset.  But
> > that's how it is (points-to treats (address of) TARGET_MEM_REF
> > as pointing to anything ...).
> > 
> > > A quick fix would be to run IPA modref before ivopts, but I do not see 
> > > how such
> > > transformation can work with rest of alias analysis (PTA etc)
> > 
> > It does.  Somewhere IPA modref interprets things wrongly, I didn't figure
> > out here though.
> 
> 
> I guess PTA gets around by tracking points-to set also for non-pointer
> types and consequently it also gives up on any such addition.

It does.  But note it does _not_ for POINTER_PLUS where it treats
the offset operand as non-pointer.

> I think it is ipa-prop.c::unadjusted_ptr_and_unit_offset. It accepts
> pointer_plus expression, but does not look through POINTER_PLUS.
> We can restrict it further, but tracking base pointer is quite useful,
> so it would be nice to not give up completely.

It looks like that function might treat that

 ADDR_EXPR >

as integer_zerop base.  It does

  if (TREE_CODE (op) == ADDR_EXPR) 
{
  poly_int64 extra_offset = 0; 
  tree base = get_addr_base_and_unit_offset (TREE_OPERAND (op, 0),
 &offset);
  if (!base)
{
  base = get_base_address (TREE_OPERAND (op, 0));
  if (TREE_CODE (ba

[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64

2024-02-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787

--- Comment #14 from rguenther at suse dot de  ---
On Tue, 13 Feb 2024, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787
> 
> --- Comment #13 from Jan Hubicka  ---
> So my understanding is that ivopts does something like
> 
>  offset = &base2 - &base1
> 
> and then translate
>  val = base2[i]
> to
>  val = *((base1+i)+offset)
> 
> Where (base1+i) is then an iv variable.
> 
> I wonder if we consider doing memory reference with base changed via offset a
> valid transformation. Is there way to tell when this happens?

IVOPTs does the above but it does it (or should) as

  offset = (uintptr)&base2 - (uintptr)&base1;
  val = *((T *)((uintptr)base1 + i + offset))

which is OK for points-to as no POINTER_PLUS_EXPR is involved so the
resulting pointer points to both base1 and base2 (which isn't optimal
but correct).

If we somehow get back a POINTER_PLUS that's where things go wrong.

Doing the above in C code would be valid input so we have to treat
it correctly (OK, the standard only allows back-and-forth
pointer-to-integer casts w/o any adjustment, but of course we relax
this).

IVOPTs then in putting all of the stuff into 'offset' gets at
trying a TARGET_MEM_REF based on a NULL base but that's invalid.
We then resort to a LEA (ADDR_EXPR of TARGET_MEM_REF) to compute
the address which gets us into some phishy argument that it's
not valid to decompose ADDR_EXPR of TARGET_MEM_REF to
POINTER_PLUS of the TARGET_MEM_REF base and the offset.  But
that's how it is (points-to treats (address of) TARGET_MEM_REF
as pointing to anything ...).

> A quick fix would be to run IPA modref before ivopts, but I do not see how 
> such
> transformation can work with rest of alias analysis (PTA etc)

It does.  Somewhere IPA modref interprets things wrongly, I didn't figure
out here though.

[Bug c++/113852] -Wsign-compare doesn't warn on unsigned result types

2024-02-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113852

--- Comment #8 from rguenther at suse dot de  ---
On Mon, 12 Feb 2024, admin at computerquip dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113852
> 
> --- Comment #7 from Zachary L  ---
> (In reply to Richard Biener from comment #6)
> > Well, given athat a1 * a2 is carried out in 'int' you are invoking undefined
> > behavior if it overflows.  GCC assumes that doesn't happen so it's correct
> > to elide the diagnostic.  Unless you make overflow well-defined with 
> > -fwrapv.
> > 
> > I think that errors on the right side for the purpose of -Wsign-compare.
> 
> Removing a diagnostic because the program could be ill-formed seems backwards
> to me, especially since it seems like there's already logic in performing this
> diagnostic. Perhaps I've misunderstood the situation?

Well, we could, for each int * int diagnose a "may invoke undefined 
behavior if overflow happens at runtime" but that's hardly useful.
So we want to diagnose cases where it's _likely_ that there's an
issue.

For the case at hand it's not likely that ushort * ushort invokes
undefined behavior (because invoking undefined should be unlikely,
very much so).  So we choose to not diagnose it as "maybe".

Yes, with constexpr evaluation we actually _see_ we invoke undefined
behavior and diagnose that.  The subsequent diagnostic of
-Wsign-compare is then spurious though and IMO should be not given.
The error is the integer overflow invoking undefined behavior even
- if that didn't happen there would be no issue with comparing 
signed/unsigned values.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #17 from rguenther at suse dot de  ---
On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #16 from JuzheZhong  ---
> The FMA is generated in widening_mul PASS:
> 
> Before widening_mul (fab1):
> 
>   _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
>   _6 = _5 * 1.229982236431605997495353221893310546875e-1;
>   _8 = _4 + _6;

So this is x + (CST1 - x) * CST2 which we might fold/associate to
x * (1. - CST2) + CST1 * CST2

this looks like something for reassociation (it knows some rules,
like what it does in undistribute_ops_list, I'm not sure if that
comes into play here already, this would be doing the reverse
before).  A match.pd pattern also works, but it wouldn't be
general enough to handle more complicated but similar cases.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-02-07 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #14 from rguenther at suse dot de  ---
On Wed, 7 Feb 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #13 from JuzheZhong  ---
> Ok. I found the optimized tree:
> 
> 
>   _5 = 3.33314829616256247390992939472198486328125e-1 - _4;
>   _8 = .FMA (_5, 1.229982236431605997495353221893310546875e-1, 
> _4);
> 
> Let CST0 = 3.33314829616256247390992939472198486328125e-1,
> CST1 = 1.229982236431605997495353221893310546875e-1
> 
> The expression is equivalent to the following:
> 
> _5 = CST0 - _4;
> _8 = _5 * CST1 + 4;
> 
> That is:
> 
> _8 = (CST0 - _4) * CST1 + 4;
> 
> So, We should be able to re-associate it like Clang:
> 
> _8 = CST0 * CST1 - _4 * CST1 + 4; ---> _8 = CST0 * CST1 + _4 * (1 - CST1);
> 
> Since both CST0 * CST1 and 1 - CST1 can be pre-computed during compilation
> time.
> 
> Let say CST2 = CST0 * CST1, CST3 = 1 - CST1, then we can re-associate as 
> Clang:
> 
> _8 = FMA (_4, CST3, CST2).
> 
> Any suggestions for this re-association ?  Is match.pd the right place to do 
> it
> ?

You need to look at the IL before we do .FMA forming, specifically 
before/after the late reassoc pass.  There pass applying match.pd
patterns everywhere is forwprop.

I also wonder which compilation flags you are using (note clang
has different defaults for example for -ftrapping-math)

[Bug ipa/113359] [13 Regression] LTO miscompilation of ceph on aarch64

2024-02-06 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359

--- Comment #13 from rguenther at suse dot de  ---
On Tue, 6 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359
> 
> --- Comment #12 from Jakub Jelinek  ---
> Just going from the demangled name of
> std::pair std::chrono::duration > > const,
> Context*>
> it would surprise me if it was ODR violation in the testcase, because class
> Context
> is only defined in Timer.ii, not in the other preprocessed source,
> ceph::mono_clock is defined in both but looks the same (and it is empty class
> anyway), and the pair uses Context* as second type anyway, so it is unclear 
> how
> it could become something smaller than pointer.
> But, admittedly I haven't looked up at this under the debugger and not even
> sure where to look at that.

Might be also an interaction with IPA ICF in case there's a pointer to
the pair involved?

[Bug tree-optimization/113736] ICE: verify_gimple failed: incompatible types in 'PHI' argument 0 with _BitInt() struct copy to __seg_fs/gs

2024-02-05 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113736

--- Comment #3 from rguenther at suse dot de  ---
On Sat, 3 Feb 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113736
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>Last reconfirmed||2024-02-03
>  CC||rguenth at gcc dot gnu.org
>  Ever confirmed|0   |1
>  Status|UNCONFIRMED |NEW
> 
> --- Comment #1 from Jakub Jelinek  ---
> struct S { _BitInt(710) a; };
> struct T { struct S b[2]; };
> 
> void
> foo (__seg_gs struct T *p)
> {
>   struct S s;
>   p->b[0] = s;
> }
> 
> Bitint lowering changes here
>   MEM < _BitInt(768)> [( struct T *)p_2(D)] 
> =
> s_4(D);
> to
>   VIEW_CONVERT_EXPR(MEM < _BitInt(768)>
> [( struct T *)p_2(D)])[_5] = s_7(D);
> accesses in a loop.  Is that invalid and should have  also in
> the VCE type?  Or is this just a vectorizer bug?

I think that's OK, I will have a look.

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #19 from rguenther at suse dot de  ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #18 from JuzheZhong  ---
> (In reply to rguent...@suse.de from comment #17)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > 
> > > --- Comment #16 from JuzheZhong  ---
> > > (In reply to rguent...@suse.de from comment #15)
> > > > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > > > 
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > > > 
> > > > > --- Comment #14 from JuzheZhong  ---
> > > > > Thanks Richard.
> > > > > 
> > > > > It seems that we can't fix this issue for now. Is that right ?
> > > > > 
> > > > > If I understand correctly, do you mean we should wait after SLP 
> > > > > representations
> > > > > are finished and then revisit this PR?
> > > > 
> > > > Yes.
> > > 
> > > It seems to be a big refactor work.
> > 
> > It's not too bad if people wouldn't continue to add features not 
> > implementing SLP ...
> > 
> > > I wonder I can do anything to help with SLP representations ?
> > 
> > I hope to get back to this before stage1 re-opens and will post
> > another request for testing.  It's really mostly going to be making
> > sure all paths have coverage which means testing all the various
> > architectures - I can only easily test x86.  There's a branch
> > I worked on last year, refs/users/rguenth/heads/vect-force-slp,
> > which I use to hunt down cases not supporting SLP (it's a bit
> > overeager to trigger, and it has known holes so it's not really
> > a good starting point yet for folks to try other archs).
> 
> Ok. It seems that you almost done with that but needs more testing in
> various targets.
> 
> So, if I want to work on optimizing vectorization (start with TSVC),
> I should avoid touching the failed vectorized due to data reference/dependence
> analysis (e.g. this PR case, s116).

It depends on the actual case - the one in this bug at least looks like
half of it might be dealt with with the refactoring.

> and avoid adding new features into loop vectorizer, e.g. min/max reduction 
> with
> index (s315).

It's fine to add features if they works with SLP as well ;)  Note that
in the future SLP will also do the "single lane" case but it doesn't
do that on trunk.  Some features are difficult with multi-lane SLP
and probably not important in practice for that case, still handling
single-lane SLP will be important as otherwise the feature is lost.

> To not to make your SLP refactoring work heavier.
> 
> Am I right ?

Yes.  I've got early break vectorization to chase now, I was "finished"
with the parts I could exercise on x86_64 in autumn ...

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #17 from rguenther at suse dot de  ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #16 from JuzheZhong  ---
> (In reply to rguent...@suse.de from comment #15)
> > On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> > > 
> > > --- Comment #14 from JuzheZhong  ---
> > > Thanks Richard.
> > > 
> > > It seems that we can't fix this issue for now. Is that right ?
> > > 
> > > If I understand correctly, do you mean we should wait after SLP 
> > > representations
> > > are finished and then revisit this PR?
> > 
> > Yes.
> 
> It seems to be a big refactor work.

It's not too bad if people wouldn't continue to add features not 
implementing SLP ...

> I wonder I can do anything to help with SLP representations ?

I hope to get back to this before stage1 re-opens and will post
another request for testing.  It's really mostly going to be making
sure all paths have coverage which means testing all the various
architectures - I can only easily test x86.  There's a branch
I worked on last year, refs/users/rguenth/heads/vect-force-slp,
which I use to hunt down cases not supporting SLP (it's a bit
overeager to trigger, and it has known holes so it's not really
a good starting point yet for folks to try other archs).

[Bug tree-optimization/99395] s116 benchmark of TSVC is vectorized by clang and not by gcc

2024-01-31 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395

--- Comment #15 from rguenther at suse dot de  ---
On Wed, 31 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99395
> 
> --- Comment #14 from JuzheZhong  ---
> Thanks Richard.
> 
> It seems that we can't fix this issue for now. Is that right ?
> 
> If I understand correctly, do you mean we should wait after SLP 
> representations
> are finished and then revisit this PR?

Yes.

[Bug tree-optimization/113622] [11/12/13 Regression] ICE with vectors in named registers

2024-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113622

--- Comment #23 from rguenther at suse dot de  ---
On Tue, 30 Jan 2024, xry111 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113622
> 
> --- Comment #22 from Xi Ruoyao  ---
> On x86_64:
> 
> $ cat t.c
> typedef float __attribute__ ((vector_size (16))) vec;
> typedef int __attribute__ ((vector_size (16))) ivec;
> register vec a asm("xmm0"), b asm("xmm1");
> register ivec c asm("xmm2");
> 
> void
> test (void)
> {
>   for (int i = 0; i < 4; i++)
> c[i] = a[i] < b[i] ? -1 : 1;
> }
> $ gcc/cc1 -msse2 t.c -O2 -fno-vect-cost-model -nostdinc -ffixed-xmm{0,1,2}
> t.c: In function 'test':
> t.c:7:1: internal compiler error: in expand_expr_addr_expr_1, at expr.cc:9139
> 7 | test (void)
>   | ^~~~
> 0x10e6d6e expand_expr_addr_expr_1
> ../../gcc/gcc/expr.cc:9139
> 0x10e76e2 expand_expr_addr_expr
> ../../gcc/gcc/expr.cc:9252
> 0x10f73a7 expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
> ../../gcc/gcc/expr.cc:12585
> 0x10e7dc8 expand_expr_real(tree_node*, rtx_def*, machine_mode, 
> expand_modifier,
> rtx_def**, bool)
> ../../gcc/gcc/expr.cc:9440
> 0xef7346 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> ../../gcc/gcc/expr.h:316
> 0x10e91fa expand_expr_real_2(separate_ops*, rtx_def*, machine_mode,
> expand_modifier)
> ../../gcc/gcc/expr.cc:9762
> 0x10ef77d expand_expr_real_gassign(gassign*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
> ../../gcc/gcc/expr.cc:11096
> 0xf2db31 expand_gimple_stmt_1
> ../../gcc/gcc/cfgexpand.cc:4010
> 0xf2ddd4 expand_gimple_stmt
> ../../gcc/gcc/cfgexpand.cc:4071
> 0xf36844 expand_gimple_basic_block
> ../../gcc/gcc/cfgexpand.cc:6127
> 0xf38ff8 execute
> ../../gcc/gcc/cfgexpand.cc:6866
> 
> Should I open a new ticket or add back 14 Regression to the subject?

Please open a new ticked - this seems to be another vectorizer issue.

We end up with the invalid

_28 = (sizetype) &a;

[Bug target/113059] [14 regression] fftw fails tests for -O3 -m32 -march=znver2 since r14-6210-ge44ed92dbbe9d4

2024-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113059

--- Comment #20 from rguenther at suse dot de  ---
On Tue, 30 Jan 2024, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113059
> 
> --- Comment #18 from Jakub Jelinek  ---
> (In reply to Richard Biener from comment #17)
> > (In reply to Jakub Jelinek from comment #16)
> > > The question is revert what exactly?
> > > If we revert r14-6210, we get back the other P1.  Or do you mean revert
> > > r14-5355?
> > > I guess another option is move the vzeroupper pass one pass later, i.e.
> > > after pass_gcse.
> > 
> > I think moving mdreorg passes as late as possible esp. when they don't play
> > well with DF/notes is a good thing.  Maybe even after pass_rtl_dse2 and
> > thus after shrink-wrapping?
> 
> The thing is that the vzeroupper pass actually plays well with DF notes, the
> problem is that it now (in GCC 14) asks for them to be computed.
> The first issue was that vzeroupper before postreload_cse computed the notes,
> then
> postreload_cse CSEd something and made the REG_UNUSED invalid without killing
> them and then later passes went wrong because of the incorrect notes.
> This issue is that vzeroupper now after postreload_cse but before gcse2
> computes notes, then gcse2 CSEs something and makes REG_UNUSED invalid, rest 
> is
> the same.
> But, I believe gcse2 is the last CSE-ish pass.
> I wouldn't move it too much further, because I don't remember the interactions
> between vzeroupper, splitting and peepholes.

OK, so the "real" revert would then simply kill the notes actively
again after vzeroupper?  Btw, DSE also uses cselib, but I'm not sure
whether it uses REG_UNUSED but IIRC it does "redundant store" removal
and it does replace reads in some cases ... (not sure if after reload
though).

So for maximum safety if we'd have a way to kill off REG_UNUSED maybe
we should do that instead?  OTOH any "stray" valid REG_UNUSED
notes not causing issues with gcse or postreload_cse might not be
preserved and cause missed optimizations later ...

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #31 from rguenther at suse dot de  ---
On Tue, 30 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #30 from Richard Sandiford  ---
> (In reply to Richard Biener from comment #29)
> > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> > VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> > masking for VECTOR_CST expansion as well, instead of trying to fixup
> > in do_compare_and_jump?
> But then how would ~ be implemented for things like 4-bit masks?
> If we use notqi2 then I assume the upper bits could be 1 rather than 0.

Yeah, I guess it's similar to expand_expr_real_1 'reduce_bit_field'
handling - we'd need to insert fixup code in strathegic places
(or for ~ use xor with the proper mask).

The difficulty is that we can't make the backend do this unless
there are insn operands that allows it to infer the real precision
of the mode.  And for most insns the excess bits are irrelevant
anyway.

Still the CTOR case showed wrong-code issues with GCN, which possibly
means it has the same issue with VECTOR_CSTs as well.  IIRC that
was that all vectors are 1024bits, and its "fake" V4SImode insns
rely on accurate masked out upper bits.  That might hint that
compares are not enough here (but for non-compares the backend
might have a chance to fixup by infering the max. number of
active elements).

If we think that compares (but that would also be compares without
jump, aka a == b | c == d) are the only problematical case we can
also fixup at the uses rather than at the defs as 'reduce_bit_field'
tries to do.

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #10 from rguenther at suse dot de  ---
On Fri, 26 Jan 2024, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #9 from Robin Dapp  ---
> (In reply to rguent...@suse.de from comment #6)
> 
> > t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> > or not equal to 3
> > t.c:47:21: missed:   not falling back to elementwise accesses
> > t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> > *_3;
> > t.c:47:21: missed:  bad operation or unsupported loop bound.
> > 
> > where we don't consider using gather because we have a known constant
> > stride (20).  Since the stores are really scatters we don't attempt
> > to SLP either.
> > 
> > Disabling the above heuristic we get this vectorized as well, avoiding
> > gather/scatter by manually implementing them and using a quite high
> > VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> > faster code in the end).
> 
> I suppose you're referring to this?
> 
>   /* FIXME: At the moment the cost model seems to underestimate the
>  cost of using elementwise accesses.  This check preserves the
>  traditional behavior until that can be fixed.  */
>   stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
>   if (!first_stmt_info)
> first_stmt_info = stmt_info;
>   if (*memory_access_type == VMAT_ELEMENTWISE
>   && !STMT_VINFO_STRIDED_P (first_stmt_info)
>   && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info)
>&& !DR_GROUP_NEXT_ELEMENT (stmt_info)
>&& !pow2p_hwi (DR_GROUP_SIZE (stmt_info
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>  "not falling back to elementwise accesses\n");
>   return false;
> }
> 
> 
> I did some more tests on my laptop.  As said above the whole loop in lbm is
> larger and contains two ifs.  The first one prevents clang and GCC from
> vectorizing the loop, the second one
> 
> if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
> ux = 0.005;
> uy = 0.002;
> uz = 0.000;
> }
> 
> seems to be if-converted? by clang or at least doesn't inhibit vectorization.
> 
> Now if I comment out the first, larger if clang does vectorize the loop.  With
> the return false commented out in the above GCC snippet GCC also vectorizes,
> but only when both ifs are commented out.
> 
> Results (with both ifs commented out), -march=native (resulting in avx2), best
> of 3 as lbm is notoriously fickle:
> 
> gcc trunk vanilla: 156.04s
> gcc trunk with elementwise: 132.10s
> clang 17: 143.06s
> 
> Of course even the comment already said that costing is difficult and the
> change will surely cause regressions elsewhere.  However the 15% improvement
> with vectorization (or the 9% improvement of clang) IMHO show that it's surely
> useful to look into this further.  On top, the riscv clang seems to not care
> about the first if either and still vectorize.  I haven't looked closer what
> happens there, though.

Yes.  I think this shows we should remove the above hack and instead
try to fix the costing next stage1.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #19 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #18 from Richard Sandiford  ---
> (In reply to Tamar Christina from comment #17)
> > Well the mid-end has generated the right precision. The type it generates is
> > vector(4)  vexit_reduc_67;
> > so it does say it's a single bit boolean.
> > 
> > Isn't this just an expand problem?
> That's what I meant.  expand is using a QImode comparison to compare things
> with 4-bit precision, so I think the masking should happen at that point.
> 
> How about doing the masking in do_compare_and_jump?

That sounds sensible.

Note that I wonder how to eliminate redundant maskings?  I suppose
eventually combine tracking nonzero bits where obvious would do
that?  For example for cmp:V4SI we know the bits will be zero but
I wonder if the RTL IL is obvious enough to derive this (or whether
there's a target hook for extra nonzero bit discovery, say for
unspecs).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #12 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #7 from Hongtao Liu  ---
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1fd957288d4..33a8d539b4d 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned 
> char
> *ptr, int len,
> 
>unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
>unsigned int first_elt = off * elts_per_byte;
> -  unsigned int extract_elts = extract_bytes * elts_per_byte;
> +  unsigned int extract_elts = count;
>for (unsigned int i = 0; i < extract_elts; ++i)
> {
>   tree elt = VECTOR_CST_ELT (expr, first_elt + i);
> 
> Shouldn't we use count here?(it also fixed the hanged issue).

extract_bytes is capped by the buffer 'len':

  int total_bytes = CEIL (elt_bits * count, BITS_PER_UNIT);
..
  int extract_bytes = MIN (len, total_bytes - off);

we'd still need to effectively do that.  But yeah, using CEIL
makes extract_elts off.  Maybe we should simply calculate
extract_bits instead (but then use uint64 for that)

[Bug tree-optimization/113583] Main loop in 519.lbm not vectorized.

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #6 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
> 
> --- Comment #5 from JuzheZhong  ---
> Both ICC and Clang X86 can vectorize SPEC 2017 lbm:
> 
> https://godbolt.org/z/MjbTbYf1G
> 
> But I am not sure X86 ICC is better or X86 Clang is better.

gather/scatter are possibly slow (and gather now has that Intel
security issue).  The reason is a "cost" one:

t.c:47:21: note:   ==> examining statement: _4 = *_3;
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   no array mode for V8DF[20]
t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
or not equal to 3
t.c:47:21: missed:   not falling back to elementwise accesses
t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
*_3;
t.c:47:21: missed:  bad operation or unsupported loop bound.

where we don't consider using gather because we have a known constant
stride (20).  Since the stores are really scatters we don't attempt
to SLP either.

Disabling the above heuristic we get this vectorized as well, avoiding
gather/scatter by manually implementing them and using a quite high
VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
faster code in the end).  But yes, I doubt that any of ICC or clang
vectorized codes are faster anywhere (but without specifying an
uarch you get some generic cost modelling applied).  Maybe SPR doesn't
have the gather bug and it does have reasonable gather and scatter
(zen4 scatter sucks).

.L3:
vmovsd  952(%rax), %xmm0
vmovsd  -8(%rax), %xmm2
addq$1280, %rsi
addq$1280, %rax
vmovhpd -168(%rax), %xmm0, %xmm1
vmovhpd -1128(%rax), %xmm2, %xmm2
vmovsd  -648(%rax), %xmm0
vmovhpd -488(%rax), %xmm0, %xmm0
vinsertf32x4$0x1, %xmm1, %ymm0, %ymm0
vmovsd  -968(%rax), %xmm1
vmovhpd -808(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm1, %ymm2, %ymm2
vinsertf64x4$0x1, %ymm0, %zmm2, %zmm2
vmovsd  -320(%rax), %xmm0
vmovhpd -160(%rax), %xmm0, %xmm1
vmovsd  -640(%rax), %xmm0
vmovhpd -480(%rax), %xmm0, %xmm0
vinsertf32x4$0x1, %xmm1, %ymm0, %ymm1
vmovsd  -960(%rax), %xmm0
vmovhpd -800(%rax), %xmm0, %xmm8
vmovsd  -1280(%rax), %xmm0
vmovhpd -1120(%rax), %xmm0, %xmm0
vinsertf32x4$0x1, %xmm8, %ymm0, %ymm0
vinsertf64x4$0x1, %ymm1, %zmm0, %zmm0
vmovsd  -312(%rax), %xmm1
vmovhpd -152(%rax), %xmm1, %xmm8
vmovsd  -632(%rax), %xmm1
vmovhpd -472(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm8, %ymm1, %ymm8
vmovsd  -952(%rax), %xmm1
vmovhpd -792(%rax), %xmm1, %xmm9
vmovsd  -1272(%rax), %xmm1
vmovhpd -1112(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm9, %ymm1, %ymm1
vinsertf64x4$0x1, %ymm8, %zmm1, %zmm1
vaddpd  %zmm1, %zmm0, %zmm0
vaddpd  %zmm7, %zmm2, %zmm1
vfnmadd132pd%zmm3, %zmm2, %zmm1
vfmadd132pd %zmm6, %zmm5, %zmm0
valignq $3, %ymm1, %ymm1, %ymm2
vmovlpd %xmm1, -1280(%rsi)
vextractf64x2   $1, %ymm1, %xmm8
vmovhpd %xmm1, -1120(%rsi)
vextractf64x4   $0x1, %zmm1, %ymm1
vmovlpd %xmm1, -640(%rsi)
vmovhpd %xmm1, -480(%rsi)
vmovsd  %xmm2, -800(%rsi)
vextractf64x2   $1, %ymm1, %xmm2
vmovsd  %xmm8, -960(%rsi)
valignq $3, %ymm1, %ymm1, %ymm1
vmovsd  %xmm2, -320(%rsi)
vmovsd  %xmm1, -160(%rsi)
vmovsd  -320(%rax), %xmm1
vmovhpd -160(%rax), %xmm1, %xmm2
vmovsd  -640(%rax), %xmm1
vmovhpd -480(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm2, %ymm1, %ymm2
vmovsd  -960(%rax), %xmm1
vmovhpd -800(%rax), %xmm1, %xmm8
vmovsd  -1280(%rax), %xmm1
vmovhpd -1120(%rax), %xmm1, %xmm1
vinsertf32x4$0x1, %xmm8, %ymm1, %ymm1
vinsertf64x4$0x1, %ymm2, %zmm1, %zmm1
vfnmadd132pd%zmm3, %zmm1, %zmm0
vaddpd  %zmm4, %zmm0, %zmm0
valignq $3, %ymm0, %ymm0, %ymm1
vmovlpd %xmm0, 14728(%rsi)
vextractf64x2   $1, %ymm0, %xmm2
vmovhpd %xmm0, 14888(%rsi)
vextractf64x4   $0x1, %zmm0, %ymm0
vmovlpd %xmm0, 15368(%rsi)
vmovhpd %xmm0, 15528(%rsi)
vmovsd  %xmm1, 15208(%rsi)
vextractf64x2   $1, %ymm0, %xmm1
vmovsd  %xmm2, 15048(%rsi)
valignq $3, %ymm0, %ymm0, %ymm0
vmovsd  %xmm1, 15688(%rsi)
vmovsd  %xmm0, 15848(%rsi)
cmpq%rdx, %rsi
jne .L3

[Bug tree-optimization/113281] [14 Regression] Wrong code due to vectorization of shift reduction and missing promotions since r14-3027

2024-01-24 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281

--- Comment #19 from rguenther at suse dot de  ---
On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> 
> --- Comment #17 from Andrew Pinski  ---
> (In reply to rguent...@suse.de from comment #16)
> > On Wed, 24 Jan 2024, pinskia at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113281
> > > 
> > > --- Comment #15 from Andrew Pinski  ---
> > > (In reply to Richard Biener from comment #14)
> > > > a >= 16 ? 0 : 32872 >> MIN (a, 15) (the MIN is still required to
> > > > avoid requiring masking).
> > > 
> > > Note maybe instead of MIN here we use `a & 0xf` since that will more 
> > > likely be
> > > cheaper than a MIN.
> > 
> > But it's incorrect (that's what I did originally).
> 
> But `(a>= 16)? 0: (32872>> (a&0xf))` is correct.
> 
> So is `(a>=16 ? 0 : 32872) >> ( a & 0xf)` .
> 
> Or is it you want to avoid the conditional here.

Yeah - at some point not trying to optimize the widening/shortening
is going to be cheaper, no?  Esp. if it is "just" because of
GIMPLE being limited with no way to express the desired semantics
for "out-of-bound" shifts (and no way to query target behavior for
them of course).

The widening/shortening might or might not have an effect on the
VF (we don't know) and we don't (yet) compute if the target
supports it.  We could force a smaller vector size (if the target
supports it) to avoid increasing the VF.

  1   2   3   4   5   6   7   8   9   10   >