[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #57 from Hongtao Liu  ---
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...

Make sense.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #56 from Uroš Bizjak  ---
The testcase is fixed with g:430c772be3382134886db33133ed466c02efc71c

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #55 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #53)
> Comment on attachment 57424 [details]
> Proposed testsuite patch
> 
> As skylake-avx512 is -mavx512{f,cd,bw,dq,vl}, requiring just avx512f
> effective target and testing it at runtime IMHO isn't enough.
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...

I think that addition of

+# if defined(__AVX512VL__)
+want_level = 7, want_b = bit_AVX512VL;
+# elif defined(__AVX512F__)
+want_level = 7, want_b = bit_AVX512F;
+# elif defined(__AVX2__)

to check_vect solves all current uses in gcc.dg/vect

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #54 from Richard Biener  ---
Please also verify the bug reproduced with the altered set of options.

What's the reason to have avx512-check.h in addition to tree-vect.h?
At least for the vectorizer testsuite the latter is the canonical one,
can we please merge AVX512 support therein?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #53 from Jakub Jelinek  ---
Comment on attachment 57424
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57424
Proposed testsuite patch

As skylake-avx512 is -mavx512{f,cd,bw,dq,vl}, requiring just avx512f effective
target and testing it at runtime IMHO isn't enough.
For dg-do run testcases I really think we should avoid those -march= options,
because it means a lot of other stuff, BMI, LZCNT, ...

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #52 from Uroš Bizjak  ---
Created attachment 57424
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57424=edit
Proposed testsuite patch

This patch fixes the failure for me (+ some other dg.exp/vect inconsistencies).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #51 from Jakub Jelinek  ---
>From the -mavx* options I think -march=skylake-avx512 implies
-mavx512{f,cd,vl,bw,dq} but -mavx512f is implied by any of the latter 4.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #50 from Jakub Jelinek  ---
(In reply to Richard Biener from comment #49)
> (In reply to Uroš Bizjak from comment #48)
> > The runtime testcase fails on non-AVX512F x86 targets due to:
> > 
> > /* { dg-do run } */
> > /* { dg-options "-O3" } */
> > /* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-*
> > i?86-*-* } } } */
> > 
> > but check_vect() only checks runtime support up to AVX2.
> 
> Hmm, can we fix that?  We could change the above to { target avx512f_runtime
> }
> but that really only checks for AVX512F, not say AVX512VL ...
> 
> I do remember using -mavx512vl wasn't enough to trigger the miscompile
> nor did it trigger with -march=znver4 ... so I stuck to skylake-avx512 :/

It is certainly preferable to add -mavx512{bw,dq,vl} or whatever the testcase
actually needs and then one can
#define AVX512BW
#define AVX512DQ
#define AVX512VL
#include "avx512-check.h"
and get checks for all those.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #49 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #48)
> The runtime testcase fails on non-AVX512F x86 targets due to:
> 
> /* { dg-do run } */
> /* { dg-options "-O3" } */
> /* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-*
> i?86-*-* } } } */
> 
> but check_vect() only checks runtime support up to AVX2.

Hmm, can we fix that?  We could change the above to { target avx512f_runtime }
but that really only checks for AVX512F, not say AVX512VL ...

I do remember using -mavx512vl wasn't enough to trigger the miscompile
nor did it trigger with -march=znver4 ... so I stuck to skylake-avx512 :/

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #48 from Uroš Bizjak  ---
The runtime testcase fails on non-AVX512F x86 targets due to:

/* { dg-do run } */
/* { dg-options "-O3" } */
/* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-*
i?86-*-* } } } */

but check_vect() only checks runtime support up to AVX2.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Richard Biener  changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |FIXED

--- Comment #47 from Richard Biener  ---
This should now be fixed, but I think the issue might be latent on branches for
GCN or for AVX512 via intrinsics (fully masked loops for AVX512 is also only
available in GCC 14).

Let's close this P1, we have to consider backporting when we manage to create a
miscompiled testcase for older releases.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-14 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #46 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:5352ede92483b949e811cbdcdfaec5378f3e06d6

commit r14-8975-g5352ede92483b949e811cbdcdfaec5378f3e06d6
Author: Richard Biener 
Date:   Fri Feb 9 08:15:44 2024 +0100

middle-end/113576 - zero padding of vector bools when expanding compares

The following zeros paddings of vector bools when expanding compares
and the mode used for the compare is an integer mode.  In that case
targets cannot distinguish between a 4 element and 8 element vector
compare (both get to the QImode compare optab) so we have to do the
job in the middle-end.

PR middle-end/113576
* expr.cc (do_store_flag): For vector bool compares of vectors
with padding zero that.
* dojump.cc (do_compare_and_jump): Likewise.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #45 from Hongtao Liu  ---

> > There's do_store_flag to fixup for uses not in branches and
> > do_compare_and_jump for conditional jumps.
> 
> reasonable enough for me.
I mean we only handle it at consumers where upper bits matters.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #44 from Hongtao Liu  ---
> 
> Note the AND is removed by combine if I add it:
> 
> Successfully matched this instruction:
> (set (reg:CCZ 17 flags)
> (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0))
> (const_int 15 [0xf]))
> (const_int 0 [0])))
> 
> (*testhi_not)
> 
> -9: {r103:QI=r102:QI&0xf;clobber flags:CC;}
> +  REG_DEAD r99:QI
> +9: NOTE_INSN_DELETED
> +   12: flags:CCZ=cmp(~r102:QI#0&0xf,0)
>REG_DEAD r102:QI
> -  REG_UNUSED flags:CC
> -   12: flags:CCZ=cmp(r103:QI,0xf)
> -  REG_DEAD r103:QI
> 
> and we get
> 
> foo:
> .LFB0:
> .cfi_startproc
> notl%esi
> orl %esi, %edi
> notl%edi
> testb   $15, %dil
> je  .L6
> ret
> 
> which I'm not sure is OK?
> 

Yes, I think it's on purpose

11508;; Split and;cmp (as optimized by combine) into not;test
11509;; Except when TARGET_BMI provides andn (*andn__ccno).
11510(define_insn_and_split "*test_not"
11511  [(set (reg:CCZ FLAGS_REG)
11512(compare:CCZ
11513  (and:SWI
11514(not:SWI (match_operand:SWI 0 "register_operand"))
11515(match_operand:SWI 1 ""))
11516  (const_int 0)))]
11517  "ix86_pre_reload_split ()
11518   && (!TARGET_BMI || !REG_P (operands[1]))"
11519  "#"
11520  "&& 1"
11521  [(set (match_dup 2) (not:SWI (match_dup 0)))
11522   (set (reg:CCZ FLAGS_REG)
11523(compare:CCZ (and:SWI (match_dup 2) (match_dup 1))
11524 (const_int 0)))]
11525  "operands[2] = gen_reg_rtx (mode);")
11526
11527;; Split and;cmp (as optimized by combine) into andn;cmp $0
11528(define_insn_and_split "*test_not_doubleword"

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #43 from Hongtao Liu  ---

> Well, yes, the discussion in this bug was whether to do this at consumers
> (that's sth new) or with all mask operations (that's how we handle
> bit-precision integer operations, so it might be relatively easy to
> do that - specifically spot the places eventually needing adjustment).
> 
> There's do_store_flag to fixup for uses not in branches and
> do_compare_and_jump for conditional jumps.

reasonable enough for me.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #42 from Richard Biener  ---
And the do_store_flag part:

diff --git a/gcc/expr.cc b/gcc/expr.cc
index fc5e998e329..44d64274071 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -13693,6 +13693,19 @@ do_store_flag (sepops ops, rtx target, machine_mode
mode)
 subtarget = 0;

   expand_operands (arg0, arg1, subtarget, , , EXPAND_NORMAL);
+  unsigned HOST_WIDE_INT nunits;
+  if (VECTOR_BOOLEAN_TYPE_P (type)
+  && operand_mode == QImode
+  && TYPE_VECTOR_SUBPARTS (type).is_constant ()
+  && nunits < BITS_PER_UNIT)
+{
+  op0 = expand_binop (mode, and_optab, op0,
+ GEN_INT ((1 << nunits) - 1), NULL_RTX,
+ true, OPTAB_WIDEN);
+  op1 = expand_binop (mode, and_optab, op1,
+ GEN_INT ((1 << nunits) - 1), NULL_RTX,
+ true, OPTAB_WIDEN);
+}

   if (target == 0)
 target = gen_reg_rtx (mode);


for the testcase

typedef long v4si __attribute__((vector_size(4*sizeof(long;
typedef v4si v4sib __attribute__((vector_mask));
typedef _Bool sbool1 __attribute__((signed_bool_precision(1)));
_Bool x;
void __GIMPLE (ssa) foo (v4sib v1, v4sib v2)
{
  v4sib tem;
  _Bool _7;

__BB(2):
  tem_5 = ~v2_2(D);
  tem_3 = v1_1(D) | tem_5;
  tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1,
_Literal (sbool1) -1, _Literal (sbool1) -1 };
  _7 = tem_3 == tem_4;
  x = _7;
  return;
}

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #41 from Richard Biener  ---
(In reply to Hongtao Liu from comment #38)
> > I think we should also mask off the upper bits of variable mask?
> > 
> > notl%esi
> > orl %esi, %edi
> > notl%edi
> > andl$15, %edi
> > je  .L3
> 
> with -mbmi, it's 
> 
> andn%esi, %edi, %edi
> andl$15, %edi
> je  .L3

Well, yes, the discussion in this bug was whether to do this at consumers
(that's sth new) or with all mask operations (that's how we handle
bit-precision integer operations, so it might be relatively easy to
do that - specifically spot the places eventually needing adjustment).

There's do_store_flag to fixup for uses not in branches and
do_compare_and_jump for conditional jumps.

Note the AND is removed by combine if I add it:

Successfully matched this instruction:
(set (reg:CCZ 17 flags)
(compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0))
(const_int 15 [0xf]))
(const_int 0 [0])))

(*testhi_not)

-9: {r103:QI=r102:QI&0xf;clobber flags:CC;}
+  REG_DEAD r99:QI
+9: NOTE_INSN_DELETED
+   12: flags:CCZ=cmp(~r102:QI#0&0xf,0)
   REG_DEAD r102:QI
-  REG_UNUSED flags:CC
-   12: flags:CCZ=cmp(r103:QI,0xf)
-  REG_DEAD r103:QI

and we get

foo:
.LFB0:
.cfi_startproc
notl%esi
orl %esi, %edi
notl%edi
testb   $15, %dil
je  .L6
ret

which I'm not sure is OK?

diff --git a/gcc/dojump.cc b/gcc/dojump.cc
index e2d2b3cb111..784707c1e55 100644
--- a/gcc/dojump.cc
+++ b/gcc/dojump.cc
@@ -1266,6 +1266,7 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum
rtx_code signed_code,
   machine_mode mode;
   int unsignedp;
   enum rtx_code code;
+  unsigned HOST_WIDE_INT nunits;

   /* Don't crash if the comparison was erroneous.  */
   op0 = expand_normal (treeop0);
@@ -1308,6 +1309,18 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum
rtx_code signed_code,
   emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, op1));
   op1 = new_op1;
 }
+  else if (VECTOR_BOOLEAN_TYPE_P (type)
+  && mode == QImode
+  && TYPE_VECTOR_SUBPARTS (type).is_constant ()
+  && nunits < BITS_PER_UNIT)
+{
+  op0 = expand_binop (mode, and_optab, op0,
+ GEN_INT ((1 << nunits) - 1), NULL_RTX,
+ true, OPTAB_WIDEN);
+  op1 = expand_binop (mode, and_optab, op1,
+ GEN_INT ((1 << nunits) - 1), NULL_RTX,
+ true, OPTAB_WIDEN);
+}

   do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode,
   ((mode == BLKmode)

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #40 from Jakub Jelinek  ---
For unsigned _BitInt(4) or unsigned _BitInt(2) we mask it whenever loading from
memory or function argument or whatever other ABI specific spot (and also when
storing because that is how RTL expects it; because of that we don't mask it
when using it from say automatic variables where we know we've initialized it
ourselves).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #39 from Hongtao Liu  ---
> > the question is whether that matches the semantics of GIMPLE (the padding
> > is inverted, too), whether it invokes undefined behavior (don't do it - it
> > seems for people using intrinsics that's what it is?)
For the intrinisc, the instructions only care about lower bits, so it's not big
issue? And it sounds like similar issue as _BitInt(4)/_BitInt(2), I assume
there're garbage in the upper bits.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #38 from Hongtao Liu  ---

> I think we should also mask off the upper bits of variable mask?
> 
> notl%esi
> orl %esi, %edi
> notl%edi
> andl$15, %edi
> je  .L3

with -mbmi, it's 

andn%esi, %edi, %edi
andl$15, %edi
je  .L3

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #37 from Hongtao Liu  ---
(In reply to Richard Biener from comment #36)
> For example with AVX512VL and the following, using -O -fgimple -mavx512vl
> we get simply
> 
> notl%esi
> orl %esi, %edi
> cmpb$15, %dil
> je  .L6
> 
> typedef long v4si __attribute__((vector_size(4*sizeof(long;
> typedef v4si v4sib __attribute__((vector_mask));
> typedef _Bool sbool1 __attribute__((signed_bool_precision(1)));
> 
> void __GIMPLE (ssa) foo (v4sib v1, v4sib v2)
> {
>   v4sib tem;
> 
> __BB(2):
>   tem_5 = ~v2_2(D);
>   tem_3 = v1_1(D) | tem_5;
>   tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1,
> _Literal (sbool1) -1, _Literal (sbool1) -1 };
>   if (tem_3 == tem_4)
> goto __BB3;
>   else
> goto __BB4;
> 
> __BB(3):
>   __builtin_abort ();
> 
> __BB(4):
>   return;
> }
> 
> 
> the question is whether that matches the semantics of GIMPLE (the padding
> is inverted, too), whether it invokes undefined behavior (don't do it - it
> seems for people using intrinsics that's what it is?) or whether we
> should avoid affecting padding.
> 
> Note after the patch I proposed on the mailing list the constant mask is
> now expanded with zero padding.

I think we should also mask off the upper bits of variable mask?

notl%esi
orl %esi, %edi
notl%edi
andl$15, %edi
je  .L3

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #36 from Richard Biener  ---
For example with AVX512VL and the following, using -O -fgimple -mavx512vl
we get simply

notl%esi
orl %esi, %edi
cmpb$15, %dil
je  .L6

typedef long v4si __attribute__((vector_size(4*sizeof(long;
typedef v4si v4sib __attribute__((vector_mask));
typedef _Bool sbool1 __attribute__((signed_bool_precision(1)));

void __GIMPLE (ssa) foo (v4sib v1, v4sib v2)
{
  v4sib tem;

__BB(2):
  tem_5 = ~v2_2(D);
  tem_3 = v1_1(D) | tem_5;
  tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1,
_Literal (sbool1) -1, _Literal (sbool1) -1 };
  if (tem_3 == tem_4)
goto __BB3;
  else
goto __BB4;

__BB(3):
  __builtin_abort ();

__BB(4):
  return;
}


the question is whether that matches the semantics of GIMPLE (the padding
is inverted, too), whether it invokes undefined behavior (don't do it - it
seems for people using intrinsics that's what it is?) or whether we
should avoid affecting padding.

Note after the patch I proposed on the mailing list the constant mask is
now expanded with zero padding.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Richard Biener  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org

--- Comment #35 from Richard Biener  ---
(In reply to Richard Sandiford from comment #34)
> (In reply to Richard Biener from comment #32)
> > Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
> > the lowest 4 or 2 bits.
> > 
> > It all feels like desaster waiting to happen ;)
> Yes :)
> 
> > For example BIT_NOT_EXPR is RTL expanded like
> > 
> > case BIT_NOT_EXPR:
> >   op0 = expand_expr (treeop0, subtarget,
> >  VOIDmode, EXPAND_NORMAL);
> >   if (modifier == EXPAND_STACK_PARM)
> > target = 0;
> >   /* In case we have to reduce the result to bitfield precision
> >  for unsigned bitfield expand this as XOR with a proper constant
> >  instead.  */
> >   if (reduce_bit_field && TYPE_UNSIGNED (type))
> > {
> >   int_mode = SCALAR_INT_TYPE_MODE (type);
> >   wide_int mask = wi::mask (TYPE_PRECISION (type),
> > false, GET_MODE_PRECISION (int_mode));
> >   
> >   temp = expand_binop (int_mode, xor_optab, op0,
> >immed_wide_int_const (mask, int_mode),
> >target, 1, OPTAB_LIB_WIDEN);
> > 
> > so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
> > effective bit-precision set reduce_bit_field and fixup the fallout
> > (not sure why the above is only for TYPE_UNSIGNED).
> >
> > At least it feels similar and doing things the opposite for vectors
> > (fixing up at uses) would be odd?
> Do you know why we take this approach for integers?  Is it for
> correctness?  Or is it supposed to be more optimal?

It's done for correctness.  The main thing was bitfields > int which
end up as bit-precision types in GIMPLE.

> I can imagine that, for arithmetic types, there are going to many
> more instances where upper bits matter (division, right shifts,
> MIN/MAX, etc.).  So perhaps reducing every result is a good
> trade-off there.

I think it was the easiest place to fix up and to make sure later
RTL opts and backends do not interfere.

> But there's an argument that it should be rare for the padding
> bits in a vector to matter, since very few things would look at the
> padding bits anyway.  So perhaps the cost should be borne by the
> operations that need canonical integers.

But what happens when an operation not needing canonical intergers
is transformed, say by combine or simplify-rtx to one that needs?
I think the reduce_bitfield code was trying to be safe here.

Actual define_insns not needing canonical padding could do like
we now require scalar shifts - they could match a variant with
the mask canonicalization op and hope for combine eliminating that.
Of course that will explode in case it's the majority of cases ...

> Not a strong opinion though, more just devil's advocate.

Yeah.  The important thing seems to be that the info this is a
bit-precision QImode isn't lost ... which points to that using
plain integer modes was a very bad choice ...

> There again, if e.g. the x86 API guarantees memcmp equality between
> two masks whose significant bits are equal, then we probably have
> no choice.

That's a good question, possibly most relevant for the OpenMP SIMD ABI.

The AVX512 APIs use integer types everywhere, there's no intrinsic
for ktest itself, but _mm512_kortestz and kortestc and also _mm512_knot
(oddly only for HImode).  So it at least seems to - from a quick look -
be "broken" in the intrinsic API as well.  At least we have
_mm_cmp_sd_mask producing a QImode mask, _mm512_knot inverting too
many bits.  So the user must be aware of the "padding".

One could argue it's the vectorizers job to fixup then, but of course
it doesn't get to see the correct vector types here.

Given we have the fixup in CTOR expansion the issue at hand could be
fixed by mirroring that in VECTOR_CST expansion for now.

Andrew - I suppose GCN also has cbranch, can you try to check what happens
for < 8 lane vector modes there?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-01 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #34 from Richard Sandiford  ---
(In reply to Richard Biener from comment #32)
> Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
> the lowest 4 or 2 bits.
> 
> It all feels like desaster waiting to happen ;)
Yes :)

> For example BIT_NOT_EXPR is RTL expanded like
> 
> case BIT_NOT_EXPR:
>   op0 = expand_expr (treeop0, subtarget,
>  VOIDmode, EXPAND_NORMAL);
>   if (modifier == EXPAND_STACK_PARM)
> target = 0;
>   /* In case we have to reduce the result to bitfield precision
>  for unsigned bitfield expand this as XOR with a proper constant
>  instead.  */
>   if (reduce_bit_field && TYPE_UNSIGNED (type))
> {
>   int_mode = SCALAR_INT_TYPE_MODE (type);
>   wide_int mask = wi::mask (TYPE_PRECISION (type),
> false, GET_MODE_PRECISION (int_mode));
>   
>   temp = expand_binop (int_mode, xor_optab, op0,
>immed_wide_int_const (mask, int_mode),
>target, 1, OPTAB_LIB_WIDEN);
> 
> so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
> effective bit-precision set reduce_bit_field and fixup the fallout
> (not sure why the above is only for TYPE_UNSIGNED).
>
> At least it feels similar and doing things the opposite for vectors
> (fixing up at uses) would be odd?
Do you know why we take this approach for integers?  Is it for
correctness?  Or is it supposed to be more optimal?

I can imagine that, for arithmetic types, there are going to many
more instances where upper bits matter (division, right shifts,
MIN/MAX, etc.).  So perhaps reducing every result is a good
trade-off there.

But there's an argument that it should be rare for the padding
bits in a vector to matter, since very few things would look at the
padding bits anyway.  So perhaps the cost should be borne by the
operations that need canonical integers.

Not a strong opinion though, more just devil's advocate.

There again, if e.g. the x86 API guarantees memcmp equality between
two masks whose significant bits are equal, then we probably have
no choice.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-01 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #33 from Jakub Jelinek  ---
I guess inverting just 2 or 4 bits can be done with kxorb or knotb + kandb, but
we need to have the mask forced in some mask register.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-01 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #32 from Richard Biener  ---
Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
the lowest 4 or 2 bits.

It all feels like desaster waiting to happen ;)

For example BIT_NOT_EXPR is RTL expanded like

case BIT_NOT_EXPR:
  op0 = expand_expr (treeop0, subtarget,
 VOIDmode, EXPAND_NORMAL);
  if (modifier == EXPAND_STACK_PARM)
target = 0;
  /* In case we have to reduce the result to bitfield precision
 for unsigned bitfield expand this as XOR with a proper constant
 instead.  */
  if (reduce_bit_field && TYPE_UNSIGNED (type))
{
  int_mode = SCALAR_INT_TYPE_MODE (type);
  wide_int mask = wi::mask (TYPE_PRECISION (type),
false, GET_MODE_PRECISION (int_mode));

  temp = expand_binop (int_mode, xor_optab, op0,
   immed_wide_int_const (mask, int_mode),
   target, 1, OPTAB_LIB_WIDEN);

so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
effective bit-precision set reduce_bit_field and fixup the fallout
(not sure why the above is only for TYPE_UNSIGNED).

At least it feels similar and doing things the opposite for vectors
(fixing up at uses) would be odd?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #31 from rguenther at suse dot de  ---
On Tue, 30 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #30 from Richard Sandiford  ---
> (In reply to Richard Biener from comment #29)
> > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> > VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> > masking for VECTOR_CST expansion as well, instead of trying to fixup
> > in do_compare_and_jump?
> But then how would ~ be implemented for things like 4-bit masks?
> If we use notqi2 then I assume the upper bits could be 1 rather than 0.

Yeah, I guess it's similar to expand_expr_real_1 'reduce_bit_field'
handling - we'd need to insert fixup code in strathegic places
(or for ~ use xor with the proper mask).

The difficulty is that we can't make the backend do this unless
there are insn operands that allows it to infer the real precision
of the mode.  And for most insns the excess bits are irrelevant
anyway.

Still the CTOR case showed wrong-code issues with GCN, which possibly
means it has the same issue with VECTOR_CSTs as well.  IIRC that
was that all vectors are 1024bits, and its "fake" V4SImode insns
rely on accurate masked out upper bits.  That might hint that
compares are not enough here (but for non-compares the backend
might have a chance to fixup by infering the max. number of
active elements).

If we think that compares (but that would also be compares without
jump, aka a == b | c == d) are the only problematical case we can
also fixup at the uses rather than at the defs as 'reduce_bit_field'
tries to do.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #30 from Richard Sandiford  ---
(In reply to Richard Biener from comment #29)
> But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> masking for VECTOR_CST expansion as well, instead of trying to fixup
> in do_compare_and_jump?
But then how would ~ be implemented for things like 4-bit masks?
If we use notqi2 then I assume the upper bits could be 1 rather than 0.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #29 from Richard Biener  ---
(In reply to Hongtao Liu from comment #28)
> I saw we already maskoff integral modes for vector mask in store_constructor
> 
>   /* Use sign-extension for uniform boolean vectors with
>  integer modes and single-bit mask entries.
>  Effectively "vec_duplicate" for bitmasks.  */
>   if (elt_size == 1
>   && !TREE_SIDE_EFFECTS (exp)
>   && VECTOR_BOOLEAN_TYPE_P (type)
>   && SCALAR_INT_MODE_P (TYPE_MODE (type))
>   && (elt = uniform_vector_p (exp))
>   && !VECTOR_TYPE_P (TREE_TYPE (elt)))
> {
>   rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)),
>expand_normal (elt));
>   rtx tmp = gen_reg_rtx (mode);
>   convert_move (tmp, op0, 0);
> 
>   /* Ensure no excess bits are set.
>  GCN needs this for nunits < 64.
>  x86 needs this for nunits < 8.  */
>   auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
>   if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
> tmp = expand_binop (mode, and_optab, tmp,
> GEN_INT ((1 << nunits) - 1), target,
> true, OPTAB_WIDEN);
>   if (tmp != target)
> emit_move_insn (target, tmp);
>   break;
> }

But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
masking for VECTOR_CST expansion as well, instead of trying to fixup
in do_compare_and_jump?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #28 from Hongtao Liu  ---
I saw we already maskoff integral modes for vector mask in store_constructor

/* Use sign-extension for uniform boolean vectors with
   integer modes and single-bit mask entries.
   Effectively "vec_duplicate" for bitmasks.  */
if (elt_size == 1
&& !TREE_SIDE_EFFECTS (exp)
&& VECTOR_BOOLEAN_TYPE_P (type)
&& SCALAR_INT_MODE_P (TYPE_MODE (type))
&& (elt = uniform_vector_p (exp))
&& !VECTOR_TYPE_P (TREE_TYPE (elt)))
  {
rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)),
 expand_normal (elt));
rtx tmp = gen_reg_rtx (mode);
convert_move (tmp, op0, 0);

/* Ensure no excess bits are set.
   GCN needs this for nunits < 64.
   x86 needs this for nunits < 8.  */
auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
  tmp = expand_binop (mode, and_optab, tmp,
  GEN_INT ((1 << nunits) - 1), target,
  true, OPTAB_WIDEN);
if (tmp != target)
  emit_move_insn (target, tmp);
break;
  }

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #27 from Richard Biener  ---
(In reply to Hongtao Liu from comment #25)
> (In reply to Tamar Christina from comment #24)
> > Just to avoid confusion, are you still working on this one Richi?
> 
> I'm working on a patch to add a target hook as #c18 mentioned.

Not sure a target hook was suggested - I think it was suggested that
do_compare_and_jump always masks excess bits for integer mode vector masks?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #26 from Tamar Christina  ---
Ah great, just checking it wasn't left unattended :)

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #25 from Hongtao Liu  ---
(In reply to Tamar Christina from comment #24)
> Just to avoid confusion, are you still working on this one Richi?

I'm working on a patch to add a target hook as #c18 mentioned.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #24 from Tamar Christina  ---
Just to avoid confusion, are you still working on this one Richi?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Tamar Christina  changed:

   What|Removed |Added

 CC||acoplan at gcc dot gnu.org

--- Comment #23 from Tamar Christina  ---
*** Bug 113661 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #22 from Hongtao Liu  ---
typedef unsigned long mp_limb_t;
typedef long mp_size_t;
typedef unsigned long mp_bitcnt_t;

typedef mp_limb_t *mp_ptr;
typedef const mp_limb_t *mp_srcptr;

#define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8)

#define GMP_LIMB_MAX (~ (mp_limb_t) 0)

mp_bitcnt_t
__attribute__((noipa))
mpn_common_scan (mp_limb_t limb, mp_size_t i, mp_srcptr up, mp_size_t un,
 mp_limb_t ux)
{
  unsigned cnt;

  while (limb == 0)
{
  i++;
  if (i == un)
return (ux == 0 ? ~(mp_bitcnt_t) 0 : un * GMP_LIMB_BITS);
  limb = ux ^ up[i];
}
  return limb;
}

int main ()
{
  mp_limb_t up[1];
  for (int i = 0; i != 1; i++)
up[i] = 1 << 8;
  up[2000] = 1;
  mp_bitcnt_t res = mpn_common_scan (0, 0, up, 1, 1 << 8);
  if (res != 257)
__builtin_abort ();
  return 1;
}


aborted with -O3 -march=skylake-avx512.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #21 from Hongtao Liu  ---
typedef unsigned long mp_limb_t;
typedef long mp_size_t;
typedef unsigned long mp_bitcnt_t;

typedef mp_limb_t *mp_ptr;
typedef const mp_limb_t *mp_srcptr;

#define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8)

#define GMP_LIMB_MAX (~ (mp_limb_t) 0)

mp_bitcnt_t
mpn_common_scan (mp_limb_t limb, mp_size_t i, mp_srcptr up, mp_size_t un,
 mp_limb_t ux)
{
  unsigned cnt;

  while (limb == 0)
{
  i++;
  if (i == un)
return (ux == 0 ? ~(mp_bitcnt_t) 0 : un * GMP_LIMB_BITS);
  limb = ux ^ up[i];
}
  return limb;
}

This one is miscompiled in 502.gcc_r

123   [local count: 862990464]:
124  _34 = ivtmp.20_20 * 32;
125  vect__5.15_59 = MEM  [(const mp_limb_t
*)vectp.14_53 + _34 * 1];
126  mask_patt_9.16_61 = vect__5.15_59 == vect_cst__60;
127  ivtmp.20_32 = ivtmp.20_20 + 1;
128  if (mask_patt_9.16_61 == { -1, -1, -1, -1 })
129goto ; [94.50%]
130  else
131goto ; [5.50%]


is expanded to

 30.L18:
 31movq%rdi, %rdx
 32incq%rdi
 33salq$5, %rdx
 34vpcmpeqq(%rax,%rdx), %ymm3, %k0
 35kmovb   %k0, %edx
 36cmpb$-1, %dl
 37jne .L21

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #20 from Hongtao Liu  ---

> Note that I wonder how to eliminate redundant maskings?  I suppose
> eventually combine tracking nonzero bits where obvious would do
> that?  For example for cmp:V4SI we know the bits will be zero but
> I wonder if the RTL IL is obvious enough to derive this (or whether
> there's a target hook for extra nonzero bit discovery, say for
> unspecs).

I guess we need extra patterns to make combine know, we already have those for
zero_extend.

3970;; Since vpcmpd implicitly clear the upper bits of dest, transform
 3971;; vpcmpd + zero_extend to vpcmpd since the instruction
 3972(define_insn_and_split
"*_cmp3_zero_extend"
 3973  [(set (match_operand:SWI248x 0 "register_operand")
 3974(zero_extend:SWI248x
 3975  (unspec:
 3976[(match_operand:V48H_AVX512VL 1 "nonimmediate_operand")
 3977 (match_operand:V48H_AVX512VL 2 "nonimmediate_operand")
 3978 (match_operand:SI 3 "const_0_to_7_operand")]
 3979UNSPEC_PCMP)))]

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #19 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #18 from Richard Sandiford  ---
> (In reply to Tamar Christina from comment #17)
> > Well the mid-end has generated the right precision. The type it generates is
> > vector(4)  vexit_reduc_67;
> > so it does say it's a single bit boolean.
> > 
> > Isn't this just an expand problem?
> That's what I meant.  expand is using a QImode comparison to compare things
> with 4-bit precision, so I think the masking should happen at that point.
> 
> How about doing the masking in do_compare_and_jump?

That sounds sensible.

Note that I wonder how to eliminate redundant maskings?  I suppose
eventually combine tracking nonzero bits where obvious would do
that?  For example for cmp:V4SI we know the bits will be zero but
I wonder if the RTL IL is obvious enough to derive this (or whether
there's a target hook for extra nonzero bit discovery, say for
unspecs).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #18 from Richard Sandiford  ---
(In reply to Tamar Christina from comment #17)
> Well the mid-end has generated the right precision. The type it generates is
> vector(4)  vexit_reduc_67;
> so it does say it's a single bit boolean.
> 
> Isn't this just an expand problem?
That's what I meant.  expand is using a QImode comparison to compare things
with 4-bit precision, so I think the masking should happen at that point.

How about doing the masking in do_compare_and_jump?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #17 from Tamar Christina  ---
Well the mid-end has generated the right precision. The type it generates is  
vector(4)  vexit_reduc_67;
so it does say it's a single bit boolean.

Isn't this just an expand problem?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #16 from Richard Sandiford  ---
(In reply to Richard Biener from comment #15)
> I think the problem is the cbranch pattern which looks at all of the
> QImode mask - but of course it doesn't know it's really V4BImode it's
> working on ...
Yeah.  Currently building an x86_64 toolchain to have a look, but I think
whatever code uses a cbranch with a higher precision than the inputs should
mask off the significant bits beforehand.  cbranch can also be used for
comparing two variable masks too.

I suppose we could add a hook to say that padding bits of an integer mask must
always be zero (and so it's expand's job to ensure that that holds for any mask
operation).  But it feels dangerously close to TRULY_NOOP_TRUNCATION for
integers.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #15 from Richard Biener  ---
(In reply to Richard Sandiford from comment #13)
> I don't think there's any principle that upper bits must be zero.
> How do we end up with a pattern that depends on that being the case?

I think the problem is the cbranch pattern which looks at all of the
QImode mask - but of course it doesn't know it's really V4BImode it's
working on ...

If there's no principle that the upper bits should be zero I think we
need a way for the target to say so.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #14 from Hongtao Liu  ---
The testcase attached is ok now, but 502.gcc_r is miscompiled after the
installed patch.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #13 from Richard Sandiford  ---
I don't think there's any principle that upper bits must be zero.
How do we end up with a pattern that depends on that being the case?

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #12 from rguenther at suse dot de  ---
On Thu, 25 Jan 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #7 from Hongtao Liu  ---
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1fd957288d4..33a8d539b4d 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned 
> char
> *ptr, int len,
> 
>unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
>unsigned int first_elt = off * elts_per_byte;
> -  unsigned int extract_elts = extract_bytes * elts_per_byte;
> +  unsigned int extract_elts = count;
>for (unsigned int i = 0; i < extract_elts; ++i)
> {
>   tree elt = VECTOR_CST_ELT (expr, first_elt + i);
> 
> Shouldn't we use count here?(it also fixed the hanged issue).

extract_bytes is capped by the buffer 'len':

  int total_bytes = CEIL (elt_bits * count, BITS_PER_UNIT);
..
  int extract_bytes = MIN (len, total_bytes - off);

we'd still need to effectively do that.  But yeah, using CEIL
makes extract_elts off.  Maybe we should simply calculate
extract_bits instead (but then use uint64 for that)

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Richard Biener  changed:

   What|Removed |Added

 CC||rsandifo at gcc dot gnu.org

--- Comment #11 from Richard Biener  ---
(In reply to Hongtao Liu from comment #8)
> maybe 
> 
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1fd957288d4..6d321f9baef 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -8035,6 +8035,9 @@ native_encode_vector_part (const_tree expr, unsigned
> char *ptr, int len,
>unsigned int extract_elts = extract_bytes * elts_per_byte;
>for (unsigned int i = 0; i < extract_elts; ++i)
> {
> + /* Don't encode any bit beyond the range of the vector.  */
> + if (first_elt + i > count)
> +   break;

Hmm.  I think that VECTOR_CST_ELT should have ICEd for out-of-bound
element queries but it seems to make up elements for us here.  Richard?

But yes, we do

  unsigned int extract_elts = extract_bytes * elts_per_byte;

and since native_encode_* and native_interpret_* operate on bytes we have
difficulties dealing with bit-precision entities with padding.

There's either the possibility to fail encoding when that happens or
do something else.  Note that RTL expansion will do

case VECTOR_CST:
  {
tree tmp = NULL_TREE; 
if (VECTOR_MODE_P (mode))
  return const_vector_from_tree (exp);
scalar_int_mode int_mode;
if (is_int_mode (mode, _mode))
  {
tree type_for_mode = lang_hooks.types.type_for_mode (int_mode, 1);
if (type_for_mode)
  tmp = fold_unary_loc (loc, VIEW_CONVERT_EXPR,
type_for_mode, exp);

which I think should always succeed (otherwise it falls back to expanding
a CTOR).  That means failing to encode/interpret might get into
store_constructor which I think will zero a register destination and thus
fill padding with zeros.

So yeah, something like this looks OK, but I think instead of only
testing against 'count' we should also test against TYPE_VECTOR_SUBPARTS
(that might be variable, so with known_gt).

Would be interesting to see whether this fixes the issue without the
now installed patch.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Hongtao Liu  ---
Fixed.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #9 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:578c7b91f418ebbef1bf169117815409e06f5197

commit r14-8413-g578c7b91f418ebbef1bf169117815409e06f5197
Author: Richard Biener 
Date:   Wed Jan 24 14:55:49 2024 +0100

tree-optimization/113576 - non-empty latch and may_be_zero vectorization

We can't support niters with may_be_zero when we end up with a
non-empty latch due to early exit peeling.  At least not in
the simplistic way the vectorizer handles this now.  Disallow
it again for exits that are not the last one.

PR tree-optimization/113576
* tree-vect-loop.cc (vec_init_loop_exit_info): Only allow
exits with may_be_zero niters when its the last one.

* gcc.dg/vect/pr113576.c: New testcase.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #8 from Hongtao Liu  ---

maybe 

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 1fd957288d4..6d321f9baef 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -8035,6 +8035,9 @@ native_encode_vector_part (const_tree expr, unsigned char
*ptr, int len,
   unsigned int extract_elts = extract_bytes * elts_per_byte;
   for (unsigned int i = 0; i < extract_elts; ++i)
{
+ /* Don't encode any bit beyond the range of the vector.  */
+ if (first_elt + i > count)
+   break;

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #7 from Hongtao Liu  ---
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 1fd957288d4..33a8d539b4d 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned char
*ptr, int len,

   unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
   unsigned int first_elt = off * elts_per_byte;
-  unsigned int extract_elts = extract_bytes * elts_per_byte;
+  unsigned int extract_elts = count;
   for (unsigned int i = 0; i < extract_elts; ++i)
{
  tree elt = VECTOR_CST_ELT (expr, first_elt + i);

Shouldn't we use count here?(it also fixed the hanged issue).

Also even vector_boolean_type has only 4 elements, VECTOR_CST_ELT (expr, 5)
still return -1, not sure if it's reasonable.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #6 from Hongtao Liu  ---
Another potential buggy place is 

240  vexit_reduc_67 = mask_patt_43.28_62 & mask_patt_43.28_63;
241  if (vexit_reduc_67 == { -1, -1, -1, -1 })
242goto ; [94.50%]
243  else


is expanded to 

 319(insn 69 68 70 8 (set (reg:CCZ 17 flags)
 320(compare:CCZ (reg:QI 189 [ vexit_reduc_67 ])
 321(const_int -1 [0x]))) "test.c":83:18 discrim 1
9 {*cmpqi_1}

But it should only test the lower 4 bits, the higher part is zeroed by avx512
comparison instructions.

 293(insn 65 64 66 8 (set (reg:QI 186 [ mask_patt_43.28_62 ])
 294(unspec:QI [
 295(reg:V4DI 124 [ vect__29.26 ])
 296(reg:V4DI 185)
 297(const_int 0 [0])
 298] UNSPEC_PCMP)) "test.c":83:18 discrim 1 2811
{avx512vl_cmpv4di3}
 299 (nil))
 300(insn 66 65 67 8 (set (reg:V4DI 187)
 301(const_vector:V4DI [
 302(const_int 0 [0]) repeated x4
 303])) "test.c":83:18 discrim 1 2021 {movv4di_internal}
 304 (nil))
 305(insn 67 66 68 8 (set (reg:QI 188 [ mask_patt_43.28_63 ])
 306(unspec:QI [
 307(reg:V4DI 125 [ vect__29.27 ])
 308(reg:V4DI 187)
 309(const_int 0 [0])
 310] UNSPEC_PCMP)) "test.c":83:18 discrim 1 2811
{avx512vl_cmpv4di3}
 311 (nil))
 312(insn 68 67 69 8 (parallel [
 313(set (reg:QI 189 [ vexit_reduc_67 ])
 314(and:QI (reg:QI 186 [ mask_patt_43.28_62 ])
 315(reg:QI 188 [ mask_patt_43.28_63 ])))
 316(clobber (reg:CC 17 flags))
 317]) "test.c":83:18 discrim 1 618 {*andqi_1}
 318 (nil))

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #5 from Richard Biener  ---
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index fe631252dc2..28ad03e0b8a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -991,8 +991,12 @@ vec_init_loop_exit_info (class loop *loop)
{
  tree may_be_zero = niter_desc.may_be_zero;
  if ((integer_zerop (may_be_zero)
-  || integer_nonzerop (may_be_zero)
-  || COMPARISON_CLASS_P (may_be_zero))
+  /* As we are handling may_be_zero that's not false by
+ rewriting niter to may_be_zero ? 0 : niter we require
+ an empty latch.  */
+  || (exit->src == single_pred (loop->latch)
+  && (integer_nonzerop (may_be_zero)
+  || COMPARISON_CLASS_P (may_be_zero
  && (!candidate
  || dominated_by_p (CDI_DOMINATORS, exit->src,
 candidate->src)))

fixes it, I'm testing this.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED
 CC||rguenth at gcc dot gnu.org,
   ||tnfchris at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
So with the niter analysis part what's different is how we deal with the
case of 'may_be_zero', for a loop with a non-empty latch a zero means
we do not execute the latch but the traditional 'niter' means the number
of latch executions.

That means the vectorizer simply implementing may_be_zero as
niter = may_be_zero ? 0 : niter is flawed (if it were that easy we
wouldn't have this extra field).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #3 from Richard Biener  ---
So the change enables early exit vectorization since may_be_zero is _10 == 0
here, resulting in an overall

number_of_iterationsm1 == _10 != 0 ? _10 + 4294967295 : 0

and

number_of_iterations = MAX_EXPR <_10, 1>

We're vectorizing the induction to re-start the iteration after exit which
I think is all OK (but it must be broken...).  But we're stuck in a
not vectorized loop,

94/* Skip bits that are zero.  */
95for (; (i->word & 1) == 0; i->word >>= 1)
96  i->bit_num++;

   0x00400890 <+528>:   shr%rdx
   0x00400893 <+531>:   mov%ecx,%eax
   0x00400895 <+533>:   inc%ecx
=> 0x00400897 <+535>:   test   $0x1,%dl
   0x0040089a <+538>:   je 0x400890 

and %rdx is zero.

I think the vector IL is sound.

Disabling cunroll allows the testcase to pass, so it might be an error on
the upper bound of its iterations (but I think that's OK, too).

What looks a bit odd is the condition for skipping the epilogue which
we should never do for LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, we're using

  /* If we have a peeled vector iteration we will never skip the epilog
loop
 and we can simplify the cfg a lot by not doing the edge split.  */
  if (skip_epilog || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
{
  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
niters, niters_vector_mult_vf);

here, but I think niters_vector_mult_vf is wrong (but it doesn't matter
here, still it looks bogus).

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 Ever confirmed|0   |1
   Last reconfirmed||2024-01-24
   Priority|P3  |P1
 Status|UNCONFIRMED |NEW
   Keywords||wrong-code

--- Comment #2 from Richard Biener  ---
Confirmed.  Let me have a look.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576

--- Comment #1 from Hongtao Liu  ---
int
__attribute__((noinline))
sbitmap_first_set_bit (const_sbitmap bmap)
{
  unsigned int n = 0;
  sbitmap_iterator sbi;

  EXECUTE_IF_SET_IN_SBITMAP (bmap, 0, n, sbi)
return n;
  return -1;
}

hangs on this function, it's an vect early break case.

 78.L10:
 79shrq%rdx
 80movl%ecx, %eax
 81incl%ecx
 82testb   $1, %dl
 83je  .L10
 84ret

hangs on this scalar part.