[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #57 from Hongtao Liu --- > For dg-do run testcases I really think we should avoid those -march= > options, because it means a lot of other stuff, BMI, LZCNT, ... Make sense.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #56 from Uroš Bizjak --- The testcase is fixed with g:430c772be3382134886db33133ed466c02efc71c
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #55 from Uroš Bizjak --- (In reply to Jakub Jelinek from comment #53) > Comment on attachment 57424 [details] > Proposed testsuite patch > > As skylake-avx512 is -mavx512{f,cd,bw,dq,vl}, requiring just avx512f > effective target and testing it at runtime IMHO isn't enough. > For dg-do run testcases I really think we should avoid those -march= > options, because it means a lot of other stuff, BMI, LZCNT, ... I think that addition of +# if defined(__AVX512VL__) +want_level = 7, want_b = bit_AVX512VL; +# elif defined(__AVX512F__) +want_level = 7, want_b = bit_AVX512F; +# elif defined(__AVX2__) to check_vect solves all current uses in gcc.dg/vect
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #54 from Richard Biener --- Please also verify the bug reproduced with the altered set of options. What's the reason to have avx512-check.h in addition to tree-vect.h? At least for the vectorizer testsuite the latter is the canonical one, can we please merge AVX512 support therein?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #53 from Jakub Jelinek --- Comment on attachment 57424 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57424 Proposed testsuite patch As skylake-avx512 is -mavx512{f,cd,bw,dq,vl}, requiring just avx512f effective target and testing it at runtime IMHO isn't enough. For dg-do run testcases I really think we should avoid those -march= options, because it means a lot of other stuff, BMI, LZCNT, ...
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #52 from Uroš Bizjak --- Created attachment 57424 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57424=edit Proposed testsuite patch This patch fixes the failure for me (+ some other dg.exp/vect inconsistencies).
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #51 from Jakub Jelinek --- >From the -mavx* options I think -march=skylake-avx512 implies -mavx512{f,cd,vl,bw,dq} but -mavx512f is implied by any of the latter 4.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #50 from Jakub Jelinek --- (In reply to Richard Biener from comment #49) > (In reply to Uroš Bizjak from comment #48) > > The runtime testcase fails on non-AVX512F x86 targets due to: > > > > /* { dg-do run } */ > > /* { dg-options "-O3" } */ > > /* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-* > > i?86-*-* } } } */ > > > > but check_vect() only checks runtime support up to AVX2. > > Hmm, can we fix that? We could change the above to { target avx512f_runtime > } > but that really only checks for AVX512F, not say AVX512VL ... > > I do remember using -mavx512vl wasn't enough to trigger the miscompile > nor did it trigger with -march=znver4 ... so I stuck to skylake-avx512 :/ It is certainly preferable to add -mavx512{bw,dq,vl} or whatever the testcase actually needs and then one can #define AVX512BW #define AVX512DQ #define AVX512VL #include "avx512-check.h" and get checks for all those.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #49 from Richard Biener --- (In reply to Uroš Bizjak from comment #48) > The runtime testcase fails on non-AVX512F x86 targets due to: > > /* { dg-do run } */ > /* { dg-options "-O3" } */ > /* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-* > i?86-*-* } } } */ > > but check_vect() only checks runtime support up to AVX2. Hmm, can we fix that? We could change the above to { target avx512f_runtime } but that really only checks for AVX512F, not say AVX512VL ... I do remember using -mavx512vl wasn't enough to trigger the miscompile nor did it trigger with -march=znver4 ... so I stuck to skylake-avx512 :/
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #48 from Uroš Bizjak --- The runtime testcase fails on non-AVX512F x86 targets due to: /* { dg-do run } */ /* { dg-options "-O3" } */ /* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-* i?86-*-* } } } */ but check_vect() only checks runtime support up to AVX2.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Richard Biener changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|--- |FIXED --- Comment #47 from Richard Biener --- This should now be fixed, but I think the issue might be latent on branches for GCN or for AVX512 via intrinsics (fully masked loops for AVX512 is also only available in GCC 14). Let's close this P1, we have to consider backporting when we manage to create a miscompiled testcase for older releases.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #46 from GCC Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:5352ede92483b949e811cbdcdfaec5378f3e06d6 commit r14-8975-g5352ede92483b949e811cbdcdfaec5378f3e06d6 Author: Richard Biener Date: Fri Feb 9 08:15:44 2024 +0100 middle-end/113576 - zero padding of vector bools when expanding compares The following zeros paddings of vector bools when expanding compares and the mode used for the compare is an integer mode. In that case targets cannot distinguish between a 4 element and 8 element vector compare (both get to the QImode compare optab) so we have to do the job in the middle-end. PR middle-end/113576 * expr.cc (do_store_flag): For vector bool compares of vectors with padding zero that. * dojump.cc (do_compare_and_jump): Likewise.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #45 from Hongtao Liu --- > > There's do_store_flag to fixup for uses not in branches and > > do_compare_and_jump for conditional jumps. > > reasonable enough for me. I mean we only handle it at consumers where upper bits matters.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #44 from Hongtao Liu --- > > Note the AND is removed by combine if I add it: > > Successfully matched this instruction: > (set (reg:CCZ 17 flags) > (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0)) > (const_int 15 [0xf])) > (const_int 0 [0]))) > > (*testhi_not) > > -9: {r103:QI=r102:QI&0xf;clobber flags:CC;} > + REG_DEAD r99:QI > +9: NOTE_INSN_DELETED > + 12: flags:CCZ=cmp(~r102:QI#0&0xf,0) >REG_DEAD r102:QI > - REG_UNUSED flags:CC > - 12: flags:CCZ=cmp(r103:QI,0xf) > - REG_DEAD r103:QI > > and we get > > foo: > .LFB0: > .cfi_startproc > notl%esi > orl %esi, %edi > notl%edi > testb $15, %dil > je .L6 > ret > > which I'm not sure is OK? > Yes, I think it's on purpose 11508;; Split and;cmp (as optimized by combine) into not;test 11509;; Except when TARGET_BMI provides andn (*andn__ccno). 11510(define_insn_and_split "*test_not" 11511 [(set (reg:CCZ FLAGS_REG) 11512(compare:CCZ 11513 (and:SWI 11514(not:SWI (match_operand:SWI 0 "register_operand")) 11515(match_operand:SWI 1 "")) 11516 (const_int 0)))] 11517 "ix86_pre_reload_split () 11518 && (!TARGET_BMI || !REG_P (operands[1]))" 11519 "#" 11520 "&& 1" 11521 [(set (match_dup 2) (not:SWI (match_dup 0))) 11522 (set (reg:CCZ FLAGS_REG) 11523(compare:CCZ (and:SWI (match_dup 2) (match_dup 1)) 11524 (const_int 0)))] 11525 "operands[2] = gen_reg_rtx (mode);") 11526 11527;; Split and;cmp (as optimized by combine) into andn;cmp $0 11528(define_insn_and_split "*test_not_doubleword"
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #43 from Hongtao Liu --- > Well, yes, the discussion in this bug was whether to do this at consumers > (that's sth new) or with all mask operations (that's how we handle > bit-precision integer operations, so it might be relatively easy to > do that - specifically spot the places eventually needing adjustment). > > There's do_store_flag to fixup for uses not in branches and > do_compare_and_jump for conditional jumps. reasonable enough for me.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #42 from Richard Biener --- And the do_store_flag part: diff --git a/gcc/expr.cc b/gcc/expr.cc index fc5e998e329..44d64274071 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -13693,6 +13693,19 @@ do_store_flag (sepops ops, rtx target, machine_mode mode) subtarget = 0; expand_operands (arg0, arg1, subtarget, , , EXPAND_NORMAL); + unsigned HOST_WIDE_INT nunits; + if (VECTOR_BOOLEAN_TYPE_P (type) + && operand_mode == QImode + && TYPE_VECTOR_SUBPARTS (type).is_constant () + && nunits < BITS_PER_UNIT) +{ + op0 = expand_binop (mode, and_optab, op0, + GEN_INT ((1 << nunits) - 1), NULL_RTX, + true, OPTAB_WIDEN); + op1 = expand_binop (mode, and_optab, op1, + GEN_INT ((1 << nunits) - 1), NULL_RTX, + true, OPTAB_WIDEN); +} if (target == 0) target = gen_reg_rtx (mode); for the testcase typedef long v4si __attribute__((vector_size(4*sizeof(long; typedef v4si v4sib __attribute__((vector_mask)); typedef _Bool sbool1 __attribute__((signed_bool_precision(1))); _Bool x; void __GIMPLE (ssa) foo (v4sib v1, v4sib v2) { v4sib tem; _Bool _7; __BB(2): tem_5 = ~v2_2(D); tem_3 = v1_1(D) | tem_5; tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1 }; _7 = tem_3 == tem_4; x = _7; return; }
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #41 from Richard Biener --- (In reply to Hongtao Liu from comment #38) > > I think we should also mask off the upper bits of variable mask? > > > > notl%esi > > orl %esi, %edi > > notl%edi > > andl$15, %edi > > je .L3 > > with -mbmi, it's > > andn%esi, %edi, %edi > andl$15, %edi > je .L3 Well, yes, the discussion in this bug was whether to do this at consumers (that's sth new) or with all mask operations (that's how we handle bit-precision integer operations, so it might be relatively easy to do that - specifically spot the places eventually needing adjustment). There's do_store_flag to fixup for uses not in branches and do_compare_and_jump for conditional jumps. Note the AND is removed by combine if I add it: Successfully matched this instruction: (set (reg:CCZ 17 flags) (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0)) (const_int 15 [0xf])) (const_int 0 [0]))) (*testhi_not) -9: {r103:QI=r102:QI&0xf;clobber flags:CC;} + REG_DEAD r99:QI +9: NOTE_INSN_DELETED + 12: flags:CCZ=cmp(~r102:QI#0&0xf,0) REG_DEAD r102:QI - REG_UNUSED flags:CC - 12: flags:CCZ=cmp(r103:QI,0xf) - REG_DEAD r103:QI and we get foo: .LFB0: .cfi_startproc notl%esi orl %esi, %edi notl%edi testb $15, %dil je .L6 ret which I'm not sure is OK? diff --git a/gcc/dojump.cc b/gcc/dojump.cc index e2d2b3cb111..784707c1e55 100644 --- a/gcc/dojump.cc +++ b/gcc/dojump.cc @@ -1266,6 +1266,7 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum rtx_code signed_code, machine_mode mode; int unsignedp; enum rtx_code code; + unsigned HOST_WIDE_INT nunits; /* Don't crash if the comparison was erroneous. */ op0 = expand_normal (treeop0); @@ -1308,6 +1309,18 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum rtx_code signed_code, emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, op1)); op1 = new_op1; } + else if (VECTOR_BOOLEAN_TYPE_P (type) + && mode == QImode + && TYPE_VECTOR_SUBPARTS (type).is_constant () + && nunits < BITS_PER_UNIT) +{ + op0 = expand_binop (mode, and_optab, op0, + GEN_INT ((1 << nunits) - 1), NULL_RTX, + true, OPTAB_WIDEN); + op1 = expand_binop (mode, and_optab, op1, + GEN_INT ((1 << nunits) - 1), NULL_RTX, + true, OPTAB_WIDEN); +} do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode, ((mode == BLKmode)
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #40 from Jakub Jelinek --- For unsigned _BitInt(4) or unsigned _BitInt(2) we mask it whenever loading from memory or function argument or whatever other ABI specific spot (and also when storing because that is how RTL expects it; because of that we don't mask it when using it from say automatic variables where we know we've initialized it ourselves).
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #39 from Hongtao Liu --- > > the question is whether that matches the semantics of GIMPLE (the padding > > is inverted, too), whether it invokes undefined behavior (don't do it - it > > seems for people using intrinsics that's what it is?) For the intrinisc, the instructions only care about lower bits, so it's not big issue? And it sounds like similar issue as _BitInt(4)/_BitInt(2), I assume there're garbage in the upper bits.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #38 from Hongtao Liu --- > I think we should also mask off the upper bits of variable mask? > > notl%esi > orl %esi, %edi > notl%edi > andl$15, %edi > je .L3 with -mbmi, it's andn%esi, %edi, %edi andl$15, %edi je .L3
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #37 from Hongtao Liu --- (In reply to Richard Biener from comment #36) > For example with AVX512VL and the following, using -O -fgimple -mavx512vl > we get simply > > notl%esi > orl %esi, %edi > cmpb$15, %dil > je .L6 > > typedef long v4si __attribute__((vector_size(4*sizeof(long; > typedef v4si v4sib __attribute__((vector_mask)); > typedef _Bool sbool1 __attribute__((signed_bool_precision(1))); > > void __GIMPLE (ssa) foo (v4sib v1, v4sib v2) > { > v4sib tem; > > __BB(2): > tem_5 = ~v2_2(D); > tem_3 = v1_1(D) | tem_5; > tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1, > _Literal (sbool1) -1, _Literal (sbool1) -1 }; > if (tem_3 == tem_4) > goto __BB3; > else > goto __BB4; > > __BB(3): > __builtin_abort (); > > __BB(4): > return; > } > > > the question is whether that matches the semantics of GIMPLE (the padding > is inverted, too), whether it invokes undefined behavior (don't do it - it > seems for people using intrinsics that's what it is?) or whether we > should avoid affecting padding. > > Note after the patch I proposed on the mailing list the constant mask is > now expanded with zero padding. I think we should also mask off the upper bits of variable mask? notl%esi orl %esi, %edi notl%edi andl$15, %edi je .L3
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #36 from Richard Biener --- For example with AVX512VL and the following, using -O -fgimple -mavx512vl we get simply notl%esi orl %esi, %edi cmpb$15, %dil je .L6 typedef long v4si __attribute__((vector_size(4*sizeof(long; typedef v4si v4sib __attribute__((vector_mask)); typedef _Bool sbool1 __attribute__((signed_bool_precision(1))); void __GIMPLE (ssa) foo (v4sib v1, v4sib v2) { v4sib tem; __BB(2): tem_5 = ~v2_2(D); tem_3 = v1_1(D) | tem_5; tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1 }; if (tem_3 == tem_4) goto __BB3; else goto __BB4; __BB(3): __builtin_abort (); __BB(4): return; } the question is whether that matches the semantics of GIMPLE (the padding is inverted, too), whether it invokes undefined behavior (don't do it - it seems for people using intrinsics that's what it is?) or whether we should avoid affecting padding. Note after the patch I proposed on the mailing list the constant mask is now expanded with zero padding.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Richard Biener changed: What|Removed |Added CC||ams at gcc dot gnu.org --- Comment #35 from Richard Biener --- (In reply to Richard Sandiford from comment #34) > (In reply to Richard Biener from comment #32) > > Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting > > the lowest 4 or 2 bits. > > > > It all feels like desaster waiting to happen ;) > Yes :) > > > For example BIT_NOT_EXPR is RTL expanded like > > > > case BIT_NOT_EXPR: > > op0 = expand_expr (treeop0, subtarget, > > VOIDmode, EXPAND_NORMAL); > > if (modifier == EXPAND_STACK_PARM) > > target = 0; > > /* In case we have to reduce the result to bitfield precision > > for unsigned bitfield expand this as XOR with a proper constant > > instead. */ > > if (reduce_bit_field && TYPE_UNSIGNED (type)) > > { > > int_mode = SCALAR_INT_TYPE_MODE (type); > > wide_int mask = wi::mask (TYPE_PRECISION (type), > > false, GET_MODE_PRECISION (int_mode)); > > > > temp = expand_binop (int_mode, xor_optab, op0, > >immed_wide_int_const (mask, int_mode), > >target, 1, OPTAB_LIB_WIDEN); > > > > so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and > > effective bit-precision set reduce_bit_field and fixup the fallout > > (not sure why the above is only for TYPE_UNSIGNED). > > > > At least it feels similar and doing things the opposite for vectors > > (fixing up at uses) would be odd? > Do you know why we take this approach for integers? Is it for > correctness? Or is it supposed to be more optimal? It's done for correctness. The main thing was bitfields > int which end up as bit-precision types in GIMPLE. > I can imagine that, for arithmetic types, there are going to many > more instances where upper bits matter (division, right shifts, > MIN/MAX, etc.). So perhaps reducing every result is a good > trade-off there. I think it was the easiest place to fix up and to make sure later RTL opts and backends do not interfere. > But there's an argument that it should be rare for the padding > bits in a vector to matter, since very few things would look at the > padding bits anyway. So perhaps the cost should be borne by the > operations that need canonical integers. But what happens when an operation not needing canonical intergers is transformed, say by combine or simplify-rtx to one that needs? I think the reduce_bitfield code was trying to be safe here. Actual define_insns not needing canonical padding could do like we now require scalar shifts - they could match a variant with the mask canonicalization op and hope for combine eliminating that. Of course that will explode in case it's the majority of cases ... > Not a strong opinion though, more just devil's advocate. Yeah. The important thing seems to be that the info this is a bit-precision QImode isn't lost ... which points to that using plain integer modes was a very bad choice ... > There again, if e.g. the x86 API guarantees memcmp equality between > two masks whose significant bits are equal, then we probably have > no choice. That's a good question, possibly most relevant for the OpenMP SIMD ABI. The AVX512 APIs use integer types everywhere, there's no intrinsic for ktest itself, but _mm512_kortestz and kortestc and also _mm512_knot (oddly only for HImode). So it at least seems to - from a quick look - be "broken" in the intrinsic API as well. At least we have _mm_cmp_sd_mask producing a QImode mask, _mm512_knot inverting too many bits. So the user must be aware of the "padding". One could argue it's the vectorizers job to fixup then, but of course it doesn't get to see the correct vector types here. Given we have the fixup in CTOR expansion the issue at hand could be fixed by mirroring that in VECTOR_CST expansion for now. Andrew - I suppose GCN also has cbranch, can you try to check what happens for < 8 lane vector modes there?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #34 from Richard Sandiford --- (In reply to Richard Biener from comment #32) > Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting > the lowest 4 or 2 bits. > > It all feels like desaster waiting to happen ;) Yes :) > For example BIT_NOT_EXPR is RTL expanded like > > case BIT_NOT_EXPR: > op0 = expand_expr (treeop0, subtarget, > VOIDmode, EXPAND_NORMAL); > if (modifier == EXPAND_STACK_PARM) > target = 0; > /* In case we have to reduce the result to bitfield precision > for unsigned bitfield expand this as XOR with a proper constant > instead. */ > if (reduce_bit_field && TYPE_UNSIGNED (type)) > { > int_mode = SCALAR_INT_TYPE_MODE (type); > wide_int mask = wi::mask (TYPE_PRECISION (type), > false, GET_MODE_PRECISION (int_mode)); > > temp = expand_binop (int_mode, xor_optab, op0, >immed_wide_int_const (mask, int_mode), >target, 1, OPTAB_LIB_WIDEN); > > so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and > effective bit-precision set reduce_bit_field and fixup the fallout > (not sure why the above is only for TYPE_UNSIGNED). > > At least it feels similar and doing things the opposite for vectors > (fixing up at uses) would be odd? Do you know why we take this approach for integers? Is it for correctness? Or is it supposed to be more optimal? I can imagine that, for arithmetic types, there are going to many more instances where upper bits matter (division, right shifts, MIN/MAX, etc.). So perhaps reducing every result is a good trade-off there. But there's an argument that it should be rare for the padding bits in a vector to matter, since very few things would look at the padding bits anyway. So perhaps the cost should be borne by the operations that need canonical integers. Not a strong opinion though, more just devil's advocate. There again, if e.g. the x86 API guarantees memcmp equality between two masks whose significant bits are equal, then we probably have no choice.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #33 from Jakub Jelinek --- I guess inverting just 2 or 4 bits can be done with kxorb or knotb + kandb, but we need to have the mask forced in some mask register.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #32 from Richard Biener --- Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting the lowest 4 or 2 bits. It all feels like desaster waiting to happen ;) For example BIT_NOT_EXPR is RTL expanded like case BIT_NOT_EXPR: op0 = expand_expr (treeop0, subtarget, VOIDmode, EXPAND_NORMAL); if (modifier == EXPAND_STACK_PARM) target = 0; /* In case we have to reduce the result to bitfield precision for unsigned bitfield expand this as XOR with a proper constant instead. */ if (reduce_bit_field && TYPE_UNSIGNED (type)) { int_mode = SCALAR_INT_TYPE_MODE (type); wide_int mask = wi::mask (TYPE_PRECISION (type), false, GET_MODE_PRECISION (int_mode)); temp = expand_binop (int_mode, xor_optab, op0, immed_wide_int_const (mask, int_mode), target, 1, OPTAB_LIB_WIDEN); so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and effective bit-precision set reduce_bit_field and fixup the fallout (not sure why the above is only for TYPE_UNSIGNED). At least it feels similar and doing things the opposite for vectors (fixing up at uses) would be odd?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #31 from rguenther at suse dot de --- On Tue, 30 Jan 2024, rsandifo at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 > > --- Comment #30 from Richard Sandiford --- > (In reply to Richard Biener from comment #29) > > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for > > VECTOR_CSTs. But yeah, that _might_ argue we should perform the same > > masking for VECTOR_CST expansion as well, instead of trying to fixup > > in do_compare_and_jump? > But then how would ~ be implemented for things like 4-bit masks? > If we use notqi2 then I assume the upper bits could be 1 rather than 0. Yeah, I guess it's similar to expand_expr_real_1 'reduce_bit_field' handling - we'd need to insert fixup code in strathegic places (or for ~ use xor with the proper mask). The difficulty is that we can't make the backend do this unless there are insn operands that allows it to infer the real precision of the mode. And for most insns the excess bits are irrelevant anyway. Still the CTOR case showed wrong-code issues with GCN, which possibly means it has the same issue with VECTOR_CSTs as well. IIRC that was that all vectors are 1024bits, and its "fake" V4SImode insns rely on accurate masked out upper bits. That might hint that compares are not enough here (but for non-compares the backend might have a chance to fixup by infering the max. number of active elements). If we think that compares (but that would also be compares without jump, aka a == b | c == d) are the only problematical case we can also fixup at the uses rather than at the defs as 'reduce_bit_field' tries to do.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #30 from Richard Sandiford --- (In reply to Richard Biener from comment #29) > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for > VECTOR_CSTs. But yeah, that _might_ argue we should perform the same > masking for VECTOR_CST expansion as well, instead of trying to fixup > in do_compare_and_jump? But then how would ~ be implemented for things like 4-bit masks? If we use notqi2 then I assume the upper bits could be 1 rather than 0.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #29 from Richard Biener --- (In reply to Hongtao Liu from comment #28) > I saw we already maskoff integral modes for vector mask in store_constructor > > /* Use sign-extension for uniform boolean vectors with > integer modes and single-bit mask entries. > Effectively "vec_duplicate" for bitmasks. */ > if (elt_size == 1 > && !TREE_SIDE_EFFECTS (exp) > && VECTOR_BOOLEAN_TYPE_P (type) > && SCALAR_INT_MODE_P (TYPE_MODE (type)) > && (elt = uniform_vector_p (exp)) > && !VECTOR_TYPE_P (TREE_TYPE (elt))) > { > rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)), >expand_normal (elt)); > rtx tmp = gen_reg_rtx (mode); > convert_move (tmp, op0, 0); > > /* Ensure no excess bits are set. > GCN needs this for nunits < 64. > x86 needs this for nunits < 8. */ > auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant (); > if (maybe_ne (GET_MODE_PRECISION (mode), nunits)) > tmp = expand_binop (mode, and_optab, tmp, > GEN_INT ((1 << nunits) - 1), target, > true, OPTAB_WIDEN); > if (tmp != target) > emit_move_insn (target, tmp); > break; > } But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for VECTOR_CSTs. But yeah, that _might_ argue we should perform the same masking for VECTOR_CST expansion as well, instead of trying to fixup in do_compare_and_jump?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #28 from Hongtao Liu --- I saw we already maskoff integral modes for vector mask in store_constructor /* Use sign-extension for uniform boolean vectors with integer modes and single-bit mask entries. Effectively "vec_duplicate" for bitmasks. */ if (elt_size == 1 && !TREE_SIDE_EFFECTS (exp) && VECTOR_BOOLEAN_TYPE_P (type) && SCALAR_INT_MODE_P (TYPE_MODE (type)) && (elt = uniform_vector_p (exp)) && !VECTOR_TYPE_P (TREE_TYPE (elt))) { rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)), expand_normal (elt)); rtx tmp = gen_reg_rtx (mode); convert_move (tmp, op0, 0); /* Ensure no excess bits are set. GCN needs this for nunits < 64. x86 needs this for nunits < 8. */ auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant (); if (maybe_ne (GET_MODE_PRECISION (mode), nunits)) tmp = expand_binop (mode, and_optab, tmp, GEN_INT ((1 << nunits) - 1), target, true, OPTAB_WIDEN); if (tmp != target) emit_move_insn (target, tmp); break; }
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #27 from Richard Biener --- (In reply to Hongtao Liu from comment #25) > (In reply to Tamar Christina from comment #24) > > Just to avoid confusion, are you still working on this one Richi? > > I'm working on a patch to add a target hook as #c18 mentioned. Not sure a target hook was suggested - I think it was suggested that do_compare_and_jump always masks excess bits for integer mode vector masks?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #26 from Tamar Christina --- Ah great, just checking it wasn't left unattended :)
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #25 from Hongtao Liu --- (In reply to Tamar Christina from comment #24) > Just to avoid confusion, are you still working on this one Richi? I'm working on a patch to add a target hook as #c18 mentioned.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #24 from Tamar Christina --- Just to avoid confusion, are you still working on this one Richi?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Tamar Christina changed: What|Removed |Added CC||acoplan at gcc dot gnu.org --- Comment #23 from Tamar Christina --- *** Bug 113661 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #22 from Hongtao Liu --- typedef unsigned long mp_limb_t; typedef long mp_size_t; typedef unsigned long mp_bitcnt_t; typedef mp_limb_t *mp_ptr; typedef const mp_limb_t *mp_srcptr; #define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8) #define GMP_LIMB_MAX (~ (mp_limb_t) 0) mp_bitcnt_t __attribute__((noipa)) mpn_common_scan (mp_limb_t limb, mp_size_t i, mp_srcptr up, mp_size_t un, mp_limb_t ux) { unsigned cnt; while (limb == 0) { i++; if (i == un) return (ux == 0 ? ~(mp_bitcnt_t) 0 : un * GMP_LIMB_BITS); limb = ux ^ up[i]; } return limb; } int main () { mp_limb_t up[1]; for (int i = 0; i != 1; i++) up[i] = 1 << 8; up[2000] = 1; mp_bitcnt_t res = mpn_common_scan (0, 0, up, 1, 1 << 8); if (res != 257) __builtin_abort (); return 1; } aborted with -O3 -march=skylake-avx512.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #21 from Hongtao Liu --- typedef unsigned long mp_limb_t; typedef long mp_size_t; typedef unsigned long mp_bitcnt_t; typedef mp_limb_t *mp_ptr; typedef const mp_limb_t *mp_srcptr; #define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8) #define GMP_LIMB_MAX (~ (mp_limb_t) 0) mp_bitcnt_t mpn_common_scan (mp_limb_t limb, mp_size_t i, mp_srcptr up, mp_size_t un, mp_limb_t ux) { unsigned cnt; while (limb == 0) { i++; if (i == un) return (ux == 0 ? ~(mp_bitcnt_t) 0 : un * GMP_LIMB_BITS); limb = ux ^ up[i]; } return limb; } This one is miscompiled in 502.gcc_r 123 [local count: 862990464]: 124 _34 = ivtmp.20_20 * 32; 125 vect__5.15_59 = MEM [(const mp_limb_t *)vectp.14_53 + _34 * 1]; 126 mask_patt_9.16_61 = vect__5.15_59 == vect_cst__60; 127 ivtmp.20_32 = ivtmp.20_20 + 1; 128 if (mask_patt_9.16_61 == { -1, -1, -1, -1 }) 129goto ; [94.50%] 130 else 131goto ; [5.50%] is expanded to 30.L18: 31movq%rdi, %rdx 32incq%rdi 33salq$5, %rdx 34vpcmpeqq(%rax,%rdx), %ymm3, %k0 35kmovb %k0, %edx 36cmpb$-1, %dl 37jne .L21
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #20 from Hongtao Liu --- > Note that I wonder how to eliminate redundant maskings? I suppose > eventually combine tracking nonzero bits where obvious would do > that? For example for cmp:V4SI we know the bits will be zero but > I wonder if the RTL IL is obvious enough to derive this (or whether > there's a target hook for extra nonzero bit discovery, say for > unspecs). I guess we need extra patterns to make combine know, we already have those for zero_extend. 3970;; Since vpcmpd implicitly clear the upper bits of dest, transform 3971;; vpcmpd + zero_extend to vpcmpd since the instruction 3972(define_insn_and_split "*_cmp3_zero_extend" 3973 [(set (match_operand:SWI248x 0 "register_operand") 3974(zero_extend:SWI248x 3975 (unspec: 3976[(match_operand:V48H_AVX512VL 1 "nonimmediate_operand") 3977 (match_operand:V48H_AVX512VL 2 "nonimmediate_operand") 3978 (match_operand:SI 3 "const_0_to_7_operand")] 3979UNSPEC_PCMP)))]
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #19 from rguenther at suse dot de --- On Thu, 25 Jan 2024, rsandifo at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 > > --- Comment #18 from Richard Sandiford --- > (In reply to Tamar Christina from comment #17) > > Well the mid-end has generated the right precision. The type it generates is > > vector(4) vexit_reduc_67; > > so it does say it's a single bit boolean. > > > > Isn't this just an expand problem? > That's what I meant. expand is using a QImode comparison to compare things > with 4-bit precision, so I think the masking should happen at that point. > > How about doing the masking in do_compare_and_jump? That sounds sensible. Note that I wonder how to eliminate redundant maskings? I suppose eventually combine tracking nonzero bits where obvious would do that? For example for cmp:V4SI we know the bits will be zero but I wonder if the RTL IL is obvious enough to derive this (or whether there's a target hook for extra nonzero bit discovery, say for unspecs).
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #18 from Richard Sandiford --- (In reply to Tamar Christina from comment #17) > Well the mid-end has generated the right precision. The type it generates is > vector(4) vexit_reduc_67; > so it does say it's a single bit boolean. > > Isn't this just an expand problem? That's what I meant. expand is using a QImode comparison to compare things with 4-bit precision, so I think the masking should happen at that point. How about doing the masking in do_compare_and_jump?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #17 from Tamar Christina --- Well the mid-end has generated the right precision. The type it generates is vector(4) vexit_reduc_67; so it does say it's a single bit boolean. Isn't this just an expand problem?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #16 from Richard Sandiford --- (In reply to Richard Biener from comment #15) > I think the problem is the cbranch pattern which looks at all of the > QImode mask - but of course it doesn't know it's really V4BImode it's > working on ... Yeah. Currently building an x86_64 toolchain to have a look, but I think whatever code uses a cbranch with a higher precision than the inputs should mask off the significant bits beforehand. cbranch can also be used for comparing two variable masks too. I suppose we could add a hook to say that padding bits of an integer mask must always be zero (and so it's expand's job to ensure that that holds for any mask operation). But it feels dangerously close to TRULY_NOOP_TRUNCATION for integers.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #15 from Richard Biener --- (In reply to Richard Sandiford from comment #13) > I don't think there's any principle that upper bits must be zero. > How do we end up with a pattern that depends on that being the case? I think the problem is the cbranch pattern which looks at all of the QImode mask - but of course it doesn't know it's really V4BImode it's working on ... If there's no principle that the upper bits should be zero I think we need a way for the target to say so.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Hongtao Liu changed: What|Removed |Added Resolution|FIXED |--- Status|RESOLVED|REOPENED --- Comment #14 from Hongtao Liu --- The testcase attached is ok now, but 502.gcc_r is miscompiled after the installed patch.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #13 from Richard Sandiford --- I don't think there's any principle that upper bits must be zero. How do we end up with a pattern that depends on that being the case?
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #12 from rguenther at suse dot de --- On Thu, 25 Jan 2024, liuhongt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 > > --- Comment #7 from Hongtao Liu --- > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index 1fd957288d4..33a8d539b4d 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned > char > *ptr, int len, > >unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits; >unsigned int first_elt = off * elts_per_byte; > - unsigned int extract_elts = extract_bytes * elts_per_byte; > + unsigned int extract_elts = count; >for (unsigned int i = 0; i < extract_elts; ++i) > { > tree elt = VECTOR_CST_ELT (expr, first_elt + i); > > Shouldn't we use count here?(it also fixed the hanged issue). extract_bytes is capped by the buffer 'len': int total_bytes = CEIL (elt_bits * count, BITS_PER_UNIT); .. int extract_bytes = MIN (len, total_bytes - off); we'd still need to effectively do that. But yeah, using CEIL makes extract_elts off. Maybe we should simply calculate extract_bits instead (but then use uint64 for that)
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #11 from Richard Biener --- (In reply to Hongtao Liu from comment #8) > maybe > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index 1fd957288d4..6d321f9baef 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -8035,6 +8035,9 @@ native_encode_vector_part (const_tree expr, unsigned > char *ptr, int len, >unsigned int extract_elts = extract_bytes * elts_per_byte; >for (unsigned int i = 0; i < extract_elts; ++i) > { > + /* Don't encode any bit beyond the range of the vector. */ > + if (first_elt + i > count) > + break; Hmm. I think that VECTOR_CST_ELT should have ICEd for out-of-bound element queries but it seems to make up elements for us here. Richard? But yes, we do unsigned int extract_elts = extract_bytes * elts_per_byte; and since native_encode_* and native_interpret_* operate on bytes we have difficulties dealing with bit-precision entities with padding. There's either the possibility to fail encoding when that happens or do something else. Note that RTL expansion will do case VECTOR_CST: { tree tmp = NULL_TREE; if (VECTOR_MODE_P (mode)) return const_vector_from_tree (exp); scalar_int_mode int_mode; if (is_int_mode (mode, _mode)) { tree type_for_mode = lang_hooks.types.type_for_mode (int_mode, 1); if (type_for_mode) tmp = fold_unary_loc (loc, VIEW_CONVERT_EXPR, type_for_mode, exp); which I think should always succeed (otherwise it falls back to expanding a CTOR). That means failing to encode/interpret might get into store_constructor which I think will zero a register destination and thus fill padding with zeros. So yeah, something like this looks OK, but I think instead of only testing against 'count' we should also test against TYPE_VECTOR_SUBPARTS (that might be variable, so with known_gt). Would be interesting to see whether this fixes the issue without the now installed patch.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from Hongtao Liu --- Fixed.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #9 from GCC Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:578c7b91f418ebbef1bf169117815409e06f5197 commit r14-8413-g578c7b91f418ebbef1bf169117815409e06f5197 Author: Richard Biener Date: Wed Jan 24 14:55:49 2024 +0100 tree-optimization/113576 - non-empty latch and may_be_zero vectorization We can't support niters with may_be_zero when we end up with a non-empty latch due to early exit peeling. At least not in the simplistic way the vectorizer handles this now. Disallow it again for exits that are not the last one. PR tree-optimization/113576 * tree-vect-loop.cc (vec_init_loop_exit_info): Only allow exits with may_be_zero niters when its the last one. * gcc.dg/vect/pr113576.c: New testcase.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #8 from Hongtao Liu --- maybe diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 1fd957288d4..6d321f9baef 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -8035,6 +8035,9 @@ native_encode_vector_part (const_tree expr, unsigned char *ptr, int len, unsigned int extract_elts = extract_bytes * elts_per_byte; for (unsigned int i = 0; i < extract_elts; ++i) { + /* Don't encode any bit beyond the range of the vector. */ + if (first_elt + i > count) + break;
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #7 from Hongtao Liu --- diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 1fd957288d4..33a8d539b4d 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned char *ptr, int len, unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits; unsigned int first_elt = off * elts_per_byte; - unsigned int extract_elts = extract_bytes * elts_per_byte; + unsigned int extract_elts = count; for (unsigned int i = 0; i < extract_elts; ++i) { tree elt = VECTOR_CST_ELT (expr, first_elt + i); Shouldn't we use count here?(it also fixed the hanged issue). Also even vector_boolean_type has only 4 elements, VECTOR_CST_ELT (expr, 5) still return -1, not sure if it's reasonable.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #6 from Hongtao Liu --- Another potential buggy place is 240 vexit_reduc_67 = mask_patt_43.28_62 & mask_patt_43.28_63; 241 if (vexit_reduc_67 == { -1, -1, -1, -1 }) 242goto ; [94.50%] 243 else is expanded to 319(insn 69 68 70 8 (set (reg:CCZ 17 flags) 320(compare:CCZ (reg:QI 189 [ vexit_reduc_67 ]) 321(const_int -1 [0x]))) "test.c":83:18 discrim 1 9 {*cmpqi_1} But it should only test the lower 4 bits, the higher part is zeroed by avx512 comparison instructions. 293(insn 65 64 66 8 (set (reg:QI 186 [ mask_patt_43.28_62 ]) 294(unspec:QI [ 295(reg:V4DI 124 [ vect__29.26 ]) 296(reg:V4DI 185) 297(const_int 0 [0]) 298] UNSPEC_PCMP)) "test.c":83:18 discrim 1 2811 {avx512vl_cmpv4di3} 299 (nil)) 300(insn 66 65 67 8 (set (reg:V4DI 187) 301(const_vector:V4DI [ 302(const_int 0 [0]) repeated x4 303])) "test.c":83:18 discrim 1 2021 {movv4di_internal} 304 (nil)) 305(insn 67 66 68 8 (set (reg:QI 188 [ mask_patt_43.28_63 ]) 306(unspec:QI [ 307(reg:V4DI 125 [ vect__29.27 ]) 308(reg:V4DI 187) 309(const_int 0 [0]) 310] UNSPEC_PCMP)) "test.c":83:18 discrim 1 2811 {avx512vl_cmpv4di3} 311 (nil)) 312(insn 68 67 69 8 (parallel [ 313(set (reg:QI 189 [ vexit_reduc_67 ]) 314(and:QI (reg:QI 186 [ mask_patt_43.28_62 ]) 315(reg:QI 188 [ mask_patt_43.28_63 ]))) 316(clobber (reg:CC 17 flags)) 317]) "test.c":83:18 discrim 1 618 {*andqi_1} 318 (nil))
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #5 from Richard Biener --- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index fe631252dc2..28ad03e0b8a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -991,8 +991,12 @@ vec_init_loop_exit_info (class loop *loop) { tree may_be_zero = niter_desc.may_be_zero; if ((integer_zerop (may_be_zero) - || integer_nonzerop (may_be_zero) - || COMPARISON_CLASS_P (may_be_zero)) + /* As we are handling may_be_zero that's not false by + rewriting niter to may_be_zero ? 0 : niter we require + an empty latch. */ + || (exit->src == single_pred (loop->latch) + && (integer_nonzerop (may_be_zero) + || COMPARISON_CLASS_P (may_be_zero && (!candidate || dominated_by_p (CDI_DOMINATORS, exit->src, candidate->src))) fixes it, I'm testing this.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Richard Biener changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED CC||rguenth at gcc dot gnu.org, ||tnfchris at gcc dot gnu.org --- Comment #4 from Richard Biener --- So with the niter analysis part what's different is how we deal with the case of 'may_be_zero', for a loop with a non-empty latch a zero means we do not execute the latch but the traditional 'niter' means the number of latch executions. That means the vectorizer simply implementing may_be_zero as niter = may_be_zero ? 0 : niter is flawed (if it were that easy we wouldn't have this extra field).
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #3 from Richard Biener --- So the change enables early exit vectorization since may_be_zero is _10 == 0 here, resulting in an overall number_of_iterationsm1 == _10 != 0 ? _10 + 4294967295 : 0 and number_of_iterations = MAX_EXPR <_10, 1> We're vectorizing the induction to re-start the iteration after exit which I think is all OK (but it must be broken...). But we're stuck in a not vectorized loop, 94/* Skip bits that are zero. */ 95for (; (i->word & 1) == 0; i->word >>= 1) 96 i->bit_num++; 0x00400890 <+528>: shr%rdx 0x00400893 <+531>: mov%ecx,%eax 0x00400895 <+533>: inc%ecx => 0x00400897 <+535>: test $0x1,%dl 0x0040089a <+538>: je 0x400890 and %rdx is zero. I think the vector IL is sound. Disabling cunroll allows the testcase to pass, so it might be an error on the upper bound of its iterations (but I think that's OK, too). What looks a bit odd is the condition for skipping the epilogue which we should never do for LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, we're using /* If we have a peeled vector iteration we will never skip the epilog loop and we can simplify the cfg a lot by not doing the edge split. */ if (skip_epilog || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) { guard_cond = fold_build2 (EQ_EXPR, boolean_type_node, niters, niters_vector_mult_vf); here, but I think niters_vector_mult_vf is wrong (but it doesn't matter here, still it looks bogus).
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 Richard Biener changed: What|Removed |Added Target Milestone|--- |14.0 Ever confirmed|0 |1 Last reconfirmed||2024-01-24 Priority|P3 |P1 Status|UNCONFIRMED |NEW Keywords||wrong-code --- Comment #2 from Richard Biener --- Confirmed. Let me have a look.
[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #1 from Hongtao Liu --- int __attribute__((noinline)) sbitmap_first_set_bit (const_sbitmap bmap) { unsigned int n = 0; sbitmap_iterator sbi; EXECUTE_IF_SET_IN_SBITMAP (bmap, 0, n, sbi) return n; return -1; } hangs on this function, it's an vect early break case. 78.L10: 79shrq%rdx 80movl%ecx, %eax 81incl%ecx 82testb $1, %dl 83je .L10 84ret hangs on this scalar part.