RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD
> > So he was responding for how to do it for the vectorizer and scalar parts. > > Remember that the goal is not to introduce new gimple IL that can block > > other > optimizations. > > The vectorizer already introduces new IL (various IFN) but this is fine as > > we don't > track things like ranges for > > vector instructions. So we don't loose any information here. > > > Now for the scalar, if we do an early replacement like in match.pd we > > prevent a > lot of other optimizations > > because they don't know what IFN_SAT_ADD does. gimple-isel runs pretty late, > and so at this point we don't > > expect many more optimizations to happen, so it's a safe spot to insert > > more IL > with "unknown semantics". > > > Was that your intention Richi? > > Thanks Tamar for clear explanation, does that mean both the scalar and vector > will > go isel approach? If so I may > misunderstand in previous that it is only for vectorize. No, The isel would only be for the scalar, The vectorizer will still use the vect_pattern. It needs to so we can cost the operation correctly, and in some cases depending on how the saturation is described you are unable the vectorize. The pattern allows us to catch these cases and still vectorize. But you should be able to use the same match.pd predicate for both the vectorizer pattern and isel. > > Understand the point that we would like to put the pattern match late but I > may > have a question here. > Given SAT_ADD related pattern is sort of complicated, it is possible that the > sub- > expression of SAT_ADD is optimized > In early pass by others and we can hardly catch the shapes later. > > For example, there is a plus expression in SAT_ADD, and in early pass it may > be > optimized to .ADD_OVERFLOW, and > then the pattern is quite different to aware of that in later pass. > Yeah, it looks like this transformation is done in widening_mul, which is the other place richi suggested to recognize SAT_ADD. widening_mul already runs quite late as well so it's also ok. If you put it there before the code that transforms the sequence to overflow it should work. Eventually we do need to recognize this variant since: uint64_t add_sat(uint64_t x, uint64_t y) noexcept { uint64_t z; if (!__builtin_add_overflow(x, y, )) return z; return -1u; } Is a valid and common way to do saturation too. But for now, it's fine. Cheers, Tamar > Sorry not sure if my understanding is correct, feel free to correct me. > > Pan > > -Original Message- > From: Tamar Christina > Sent: Thursday, May 2, 2024 11:26 AM > To: Li, Pan2 ; gcc-patches@gcc.gnu.org > Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com; > Liu, Hongtao > Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD > > > -Original Message- > > From: Li, Pan2 > > Sent: Thursday, May 2, 2024 4:11 AM > > To: Tamar Christina ; gcc-patches@gcc.gnu.org > > Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com; > > Liu, Hongtao > > Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD > > > > Thanks Tamar > > > > > Could you also split off the vectorizer change from scalar recog one? > > > Typically I > > would structure a change like this as: > > > > > 1. create types/structures + scalar recogn > > > 2. Vector recog code > > > 3. Backend changes > > > > Sure thing, will rearrange the patch like this. > > > > > Is ECF_NOTHROW correct here? At least on most targets I believe the scalar > > version > > > can set flags/throw exceptions if the saturation happens? > > > > I see, will remove that. > > > > > Hmm I believe Richi mentioned that he wanted the recognition done in isel? > > > > > The problem with doing it in match.pd is that it replaces the operations > > > quite > > > early the pipeline. Did I miss an email perhaps? The early replacement > > > means > we > > > lose optimizations and things such as range calculations etc, since e.g. > > > ranger > > > doesn't know these internal functions. > > > > > I think Richi will want this in islet or mult widening but I'll continue > > > with > match.pd > > > review just in case. > > > > If I understand is correct, Richard suggested try vectorizer patterns first > > and then > > possible isel. > > Thus, I don't have a try for SAT_ADD in ISEL as vectorizer patterns works > > well for > > SAT_ADD. > > Let's wait
RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD
> -Original Message- > From: Li, Pan2 > Sent: Thursday, May 2, 2024 4:11 AM > To: Tamar Christina ; gcc-patches@gcc.gnu.org > Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com; > Liu, Hongtao > Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD > > Thanks Tamar > > > Could you also split off the vectorizer change from scalar recog one? > > Typically I > would structure a change like this as: > > > 1. create types/structures + scalar recogn > > 2. Vector recog code > > 3. Backend changes > > Sure thing, will rearrange the patch like this. > > > Is ECF_NOTHROW correct here? At least on most targets I believe the scalar > version > > can set flags/throw exceptions if the saturation happens? > > I see, will remove that. > > > Hmm I believe Richi mentioned that he wanted the recognition done in isel? > > > The problem with doing it in match.pd is that it replaces the operations > > quite > > early the pipeline. Did I miss an email perhaps? The early replacement > > means we > > lose optimizations and things such as range calculations etc, since e.g. > > ranger > > doesn't know these internal functions. > > > I think Richi will want this in islet or mult widening but I'll continue > > with match.pd > > review just in case. > > If I understand is correct, Richard suggested try vectorizer patterns first > and then > possible isel. > Thus, I don't have a try for SAT_ADD in ISEL as vectorizer patterns works > well for > SAT_ADD. > Let's wait the confirmation from Richard. Below are the original words from > previous mail for reference. > I think the comment he made was this > > Given we have saturating integer alu like below, could you help to coach me > > the most reasonable way to represent > > It in scalar as well as vectorize part? Sorry not familiar with this part > > and still dig into how it works... > > As in your v2, .SAT_ADD for both sat_uadd and sat_sadd, similar for > the other cases. > > As I said, use vectorizer patterns and possibly do instruction > selection at ISEL/widen_mult time. So he was responding for how to do it for the vectorizer and scalar parts. Remember that the goal is not to introduce new gimple IL that can block other optimizations. The vectorizer already introduces new IL (various IFN) but this is fine as we don't track things like ranges for vector instructions. So we don't loose any information here. Now for the scalar, if we do an early replacement like in match.pd we prevent a lot of other optimizations because they don't know what IFN_SAT_ADD does. gimple-isel runs pretty late, and so at this point we don't expect many more optimizations to happen, so it's a safe spot to insert more IL with "unknown semantics". Was that your intention Richi? Thanks, Tamar > >> As I said, use vectorizer patterns and possibly do instruction > >> selection at ISEL/widen_mult time. > > > The optimize checks in the match.pd file are weird as it seems to check if > > we have > > optimizations enabled? > > > We don't typically need to do this. > > Sure, will remove this. > > > The function has only one caller, you should just inline it into the > > pattern. > > Sure thing. > > > Once you inline vect_sat_add_build_call you can do the check for > > vtype here, which is the cheaper check so perform it early. > > Sure thing. > > Thanks again and will send the v4 with all comments addressed, as well as the > test > results. > > Pan > > -Original Message- > From: Tamar Christina > Sent: Thursday, May 2, 2024 1:06 AM > To: Li, Pan2 ; gcc-patches@gcc.gnu.org > Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com; > Liu, Hongtao > Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD > > Hi, > > > From: Pan Li > > > > Update in v3: > > * Rebase upstream for conflict. > > > > Update in v2: > > * Fix one failure for x86 bootstrap. > > > > Original log: > > > > This patch would like to add the middle-end presentation for the > > saturation add. Aka set the result of add to the max when overflow. > > It will take the pattern similar as below. > > > > SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x)) > > > > Take uint8_t as example, we will have: > > > > * SAT_ADD (1, 254) => 255. > > * SAT_ADD (1, 255) => 255. > > * SAT_ADD (2, 255) => 255. > > * SAT_ADD (255, 255) => 255. > > > > The p
RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD
Hi, > From: Pan Li > > Update in v3: > * Rebase upstream for conflict. > > Update in v2: > * Fix one failure for x86 bootstrap. > > Original log: > > This patch would like to add the middle-end presentation for the > saturation add. Aka set the result of add to the max when overflow. > It will take the pattern similar as below. > > SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x)) > > Take uint8_t as example, we will have: > > * SAT_ADD (1, 254) => 255. > * SAT_ADD (1, 255) => 255. > * SAT_ADD (2, 255) => 255. > * SAT_ADD (255, 255) => 255. > > The patch also implement the SAT_ADD in the riscv backend as > the sample for both the scalar and vector. Given below example: > > uint64_t sat_add_u64 (uint64_t x, uint64_t y) > { > return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x)); > } > > Before this patch: > uint64_t sat_add_uint64_t (uint64_t x, uint64_t y) > { > long unsigned int _1; > _Bool _2; > long unsigned int _3; > long unsigned int _4; > uint64_t _7; > long unsigned int _10; > __complex__ long unsigned int _11; > > ;; basic block 2, loop depth 0 > ;;pred: ENTRY > _11 = .ADD_OVERFLOW (x_5(D), y_6(D)); > _1 = REALPART_EXPR <_11>; > _10 = IMAGPART_EXPR <_11>; > _2 = _10 != 0; > _3 = (long unsigned int) _2; > _4 = -_3; > _7 = _1 | _4; > return _7; > ;;succ: EXIT > > } > > After this patch: > uint64_t sat_add_uint64_t (uint64_t x, uint64_t y) > { > uint64_t _7; > > ;; basic block 2, loop depth 0 > ;;pred: ENTRY > _7 = .SAT_ADD (x_5(D), y_6(D)); [tail call] > return _7; > ;;succ: EXIT > } > > For vectorize, we leverage the existing vect pattern recog to find > the pattern similar to scalar and let the vectorizer to perform > the rest part for standard name usadd3 in vector mode. > The riscv vector backend have insn "Vector Single-Width Saturating > Add and Subtract" which can be leveraged when expand the usadd3 > in vector mode. For example: > > void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n) > { > unsigned i; > > for (i = 0; i < n; i++) > out[i] = (x[i] + y[i]) | (- (uint64_t)((uint64_t)(x[i] + y[i]) < x[i])); > } > > Before this patch: > void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n) > { > ... > _80 = .SELECT_VL (ivtmp_78, POLY_INT_CST [2, 2]); > ivtmp_58 = _80 * 8; > vect__4.7_61 = .MASK_LEN_LOAD (vectp_x.5_59, 64B, { -1, ... }, _80, 0); > vect__6.10_65 = .MASK_LEN_LOAD (vectp_y.8_63, 64B, { -1, ... }, _80, 0); > vect__7.11_66 = vect__4.7_61 + vect__6.10_65; > mask__8.12_67 = vect__4.7_61 > vect__7.11_66; > vect__12.15_72 = .VCOND_MASK (mask__8.12_67, { 18446744073709551615, > ... }, vect__7.11_66); > .MASK_LEN_STORE (vectp_out.16_74, 64B, { -1, ... }, _80, 0, vect__12.15_72); > vectp_x.5_60 = vectp_x.5_59 + ivtmp_58; > vectp_y.8_64 = vectp_y.8_63 + ivtmp_58; > vectp_out.16_75 = vectp_out.16_74 + ivtmp_58; > ivtmp_79 = ivtmp_78 - _80; > ... > } > > vec_sat_add_u64: > ... > vsetvli a5,a3,e64,m1,ta,ma > vle64.v v0,0(a1) > vle64.v v1,0(a2) > sllia4,a5,3 > sub a3,a3,a5 > add a1,a1,a4 > add a2,a2,a4 > vadd.vv v1,v0,v1 > vmsgtu.vv v0,v0,v1 > vmerge.vim v1,v1,-1,v0 > vse64.v v1,0(a0) > ... > > After this patch: > void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n) > { > ... > _62 = .SELECT_VL (ivtmp_60, POLY_INT_CST [2, 2]); > ivtmp_46 = _62 * 8; > vect__4.7_49 = .MASK_LEN_LOAD (vectp_x.5_47, 64B, { -1, ... }, _62, 0); > vect__6.10_53 = .MASK_LEN_LOAD (vectp_y.8_51, 64B, { -1, ... }, _62, 0); > vect__12.11_54 = .SAT_ADD (vect__4.7_49, vect__6.10_53); > .MASK_LEN_STORE (vectp_out.12_56, 64B, { -1, ... }, _62, 0, vect__12.11_54); > ... > } > > vec_sat_add_u64: > ... > vsetvli a5,a3,e64,m1,ta,ma > vle64.v v1,0(a1) > vle64.v v2,0(a2) > sllia4,a5,3 > sub a3,a3,a5 > add a1,a1,a4 > add a2,a2,a4 > vsaddu.vv v1,v1,v2 > vse64.v v1,0(a0) > ... > > To limit the patch size for review, only unsigned version of > usadd3 are involved here. The signed version will be covered > in the underlying patch(es). > > The below test suites are passed for this patch. > * The riscv fully regression tests. > * The aarch64 fully regression tests. > * The x86 bootstrap tests. > * The x86 fully regression tests. > > PR target/51492 > PR target/112600 > > gcc/ChangeLog: > > * config/riscv/autovec.md (usadd3): New pattern expand > for unsigned SAT_ADD vector. > * config/riscv/riscv-protos.h (riscv_expand_usadd): New func > decl to expand usadd3 pattern. > (expand_vec_usadd): Ditto but for vector. > * config/riscv/riscv-v.cc (emit_vec_saddu): New func impl to > emit the vsadd insn. > (expand_vec_usadd): New func impl to expand usadd3 for > vector. > * config/riscv/riscv.cc (riscv_expand_usadd): New func impl > to
[PATCH]middle-end: refactory vect_recog_absolute_difference to simplify flow [PR114769]
Hi All, As the reporter in PR114769 points out the control flow for the abd detection is hard to follow. This is because vect_recog_absolute_difference has two different ways it can return true. 1. It can return true when the widening operation is matched, in which case unprom is set, half_type is not NULL and diff_stmt is not set. 2. It can return true when the widening operation is not matched, but the stmt being checked is a minus. In this case unprom is not set, half_type is set to NULL and diff_stmt is set. This because to get to diff_stmt you have to dig through the abs statement and any possible promotions. This however leads to complicated uses of the function at the call sites as the exact semantic needs to be known to use it safely. vect_recog_absolute_difference has two callers: 1. vect_recog_sad_pattern where if you return true with unprom not set, then *half_type will be NULL. The call to vect_supportable_direct_optab_p will always reject it since there's no vector mode for NULL. Note that if looking at the dump files, the convention in the dump files have always been that we first indicate that a pattern could possibly be recognize and then check that it's supported. This change somewhat incorrectly makes the diagnostic message get printed for "invalid" patterns. 2. vect_recog_abd_pattern, where if half_type is NULL, it then uses diff_stmt to set them. So while the note in the dump file is misleading, the code is safe. This refactors the code, it now only has 1 success condition, and diff_stmt is always set to the minus statement in the abs if there is one. The function now only returns success if the widening minus is found, in which case unprom and half_type set. This then leaves it up to the caller to decide if they want to do anything with diff_stmt. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/114769 * tree-vect-patterns.cc: (vect_recog_absolute_difference): Have only one success condition. (vect_recog_abd_pattern): Handle further checks if vect_recog_absolute_difference fails. --- diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 4f491c6b8336f8710c3519dec1fa7e0f49387d2b..87c2acff386d91d22a3b2d6e6443d1f2f2326ea6 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -797,8 +797,7 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info stmt2_info, tree new_rhs, HALF_TYPE and UNPROM will be set should the statement be found to be a widened operation. DIFF_STMT will be set to the MINUS_EXPR - statement that precedes the ABS_STMT unless vect_widened_op_tree - succeeds. + statement that precedes the ABS_STMT if it is a MINUS_EXPR.. */ static bool vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt, @@ -843,6 +842,12 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt, if (!diff_stmt_vinfo) return false; + gassign *diff = dyn_cast (STMT_VINFO_STMT (diff_stmt_vinfo)); + if (diff_stmt && diff + && gimple_assign_rhs_code (diff) == MINUS_EXPR + && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd))) +*diff_stmt = diff; + /* FORNOW. Can continue analyzing the def-use chain when this stmt in a phi inside the loop (in case we are analyzing an outer-loop). */ if (vect_widened_op_tree (vinfo, diff_stmt_vinfo, @@ -850,17 +855,6 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt, false, 2, unprom, half_type)) return true; - /* Failed to find a widen operation so we check for a regular MINUS_EXPR. */ - gassign *diff = dyn_cast (STMT_VINFO_STMT (diff_stmt_vinfo)); - if (diff_stmt && diff - && gimple_assign_rhs_code (diff) == MINUS_EXPR - && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd))) -{ - *diff_stmt = diff; - *half_type = NULL_TREE; - return true; -} - return false; } @@ -1499,27 +1493,22 @@ vect_recog_abd_pattern (vec_info *vinfo, tree out_type = TREE_TYPE (gimple_assign_lhs (last_stmt)); vect_unpromoted_value unprom[2]; - gassign *diff_stmt; - tree half_type; - if (!vect_recog_absolute_difference (vinfo, last_stmt, _type, + gassign *diff_stmt = NULL; + tree abd_in_type; + if (!vect_recog_absolute_difference (vinfo, last_stmt, _in_type, unprom, _stmt)) -return NULL; - - tree abd_in_type, abd_out_type; - - if (half_type) -{ - abd_in_type = half_type; - abd_out_type = abd_in_type; -} - else { + /* We cannot try further without having a non-widening MINUS. */ + if (!diff_stmt) + return NULL; + unprom[0].op = gimple_assign_rhs1 (diff_stmt); unprom[1].op = gimple_assign_rhs2 (diff_stmt); abd_in_type = signed_type_for (out_type); - abd_out_type = abd_in_type; } +
[PATCH]AArch64: remove reliance on register allocator for simd/gpreg costing. [PR114741]
Hi All, In PR114741 we see that we have a regression in codegen when SVE is enable where the simple testcase: void foo(unsigned v, unsigned *p) { *p = v & 1; } generates foo: fmovs31, w0 and z31.s, z31.s, #1 str s31, [x1] ret instead of: foo: and w0, w0, 1 str w0, [x1] ret This causes an impact it not just codesize but also performance. This is caused by the use of the ^ constraint modifier in the pattern 3. The documentation states that this modifier should only have an effect on the alternative costing in that a particular alternative is to be preferred unless a non-psuedo reload is needed. The pattern was trying to convey that whenever both r and w are required, that it should prefer r unless a reload is needed. This is because if a reload is needed then we can construct the constants more flexibly on the SIMD side. We were using this so simplify the implementation and to get generic cases such as: double negabs (double x) { unsigned long long y; memcpy (, , sizeof(double)); y = y | (1UL << 63); memcpy (, , sizeof(double)); return x; } which don't go through an expander. However the implementation of ^ in the register allocator is not according to the documentation in that it also has an effect during coloring. During initial register class selection it applies a penalty to a class, similar to how ? does. In this example the penalty makes the use of GP regs expensive enough that it no longer considers them: r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS ;;3--> b 0: i 9 r106=r105&0x1 :cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0) PR_HI_REGS+0(0):model 4 which is not the expected behavior. For GCC 14 this is a conservative fix. 1. we remove the ^ modifier from the logical optabs. 2. In order not to regress copysign we then move the copysign expansion to directly use the SIMD variant. Since copysign only supports floating point modes this is fine and no longer relies on the register allocator to select the right alternative. It once again regresses the general case, but this case wasn't optimized in earlier GCCs either so it's not a regression in GCC 14. This change gives strict better codegen than earlier GCCs and still optimizes the important cases. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR target/114741 * config/aarch64/aarch64.md (3): Remove ^ from alt 2. (copysign3): Use SIMD version of IOR directly. gcc/testsuite/ChangeLog: PR target/114741 * gcc.target/aarch64/fneg-abs_2.c: Update codegen. * gcc.target/aarch64/fneg-abs_4.c: xfail for now. * gcc.target/aarch64/pr114741.c: New test. --- diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 385a669b9b3c31cc9108a660e881b9091c71fc7c..dbde066f7478bec51a8703b017ea553aa98be309 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -4811,7 +4811,7 @@ (define_insn "3" "" {@ [ cons: =0 , 1 , 2; attrs: type , arch ] [ r, %r , r; logic_reg , * ] \t%0, %1, %2 - [ rk , ^r , ; logic_imm , * ] \t%0, %1, %2 + [ rk , r , ; logic_imm , * ] \t%0, %1, %2 [ w, 0 , ; * , sve ] \t%Z0., %Z0., #%2 [ w, w , w; neon_logic , simd ] \t%0., %1., %2. } @@ -7192,22 +7192,29 @@ (define_expand "copysign3" (match_operand:GPF 2 "nonmemory_operand")] "TARGET_SIMD" { - machine_mode int_mode = mode; - rtx bitmask = gen_reg_rtx (int_mode); - emit_move_insn (bitmask, GEN_INT (HOST_WIDE_INT_M1U - << (GET_MODE_BITSIZE (mode) - 1))); + rtx signbit_const = GEN_INT (HOST_WIDE_INT_M1U + << (GET_MODE_BITSIZE (mode) - 1)); /* copysign (x, -1) should instead be expanded as orr with the sign bit. */ rtx op2_elt = unwrap_const_vec_duplicate (operands[2]); if (GET_CODE (op2_elt) == CONST_DOUBLE && real_isneg (CONST_DOUBLE_REAL_VALUE (op2_elt))) { - emit_insn (gen_ior3 ( - lowpart_subreg (int_mode, operands[0], mode), - lowpart_subreg (int_mode, operands[1], mode), bitmask)); + rtx v_bitmask + = force_reg (V2mode, +gen_const_vec_duplicate (V2mode, + signbit_const)); + + emit_insn (gen_iorv23 ( + lowpart_subreg (V2mode, operands[0], mode), + lowpart_subreg (V2mode, operands[1], mode), + v_bitmask)); DONE; } + machine_mode int_mode = mode; + rtx bitmask = gen_reg_rtx (int_mode); + emit_move_insn (bitmask, signbit_const); operands[2] = force_reg (mode, operands[2]); emit_insn (gen_copysign3_insn (operands[0], operands[1], operands[2],
RE: [PATCH]middle-end: skip vectorization check on ilp32 on vect-early-break_124-pr114403.c
> On Tue, Apr 16, 2024 at 09:00:53AM +0200, Richard Biener wrote: > > > PR tree-optimization/114403 > > > * gcc.dg/vect/vect-early-break_124-pr114403.c: Skip in ilp32. > > > > > > --- > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c > > > index > 1751296ab813fe85eaab1f58dc674bac10f6eb7a..db8e00556f116ca81c5a6558e > c6ecd3b222ec93d 100644 > > > --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c > > > @@ -2,11 +2,11 @@ > > > /* { dg-require-effective-target vect_early_break_hw } */ > > > /* { dg-require-effective-target vect_long_long } */ > > > > > > -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ > > > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! > > > ilp32 } } } > } */ > > > > > > #include "tree-vect.h" > > > > > > -typedef unsigned long PV; > > > +typedef unsigned long long PV; > > > typedef struct _buff_t { > > > int foo; > > > PV Val; > > As discussed on IRC, I think we want > --- gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c.jj 2024-04- > 16 08:43:36.001729192 +0200 > +++ gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c 2024-04- > 16 08:55:11.296214959 +0200 > @@ -64,8 +64,8 @@ int main () > >int store_size = sizeof(PV); > #pragma GCC novector > - for (int i = 0; i < NUM - 1; i+=store_size) > -if (0 != __builtin_memcmp (buffer+i, (char*)[i].Val, store_size)) > + for (int i = 0; i < NUM - 1; i++) > +if (0 != __builtin_memcmp (buffer+i*store_size, (char*)[i].Val, > store_size)) >__builtin_abort (); > >return 0; > > instead (and then I think there is no need to switch PV from unsigned long > to unsigned long long, nor disabling on ilp32. > Regtested on x86_64-pc-linux-gnu with -m32,-m64 and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR tree-optimization/114403 * gcc.dg/vect/vect-early-break_124-pr114403.c: Fix check loop. -- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c index 1751296ab813fe85eaab1f58dc674bac10f6eb7a..51abf245ccb51b85f06916a8a0238698911ab551 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c @@ -68,8 +68,8 @@ int main () int store_size = sizeof(PV); #pragma GCC novector - for (int i = 0; i < NUM - 1; i+=store_size) -if (0 != __builtin_memcmp (buffer+i, (char*)[i].Val, store_size)) + for (int i = 0; i < NUM - 1; i++) +if (0 != __builtin_memcmp (buffer+(i*store_size), (char*)[i].Val, store_size)) __builtin_abort (); return 0; rb18418.patch Description: rb18418.patch
[PATCH]middle-end: skip vectorization check on ilp32 on vect-early-break_124-pr114403.c
Hi all, The testcase seems to fail vectorization on -m32 since the access pattern is determined as too complex. This skips the vectorization check on ilp32 systems as I couldn't find a better proxy for being able to do strided 64-bit loads and I suspect it would fail on all 32-bit targets. Regtested on x86_64-pc-linux-gnu with -m32 and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR tree-optimization/114403 * gcc.dg/vect/vect-early-break_124-pr114403.c: Skip in ilp32. --- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c index 1751296ab813fe85eaab1f58dc674bac10f6eb7a..db8e00556f116ca81c5a6558ec6ecd3b222ec93d 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c @@ -2,11 +2,11 @@ /* { dg-require-effective-target vect_early_break_hw } */ /* { dg-require-effective-target vect_long_long } */ -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! ilp32 } } } } */ #include "tree-vect.h" -typedef unsigned long PV; +typedef unsigned long long PV; typedef struct _buff_t { int foo; PV Val; -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c index 1751296ab813fe85eaab1f58dc674bac10f6eb7a..db8e00556f116ca81c5a6558ec6ecd3b222ec93d 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c @@ -2,11 +2,11 @@ /* { dg-require-effective-target vect_early_break_hw } */ /* { dg-require-effective-target vect_long_long } */ -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! ilp32 } } } } */ #include "tree-vect.h" -typedef unsigned long PV; +typedef unsigned long long PV; typedef struct _buff_t { int foo; PV Val;
docs: document early break support and pragma novector
docs: document early break support and pragma novector --- diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index b4c602a523717c1d64333e44aefb60ba0ed02e7a..aceecb86f17443cfae637e90987427b98c42f6eb 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -200,6 +200,34 @@ a work-in-progress. for indicating parameters that are expected to be null-terminated strings. + +The vectorizer now supports vectorizing loops which contain any number of early breaks. +This means loops such as: + + int z[100], y[100], x[100]; + int foo (int n) + { + int res = 0; + for (int i = 0; i < n; i++) + { + y[i] = x[i] * 2; + res += x[i] + y[i]; + + if (x[i] > 5) +break; + + if (z[i] > 5) +break; + + } + return res; + } + +can now be vectorized on a number of targets. In this first version any +input data sources must either have a statically known size at compile time +or the vectorizer must be able to determine based on auxillary information +that the accesses are aligned. + New Languages and Language specific improvements @@ -231,6 +259,9 @@ a work-in-progress. previous options -std=c2x, -std=gnu2x and -Wc11-c2x-compat, which are deprecated but remain supported. + GCC supports a new pragma pragma GCC novector to + indicate to the vectorizer not to vectorize the loop annotated with the + pragma. C++ @@ -400,6 +431,9 @@ a work-in-progress. warnings are enabled for C++ as well The DR 2237 code no longer gives an error, it emits a -Wtemplate-id-cdtor warning instead + GCC supports a new pragma pragma GCC novector to + indicate to the vectorizer not to vectorize the loop annotated with the + pragma. Runtime Library (libstdc++) -- diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index b4c602a523717c1d64333e44aefb60ba0ed02e7a..aceecb86f17443cfae637e90987427b98c42f6eb 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -200,6 +200,34 @@ a work-in-progress. for indicating parameters that are expected to be null-terminated strings. + +The vectorizer now supports vectorizing loops which contain any number of early breaks. +This means loops such as: + + int z[100], y[100], x[100]; + int foo (int n) + { + int res = 0; + for (int i = 0; i < n; i++) + { + y[i] = x[i] * 2; + res += x[i] + y[i]; + + if (x[i] > 5) + break; + + if (z[i] > 5) + break; + + } + return res; + } + +can now be vectorized on a number of targets. In this first version any +input data sources must either have a statically known size at compile time +or the vectorizer must be able to determine based on auxillary information +that the accesses are aligned. + New Languages and Language specific improvements @@ -231,6 +259,9 @@ a work-in-progress. previous options -std=c2x, -std=gnu2x and -Wc11-c2x-compat, which are deprecated but remain supported. + GCC supports a new pragma pragma GCC novector to + indicate to the vectorizer not to vectorize the loop annotated with the + pragma. C++ @@ -400,6 +431,9 @@ a work-in-progress. warnings are enabled for C++ as well The DR 2237 code no longer gives an error, it emits a -Wtemplate-id-cdtor warning instead + GCC supports a new pragma pragma GCC novector to + indicate to the vectorizer not to vectorize the loop annotated with the + pragma. Runtime Library (libstdc++)
[PATCH]middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403].
Hi All, This is a story all about how the peeling for gaps introduces a bug in the upper bounds. Before I go further, I'll first explain how I understand this to work for loops with a single exit. When peeling for gaps we peel N < VF iterations to scalar. This happens by removing N iterations from the calculation of niters such that vect_iters * VF == niters is always false. In other words, when we exit the vector loop we always fall to the scalar loop. The loop bounds adjustment guarantees this. Because of this we potentially execute a vector loop iteration less. That is, if you're at the boundary condition where niters % VF by peeling one or more scalar iterations the vector loop executes one less. This is accounted for by the adjustments in vect_transform_loops. This adjustment happens differently based on whether the the vector loop can be partial or not: Peeling for gaps sets the bias to 0 and then: when not partial: we take the floor of (scalar_upper_bound / VF) - 1 to get the vector latch iteration count. when loop is partial: For a single exit this means the loop is masked, we take the ceil to account for the fact that the loop can handle the final partial iteration using masking. Note that there's no difference between ceil an floor on the boundary condition. There is a difference however when you're slightly above it. i.e. if scalar iterates 14 times and VF = 4 and we peel 1 iteration for gaps. The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in effect the partial iteration is ignored and it's done as scalar. This is fine because the niters modification has capped the vector iteration at 2. So that when we reduce the induction values you end up entering the scalar code with ind_var.2 = ind_var.1 + 2 * VF. Now lets look at early breaks. To make it esier I'll focus on the specific testcase: char buffer[64]; __attribute__ ((noipa)) buff_t *copy (buff_t *first, buff_t *last) { char *buffer_ptr = buffer; char *const buffer_end = [SZ-1]; int store_size = sizeof(first->Val); while (first != last && (buffer_ptr + store_size) <= buffer_end) { const char *value_data = (const char *)(>Val); __builtin_memcpy(buffer_ptr, value_data, store_size); buffer_ptr += store_size; ++first; } if (first == last) return 0; return first; } Here the first, early exit is on the condition: (buffer_ptr + store_size) <= buffer_end and the main exit is on condition: first != last This is important, as this bug only manifests itself when the first exit has a known constant iteration count that's lower than the latch exit count. because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 16 bytes per iteration. So the exit has a known bounds of 8 + 1. The vectorizer correctly analizes this: Statement (exit)if (ivtmp_21 != 0) is executed at most 8 (bounded by 8) + 1 times in loop 1. and as a consequence the IV is bound by 9: # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)> ... vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615 }; mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 }; if (mask_patt_22.17_126 == { -1, -1, -1, -1 }) goto ; [88.89%] else goto ; [11.11%] The imporant bits are this: In this example the value of last - first = 416. the calculated vector iteration count, is: x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27 the bounds generated, adjusting for gaps: x == (((x - 1) >> 2) << 2) which means we'll always fall through to the scalar code. as intended. Here are two key things to note: 1. In this loop, the early exit will always be the one taken. When it's taken we enter the scalar loop with the correct induction value to apply the gap peeling. 2. If the main exit is taken, the induction values assumes you've finished all vector iterations. i.e. it assumes you have completed 24 iterations, as we treat the main exit the same for normal loop vect and early break when not PEELED. This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 * VF; So what's going wrong. The vectorizer's codegen is correct and efficient, however when we adjust the upper bounds, that code knows that the loops upper bound is based on the early exit. i.e. 8 latch iterations. or in other words. It thinks the loop iterates once. This is incorrect as the vector loop iterates twice, as it has set up the induction value such that it exits at the early exit. So it in effect iterates 2.5x times. Becuase the upper bound is incorrect, when we unroll it now exits from the main exit which uses the incorrect induction value. So there are three ways to fix this: 1. If we take the position that the main exit should support both premature exits and final exits then vect_update_ivs_after_vectorizer
[PATCH]middle-end vect: adjust loop upper bounds when peeling for gaps and early break [PR114403]
Hi All, The report shows that we end up in a situation where the code has been peeled for gaps and we have an early break. The code for peeling for gaps assume that a scalar loop needs to perform at least one iteration. However this doesn't take into account early break where the scalar loop may not need to be executed. That the early break loop can be partial is not accounted for in this scenario. loop partiality is normally handled by setting bias_for_lowest to 1, but when peeling for gaps we end up with 0, which when the loop upper bounds are calculated means that a partial loop iteration loses the final partial iter: Analyzing # of iterations of loop 1 exit condition [8, + , 18446744073709551615] != 0 bounds on difference of bases: -8 ... -8 result: # of iterations 8, bounded by 8 and a VF=4 calculating: Loop 1 iterates at most 1 times. Loop 1 likely iterates at most 1 times. Analyzing # of iterations of loop 1 exit condition [1, + , 1](no_overflow) < bnd.5505_39 bounds on difference of bases: 0 ... 4611686018427387902 Matching expression match.pd:2011, generic-match-8.cc:27 Applying pattern match.pd:2067, generic-match-1.cc:4813 result: # of iterations bnd.5505_39 + 18446744073709551615, bounded by 4611686018427387902 Estimating sizes for loop 1 ... Induction variable computation will be folded away. size: 2 if (ivtmp_312 < bnd.5505_39) Exit condition will be eliminated in last copy. size: 24-3, last_iteration: 24-5 Loop size: 24 Estimated size after unrolling: 26 ;; Guessed iterations of loop 1 is 0.858446. New upper bound 1. upper bound should be 2 not 1. This patch forced the bias_for_lowest to be 1 even when peeling for gaps. I have however not been able to write a standalone reproducer for this so I have no tests but bootstrap and LLVM build fine now. The testcase: #define COUNT 9 #define SIZE COUNT * 4 #define TYPE unsigned long TYPE x[SIZE], y[SIZE]; void __attribute__((noipa)) loop (TYPE val) { for (int i = 0; i < COUNT; ++i) { if (x[i * 4] > val || x[i * 4 + 1] > val) return; x[i * 4] = y[i * 2] + 1; x[i * 4 + 1] = y[i * 2] + 2; x[i * 4 + 2] = y[i * 2 + 1] + 3; x[i * 4 + 3] = y[i * 2 + 1] + 4; } } does perform the peeling for gaps and early beak, however it creates a hybrid loop which works fine. adjusting the indices to non linear also works. So I'd like to submit the fix and work on a testcase separately if needed. Bootstrapped Regtested on x86_64-pc-linux-gnu no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/114403 * tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when peeling for gaps and early break. --- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12139,7 +12139,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* The minimum number of iterations performed by the epilogue. This is 1 when peeling for gaps because we always need a final scalar iteration. */ - int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0; + int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) + && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0; /* +1 to convert latch counts to loop iteration counts, -min_epilogue_iters to remove iterations that cannot be performed by the vector code. */ -- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12139,7 +12139,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* The minimum number of iterations performed by the epilogue. This is 1 when peeling for gaps because we always need a final scalar iteration. */ - int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0; + int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) + && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0; /* +1 to convert latch counts to loop iteration counts, -min_epilogue_iters to remove iterations that cannot be performed by the vector code. */
Summary: [PATCH][committed]AArch64: Do not allow SIMD clones with simdlen 1 [PR113552][GCC 13/12/11 backport]
Hi All, This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07. The AArch64 vector PCS does not allow simd calls with simdlen 1, however due to a bug we currently do allow it for num == 0. This causes us to emit a symbol that doesn't exist and we fail to link. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Committed to GCC 13,12,11 branches as previously approved. Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113552 * config/aarch64/aarch64.cc (aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1. gcc/testsuite/ChangeLog: PR tree-optimization/113552 * gcc.target/aarch64/pr113552.c: New test. * gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check. --- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f546c48ae2d2bad2e34c6b72e5e3e30aba3c3bd6..d19a9c16cc97ae75afd4e29f4339d65d39cfb73a 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -27027,7 +27027,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, bool explicit_p) { tree t, ret_type; - unsigned int elt_bits, count; + unsigned int elt_bits, count = 0; unsigned HOST_WIDE_INT const_simdlen; poly_uint64 vec_bits; @@ -27104,7 +27104,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, vec_bits = (num == 0 ? 64 : 128); clonei->simdlen = exact_div (vec_bits, elt_bits); } - else + else if (maybe_ne (clonei->simdlen, 1U)) { count = 1; vec_bits = clonei->simdlen * elt_bits; diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c new file mode 100644 index ..9c96b061ed2b4fcc57e58925277f74d14f79c51f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8-a" } */ + +__attribute__ ((__simd__ ("notinbranch"), const)) +double cos (double); + +void foo (float *a, double *b) +{ +for (int i = 0; i < 12; i+=3) + { +b[i] = cos (5.0 * a[i]); +b[i+1] = cos (5.0 * a[i+1]); +b[i+2] = cos (5.0 * a[i+2]); + } +} + +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c index 95f6a6803e889c02177ef10972962ed62d2095eb..c6dac6b104c94c9de89ed88dc5a73e185d2be125 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c @@ -18,7 +18,7 @@ double foo(double x) } /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */ +/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */ -- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f546c48ae2d2bad2e34c6b72e5e3e30aba3c3bd6..d19a9c16cc97ae75afd4e29f4339d65d39cfb73a 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -27027,7 +27027,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, bool explicit_p) { tree t, ret_type; - unsigned int elt_bits, count; + unsigned int elt_bits, count = 0; unsigned HOST_WIDE_INT const_simdlen; poly_uint64 vec_bits; @@ -27104,7 +27104,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, vec_bits = (num == 0 ? 64 : 128); clonei->simdlen = exact_div (vec_bits, elt_bits); } - else + else if (maybe_ne (clonei->simdlen, 1U)) { count = 1; vec_bits = clonei->simdlen * elt_bits; diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c new file mode 100644 index ..9c96b061ed2b4fcc57e58925277f74d14f79c51f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8-a" } */ + +__attribute__ ((__simd__ ("notinbranch"), const)) +double cos (double); + +void foo (float *a, double *b) +{ +for (int i = 0; i < 12; i+=3) + { +b[i] = cos (5.0 * a[i]); +b[i+1] = cos (5.0 * a[i+1]); +b[i+2] = cos (5.0 * a[i+2]); + } +} + +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c index
RE: [PATCH] vect: Do not peel epilogue for partial vectors [PR114196].
> -Original Message- > From: Richard Biener > Sent: Thursday, March 7, 2024 8:47 AM > To: Robin Dapp > Cc: gcc-patches ; Tamar Christina > > Subject: Re: [PATCH] vect: Do not peel epilogue for partial vectors > [PR114196]. > > On Wed, Mar 6, 2024 at 9:21 PM Robin Dapp wrote: > > > > Hi, > > > > r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early > > break but PR114196 shows that we also run into the problem without early > > break. Therefore remove early break from the conditions. > > > > gcc/ChangeLog: > > > > PR middle-end/114196 > > > > * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Remove > > early break check from guards. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/pr114196.c: New test. > > * gcc.target/riscv/rvv/autovec/pr114196.c: New test. > > --- > > gcc/testsuite/gcc.target/aarch64/pr114196.c | 19 +++ > > .../gcc.target/riscv/rvv/autovec/pr114196.c | 19 +++ > > gcc/tree-vect-loop-manip.cc | 6 +++--- > > 3 files changed, 41 insertions(+), 3 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr114196.c > > create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr114196.c > b/gcc/testsuite/gcc.target/aarch64/pr114196.c > > new file mode 100644 > > index 000..15e4b0e31b8 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/pr114196.c > > @@ -0,0 +1,19 @@ > > +/* { dg-do compile } */ > > +/* { dg-options { -O3 -fno-vect-cost-model -march=armv9-a -msve-vector- > bits=256 } } */ > > + > > +unsigned a; > > +int b; > > +long *c; > > + > > +int > > +main () > > +{ > > + for (int d = 0; d < 22; d += 4) { > > + b = ({ > > + int e = c[d]; > > + e; > > + }) > > + ? 0 : -c[d]; > > + a *= 3; > > + } > > +} > > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c > b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c > > new file mode 100644 > > index 000..7ba9cbbed70 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c > > @@ -0,0 +1,19 @@ > > +/* { dg-do compile } */ > > +/* { dg-options { -O3 -fno-vect-cost-model -march=rv64gcv_zvl256b - > mabi=lp64d -mrvv-vector-bits=zvl } } */ > > + > > +unsigned a; > > +int b; > > +long *c; > > + > > +int > > +main () > > +{ > > + for (int d = 0; d < 22; d += 4) { > > + b = ({ > > + int e = c[d]; > > + e; > > + }) > > + ? 0 : -c[d]; > > + a *= 3; > > + } > > +} > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > > index f72da915103..c3cd20eef70 100644 > > --- a/gcc/tree-vect-loop-manip.cc > > +++ b/gcc/tree-vect-loop-manip.cc > > @@ -2183,9 +2183,9 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info > loop_vinfo, > > perform the peeling. The below condition mirrors that of > > vect_gen_vector_loop_niters where niters_vector_mult_vf_var then sets > > step_vector to VF rather than 1. This is what creates the nonlinear > > - IV. PR113163. */ > > - if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo) > > - && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant () > > + IV. PR113163. > > + This also happens without early breaks, see PR114196. */ > > Can you instead reword to not mention early breaks, maybe instead > say PR113163 (with early breaks), PR114196 (without)? > > The dump message also needs adjustments, it mentions early breaks as > well. > > The comment says it matches a condition in vect_gen_vector_loop_niters > but I can't see what that means ... Tamar? > The comment was trying to say that this case is when you manage to get here: https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2847 because that makes you fall into https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L3528 which creates the nonlinear IV variable. The vect_step_op_neg exception is because vect_update_ivs_after_vectorizer can deal with that case specifically https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2398 which is what the previous check is also explaining https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2133 If this also happens for non-early breaks it's just better to merge the check into the earlier one at github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2133 Tamar > > + if (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant () > >&& LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > >&& induction_type != vect_step_op_neg) > > { > > -- > > 2.43.2
RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation US_PLUS
> Thanks Tamar. > > > Those two cases also *completely* stop vectorization because of either the > > control flow or the fact the vectorizer can't handle complex types. > > Yes, we eventually would like to vectorize the SAT ALU but we start with > scalar part > first. > I tried the DEF_INTERNAL_SIGNED_OPTAB_EXT_FN as your suggestion. It works > well with some additions as below. > Feel free to correct me if any misunderstandings. > > 1. usadd$Q$a3 are restricted to fixed point and we need to change it to > usadd$a3(as well as gen_int_libfunc) for int. > 2. We need to implement a default implementation of SAT_ADD if > direct_binary_optab_supported_p is false. > It looks like the default implementation is difficult to make every > backend happy. > That is why you suggest just normal > DEF_INTERNAL_SIGNED_OPTAB_FN in another thread. > > Thanks Richard. > > > But what I'd like to see is that we do more instruction selection on GIMPLE > > but _late_ (there's the pass_optimize_widening_mul and pass_gimple_isel > > passes doing what I'd call instruction selection). But that means not > > adding > > match.pd patterns for that or at least have a separate isel-match.pd > > machinery for that. > > > So as a start I would go for a direct optab and see to recognize it during > > ISEL? > > Looks we have sorts of SAT alu like PLUS/MINUS/MULT/DIV/SHIFT/NEG/ABS, good > to know isel and I am happy to > try that once we have conclusion. > So after a lively discussion on IRC, the conclusion is that before we proceed Richi would like to see some examples of various operations. The problem is that unsigned saturating addition is the simplest example and it may lead to an implementation strategy that doesn't scale. So I'd suggest writing some example of both signed and unsigned saturating add and multiply Because signed addition, will likely require a branch and signed multiplication would require a larger type. This would allow us to better understand what kind of gimple would have to to deal with in ISEL and VECT if we decide not to lower early. Thanks, Tamar > Pan > > -Original Message- > From: Tamar Christina > Sent: Tuesday, February 27, 2024 5:57 PM > To: Richard Biener > Cc: Li, Pan2 ; gcc-patches@gcc.gnu.org; > juzhe.zh...@rivai.ai; > Wang, Yanzhang ; kito.ch...@gmail.com; > richard.sandiford@arm.com2; jeffreya...@gmail.com > Subject: RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation > US_PLUS > > > -Original Message- > > From: Richard Biener > > Sent: Tuesday, February 27, 2024 9:44 AM > > To: Tamar Christina > > Cc: pan2...@intel.com; gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; > > yanzhang.w...@intel.com; kito.ch...@gmail.com; > > richard.sandiford@arm.com2; jeffreya...@gmail.com > > Subject: Re: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation > > US_PLUS > > > > On Sun, Feb 25, 2024 at 10:01 AM Tamar Christina > > wrote: > > > > > > Hi Pan, > > > > > > > From: Pan Li > > > > > > > > Hi Richard & Tamar, > > > > > > > > Try the DEF_INTERNAL_INT_EXT_FN as your suggestion. By mapping > > > > us_plus$a3 to the RTL representation (us_plus:m x y) in optabs.def. > > > > And then expand_US_PLUS in internal-fn.cc. Not very sure if my > > > > understanding is correct for DEF_INTERNAL_INT_EXT_FN. > > > > > > > > I am not sure if we still need DEF_INTERNAL_SIGNED_OPTAB_FN here, given > > > > the RTL representation has (ss_plus:m x y) and (us_plus:m x y) already. > > > > > > > > > > I think a couple of things are being confused here. So lets break it > > > down: > > > > > > The reason for DEF_INTERNAL_SIGNED_OPTAB_FN is because in GIMPLE > > > we only want one internal function for both signed and unsigned SAT_ADD. > > > with this definition we don't need SAT_UADD and SAT_SADD but instead > > > we will only have SAT_ADD, which will expand to us_plus or ss_plus. > > > > > > Now the downside of this is that this is a direct internal optab. This > > > means > > > that for the representation to be used the target *must* have the optab > > > implemented. This is a bit annoying because it doesn't allow us to > > > generically > > > assume that all targets use SAT_ADD for saturating add and thus only have > > > to > > > write optimization for this representation. > > > > > > This is why Richi said we may need to use a new tree_code because we c
RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
> Am 19.02.24 um 08:36 schrieb Richard Biener: > > On Sat, Feb 17, 2024 at 11:30 AM wrote: > >> > >> From: Pan Li > >> > >> This patch would like to add the middle-end presentation for the > >> unsigned saturation add. Aka set the result of add to the max > >> when overflow. It will take the pattern similar as below. > >> > >> SAT_ADDU (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x)) > > Does this even try to wort out the costs? > > For example, with the following example > > > #define T __UINT16_TYPE__ > > T sat_add1 (T x, T y) > { >return (x + y) | (- (T)((T)(x + y) < x)); > } > > T sat_add2 (T x, T y) > { > T z = x + y; > if (z < x) > z = (T) -1; > return z; > } > > And then "avr-gcc -S -Os -dp" the code is > > > sat_add1: > add r22,r24 ; 7 [c=8 l=2] *addhi3/0 > adc r23,r25 > ldi r18,lo8(1) ; 8 [c=4 l=2] *movhi/4 > ldi r19,0 > cp r22,r24 ; 9 [c=8 l=2] cmphi3/2 > cpc r23,r25 > brlo .L2 ; 10 [c=16 l=1] branch > ldi r19,0; 31 [c=4 l=1] movqi_insn/0 > ldi r18,0; 32 [c=4 l=1] movqi_insn/0 > .L2: > clr r24 ; 13 [c=12 l=4] neghi2/1 > clr r25 > sub r24,r18 > sbc r25,r19 > or r24,r22 ; 29 [c=4 l=1] iorqi3/0 > or r25,r23 ; 30 [c=4 l=1] iorqi3/0 > ret ; 35 [c=0 l=1] return > > sat_add2: > add r22,r24 ; 8 [c=8 l=2] *addhi3/0 > adc r23,r25 > cp r22,r24 ; 9 [c=8 l=2] cmphi3/2 > cpc r23,r25 > brsh .L3 ; 10 [c=16 l=1] branch > ldi r22,lo8(-1) ; 5 [c=4 l=2] *movhi/4 > ldi r23,lo8(-1) > .L3: > mov r25,r23 ; 21 [c=4 l=1] movqi_insn/0 > mov r24,r22 ; 22 [c=4 l=1] movqi_insn/0 > ret ; 25 [c=0 l=1] return > > i.e. the conditional jump is better than overly smart arithmetic > (smaller and faster code with less register pressure). > With larger dypes the difference is even more pronounced- > *on AVR. https://godbolt.org/z/7jaExbTa8 shows the branchless code is better. And the branchy code will vectorize worse if at all https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492 But looking at that output it just seems like it's your expansion that's inefficient. But fair point, perhaps it should be just a normal DEF_INTERNAL_SIGNED_OPTAB_FN so that we provide the additional optimization only for targets that want it. Tamar > >> Take uint8_t as example, we will have: > >> > >> * SAT_ADDU (1, 254) => 255. > >> * SAT_ADDU (1, 255) => 255. > >> * SAT_ADDU (2, 255) => 255. > >> * SAT_ADDU (255, 255) => 255. > >> > >> The patch also implement the SAT_ADDU in the riscv backend as > >> the sample. Given below example: > >> > >> uint64_t sat_add_u64 (uint64_t x, uint64_t y) > >> { > >>return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x)); > >> } > >> > >> Before this patch: > >> > >> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y) > >> { > >>long unsigned int _1; > >>_Bool _2; > >>long unsigned int _3; > >>long unsigned int _4; > >>uint64_t _7; > >>long unsigned int _10; > >>__complex__ long unsigned int _11; > >> > >> ;; basic block 2, loop depth 0 > >> ;;pred: ENTRY > >>_11 = .ADD_OVERFLOW (x_5(D), y_6(D)); > >>_1 = REALPART_EXPR <_11>; > >>_10 = IMAGPART_EXPR <_11>; > >>_2 = _10 != 0; > >>_3 = (long unsigned int) _2; > >>_4 = -_3; > >>_7 = _1 | _4; > >>return _7; > >> ;;succ: EXIT > >> > >> } > >> > >> After this patch: > >> > >> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y) > >> { > >>uint64_t _7; > >> > >> ;; basic block 2, loop depth 0 > >> ;;pred: ENTRY > >>_7 = .SAT_ADDU (x_5(D), y_6(D)); [tail call] > >>return _7; > >> ;;succ: EXIT > >> > >> } > >> > >> Then we will have the middle-end representation like .SAT_ADDU after > >> this patch. > > > > I'll note that on RTL we already have SS_PLUS/US_PLUS and friends and > > the corresponding ssadd/usadd optabs. There's not much documentation > > unfortunately besides the use of gen_*_fixed_libfunc usage where the comment > > suggests this is used for fixed-point operations. It looks like arm uses > > fractional/accumulator modes for this but for example bfin has ssaddsi3. > > > > So the question is whether the fixed-point case can be distinguished from > > the integer case based on mode. > > > > There's also FIXED_POINT_TYPE on the GENERIC/GIMPLE side and > > no special tree operator codes for them. So compared to what appears > > to be the case on RTL we'd need a way to represent saturating integer > > operations on GIMPLE. > > > > The natural thing is to use direct optab internal functions (that's what you > > basically did, but you added a new optab, IMO without good reason). > > More
RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation US_PLUS
> -Original Message- > From: Richard Biener > Sent: Tuesday, February 27, 2024 9:44 AM > To: Tamar Christina > Cc: pan2...@intel.com; gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; > yanzhang.w...@intel.com; kito.ch...@gmail.com; > richard.sandiford@arm.com2; jeffreya...@gmail.com > Subject: Re: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation > US_PLUS > > On Sun, Feb 25, 2024 at 10:01 AM Tamar Christina > wrote: > > > > Hi Pan, > > > > > From: Pan Li > > > > > > Hi Richard & Tamar, > > > > > > Try the DEF_INTERNAL_INT_EXT_FN as your suggestion. By mapping > > > us_plus$a3 to the RTL representation (us_plus:m x y) in optabs.def. > > > And then expand_US_PLUS in internal-fn.cc. Not very sure if my > > > understanding is correct for DEF_INTERNAL_INT_EXT_FN. > > > > > > I am not sure if we still need DEF_INTERNAL_SIGNED_OPTAB_FN here, given > > > the RTL representation has (ss_plus:m x y) and (us_plus:m x y) already. > > > > > > > I think a couple of things are being confused here. So lets break it down: > > > > The reason for DEF_INTERNAL_SIGNED_OPTAB_FN is because in GIMPLE > > we only want one internal function for both signed and unsigned SAT_ADD. > > with this definition we don't need SAT_UADD and SAT_SADD but instead > > we will only have SAT_ADD, which will expand to us_plus or ss_plus. > > > > Now the downside of this is that this is a direct internal optab. This > > means > > that for the representation to be used the target *must* have the optab > > implemented. This is a bit annoying because it doesn't allow us to > > generically > > assume that all targets use SAT_ADD for saturating add and thus only have to > > write optimization for this representation. > > > > This is why Richi said we may need to use a new tree_code because we can > > override tree code expansions. However the same can be done with the > > _EXT_FN > > internal functions. > > > > So what I meant was that we want to have a combination of the two. i.e. a > > DEF_INTERNAL_SIGNED_OPTAB_EXT_FN. > > Whether we want/need _EXT or only direct depends mainly on how we want to > leverage support. If it's only during vectorization and possibly instruction > selection a direct optab is IMO the way to go. Generic optimization only > marginally improves when you explode the number of basic operations you > expose - in fact it gets quite unwieldly to support all of them in > simplifications > and/or canonicalization and you possibly need to translate them back to what > the target CPU supports. > > We already do have too many (IMO) "special" operations exposed "early" > in the GIMPLE pipeline. > > But what I'd like to see is that we do more instruction selection on GIMPLE > but _late_ (there's the pass_optimize_widening_mul and pass_gimple_isel > passes doing what I'd call instruction selection). But that means not adding > match.pd patterns for that or at least have a separate isel-match.pd > machinery for that. > > So as a start I would go for a direct optab and see to recognize it during > ISEL? > The problem with ISEL and the reason I suggested an indirect IFN is that there Are benefit to be had from recognizing it early. Saturating arithmetic can be optimized Differently from non-saturating ones. But additionally a common way of specifying them decomposes to branches and/or using COMPLEX_EXPR (see the various PRs on saturating arithmetic). These two representation can be detected in PHI-opts and it's beneficial to all targets to canonicalize them to the branchless code. Those two cases also *completely* stop vectorization because of either the control flow or the fact the vectorizer can't handle complex types. So really, gimple ISEL would fix just 1 of the 3 very common cases, and then We'd still need to hack the vectorizer cost models for targets with saturating vector instructions. I of course defer to you, but it seems quite suboptimal to do it this way and doesn't get us first class saturation support. Additionally there have been discussions whether both clang and gcc should provide __builtin_saturate_* methods, which the non-direct IFN would help support. Tamar. > > If Richi agrees, the below is what I meant. It creates the infrastructure > > for this > > and for now only allows a default fallback for unsigned saturating add and > > makes > > it easier for us to add the rest later > > > > Also, unless I'm wrong (and Richi can correct me here), us_plus and ss_plus > > are > the > > RTL expressi
RE: [PATCH]middle-end: delay updating of dominators until later during vectorization. [PR114081]
> > The testcase shows an interesting case where we have multiple loops sharing > > a > > live value and have an early exit that go to the same location. The > > additional > > complication is that on x86_64 with -mavx we seem to also do prologue > > peeling > > on the loops. > > > > We correctly identify which BB we need their dominators updated for, but we > > do > > so too early. > > > > Instead of adding more dominator update we can solve this by for the cases > > with > > multiple exits not to verify dominators at the end of peeling if peeling for > > vectorization. > > > > We can then perform the final dominator updates just before vectorization > > when > > all loop transformations are done. > > What's the actual CFG transform that happens between the old and the new > place? I see a possible edge splitting but where is the one that makes > this patch work? It's not one but two. 1. loop 1 is prologue peeled. This ICEs because the dominator update is only happening for epilogue peeling. Note that loop 1 here dominates 21 and the ICE is: ice.c: In function 'void php_zval_filter(int, int)': ice.c:7:6: error: dominator of 14 should be 21, not 3 7 | void php_zval_filter(int filter, int id1) { | ^~~ ice.c:7:6: error: dominator of 10 should be 21, not 3 during GIMPLE pass: vect dump file: a-ice.c.179t.vect This can be simply fixed by just moving the dom update code down: diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index a5202f32e27..e88948370c6 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -1845,13 +1845,7 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, to the original function exit we recorded. Other exits are already correct. */ if (multiple_exits_p) - { - update_loop = new_loop; - doms = get_all_dominated_blocks (CDI_DOMINATORS, loop->header); - for (unsigned i = 0; i < doms.length (); ++i) - if (flow_bb_inside_loop_p (loop, doms[i])) - doms.unordered_remove (i); - } + update_loop = new_loop; } else /* Add the copy at entry. */ { @@ -1906,6 +1900,11 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, if (multiple_exits_p) { + doms = get_all_dominated_blocks (CDI_DOMINATORS, loop->header); + for (unsigned i = 0; i < doms.length (); ++i) + if (flow_bb_inside_loop_p (loop, doms[i])) + doms.unordered_remove (i); + for (edge e : get_loop_exit_edges (update_loop)) { edge ex; with that done, the next ICE comes along. Loop 1 is peeled again, but this time for epilogue. however loop 1 no longer dominates the exits as the prologue peeled loop does. So we don't find anything to update and ice with the second ICE: ice.c: In function 'void php_zval_filter(int, int)': ice.c:7:6: error: dominator of 14 should be 2, not 21 7 | void php_zval_filter(int filter, int id1) { | ^~~ ice.c:7:6: error: dominator of 10 should be 2, not 21 during GIMPLE pass: vect dump file: a-ice.c.179t.vect because the prologue loop no longer dominates them due to the skip edge. This is why delaying works because we know we have to update the dominators of 14 and 10, but to what we don't know yet. Tamar > > > This also means we reduce the number of dominator updates needed by at least > > 50% and fixes the ICE. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and > > x86_64-pc-linux-gnu no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR tree-optimization/114081 > > PR tree-optimization/113290 > > * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): > > Skip dominator update when multiple exit. > > (vect_do_peeling): Remove multiple exit dominator update. > > * tree-vect-loop.cc (vect_transform_loop): Update dominators when > > multiple exits. > > * tree-vectorizer.h (LOOP_VINFO_DOMS_NEED_UPDATE, > > dominators_needing_update): New. > > > > gcc/testsuite/ChangeLog: > > > > PR tree-optimization/114081 > > PR tree-optimization/113290 > > * gcc.dg/vect/vect-early-break_120-pr114081.c: New test. > > * gcc.dg/vect/vect-early-break_121-pr114081.c: New test. > > > > --- inline copy of patch -- > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c > > new file mode 100644 > > index > ..2cd4ce1e4ac573ba6e4173 > 0fd2216f0ec8061376 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c > > @@ -0,0 +1,38 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > +/* {
[PATCH]middle-end: delay updating of dominators until later during vectorization. [PR114081]
Hi All, The testcase shows an interesting case where we have multiple loops sharing a live value and have an early exit that go to the same location. The additional complication is that on x86_64 with -mavx we seem to also do prologue peeling on the loops. We correctly identify which BB we need their dominators updated for, but we do so too early. Instead of adding more dominator update we can solve this by for the cases with multiple exits not to verify dominators at the end of peeling if peeling for vectorization. We can then perform the final dominator updates just before vectorization when all loop transformations are done. This also means we reduce the number of dominator updates needed by at least 50% and fixes the ICE. Bootstrapped Regtested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/114081 PR tree-optimization/113290 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Skip dominator update when multiple exit. (vect_do_peeling): Remove multiple exit dominator update. * tree-vect-loop.cc (vect_transform_loop): Update dominators when multiple exits. * tree-vectorizer.h (LOOP_VINFO_DOMS_NEED_UPDATE, dominators_needing_update): New. gcc/testsuite/ChangeLog: PR tree-optimization/114081 PR tree-optimization/113290 * gcc.dg/vect/vect-early-break_120-pr114081.c: New test. * gcc.dg/vect/vect-early-break_121-pr114081.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c new file mode 100644 index ..2cd4ce1e4ac573ba6e41730fd2216f0ec8061376 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c @@ -0,0 +1,38 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +typedef struct filter_list_entry { + const char *name; + int id; + void (*function)(); +} filter_list_entry; + +static const filter_list_entry filter_list[9] = {0}; + +void php_zval_filter(int filter, int id1) { + filter_list_entry filter_func; + + int size = 9; + for (int i = 0; i < size; ++i) { +if (filter_list[i].id == filter) { + filter_func = filter_list[i]; + goto done; +} + } + +#pragma GCC novector + for (int i = 0; i < size; ++i) { +if (filter_list[i].id == 0x0204) { + filter_func = filter_list[i]; + goto done; +} + } +done: + if (!filter_func.id) +filter_func.function(); +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_121-pr114081.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_121-pr114081.c new file mode 100644 index ..feebdb7a6c9b8981d7be31dd1c741f9e36738515 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_121-pr114081.c @@ -0,0 +1,37 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +typedef struct filter_list_entry { + const char *name; + int id; + void (*function)(); +} filter_list_entry; + +static const filter_list_entry filter_list[9] = {0}; + +void php_zval_filter(int filter, int id1) { + filter_list_entry filter_func; + + int size = 9; + for (int i = 0; i < size; ++i) { +if (filter_list[i].id == filter) { + filter_func = filter_list[i]; + goto done; +} + } + + for (int i = 0; i < size; ++i) { +if (filter_list[i].id == 0x0204) { + filter_func = filter_list[i]; + goto done; +} + } +done: + if (!filter_func.id) +filter_func.function(); +} diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 3f974d6d839e32516ae316f28ca25316e43d7d86..b5e158bc5cfb5107d5ff461e489d306f81e090d0 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -1917,7 +1917,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, doms.safe_push (e->dest); } - iterate_fix_dominators (CDI_DOMINATORS, doms, false); if (updated_doms) updated_doms->safe_splice (doms); } @@ -1925,7 +1924,9 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, free (new_bbs); free (bbs); - checking_verify_dominators (CDI_DOMINATORS); + /* If we're peeling for vectorization then delay verifying dominators. */ + if (!flow_loops || !multiple_exits_p) +checking_verify_dominators (CDI_DOMINATORS); return
RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation US_PLUS
Hi Pan, > From: Pan Li > > Hi Richard & Tamar, > > Try the DEF_INTERNAL_INT_EXT_FN as your suggestion. By mapping > us_plus$a3 to the RTL representation (us_plus:m x y) in optabs.def. > And then expand_US_PLUS in internal-fn.cc. Not very sure if my > understanding is correct for DEF_INTERNAL_INT_EXT_FN. > > I am not sure if we still need DEF_INTERNAL_SIGNED_OPTAB_FN here, given > the RTL representation has (ss_plus:m x y) and (us_plus:m x y) already. > I think a couple of things are being confused here. So lets break it down: The reason for DEF_INTERNAL_SIGNED_OPTAB_FN is because in GIMPLE we only want one internal function for both signed and unsigned SAT_ADD. with this definition we don't need SAT_UADD and SAT_SADD but instead we will only have SAT_ADD, which will expand to us_plus or ss_plus. Now the downside of this is that this is a direct internal optab. This means that for the representation to be used the target *must* have the optab implemented. This is a bit annoying because it doesn't allow us to generically assume that all targets use SAT_ADD for saturating add and thus only have to write optimization for this representation. This is why Richi said we may need to use a new tree_code because we can override tree code expansions. However the same can be done with the _EXT_FN internal functions. So what I meant was that we want to have a combination of the two. i.e. a DEF_INTERNAL_SIGNED_OPTAB_EXT_FN. If Richi agrees, the below is what I meant. It creates the infrastructure for this and for now only allows a default fallback for unsigned saturating add and makes it easier for us to add the rest later Also, unless I'm wrong (and Richi can correct me here), us_plus and ss_plus are the RTL expression, but the optab for saturation are ssadd and usadd. So you don't need to make new us_plus and ss_plus ones. diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc index a07f25f3aee..aaf9f8991b3 100644 --- a/gcc/internal-fn.cc +++ b/gcc/internal-fn.cc @@ -4103,6 +4103,17 @@ direct_internal_fn_supported_p (internal_fn fn, tree_pair types, return direct_##TYPE##_optab_supported_p (which_optab, types, \ opt_type);\ } +#define DEF_INTERNAL_SIGNED_OPTAB_EXT_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \ +UNSIGNED_OPTAB, TYPE) \ +case IFN_##CODE: \ + { \ + optab which_optab = (TYPE_UNSIGNED (types.SELECTOR) \ +? UNSIGNED_OPTAB ## _optab \ +: SIGNED_OPTAB ## _optab); \ + return direct_##TYPE##_optab_supported_p (which_optab, types, \ + opt_type) \ + || internal_##CODE##_fn_supported_p (types.SELECTOR, opt_type); \ + } #include "internal-fn.def" case IFN_LAST: @@ -4303,6 +4314,8 @@ set_edom_supported_p (void) optab which_optab = direct_internal_fn_optab (fn, types); \ expand_##TYPE##_optab_fn (fn, stmt, which_optab); \ } +#define DEF_INTERNAL_SIGNED_OPTAB_EXT_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \ +UNSIGNED_OPTAB, TYPE) #include "internal-fn.def" /* Routines to expand each internal function, indexed by function number. @@ -5177,3 +5190,45 @@ expand_POPCOUNT (internal_fn fn, gcall *stmt) emit_move_insn (plhs, cmp); } } + +void +expand_SAT_ADD (internal_fn fn, gcall *stmt) +{ + /* Check if the target supports the expansion through an IFN. */ + tree_pair types = direct_internal_fn_types (fn, stmt); + optab which_optab = direct_internal_fn_optab (fn, types); + if (direct_binary_optab_supported_p (which_optab, types, + insn_optimization_type ())) +{ + expand_binary_optab_fn (fn, stmt, which_optab); + return; +} + + /* Target does not support the optab, but we can de-compose it. */ + /* + ... decompose to a canonical representation ... + if (TYPE_UNSIGNED (types.SELECTOR)) +{ + ... + decompose back to (X + Y) | - ((X + Y) < X) +} + else +{ + ... +} + */ +} + +bool internal_SAT_ADD_fn_supported_p (tree type, optimization_type /* optype */) +{ + /* For now, don't support decomposing vector ops. */ + if (VECTOR_TYPE_P (type)) +return false; + + /* Signed saturating arithmetic is harder to do since we'll so for now + lets ignore. */ + if (!TYPE_UNSIGNED (type)) +return false; + + return TREE_CODE (type) == INTEGER_TYPE; +} \ No newline at end of file diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index c14d30365c1..5a2491228d5 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -92,6 +92,10 @@ along with GCC; see the file
[PATCH]middle-end: update vuses out of loop which use a vdef that's moved [PR114068]
Hi All, In certain cases we can have a situation where the merge block has a vUSE virtual PHI and the exits do not. In this case for instance the exits lead to an abort so they have no virtual PHIs. If we have a store before the first exit and we move it to a later block during vectorization we update the vUSE chain. However the merge block is not an exit and is not visited by the update code. This patch fixes it by checking during moving if there are any out of loop uses of the vDEF that is the last_seen_vuse. Normally there wouldn't be any and things are skipped, but if there is then update that to the last vDEF in the exit block. Bootstrapped Regtested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimizations/114068 * tree-vect-loop.cc (move_early_exit_stmts): Update vUSE chain in merge block. gcc/testsuite/ChangeLog: PR tree-optimizations/114068 * gcc.dg/vect/vect-early-break_118-pr114068.c: New test. * gcc.dg/vect/vect-early-break_119-pr114068.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c new file mode 100644 index ..b462a464b6603e718c5a283513ea586fc13e37ce --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +struct h { + int b; + int f; +} k; + +void n(int m) { + struct h a = k; + for (int o = m; o; ++o) { +if (a.f) + __builtin_unreachable(); +if (o > 1) + __builtin_unreachable(); +*( + o) = 1; + } +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_119-pr114068.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_119-pr114068.c new file mode 100644 index ..a65ef7b8c4901b2ada585f38fda436dc07d1e1de --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_119-pr114068.c @@ -0,0 +1,25 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +struct h { + int b; + int c; + int f; +} k; + +void n(int m) { + struct h a = k; + for (int o = m; o; ++o) { +if (a.f) + __builtin_unreachable(); +if (o > 1) + __builtin_unreachable(); +*( + o) = 1; +*( + o*m) = 2; + } +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 35f1f8c7d4245135ace740ff9be548919587..44bd8032b55b1ef84fdf4fa9d6117304b7709d6f 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -11837,6 +11837,27 @@ move_early_exit_stmts (loop_vec_info loop_vinfo) update_stmt (p); } + /* last_seen_vuse should now be the PHI in the loop header. Check for + any out of loop uses and update them to the vUSE on the loop latch. */ + auto vuse_stmt = loop_vinfo->lookup_def (last_seen_vuse); + gphi *vuse_def; + if (vuse_stmt + && (vuse_def = dyn_cast (STMT_VINFO_STMT (vuse_stmt +{ + imm_use_iterator iter; + use_operand_p use_p; + gimple *use_stmt; + auto loop = LOOP_VINFO_LOOP (loop_vinfo); + tree vuse = PHI_ARG_DEF_FROM_EDGE (vuse_def, loop_latch_edge (loop)); + FOR_EACH_IMM_USE_STMT (use_stmt, iter, last_seen_vuse) + { + if (flow_bb_inside_loop_p (loop, use_stmt->bb)) + continue; + FOR_EACH_IMM_USE_ON_STMT (use_p, iter) + SET_USE (use_p, vuse); + } +} + /* And update the LC PHIs on exits. */ for (edge e : get_loop_exit_edges (LOOP_VINFO_LOOP (loop_vinfo))) if (!dominated_by_p (CDI_DOMINATORS, e->src, dest_bb)) -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c new file mode 100644 index ..b462a464b6603e718c5a283513ea586fc13e37ce --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +struct h { + int b; + int f; +} k; + +void n(int m) { + struct h a = k; + for (int o = m; o; ++o) { +if (a.f) + __builtin_unreachable(); +if (o > 1) + __builtin_unreachable(); +*( + o) = 1; + } +} diff --git
RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
> -Original Message- > From: Li, Pan2 > Sent: Monday, February 19, 2024 12:59 PM > To: Tamar Christina ; Richard Biener > > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang > ; kito.ch...@gmail.com > Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU > > Thanks Tamar for comments and explanations. > > > I think we should actually do an indirect optab here, because the IFN can > > be used > > to replace the general representation of saturating arithmetic. > > > e.g. the __builtin_add_overflow case in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600 > > is inefficient on all targets and so the IFN can always expand to something > > that's > more > > efficient like the branchless version add_sat2. > > > I think this is why you suggested a new tree code below, but we don't > > really need > > tree-codes for this. It can be done cleaner using the same way as > DEF_INTERNAL_INT_EXT_FN > > Yes, the backend could choose a branchless(of course we always hate branch for > performance) code-gen or even better there is one saturation insn. > Good to learn DEF_INTERNAL_INT_EXT_FN, and will have a try for it. > > > Richard means that there shouldn't be .SAT_ADDU and .SAT_ADDS and that the > sign > > should be determined by the types at expansion time. i.e. there should > > only be > > .SAT_ADD. > > Got it, my initial idea comes from that we may have two insns for saturation > add, > mostly these insns need to be signed or unsigned. > For example, slt/sltu in riscv scalar. But I am not very clear about a > scenario like this. > During define_expand in backend, we hit the standard name > sat_add_3 but can we tell it is signed or not here? AFAIK, we only have > QI, HI, > SI and DI. Yeah, the way DEF_INTERNAL_SIGNED_OPTAB_FN works is that you give it two optabs, one for when it's signed and one for when it's unsigned, and the right one is picked automatically during expansion. But in GIMPLE you'd only have one IFN. > Maybe I will have the answer after try DEF_INTERNAL_SIGNED_OPTAB_FN, will > keep you posted. Awesome, Thanks! Tamar > > Pan > > -Original Message- > From: Tamar Christina > Sent: Monday, February 19, 2024 4:55 PM > To: Li, Pan2 ; Richard Biener > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang > ; kito.ch...@gmail.com > Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU > > Thanks for doing this! > > > -Original Message- > > From: Li, Pan2 > > Sent: Monday, February 19, 2024 8:42 AM > > To: Richard Biener > > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang > > ; kito.ch...@gmail.com; Tamar Christina > > > > Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU > > > > Thanks Richard for comments. > > > > > I'll note that on RTL we already have SS_PLUS/US_PLUS and friends and > > > the corresponding ssadd/usadd optabs. There's not much documentation > > > unfortunately besides the use of gen_*_fixed_libfunc usage where the > comment > > > suggests this is used for fixed-point operations. It looks like arm uses > > > fractional/accumulator modes for this but for example bfin has ssaddsi3. > > > > I find the related description about plus family in GCC internals doc but > > it doesn't > > mention > > anything about mode m here. > > > > (plus:m x y) > > (ss_plus:m x y) > > (us_plus:m x y) > > These three expressions all represent the sum of the values represented by x > > and y carried out in machine mode m. They diff er in their behavior on > > overflow > > of integer modes. plus wraps round modulo the width of m; ss_plus saturates > > at the maximum signed value representable in m; us_plus saturates at the > > maximum unsigned value. > > > > > The natural thing is to use direct optab internal functions (that's what > > > you > > > basically did, but you added a new optab, IMO without good reason). > > I think we should actually do an indirect optab here, because the IFN can be > used > to replace the general representation of saturating arithmetic. > > e.g. the __builtin_add_overflow case in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600 > is inefficient on all targets and so the IFN can always expand to something > that's > more > efficient like the branchless version add_sat2. > > I think this is why you suggested a new tree code below, but we don't really > need > tree-codes for this. It can be done cleaner using the sam
RE: [PATCH]AArch64: xfail modes_1.f90 [PR107071]
> -Original Message- > From: Tamar Christina > Sent: Thursday, February 15, 2024 11:05 AM > To: Richard Earnshaw (lists) ; gcc- > patc...@gcc.gnu.org > Cc: nd ; Marcus Shawcroft ; Kyrylo > Tkachov ; Richard Sandiford > > Subject: RE: [PATCH]AArch64: xfail modes_1.f90 [PR107071] > > > -Original Message- > > From: Richard Earnshaw (lists) > > Sent: Thursday, February 15, 2024 11:01 AM > > To: Tamar Christina ; gcc-patches@gcc.gnu.org > > Cc: nd ; Marcus Shawcroft ; > Kyrylo > > Tkachov ; Richard Sandiford > > > > Subject: Re: [PATCH]AArch64: xfail modes_1.f90 [PR107071] > > > > On 15/02/2024 10:57, Tamar Christina wrote: > > > Hi All, > > > > > > This test has never worked on AArch64 since the day it was committed. It > > > has > > > a number of issues that prevent it from working on AArch64: > > > > > > 1. IEEE does not require that FP operations raise a SIGFPE for FP > > > operations, > > > only that an exception is raised somehow. > > > > > > 2. Most Arm designed cores don't raise SIGFPE and instead set a status > > > register > > > and some partner cores raise a SIGILL instead. > > > > > > 3. The way it checks for feenableexcept doesn't really work for AArch64. > > > > > > As such this test doesn't seem to really provide much value on AArch64 so > > > we > > > should just xfail it. > > > > > > Regtested on aarch64-none-linux-gnu and no issues. > > > > > > Ok for master? > > > > Wouldn't it be better to just skip the test. XFAIL just adds clutter to > > verbose > output > > and suggests that someday the tools might be fixed for this case. > > > > Better still would be a new dg-requires fp_exceptions_raise_sigfpe as a > > guard for > > the test. > It looks like this is similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78314 so I'll just similarly skip it. --- inline copy of patch --- diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 index 205c47f38007d06116289c19d6b23cf3bf83bd48..e29d8c678e6e51c3f2e5dac53c7703bb18a99ac4 100644 --- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 +++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 @@ -1,5 +1,5 @@ ! { dg-do run } -! +! { dg-skip-if "PR libfortran/78314" { aarch64*-*-gnu* arm*-*-gnueabi arm*-*-gnueabihf } } ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR fortran/107071 * gfortran.dg/ieee/modes_1.f90: skip aarch64, arm. rb18274.patch Description: rb18274.patch
RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
Thanks for doing this! > -Original Message- > From: Li, Pan2 > Sent: Monday, February 19, 2024 8:42 AM > To: Richard Biener > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang > ; kito.ch...@gmail.com; Tamar Christina > > Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU > > Thanks Richard for comments. > > > I'll note that on RTL we already have SS_PLUS/US_PLUS and friends and > > the corresponding ssadd/usadd optabs. There's not much documentation > > unfortunately besides the use of gen_*_fixed_libfunc usage where the comment > > suggests this is used for fixed-point operations. It looks like arm uses > > fractional/accumulator modes for this but for example bfin has ssaddsi3. > > I find the related description about plus family in GCC internals doc but it > doesn't > mention > anything about mode m here. > > (plus:m x y) > (ss_plus:m x y) > (us_plus:m x y) > These three expressions all represent the sum of the values represented by x > and y carried out in machine mode m. They diff er in their behavior on > overflow > of integer modes. plus wraps round modulo the width of m; ss_plus saturates > at the maximum signed value representable in m; us_plus saturates at the > maximum unsigned value. > > > The natural thing is to use direct optab internal functions (that's what you > > basically did, but you added a new optab, IMO without good reason). I think we should actually do an indirect optab here, because the IFN can be used to replace the general representation of saturating arithmetic. e.g. the __builtin_add_overflow case in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600 is inefficient on all targets and so the IFN can always expand to something that's more efficient like the branchless version add_sat2. I think this is why you suggested a new tree code below, but we don't really need tree-codes for this. It can be done cleaner using the same way as DEF_INTERNAL_INT_EXT_FN. > > That makes sense to me, I will try to leverage US_PLUS instead here. > > > More GIMPLE-like would be to let the types involved decide whether > > it's signed or unsigned saturation. That's actually what I'd prefer here > > and if we don't map 1:1 to optabs then instead use tree codes like > > S_PLUS_EXPR (mimicing RTL here). > > Sorry I don't get the point here for GIMPLE-like way. For the .SAT_ADDU, I > add one > restriction > like unsigned_p (type) in match.pd. Looks we have a better way here. > Richard means that there shouldn't be .SAT_ADDU and .SAT_ADDS and that the sign should be determined by the types at expansion time. i.e. there should only be .SAT_ADD. i.e. instead of this +DEF_INTERNAL_OPTAB_FN (SAT_ADDU, ECF_CONST | ECF_NOTHROW, sat_addu, binary) You should use DEF_INTERNAL_SIGNED_OPTAB_FN. Regards, Tamar > > Any other opinions? Anyone knows more about fixed-point and RTL/modes? > > AFAIK, the scalar of the riscv backend doesn't have fixed-point but the > vector does > have. They > share the same mode as vector integer. For example, RVVM1SI in vector- > iterators.md. Kito > and Juzhe can help to correct me if any misunderstandings. > > Pan > > -Original Message- > From: Richard Biener > Sent: Monday, February 19, 2024 3:36 PM > To: Li, Pan2 > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang > ; kito.ch...@gmail.com; tamar.christ...@arm.com > Subject: Re: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU > > On Sat, Feb 17, 2024 at 11:30 AM wrote: > > > > From: Pan Li > > > > This patch would like to add the middle-end presentation for the > > unsigned saturation add. Aka set the result of add to the max > > when overflow. It will take the pattern similar as below. > > > > SAT_ADDU (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x)) > > > > Take uint8_t as example, we will have: > > > > * SAT_ADDU (1, 254) => 255. > > * SAT_ADDU (1, 255) => 255. > > * SAT_ADDU (2, 255) => 255. > > * SAT_ADDU (255, 255) => 255. > > > > The patch also implement the SAT_ADDU in the riscv backend as > > the sample. Given below example: > > > > uint64_t sat_add_u64 (uint64_t x, uint64_t y) > > { > > return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x)); > > } > > > > Before this patch: > > > > uint64_t sat_add_uint64_t (uint64_t x, uint64_t y) > > { > > long unsigned int _1; > > _Bool _2; > > long unsigned int _3; > > long unsigned int _4; > > uint64_t _7; > > long unsigned int _10; > > __complex__ long unsigned int _11
RE: [PATCH] aarch64: Improve PERM<{0}, a, ...> (64bit) by adding whole vector shift right [PR113872]
> -Original Message- > From: Richard Sandiford > Sent: Thursday, February 15, 2024 2:56 PM > To: Andrew Pinski > Cc: gcc-patches@gcc.gnu.org; Tamar Christina > Subject: Re: [PATCH] aarch64: Improve PERM<{0}, a, ...> (64bit) by adding > whole > vector shift right [PR113872] > > Andrew Pinski writes: > > The backend currently defines a whole vector shift left for 64bit vectors, > > adding > the > > shift right can also improve code for some PERMs too. So this adds that > > pattern. > > Is this reversed? It looks like we have the shift right and the patch is > adding the shift left (at least in GCC internal and little-endian terms). > > But on many Arm cores, EXT has a higher throughput than SHL, so I don't think > we should do this unconditionally. Yeah, on most (if not all) all Arm cores the EXT has higher throughput than SHL and on Cortex-A75 the EXT has both higher throughput and lower latency. I guess the expected gain here is that we wouldn't need to create the zero vector, However on modern Arm cores the zero vector creation is free using movi and EXT being three operand also means we only need one copy if e.g in a loop. Kind Regards, Tamar > > Thanks, > Richard > > > > > I added a testcase for the shift left also. I also fixed the instruction > > template > > there which was using a space instead of a tab after the instruction. > > > > Built and tested on aarch64-linux-gnu. > > > > PR target/113872 > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-simd.md (vec_shr_): > Use tab instead of space after > > the instruction in the template. > > (vec_shl_): New pattern > > * config/aarch64/iterators.md (unspec): Add UNSPEC_VEC_SHL > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/perm_zero-1.c: New test. > > * gcc.target/aarch64/perm_zero-2.c: New test. > > > > Signed-off-by: Andrew Pinski > > --- > > gcc/config/aarch64/aarch64-simd.md | 18 -- > > gcc/config/aarch64/iterators.md| 1 + > > gcc/testsuite/gcc.target/aarch64/perm_zero-1.c | 15 +++ > > gcc/testsuite/gcc.target/aarch64/perm_zero-2.c | 15 +++ > > 4 files changed, 47 insertions(+), 2 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/perm_zero-1.c > > create mode 100644 gcc/testsuite/gcc.target/aarch64/perm_zero-2.c > > > > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > > index f8bb973a278..0d2f1ea3902 100644 > > --- a/gcc/config/aarch64/aarch64-simd.md > > +++ b/gcc/config/aarch64/aarch64-simd.md > > @@ -1592,9 +1592,23 @@ (define_insn "vec_shr_" > >"TARGET_SIMD" > >{ > > if (BYTES_BIG_ENDIAN) > > - return "shl %d0, %d1, %2"; > > + return "shl\t%d0, %d1, %2"; > > else > > - return "ushr %d0, %d1, %2"; > > + return "ushr\t%d0, %d1, %2"; > > + } > > + [(set_attr "type" "neon_shift_imm")] > > +) > > +(define_insn "vec_shl_" > > + [(set (match_operand:VD 0 "register_operand" "=w") > > +(unspec:VD [(match_operand:VD 1 "register_operand" "w") > > + (match_operand:SI 2 "immediate_operand" "i")] > > + UNSPEC_VEC_SHL))] > > + "TARGET_SIMD" > > + { > > +if (BYTES_BIG_ENDIAN) > > + return "ushr\t%d0, %d1, %2"; > > +else > > + return "shl\t%d0, %d1, %2"; > >} > >[(set_attr "type" "neon_shift_imm")] > > ) > > diff --git a/gcc/config/aarch64/iterators.md > > b/gcc/config/aarch64/iterators.md > > index 99cde46f1ba..3aebe9cf18a 100644 > > --- a/gcc/config/aarch64/iterators.md > > +++ b/gcc/config/aarch64/iterators.md > > @@ -758,6 +758,7 @@ (define_c_enum "unspec" > > UNSPEC_PMULL; Used in aarch64-simd.md. > > UNSPEC_PMULL2 ; Used in aarch64-simd.md. > > UNSPEC_REV_REGLIST ; Used in aarch64-simd.md. > > +UNSPEC_VEC_SHL ; Used in aarch64-simd.md. > > UNSPEC_VEC_SHR ; Used in aarch64-simd.md. > > UNSPEC_SQRDMLAH ; Used in aarch64-simd.md. > > UNSPEC_SQRDMLSH ; Used in aarch64-simd.md. > > diff --git a/gcc/testsuite/gcc.target/aarch64/perm_zero-1.c > b/gcc/testsuite/gcc.target/aarch64/perm_zero-1.c > > new file mode 100644 > >
RE: [PATCH]AArch64: xfail modes_1.f90 [PR107071]
> -Original Message- > From: Richard Earnshaw (lists) > Sent: Thursday, February 15, 2024 11:01 AM > To: Tamar Christina ; gcc-patches@gcc.gnu.org > Cc: nd ; Marcus Shawcroft ; Kyrylo > Tkachov ; Richard Sandiford > > Subject: Re: [PATCH]AArch64: xfail modes_1.f90 [PR107071] > > On 15/02/2024 10:57, Tamar Christina wrote: > > Hi All, > > > > This test has never worked on AArch64 since the day it was committed. It > > has > > a number of issues that prevent it from working on AArch64: > > > > 1. IEEE does not require that FP operations raise a SIGFPE for FP > > operations, > > only that an exception is raised somehow. > > > > 2. Most Arm designed cores don't raise SIGFPE and instead set a status > > register > > and some partner cores raise a SIGILL instead. > > > > 3. The way it checks for feenableexcept doesn't really work for AArch64. > > > > As such this test doesn't seem to really provide much value on AArch64 so we > > should just xfail it. > > > > Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > Wouldn't it be better to just skip the test. XFAIL just adds clutter to > verbose output > and suggests that someday the tools might be fixed for this case. > > Better still would be a new dg-requires fp_exceptions_raise_sigfpe as a guard > for > the test. There seems to be check_effective_target_fenv_exceptions which seems to test for if the target can raise FP exceptions. I'll see if that works. Thanks, Tamar > > R. > > > > > Thanks, > > Tamar > > > > gcc/testsuite/ChangeLog: > > > > PR fortran/107071 > > * gfortran.dg/ieee/modes_1.f90: xfail aarch64. > > > > --- inline copy of patch -- > > diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 > b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 > > index > 205c47f38007d06116289c19d6b23cf3bf83bd48..3667571969427ae7b2b9668 > 4ec1af8b3fdd4985f 100644 > > --- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 > > +++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 > > @@ -1,4 +1,4 @@ > > -! { dg-do run } > > +! { dg-do run { xfail { aarch64*-*-* } } } > > ! > > ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES > > > > > > > > > > > > --
[PATCH]AArch64: xfail modes_1.f90 [PR107071]
Hi All, This test has never worked on AArch64 since the day it was committed. It has a number of issues that prevent it from working on AArch64: 1. IEEE does not require that FP operations raise a SIGFPE for FP operations, only that an exception is raised somehow. 2. Most Arm designed cores don't raise SIGFPE and instead set a status register and some partner cores raise a SIGILL instead. 3. The way it checks for feenableexcept doesn't really work for AArch64. As such this test doesn't seem to really provide much value on AArch64 so we should just xfail it. Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR fortran/107071 * gfortran.dg/ieee/modes_1.f90: xfail aarch64. --- inline copy of patch -- diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 index 205c47f38007d06116289c19d6b23cf3bf83bd48..3667571969427ae7b2b96684ec1af8b3fdd4985f 100644 --- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 +++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 @@ -1,4 +1,4 @@ -! { dg-do run } +! { dg-do run { xfail { aarch64*-*-* } } } ! ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES -- diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 index 205c47f38007d06116289c19d6b23cf3bf83bd48..3667571969427ae7b2b96684ec1af8b3fdd4985f 100644 --- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 +++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 @@ -1,4 +1,4 @@ -! { dg-do run } +! { dg-do run { xfail { aarch64*-*-* } } } ! ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES
RE: [PATCH]AArch64: remove ls64 from being mandatory on armv8.7-a..
Hi, this I a new version of the patch updating some additional tests because some of the LTO tests required a newer binutils than my distro had. --- The Arm Architectural Reference Manual (Version J.a, section A2.9 on FEAT_LS64) shows that ls64 is an optional extensions and should not be enabled by default for Armv8.7-a. This drops it from the mandatory bits for the architecture and brings GCC inline with LLVM and the achitecture. Note that we will not be changing binutils to preserve compatibility with older released compilers. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? and backport to GCC 13,12,11? Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-arches.def (AARCH64_ARCH): Remove LS64 from Armv8.7-a. gcc/testsuite/ChangeLog: * g++.target/aarch64/acle/ls64.C: Add +ls64. * g++.target/aarch64/acle/ls64_lto.C: Likewise. * gcc.target/aarch64/acle/ls64_lto.c: Likewise. * gcc.target/aarch64/acle/pr110100.c: Likewise. * gcc.target/aarch64/acle/pr110132.c: Likewise. * gcc.target/aarch64/options_set_28.c: Drop check for nols64. * gcc.target/aarch64/pragma_cpp_predefs_2.c: Correct header checks. --- inline copy of patch --- diff --git a/gcc/config/aarch64/aarch64-arches.def b/gcc/config/aarch64/aarch64-arches.def index b7115ff7c3d4a7ee7abbedcb091ef15a7efacc79..9bec30e9203bac01155281ef3474846c402bb29e 100644 --- a/gcc/config/aarch64/aarch64-arches.def +++ b/gcc/config/aarch64/aarch64-arches.def @@ -37,7 +37,7 @@ AARCH64_ARCH("armv8.3-a", generic_armv8_a, V8_3A, 8, (V8_2A, PAUTH, R AARCH64_ARCH("armv8.4-a", generic_armv8_a, V8_4A, 8, (V8_3A, F16FML, DOTPROD, FLAGM)) AARCH64_ARCH("armv8.5-a", generic_armv8_a, V8_5A, 8, (V8_4A, SB, SSBS, PREDRES)) AARCH64_ARCH("armv8.6-a", generic_armv8_a, V8_6A, 8, (V8_5A, I8MM, BF16)) -AARCH64_ARCH("armv8.7-a", generic_armv8_a, V8_7A, 8, (V8_6A, LS64)) +AARCH64_ARCH("armv8.7-a", generic_armv8_a, V8_7A, 8, (V8_6A)) AARCH64_ARCH("armv8.8-a", generic_armv8_a, V8_8A, 8, (V8_7A, MOPS)) AARCH64_ARCH("armv8.9-a", generic_armv8_a, V8_9A, 8, (V8_8A)) AARCH64_ARCH("armv8-r", generic_armv8_a, V8R , 8, (V8_4A)) diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64.C b/gcc/testsuite/g++.target/aarch64/acle/ls64.C index d9002785b578741bde1202761f0881dc3d47e608..dcfe6f1af6711a7f3ec2562f6aabf56baecf417d 100644 --- a/gcc/testsuite/g++.target/aarch64/acle/ls64.C +++ b/gcc/testsuite/g++.target/aarch64/acle/ls64.C @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv8.7-a" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64" } */ #include int main() { diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C b/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C index 274a4771e1c1d13bcb1a7bdc77c2e499726f024c..0198fe2a1b78627b873bf22e3d8416dbdcc77078 100644 --- a/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C +++ b/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C @@ -1,5 +1,5 @@ /* { dg-do link { target aarch64_asm_ls64_ok } } */ -/* { dg-additional-options "-march=armv8.7-a -flto" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64 -flto" } */ #include int main() { diff --git a/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c b/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c index 8b4f24277717675badc39dd145d365f75f5ceb27..0e5ae0b052b50b08d35151f4bc113617c1569bd3 100644 --- a/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c +++ b/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c @@ -1,5 +1,5 @@ /* { dg-do link { target aarch64_asm_ls64_ok } } */ -/* { dg-additional-options "-march=armv8.7-a -flto" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64 -flto" } */ #include int main(void) { diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c index f56d5e619e8ac23cdf720574bd6ee08fbfd36423..62a82b97c56debad092cc8fd1ed48f0219109cd7 100644 --- a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c +++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-march=armv8.7-a -O2" } */ +/* { dg-options "-march=armv8.7-a+ls64 -O2" } */ #include void do_st64b(data512_t data) { __arm_st64b((void*)0x1000, data); diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c index fb88d633dd20772fd96e976a400fe52ae0bc3647..423d91b9a99f269d01d07428414ade7cc518c711 100644 --- a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c +++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv8.7-a" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64" } */ /* Check that ls64 builtins can be invoked using a preprocesed testcase without triggering bogus builtin warnings,
RE: [PATCH]AArch64: update vget_set_lane_1.c test output
> -Original Message- > From: Richard Sandiford > Sent: Thursday, February 1, 2024 4:42 PM > To: Tamar Christina > Cc: Andrew Pinski ; gcc-patches@gcc.gnu.org; nd > ; Richard Earnshaw ; Marcus > Shawcroft ; Kyrylo Tkachov > > Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output > > Tamar Christina writes: > >> -Original Message- > >> From: Richard Sandiford > >> Sent: Thursday, February 1, 2024 2:24 PM > >> To: Andrew Pinski > >> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org; nd > >> ; Richard Earnshaw ; Marcus > >> Shawcroft ; Kyrylo Tkachov > >> > >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output > >> > >> Andrew Pinski writes: > >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina > >> wrote: > >> >> > >> >> Hi All, > >> >> > >> >> In the vget_set_lane_1.c test the following entries now generate a zip1 > instead > >> of an INS > >> >> > >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) > >> >> BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) > >> >> BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) > >> >> > >> >> This is because the non-Q variant for indices 0 and 1 are just > >> >> shuffling values. > >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such > >> >> just > >> update the > >> >> test file. > >> > Hmm, is this true on all cores? I suspect there is a core out there > >> > where INS is implemented with a much lower latency than ZIP. > >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles > >> > while ZIP is 6 cycles (3/7 for q versions). > >> > Now I don't have any invested interest in that core any more but I > >> > just wanted to point out that is not exactly true for all cores. > >> > >> Thanks for the pointer. In that case, perhaps we should prefer > >> aarch64_evpc_ins over aarch64_evpc_zip in > aarch64_expand_vec_perm_const_1? > >> That's enough to fix this failure, but it'll probably require other > >> tests to be adjusted... > > > > I think given that Thundex-X is a 10 year old micro-architecture that is > > several > cases where > > often used instructions have very high latencies that generic codegen > > should not > be blocked > > from progressing because of it. > > > > we use zips in many things and if thunderx codegen is really of that much > importance then I > > think the old codegen should be gated behind -mcpu=thunderx rather than > preventing generic > > changes. > > But you said there was no perf difference between INS and ZIP, so it > sounds like for all known cases, using INS rather than ZIP is either > neutral or better. > > There's also the possible secondary benefit that the INS patterns use > standard RTL operations whereas the ZIP patterns use unspecs. > > Keeping ZIP seems OK there's a specific reason to prefer it over INS for > more modern cores though. Ok, that's a fair point. Doing some due diligence, Neoverse-E1 and Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput than INSs. However the entries are inconsistent and I can't measure the difference so I believe this to be a documentation bug. That said, switching the operands seems to show one issue in that preferring INS degenerates code in cases where we are inserting the top bits of the first parameter into the bottom of the second parameter and returning, Zip being a Three operand instruction allows us to put the result into the final destination register with one operation whereas INS requires an fmov: foo_uzp1_s32: ins v0.s[1], v1.s[0] fmovd0, d0 ret foo_uzp2_s32: ins v1.s[0], v0.s[1] fmovd0, d1 ret I've posted uzp but zip has the same issue. So I guess it's not better to flip the order but perhaps I should add a case to the zip/unzip RTL patterns for when op0 == op1? Thanks, Tamar > > Thanks, > Richard
[PATCH]AArch64: remove ls64 from being mandatory on armv8.7-a..
Hi All, The Arm Architectural Reference Manual (Version J.a, section A2.9 on FEAT_LS64) shows that ls64 is an optional extensions and should not be enabled by default for Armv8.7-a. This drops it from the mandatory bits for the architecture and brings GCC inline with LLVM and the achitecture. Note that we will not be changing binutils to preserve compatibility with older released compilers. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? and backport to GCC 13,12,11? Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-arches.def (AARCH64_ARCH): Remove LS64 from Armv8.7-a. gcc/testsuite/ChangeLog: * g++.target/aarch64/acle/ls64.C: Add +ls64. * gcc.target/aarch64/acle/pr110100.c: Likewise. * gcc.target/aarch64/acle/pr110132.c: Likewise. * gcc.target/aarch64/options_set_28.c: Drop check for nols64. * gcc.target/aarch64/pragma_cpp_predefs_2.c: Correct header checks. --- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64-arches.def b/gcc/config/aarch64/aarch64-arches.def index b7115ff7c3d4a7ee7abbedcb091ef15a7efacc79..9bec30e9203bac01155281ef3474846c402bb29e 100644 --- a/gcc/config/aarch64/aarch64-arches.def +++ b/gcc/config/aarch64/aarch64-arches.def @@ -37,7 +37,7 @@ AARCH64_ARCH("armv8.3-a", generic_armv8_a, V8_3A, 8, (V8_2A, PAUTH, R AARCH64_ARCH("armv8.4-a", generic_armv8_a, V8_4A, 8, (V8_3A, F16FML, DOTPROD, FLAGM)) AARCH64_ARCH("armv8.5-a", generic_armv8_a, V8_5A, 8, (V8_4A, SB, SSBS, PREDRES)) AARCH64_ARCH("armv8.6-a", generic_armv8_a, V8_6A, 8, (V8_5A, I8MM, BF16)) -AARCH64_ARCH("armv8.7-a", generic_armv8_a, V8_7A, 8, (V8_6A, LS64)) +AARCH64_ARCH("armv8.7-a", generic_armv8_a, V8_7A, 8, (V8_6A)) AARCH64_ARCH("armv8.8-a", generic_armv8_a, V8_8A, 8, (V8_7A, MOPS)) AARCH64_ARCH("armv8.9-a", generic_armv8_a, V8_9A, 8, (V8_8A)) AARCH64_ARCH("armv8-r", generic_armv8_a, V8R , 8, (V8_4A)) diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64.C b/gcc/testsuite/g++.target/aarch64/acle/ls64.C index d9002785b578741bde1202761f0881dc3d47e608..dcfe6f1af6711a7f3ec2562f6aabf56baecf417d 100644 --- a/gcc/testsuite/g++.target/aarch64/acle/ls64.C +++ b/gcc/testsuite/g++.target/aarch64/acle/ls64.C @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv8.7-a" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64" } */ #include int main() { diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c index f56d5e619e8ac23cdf720574bd6ee08fbfd36423..62a82b97c56debad092cc8fd1ed48f0219109cd7 100644 --- a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c +++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-march=armv8.7-a -O2" } */ +/* { dg-options "-march=armv8.7-a+ls64 -O2" } */ #include void do_st64b(data512_t data) { __arm_st64b((void*)0x1000, data); diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c index fb88d633dd20772fd96e976a400fe52ae0bc3647..423d91b9a99f269d01d07428414ade7cc518c711 100644 --- a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c +++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv8.7-a" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64" } */ /* Check that ls64 builtins can be invoked using a preprocesed testcase without triggering bogus builtin warnings, see PR110132. diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_28.c b/gcc/testsuite/gcc.target/aarch64/options_set_28.c index 9e63768581e9d429e9408863942051b1b04761ac..d5b15f8bc5831de56fe667179d83d9c853529aaf 100644 --- a/gcc/testsuite/gcc.target/aarch64/options_set_28.c +++ b/gcc/testsuite/gcc.target/aarch64/options_set_28.c @@ -1,9 +1,9 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv9.3-a+nopredres+nols64+nomops" } */ +/* { dg-additional-options "-march=armv9.3-a+nopredres+nomops" } */ int main () { return 0; } -/* { dg-final { scan-assembler-times {\.arch armv9\.3\-a\+crc\+nopredres\+nols64\+nomops\n} 1 } } */ +/* { dg-final { scan-assembler-times {\.arch armv9\.3\-a\+crc\+nopredres\+nomops\n} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c index 2d76bfc23dfdcd78a74ec0e4845a3bd8d110b010..d8fc86d1557895f91ffe8be2f65d6581abe51568 100644 --- a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c +++ b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c @@ -242,8 +242,8 @@ #pragma GCC push_options #pragma GCC target ("arch=armv8.7-a") -#ifndef __ARM_FEATURE_LS64 -#error "__ARM_FEATURE_LS64 is not defined but should be!" +#ifdef __ARM_FEATURE_LS64 +#error
RE: [PATCH]middle-end: inspect all exits for additional annotations for loop.
> > I think this isn't entirely good. For simple cases for > do {} while the condition ends up in the latch while for while () {} > loops it ends up in the header. In your case the latch isn't empty > so it doesn't end up with the conditional. > > I think your patch is OK to the point of looking at all loop exit > sources but you should elide the special-casing of header and > latch since it's really only exit conditionals that matter. > That makes sense, since in both cases the edges are in the respective blocks. Should have thought about it more. So how about this one. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-cfg.cc (replace_loop_annotate): Inspect loop edges for annotations. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-novect_gcond.c: New test. --- inline copy of patch --- diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c new file mode 100644 index ..01e69cbef9d51b234c08a400c78dc078d53252f1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c @@ -0,0 +1,39 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; +#pragma GCC novector + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + +#pragma GCC novector + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index cdd439fe7506e7bc33654ffa027b493f23d278ac..bdffc3b4ed277724e81b7dd67fe7966e8ece0c13 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -320,12 +320,9 @@ replace_loop_annotate (void) for (auto loop : loops_list (cfun, 0)) { - /* First look into the header. */ - replace_loop_annotate_in_block (loop->header, loop); - - /* Then look into the latch, if any. */ - if (loop->latch) - replace_loop_annotate_in_block (loop->latch, loop); + /* Check all exit source blocks for annotations. */ + for (auto e : get_loop_exit_edges (loop)) + replace_loop_annotate_in_block (e->src, loop); /* Push the global flag_finite_loops state down to individual loops. */ loop->finite_p = flag_finite_loops; rb18267.patch Description: rb18267.patch
[PATCH]middle-end: inspect all exits for additional annotations for loop.
Hi All, Attaching a pragma to a loop which has a complex condition often gets the pragma dropped. e.g. #pragma GCC novector while (i < N && parse_tables_n--) before lowering this is represented as: if (ANNOTATE_EXPR ) ... But after lowering the condition is broken appart and attached to the final component of the expression: if (parse_tables_n.2_2 != 0) goto ; else goto ; : iftmp.1D.4452 = 1; goto ; : iftmp.1D.4452 = 0; : D.4451 = .ANNOTATE (iftmp.1D.4452, 2, 0); if (D.4451 != 0) goto ; else goto ; : and it's never heard from again because during replace_loop_annotate we only inspect the loop header and latch for annotations. Since annotations were supposed to apply to the loop as a whole this fixes it by also checking the loop exit src blocks for annotations. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-cfg.cc (replace_loop_annotate): Inspect loop edges for annotations. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-novect_gcond.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c new file mode 100644 index ..01e69cbef9d51b234c08a400c78dc078d53252f1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c @@ -0,0 +1,39 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; +#pragma GCC novector + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + +#pragma GCC novector + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -327,6 +327,10 @@ replace_loop_annotate (void) if (loop->latch) replace_loop_annotate_in_block (loop->latch, loop); + /* Then also check all other exits. */ + for (auto e : get_loop_exit_edges (loop)) + replace_loop_annotate_in_block (e->src, loop); + /* Push the global flag_finite_loops state down to individual loops. */ loop->finite_p = flag_finite_loops; } -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c new file mode 100644 index ..01e69cbef9d51b234c08a400c78dc078d53252f1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c @@ -0,0 +1,39 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; +#pragma GCC novector + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + +#pragma GCC novector + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -327,6 +327,10 @@ replace_loop_annotate (void) if (loop->latch) replace_loop_annotate_in_block (loop->latch, loop); + /* Then also check all other exits. */ + for (auto e : get_loop_exit_edges (loop)) + replace_loop_annotate_in_block (e->src, loop); + /* Push the global flag_finite_loops state down to individual loops. */ loop->finite_p = flag_finite_loops; }
[PATCH]middle-end: update vector loop upper bounds when early break vect [PR113734]
Hi All, When doing early break vectorization we should treat the final iteration as possibly being partial. This so that when we calculate the vector loop upper bounds we take into account that final iteration could have done some work. The attached testcase shows that if we don't then cunroll may unroll the loop an if the upper bound is wrong we lose a vector iteration. This is similar to how we adjust the scalar loop bounds for the PEELED case. Bootstrapped Regtested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113734 * tree-vect-loop.cc (vect_transform_loop): Treat the final iteration of an early break loop as partial. gcc/testsuite/ChangeLog: PR tree-optimization/113734 * gcc.dg/vect/vect-early-break_117-pr113734.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c new file mode 100644 index ..36ae09483dfd426f977a3d92cf24a78d76de6961 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c @@ -0,0 +1,37 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 854e9d78bc71721e6559a6bc5dff78c813603a78..0b1656fef2fed83f30295846c382ad9fb318454a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12171,7 +12171,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* True if the final iteration might not handle a full vector's worth of scalar iterations. */ bool final_iter_may_be_partial -= LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo); += LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) + || LOOP_VINFO_EARLY_BREAKS (loop_vinfo); /* The minimum number of iterations performed by the epilogue. This is 1 when peeling for gaps because we always need a final scalar iteration. */ -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c new file mode 100644 index ..36ae09483dfd426f977a3d92cf24a78d76de6961 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c @@ -0,0 +1,37 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 854e9d78bc71721e6559a6bc5dff78c813603a78..0b1656fef2fed83f30295846c382ad9fb318454a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -12171,7 +12171,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* True if the final iteration might not handle a full vector's worth of scalar iterations. */ bool final_iter_may_be_partial -= LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo); += LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) + || LOOP_VINFO_EARLY_BREAKS (loop_vinfo); /* The minimum number of iterations performed by the epilogue. This is 1 when peeling for gaps because we always need a final scalar iteration. */
RE: [PATCH]middle-end: add two debug counters for early-break vectorization debugging
> -Original Message- > From: Richard Biener > Sent: Thursday, February 8, 2024 2:16 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: add two debug counters for early-break > vectorization debugging > > On Thu, 8 Feb 2024, Tamar Christina wrote: > > > Hi All, > > > > This adds two new debug counter to aid in debugging early break code. > > > > - vect_force_last_exit: when reached will always force the final loop exit. > > - vect_skip_exit: when reached will skip selecting the current candidate > > exit > > as the loop exit. > > > > The first counter essentially allows you to turn off the PEELED case and the > > second counter to pick a different exit, which may mean you pick no exit at > > all. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * dbgcnt.def (vect_force_last_exit, vect_skip_exit): New. > > * tree-vect-loop.cc (vec_init_loop_exit_info): Use them. > > > > --- inline copy of patch -- > > diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def > > index > ed9f062eac2c28c52df76b39d4312dd9fde1c800..8f7bebf93fceabdf6ae86c2df5 > 91eae4848b8a5c 100644 > > --- a/gcc/dbgcnt.def > > +++ b/gcc/dbgcnt.def > > @@ -213,5 +213,7 @@ DEBUG_COUNTER (stv_conversion) > > DEBUG_COUNTER (tail_call) > > DEBUG_COUNTER (tree_sra) > > DEBUG_COUNTER (treepre_insert) > > +DEBUG_COUNTER (vect_force_last_exit) > > DEBUG_COUNTER (vect_loop) > > +DEBUG_COUNTER (vect_skip_exit) > > DEBUG_COUNTER (vect_slp) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index > eed2268e9bae7e7ad36d13da03e0b54eab26ef6f..854e9d78bc71721e6559a6bc > 5dff78c813603a78 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -58,6 +58,7 @@ along with GCC; see the file COPYING3. If not see > > #include "tree-eh.h" > > #include "case-cfn-macros.h" > > #include "langhooks.h" > > +#include "dbgcnt.h" > > > > /* Loop Vectorization Pass. > > > > @@ -977,6 +978,20 @@ vec_init_loop_exit_info (class loop *loop) > >if (exits.length () == 1) > > return exits[0]; > > > > + /* Check to see if we've been asked to force the last exit. */ > > + if (!dbg_cnt (vect_force_last_exit)) > > +{ > > + basic_block bb = ip_normal_pos (loop); > > + if (!bb) > > + return NULL; > > + > > + edge exit = EDGE_SUCC (bb, 0); > > + if (exit->dest == loop->latch) > > + return EDGE_SUCC (bb, 1); > > + > > + return exit; > > Err, that's quite odd. Why not just below do > > > +} > > + > >/* If we have multiple exits we only support counting IV at the moment. > > Analyze all exits and return the last one we can analyze. */ > >class tree_niter_desc niter_desc; > > @@ -998,6 +1013,7 @@ vec_init_loop_exit_info (class loop *loop) > >&& exit->src == single_pred (loop->latch) > >&& (integer_nonzerop (may_be_zero) > >|| COMPARISON_CLASS_P (may_be_zero > > + && dbg_cnt (vect_skip_exit) > > && (dbg_cnt (vect_force_last_exit) > || exit->src == single_pred (loop->latch)) > > (also computed above already)? It's also oddly named, it's more like > vect_allow_peeled_exit or so. Because this isn't deterministic. If a loop has n exits the above always forces you to pick the final one regardless of n, rather than just skip consideration of an exit. And in that case is there a point in analyzing all the exits just to throw away the information? Doing in inside the consideration check would only skip one exit unless I'm misunderstanding. > > It's also seemingly redundant with vect_skip_exit, no? > > Note the counter gets incremented even if we'd not consider the exit > because we have a later candidate already. > > I fear it's going to be quite random even with the debug counter. It is, I think the first counter is more useful. But in general the reason I kept the second counter which kinda does what was suggested in the RFC I sent before was that it should in theory at least allow us to test forcing of a PEELED case. Since we generally prefer the non-PEELED case if possible. At least that was the intention. Thanks, Tamar > > Can you see whether it really helps you? > > > && (!candidate > > || dominated_by_p (CDI_DOMINATORS, exit->src, > > candidate->src))) > > > > > > > > > > > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
[PATCH]middle-end: add two debug counters for early-break vectorization debugging
Hi All, This adds two new debug counter to aid in debugging early break code. - vect_force_last_exit: when reached will always force the final loop exit. - vect_skip_exit: when reached will skip selecting the current candidate exit as the loop exit. The first counter essentially allows you to turn off the PEELED case and the second counter to pick a different exit, which may mean you pick no exit at all. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * dbgcnt.def (vect_force_last_exit, vect_skip_exit): New. * tree-vect-loop.cc (vec_init_loop_exit_info): Use them. --- inline copy of patch -- diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def index ed9f062eac2c28c52df76b39d4312dd9fde1c800..8f7bebf93fceabdf6ae86c2df591eae4848b8a5c 100644 --- a/gcc/dbgcnt.def +++ b/gcc/dbgcnt.def @@ -213,5 +213,7 @@ DEBUG_COUNTER (stv_conversion) DEBUG_COUNTER (tail_call) DEBUG_COUNTER (tree_sra) DEBUG_COUNTER (treepre_insert) +DEBUG_COUNTER (vect_force_last_exit) DEBUG_COUNTER (vect_loop) +DEBUG_COUNTER (vect_skip_exit) DEBUG_COUNTER (vect_slp) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index eed2268e9bae7e7ad36d13da03e0b54eab26ef6f..854e9d78bc71721e6559a6bc5dff78c813603a78 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -58,6 +58,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-eh.h" #include "case-cfn-macros.h" #include "langhooks.h" +#include "dbgcnt.h" /* Loop Vectorization Pass. @@ -977,6 +978,20 @@ vec_init_loop_exit_info (class loop *loop) if (exits.length () == 1) return exits[0]; + /* Check to see if we've been asked to force the last exit. */ + if (!dbg_cnt (vect_force_last_exit)) +{ + basic_block bb = ip_normal_pos (loop); + if (!bb) + return NULL; + + edge exit = EDGE_SUCC (bb, 0); + if (exit->dest == loop->latch) + return EDGE_SUCC (bb, 1); + + return exit; +} + /* If we have multiple exits we only support counting IV at the moment. Analyze all exits and return the last one we can analyze. */ class tree_niter_desc niter_desc; @@ -998,6 +1013,7 @@ vec_init_loop_exit_info (class loop *loop) && exit->src == single_pred (loop->latch) && (integer_nonzerop (may_be_zero) || COMPARISON_CLASS_P (may_be_zero + && dbg_cnt (vect_skip_exit) && (!candidate || dominated_by_p (CDI_DOMINATORS, exit->src, candidate->src))) -- diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def index ed9f062eac2c28c52df76b39d4312dd9fde1c800..8f7bebf93fceabdf6ae86c2df591eae4848b8a5c 100644 --- a/gcc/dbgcnt.def +++ b/gcc/dbgcnt.def @@ -213,5 +213,7 @@ DEBUG_COUNTER (stv_conversion) DEBUG_COUNTER (tail_call) DEBUG_COUNTER (tree_sra) DEBUG_COUNTER (treepre_insert) +DEBUG_COUNTER (vect_force_last_exit) DEBUG_COUNTER (vect_loop) +DEBUG_COUNTER (vect_skip_exit) DEBUG_COUNTER (vect_slp) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index eed2268e9bae7e7ad36d13da03e0b54eab26ef6f..854e9d78bc71721e6559a6bc5dff78c813603a78 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -58,6 +58,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-eh.h" #include "case-cfn-macros.h" #include "langhooks.h" +#include "dbgcnt.h" /* Loop Vectorization Pass. @@ -977,6 +978,20 @@ vec_init_loop_exit_info (class loop *loop) if (exits.length () == 1) return exits[0]; + /* Check to see if we've been asked to force the last exit. */ + if (!dbg_cnt (vect_force_last_exit)) +{ + basic_block bb = ip_normal_pos (loop); + if (!bb) + return NULL; + + edge exit = EDGE_SUCC (bb, 0); + if (exit->dest == loop->latch) + return EDGE_SUCC (bb, 1); + + return exit; +} + /* If we have multiple exits we only support counting IV at the moment. Analyze all exits and return the last one we can analyze. */ class tree_niter_desc niter_desc; @@ -998,6 +1013,7 @@ vec_init_loop_exit_info (class loop *loop) && exit->src == single_pred (loop->latch) && (integer_nonzerop (may_be_zero) || COMPARISON_CLASS_P (may_be_zero + && dbg_cnt (vect_skip_exit) && (!candidate || dominated_by_p (CDI_DOMINATORS, exit->src, candidate->src)))
RE: [PATCH]middle-end: don't cache restart_loop in vectorizable_live_operations [PR113808]
> Please either drop lastprivate(k) clause or use linear(k:1) > The iteration var of simd loop without collapse or with > collapse(1) is implicitly linear with the step, and even linear > means the value from the last iteration can be used after the > simd construct. Overriding the data sharing to something different > has been only added recently to OpenMP and isn't really needed here. > Sorry I know very little about fortran, is this ok? Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR tree-optimization/113808 * gfortran.dg/vect/vect-early-break_1-PR113808.f90: Moved to... * gfortran.dg/vect/vect-early-break_1-pr113808.f90: ...here. --- inline copy of patch --- diff --git a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-pr113808.f90 similarity index 93% rename from gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 rename to gcc/testsuite/gfortran.dg/vect/vect-early-break_1-pr113808.f90 index 5c339fa7a348fac5527bbbf456a535da96b5c1ed..6f92e9095bdee08a5a9db2816f57da6c14d91b11 100644 --- a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 +++ b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-pr113808.f90 @@ -9,7 +9,7 @@ program main integer :: n, i,k n = 11 do i = 1, n,2 -!$omp simd lastprivate(k) +!$omp simd do k = 1, i + 41 if (k > 11 + 41 .or. k < 1) error stop end do rb18253.patch Description: rb18253.patch
[PATCH]middle-end: don't cache restart_loop in vectorizable_live_operations [PR113808]
Hi All, There's a bug in vectorizable_live_operation that restart_loop is defined outside the loop. This variable is supposed to indicate whether we are doing a first or last index reduction. The problem is that by defining it outside the loop it becomes dependent on the order we visit the USE/DEFs. In the given example, the loop isn't PEELED, but we visit the early exit uses first. This then sets the boolean to true and it can't get to false again. So when we visit the main exit we still treat it as an early exit for that SSA name. This cleans it up and renames the variables to something that's hopefully clearer to their intention. Bootstrapped Regtested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113808 * tree-vect-loop.cc (vectorizable_live_operation): Don't cache the value cross iterations. gcc/testsuite/ChangeLog: PR tree-optimization/113808 * gfortran.dg/vect/vect-early-break_1-PR113808.f90: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 new file mode 100644 index ..5c339fa7a348fac5527bbbf456a535da96b5c1ed --- /dev/null +++ b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 @@ -0,0 +1,21 @@ +! { dg-add-options vect_early_break } +! { dg-require-effective-target vect_early_break } +! { dg-require-effective-target vect_long_long } +! { dg-additional-options "-fopenmp-simd" } + +! { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } + +program main + integer :: n, i,k + n = 11 + do i = 1, n,2 +!$omp simd lastprivate(k) +do k = 1, i + 41 + if (k > 11 + 41 .or. k < 1) error stop +end do + end do + if (k /= 53) then +print *, k, 53 +error stop + endif +end diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 190df9ec7741fd05aa0b9abe150baf06b2ca9a57..eed2268e9bae7e7ad36d13da03e0b54eab26ef6f 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10950,7 +10950,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, did. For the live values we want the value at the start of the iteration rather than at the end. */ edge main_e = LOOP_VINFO_IV_EXIT (loop_vinfo); - bool restart_loop = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); + bool all_exits_as_early_p = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs) if (!is_gimple_debug (use_stmt) && !flow_bb_inside_loop_p (loop, gimple_bb (use_stmt))) @@ -10966,8 +10966,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, /* For early exit where the exit is not in the BB that leads to the latch then we're restarting the iteration in the scalar loop. So get the first live value. */ - restart_loop = restart_loop || !main_exit_edge; - if (restart_loop + if ((all_exits_as_early_p || !main_exit_edge) && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def) { tmp_vec_lhs = vec_lhs0; -- diff --git a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 new file mode 100644 index ..5c339fa7a348fac5527bbbf456a535da96b5c1ed --- /dev/null +++ b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 @@ -0,0 +1,21 @@ +! { dg-add-options vect_early_break } +! { dg-require-effective-target vect_early_break } +! { dg-require-effective-target vect_long_long } +! { dg-additional-options "-fopenmp-simd" } + +! { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } + +program main + integer :: n, i,k + n = 11 + do i = 1, n,2 +!$omp simd lastprivate(k) +do k = 1, i + 41 + if (k > 11 + 41 .or. k < 1) error stop +end do + end do + if (k /= 53) then +print *, k, 53 +error stop + endif +end diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 190df9ec7741fd05aa0b9abe150baf06b2ca9a57..eed2268e9bae7e7ad36d13da03e0b54eab26ef6f 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10950,7 +10950,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, did. For the live values we want the value at the start of the iteration rather than at the end. */ edge main_e = LOOP_VINFO_IV_EXIT (loop_vinfo); - bool restart_loop = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); + bool all_exits_as_early_p = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs) if (!is_gimple_debug (use_stmt) && !flow_bb_inside_loop_p
[PATCH][committed]middle-end: fix pointer conversion error in testcase vect-early-break_110-pr113467.c
Hi All, I had missed a conversion from unsigned long to uint64_t. This fixes the failing test on -m32. Regtested on x86_64-pc-linux-gnu with -m32 and no issues. Committed as obvious. Thanks, Tamar gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-early-break_110-pr113467.c: Change unsigned long * to uint64_t *. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c index 1e2c47be5fdf1e1fed88e4b5f45d7eda6c3b85d1..12d0ea1e871b51742c040c909ea5741bc820206e 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c @@ -10,7 +10,7 @@ typedef struct gcry_mpi *gcry_mpi_t; struct gcry_mpi { int nlimbs; - unsigned long *d; + uint64_t *d; }; long gcry_mpi_add_ui_up; -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c index 1e2c47be5fdf1e1fed88e4b5f45d7eda6c3b85d1..12d0ea1e871b51742c040c909ea5741bc820206e 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c @@ -10,7 +10,7 @@ typedef struct gcry_mpi *gcry_mpi_t; struct gcry_mpi { int nlimbs; - unsigned long *d; + uint64_t *d; }; long gcry_mpi_add_ui_up;
RE: [PATCH]middle-end: fix ICE when moving statements to empty BB [PR113731]
> It looks like LOOP_VINFO_EARLY_BRK_STORES is "reverse"? Is that > why you are doing gsi_move_before + gsi_prev? Why do gsi_prev > at all? > As discussed on IRC, then how about this one. Incremental building passed all tests and bootstrap is running. Ok for master if bootstrap and regtesting clean? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113731 * gimple-iterator.cc (gsi_move_before): Take new parameter for update method. * gimple-iterator.h (gsi_move_before): Default new param to GSI_SAME_STMT. * tree-vect-loop.cc (move_early_exit_stmts): Call gsi_move_before with GSI_NEW_STMT. gcc/testsuite/ChangeLog: PR tree-optimization/113731 * gcc.dg/vect/vect-early-break_111-pr113731.c: New test. --- inline copy of patch --- diff --git a/gcc/gimple-iterator.cc b/gcc/gimple-iterator.cc index 517c53376f0511af59e124f52ec7be566a6c4789..f67bcfbfdfdd7c6cb0ad0130972f5b1dc4429bcf 100644 --- a/gcc/gimple-iterator.cc +++ b/gcc/gimple-iterator.cc @@ -666,10 +666,11 @@ gsi_move_after (gimple_stmt_iterator *from, gimple_stmt_iterator *to) /* Move the statement at FROM so it comes right before the statement - at TO. */ + at TO using method M. */ void -gsi_move_before (gimple_stmt_iterator *from, gimple_stmt_iterator *to) +gsi_move_before (gimple_stmt_iterator *from, gimple_stmt_iterator *to, +gsi_iterator_update m = GSI_SAME_STMT) { gimple *stmt = gsi_stmt (*from); gsi_remove (from, false); @@ -677,7 +678,7 @@ gsi_move_before (gimple_stmt_iterator *from, gimple_stmt_iterator *to) /* For consistency with gsi_move_after, it might be better to have GSI_NEW_STMT here; however, that breaks several places that expect that TO does not change. */ - gsi_insert_before (to, stmt, GSI_SAME_STMT); + gsi_insert_before (to, stmt, m); } diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c new file mode 100644 index ..2d6db91df97625a7f11609d034e89af0461129b2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +char* inet_net_pton_ipv4_bits; +char inet_net_pton_ipv4_odst; +void __errno_location(); +void inet_net_pton_ipv4(); +void inet_net_pton() { inet_net_pton_ipv4(); } +void inet_net_pton_ipv4(char *dst, int size) { + while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) { +if (size-- <= 0) + goto emsgsize; +*dst++ = '\0'; + } +emsgsize: + __errno_location(); +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 30b90d99925bea74caf14833d8ab1695607d0fe9..9aba94bd6ca2061a19487ac4a2735a16d03bcbee 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -11800,8 +11800,7 @@ move_early_exit_stmts (loop_vec_info loop_vinfo) dump_printf_loc (MSG_NOTE, vect_location, "moving stmt %G", stmt); gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt); - gsi_move_before (_gsi, _gsi); - gsi_prev (_gsi); + gsi_move_before (_gsi, _gsi, GSI_NEW_STMT); } /* Update all the stmts with their new reaching VUSES. */ rb18247.patch Description: rb18247.patch
RE: [PATCH]middle-end: add additional runtime test for [PR113467]
> > Ok for master? > > I think you need a lp64 target check for the large constants or > alternatively use uint64_t? > Ok, how about this one. Regtested on x86_64-pc-linux-gnu with -m32,-m64 and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR tree-optimization/113467 * gcc.dg/vect/vect-early-break_110-pr113467.c: New test. --- inline copy of patch --- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c new file mode 100644 index ..1e2c47be5fdf1e1fed88e4b5f45d7eda6c3b85d1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c @@ -0,0 +1,52 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_long_long } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" +#include + +typedef struct gcry_mpi *gcry_mpi_t; +struct gcry_mpi { + int nlimbs; + unsigned long *d; +}; + +long gcry_mpi_add_ui_up; +void gcry_mpi_add_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned v) { + gcry_mpi_add_ui_up = *w->d; + if (u) { +uint64_t *res_ptr = w->d, *s1_ptr = w->d; +int s1_size = u->nlimbs; +unsigned s2_limb = v, x = *s1_ptr++; +s2_limb += x; +*res_ptr++ = s2_limb; +if (x) + while (--s1_size) { +x = *s1_ptr++ + 1; +*res_ptr++ = x; +if (x) { + break; +} + } + } +} + +int main() +{ + check_vect (); + + static struct gcry_mpi sv; + static uint64_t vals[] = {4294967288ULL, 191ULL,4160749568ULL, 4294963263ULL, +127ULL,4294950912ULL, 255ULL, 4294901760ULL, +534781951ULL, 33546240ULL, 4294967292ULL, 4294960127ULL, +4292872191ULL, 4294967295ULL, 4294443007ULL, 3ULL}; + gcry_mpi_t v = + v->nlimbs = 16; + v->d = vals; + + gcry_mpi_add_ui(v, v, 8); + if (v->d[1] != 192) +__builtin_abort(); +} rb18246.patch Description: rb18246.patch
RE: [PATCH]middle-end: fix ICE when moving statements to empty BB [PR113731]
> -Original Message- > From: Richard Biener > Sent: Monday, February 5, 2024 1:22 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: fix ICE when moving statements to empty BB > [PR113731] > > On Mon, 5 Feb 2024, Tamar Christina wrote: > > > Hi All, > > > > We use gsi_move_before (_gsi, _gsi); to request that the new > statement > > be placed before any other statement. Typically this then moves the current > > pointer to be after the statement we just inserted. > > > > However it looks like when the BB is empty, this does not happen and the CUR > > pointer stays NULL. There's a comment in the source of gsi_insert_before > > that > > explains: > > > > /* If CUR is NULL, we link at the end of the sequence (this case happens > > > > so it adds it to the end instead of start like you asked. This means that > > in > > this case there's nothing to move and so we shouldn't move the pointer if > > we're > > already at the HEAD. > > The issue is that a gsi_end_p () is ambiguous, it could be the start > or the end. gsi_insert_before treats it as "end" while gsi_insert_after > treats it as "start" since you can't really insert "after" the "end". > > gsi_move_before doesn't update the insertion pointer (using > GSI_SAME_STMT), so with a gsi_end_p () you get what you ask for. > > Btw, > > /* Move all stmts that need moving. */ > basic_block dest_bb = LOOP_VINFO_EARLY_BRK_DEST_BB (loop_vinfo); > gimple_stmt_iterator dest_gsi = gsi_start_bb (dest_bb); > > should probably use gsi_after_labels (dest_bb) just in case. See next patch. > > It looks like LOOP_VINFO_EARLY_BRK_STORES is "reverse"? Is that > why you are doing gsi_move_before + gsi_prev? Why do gsi_prev > at all? > Yes, it stores them reverse because we record them from the latch on up. So we either have to iterate backwards, insert them to the front or move gsi. I guess I could remove it by removing the for-each loop and iterating in reverse. Is that preferred? Tamar. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR tree-optimization/113731 > > * tree-vect-loop.cc (move_early_exit_stmts): Conditionally move pointer. > > > > gcc/testsuite/ChangeLog: > > > > PR tree-optimization/113731 > > * gcc.dg/vect/vect-early-break_111-pr113731.c: New test. > > > > --- inline copy of patch -- > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c > > new file mode 100644 > > index > ..2d6db91df97625a7f1160 > 9d034e89af0461129b2 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c > > @@ -0,0 +1,21 @@ > > +/* { dg-do compile } */ > > +/* { dg-add-options vect_early_break } */ > > +/* { dg-require-effective-target vect_early_break } */ > > +/* { dg-require-effective-target vect_int } */ > > + > > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ > > + > > +char* inet_net_pton_ipv4_bits; > > +char inet_net_pton_ipv4_odst; > > +void __errno_location(); > > +void inet_net_pton_ipv4(); > > +void inet_net_pton() { inet_net_pton_ipv4(); } > > +void inet_net_pton_ipv4(char *dst, int size) { > > + while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) { > > +if (size-- <= 0) > > + goto emsgsize; > > +*dst++ = '\0'; > > + } > > +emsgsize: > > + __errno_location(); > > +} > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index > 30b90d99925bea74caf14833d8ab1695607d0fe9..e2587315020a35a7d4ebd3e > 7a9842caa36bb5d3c 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -11801,7 +11801,8 @@ move_early_exit_stmts (loop_vec_info loop_vinfo) > > > >gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt); > >gsi_move_before (_gsi, _gsi); > > - gsi_prev (_gsi); > > + if (!gsi_end_p (dest_gsi)) > > + gsi_prev (_gsi); > > } > > > >/* Update all the stmts with their new reaching VUSES. */ > > > > > > > > > > > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
[PATCH]middle-end: fix ICE when destination BB for stores starts with a label [PR113750]
Hi All, The report shows that if the FE leaves a label as the first thing in the dest BB then we ICE because we move the stores before the label. This is easy to fix if we know that there's still only one way into the BB. We would have already rejected the loop if there was multiple paths into the BB however I added an additional check just for early break in case the other constraints are relaxed later with an explanation. After that we fix the issue just by getting the GSI after the labels and I add a bunch of testcases for different positions the label can be added. Only the vect-early-break_112-pr113750.c one results in the label being kept. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113750 * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Check for single predecessor when doing early break vect. * tree-vect-loop.cc (move_early_exit_stmts): Get gsi at the start but after labels. gcc/testsuite/ChangeLog: PR tree-optimization/113750 * gcc.dg/vect/vect-early-break_112-pr113750.c: New test. * gcc.dg/vect/vect-early-break_113-pr113750.c: New test. * gcc.dg/vect/vect-early-break_114-pr113750.c: New test. * gcc.dg/vect/vect-early-break_115-pr113750.c: New test. * gcc.dg/vect/vect-early-break_116-pr113750.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_112-pr113750.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_112-pr113750.c new file mode 100644 index ..559ebd84d5c39881e694e7c8c31be29d846866ed --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_112-pr113750.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { + vect_b[i] = x + i; + if (vect_a[i] != x) + break; +foo: + vect_a[i] = x; + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_113-pr113750.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_113-pr113750.c new file mode 100644 index ..ba85780a46b1378aaec238ff9eb5f906be9a44dd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_113-pr113750.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { + vect_b[i] = x + i; + if (vect_a[i] != x) + break; + vect_a[i] = x; +foo: + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_114-pr113750.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_114-pr113750.c new file mode 100644 index ..37af2998688f5d60e2cdb372ab43afcaa52a3146 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_114-pr113750.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { + vect_b[i] = x + i; +foo: + if (vect_a[i] != x) + break; + vect_a[i] = x; + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_115-pr113750.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_115-pr113750.c new file mode 100644 index ..502686d308e298cd84e9e3b74d7b4ad1979602a9 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_115-pr113750.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#ifndef N +#define N 800 +#endif +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < N; i++) + { +foo: + vect_b[i] = x + i; + if (vect_a[i] != x) + break; + vect_a[i] = x; + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_116-pr113750.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_116-pr113750.c new file mode
[PATCH]middle-end: fix ICE when moving statements to empty BB [PR113731]
Hi All, We use gsi_move_before (_gsi, _gsi); to request that the new statement be placed before any other statement. Typically this then moves the current pointer to be after the statement we just inserted. However it looks like when the BB is empty, this does not happen and the CUR pointer stays NULL. There's a comment in the source of gsi_insert_before that explains: /* If CUR is NULL, we link at the end of the sequence (this case happens so it adds it to the end instead of start like you asked. This means that in this case there's nothing to move and so we shouldn't move the pointer if we're already at the HEAD. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113731 * tree-vect-loop.cc (move_early_exit_stmts): Conditionally move pointer. gcc/testsuite/ChangeLog: PR tree-optimization/113731 * gcc.dg/vect/vect-early-break_111-pr113731.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c new file mode 100644 index ..2d6db91df97625a7f11609d034e89af0461129b2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +char* inet_net_pton_ipv4_bits; +char inet_net_pton_ipv4_odst; +void __errno_location(); +void inet_net_pton_ipv4(); +void inet_net_pton() { inet_net_pton_ipv4(); } +void inet_net_pton_ipv4(char *dst, int size) { + while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) { +if (size-- <= 0) + goto emsgsize; +*dst++ = '\0'; + } +emsgsize: + __errno_location(); +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 30b90d99925bea74caf14833d8ab1695607d0fe9..e2587315020a35a7d4ebd3e7a9842caa36bb5d3c 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -11801,7 +11801,8 @@ move_early_exit_stmts (loop_vec_info loop_vinfo) gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt); gsi_move_before (_gsi, _gsi); - gsi_prev (_gsi); + if (!gsi_end_p (dest_gsi)) + gsi_prev (_gsi); } /* Update all the stmts with their new reaching VUSES. */ -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c new file mode 100644 index ..2d6db91df97625a7f11609d034e89af0461129b2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +char* inet_net_pton_ipv4_bits; +char inet_net_pton_ipv4_odst; +void __errno_location(); +void inet_net_pton_ipv4(); +void inet_net_pton() { inet_net_pton_ipv4(); } +void inet_net_pton_ipv4(char *dst, int size) { + while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) { +if (size-- <= 0) + goto emsgsize; +*dst++ = '\0'; + } +emsgsize: + __errno_location(); +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 30b90d99925bea74caf14833d8ab1695607d0fe9..e2587315020a35a7d4ebd3e7a9842caa36bb5d3c 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -11801,7 +11801,8 @@ move_early_exit_stmts (loop_vec_info loop_vinfo) gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt); gsi_move_before (_gsi, _gsi); - gsi_prev (_gsi); + if (!gsi_end_p (dest_gsi)) + gsi_prev (_gsi); } /* Update all the stmts with their new reaching VUSES. */
[PATCH]middle-end: add additional runtime test for [PR113467]
Hi All, This just adds an additional runtime testcase for the fixed issue. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR tree-optimization/113467 * gcc.dg/vect/vect-early-break_110-pr113467.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c new file mode 100644 index ..2d8a071c0e922ccfd5fa8c7b2704852dbd95 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c @@ -0,0 +1,51 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +typedef struct gcry_mpi *gcry_mpi_t; +struct gcry_mpi { + int nlimbs; + unsigned long *d; +}; + +long gcry_mpi_add_ui_up; +void gcry_mpi_add_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned v) { + gcry_mpi_add_ui_up = *w->d; + if (u) { +unsigned long *res_ptr = w->d, *s1_ptr = w->d; +int s1_size = u->nlimbs; +unsigned s2_limb = v, x = *s1_ptr++; +s2_limb += x; +*res_ptr++ = s2_limb; +if (x) + while (--s1_size) { +x = *s1_ptr++ + 1; +*res_ptr++ = x; +if (x) { + break; +} + } + } +} + +int main() +{ + check_vect (); + + static struct gcry_mpi sv; + static unsigned long vals[] = {4294967288, 191,4160749568, 4294963263, + 127,4294950912, 255, 4294901760, + 534781951, 33546240, 4294967292, 4294960127, + 4292872191, 4294967295, 4294443007, 3}; + gcry_mpi_t v = + v->nlimbs = 16; + v->d = vals; + + gcry_mpi_add_ui(v, v, 8); + if (v->d[1] != 192) +__builtin_abort(); +} -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c new file mode 100644 index ..2d8a071c0e922ccfd5fa8c7b2704852dbd95 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c @@ -0,0 +1,51 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +typedef struct gcry_mpi *gcry_mpi_t; +struct gcry_mpi { + int nlimbs; + unsigned long *d; +}; + +long gcry_mpi_add_ui_up; +void gcry_mpi_add_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned v) { + gcry_mpi_add_ui_up = *w->d; + if (u) { +unsigned long *res_ptr = w->d, *s1_ptr = w->d; +int s1_size = u->nlimbs; +unsigned s2_limb = v, x = *s1_ptr++; +s2_limb += x; +*res_ptr++ = s2_limb; +if (x) + while (--s1_size) { +x = *s1_ptr++ + 1; +*res_ptr++ = x; +if (x) { + break; +} + } + } +} + +int main() +{ + check_vect (); + + static struct gcry_mpi sv; + static unsigned long vals[] = {4294967288, 191,4160749568, 4294963263, + 127,4294950912, 255, 4294901760, + 534781951, 33546240, 4294967292, 4294960127, + 4292872191, 4294967295, 4294443007, 3}; + gcry_mpi_t v = + v->nlimbs = 16; + v->d = vals; + + gcry_mpi_add_ui(v, v, 8); + if (v->d[1] != 192) +__builtin_abort(); +}
RE: [PATCH]middle-end: check memory accesses in the destination block [PR113588].
> > > > If the above is correct then I think I understand what you're saying and > > will update the patch and do some Checks. > > Yes, I think that's what I wanted to say. > As discussed: Bootstrapped Regtested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu no issues. Also checked both with --enable-lto --with-build-config='bootstrap-O3 bootstrap-lto' --enable-multilib and --enable-lto --with-build-config=bootstrap-O3 --enable-checking=release,yes,rtl,extra; and checked the libcrypt testsuite as reported on PR113467. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113588 PR tree-optimization/113467 (vect_analyze_data_ref_dependence): Choose correct dest and fix checks. (vect_analyze_early_break_dependences): Update comments. gcc/testsuite/ChangeLog: PR tree-optimization/113588 PR tree-optimization/113467 * gcc.dg/vect/vect-early-break_108-pr113588.c: New test. * gcc.dg/vect/vect-early-break_109-pr113588.c: New test. --- inline copy of patch --- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c new file mode 100644 index ..e488619c9aac41fafbcf479818392a6bb7c6924f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +int foo (const char *s, unsigned long n) +{ + unsigned long len = 0; + while (*s++ && n--) + ++len; + return len; +} + diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c new file mode 100644 index ..488c19d3ede809631d1a7ede0e7f7bcdc7a1ae43 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c @@ -0,0 +1,44 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target mmap } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include +#include + +#include "tree-vect.h" + +__attribute__((noipa)) +int foo (const char *s, unsigned long n) +{ + unsigned long len = 0; + while (*s++ && n--) + ++len; + return len; +} + +int main() +{ + + check_vect (); + + long pgsz = sysconf (_SC_PAGESIZE); + void *p = mmap (NULL, pgsz * 3, PROT_READ|PROT_WRITE, + MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); + if (p == MAP_FAILED) +return 0; + mprotect (p, pgsz, PROT_NONE); + mprotect (p+2*pgsz, pgsz, PROT_NONE); + char *p1 = p + pgsz; + p1[0] = 1; + p1[1] = 0; + foo (p1, 1000); + p1 = p + 2*pgsz - 2; + p1[0] = 1; + p1[1] = 0; + foo (p1, 1000); + return 0; +} + diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index f592aeb8028afd4fd70e2175104efab2a2c0d82e..53fdfc25d7dc2deb7788176252697d2e45fc 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -619,10 +619,10 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, return opt_result::success (); } -/* Funcion vect_analyze_early_break_dependences. +/* Function vect_analyze_early_break_dependences. - Examime all the data references in the loop and make sure that if we have - mulitple exits that we are able to safely move stores such that they become + Examine all the data references in the loop and make sure that if we have + multiple exits that we are able to safely move stores such that they become safe for vectorization. The function also calculates the place where to move the instructions to and computes what the new vUSE chain should be. @@ -639,7 +639,7 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, - Multiple loads are allowed as long as they don't alias. NOTE: - This implemementation is very conservative. Any overlappig loads/stores + This implementation is very conservative. Any overlapping loads/stores that take place before the early break statement gets rejected aside from WAR dependencies. @@ -668,7 +668,6 @@ vect_analyze_early_break_dependences (loop_vec_info loop_vinfo) auto_vec bases; basic_block dest_bb = NULL; - hash_set visited; class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); class loop *loop_nest = loop_outer (loop); @@ -677,19 +676,33 @@ vect_analyze_early_break_dependences (loop_vec_info loop_vinfo) "loop contains multiple exits, analyzing" " statement dependencies.\n"); + if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)) +if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, +
RE: [PATCH]AArch64: update vget_set_lane_1.c test output
> -Original Message- > From: Richard Sandiford > Sent: Thursday, February 1, 2024 2:24 PM > To: Andrew Pinski > Cc: Tamar Christina ; gcc-patches@gcc.gnu.org; nd > ; Richard Earnshaw ; Marcus > Shawcroft ; Kyrylo Tkachov > > Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output > > Andrew Pinski writes: > > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina > wrote: > >> > >> Hi All, > >> > >> In the vget_set_lane_1.c test the following entries now generate a zip1 > >> instead > of an INS > >> > >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) > >> BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) > >> BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) > >> > >> This is because the non-Q variant for indices 0 and 1 are just shuffling > >> values. > >> There is no perf difference between INS SIMD to SIMD and ZIP, as such just > update the > >> test file. > > Hmm, is this true on all cores? I suspect there is a core out there > > where INS is implemented with a much lower latency than ZIP. > > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles > > while ZIP is 6 cycles (3/7 for q versions). > > Now I don't have any invested interest in that core any more but I > > just wanted to point out that is not exactly true for all cores. > > Thanks for the pointer. In that case, perhaps we should prefer > aarch64_evpc_ins over aarch64_evpc_zip in aarch64_expand_vec_perm_const_1? > That's enough to fix this failure, but it'll probably require other > tests to be adjusted... I think given that Thundex-X is a 10 year old micro-architecture that is several cases where often used instructions have very high latencies that generic codegen should not be blocked from progressing because of it. we use zips in many things and if thunderx codegen is really of that much importance then I think the old codegen should be gated behind -mcpu=thunderx rather than preventing generic changes. Regards, Tamar. > > Richard
[PATCH]AArch64: update vget_set_lane_1.c test output
Hi All, In the vget_set_lane_1.c test the following entries now generate a zip1 instead of an INS BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) This is because the non-Q variant for indices 0 and 1 are just shuffling values. There is no perf difference between INS SIMD to SIMD and ZIP, as such just update the test file. Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: * gcc.target/aarch64/vget_set_lane_1.c: Update test output. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c index 07a77de319206c5c6dad1c0d2d9bcc998583f9c1..a3978f68e4ff5899f395a98615a5e86c3b1389cb 100644 --- a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c +++ b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c @@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2) BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) -/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 } } */ +/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */ BUILD_TEST (poly8x8_t, poly8x16_t, , q, p8, 7, 15) BUILD_TEST (int8x8_t, int8x16_t, , q, s8, 7, 15) -- diff --git a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c index 07a77de319206c5c6dad1c0d2d9bcc998583f9c1..a3978f68e4ff5899f395a98615a5e86c3b1389cb 100644 --- a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c +++ b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c @@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2) BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) -/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 } } */ +/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */ BUILD_TEST (poly8x8_t, poly8x16_t, , q, p8, 7, 15) BUILD_TEST (int8x8_t, int8x16_t, , q, s8, 7, 15)
[PATCH 2/2][libsanitizer] hwasan: Remove testsuite check for a complaint message [PR112644]
Hi All, With recent updates to hwasan runtime libraries, the error reporting for this particular check is has been reworked. I would question why it has lost this message. To me it looks strange that num_descriptions_printed is incremented whenever we call PrintHeapOrGlobalCandidate whether that function prints anything or not. (See PrintAddressDescription in libsanitizer/hwasan/hwasan_report.cpp). The message is no longer printed because we increment this num_descriptions_printed variable indicating that we have found some description. I would like to question this upstream, but it doesn't look that much of a problem and if pressed for time we should just change our testsuite. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR sanitizer/112644 * c-c++-common/hwasan/hwasan-thread-clears-stack.c: Update testcase. --- inline copy of patch -- diff --git a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c index 09c72a56f0f50a8c301d89217aa8c7df70087e6c..6c70684d72a887c49b02ecb17ca097da81a9168f 100644 --- a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c +++ b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c @@ -52,5 +52,4 @@ main (int argc, char **argv) /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" } */ /* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: \[\[:xdigit:\]\]\[\[:xdigit:\]\]/00 \\(ptr/mem\\) in thread T0.*" } */ -/* { dg-output "HWAddressSanitizer can not describe address in more detail\..*" } */ /* { dg-output "SUMMARY: HWAddressSanitizer: tag-mismatch \[^\n\]*.*" } */ -- diff --git a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c index 09c72a56f0f50a8c301d89217aa8c7df70087e6c..6c70684d72a887c49b02ecb17ca097da81a9168f 100644 --- a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c +++ b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c @@ -52,5 +52,4 @@ main (int argc, char **argv) /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" } */ /* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: \[\[:xdigit:\]\]\[\[:xdigit:\]\]/00 \\(ptr/mem\\) in thread T0.*" } */ -/* { dg-output "HWAddressSanitizer can not describe address in more detail\..*" } */ /* { dg-output "SUMMARY: HWAddressSanitizer: tag-mismatch \[^\n\]*.*" } */
[PATCH 1/2][libsanitizer] hwasan: Remove testsuite check for a complaint message [PR112644]
Hi All, Recent libhwasan updates[1] intercept various string and memory functions. These functions have checking in them, which means there's no need to inline the checking. This patch marks said functions as intercepted, and adjusts a testcase to handle the difference. It also looks for HWASAN in a check in expand_builtin. This check originally is there to avoid using expand to inline the behaviour of builtins like memset which are intercepted by ASAN and hence which we rely on the function call staying as a function call. With the new reliance on function calls in HWASAN we need to do the same thing for HWASAN too. HWASAN and ASAN don't seem to however instrument the same functions. Looking into libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc it looks like the common ones are memset, memmove and memcpy. The rest of the routines for asan seem to be defined in compiler-rt/lib/asan/asan_interceptors.h however compiler-rt/lib/hwasan/ does not have such a file but it does have compiler-rt/lib/hwasan/hwasan_platform_interceptors.h which it looks like is forcing off everything but memset, memmove, memcpy, memcmp and bcmp. As such I've taken those as the final list that hwasan currently supports. This also means that on future updates this list should be cross checked. [1] https://discourse.llvm.org/t/hwasan-question-about-the-recent-interceptors-being-added/75351 Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR sanitizer/112644 * asan.h (asan_intercepted_p): Incercept memset, memmove, memcpy and memcmp. * builtins.cc (expand_builtin): Include HWASAN when checking for builtin inlining. gcc/testsuite/ChangeLog: PR sanitizer/112644 * c-c++-common/hwasan/builtin-special-handling.c: Update testcase. Co-Authored-By: Matthew Malcomson --- inline copy of patch -- diff --git a/gcc/asan.h b/gcc/asan.h index 82811bdbe697665652aba89f2ee1c3ac07970df9..d1bf8b1e701b15525c6a900d324f2aebfb778cba 100644 --- a/gcc/asan.h +++ b/gcc/asan.h @@ -185,8 +185,13 @@ extern hash_set *asan_handled_variables; inline bool asan_intercepted_p (enum built_in_function fcode) { + /* This list should be kept up-to-date with upstream's version at + compiler-rt/lib/hwasan/hwasan_platform_interceptors.h. */ if (hwasan_sanitize_p ()) -return false; +return fcode == BUILT_IN_MEMCMP +|| fcode == BUILT_IN_MEMCPY +|| fcode == BUILT_IN_MEMMOVE +|| fcode == BUILT_IN_MEMSET; return fcode == BUILT_IN_INDEX || fcode == BUILT_IN_MEMCHR diff --git a/gcc/builtins.cc b/gcc/builtins.cc index a0bd82c7981c05caf2764de70c62fe83bef9ad29..12cc7a54e99555d0f4b21fa2cc32ffa7bb548f18 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -7792,7 +7792,8 @@ expand_builtin (tree exp, rtx target, rtx subtarget, machine_mode mode, default: break; } - if (sanitize_flags_p (SANITIZE_ADDRESS) && asan_intercepted_p (fcode)) + if (sanitize_flags_p (SANITIZE_ADDRESS | SANITIZE_HWADDRESS) + && asan_intercepted_p (fcode)) return expand_call (exp, target, ignore); /* When not optimizing, generate calls to library functions for a certain diff --git a/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c b/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c index a7a6d91693ae48c20f33ab28f28d27b01af4722c..f975b1cc397bc0d6fd475dbfed5ccc8ac386 100644 --- a/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c +++ b/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c @@ -8,24 +8,24 @@ /* { dg-skip-if "" { *-*-* } { "-flto" } { "-flto-partition=none" } } */ typedef __SIZE_TYPE__ size_t; -/* Functions to observe that HWASAN instruments memory builtins in the expected - manner. */ +/* HWASAN used to instrument calls to memset, memcpy, and memmove. It no + longer does this. Many other string and memory builtins are intercepted by + the runtime (and hence the codegen need not do anything). */ void * __attribute__((noinline)) memset_builtin (void *dest, int value, size_t len) { return __builtin_memset (dest, value, len); } -/* HWASAN avoids strlen because it doesn't know the size of the memory access - until *after* the function call. */ size_t __attribute__ ((noinline)) strlen_builtin (char *element) { return __builtin_strlen (element); } -/* First test ensures that the HWASAN_CHECK was emitted before the - memset. Second test ensures there was only HWASAN_CHECK (which demonstrates - that strlen was not instrumented). */ -/* { dg-final { scan-tree-dump-times "HWASAN_CHECK.*memset" 1 "asan1" } } */ -/* { dg-final { scan-tree-dump-times "HWASAN_CHECK" 1 "asan1" } } */ +/* First check here ensures there is no inline instrumentation generated for + these builtins. Second checks that we end up calling memset (i.e. that it's + not optimised
RE: [PATCH][libsanitizer]: Sync fixes for asan interceptors from upstream [PR112644]
> -Original Message- > From: Andrew Pinski > Sent: Monday, January 29, 2024 9:55 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; ja...@redhat.com; > do...@redhat.com; k...@google.com; dvyu...@google.com > Subject: Re: [PATCH][libsanitizer]: Sync fixes for asan interceptors from > upstream > [PR112644] > > On Mon, Jan 29, 2024 at 7:04 AM Tamar Christina > wrote: > > > > Hi All, > > > > This cherry-picks and squashes the differences between commits > > > > > d3e5c20ab846303874a2a25e5877c72271fc798b..76e1e45922e6709392fb82aa > c44bebe3dbc2ea63 > > from LLVM upstream from compiler-rt/lib/hwasan/ to GCC on the changes > relevant > > for GCC. > > > > This is required to fix the linked PR. > > > > As mentioned in the PR the last sync brought in a bug from upstream[1] where > > operations became non-recoverable and as such the tests in AArch64 started > > failing. This cherry picks the fix and there are minor updates needed to > > GCC > > after this to fix the cases. > > > > [1] https://github.com/llvm/llvm-project/pull/74000 > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > Thanks for handling this; though I wonder how this slipped through > testing upstream in LLVM. I see they added some new testcases for > this. I Know GCC's testsuite for sanitizer is slightly different from > LLVM's. Is it the case, GCC has more tests in this area? Is someone > adding the testcases that GCC has in this area upstream to LLVM; > basically so merging won't bring in regressions like this in the > future? There were two parts here. The first one is that their testsuite didn't have any test for the recovery case. Which they've now added. But the second parts (which I'm not posting patches for) is that the change In hwasan means that the runtime can now instrument some additional library methods which it couldn't before. And GCC now needs to not inline these anymore. This does mean that on future updates one needs to take a look at the Instrumentation list and make sure to keep it in sync with GCC's otherwise we'll lose instrumentation. Regards, Tamar > > Thanks, > Andrew > > > > > Thanks, > > Tamar > > > > libsanitizer/ChangeLog: > > > > PR sanitizer/112644 > > * hwasan/hwasan_interceptors.cpp (ACCESS_MEMORY_RANGE, > > HWASAN_READ_RANGE, HWASAN_WRITE_RANGE, > COMMON_SYSCALL_PRE_READ_RANGE, > > COMMON_SYSCALL_PRE_WRITE_RANGE, > COMMON_INTERCEPTOR_WRITE_RANGE, > > COMMON_INTERCEPTOR_READ_RANGE): Make recoverable. > > > > --- inline copy of patch -- > > diff --git a/libsanitizer/hwasan/hwasan_interceptors.cpp > b/libsanitizer/hwasan/hwasan_interceptors.cpp > > index > d9237cf9b8e3bf982cf213123ef22e73ec027c9e..96df4dd0c24d7d3db28fa2557 > cf63da0f295e33f 100644 > > --- a/libsanitizer/hwasan/hwasan_interceptors.cpp > > +++ b/libsanitizer/hwasan/hwasan_interceptors.cpp > > @@ -36,16 +36,16 @@ struct HWAsanInterceptorContext { > >const char *interceptor_name; > > }; > > > > -# define ACCESS_MEMORY_RANGE(ctx, offset, size, access) > > \ > > -do { > > \ > > - __hwasan::CheckAddressSized > access>((uptr)offset, \ > > - size); > > \ > > +# define ACCESS_MEMORY_RANGE(offset, size, access) > >\ > > +do { > >\ > > + __hwasan::CheckAddressSized > access>((uptr)offset, \ > > +size); > >\ > > } while (0) > > > > -# define HWASAN_READ_RANGE(ctx, offset, size) \ > > -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Load) > > -# define HWASAN_WRITE_RANGE(ctx, offset, size) \ > > -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Store) > > +# define HWASAN_READ_RANGE(offset, size) \ > > +ACCESS_MEMORY_RANGE(offset, size, AccessType::Load) > > +# define HWASAN_WRITE_RANGE(offset, size) \ > > +ACCESS_MEMORY_RANGE(offset, size, AccessType::Store) > > > > # if !SANITIZER_APPLE > > #define HWASAN_INTERCEPT_FUNC(name) > > \ > > @@ -74,9 +74,8 @@ struct HWAsanInterceptorContext { > > > > # if HWASAN_WITH_INTERCEPTORS > > > > -#define COMMON_SYSC
RE: [PATCH]middle-end: check memory accesses in the destination block [PR113588].
> -Original Message- > From: Richard Biener > Sent: Tuesday, January 30, 2024 9:51 AM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: check memory accesses in the destination block > [PR113588]. > > On Mon, 29 Jan 2024, Tamar Christina wrote: > > > Hi All, > > > > When analyzing loads for early break it was always the intention that > > for the exit where things get moved to we only check the loads that can > > be reached from the condition. > > Looking at the code I'm a bit confused that we always move to > single_pred (loop->latch) - IIRC that was different at some point? > > Shouldn't we move stores after the last early exit condition instead? Yes it was changed during another PR fix. The rationale at that time didn't take into account the peeled case. It used to be that we would "search" for the the exit to place it in. At that time the rational was, well it doesn't make sense. It has to go in the block that is the last to be executed. With the non-peeled case it's always the one before the latch. Or put differently, I think the destination should be the main IV block. I am not quite sure I'm following why you want to put the peeled cases inside the latch block. Ah, is it because the latch block is always going to only be executed when you make a full iteration? That makes sense, but then I think we should also analyze the stores in all blocks (which your change maybe already does, let me check) since we'll also lifting past the final block we need to update the vuses there too. If the above is correct then I think I understand what you're saying and will update the patch and do some Checks. Thanks, Tamar > > In particular for the peeled case single_pred (loop->latch) is the > block with the actual early exit condition? So for that case we'd > need to move to the latch itself instead? For non-peeled we move > to the block with the IV condition which looks OK. > > > However the main loop checks all loads and we skip the destination BB. > > As such we never actually check the loads reachable from the COND in the > > last BB unless this BB was also the exit chosen by the vectorizer. > > > > This leads us to incorrectly vectorize the loop in the PR and in doing so > > access > > out of bounds. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > The patch ends up with a worklist and another confusing comment > > + /* For the destination BB we need to only analyze loads reachable from > the early > + break statement itself. */ > > But I think it's a downstream issue from the issue above. That said, > even for the non-peeled case we need to check ref_within_array_bound, > no? > > So what about re-doing that initial loop like the following instead > (and also fix dest_bb, but I'd like clarification here). Basically > walk all blocks, do the ref_within_array_bound first and only > after we've seen 'dest_bb' do the checks required for moving > stores for all upstream BBs. > > And dest_bb should be > > /* Move side-effects to the in-loop destination of the last early > exit. */ > if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)) > dest_bb = loop->latch; > else > dest_bb = single_pred (loop->latch); > > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc > index f592aeb8028..d6c8910dd6c 100644 > --- a/gcc/tree-vect-data-refs.cc > +++ b/gcc/tree-vect-data-refs.cc > @@ -668,7 +668,6 @@ vect_analyze_early_break_dependences (loop_vec_info > loop_vinfo) >auto_vec bases; >basic_block dest_bb = NULL; > > - hash_set visited; >class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); >class loop *loop_nest = loop_outer (loop); > > @@ -681,15 +680,11 @@ vect_analyze_early_break_dependences > (loop_vec_info loop_vinfo) > side-effects to is always the latch connected exit. When we support > general control flow we can do better but for now this is fine. */ >dest_bb = single_pred (loop->latch); > - basic_block bb = dest_bb; > + basic_block bb = loop->latch; > + bool check_deps = false; > >do > { > - /* If the destination block is also the header then we have nothing to > do. */ > - if (!single_pred_p (bb)) > - continue; > - > - bb = single_pred (bb); >gimple_stmt_iterator gsi = gsi_last_bb (bb); > >/* Now analyze all the remaining statements and try to determine which > @@ -707,6 +702,25 @@ vect_analyze_early_break_dependences (loop_vec_info > loop_vi
[PATCH]middle-end: check memory accesses in the destination block [PR113588].
Hi All, When analyzing loads for early break it was always the intention that for the exit where things get moved to we only check the loads that can be reached from the condition. However the main loop checks all loads and we skip the destination BB. As such we never actually check the loads reachable from the COND in the last BB unless this BB was also the exit chosen by the vectorizer. This leads us to incorrectly vectorize the loop in the PR and in doing so access out of bounds. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113588 * tree-vect-data-refs.cc (vect_analyze_early_break_dependences_1): New. (vect_analyze_data_ref_dependence): Use it. (vect_analyze_early_break_dependences): Update comments. gcc/testsuite/ChangeLog: PR tree-optimization/113588 * gcc.dg/vect/vect-early-break_108-pr113588.c: New test. * gcc.dg/vect/vect-early-break_109-pr113588.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c new file mode 100644 index ..e488619c9aac41fafbcf479818392a6bb7c6924f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +int foo (const char *s, unsigned long n) +{ + unsigned long len = 0; + while (*s++ && n--) + ++len; + return len; +} + diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c new file mode 100644 index ..488c19d3ede809631d1a7ede0e7f7bcdc7a1ae43 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c @@ -0,0 +1,44 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target mmap } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include +#include + +#include "tree-vect.h" + +__attribute__((noipa)) +int foo (const char *s, unsigned long n) +{ + unsigned long len = 0; + while (*s++ && n--) + ++len; + return len; +} + +int main() +{ + + check_vect (); + + long pgsz = sysconf (_SC_PAGESIZE); + void *p = mmap (NULL, pgsz * 3, PROT_READ|PROT_WRITE, + MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); + if (p == MAP_FAILED) +return 0; + mprotect (p, pgsz, PROT_NONE); + mprotect (p+2*pgsz, pgsz, PROT_NONE); + char *p1 = p + pgsz; + p1[0] = 1; + p1[1] = 0; + foo (p1, 1000); + p1 = p + 2*pgsz - 2; + p1[0] = 1; + p1[1] = 0; + foo (p1, 1000); + return 0; +} + diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index f592aeb8028afd4fd70e2175104efab2a2c0d82e..52cef242a7ce5d0e525bff639fa1dc2f0a6f30b9 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -619,10 +619,69 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, return opt_result::success (); } -/* Funcion vect_analyze_early_break_dependences. +/* Function vect_analyze_early_break_dependences_1 - Examime all the data references in the loop and make sure that if we have - mulitple exits that we are able to safely move stores such that they become + Helper function of vect_analyze_early_break_dependences which performs safety + analysis for load operations in an early break. */ + +static opt_result +vect_analyze_early_break_dependences_1 (data_reference *dr_ref, gimple *stmt) +{ + /* We currently only support statically allocated objects due to + not having first-faulting loads support or peeling for + alignment support. Compute the size of the referenced object + (it could be dynamically allocated). */ + tree obj = DR_BASE_ADDRESS (dr_ref); + if (!obj || TREE_CODE (obj) != ADDR_EXPR) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"early breaks only supported on statically" +" allocated objects.\n"); + return opt_result::failure_at (stmt, +"can't safely apply code motion to " +"dependencies of %G to vectorize " +"the early exit.\n", stmt); +} + + tree refop = TREE_OPERAND (obj, 0); + tree refbase = get_base_address (refop); + if (!refbase || !DECL_P (refbase) || !DECL_SIZE (refbase) + || TREE_CODE (DECL_SIZE (refbase)) != INTEGER_CST) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"early
[PATCH]AArch64: relax cbranch tests to accepted inverted branches [PR113502]
Hi All, Recently something in the midend had started inverting the branches by inverting the condition and the branches. While this is fine, it makes it hard to actually test. In RTL I disable scheduling and BB reordering to prevent this. But in GIMPLE there seems to be nothing I can do. __builtin_expect seems to have no impact on the change since I suspect this is happening during expand where conditions can be flipped regardless of probability during compare_and_branch. Since the mid-end has plenty of correctness tests, this weakens the backend tests to just check that a correct looking sequence is emitted. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR testsuite/113502 * gcc.target/aarch64/sve/vect-early-break-cbranch.c: Ignore exact branch. * gcc.target/aarch64/vect-early-break-cbranch.c: Likewise. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c index d15053553f94e7dce3540e21f0c1f0d39ea4f289..d7cef1105410be04ed67d1d3b800746267f205a8 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c @@ -9,7 +9,7 @@ int b[N] = {0}; ** ... ** cmpgt p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0 ** ptest p[0-9]+, p[0-9]+.b -** b.any \.L[0-9]+ +** b.(any|none)\.L[0-9]+ ** ... */ void f1 () @@ -26,7 +26,7 @@ void f1 () ** ... ** cmpge p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0 ** ptest p[0-9]+, p[0-9]+.b -** b.any \.L[0-9]+ +** b.(any|none)\.L[0-9]+ ** ... */ void f2 () @@ -43,7 +43,7 @@ void f2 () ** ... ** cmpeq p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0 ** ptest p[0-9]+, p[0-9]+.b -** b.any \.L[0-9]+ +** b.(any|none)\.L[0-9]+ ** ... */ void f3 () @@ -60,7 +60,7 @@ void f3 () ** ... ** cmpne p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0 ** ptest p[0-9]+, p[0-9]+.b -** b.any \.L[0-9]+ +** b.(any|none)\.L[0-9]+ ** ... */ void f4 () @@ -77,7 +77,7 @@ void f4 () ** ... ** cmplt p[0-9]+.s, p7/z, z[0-9]+.s, #0 ** ptest p[0-9]+, p[0-9]+.b -** b.any .L[0-9]+ +** b.(any|none).L[0-9]+ ** ... */ void f5 () @@ -94,7 +94,7 @@ void f5 () ** ... ** cmple p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0 ** ptest p[0-9]+, p[0-9]+.b -** b.any \.L[0-9]+ +** b.(any|none)\.L[0-9]+ ** ... */ void f6 () diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c b/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c index a5e7b94827dd70240d754a834f1d11750a9c27a9..673b781eb6d092f6311409797b20a971f4fae247 100644 --- a/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c +++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c @@ -15,7 +15,7 @@ int b[N] = {0}; ** cmgtv[0-9]+.4s, v[0-9]+.4s, #0 ** umaxp v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s ** fmovx[0-9]+, d[0-9]+ -** cbnzx[0-9]+, \.L[0-9]+ +** cbn?z x[0-9]+, \.L[0-9]+ ** ... */ void f1 () @@ -34,7 +34,7 @@ void f1 () ** cmgev[0-9]+.4s, v[0-9]+.4s, #0 ** umaxp v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s ** fmovx[0-9]+, d[0-9]+ -** cbnzx[0-9]+, \.L[0-9]+ +** cbn?z x[0-9]+, \.L[0-9]+ ** ... */ void f2 () @@ -53,7 +53,7 @@ void f2 () ** cmeqv[0-9]+.4s, v[0-9]+.4s, #0 ** umaxp v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s ** fmovx[0-9]+, d[0-9]+ -** cbnzx[0-9]+, \.L[0-9]+ +** cbn?z x[0-9]+, \.L[0-9]+ ** ... */ void f3 () @@ -72,7 +72,7 @@ void f3 () ** cmtst v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s ** umaxp v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s ** fmovx[0-9]+, d[0-9]+ -** cbnzx[0-9]+, \.L[0-9]+ +** cbn?z x[0-9]+, \.L[0-9]+ ** ... */ void f4 () @@ -91,7 +91,7 @@ void f4 () ** cmltv[0-9]+.4s, v[0-9]+.4s, #0 ** umaxp v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s ** fmovx[0-9]+, d[0-9]+ -** cbnzx[0-9]+, \.L[0-9]+ +** cbn?z x[0-9]+, \.L[0-9]+ ** ... */ void f5 () @@ -110,7 +110,7 @@ void f5 () ** cmlev[0-9]+.4s, v[0-9]+.4s, #0 ** umaxp v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s ** fmovx[0-9]+, d[0-9]+ -** cbnzx[0-9]+, \.L[0-9]+ +** cbn?z x[0-9]+, \.L[0-9]+ ** ... */ void f6 () -- diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c index d15053553f94e7dce3540e21f0c1f0d39ea4f289..d7cef1105410be04ed67d1d3b800746267f205a8 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c @@ -9,7 +9,7 @@ int b[N] = {0}; ** ... ** cmpgt p[0-9]+.s,
[PATCH][libsanitizer]: Sync fixes for asan interceptors from upstream [PR112644]
Hi All, This cherry-picks and squashes the differences between commits d3e5c20ab846303874a2a25e5877c72271fc798b..76e1e45922e6709392fb82aac44bebe3dbc2ea63 from LLVM upstream from compiler-rt/lib/hwasan/ to GCC on the changes relevant for GCC. This is required to fix the linked PR. As mentioned in the PR the last sync brought in a bug from upstream[1] where operations became non-recoverable and as such the tests in AArch64 started failing. This cherry picks the fix and there are minor updates needed to GCC after this to fix the cases. [1] https://github.com/llvm/llvm-project/pull/74000 Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar libsanitizer/ChangeLog: PR sanitizer/112644 * hwasan/hwasan_interceptors.cpp (ACCESS_MEMORY_RANGE, HWASAN_READ_RANGE, HWASAN_WRITE_RANGE, COMMON_SYSCALL_PRE_READ_RANGE, COMMON_SYSCALL_PRE_WRITE_RANGE, COMMON_INTERCEPTOR_WRITE_RANGE, COMMON_INTERCEPTOR_READ_RANGE): Make recoverable. --- inline copy of patch -- diff --git a/libsanitizer/hwasan/hwasan_interceptors.cpp b/libsanitizer/hwasan/hwasan_interceptors.cpp index d9237cf9b8e3bf982cf213123ef22e73ec027c9e..96df4dd0c24d7d3db28fa2557cf63da0f295e33f 100644 --- a/libsanitizer/hwasan/hwasan_interceptors.cpp +++ b/libsanitizer/hwasan/hwasan_interceptors.cpp @@ -36,16 +36,16 @@ struct HWAsanInterceptorContext { const char *interceptor_name; }; -# define ACCESS_MEMORY_RANGE(ctx, offset, size, access)\ -do {\ - __hwasan::CheckAddressSized((uptr)offset, \ - size);\ +# define ACCESS_MEMORY_RANGE(offset, size, access) \ +do { \ + __hwasan::CheckAddressSized((uptr)offset, \ +size);\ } while (0) -# define HWASAN_READ_RANGE(ctx, offset, size) \ -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Load) -# define HWASAN_WRITE_RANGE(ctx, offset, size) \ -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Store) +# define HWASAN_READ_RANGE(offset, size) \ +ACCESS_MEMORY_RANGE(offset, size, AccessType::Load) +# define HWASAN_WRITE_RANGE(offset, size) \ +ACCESS_MEMORY_RANGE(offset, size, AccessType::Store) # if !SANITIZER_APPLE #define HWASAN_INTERCEPT_FUNC(name) \ @@ -74,9 +74,8 @@ struct HWAsanInterceptorContext { # if HWASAN_WITH_INTERCEPTORS -#define COMMON_SYSCALL_PRE_READ_RANGE(p, s) __hwasan_loadN((uptr)p, (uptr)s) -#define COMMON_SYSCALL_PRE_WRITE_RANGE(p, s) \ - __hwasan_storeN((uptr)p, (uptr)s) +#define COMMON_SYSCALL_PRE_READ_RANGE(p, s) HWASAN_READ_RANGE(p, s) +#define COMMON_SYSCALL_PRE_WRITE_RANGE(p, s) HWASAN_WRITE_RANGE(p, s) #define COMMON_SYSCALL_POST_READ_RANGE(p, s) \ do { \ (void)(p); \ @@ -91,10 +90,10 @@ struct HWAsanInterceptorContext { #include "sanitizer_common/sanitizer_syscalls_netbsd.inc" #define COMMON_INTERCEPTOR_WRITE_RANGE(ctx, ptr, size) \ - HWASAN_WRITE_RANGE(ctx, ptr, size) + HWASAN_WRITE_RANGE(ptr, size) #define COMMON_INTERCEPTOR_READ_RANGE(ctx, ptr, size) \ - HWASAN_READ_RANGE(ctx, ptr, size) + HWASAN_READ_RANGE(ptr, size) #define COMMON_INTERCEPTOR_ENTER(ctx, func, ...) \ HWAsanInterceptorContext _ctx = {#func}; \ -- diff --git a/libsanitizer/hwasan/hwasan_interceptors.cpp b/libsanitizer/hwasan/hwasan_interceptors.cpp index d9237cf9b8e3bf982cf213123ef22e73ec027c9e..96df4dd0c24d7d3db28fa2557cf63da0f295e33f 100644 --- a/libsanitizer/hwasan/hwasan_interceptors.cpp +++ b/libsanitizer/hwasan/hwasan_interceptors.cpp @@ -36,16 +36,16 @@ struct HWAsanInterceptorContext { const char *interceptor_name; }; -# define ACCESS_MEMORY_RANGE(ctx, offset, size, access)\ -do {\ - __hwasan::CheckAddressSized((uptr)offset, \ - size);\ +# define ACCESS_MEMORY_RANGE(offset, size, access) \ +do { \ + __hwasan::CheckAddressSized((uptr)offset, \ +size);\ } while (0) -# define HWASAN_READ_RANGE(ctx, offset, size) \ -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Load) -# define HWASAN_WRITE_RANGE(ctx, offset, size) \ -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Store) +# define HWASAN_READ_RANGE(offset, size) \ +ACCESS_MEMORY_RANGE(offset, size, AccessType::Load)
[PATCH]AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]
Hi All, The AArch64 vector PCS does not allow simd calls with simdlen 1, however due to a bug we currently do allow it for num == 0. This causes us to emit a symbol that doesn't exist and we fail to link. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? and for backport to GCC 13,12,11? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113552 * config/aarch64/aarch64.cc (aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1. gcc/testsuite/ChangeLog: PR tree-optimization/113552 * gcc.target/aarch64/pr113552.c: New test. * gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check. --- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index e6bd3fd0bb42c70603d5335402b89c9deeaf48d8..a2fc1a5d9d27e9d837e4d616e3feaf38f7272b4f 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -28620,7 +28620,8 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, if (known_eq (clonei->simdlen, 0U)) { simdlen = exact_div (poly_uint64 (64), nds_elt_bits); - simdlens.safe_push (simdlen); + if (known_ne (simdlen, 1U)) + simdlens.safe_push (simdlen); simdlens.safe_push (simdlen * 2); } else diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c new file mode 100644 index ..9c96b061ed2b4fcc57e58925277f74d14f79c51f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8-a" } */ + +__attribute__ ((__simd__ ("notinbranch"), const)) +double cos (double); + +void foo (float *a, double *b) +{ +for (int i = 0; i < 12; i+=3) + { +b[i] = cos (5.0 * a[i]); +b[i+1] = cos (5.0 * a[i+1]); +b[i+2] = cos (5.0 * a[i+2]); + } +} + +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c index 95f6a6803e889c02177ef10972962ed62d2095eb..661764b3d4a89e08951a7a3c0495d5b7ba7f0871 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c @@ -18,7 +18,5 @@ double foo(double x) } /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */ -- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index e6bd3fd0bb42c70603d5335402b89c9deeaf48d8..a2fc1a5d9d27e9d837e4d616e3feaf38f7272b4f 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -28620,7 +28620,8 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node, if (known_eq (clonei->simdlen, 0U)) { simdlen = exact_div (poly_uint64 (64), nds_elt_bits); - simdlens.safe_push (simdlen); + if (known_ne (simdlen, 1U)) + simdlens.safe_push (simdlen); simdlens.safe_push (simdlen * 2); } else diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c new file mode 100644 index ..9c96b061ed2b4fcc57e58925277f74d14f79c51f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8-a" } */ + +__attribute__ ((__simd__ ("notinbranch"), const)) +double cos (double); + +void foo (float *a, double *b) +{ +for (int i = 0; i < 12; i+=3) + { +b[i] = cos (5.0 * a[i]); +b[i+1] = cos (5.0 * a[i+1]); +b[i+2] = cos (5.0 * a[i+2]); + } +} + +/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c index 95f6a6803e889c02177ef10972962ed62d2095eb..661764b3d4a89e08951a7a3c0495d5b7ba7f0871 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c +++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c @@ -18,7 +18,5 @@ double foo(double x) } /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */ -/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */ /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */
[PATCH]AArch64: Fix expansion of Advanced SIMD div and mul using SVE [PR109636]
Hi All, As suggested in the ticket this replaces the expansion by converting the Advanced SIMD types to SVE types by simply printing out an SVE register for these instructions. This fixes the subreg issues since there are no subregs involved anymore. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR target/109636 * config/aarch64/aarch64-simd.md (div3, mulv2di3): Remove. * config/aarch64/iterators.md (VQDIV): Remove. (SVE_FULL_SDI_SIMD, SVE_FULL_SDI_SIMD_DI, SVE_FULL_HSDI_SIMD_DI, SVE_I_SIMD_DI): New. (VPRED, sve_lane_con): Add V4SI and V2DI. * config/aarch64/aarch64-sve.md (3, @aarch64_pred_): Support Advanced SIMD types. (mul3): New, split from 3. (@aarch64_pred_, *post_ra_3): New. * config/aarch64/aarch64-sve2.md (@aarch64_mul_lane_, *aarch64_mul_unpredicated_): Change SVE_FULL_HSDI to SVE_FULL_HSDI_SIMD_DI. gcc/testsuite/ChangeLog: PR target/109636 * gcc.target/aarch64/sve/pr109636_1.c: New test. * gcc.target/aarch64/sve/pr109636_2.c: New test. * gcc.target/aarch64/sve2/pr109636_1.c: New test. --- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 6f48b4d5f21da9f96a376cd6b34110c2a39deb33..556d0cf359fedf2c28dfe1e0a75e1c12321be68a 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -389,26 +389,6 @@ (define_insn "mul3" [(set_attr "type" "neon_mul_")] ) -;; Advanced SIMD does not support vector DImode MUL, but SVE does. -;; Make use of the overlap between Z and V registers to implement the V2DI -;; optab for TARGET_SVE. The mulvnx2di3 expander can -;; handle the TARGET_SVE2 case transparently. -(define_expand "mulv2di3" - [(set (match_operand:V2DI 0 "register_operand") -(mult:V2DI (match_operand:V2DI 1 "register_operand") - (match_operand:V2DI 2 "aarch64_sve_vsm_operand")))] - "TARGET_SVE" - { -machine_mode sve_mode = VNx2DImode; -rtx sve_op0 = simplify_gen_subreg (sve_mode, operands[0], V2DImode, 0); -rtx sve_op1 = simplify_gen_subreg (sve_mode, operands[1], V2DImode, 0); -rtx sve_op2 = simplify_gen_subreg (sve_mode, operands[2], V2DImode, 0); - -emit_insn (gen_mulvnx2di3 (sve_op0, sve_op1, sve_op2)); -DONE; - } -) - (define_insn "bswap2" [(set (match_operand:VDQHSD 0 "register_operand" "=w") (bswap:VDQHSD (match_operand:VDQHSD 1 "register_operand" "w")))] @@ -2678,27 +2658,6 @@ (define_insn "*div3" [(set_attr "type" "neon_fp_div_")] ) -;; SVE has vector integer divisions, unlike Advanced SIMD. -;; We can use it with Advanced SIMD modes to expose the V2DI and V4SI -;; optabs to the midend. -(define_expand "div3" - [(set (match_operand:VQDIV 0 "register_operand") - (ANY_DIV:VQDIV - (match_operand:VQDIV 1 "register_operand") - (match_operand:VQDIV 2 "register_operand")))] - "TARGET_SVE" - { -machine_mode sve_mode - = aarch64_full_sve_mode (GET_MODE_INNER (mode)).require (); -rtx sve_op0 = simplify_gen_subreg (sve_mode, operands[0], mode, 0); -rtx sve_op1 = simplify_gen_subreg (sve_mode, operands[1], mode, 0); -rtx sve_op2 = simplify_gen_subreg (sve_mode, operands[2], mode, 0); - -emit_insn (gen_div3 (sve_op0, sve_op1, sve_op2)); -DONE; - } -) - (define_insn "neg2" [(set (match_operand:VHSDF 0 "register_operand" "=w") (neg:VHSDF (match_operand:VHSDF 1 "register_operand" "w")))] diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index e1e3c1bd0b7d12eefe43dc95a10716c24e3a48de..eca8623e587af944927a9459e29d5f8af170d347 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -3789,16 +3789,35 @@ (define_expand "3" [(set (match_operand:SVE_I 0 "register_operand") (unspec:SVE_I [(match_dup 3) - (SVE_INT_BINARY_IMM:SVE_I + (SVE_INT_BINARY_MULTI:SVE_I (match_operand:SVE_I 1 "register_operand") (match_operand:SVE_I 2 "aarch64_sve__operand"))] UNSPEC_PRED_X))] "TARGET_SVE" + { +operands[3] = aarch64_ptrue_reg (mode); + } +) + +;; Unpredicated integer binary operations that have an immediate form. +;; Advanced SIMD does not support vector DImode MUL, but SVE does. +;; Make use of the overlap between Z and V registers to implement the V2DI +;; optab for TARGET_SVE. The mulvnx2di3 expander can +;; handle the TARGET_SVE2 case transparently. +(define_expand "mul3" + [(set (match_operand:SVE_I_SIMD_DI 0 "register_operand") + (unspec:SVE_I_SIMD_DI + [(match_dup 3) + (mult:SVE_I_SIMD_DI +(match_operand:SVE_I_SIMD_DI 1 "register_operand") +(match_operand:SVE_I_SIMD_DI 2 "aarch64_sve_vsm_operand"))] + UNSPEC_PRED_X))] + "TARGET_SVE" { /* SVE2 supports
[PATCH]middle-end: rename main_exit_p in reduction code.
Hi All, This renamed main_exit_p to last_val_reduc_p to more accurately reflect what the value is calculating. Ok for master if bootstrap passes? Incremental build shows it's fine. Thanks, Tamar gcc/ChangeLog: * tree-vect-loop.cc (vect_get_vect_def, vect_create_epilog_for_reduction): Rename main_exit_p to last_val_reduc_p. --- inline copy of patch -- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4da1421c8f09746ef4b293573e4f861b642349e1..21a997599f397ba6c2cd15c3b9c8b04513bc0c83 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -5892,25 +5892,26 @@ vect_create_partial_epilog (tree vec_def, tree vectype, code_helper code, } /* Retrieves the definining statement to be used for a reduction. - For MAIN_EXIT_P we use the current VEC_STMTs and otherwise we look at - the reduction definitions. */ + For LAST_VAL_REDUC_P we use the current VEC_STMTs which correspond to the + final value after vectorization and otherwise we look at the reduction + definitions to get the first. */ tree vect_get_vect_def (stmt_vec_info reduc_info, slp_tree slp_node, - slp_instance slp_node_instance, bool main_exit_p, unsigned i, - vec _stmts) + slp_instance slp_node_instance, bool last_val_reduc_p, + unsigned i, vec _stmts) { tree def; if (slp_node) { - if (!main_exit_p) + if (!last_val_reduc_p) slp_node = slp_node_instance->reduc_phis; def = vect_get_slp_vect_def (slp_node, i); } else { - if (!main_exit_p) + if (!last_val_reduc_p) reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (reduc_info)); vec_stmts = STMT_VINFO_VEC_STMTS (reduc_info); def = gimple_get_lhs (vec_stmts[0]); @@ -5982,8 +5983,8 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, loop-closed PHI of the inner loop which we remember as def for the reduction PHI generation. */ bool double_reduc = false; - bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit -&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); + bool last_val_reduc_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit + && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); stmt_vec_info rdef_info = stmt_info; if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def) { @@ -6233,7 +6234,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, { gimple_seq stmts = NULL; def = vect_get_vect_def (rdef_info, slp_node, slp_node_instance, - main_exit_p, i, vec_stmts); + last_val_reduc_p, i, vec_stmts); for (j = 0; j < ncopies; j++) { tree new_def = copy_ssa_name (def); -- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 4da1421c8f09746ef4b293573e4f861b642349e1..21a997599f397ba6c2cd15c3b9c8b04513bc0c83 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -5892,25 +5892,26 @@ vect_create_partial_epilog (tree vec_def, tree vectype, code_helper code, } /* Retrieves the definining statement to be used for a reduction. - For MAIN_EXIT_P we use the current VEC_STMTs and otherwise we look at - the reduction definitions. */ + For LAST_VAL_REDUC_P we use the current VEC_STMTs which correspond to the + final value after vectorization and otherwise we look at the reduction + definitions to get the first. */ tree vect_get_vect_def (stmt_vec_info reduc_info, slp_tree slp_node, - slp_instance slp_node_instance, bool main_exit_p, unsigned i, - vec _stmts) + slp_instance slp_node_instance, bool last_val_reduc_p, + unsigned i, vec _stmts) { tree def; if (slp_node) { - if (!main_exit_p) + if (!last_val_reduc_p) slp_node = slp_node_instance->reduc_phis; def = vect_get_slp_vect_def (slp_node, i); } else { - if (!main_exit_p) + if (!last_val_reduc_p) reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (reduc_info)); vec_stmts = STMT_VINFO_VEC_STMTS (reduc_info); def = gimple_get_lhs (vec_stmts[0]); @@ -5982,8 +5983,8 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, loop-closed PHI of the inner loop which we remember as def for the reduction PHI generation. */ bool double_reduc = false; - bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit -&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); + bool last_val_reduc_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit + && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); stmt_vec_info rdef_info = stmt_info; if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def) { @@ -6233,7 +6234,7 @@ vect_create_epilog_for_reduction
[PATCH]middle-end: fix epilog reductions when vector iters peeled [PR113364]
Hi All, This fixes a bug where vect_create_epilog_for_reduction does not handle the case where all exits are early exits. In this case we should do like induction handling code does and not have a main exit. Bootstrapped Regtested on x86_64-pc-linux-gnu with --enable-checking=release --enable-lto --with-arch=native --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. This shows that some new miscompiles are happening (stage3 is likely miscompiled) but that's unrelated to this patch and I'll look at it next. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113364 * tree-vect-loop.cc (vect_create_epilog_for_reduction): If all exits all early exits then we must reduce from the first offset for all of them. gcc/testsuite/ChangeLog: PR tree-optimization/113364 * gcc.dg/vect/vect-early-break_107-pr113364.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c new file mode 100644 index ..f489265dbfe5eb8fe302dcc34901abaf6e6d5c14 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-w" } */ + +typedef const unsigned char *It; +It DecodeSLEB128(It begin, It end, int *v) { + int value = 0; + unsigned shift = 0; + unsigned char byte; + do + { +if (begin == end) + return begin; +byte = *(begin++); +int slice = byte & 0x7f; +value |= slice << shift; + } while (byte >= 128); + *v = value; + return begin; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index fe631252dc2258e8ea42179b4ba068a480be9e38..4da1421c8f09746ef4b293573e4f861b642349e1 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -5982,7 +5982,8 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, loop-closed PHI of the inner loop which we remember as def for the reduction PHI generation. */ bool double_reduc = false; - bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit; + bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit +&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); stmt_vec_info rdef_info = stmt_info; if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def) { -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c new file mode 100644 index ..f489265dbfe5eb8fe302dcc34901abaf6e6d5c14 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-w" } */ + +typedef const unsigned char *It; +It DecodeSLEB128(It begin, It end, int *v) { + int value = 0; + unsigned shift = 0; + unsigned char byte; + do + { +if (begin == end) + return begin; +byte = *(begin++); +int slice = byte & 0x7f; +value |= slice << shift; + } while (byte >= 128); + *v = value; + return begin; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index fe631252dc2258e8ea42179b4ba068a480be9e38..4da1421c8f09746ef4b293573e4f861b642349e1 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -5982,7 +5982,8 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, loop-closed PHI of the inner loop which we remember as def for the reduction PHI generation. */ bool double_reduc = false; - bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit; + bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit +&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo); stmt_vec_info rdef_info = stmt_info; if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def) {
[PATCH]middle-end: remove more usages of single_exit
Hi All, This replaces two more usages of single_exit that I had missed before. They both seem to happen when we re-use the ifcvt scalar loop for versioning. The condition in versioning is the same as the one for when we don't re-use the scalar loop. I hit these during an LTO enabled bootstrap now. Bootstrapped Regtested on aarch64-none-linux-gnu with lto enabled and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-vect-loop-manip.cc (vect_loop_versioning): Replace single_exit. * tree-vect-loop.cc (vect_transform_loop): Likewise. --- inline copy of patch -- diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 0931b18404856f6c33dcae1ffa8d5a350dbd0f8f..0d8c90f69e9693d5d25095e799fbc17a9910779b 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -4051,7 +4051,16 @@ vect_loop_versioning (loop_vec_info loop_vinfo, basic_block preheader = loop_preheader_edge (loop_to_version)->src; preheader->count = preheader->count.apply_probability (prob * prob2); scale_loop_frequencies (loop_to_version, prob * prob2); - single_exit (loop_to_version)->dest->count = preheader->count; + /* When the loop has multiple exits then we can only version itself. + This is denoted by loop_to_version == loop. In this case we can + do the versioning by selecting the exit edge the vectorizer is + currently using. */ + edge exit_edge; + if (loop_to_version == loop) + exit_edge = LOOP_VINFO_IV_EXIT (loop_vinfo); + else + exit_edge = single_exit (loop_to_version); + exit_edge->dest->count = preheader->count; LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo) = (prob * prob2).invert (); nloop = scalar_loop; diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index da2dfa176ecd457ebc11d1131302ca15d77d779d..eccf0953bbae2a0e95efba0966c85492e5057b14 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -11910,8 +11910,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) (LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo)); scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo), LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo)); - single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))->dest->count - = preheader->count; + LOOP_VINFO_SCALAR_IV_EXIT (loop_vinfo)->dest->count = preheader->count; } if (niters_vector == NULL_TREE) -- diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 0931b18404856f6c33dcae1ffa8d5a350dbd0f8f..0d8c90f69e9693d5d25095e799fbc17a9910779b 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -4051,7 +4051,16 @@ vect_loop_versioning (loop_vec_info loop_vinfo, basic_block preheader = loop_preheader_edge (loop_to_version)->src; preheader->count = preheader->count.apply_probability (prob * prob2); scale_loop_frequencies (loop_to_version, prob * prob2); - single_exit (loop_to_version)->dest->count = preheader->count; + /* When the loop has multiple exits then we can only version itself. + This is denoted by loop_to_version == loop. In this case we can + do the versioning by selecting the exit edge the vectorizer is + currently using. */ + edge exit_edge; + if (loop_to_version == loop) + exit_edge = LOOP_VINFO_IV_EXIT (loop_vinfo); + else + exit_edge = single_exit (loop_to_version); + exit_edge->dest->count = preheader->count; LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo) = (prob * prob2).invert (); nloop = scalar_loop; diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index da2dfa176ecd457ebc11d1131302ca15d77d779d..eccf0953bbae2a0e95efba0966c85492e5057b14 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -11910,8 +11910,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) (LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo)); scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo), LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo)); - single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))->dest->count - = preheader->count; + LOOP_VINFO_SCALAR_IV_EXIT (loop_vinfo)->dest->count = preheader->count; } if (niters_vector == NULL_TREE)
[PATCH]middle-end testsuite: remove -save-temps from many tests [PR113319]
Hi All, This removes -save-temps from the tests I've introduced to fix the LTO mismatches. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issue Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR testsuite/113319 * gcc.dg/bic-bitmask-13.c: Remove -save-temps. * gcc.dg/bic-bitmask-14.c: Likewise. * gcc.dg/bic-bitmask-15.c: Likewise. * gcc.dg/bic-bitmask-16.c: Likewise. * gcc.dg/bic-bitmask-17.c: Likewise. * gcc.dg/bic-bitmask-18.c: Likewise. * gcc.dg/bic-bitmask-19.c: Likewise. * gcc.dg/bic-bitmask-20.c: Likewise. * gcc.dg/bic-bitmask-21.c: Likewise. * gcc.dg/bic-bitmask-22.c: Likewise. * gcc.dg/bic-bitmask-7.c: Likewise. * gcc.dg/vect/vect-early-break-run_1.c: Likewise. * gcc.dg/vect/vect-early-break-run_10.c: Likewise. * gcc.dg/vect/vect-early-break-run_2.c: Likewise. * gcc.dg/vect/vect-early-break-run_3.c: Likewise. * gcc.dg/vect/vect-early-break-run_4.c: Likewise. * gcc.dg/vect/vect-early-break-run_5.c: Likewise. * gcc.dg/vect/vect-early-break-run_6.c: Likewise. * gcc.dg/vect/vect-early-break-run_7.c: Likewise. * gcc.dg/vect/vect-early-break-run_8.c: Likewise. * gcc.dg/vect/vect-early-break-run_9.c: Likewise. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-13.c b/gcc/testsuite/gcc.dg/bic-bitmask-13.c index bac86c2cfcebb4fd83eef1ea276026af97bcb096..141b03d6df772e9bdfaaf832287a1e91ebc6be0d 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-13.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-13.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O0 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O0 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-14.c b/gcc/testsuite/gcc.dg/bic-bitmask-14.c index ec3bd6a7e04de93e60b0a606ec4cabf5bb90af22..59a008c01e22b21cbe4b8d15e411046d7940a7cf 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-14.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-14.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O1 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-15.c b/gcc/testsuite/gcc.dg/bic-bitmask-15.c index 8bdf1ea4eb2e5117c6d84b0d6cdf95798c4b8e2c..c28d9b13f4eb300414cdf19ab0550a888b8edeec 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-15.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-15.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O1 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-16.c b/gcc/testsuite/gcc.dg/bic-bitmask-16.c index cfea925b59104ad5c84beea90cea5e6ec9b1e787..f93912f0cc579b3c56e24577b36d755ec3737ed6 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-16.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-16.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O1 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-17.c b/gcc/testsuite/gcc.dg/bic-bitmask-17.c index 86873b97f27c5fe6e1495ac0cf3471b7782a8067..f8d651b829b4f3c771bc2db056f15aa385c8302e 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-17.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-17.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O1 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-18.c b/gcc/testsuite/gcc.dg/bic-bitmask-18.c index 70bab0c520321ba13c6dd7969d1b51708dc3c71f..d6242fe3c19b8e958e4eca5ae8a633c376f09794 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-18.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-18.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O1 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-19.c b/gcc/testsuite/gcc.dg/bic-bitmask-19.c index c4620dfaad3b8fdbb0ba214bbd69b975f37c68db..aa139da5c1ede2aa422c7e56956051c3b854f983 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-19.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-19.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O1 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-20.c b/gcc/testsuite/gcc.dg/bic-bitmask-20.c index a114122e075eab6be651b4e0954f084a2fd427c9..849eca4e51489b7f68f6695de3389ed5a0697ef2 100644 --- a/gcc/testsuite/gcc.dg/bic-bitmask-20.c +++ b/gcc/testsuite/gcc.dg/bic-bitmask-20.c @@ -1,5 +1,5 @@ /* { dg-do run } */ -/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */ +/* { dg-options "-O1 -fdump-tree-dce" } */ #include diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-21.c b/gcc/testsuite/gcc.dg/bic-bitmask-21.c index bd12a58da1ed5868b78b18742ed9d69289e58a37..9aecd7228523be5d7c4fd13c16833986ada79615
[PATCH]middle-end: make memory analysis for early break more deterministic [PR113135]
Hi All, Instead of searching for where to move stores to, they should always be in exit belonging to the latch. We can only ever delay stores and even if we pick a different exit than the latch one as the main one, effects still happen in program order when vectorized. If we don't move the stores to the latch exit but instead to whever we pick as the "main" exit then we can perform incorrect memory accesses (luckily these are trapped by verify_ssa). We used to iterate over the conds and check the loads and stores inside them. However this relies on the conds being ordered in program order. Additionally if there is a basic block between two conds we would not have analyzed it. Instead this now walks from the preds of the destination basic block up to the loop header and analyzes every block along the way. As a later optimization we could stop as soon as we've seen all the BBs we have conds for. For now the header will always contain the first cond, but this can change when we support arbitrary control flow. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues normally and with --enable-checking=release --enable-lto --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113135 * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Rework dependency analysis. gcc/testsuite/ChangeLog: PR tree-optimization/113135 * gcc.dg/vect/vect-early-break_103-pr113135.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_103-pr113135.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_103-pr113135.c new file mode 100644 index ..bbad7ee2cb18086e470f4a2a2dc0a2b345bbdd71 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_103-pr113135.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-w" } */ + +char UnpackReadTables_BitLength[20]; +int UnpackReadTables_ZeroCount; +void UnpackReadTables() { + for (unsigned I = 0; I < 20;) +while (UnpackReadTables_ZeroCount-- && + I < sizeof(UnpackReadTables_BitLength)) + UnpackReadTables_BitLength[I++] = 0; +} diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index 3d9673fb0b580ff21ff151dc5c199840df41a1cd..6b76eee72cb7d09de5f443589b4fc3a0e8c2584f 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -671,13 +671,18 @@ vect_analyze_early_break_dependences (loop_vec_info loop_vinfo) "loop contains multiple exits, analyzing" " statement dependencies.\n"); - for (gimple *c : LOOP_VINFO_LOOP_CONDS (loop_vinfo)) -{ - stmt_vec_info loop_cond_info = loop_vinfo->lookup_stmt (c); - if (STMT_VINFO_TYPE (loop_cond_info) != loop_exit_ctrl_vec_info_type) - continue; + /* Since we don't support general control flow, the location we'll move the + side-effects to is always the latch connected exit. When we support + general control flow we can do better but for now this is fine. */ + dest_bb = single_pred (loop->latch); + auto_vec workset; + for (auto e: dest_bb->preds) +workset.safe_push (e); - gimple_stmt_iterator gsi = gsi_for_stmt (c); + while (!workset.is_empty ()) +{ + basic_block bb = workset.pop ()->src; + gimple_stmt_iterator gsi = gsi_last_bb (bb); /* Now analyze all the remaining statements and try to determine which instructions are allowed/needed to be moved. */ @@ -705,10 +710,10 @@ vect_analyze_early_break_dependences (loop_vec_info loop_vinfo) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "early breaks only supported on statically" " allocated objects.\n"); - return opt_result::failure_at (c, + return opt_result::failure_at (stmt, "can't safely apply code motion to " "dependencies of %G to vectorize " -"the early exit.\n", c); +"the early exit.\n", stmt); } tree refop = TREE_OPERAND (obj, 0); @@ -720,10 +725,10 @@ vect_analyze_early_break_dependences (loop_vec_info loop_vinfo) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "early breaks only supported on" " statically allocated objects.\n"); - return opt_result::failure_at (c, + return opt_result::failure_at (stmt, "can't safely apply code motion to " "dependencies of %G to vectorize
[PATCH]middle-end: fill in reduction PHI for all alt exits [PR113144]
Hi All, When we have a loop with more than 2 exits and a reduction I forgot to fill in the PHI value for all alternate exits. All alternate exits use the same PHI value so we should loop over the new PHI elements and copy the value across since we call the reduction calculation code only once for all exits. This was normally covered up by earlier parts of the compiler rejecting loops incorrectly (which has been fixed now). Note that while I can use the loop in all cases, the reason I separated out the main and alt exit is so that if you pass the wrong edge the macro will assert. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113178 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Fill in all alternate exits. gcc/testsuite/ChangeLog: PR tree-optimization/113178 * g++.dg/vect/vect-early-break_6-pr113178.cc: New test. * gcc.dg/vect/vect-early-break_101-pr113178.c: New test. * gcc.dg/vect/vect-early-break_102-pr113178.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_6-pr113178.cc b/gcc/testsuite/g++.dg/vect/vect-early-break_6-pr113178.cc new file mode 100644 index ..da008759a72dd563bf4930decd74470ae35cb98e --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/vect-early-break_6-pr113178.cc @@ -0,0 +1,34 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +struct PixelWeight { + int m_SrcStart; + int m_Weights[]; +}; +struct CWeightTable { + int *GetValueFromPixelWeight(PixelWeight *, int) const; +}; +char ContinueStretchHorz_dest_scan; +struct CStretchEngine { + bool ContinueStretchHorz(); + CWeightTable m_WeightTable; +}; +int *CWeightTable::GetValueFromPixelWeight(PixelWeight *pWeight, + int index) const { + long __trans_tmp_1; + if (index < pWeight->m_SrcStart) +return __trans_tmp_1 ? >m_Weights[pWeight->m_SrcStart] : nullptr; +} +bool CStretchEngine::ContinueStretchHorz() { + { +PixelWeight pPixelWeights; +int dest_g_m; +for (int j; j; j++) { + int pWeight = *m_WeightTable.GetValueFromPixelWeight(, j); + dest_g_m += pWeight; +} +ContinueStretchHorz_dest_scan = dest_g_m; + } +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_101-pr113178.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_101-pr113178.c new file mode 100644 index ..8b91112133f0522270bb4d92664355838a405aaf --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_101-pr113178.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +struct PixelWeight { + int m_SrcStart; + int m_Weights[16]; +}; +char h; +void f(struct PixelWeight *pPixelWeights) { +int dest_g_m; +long tt; +for (int j = 0; j < 16; j++) { + int *p = 0; + if (j < pPixelWeights->m_SrcStart) +p = tt ? >m_Weights[0] : 0; + int pWeight = *p; + dest_g_m += pWeight; +} +h = dest_g_m; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_102-pr113178.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_102-pr113178.c new file mode 100644 index ..ad7582e440720e50a2769239c88b1e07517e4c10 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_102-pr113178.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-std=gnu99 -fpermissive -fgnu89-inline -Ofast -fprofile-generate -w" } */ + +extern int replace_reg_with_saved_mem_i, replace_reg_with_saved_mem_nregs, +replace_reg_with_saved_mem_mem_1; +replace_reg_with_saved_mem_mode() { + if (replace_reg_with_saved_mem_i) +return; + while (++replace_reg_with_saved_mem_i < replace_reg_with_saved_mem_nregs) +if (replace_reg_with_saved_mem_i) + break; + if (replace_reg_with_saved_mem_i) +if (replace_reg_with_saved_mem_mem_1) + adjust_address_1(); + replace_reg_with_saved_mem_mem_1 ? fancy_abort() : 0; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 27bb28365936978013a576b64b72d9e92375f361..da2dfa176ecd457ebc11d1131302ca15d77d779d 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -6223,7 +6223,13 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, phi = create_phi_node (new_def, exit_bb); if (j) def = gimple_get_lhs (vec_stmts[j]); - SET_PHI_ARG_DEF (phi, loop_exit->dest_idx, def); + if (LOOP_VINFO_IV_EXIT
RE: [PATCH][testsuite]: Make bitint early vect test more accurate
> But I'm afraid I have no idea how is this supposed to work on > non-bitint targets or where __BITINT_MAXWIDTH__ is smaller than 9020. > There is no loop at all there, so what should be vectorized? > Yeah It was giving an unresolved and I didn't notice in diff. > I'd say introduce > # Return 1 if the target supports _BitInt(65535), 0 otherwise. > > proc check_effective_target_bitint65535 { } { > return [check_no_compiler_messages bitint65535 object { > _BitInt (2) a = 1wb; > unsigned _BitInt (65535) b = 0uwb; > } "-std=c23"] > } > > after bitint575 effective target and use it in the test. > Sure, how's: -- This changes the tests I committed for PR113287 to also run on targets that don't support bitint. Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues and tests run on both. Ok for master? Thanks, Tamar gcc/ChangeLog: * doc/sourcebuild.texi (check_effective_target_bitint65535): New. gcc/testsuite/ChangeLog: PR tree-optimization/113287 * gcc.dg/vect/vect-early-break_100-pr113287.c: Support non-bitint. * gcc.dg/vect/vect-early-break_99-pr113287.c: Likewise. * lib/target-supports.exp (bitint, bitint128, bitint575, bitint65535): Document them. ---inline copy of patch --- diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi index bd62b21f3b725936eae34c22159ccbc9db40873f..6fbb102f9971d54d66d77dcee8f10a1b57aa6e5a 100644 --- a/gcc/doc/sourcebuild.texi +++ b/gcc/doc/sourcebuild.texi @@ -2864,6 +2864,18 @@ Target supports Graphite optimizations. @item fixed_point Target supports fixed-point extension to C. +@item bitint +Target supports _BitInt(N). + +@item bitint128 +Target supports _BitInt(128). + +@item bitint575 +Target supports _BitInt(575). + +@item bitint65535 +Target supports _BitInt(65535). + @item fopenacc Target supports OpenACC via @option{-fopenacc}. diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c index f908e5bc60779c148dc95bda3e200383d12b9e1e..05fb84e1d36d4d05f39e48e41fc70703074ecabd 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c @@ -1,28 +1,29 @@ /* { dg-add-options vect_early_break } */ /* { dg-require-effective-target vect_early_break } */ -/* { dg-require-effective-target vect_int } */ -/* { dg-require-effective-target bitint } */ +/* { dg-require-effective-target vect_long_long } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ __attribute__((noipa)) void -bar (unsigned long *p) +bar (unsigned long long *p) { - __builtin_memset (p, 0, 142 * sizeof (unsigned long)); - p[17] = 0x500UL; + __builtin_memset (p, 0, 142 * sizeof (unsigned long long)); + p[17] = 0x500ULL; } __attribute__((noipa)) int foo (void) { - unsigned long r[142]; + unsigned long long r[142]; bar (r); - unsigned long v = ((long) r[0] >> 31); + unsigned long long v = ((long) r[0] >> 31); if (v + 1 > 1) return 1; - for (unsigned long i = 1; i <= 140; ++i) + for (unsigned long long i = 1; i <= 140; ++i) if (r[i] != v) return 1; - unsigned long w = r[141]; - if ((unsigned long) (((long) (w << 60)) >> 60) != v) + unsigned long long w = r[141]; + if ((unsigned long long) (((long) (w << 60)) >> 60) != v) return 1; return 0; } diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c index b92a8a268d803ab1656b4716b1a319ed4edc87a3..e141e8a9277f89527e8aff809fe101fdd91a4c46 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c @@ -1,7 +1,8 @@ /* { dg-add-options vect_early_break } */ /* { dg-require-effective-target vect_early_break } */ -/* { dg-require-effective-target vect_int } */ -/* { dg-require-effective-target bitint } */ +/* { dg-require-effective-target bitint65535 } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ _BitInt(998) b; char c; diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index a9c76e0b290b19fd07574805bb2b87c86a5e9cf7..1ddcb3926a8d549b6a17b61e29e1d9836ecce897 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -3850,6 +3850,15 @@ proc check_effective_target_bitint575 { } { } "-std=c23"] } +# Return 1 if the target supports _BitInt(65535), 0 otherwise. + +proc check_effective_target_bitint65535 { } { +return [check_no_compiler_messages bitint65535 object { +_BitInt (2) a = 1wb; +unsigned _BitInt (65535) b = 0uwb; +} "-std=c23"] +} + # Return 1 if the target supports compiling decimal floating point, # 0 otherwise. rb18146.patch Description: rb18146.patch
[PATCH][testsuite]: Make bitint early vect test more accurate
Hi All, This changes the tests I committed for PR113287 to also run on targets that don't support bitint. Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues and tests run on both. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: PR tree-optimization/113287 * gcc.dg/vect/vect-early-break_100-pr113287.c: Support non-bitint. * gcc.dg/vect/vect-early-break_99-pr113287.c: Likewise. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c index f908e5bc60779c148dc95bda3e200383d12b9e1e..05fb84e1d36d4d05f39e48e41fc70703074ecabd 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c @@ -1,28 +1,29 @@ /* { dg-add-options vect_early_break } */ /* { dg-require-effective-target vect_early_break } */ -/* { dg-require-effective-target vect_int } */ -/* { dg-require-effective-target bitint } */ +/* { dg-require-effective-target vect_long_long } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ __attribute__((noipa)) void -bar (unsigned long *p) +bar (unsigned long long *p) { - __builtin_memset (p, 0, 142 * sizeof (unsigned long)); - p[17] = 0x500UL; + __builtin_memset (p, 0, 142 * sizeof (unsigned long long)); + p[17] = 0x500ULL; } __attribute__((noipa)) int foo (void) { - unsigned long r[142]; + unsigned long long r[142]; bar (r); - unsigned long v = ((long) r[0] >> 31); + unsigned long long v = ((long) r[0] >> 31); if (v + 1 > 1) return 1; - for (unsigned long i = 1; i <= 140; ++i) + for (unsigned long long i = 1; i <= 140; ++i) if (r[i] != v) return 1; - unsigned long w = r[141]; - if ((unsigned long) (((long) (w << 60)) >> 60) != v) + unsigned long long w = r[141]; + if ((unsigned long long) (((long) (w << 60)) >> 60) != v) return 1; return 0; } diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c index b92a8a268d803ab1656b4716b1a319ed4edc87a3..fb99ef39402ee7b3c6c564e7db5f5543a5f0c2e0 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c @@ -1,9 +1,18 @@ /* { dg-add-options vect_early_break } */ /* { dg-require-effective-target vect_early_break } */ -/* { dg-require-effective-target vect_int } */ -/* { dg-require-effective-target bitint } */ +/* { dg-require-effective-target vect_long_long } */ -_BitInt(998) b; +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#if __BITINT_MAXWIDTH__ >= 9020 +typedef _BitInt(9020) B9020; +typedef _BitInt(998) B998; +#else +typedef long long B998; +typedef long long B9020; +#endif + +B998 b; char c; char d; char e; @@ -14,7 +23,7 @@ char i; char j; void -foo(char y, _BitInt(9020) a, char *r) +foo(char y, B9020 a, char *r) { char x = __builtin_mul_overflow_p(a << sizeof(a), y, 0); x += c + d + e + f + g + h + i + j + b; @@ -26,7 +35,12 @@ main(void) { char x; foo(5, 5, ); +#if __BITINT_MAXWIDTH__ >= 9020 if (x != 1) __builtin_abort(); +#else + if (x != 0) +__builtin_abort(); +#endif return 0; } -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c index f908e5bc60779c148dc95bda3e200383d12b9e1e..05fb84e1d36d4d05f39e48e41fc70703074ecabd 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c @@ -1,28 +1,29 @@ /* { dg-add-options vect_early_break } */ /* { dg-require-effective-target vect_early_break } */ -/* { dg-require-effective-target vect_int } */ -/* { dg-require-effective-target bitint } */ +/* { dg-require-effective-target vect_long_long } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ __attribute__((noipa)) void -bar (unsigned long *p) +bar (unsigned long long *p) { - __builtin_memset (p, 0, 142 * sizeof (unsigned long)); - p[17] = 0x500UL; + __builtin_memset (p, 0, 142 * sizeof (unsigned long long)); + p[17] = 0x500ULL; } __attribute__((noipa)) int foo (void) { - unsigned long r[142]; + unsigned long long r[142]; bar (r); - unsigned long v = ((long) r[0] >> 31); + unsigned long long v = ((long) r[0] >> 31); if (v + 1 > 1) return 1; - for (unsigned long i = 1; i <= 140; ++i) + for (unsigned long long i = 1; i <= 140; ++i) if (r[i] != v) return 1; - unsigned long w = r[141]; - if ((unsigned long) (((long) (w << 60)) >> 60) != v) + unsigned long long w = r[141]; + if ((unsigned long long) (((long) (w << 60)) >> 60) != v) return 1; return 0; } diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c index
RE: [PATCH]middle-end: correctly identify the edge taken when condition is true. [PR113287]
> -Original Message- > From: Jakub Jelinek > Sent: Wednesday, January 10, 2024 2:42 PM > To: Tamar Christina ; Richard Biener > > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: correctly identify the edge taken when > condition > is true. [PR113287] > > Hi! > > Thanks for fixing it, just testsuite nits. > > On Wed, Jan 10, 2024 at 03:22:53PM +0100, Richard Biener wrote: > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c > > > @@ -0,0 +1,35 @@ > > > +/* { dg-add-options vect_early_break } */ > > > +/* { dg-require-effective-target vect_early_break } */ > > > +/* { dg-require-effective-target vect_int } */ > > > +/* { dg-require-effective-target bitint } */ > > This test doesn't need bitint effective target. > But relies on long being 64-bit, otherwise e.g. > 0x500UL doesn't need to fit or shifting it by 60 is invalid. > So, maybe use lp64 effective target instead. I was thinking about it. Would using effective-target longlong and changing the constant to ULL instead work? Thanks, Tamar
[PATCH]middle-end: correctly identify the edge taken when condition is true. [PR113287]
Hi All, The vectorizer needs to know during early break vectorization whether the edge that will be taken if the condition is true stays or leaves the loop. This is because the code assumes that if you take the true branch you exit the loop. If you don't exit the loop it has to generate a different condition. Basically it uses this information to decide whether it's generating a "any element" or an "all element" check. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues with --enable-lto --with-build-config=bootstrap-O3 --enable-checking=release,yes,rtl,extra. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113287 * tree-vect-stmts.cc (vectorizable_early_exit): Check the flags on edge instead of using BRANCH_EDGE to determine true edge. gcc/testsuite/ChangeLog: PR tree-optimization/113287 * gcc.dg/vect/vect-early-break_100-pr113287.c: New test. * gcc.dg/vect/vect-early-break_99-pr113287.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c new file mode 100644 index ..f908e5bc60779c148dc95bda3e200383d12b9e1e --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c @@ -0,0 +1,35 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target bitint } */ + +__attribute__((noipa)) void +bar (unsigned long *p) +{ + __builtin_memset (p, 0, 142 * sizeof (unsigned long)); + p[17] = 0x500UL; +} + +__attribute__((noipa)) int +foo (void) +{ + unsigned long r[142]; + bar (r); + unsigned long v = ((long) r[0] >> 31); + if (v + 1 > 1) +return 1; + for (unsigned long i = 1; i <= 140; ++i) +if (r[i] != v) + return 1; + unsigned long w = r[141]; + if ((unsigned long) (((long) (w << 60)) >> 60) != v) +return 1; + return 0; +} + +int +main () +{ + if (foo () != 1) +__builtin_abort (); +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c new file mode 100644 index ..b92a8a268d803ab1656b4716b1a319ed4edc87a3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c @@ -0,0 +1,32 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target bitint } */ + +_BitInt(998) b; +char c; +char d; +char e; +char f; +char g; +char h; +char i; +char j; + +void +foo(char y, _BitInt(9020) a, char *r) +{ + char x = __builtin_mul_overflow_p(a << sizeof(a), y, 0); + x += c + d + e + f + g + h + i + j + b; + *r = x; +} + +int +main(void) +{ + char x; + foo(5, 5, ); + if (x != 1) +__builtin_abort(); + return 0; +} diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 1333d8934783acdb5277e3a03c2b4021fec4777b..da004b0e9e2696cd2ce358d3b221851c7b60b448 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -12870,13 +12870,18 @@ vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info, rewrite conditions to always be a comparison against 0. To do this it sometimes flips the edges. This is fine for scalar, but for vector we then have to flip the test, as we're still assuming that if you take the - branch edge that we found the exit condition. */ + branch edge that we found the exit condition. i.e. we need to know whether + we are generating a `forall` or an `exist` condition. */ auto new_code = NE_EXPR; auto reduc_optab = ior_optab; auto reduc_op = BIT_IOR_EXPR; tree cst = build_zero_cst (vectype); + edge exit_true_edge = EDGE_SUCC (gimple_bb (cond_stmt), 0); + if (exit_true_edge->flags & EDGE_FALSE_VALUE) +exit_true_edge = EDGE_SUCC (gimple_bb (cond_stmt), 1); + gcc_assert (exit_true_edge->flags & EDGE_TRUE_VALUE); if (flow_bb_inside_loop_p (LOOP_VINFO_LOOP (loop_vinfo), -BRANCH_EDGE (gimple_bb (cond_stmt))->dest)) +exit_true_edge->dest)) { new_code = EQ_EXPR; reduc_optab = and_optab; -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c new file mode 100644 index ..f908e5bc60779c148dc95bda3e200383d12b9e1e --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c @@ -0,0 +1,35 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target bitint } */ + +__attribute__((noipa)) void +bar (unsigned long *p) +{
[PATCH][committed][c++ frontend]: initialize ivdep value
Hi All, Should control enter the switch from one of the cases other than the IVDEP one then the variable remains uninitialized. This fixes it by initializing it to false. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues Committed as obvious. Thanks, Tamar gcc/cp/ChangeLog: * parser.cc (cp_parser_pragma): Initialize to false. --- inline copy of patch -- diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index 379aeb56b152b9b29606ba4d75ad4c49dfe92aac..1b4ce1497e893d6463350eecf5ef4e88957f5f00 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -50625,7 +50625,7 @@ cp_parser_pragma (cp_parser *parser, enum pragma_context context, bool *if_p) case PRAGMA_UNROLL: case PRAGMA_NOVECTOR: { - bool ivdep; + bool ivdep = false; tree unroll = NULL_TREE; bool novector = false; const char *pragma_str; -- diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc index 379aeb56b152b9b29606ba4d75ad4c49dfe92aac..1b4ce1497e893d6463350eecf5ef4e88957f5f00 100644 --- a/gcc/cp/parser.cc +++ b/gcc/cp/parser.cc @@ -50625,7 +50625,7 @@ cp_parser_pragma (cp_parser *parser, enum pragma_context context, bool *if_p) case PRAGMA_UNROLL: case PRAGMA_NOVECTOR: { - bool ivdep; + bool ivdep = false; tree unroll = NULL_TREE; bool novector = false; const char *pragma_str;
RE: [PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]
ping > -Original Message- > From: Tamar Christina > Sent: Friday, January 5, 2024 1:31 PM > To: Xi Ruoyao ; Palmer Dabbelt > Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de; Jeff Law > > Subject: RE: [PATCH]middle-end: Don't apply copysign optimization if target > does > not implement optab [PR112468] > > > On Fri, 2024-01-05 at 11:02 +, Tamar Christina wrote: > > > Ok, so something like: > > > > > > > > ([istarget loongarch*-*-*] && > > > > > ([check_effective_target_loongarch_sx] || > > > > > [check_effective_target_hard_float])) > > > ? > > > > We don't need "[check_effective_target_loongarch_sx] ||" because SIMD > > requires hard float. > > > > Cool, thanks! > > -- > > Hi All, > > currently GCC does not treat IFN_COPYSIGN the same as the copysign tree expr. > The latter has a libcall fallback and the IFN can only do optabs. > > Because of this the change I made to optimize copysign only works if the > target has impemented the optab, but it should work for those that have the > libcall too. > > More annoyingly if a target has vector versions of ABS and NEG but not > COPYSIGN > then the change made them lose vectorization. > > The proper fix for this is to treat the IFN the same as the tree EXPR and to > enhance expand_COPYSIGN to also support vector calls. > > I have such a patch for GCC 15 but it's quite big and too invasive for > stage-4. > As such this is a minimal fix, just don't apply the transformation and leave > targets which don't have the optab unoptimized. > > Targets list for check_effective_target_ifn_copysign was gotten by grepping > for > copysign and looking at the optab. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > Tests ran in x86_64-pc-linux-gnu -m32 and tests no longer fail. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR tree-optimization/112468 > * doc/sourcebuild.texi: Document ifn_copysign. > * match.pd: Only apply transformation if target supports the IFN. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/112468 > * gcc.dg/fold-copysign-1.c: Modify tests based on if target supports > IFN_COPYSIGN. > * gcc.dg/pr55152-2.c: Likewise. > * gcc.dg/tree-ssa/abs-4.c: Likewise. > * gcc.dg/tree-ssa/backprop-6.c: Likewise. > * gcc.dg/tree-ssa/copy-sign-2.c: Likewise. > * gcc.dg/tree-ssa/mult-abs-2.c: Likewise. > * lib/target-supports.exp (check_effective_target_ifn_copysign): New. > > --- inline copy of patch --- > > diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi > index > 4be67daedb20d394857c02739389cabf23c0d533..f4847dafe65cbbf8c9de3490 > 5f614ef6957658b4 100644 > --- a/gcc/doc/sourcebuild.texi > +++ b/gcc/doc/sourcebuild.texi > @@ -2664,6 +2664,10 @@ Target requires a command line argument to enable a > SIMD instruction set. > @item xorsign > Target supports the xorsign optab expansion. > > +@item ifn_copysign > +Target supports the IFN_COPYSIGN optab expansion for both scalar and vector > +types. > + > @end table > > @subsubsection Environment attributes > diff --git a/gcc/match.pd b/gcc/match.pd > index > d57e29bfe1d68afd4df4dda20fecc2405ff05332..87d13e7e3e1aa6d89119142b6 > 14890dc4729b521 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -1159,13 +1159,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (simplify >(copysigns @0 REAL_CST@1) >(if (!REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1))) > - (abs @0 > + (abs @0) > +#if GIMPLE > + (if (!direct_internal_fn_supported_p (IFN_COPYSIGN, type, > + OPTIMIZE_FOR_BOTH)) > +(negate (abs @0))) > +#endif > + ))) > > +#if GIMPLE > /* Transform fneg (fabs (X)) -> copysign (X, -1). */ > (simplify > (negate (abs @0)) > - (IFN_COPYSIGN @0 { build_minus_one_cst (type); })) > - > + (if (direct_internal_fn_supported_p (IFN_COPYSIGN, type, > + OPTIMIZE_FOR_BOTH)) > + (IFN_COPYSIGN @0 { build_minus_one_cst (type); }))) > +#endif > /* copysign(copysign(x, y), z) -> copysign(x, z). */ > (for copysigns (COPYSIGN_ALL) > (simplify > diff --git a/gcc/testsuite/gcc.dg/fold-copysign-1.c > b/gcc/testsuite/gcc.dg/fold- > copysign-1.c > index > f9cafd14ab05f5e8ab2f6f68e62801d21c2df6a6..96b80c733794fffada1b08274ef > 39cc8f6e442ce 100644 > --- a/gcc/testsuite/gcc.dg/fold-copysign-1.c > +++ b/gcc/testsuite/gcc.dg/fold-copysign-1.c > @@ -1,5 +1,6 @@ > /* { dg-do compile }
[PATCH][committed]middle-end: removed unused variable in vectorizable_live_operation_1
Hi All, It looks like the previous patch had an unused variable. It's odd that my bootstrap didn't catch it (I'm assuming -Werror is still on for O3 bootstraps) but this fixes it. Committed to fix bootstrap. Thanks, Tamar gcc/ChangeLog: * tree-vect-loop.cc (vectorizable_live_operation_1): Drop unused restart_loop. (vectorizable_live_operation): Likewise. --- inline copy of patch -- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 39b1161309d8ff8bfe88ee26df9147df0af0a58c..c218d514fe4be57fca97a85a36be7240d3e84edf 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10575,13 +10575,12 @@ vectorizable_induction (loop_vec_info loop_vinfo, helper function for vectorizable_live_operation. */ -tree +static tree vectorizable_live_operation_1 (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, basic_block exit_bb, tree vectype, int ncopies, slp_tree slp_node, tree bitsize, tree bitstart, tree vec_lhs, - tree lhs_type, bool restart_loop, - gimple_stmt_iterator *exit_gsi) + tree lhs_type, gimple_stmt_iterator *exit_gsi) { gcc_assert (single_pred_p (exit_bb) || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)); @@ -10597,7 +10596,7 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo, if (integer_zerop (bitstart)) { tree scalar_res = gimple_build (, BIT_FIELD_REF, TREE_TYPE (vectype), - vec_lhs_phi, bitsize, bitstart); + vec_lhs_phi, bitsize, bitstart); /* Convert the extracted vector element to the scalar type. */ new_tree = gimple_convert (, lhs_type, scalar_res); @@ -10958,8 +10957,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, dest, vectype, ncopies, slp_node, bitsize, tmp_bitstart, tmp_vec_lhs, -lhs_type, restart_loop, -_gsi); +lhs_type, _gsi); if (gimple_phi_num_args (use_stmt) == 1) { -- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 39b1161309d8ff8bfe88ee26df9147df0af0a58c..c218d514fe4be57fca97a85a36be7240d3e84edf 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10575,13 +10575,12 @@ vectorizable_induction (loop_vec_info loop_vinfo, helper function for vectorizable_live_operation. */ -tree +static tree vectorizable_live_operation_1 (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, basic_block exit_bb, tree vectype, int ncopies, slp_tree slp_node, tree bitsize, tree bitstart, tree vec_lhs, - tree lhs_type, bool restart_loop, - gimple_stmt_iterator *exit_gsi) + tree lhs_type, gimple_stmt_iterator *exit_gsi) { gcc_assert (single_pred_p (exit_bb) || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)); @@ -10597,7 +10596,7 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo, if (integer_zerop (bitstart)) { tree scalar_res = gimple_build (, BIT_FIELD_REF, TREE_TYPE (vectype), - vec_lhs_phi, bitsize, bitstart); + vec_lhs_phi, bitsize, bitstart); /* Convert the extracted vector element to the scalar type. */ new_tree = gimple_convert (, lhs_type, scalar_res); @@ -10958,8 +10957,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, dest, vectype, ncopies, slp_node, bitsize, tmp_bitstart, tmp_vec_lhs, -lhs_type, restart_loop, -_gsi); +lhs_type, _gsi); if (gimple_phi_num_args (use_stmt) == 1) {
RE: [PATCH]middle-end: check if target can do extract first for early breaks [PR113199]
Hmm I'm confused as to why It didn't break mine.. just did one again.. anyway I'll remove the unused variable. > -Original Message- > From: Rainer Orth > Sent: Tuesday, January 9, 2024 4:06 PM > To: Richard Biener > Cc: Tamar Christina ; gcc-patches@gcc.gnu.org; nd > ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: check if target can do extract first for > early breaks > [PR113199] > > Richard Biener writes: > > > On Tue, 9 Jan 2024, Tamar Christina wrote: > > > >> > > - > >> > > - gimple_seq_add_seq (, tem); > >> > > - > >> > > - scalar_res = gimple_build (, CFN_EXTRACT_LAST, > >> > > scalar_type, > >> > > - mask, vec_lhs_phi); > >> > > + scalar_res = gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE > >> > (vectype), > >> > > + vec_lhs_phi, bitstart); > >> > > >> > So bitstart is always zero? I wonder why using CFN_VEC_EXTRACT over > >> > BIT_FIELD_REF here which wouldn't need any additional target support. > >> > > >> > >> Ok, how about... > >> > >> --- > >> > >> I was generating the vector reverse mask without checking if the target > >> actually supported such an operation. > >> > >> This patch changes it to if the bitstart is 0 then use BIT_FIELD_REF > >> instead > >> to extract the first element since this is supported by all targets. > >> > >> This is good for now since masks always come from whilelo. But in the > >> future > >> when masks can come from other sources we will need the old code back. > >> > >> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > >> and no issues with --enable-checking=release --enable-lto > >> --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. > >> tested on cross cc1 for amdgcn-amdhsa and issue fixed. > >> > >> Ok for master? > > > > OK. > > > >> Thanks, > >> Tamar > >> > >> gcc/ChangeLog: > >> > >>PR tree-optimization/113199 > >>* tree-vect-loop.cc (vectorizable_live_operation_1): Use > >>BIT_FIELD_REF. > > This patch broke bootstrap (everywhere, it seems; seen on > i386-pc-solaris2.11 and sparc-sun-solaris2.11): > > /vol/gcc/src/hg/master/local/gcc/tree-vect-loop.cc: In function 'tree_node* > vectorizable_live_operation_1(loop_vec_info, stmt_vec_info, basic_block, > tree, int, > slp_tree, tree, tree, tree, tree, bool, gimple_stmt_iterator*)': > /vol/gcc/src/hg/master/local/gcc/tree-vect-loop.cc:10598:52: error: unused > parameter 'restart_loop' [-Werror=unused-parameter] > 10598 |tree lhs_type, bool restart_loop, > | ~^~~~ > > Rainer > > -- > - > Rainer Orth, Center for Biotechnology, Bielefeld University
RE: [PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]
> -Original Message- > From: Richard Biener > Sent: Tuesday, January 9, 2024 1:51 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: RE: [PATCH]middle-end: Fix dominators updates when peeling with > multiple exits [PR113144] > > On Tue, 9 Jan 2024, Richard Biener wrote: > > > On Tue, 9 Jan 2024, Tamar Christina wrote: > > > > > > > > > > > > -Original Message- > > > > From: Richard Biener > > > > Sent: Tuesday, January 9, 2024 12:26 PM > > > > To: Tamar Christina > > > > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > > > > Subject: RE: [PATCH]middle-end: Fix dominators updates when peeling with > > > > multiple exits [PR113144] > > > > > > > > On Tue, 9 Jan 2024, Tamar Christina wrote: > > > > > > > > > > This makes it quadratic in the number of vectorized early exit loops > > > > > > in a function. The vectorizer CFG manipulation operates in a local > > > > > > enough bubble that programmatic updating of dominators should be > > > > > > possible (after all we manage to produce correct SSA form!), the > > > > > > proposed change gets us too far off to a point where re-computating > > > > > > dominance info is likely cheaper (but no, we shouldn't do this > > > > > > either). > > > > > > > > > > > > Can you instead give manual updating a try again? I think > > > > > > versioning should produce up-to-date dominator info, it's only > > > > > > when you redirect branches during peeling that you'd need > > > > > > adjustments - but IIRC we're never introducing new merges? > > > > > > > > > > > > IIRC we can't wipe dominators during transform since we query them > > > > > > during code generation. We possibly could code generate all > > > > > > CFG manipulations of all vectorized loops, recompute all dominators > > > > > > and then do code generation of all vectorized loops. > > > > > > > > > > > > But then we're doing a loop transform and the exits will > > > > > > ultimatively > > > > > > end up in the same place, so the CFG and dominator update is bound > > > > > > to > > > > > > where the original exits went to. > > > > > > > > > > Yeah that's a fair point, the issue is specifically with at_exit. So > > > > > how about: > > > > > > > > > > When we peel at_exit we are moving the new loop at the exit of the > previous > > > > > loop. This means that the blocks outside the loop dat the previous > > > > > loop > used to > > > > > dominate are no longer being dominated by it. > > > > > > > > Hmm, indeed. Note this does make the dominator update O(function-size) > > > > and when vectorizing multiple loops in a function this becomes > > > > quadratic. That's quite unfortunate so I wonder if we can delay the > > > > update to the parts we do not need up-to-date dominators during > > > > vectorization (of course it gets fragile with having only partly > > > > correct dominators). > > > > > > Fair, I created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113290 and > > > will > > > tackle it when I add SLP support in GCC 15. > > > > > > I think the problem is, and the reason we do early dominator correction > > > and > > > validation is because the same function is used by loop distribution. > > > > > > But you're right that during vectorization we perform dominators update > > > twice > > > now. > > > > We're performing it at least once per multi-exit loop that is vectorized, > > covering all downstream blocks. > > That is, consider sth like > > int a[77]; > > int bar (); > void foo () > { > int val; > #define LOOP \ > val = bar (); \ > for (int i = 0; i < 77; ++i) \ > { \ > if (a[i] == val) \ > break; \ > a[i]++; \ > } > #define LOOP10 LOOP LOOP LOOP LOOP LOOP LOOP LOOP LOOP LOOP LOOP > #define LOOP100 LOOP10 LOOP10 LOOP10 LOOP10 LOOP10 LOOP10 LOOP10 > LOOP10 > LOOP10 LOOP10 > #define LOOP1000 LOOP100 LOOP100 LOOP100 LOOP100 LOOP100 LOOP100 > LOOP100 > LOOP100 LOOP100 LOOP100 > LOOP1000 > } >
RE: [PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]
> -Original Message- > From: Richard Biener > Sent: Tuesday, January 9, 2024 12:26 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: RE: [PATCH]middle-end: Fix dominators updates when peeling with > multiple exits [PR113144] > > On Tue, 9 Jan 2024, Tamar Christina wrote: > > > > This makes it quadratic in the number of vectorized early exit loops > > > in a function. The vectorizer CFG manipulation operates in a local > > > enough bubble that programmatic updating of dominators should be > > > possible (after all we manage to produce correct SSA form!), the > > > proposed change gets us too far off to a point where re-computating > > > dominance info is likely cheaper (but no, we shouldn't do this either). > > > > > > Can you instead give manual updating a try again? I think > > > versioning should produce up-to-date dominator info, it's only > > > when you redirect branches during peeling that you'd need > > > adjustments - but IIRC we're never introducing new merges? > > > > > > IIRC we can't wipe dominators during transform since we query them > > > during code generation. We possibly could code generate all > > > CFG manipulations of all vectorized loops, recompute all dominators > > > and then do code generation of all vectorized loops. > > > > > > But then we're doing a loop transform and the exits will ultimatively > > > end up in the same place, so the CFG and dominator update is bound to > > > where the original exits went to. > > > > Yeah that's a fair point, the issue is specifically with at_exit. So how > > about: > > > > When we peel at_exit we are moving the new loop at the exit of the previous > > loop. This means that the blocks outside the loop dat the previous loop > > used to > > dominate are no longer being dominated by it. > > Hmm, indeed. Note this does make the dominator update O(function-size) > and when vectorizing multiple loops in a function this becomes > quadratic. That's quite unfortunate so I wonder if we can delay the > update to the parts we do not need up-to-date dominators during > vectorization (of course it gets fragile with having only partly > correct dominators). Fair, I created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113290 and will tackle it when I add SLP support in GCC 15. I think the problem is, and the reason we do early dominator correction and validation is because the same function is used by loop distribution. But you're right that during vectorization we perform dominators update twice now. So Maybe we should have a parameter to indicate whether dominators should be updated? Thanks, Tamar > > > The new dominators however are hard to predict since if the loop has > > multiple > > exits and all the exits are an "early" one then we always execute the scalar > > loop. In this case the scalar loop can completely dominate the new loop. > > > > If we later have skip_vector then there's an additional skip edge added that > > might change the dominators. > > > > The previous patch would force an update of all blocks reachable from the > > new > > exits. This one updates *only* blocks that we know the scalar exits > > dominated. > > > > For the examples this reduces the blocks to update from 18 to 3. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > > and no issues normally and with --enable-checking=release --enable-lto > > --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. > > > > Ok for master? > > See below. > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR tree-optimization/113144 > > PR tree-optimization/113145 > > * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): > > Update all BB that the original exits dominated. > > > > gcc/testsuite/ChangeLog: > > > > PR tree-optimization/113144 > > PR tree-optimization/113145 > > * gcc.dg/vect/vect-early-break_94-pr113144.c: New test. > > > > --- inline copy of patch --- > > > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c > b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c > > new file mode 100644 > > index > ..903fe7be6621e81db6f294 > 41e4309fa213d027c5 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c > > @@ -0,0 +1,41 @@ > > +/* { dg-do compile
RE: [PATCH]Arm: Update early-break tests to accept thumb output too.
> > 3f40b2a241953 100644 > > --- a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c > > +++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c > > @@ -16,8 +16,12 @@ int b[N] = {0}; > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vmovr[0-9]+, s[0-9]+@ int > > +** ( > > ** cmp r[0-9]+, #0 > > ** bne \.L[0-9]+ > > +** | > > +** cbnzr[0-9]+, \.L.+ > > +** ) > > If we want to be a bit fancy, I think the scan syntax allows to add a target > selector, > you should be able to do > ** | { target_thumb } > ** cbnz... > I tried, but it looks like this doesn't work because the | is not a TCL feature, so the contents of the conditional match gets interpreted as regexpr: body: .*\tvcgt.s32 q[0-9]+, q[0-9]+, #0 \tvpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ \tvpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ \tvmov r[0-9]+, s[0-9]+@ int (?:\tcmpr[0-9]+, #0 \tbne \.L[0-9]+ \t| { target_thumb } \tcbnz r[0-9]+, \.L.+ ).* > Ok for trunk with or without that change. Will commit without, Thanks, Tamar > Thanks, > Kyrill > > > ** ... > > */ > > void f1 () > > @@ -37,8 +41,12 @@ void f1 () > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vmovr[0-9]+, s[0-9]+@ int > > +** ( > > ** cmp r[0-9]+, #0 > > ** bne \.L[0-9]+ > > +** | > > +** cbnzr[0-9]+, \.L.+ > > +** ) > > ** ... > > */ > > void f2 () > > @@ -58,8 +66,12 @@ void f2 () > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vmovr[0-9]+, s[0-9]+@ int > > +** ( > > ** cmp r[0-9]+, #0 > > ** bne \.L[0-9]+ > > +** | > > +** cbnzr[0-9]+, \.L.+ > > +** ) > > ** ... > > */ > > void f3 () > > @@ -80,8 +92,12 @@ void f3 () > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vmovr[0-9]+, s[0-9]+@ int > > +** ( > > ** cmp r[0-9]+, #0 > > ** bne \.L[0-9]+ > > +** | > > +** cbnzr[0-9]+, \.L.+ > > +** ) > > ** ... > > */ > > void f4 () > > @@ -101,8 +117,12 @@ void f4 () > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vmovr[0-9]+, s[0-9]+@ int > > +** ( > > ** cmp r[0-9]+, #0 > > ** bne \.L[0-9]+ > > +** | > > +** cbnzr[0-9]+, \.L.+ > > +** ) > > ** ... > > */ > > void f5 () > > @@ -122,8 +142,12 @@ void f5 () > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ > > ** vmovr[0-9]+, s[0-9]+@ int > > +** ( > > ** cmp r[0-9]+, #0 > > ** bne \.L[0-9]+ > > +** | > > +** cbnzr[0-9]+, \.L.+ > > +** ) > > ** ... > > */ > > void f6 () > > > > > > > > > > --
[PATCH]Arm: Update early-break tests to accept thumb output too.
Hi All, The tests I recently added for early break fail in thumb mode because in thumb mode `cbz/cbnz` exist and so the cmp+branch is fused. This updates the testcases to accept either output. Tested on arm-none-linux-gnueabihf with -mthumb/-marm. Ok for master? Thanks, Tamar gcc/testsuite/ChangeLog: * gcc.target/arm/vect-early-break-cbranch.c: Accept thumb output. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c index f57bbd8be428d75dcf35aa194b5892fe04124cf6..d5c6d56ec869b8fa868acb78d4c3f40b2a241953 100644 --- a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c +++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c @@ -16,8 +16,12 @@ int b[N] = {0}; ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f1 () @@ -37,8 +41,12 @@ void f1 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f2 () @@ -58,8 +66,12 @@ void f2 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f3 () @@ -80,8 +92,12 @@ void f3 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f4 () @@ -101,8 +117,12 @@ void f4 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f5 () @@ -122,8 +142,12 @@ void f5 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f6 () -- diff --git a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c index f57bbd8be428d75dcf35aa194b5892fe04124cf6..d5c6d56ec869b8fa868acb78d4c3f40b2a241953 100644 --- a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c +++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c @@ -16,8 +16,12 @@ int b[N] = {0}; ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f1 () @@ -37,8 +41,12 @@ void f1 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f2 () @@ -58,8 +66,12 @@ void f2 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f3 () @@ -80,8 +92,12 @@ void f3 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f4 () @@ -101,8 +117,12 @@ void f4 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f5 () @@ -122,8 +142,12 @@ void f5 () ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+ ** vmovr[0-9]+, s[0-9]+@ int +** ( ** cmp r[0-9]+, #0 ** bne \.L[0-9]+ +** | +** cbnzr[0-9]+, \.L.+ +** ) ** ... */ void f6 ()
RE: [PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]
> This makes it quadratic in the number of vectorized early exit loops > in a function. The vectorizer CFG manipulation operates in a local > enough bubble that programmatic updating of dominators should be > possible (after all we manage to produce correct SSA form!), the > proposed change gets us too far off to a point where re-computating > dominance info is likely cheaper (but no, we shouldn't do this either). > > Can you instead give manual updating a try again? I think > versioning should produce up-to-date dominator info, it's only > when you redirect branches during peeling that you'd need > adjustments - but IIRC we're never introducing new merges? > > IIRC we can't wipe dominators during transform since we query them > during code generation. We possibly could code generate all > CFG manipulations of all vectorized loops, recompute all dominators > and then do code generation of all vectorized loops. > > But then we're doing a loop transform and the exits will ultimatively > end up in the same place, so the CFG and dominator update is bound to > where the original exits went to. Yeah that's a fair point, the issue is specifically with at_exit. So how about: When we peel at_exit we are moving the new loop at the exit of the previous loop. This means that the blocks outside the loop dat the previous loop used to dominate are no longer being dominated by it. The new dominators however are hard to predict since if the loop has multiple exits and all the exits are an "early" one then we always execute the scalar loop. In this case the scalar loop can completely dominate the new loop. If we later have skip_vector then there's an additional skip edge added that might change the dominators. The previous patch would force an update of all blocks reachable from the new exits. This one updates *only* blocks that we know the scalar exits dominated. For the examples this reduces the blocks to update from 18 to 3. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues normally and with --enable-checking=release --enable-lto --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113144 PR tree-optimization/113145 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Update all BB that the original exits dominated. gcc/testsuite/ChangeLog: PR tree-optimization/113144 PR tree-optimization/113145 * gcc.dg/vect/vect-early-break_94-pr113144.c: New test. --- inline copy of patch --- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c new file mode 100644 index ..903fe7be6621e81db6f29441e4309fa213d027c5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c @@ -0,0 +1,41 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +long tar_atol256_max, tar_atol256_size, tar_atosl_min; +char tar_atol256_s; +void __errno_location(); + + +inline static long tar_atol256(long min) { + char c; + int sign; + c = tar_atol256_s; + sign = c; + while (tar_atol256_size) { +if (c != sign) + return sign ? min : tar_atol256_max; +c = tar_atol256_size--; + } + if ((c & 128) != (sign & 128)) +return sign ? min : tar_atol256_max; + return 0; +} + +inline static long tar_atol(long min) { + return tar_atol256(min); +} + +long tar_atosl() { + long n = tar_atol(-1); + if (tar_atosl_min) { +__errno_location(); +return 0; + } + if (n > 0) +return 0; + return n; +} diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 76d4979c0b3b374dcaacf6825a95a8714114a63b..9bacaa182a3919cae1cb99dfc5ae4923e1f93376 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -1719,8 +1719,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, /* Now link the alternative exits. */ if (multiple_exits_p) { - set_immediate_dominator (CDI_DOMINATORS, new_preheader, - main_loop_exit_block); for (auto gsi_from = gsi_start_phis (loop->header), gsi_to = gsi_start_phis (new_preheader); !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to); @@ -1776,7 +1774,14 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, { update_loop = new_loop; for (edge e : get_loop_exit_edges (loop)) - doms.safe_push (e->dest); + { + /* Basic blocks that the old loop dominated are now dominated by +the new loop and so we have to update those. */
RE: [PATCH]middle-end: check if target can do extract first for early breaks [PR113199]
> > - > > - gimple_seq_add_seq (, tem); > > - > > - scalar_res = gimple_build (, CFN_EXTRACT_LAST, scalar_type, > > -mask, vec_lhs_phi); > > + scalar_res = gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE > (vectype), > > + vec_lhs_phi, bitstart); > > So bitstart is always zero? I wonder why using CFN_VEC_EXTRACT over > BIT_FIELD_REF here which wouldn't need any additional target support. > Ok, how about... --- I was generating the vector reverse mask without checking if the target actually supported such an operation. This patch changes it to if the bitstart is 0 then use BIT_FIELD_REF instead to extract the first element since this is supported by all targets. This is good for now since masks always come from whilelo. But in the future when masks can come from other sources we will need the old code back. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues with --enable-checking=release --enable-lto --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. tested on cross cc1 for amdgcn-amdhsa and issue fixed. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113199 * tree-vect-loop.cc (vectorizable_live_operation_1): Use BIT_FIELD_REF. gcc/testsuite/ChangeLog: PR tree-optimization/113199 * gcc.target/gcn/pr113199.c: New test. --- inline copy of patch --- diff --git a/gcc/testsuite/gcc.target/gcn/pr113199.c b/gcc/testsuite/gcc.target/gcn/pr113199.c new file mode 100644 index ..8a641e5536e80e207ca0163cac66c0f4f6ca93f7 --- /dev/null +++ b/gcc/testsuite/gcc.target/gcn/pr113199.c @@ -0,0 +1,44 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2" } */ + +typedef long unsigned int size_t; +typedef int wchar_t; +struct tm +{ + int tm_mon; + int tm_year; +}; +int abs (int); +struct lc_time_T { const char *month[12]; }; +struct __locale_t * __get_current_locale (void) { } +const struct lc_time_T * __get_time_locale (struct __locale_t *locale) { } +const wchar_t * __ctloc (wchar_t *buf, const char *elem, size_t *len_ret) { return buf; } +size_t +__strftime (wchar_t *s, size_t maxsize, const wchar_t *format, + const struct tm *tim_p, struct __locale_t *locale) +{ + size_t count = 0; + const wchar_t *ctloc; + wchar_t ctlocbuf[256]; + size_t i, ctloclen; + const struct lc_time_T *_CurrentTimeLocale = __get_time_locale (locale); +{ + switch (*format) + { + case L'B': + (ctloc = __ctloc (ctlocbuf, _CurrentTimeLocale->month[tim_p->tm_mon], )); + for (i = 0; i < ctloclen; i++) + { + if (count < maxsize - 1) + s[count++] = ctloc[i]; + else + return 0; + { + int century = tim_p->tm_year >= 0 +? tim_p->tm_year / 100 + 1900 / 100 +: abs (tim_p->tm_year + 1900) / 100; + } + } + } +} +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 37f1be1101ffae779214056a0886411e0683e887..39b1161309d8ff8bfe88ee26df9147df0af0a58c 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10592,7 +10592,17 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo, gimple_seq stmts = NULL; tree new_tree; - if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + + /* If bitstart is 0 then we can use a BIT_FIELD_REF */ + if (integer_zerop (bitstart)) +{ + tree scalar_res = gimple_build (, BIT_FIELD_REF, TREE_TYPE (vectype), + vec_lhs_phi, bitsize, bitstart); + + /* Convert the extracted vector element to the scalar type. */ + new_tree = gimple_convert (, lhs_type, scalar_res); +} + else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) { /* Emit: @@ -10618,12 +10628,6 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo, tree last_index = gimple_build (, PLUS_EXPR, TREE_TYPE (len), len, bias_minus_one); - /* This needs to implement extraction of the first index, but not sure -how the LEN stuff works. At the moment we shouldn't get here since -there's no LEN support for early breaks. But guard this so there's -no incorrect codegen. */ - gcc_assert (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo)); - /* SCALAR_RES = VEC_EXTRACT . */ tree scalar_res = gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE (vectype), @@ -10648,32 +10652,6 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo, _VINFO_MASKS (loop_vinfo), 1, vectype, 0); tree scalar_res; - - /* For an inverted control flow with early breaks we want EXTRACT_FIRST -instead of EXTRACT_LAST. Emulate by reversing the vector and mask. */ - if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) - { - /* First create the permuted mask. */ - tree perm_mask =
RE: [PATCH]middle-end: check if target can do extract first for early breaks [PR113199]
> -Original Message- > From: Richard Biener > Sent: Monday, January 8, 2024 12:48 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: check if target can do extract first for > early breaks > [PR113199] > > On Tue, 2 Jan 2024, Tamar Christina wrote: > > > Hi All, > > > > I was generating the vector reverse mask without checking if the target > > actually supported such an operation. > > > > It also seems like more targets implement VEC_EXTRACT than permute on mask > > registers. > > > > So this adds a check for IFN_VEC_EXTRACT support when required and changes > > the select first code to use it. > > > > This is good for now since masks always come from whilelo. But in the > > future > > when masks can come from other sources we will need the old code back. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > > and no issues with --enable-checking=release --enable-lto > > --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. > > tested on cross cc1 for amdgcn-amdhsa and issue fixed. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR tree-optimization/113199 > > * tree-vect-loop.cc (vectorizable_live_operation_1): Use > > IFN_VEC_EXTRACT. > > (vectorizable_live_operation): Check for IFN_VEC_EXTRACT support. > > > > gcc/testsuite/ChangeLog: > > > > PR tree-optimization/113199 > > * gcc.target/gcn/pr113199.c: New test. > > > > --- inline copy of patch -- > > diff --git a/gcc/testsuite/gcc.target/gcn/pr113199.c > b/gcc/testsuite/gcc.target/gcn/pr113199.c > > new file mode 100644 > > index > ..8a641e5536e80e207ca01 > 63cac66c0f4f6ca93f7 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/gcn/pr113199.c > > @@ -0,0 +1,44 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O2" } */ > > + > > +typedef long unsigned int size_t; > > +typedef int wchar_t; > > +struct tm > > +{ > > + int tm_mon; > > + int tm_year; > > +}; > > +int abs (int); > > +struct lc_time_T { const char *month[12]; }; > > +struct __locale_t * __get_current_locale (void) { } > > +const struct lc_time_T * __get_time_locale (struct __locale_t *locale) { } > > +const wchar_t * __ctloc (wchar_t *buf, const char *elem, size_t *len_ret) { > return buf; } > > +size_t > > +__strftime (wchar_t *s, size_t maxsize, const wchar_t *format, > > + const struct tm *tim_p, struct __locale_t *locale) > > +{ > > + size_t count = 0; > > + const wchar_t *ctloc; > > + wchar_t ctlocbuf[256]; > > + size_t i, ctloclen; > > + const struct lc_time_T *_CurrentTimeLocale = __get_time_locale (locale); > > +{ > > + switch (*format) > > + { > > + case L'B': > > + (ctloc = __ctloc (ctlocbuf, _CurrentTimeLocale->month[tim_p->tm_mon], > )); > > + for (i = 0; i < ctloclen; i++) > > + { > > + if (count < maxsize - 1) > > + s[count++] = ctloc[i]; > > + else > > + return 0; > > + { > > + int century = tim_p->tm_year >= 0 > > +? tim_p->tm_year / 100 + 1900 / 100 > > +: abs (tim_p->tm_year + 1900) / 100; > > + } > > + } > > + } > > +} > > +} > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index > 37f1be1101ffae779214056a0886411e0683e887..5aa92e67444e7aacf458fffa14 > 28f1983c482374 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -10648,36 +10648,18 @@ vectorizable_live_operation_1 (loop_vec_info > loop_vinfo, > > _VINFO_MASKS (loop_vinfo), > > 1, vectype, 0); > >tree scalar_res; > > + gimple_seq_add_seq (, tem); > > > >/* For an inverted control flow with early breaks we want > > EXTRACT_FIRST > > -instead of EXTRACT_LAST. Emulate by reversing the vector and mask. */ > > +instead of EXTRACT_LAST. For now since the mask always comes from a > > +WHILELO we can get the first element ignoring the mask since CLZ of the > > +mask will always be zero. */ > >if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) > > - { > > - /* First create the permuted mask. */ >
RE: [PATCH]middle-end: maintain LCSSA form when peeled vector iterations have virtual operands
> -Original Message- > From: Richard Biener > Sent: Monday, January 8, 2024 12:38 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: maintain LCSSA form when peeled vector > iterations have virtual operands > > On Fri, 29 Dec 2023, Tamar Christina wrote: > > > Hi All, > > > > This patch fixes several interconnected issues. > > > > 1. When picking an exit we wanted to check for niter_desc.may_be_zero not > true. > >i.e. we want to pick an exit which we know will iterate at least once. > >However niter_desc.may_be_zero is not a boolean. It is a tree that > > encodes > >a boolean value. !niter_desc.may_be_zero is just checking if we have > > some > >information, not what the information is. This leads us to pick a more > >difficult to vectorize exit more often than we should. > > > > 2. Because we had this bug, we used to pick an alternative exit much more > > ofthen > >which showed one issue, when the loop accesses memory and we "invert it" > > we > >would corrupt the VUSE chain. This is because on an peeled vector > > iteration > >every exit restarts the loop (i.e. they're all early) BUT since we may > > have > >performed a store, the vUSE would need to be updated. This version > > maintains > >virtual PHIs correctly in these cases. Note that we can't simply > > remove all > >of them and recreate them because we need the PHI nodes still in the > > right > >order for if skip_vector. > > > > 3. Since we're moving the stores to a safe location I don't think we > > actually > >need to analyze whether the store is in range of the memref, because if > > we > >ever get there, we know that the loads must be in range, and if the > > loads are > >in range and we get to the store we know the early breaks were not taken > > and > >so the scalar loop would have done the VF stores too. > > > > 4. Instead of searching for where to move stores to, they should always be > > in > >exit belonging to the latch. We can only ever delay stores and even if > > we > >pick a different exit than the latch one as the main one, effects still > >happen in program order when vectorized. If we don't move the stores to > > the > >latch exit but instead to whever we pick as the "main" exit then we can > >perform incorrect memory accesses (luckily these are trapped by > > verify_ssa). > > > > 5. We only used to analyze loads inside the same BB as an early break, and > > also > >we'd never analyze the ones inside the block where we'd be moving memory > >references to. This is obviously bogus and to fix it this patch splits > > apart > >the two constraints. We first validate that all load memory references > > are > >in bounds and only after that do we perform the alias checks for the > > writes. > >This makes the code simpler to understand and more trivially correct. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu > > and no issues with --enable-checking=release --enable-lto > > --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR tree-optimization/113137 > > PR tree-optimization/113136 > > PR tree-optimization/113172 > > * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): > > * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): > > (vect_do_peeling): Maintain virtual PHIs on inverted loops. > > * tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to > > latch. > > (vect_create_loop_vinfo): Record all conds instead of only alt ones. > > * tree-vectorizer.h: Fix comment > > > > gcc/testsuite/ChangeLog: > > > > PR tree-optimization/113137 > > PR tree-optimization/113136 > > PR tree-optimization/113172 > > * g++.dg/vect/vect-early-break_4-pr113137.cc: New test. > > * g++.dg/vect/vect-early-break_5-pr113137.cc: New test. > > * gcc.dg/vect/vect-early-break_95-pr113137.c: New test. > > * gcc.dg/vect/vect-early-break_96-pr113136.c: New test. > > * gcc.dg/vect/vect-early-break_97-pr113172.c: New test. > > > > --- inline copy of patch -- > > diff --git a/gcc/testsuite/g++.dg/vec
RE: [PATCH]middle-end: rejects loops with nonlinear inductions and early breaks [PR113163]
> -Original Message- > From: Richard Biener > Sent: Monday, January 8, 2024 12:07 PM > To: Tamar Christina > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com > Subject: Re: [PATCH]middle-end: rejects loops with nonlinear inductions and > early > breaks [PR113163] > > On Fri, 29 Dec 2023, Tamar Christina wrote: > > > Hi All, > > > > We can't support nonlinear inductions other than neg when vectorizing > > early breaks and iteration count is known. > > > > For early break we currently require a peeled epilog but in these cases > > we can't compute the remaining values. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > tested on cross cc1 for amdgcn-amdhsa and issue fixed. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR middle-end/113163 > > * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): > > Misses sth. > > > gcc/testsuite/ChangeLog: > > > > PR middle-end/113163 > > * gcc.target/gcn/pr113163.c: New test. > > > > --- inline copy of patch -- > > diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c > b/gcc/testsuite/gcc.target/gcn/pr113163.c > > new file mode 100644 > > index > ..99b0fdbaf3a3152ca008b5 > 109abf6e80d8cb3d6a > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/gcn/pr113163.c > > @@ -0,0 +1,30 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O2 -ftree-vectorize" } */ > > + > > +struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; }; > > +static const char R64_ARRAY[] = > "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" > ; > > +char * > > +_l64a_r (struct _reent *rptr, > > + long value) > > +{ > > + char *ptr; > > + char *result; > > + int i, index; > > + unsigned long tmp = (unsigned long)value & 0x; > > + result = > > + (( > > + rptr > > + )->_new._reent._l64a_buf) > > + ; > > + ptr = result; > > + for (i = 0; i < 6; ++i) > > +{ > > + if (tmp == 0) > > + { > > + *ptr = '\0'; > > + break; > > + } > > + *ptr++ = R64_ARRAY[index]; > > + tmp >>= 6; > > +} > > +} > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > > index > 3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c18 > 6302fbeaf515be8cf 100644 > > --- a/gcc/tree-vect-loop-manip.cc > > +++ b/gcc/tree-vect-loop-manip.cc > > @@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info > loop_vinfo, > >return false; > > } > > > > + /* We can't support partial vectors and early breaks with an induction > > + type other than add or neg since we require the epilog and can't > > + perform the peeling. PR113163. */ > > + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo) > > + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant () > > But why's that only for constant VF? We might never end up here > with variable VF but the check looks odd ... It's mirroring the condition in vect_gen_vector_loop_niters where we create step_vector which is not 1. This is the case which causes niters_vector_mult_vf_var to become a tree var instead. I'll update the comment to say this. Thanks, Tamar > > OK with that clarified and/or the test removed. > > Thanks, > Richard. > > > + && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > > + && induction_type != vect_step_op_neg) > > +{ > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > +"Peeling for epilogue is not supported" > > +" for nonlinear induction except neg" > > +" when iteration count is known and early breaks.\n"); > > + return false; > > +} > > + > >return true; > > } > > > > > > > > > > > > > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
RE: [PATCH] tree-optimization/113026 - avoid vector epilog in more cases
> -Original Message- > From: Richard Biener > Sent: Monday, January 8, 2024 11:29 AM > To: gcc-patches@gcc.gnu.org > Cc: Tamar Christina > Subject: [PATCH] tree-optimization/113026 - avoid vector epilog in more cases > > The following avoids creating a niter peeling epilog more consistently, > matching what peeling later uses for the skip_vector condition, in > particular when versioning is required which then also ensures the > vector loop is entered unless the epilog is vectorized. This should > ideally match LOOP_VINFO_VERSIONING_THRESHOLD which is only computed > later, some refactoring could make that better matching. > > The patch also makes sure to adjust the upper bound of the epilogues > when we do not have a skip edge around the vector loop. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Tamar, does > that look OK wrt early-breaks? Yeah the value looks correct, I did find a few cases where the niters should actually be higher for skip_vector, namely when of the breaks forces ncopies > 1 and we have a break condition that requires all values to be true to continue. The code is not wrong in that case, just executes a completely useless vector iters. But that's unrelated, this looks correct because it means bound_scalar is not set, in which case there's no difference between one and multiple exits. Thanks, Tamar > > Thanks, > Richard. > > PR tree-optimization/113026 > * tree-vect-loop.cc (vect_need_peeling_or_partial_vectors_p): > Avoid an epilog in more cases. > * tree-vect-loop-manip.cc (vect_do_peeling): Adjust the > epilogues niter upper bounds and estimates. > > * gcc.dg/torture/pr113026-1.c: New testcase. > * gcc.dg/torture/pr113026-2.c: Likewise. > --- > gcc/testsuite/gcc.dg/torture/pr113026-1.c | 11 > gcc/testsuite/gcc.dg/torture/pr113026-2.c | 18 + > gcc/tree-vect-loop-manip.cc | 32 +++ > gcc/tree-vect-loop.cc | 6 - > 4 files changed, 66 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.dg/torture/pr113026-1.c > create mode 100644 gcc/testsuite/gcc.dg/torture/pr113026-2.c > > diff --git a/gcc/testsuite/gcc.dg/torture/pr113026-1.c > b/gcc/testsuite/gcc.dg/torture/pr113026-1.c > new file mode 100644 > index 000..56dfef3b36c > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/torture/pr113026-1.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-Wall" } */ > + > +char dst[16]; > + > +void > +foo (char *src, long n) > +{ > + for (long i = 0; i < n; i++) > +dst[i] = src[i]; /* { dg-bogus "" } */ > +} > diff --git a/gcc/testsuite/gcc.dg/torture/pr113026-2.c > b/gcc/testsuite/gcc.dg/torture/pr113026-2.c > new file mode 100644 > index 000..b9d5857a403 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/torture/pr113026-2.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-Wall" } */ > + > +char dst1[17]; > +void > +foo1 (char *src, long n) > +{ > + for (long i = 0; i < n; i++) > +dst1[i] = src[i]; /* { dg-bogus "" } */ > +} > + > +char dst2[18]; > +void > +foo2 (char *src, long n) > +{ > + for (long i = 0; i < n; i++) > +dst2[i] = src[i]; /* { dg-bogus "" } */ > +} > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > index 9330183bfb9..927f76a0947 100644 > --- a/gcc/tree-vect-loop-manip.cc > +++ b/gcc/tree-vect-loop-manip.cc > @@ -3364,6 +3364,38 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree > niters, tree nitersm1, > bb_before_epilog->count = single_pred_edge (bb_before_epilog)->count > (); > bb_before_epilog = loop_preheader_edge (epilog)->src; > } > + else > + { > + /* When we do not have a loop-around edge to the epilog we know > + the vector loop covered at least VF scalar iterations unless > + we have early breaks and the epilog will cover at most > + VF - 1 + gap peeling iterations. > + Update any known upper bound with this knowledge. */ > + if (! LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) > + { > + if (epilog->any_upper_bound) > + epilog->nb_iterations_upper_bound -= lowest_vf; > + if (epilog->any_likely_upper_bound) > + epilog->nb_iterations_likely_upper_bound -= lowest_vf; > + if (epilog->any_estimate) > + epilog->nb_iterations_estimate -= lowest_vf; > + } > + unsigned HOST_WIDE_INT const_vf; > + if
[PATCH][frontend]: don't ice with pragma NOVECTOR if loop in C has no condition [PR113267]
Hi All, In C you can have loops without a condition, the original version of the patch was rejecting the use of #pragma GCC novector, however during review it was changed to not due this with the reason that we didn't want to give a compile error with such cases. However because annotations seem to be only be allowed on conditions (unless I'm mistaken?) the attached example ICEs because there's no condition. This will have it ignore the pragma instead of ICEing. I don't know if this is the best solution, but as far as I can tell we can't attach the annotation to anything else. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/c/ChangeLog: PR c/113267 * c-parser.cc (c_parser_for_statement): Skip the pragma is no cond. gcc/testsuite/ChangeLog: PR c/113267 * gcc.dg/pr113267.c: New test. --- inline copy of patch -- diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index c3724304580cf54f52655e10d2697c68966b9a17..e8300cea8ef7cedead5871e40c2a9ba5333bf839 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -8442,7 +8442,7 @@ c_parser_for_statement (c_parser *parser, bool ivdep, unsigned short unroll, build_int_cst (integer_type_node, annot_expr_unroll_kind), build_int_cst (integer_type_node, unroll)); - if (novector && cond != error_mark_node) + if (novector && cond && cond != error_mark_node) cond = build3 (ANNOTATE_EXPR, TREE_TYPE (cond), cond, build_int_cst (integer_type_node, annot_expr_no_vector_kind), diff --git a/gcc/testsuite/gcc.dg/pr113267.c b/gcc/testsuite/gcc.dg/pr113267.c new file mode 100644 index ..8b6fa08324eb12ad6493291cca8e80bd3a072ba8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr113267.c @@ -0,0 +1,8 @@ +/* { dg-do compile } */ + +void f (char *a, int i) +{ +#pragma GCC novector + for (;;i++) +a[i] *= 2; +} -- diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc index c3724304580cf54f52655e10d2697c68966b9a17..e8300cea8ef7cedead5871e40c2a9ba5333bf839 100644 --- a/gcc/c/c-parser.cc +++ b/gcc/c/c-parser.cc @@ -8442,7 +8442,7 @@ c_parser_for_statement (c_parser *parser, bool ivdep, unsigned short unroll, build_int_cst (integer_type_node, annot_expr_unroll_kind), build_int_cst (integer_type_node, unroll)); - if (novector && cond != error_mark_node) + if (novector && cond && cond != error_mark_node) cond = build3 (ANNOTATE_EXPR, TREE_TYPE (cond), cond, build_int_cst (integer_type_node, annot_expr_no_vector_kind), diff --git a/gcc/testsuite/gcc.dg/pr113267.c b/gcc/testsuite/gcc.dg/pr113267.c new file mode 100644 index ..8b6fa08324eb12ad6493291cca8e80bd3a072ba8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr113267.c @@ -0,0 +1,8 @@ +/* { dg-do compile } */ + +void f (char *a, int i) +{ +#pragma GCC novector + for (;;i++) +a[i] *= 2; +}
Re: [PATCH]middle-end: thread through existing LCSSA variable for alternative exits too [PR113237]
No, that error is fixed by some earlier patches sent early last week that are awaiting review :) From: Toon Moene Sent: Sunday, January 7, 2024 7:11 PM To: gcc-patches@gcc.gnu.org Subject: Re: [PATCH]middle-end: thread through existing LCSSA variable for alternative exits too [PR113237] On 1/7/24 18:29, Tamar Christina wrote: > gcc/ChangeLog: > >PR tree-optimization/113237 >* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use >existing LCSSA variable for exit when all exits are early break. Might that be the same error as I got here when building with bootstrap-lto and bootstrap-O3: https://gcc.gnu.org/pipermail/gcc-testresults/2024-January/804807.html ? -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
[PATCH]middle-end: thread through existing LCSSA variable for alternative exits too [PR113237]
Hi All, Builing on top of the previous patch, similar to when we have a single exit if we have a case where all exits are considered early exits and there are existing non virtual phi then in order to maintain LCSSA we have to use the existing PHI variables. We can't simply clear them and just rebuild them because the order of the PHIs in the main exit must match the original exit for when we add the skip_epilog guard. But the infrastructure is already in place to maintain them, we just have to use the right value. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues normally and with with --enable-checking=release --enable-lto --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113237 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use existing LCSSA variable for exit when all exits are early break. gcc/testsuite/ChangeLog: PR tree-optimization/113237 * gcc.dg/vect/vect-early-break_98-pr113237.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c new file mode 100644 index ..e6d150b571f753e9eb3859f06f62b371817494a3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +long Perl_pp_split_limit; +int Perl_block_gimme(); +int Perl_pp_split() { + char strend; + long iters; + int gimme = Perl_block_gimme(); + while (--Perl_pp_split_limit) { +if (gimme) + iters++; +if (strend) + break; + } + if (iters) +return 0; +} diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 7fd6566341b4893a1e209d1f8ff65d6d180f1190..77649b84f45b9e5dacec2809e0c854c8fcc17ce1 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -1700,7 +1700,12 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, if (peeled_iters && !virtual_operand_p (new_arg)) { tree tmp_arg = gimple_phi_result (from_phi); - if (!new_phi_args.get (tmp_arg)) + /* Similar to the single exit case, If we have an existing +LCSSA variable thread through the original value otherwise +skip it and directly use the final value. */ + if (tree *res = new_phi_args.get (tmp_arg)) + new_arg = *res; + else new_arg = tmp_arg; } -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c new file mode 100644 index ..e6d150b571f753e9eb3859f06f62b371817494a3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +long Perl_pp_split_limit; +int Perl_block_gimme(); +int Perl_pp_split() { + char strend; + long iters; + int gimme = Perl_block_gimme(); + while (--Perl_pp_split_limit) { +if (gimme) + iters++; +if (strend) + break; + } + if (iters) +return 0; +} diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 7fd6566341b4893a1e209d1f8ff65d6d180f1190..77649b84f45b9e5dacec2809e0c854c8fcc17ce1 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -1700,7 +1700,12 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, if (peeled_iters && !virtual_operand_p (new_arg)) { tree tmp_arg = gimple_phi_result (from_phi); - if (!new_phi_args.get (tmp_arg)) + /* Similar to the single exit case, If we have an existing +LCSSA variable thread through the original value otherwise +skip it and directly use the final value. */ + if (tree *res = new_phi_args.get (tmp_arg)) + new_arg = *res; + else new_arg = tmp_arg; }
RE: [PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]
> On Fri, 2024-01-05 at 11:02 +0000, Tamar Christina wrote: > > Ok, so something like: > > > > > > ([istarget loongarch*-*-*] && > > > > ([check_effective_target_loongarch_sx] || > > > > [check_effective_target_hard_float])) > > ? > > We don't need "[check_effective_target_loongarch_sx] ||" because SIMD > requires hard float. > Cool, thanks! -- Hi All, currently GCC does not treat IFN_COPYSIGN the same as the copysign tree expr. The latter has a libcall fallback and the IFN can only do optabs. Because of this the change I made to optimize copysign only works if the target has impemented the optab, but it should work for those that have the libcall too. More annoyingly if a target has vector versions of ABS and NEG but not COPYSIGN then the change made them lose vectorization. The proper fix for this is to treat the IFN the same as the tree EXPR and to enhance expand_COPYSIGN to also support vector calls. I have such a patch for GCC 15 but it's quite big and too invasive for stage-4. As such this is a minimal fix, just don't apply the transformation and leave targets which don't have the optab unoptimized. Targets list for check_effective_target_ifn_copysign was gotten by grepping for copysign and looking at the optab. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Tests ran in x86_64-pc-linux-gnu -m32 and tests no longer fail. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/112468 * doc/sourcebuild.texi: Document ifn_copysign. * match.pd: Only apply transformation if target supports the IFN. gcc/testsuite/ChangeLog: PR tree-optimization/112468 * gcc.dg/fold-copysign-1.c: Modify tests based on if target supports IFN_COPYSIGN. * gcc.dg/pr55152-2.c: Likewise. * gcc.dg/tree-ssa/abs-4.c: Likewise. * gcc.dg/tree-ssa/backprop-6.c: Likewise. * gcc.dg/tree-ssa/copy-sign-2.c: Likewise. * gcc.dg/tree-ssa/mult-abs-2.c: Likewise. * lib/target-supports.exp (check_effective_target_ifn_copysign): New. --- inline copy of patch --- diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi index 4be67daedb20d394857c02739389cabf23c0d533..f4847dafe65cbbf8c9de34905f614ef6957658b4 100644 --- a/gcc/doc/sourcebuild.texi +++ b/gcc/doc/sourcebuild.texi @@ -2664,6 +2664,10 @@ Target requires a command line argument to enable a SIMD instruction set. @item xorsign Target supports the xorsign optab expansion. +@item ifn_copysign +Target supports the IFN_COPYSIGN optab expansion for both scalar and vector +types. + @end table @subsubsection Environment attributes diff --git a/gcc/match.pd b/gcc/match.pd index d57e29bfe1d68afd4df4dda20fecc2405ff05332..87d13e7e3e1aa6d89119142b614890dc4729b521 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1159,13 +1159,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (simplify (copysigns @0 REAL_CST@1) (if (!REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1))) - (abs @0 + (abs @0) +#if GIMPLE + (if (!direct_internal_fn_supported_p (IFN_COPYSIGN, type, +OPTIMIZE_FOR_BOTH)) +(negate (abs @0))) +#endif + ))) +#if GIMPLE /* Transform fneg (fabs (X)) -> copysign (X, -1). */ (simplify (negate (abs @0)) - (IFN_COPYSIGN @0 { build_minus_one_cst (type); })) - + (if (direct_internal_fn_supported_p (IFN_COPYSIGN, type, + OPTIMIZE_FOR_BOTH)) + (IFN_COPYSIGN @0 { build_minus_one_cst (type); }))) +#endif /* copysign(copysign(x, y), z) -> copysign(x, z). */ (for copysigns (COPYSIGN_ALL) (simplify diff --git a/gcc/testsuite/gcc.dg/fold-copysign-1.c b/gcc/testsuite/gcc.dg/fold-copysign-1.c index f9cafd14ab05f5e8ab2f6f68e62801d21c2df6a6..96b80c733794fffada1b08274ef39cc8f6e442ce 100644 --- a/gcc/testsuite/gcc.dg/fold-copysign-1.c +++ b/gcc/testsuite/gcc.dg/fold-copysign-1.c @@ -1,5 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-O -fdump-tree-cddce1" } */ +/* { dg-additional-options "-msse -mfpmath=sse" { target { { i?86-*-* x86_64-*-* } && ilp32 } } } */ double foo (double x) { @@ -12,5 +13,7 @@ double bar (double x) return __builtin_copysign (x, minuszero); } -/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" } } */ -/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" { target ifn_copysign } } } */ +/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" { target ifn_copysign } } } */ +/* { dg-final { scan-tree-dump-times "= -" 1 "cddce1" { target { ! ifn_copysign } } } } */ +/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 2 "cddce1" { target { ! ifn_copysign } } } } */ diff --git a/gc
RE: [PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]
> -Original Message- > From: Xi Ruoyao > Sent: Thursday, January 4, 2024 10:39 PM > To: Palmer Dabbelt ; Tamar Christina > > Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de; Jeff Law > > Subject: Re: [PATCH]middle-end: Don't apply copysign optimization if target > does > not implement optab [PR112468] > > On Thu, 2024-01-04 at 14:32 -0800, Palmer Dabbelt wrote: > > > +proc check_effective_target_ifn_copysign { } { > > > + return [check_cached_effective_target_indexed ifn_copysign { > > > + expr { > > > + (([istarget i?86-*-*] || [istarget x86_64-*-*]) > > > + && [is-effective-target sse]) > > > + || ([istarget loongarch*-*-*] && [check_effective_target_loongarch_sx]) > > LoongArch has [scalar FP copysign][1] too. > > [1]:https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1- > EN.html#_fscaleblogbcopysign_sd Ok, so something like: || ([istarget loongarch*-*-*] && ([check_effective_target_loongarch_sx] || [check_effective_target_hard_float])) ? > > > > + || ([istarget powerpc*-*-*] > > > + && ![istarget powerpc-*-linux*paired*]) > > > + || [istarget alpha*-*-*] > > > + || [istarget aarch64*-*-*] > > > + || [is-effective-target arm_neon] > > > + || ([istarget s390*-*-*] > > > + && [check_effective_target_s390_vx]) > > > + || ([istarget riscv*-*-*] > > > + && [check_effective_target_riscv_v]) > > > > Unless I'm missing something, we have copysign in the scalar > > floating-point ISAs as well. So I think this should be > > > > || ([istarget riscv*-*-*] > > && [check_effective_target_hard_float]) > Ah cool, will update it in next version. Thanks, Tamar > -- > Xi Ruoyao > School of Aerospace Science and Technology, Xidian University
[PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]
Hi All, currently GCC does not treat IFN_COPYSIGN the same as the copysign tree expr. The latter has a libcall fallback and the IFN can only do optabs. Because of this the change I made to optimize copysign only works if the target has impemented the optab, but it should work for those that have the libcall too. More annoyingly if a target has vector versions of ABS and NEG but not COPYSIGN then the change made them lose vectorization. The proper fix for this is to treat the IFN the same as the tree EXPR and to enhance expand_COPYSIGN to also support vector calls. I have such a patch for GCC 15 but it's quite big and too invasive for stage-4. As such this is a minimal fix, just don't apply the transformation and leave targets which don't have the optab unoptimized. Targets list for check_effective_target_ifn_copysign was gotten by grepping for copysign and looking at the optab. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Tests ran in x86_64-pc-linux-gnu -m64/-m32 and tests no longer fail. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/112468 * doc/sourcebuild.texi: Document ifn_copysign. * match.pd: Only apply transformation if target supports the IFN. gcc/testsuite/ChangeLog: PR tree-optimization/112468 * gcc.dg/fold-copysign-1.c: Modify tests based on if target supports IFN_COPYSIGN. * gcc.dg/pr55152-2.c: Likewise. * gcc.dg/tree-ssa/abs-4.c: Likewise. * gcc.dg/tree-ssa/backprop-6.c: Likewise. * gcc.dg/tree-ssa/copy-sign-2.c: Likewise. * gcc.dg/tree-ssa/mult-abs-2.c: Likewise. * lib/target-supports.exp (check_effective_target_ifn_copysign): New. --- inline copy of patch -- diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi index 4be67daedb20d394857c02739389cabf23c0d533..f4847dafe65cbbf8c9de34905f614ef6957658b4 100644 --- a/gcc/doc/sourcebuild.texi +++ b/gcc/doc/sourcebuild.texi @@ -2664,6 +2664,10 @@ Target requires a command line argument to enable a SIMD instruction set. @item xorsign Target supports the xorsign optab expansion. +@item ifn_copysign +Target supports the IFN_COPYSIGN optab expansion for both scalar and vector +types. + @end table @subsubsection Environment attributes diff --git a/gcc/match.pd b/gcc/match.pd index d57e29bfe1d68afd4df4dda20fecc2405ff05332..87d13e7e3e1aa6d89119142b614890dc4729b521 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1159,13 +1159,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (simplify (copysigns @0 REAL_CST@1) (if (!REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1))) - (abs @0 + (abs @0) +#if GIMPLE + (if (!direct_internal_fn_supported_p (IFN_COPYSIGN, type, +OPTIMIZE_FOR_BOTH)) +(negate (abs @0))) +#endif + ))) +#if GIMPLE /* Transform fneg (fabs (X)) -> copysign (X, -1). */ (simplify (negate (abs @0)) - (IFN_COPYSIGN @0 { build_minus_one_cst (type); })) - + (if (direct_internal_fn_supported_p (IFN_COPYSIGN, type, + OPTIMIZE_FOR_BOTH)) + (IFN_COPYSIGN @0 { build_minus_one_cst (type); }))) +#endif /* copysign(copysign(x, y), z) -> copysign(x, z). */ (for copysigns (COPYSIGN_ALL) (simplify diff --git a/gcc/testsuite/gcc.dg/fold-copysign-1.c b/gcc/testsuite/gcc.dg/fold-copysign-1.c index f9cafd14ab05f5e8ab2f6f68e62801d21c2df6a6..96b80c733794fffada1b08274ef39cc8f6e442ce 100644 --- a/gcc/testsuite/gcc.dg/fold-copysign-1.c +++ b/gcc/testsuite/gcc.dg/fold-copysign-1.c @@ -1,5 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-O -fdump-tree-cddce1" } */ +/* { dg-additional-options "-msse -mfpmath=sse" { target { { i?86-*-* x86_64-*-* } && ilp32 } } } */ double foo (double x) { @@ -12,5 +13,7 @@ double bar (double x) return __builtin_copysign (x, minuszero); } -/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" } } */ -/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" { target ifn_copysign } } } */ +/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" { target ifn_copysign } } } */ +/* { dg-final { scan-tree-dump-times "= -" 1 "cddce1" { target { ! ifn_copysign } } } } */ +/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 2 "cddce1" { target { ! ifn_copysign } } } } */ diff --git a/gcc/testsuite/gcc.dg/pr55152-2.c b/gcc/testsuite/gcc.dg/pr55152-2.c index 605f202ed6bc7aa8fe921457b02ff0b88cc63ce6..24068cffa4a8e2807ba7d16c4ed3def4f736e797 100644 --- a/gcc/testsuite/gcc.dg/pr55152-2.c +++ b/gcc/testsuite/gcc.dg/pr55152-2.c @@ -1,5 +1,6 @@ /* { dg-do compile } */ /* { dg-options "-O -ffinite-math-only -fno-signed-zeros -fstrict-overflow -fdump-tree-optimized" } */ +/* { dg-additional-options "-msse -mfpmath=sse" { target { { i?86-*-* x86_64-*-* } && ilp32 } } } */ double g (double a) { @@ -10,5 +11,6 @@ int f(int a) return (a<-a)?a:-a; } -/* { dg-final {
RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation
> -Original Message- > From: Kyrylo Tkachov > Sent: Thursday, January 4, 2024 11:12 AM > To: Tamar Christina ; gcc-patches@gcc.gnu.org > Cc: nd ; Ramana Radhakrishnan > ; Richard Earnshaw > ; ni...@redhat.com > Subject: RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation > > Hi Tamar, > > > -----Original Message- > > From: Tamar Christina > > Sent: Thursday, January 4, 2024 11:06 AM > > To: Tamar Christina ; gcc-patches@gcc.gnu.org > > Cc: nd ; Ramana Radhakrishnan > > ; Richard Earnshaw > > ; ni...@redhat.com; Kyrylo Tkachov > > > > Subject: RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation > > > > Ping, > > > > --- > > > > Hi All, > > > > This adds an implementation for conditional branch optab for AArch32. > > The previous version only allowed operand 0 but it looks like cbranch > > expansion does not check with the target and so we have to implement all. > > > > I therefore did not commit it. This is a larger version. I've also dropped > > the MVE > > version because the mid-end can rewrite the comparison into comparing two > > predicates without checking with the backend. Since MVE only has 1 > > predicate > > register this would need to go through memory and two MRS calls. It's > > unlikely > > to be beneficial and so that's for GCC 15 when I can fix the middle-end. > > > > The cases where AArch32 is skipped in the testsuite are all > > missed-optimizations > > due to AArch32 missing some optabs. > > Does the testsuite have vect_* checks that can be used instead of target arm*? > If so let's use those. Unfortunately not, a lot of them center around handling of complex doubles. Some tests work and some fail, which makes it hard to disable based on a target effective test. They are things that look easy to fix so I may file some tickets for them. Cheers, Tamar > Otherwise it's okay as is. > Thanks, > Kyrill > > > > > For e.g. > > > > void f1 () > > { > > for (int i = 0; i < N; i++) > > { > > b[i] += a[i]; > > if (a[i] > 0) > > break; > > } > > } > > > > For 128-bit vectors we generate: > > > > vcgt.s32q8, q9, #0 > > vpmax.u32 d7, d16, d17 > > vpmax.u32 d7, d7, d7 > > vmovr3, s14 @ int > > cmp r3, #0 > > > > and of 64-bit vector we can omit one vpmax as we still need to compress to > > 32-bits. > > > > Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * config/arm/neon.md (cbranch4): New. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.dg/vect/vect-early-break_2.c: Skip Arm. > > * gcc.dg/vect/vect-early-break_7.c: Likewise. > > * gcc.dg/vect/vect-early-break_75.c: Likewise. > > * gcc.dg/vect/vect-early-break_77.c: Likewise. > > * gcc.dg/vect/vect-early-break_82.c: Likewise. > > * gcc.dg/vect/vect-early-break_88.c: Likewise. > > * lib/target-supports.exp (add_options_for_vect_early_break, > > check_effective_target_vect_early_break_hw, > > check_effective_target_vect_early_break): Support AArch32. > > * gcc.target/arm/vect-early-break-cbranch.c: New test. > > > > --- inline version of patch --- > > > > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md > > index > > > d213369ffc38fb88ad0357d848cc7da5af73bab7..ed659ab736862da416d1ff624 > 1d > > 0d3e6c6b96ff1 100644 > > --- a/gcc/config/arm/neon.md > > +++ b/gcc/config/arm/neon.md > > @@ -408,6 +408,55 @@ (define_insn "vec_extract" > >[(set_attr "type" "neon_store1_one_lane,neon_to_gp")] > > ) > > > > +;; Patterns comparing two vectors and conditionally jump. > > +;; Avdanced SIMD lacks a vector != comparison, but this is a quite common > > +;; operation. To not pay the penalty for inverting == we can map our any > > +;; comparisons to all i.e. any(~x) => all(x). > > +;; > > +;; However unlike the AArch64 version, we can't optimize this further as > > the > > +;; chain is too long for combine due to these being unspecs so it doesn't > > fold > > +;; the operation to something simpler. > > +(define_expand "cbranch4" > > + [(set (pc) (if_then_else > > + (match_operator 0 "expandable_comparison_operator" > >
RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation
Ping, --- Hi All, This adds an implementation for conditional branch optab for AArch32. The previous version only allowed operand 0 but it looks like cbranch expansion does not check with the target and so we have to implement all. I therefore did not commit it. This is a larger version. I've also dropped the MVE version because the mid-end can rewrite the comparison into comparing two predicates without checking with the backend. Since MVE only has 1 predicate register this would need to go through memory and two MRS calls. It's unlikely to be beneficial and so that's for GCC 15 when I can fix the middle-end. The cases where AArch32 is skipped in the testsuite are all missed-optimizations due to AArch32 missing some optabs. For e.g. void f1 () { for (int i = 0; i < N; i++) { b[i] += a[i]; if (a[i] > 0) break; } } For 128-bit vectors we generate: vcgt.s32q8, q9, #0 vpmax.u32 d7, d16, d17 vpmax.u32 d7, d7, d7 vmovr3, s14 @ int cmp r3, #0 and of 64-bit vector we can omit one vpmax as we still need to compress to 32-bits. Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * config/arm/neon.md (cbranch4): New. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-early-break_2.c: Skip Arm. * gcc.dg/vect/vect-early-break_7.c: Likewise. * gcc.dg/vect/vect-early-break_75.c: Likewise. * gcc.dg/vect/vect-early-break_77.c: Likewise. * gcc.dg/vect/vect-early-break_82.c: Likewise. * gcc.dg/vect/vect-early-break_88.c: Likewise. * lib/target-supports.exp (add_options_for_vect_early_break, check_effective_target_vect_early_break_hw, check_effective_target_vect_early_break): Support AArch32. * gcc.target/arm/vect-early-break-cbranch.c: New test. --- inline version of patch --- diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index d213369ffc38fb88ad0357d848cc7da5af73bab7..ed659ab736862da416d1ff6241d0d3e6c6b96ff1 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -408,6 +408,55 @@ (define_insn "vec_extract" [(set_attr "type" "neon_store1_one_lane,neon_to_gp")] ) +;; Patterns comparing two vectors and conditionally jump. +;; Avdanced SIMD lacks a vector != comparison, but this is a quite common +;; operation. To not pay the penalty for inverting == we can map our any +;; comparisons to all i.e. any(~x) => all(x). +;; +;; However unlike the AArch64 version, we can't optimize this further as the +;; chain is too long for combine due to these being unspecs so it doesn't fold +;; the operation to something simpler. +(define_expand "cbranch4" + [(set (pc) (if_then_else + (match_operator 0 "expandable_comparison_operator" + [(match_operand:VDQI 1 "register_operand") + (match_operand:VDQI 2 "reg_or_zero_operand")]) + (label_ref (match_operand 3 "" "")) + (pc)))] + "TARGET_NEON" +{ + rtx mask = operands[1]; + + /* If comparing against a non-zero vector we have to do a comparison first + so we can have a != 0 comparison with the result. */ + if (operands[2] != CONST0_RTX (mode)) +{ + mask = gen_reg_rtx (mode); + emit_insn (gen_xor3 (mask, operands[1], operands[2])); +} + + /* For 128-bit vectors we need an additional reductions. */ + if (known_eq (128, GET_MODE_BITSIZE (mode))) +{ + /* Always reduce using a V4SI. */ + mask = gen_reg_rtx (V2SImode); + rtx low = gen_reg_rtx (V2SImode); + rtx high = gen_reg_rtx (V2SImode); + rtx op1 = lowpart_subreg (V4SImode, operands[1], mode); + emit_insn (gen_neon_vget_lowv4si (low, op1)); + emit_insn (gen_neon_vget_highv4si (high, op1)); + emit_insn (gen_neon_vpumaxv2si (mask, low, high)); +} + + rtx op1 = lowpart_subreg (V2SImode, mask, GET_MODE (mask)); + emit_insn (gen_neon_vpumaxv2si (op1, op1, op1)); + + rtx val = gen_reg_rtx (SImode); + emit_move_insn (val, gen_lowpart (SImode, mask)); + emit_jump_insn (gen_cbranch_cc (operands[0], val, const0_rtx, operands[3])); + DONE; +}) + ;; This pattern is renamed from "vec_extract" to ;; "neon_vec_extract" and this pattern is called ;; by define_expand in vec-common.md file. diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c index 5c32bf94409e9743e72429985ab3bf13aab8f2c1..dec0b492ab883de6e02944a95fd554a109a68a39 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c @@ -5,7 +5,7 @@ /* { dg-additional-options "-Ofast" } */ -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! "arm*-*-*" } } } } */ #include diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c
[PATCH]middle-end: check if target can do extract first for early breaks [PR113199]
Hi All, I was generating the vector reverse mask without checking if the target actually supported such an operation. It also seems like more targets implement VEC_EXTRACT than permute on mask registers. So this adds a check for IFN_VEC_EXTRACT support when required and changes the select first code to use it. This is good for now since masks always come from whilelo. But in the future when masks can come from other sources we will need the old code back. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues with --enable-checking=release --enable-lto --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. tested on cross cc1 for amdgcn-amdhsa and issue fixed. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113199 * tree-vect-loop.cc (vectorizable_live_operation_1): Use IFN_VEC_EXTRACT. (vectorizable_live_operation): Check for IFN_VEC_EXTRACT support. gcc/testsuite/ChangeLog: PR tree-optimization/113199 * gcc.target/gcn/pr113199.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.target/gcn/pr113199.c b/gcc/testsuite/gcc.target/gcn/pr113199.c new file mode 100644 index ..8a641e5536e80e207ca0163cac66c0f4f6ca93f7 --- /dev/null +++ b/gcc/testsuite/gcc.target/gcn/pr113199.c @@ -0,0 +1,44 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2" } */ + +typedef long unsigned int size_t; +typedef int wchar_t; +struct tm +{ + int tm_mon; + int tm_year; +}; +int abs (int); +struct lc_time_T { const char *month[12]; }; +struct __locale_t * __get_current_locale (void) { } +const struct lc_time_T * __get_time_locale (struct __locale_t *locale) { } +const wchar_t * __ctloc (wchar_t *buf, const char *elem, size_t *len_ret) { return buf; } +size_t +__strftime (wchar_t *s, size_t maxsize, const wchar_t *format, + const struct tm *tim_p, struct __locale_t *locale) +{ + size_t count = 0; + const wchar_t *ctloc; + wchar_t ctlocbuf[256]; + size_t i, ctloclen; + const struct lc_time_T *_CurrentTimeLocale = __get_time_locale (locale); +{ + switch (*format) + { + case L'B': + (ctloc = __ctloc (ctlocbuf, _CurrentTimeLocale->month[tim_p->tm_mon], )); + for (i = 0; i < ctloclen; i++) + { + if (count < maxsize - 1) + s[count++] = ctloc[i]; + else + return 0; + { + int century = tim_p->tm_year >= 0 +? tim_p->tm_year / 100 + 1900 / 100 +: abs (tim_p->tm_year + 1900) / 100; + } + } + } +} +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 37f1be1101ffae779214056a0886411e0683e887..5aa92e67444e7aacf458fffa1428f1983c482374 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10648,36 +10648,18 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo, _VINFO_MASKS (loop_vinfo), 1, vectype, 0); tree scalar_res; + gimple_seq_add_seq (, tem); /* For an inverted control flow with early breaks we want EXTRACT_FIRST -instead of EXTRACT_LAST. Emulate by reversing the vector and mask. */ +instead of EXTRACT_LAST. For now since the mask always comes from a +WHILELO we can get the first element ignoring the mask since CLZ of the +mask will always be zero. */ if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) - { - /* First create the permuted mask. */ - tree perm_mask = perm_mask_for_reverse (TREE_TYPE (mask)); - tree perm_dest = copy_ssa_name (mask); - gimple *perm_stmt - = gimple_build_assign (perm_dest, VEC_PERM_EXPR, mask, - mask, perm_mask); - vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt, - ); - mask = perm_dest; - - /* Then permute the vector contents. */ - tree perm_elem = perm_mask_for_reverse (vectype); - perm_dest = copy_ssa_name (vec_lhs_phi); - perm_stmt - = gimple_build_assign (perm_dest, VEC_PERM_EXPR, vec_lhs_phi, - vec_lhs_phi, perm_elem); - vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt, - ); - vec_lhs_phi = perm_dest; - } - - gimple_seq_add_seq (, tem); - - scalar_res = gimple_build (, CFN_EXTRACT_LAST, scalar_type, -mask, vec_lhs_phi); + scalar_res = gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE (vectype), + vec_lhs_phi, bitstart); + else + scalar_res = gimple_build (, CFN_EXTRACT_LAST, scalar_type, + mask, vec_lhs_phi); /* Convert the extracted vector element to the scalar type. */ new_tree = gimple_convert (, lhs_type, scalar_res); @@ -10852,9
RE: skip vector profiles multiple exits
> -Original Message- > From: Jan Hubicka > Sent: Friday, December 29, 2023 10:32 PM > To: Tamar Christina > Cc: rguent...@suse.de; GCC Patches ; nd > > Subject: Re: skip vector profiles multiple exits > > > Hi Honza, > Hi, > > > > I wasn't sure what to do here so I figured I'd ask. > > > > In adding support for multiple exits to the vectorizer I didn't know how to > > update > this bit: > > > > https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-vect-loop- > manip.cc#L3363 > > > > Essentially, if skip_vector (i.e. not enough iteration to enter the vector > > loop) then > the > > previous code would update the new probability to be the same as that of the > > exit edge. This made sense because that's the only edge which could bring > > you to > > the next loop preheader. > > > > With multiple exits this is no longer the case since any exit can bring you > > to the > > Preaheader node. I figured the new counts should simply be the sum of all > > exit > > edges. But that gives quite large count values compared to the rest of the > > loop. > The sum of all exit counts (not probabilities) relative to header count should > give you estimated probability that the loop iterates at any given > iteration. I am not sure how good estimate this is for loop > preconditioning to be true (without profile histograms it is really hard > to tell). Happy new years! Ah, so I need to subtract the loop header from the sum? I'll try > > > > I then thought I would need to scale the counts by the probability of the > > edge > > being taken. The problem here is that the probabilities don't end up to > > 100% > > So you are summing exit_edge->count ()? > I am not sure how useful would be summit probabilities since they are > conditional (relative to probability of entering BB you go to). > How complicated CFG we now handle with vectorization? > Yeah I as trying to sum the edge counts. The CFG can get quite complicated because we allow vectorization of any arbitrary number of exits as long as that exit leaves the loop body. In this current version we force everything to the scalar epilog, so the merge block can get any number of incoming edges now. Aside from this we still support versioning and skip_epilog so you have the additional edges coming in from there too. Regards, Tamar > Honza > > > > so the scaled counts also looked kinda wonkey. Any suggestions? > > > > If you want some small examples to look at, testcases > > ./gcc/testsuite/gcc.dg/vect/vect-early-break_90.c to > ./gcc/testsuite/gcc.dg/vect/vect-early-break_93.c > > should be relevant here. > > > > Thanks, > > Tamar
skip vector profiles multiple exits
Hi Honza, I wasn't sure what to do here so I figured I'd ask. In adding support for multiple exits to the vectorizer I didn't know how to update this bit: https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-vect-loop-manip.cc#L3363 Essentially, if skip_vector (i.e. not enough iteration to enter the vector loop) then the previous code would update the new probability to be the same as that of the exit edge. This made sense because that's the only edge which could bring you to the next loop preheader. With multiple exits this is no longer the case since any exit can bring you to the Preaheader node. I figured the new counts should simply be the sum of all exit edges. But that gives quite large count values compared to the rest of the loop. I then thought I would need to scale the counts by the probability of the edge being taken. The problem here is that the probabilities don't end up to 100% so the scaled counts also looked kinda wonkey. Any suggestions? If you want some small examples to look at, testcases ./gcc/testsuite/gcc.dg/vect/vect-early-break_90.c to ./gcc/testsuite/gcc.dg/vect/vect-early-break_93.c should be relevant here. Thanks, Tamar
[PATCH]middle-end: maintain LCSSA form when peeled vector iterations have virtual operands
Hi All, This patch fixes several interconnected issues. 1. When picking an exit we wanted to check for niter_desc.may_be_zero not true. i.e. we want to pick an exit which we know will iterate at least once. However niter_desc.may_be_zero is not a boolean. It is a tree that encodes a boolean value. !niter_desc.may_be_zero is just checking if we have some information, not what the information is. This leads us to pick a more difficult to vectorize exit more often than we should. 2. Because we had this bug, we used to pick an alternative exit much more ofthen which showed one issue, when the loop accesses memory and we "invert it" we would corrupt the VUSE chain. This is because on an peeled vector iteration every exit restarts the loop (i.e. they're all early) BUT since we may have performed a store, the vUSE would need to be updated. This version maintains virtual PHIs correctly in these cases. Note that we can't simply remove all of them and recreate them because we need the PHI nodes still in the right order for if skip_vector. 3. Since we're moving the stores to a safe location I don't think we actually need to analyze whether the store is in range of the memref, because if we ever get there, we know that the loads must be in range, and if the loads are in range and we get to the store we know the early breaks were not taken and so the scalar loop would have done the VF stores too. 4. Instead of searching for where to move stores to, they should always be in exit belonging to the latch. We can only ever delay stores and even if we pick a different exit than the latch one as the main one, effects still happen in program order when vectorized. If we don't move the stores to the latch exit but instead to whever we pick as the "main" exit then we can perform incorrect memory accesses (luckily these are trapped by verify_ssa). 5. We only used to analyze loads inside the same BB as an early break, and also we'd never analyze the ones inside the block where we'd be moving memory references to. This is obviously bogus and to fix it this patch splits apart the two constraints. We first validate that all load memory references are in bounds and only after that do we perform the alias checks for the writes. This makes the code simpler to understand and more trivially correct. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues with --enable-checking=release --enable-lto --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra. Ok for master? Thanks, Tamar gcc/ChangeLog: PR tree-optimization/113137 PR tree-optimization/113136 PR tree-optimization/113172 * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): (vect_do_peeling): Maintain virtual PHIs on inverted loops. * tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to latch. (vect_create_loop_vinfo): Record all conds instead of only alt ones. * tree-vectorizer.h: Fix comment gcc/testsuite/ChangeLog: PR tree-optimization/113137 PR tree-optimization/113136 PR tree-optimization/113172 * g++.dg/vect/vect-early-break_4-pr113137.cc: New test. * g++.dg/vect/vect-early-break_5-pr113137.cc: New test. * gcc.dg/vect/vect-early-break_95-pr113137.c: New test. * gcc.dg/vect/vect-early-break_96-pr113136.c: New test. * gcc.dg/vect/vect-early-break_97-pr113172.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc b/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc new file mode 100644 index ..f78db8669dcc65f1b45ea78f4433d175e1138332 --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +int b; +void a() __attribute__((__noreturn__)); +void c() { + char *buf; + int bufsz = 64; + while (b) { +!bufsz ? a(), 0 : *buf++ = bufsz--; +b -= 4; + } +} diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc b/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc new file mode 100644 index ..dcd19fa2d2145e09de18279479b3f20fc27336ba --- /dev/null +++ b/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +char UnpackReadTables_BitLength[20]; +int UnpackReadTables_ZeroCount; +void UnpackReadTables() { + for (unsigned I = 0; I < 20;) +while
[PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]
Hi All, Only trying to update certain dominators doesn't seem to work very well because as the loop gets versioned, peeled, or skip_vector then we end up with very complicated control flow. This means that the final merge blocks for the loop exit are not easy to find or update. Instead of trying to pick which exits to update, this changes it to update all the blocks reachable by the new exits. This is because they'll contain common blocks with e.g. the versioned loop. It's these blocks that need an update most of the time. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR middle-end/113144 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Update all dominators reachable from exit. gcc/testsuite/ChangeLog: PR middle-end/113144 * gcc.dg/vect/vect-early-break_94-pr113144.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c new file mode 100644 index ..903fe7be6621e81db6f29441e4309fa213d027c5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c @@ -0,0 +1,41 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +long tar_atol256_max, tar_atol256_size, tar_atosl_min; +char tar_atol256_s; +void __errno_location(); + + +inline static long tar_atol256(long min) { + char c; + int sign; + c = tar_atol256_s; + sign = c; + while (tar_atol256_size) { +if (c != sign) + return sign ? min : tar_atol256_max; +c = tar_atol256_size--; + } + if ((c & 128) != (sign & 128)) +return sign ? min : tar_atol256_max; + return 0; +} + +inline static long tar_atol(long min) { + return tar_atol256(min); +} + +long tar_atosl() { + long n = tar_atol(-1); + if (tar_atosl_min) { +__errno_location(); +return 0; + } + if (n > 0) +return 0; + return n; +} diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 1066ea17c5674e03412b3dcd8a62ddf4dd54cf31..3810983a80c8b989be9fd9a9993642069fd39b99 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -1716,8 +1716,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, /* Now link the alternative exits. */ if (multiple_exits_p) { - set_immediate_dominator (CDI_DOMINATORS, new_preheader, - main_loop_exit_block); for (auto gsi_from = gsi_start_phis (loop->header), gsi_to = gsi_start_phis (new_preheader); !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to); @@ -1751,12 +1749,26 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, edge loop_exit, /* Finally after wiring the new epilogue we need to update its main exit to the original function exit we recorded. Other exits are already -correct. */ +correct. Because of versioning, skip vectors and others we must update +the dominators of every node reachable by the new exits. */ if (multiple_exits_p) { update_loop = new_loop; - for (edge e : get_loop_exit_edges (loop)) - doms.safe_push (e->dest); + hash_set visited; + auto_vec workset; + edge ev; + edge_iterator ei; + workset.safe_splice (get_loop_exit_edges (loop)); + while (!workset.is_empty ()) + { + auto bb = workset.pop ()->dest; + if (visited.add (bb)) + continue; + doms.safe_push (bb); + FOR_EACH_EDGE (ev, ei, bb->succs) + workset.safe_push (ev); + } + visited.empty (); doms.safe_push (exit_dest); /* Likely a fall-through edge, so update if needed. */ -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c new file mode 100644 index ..903fe7be6621e81db6f29441e4309fa213d027c5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c @@ -0,0 +1,41 @@ +/* { dg-do compile } */ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +long tar_atol256_max, tar_atol256_size, tar_atosl_min; +char tar_atol256_s; +void __errno_location(); + + +inline static long tar_atol256(long min) { + char c; + int sign; + c = tar_atol256_s; + sign = c; + while (tar_atol256_size) { +if (c != sign) + return sign
[PATCH]middle-end: rejects loops with nonlinear inductions and early breaks [PR113163]
Hi All, We can't support nonlinear inductions other than neg when vectorizing early breaks and iteration count is known. For early break we currently require a peeled epilog but in these cases we can't compute the remaining values. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. tested on cross cc1 for amdgcn-amdhsa and issue fixed. Ok for master? Thanks, Tamar gcc/ChangeLog: PR middle-end/113163 * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): gcc/testsuite/ChangeLog: PR middle-end/113163 * gcc.target/gcn/pr113163.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c b/gcc/testsuite/gcc.target/gcn/pr113163.c new file mode 100644 index ..99b0fdbaf3a3152ca008b5109abf6e80d8cb3d6a --- /dev/null +++ b/gcc/testsuite/gcc.target/gcn/pr113163.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -ftree-vectorize" } */ + +struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; }; +static const char R64_ARRAY[] = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; +char * +_l64a_r (struct _reent *rptr, + long value) +{ + char *ptr; + char *result; + int i, index; + unsigned long tmp = (unsigned long)value & 0x; + result = + (( + rptr + )->_new._reent._l64a_buf) + ; + ptr = result; + for (i = 0; i < 6; ++i) +{ + if (tmp == 0) + { + *ptr = '\0'; + break; + } + *ptr++ = R64_ARRAY[index]; + tmp >>= 6; +} +} diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c186302fbeaf515be8cf 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo, return false; } + /* We can't support partial vectors and early breaks with an induction + type other than add or neg since we require the epilog and can't + perform the peeling. PR113163. */ + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo) + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant () + && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) + && induction_type != vect_step_op_neg) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"Peeling for epilogue is not supported" +" for nonlinear induction except neg" +" when iteration count is known and early breaks.\n"); + return false; +} + return true; } -- diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c b/gcc/testsuite/gcc.target/gcn/pr113163.c new file mode 100644 index ..99b0fdbaf3a3152ca008b5109abf6e80d8cb3d6a --- /dev/null +++ b/gcc/testsuite/gcc.target/gcn/pr113163.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -ftree-vectorize" } */ + +struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; }; +static const char R64_ARRAY[] = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; +char * +_l64a_r (struct _reent *rptr, + long value) +{ + char *ptr; + char *result; + int i, index; + unsigned long tmp = (unsigned long)value & 0x; + result = + (( + rptr + )->_new._reent._l64a_buf) + ; + ptr = result; + for (i = 0; i < 6; ++i) +{ + if (tmp == 0) + { + *ptr = '\0'; + break; + } + *ptr++ = R64_ARRAY[index]; + tmp >>= 6; +} +} diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index 3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c186302fbeaf515be8cf 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo, return false; } + /* We can't support partial vectors and early breaks with an induction + type other than add or neg since we require the epilog and can't + perform the peeling. PR113163. */ + if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo) + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant () + && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) + && induction_type != vect_step_op_neg) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"Peeling for epilogue is not supported" +" for nonlinear induction except neg" +" when iteration count is known and early breaks.\n"); + return false; +} + return true; }
[PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation
Hi All, This adds an implementation for conditional branch optab for AArch32. The previous version only allowed operand 0 but it looks like cbranch expansion does not check with the target and so we have to implement all. I therefore did not commit it. This is a larger version. For e.g. void f1 () { for (int i = 0; i < N; i++) { b[i] += a[i]; if (a[i] > 0) break; } } For 128-bit vectors we generate: vcgt.s32q8, q9, #0 vpmax.u32 d7, d16, d17 vpmax.u32 d7, d7, d7 vmovr3, s14 @ int cmp r3, #0 and of 64-bit vector we can omit one vpmax as we still need to compress to 32-bits. Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * config/arm/neon.md (cbranch4): New. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-early-break_2.c: Skip Arm. * gcc.dg/vect/vect-early-break_7.c: Likewise. * gcc.dg/vect/vect-early-break_75.c: Likewise. * gcc.dg/vect/vect-early-break_77.c: Likewise. * gcc.dg/vect/vect-early-break_82.c: Likewise. * gcc.dg/vect/vect-early-break_88.c: Likewise. * lib/target-supports.exp (add_options_for_vect_early_break, check_effective_target_vect_early_break_hw, check_effective_target_vect_early_break): Support AArch32. * gcc.target/arm/vect-early-break-cbranch.c: New test. --- inline copy of patch -- diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index d213369ffc38fb88ad0357d848cc7da5af73bab7..0f088a51d31e6882bc0fabbad99862b8b465dd22 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -408,6 +408,54 @@ (define_insn "vec_extract" [(set_attr "type" "neon_store1_one_lane,neon_to_gp")] ) +;; Patterns comparing two vectors and conditionally jump. +;; Avdanced SIMD lacks a vector != comparison, but this is a quite common +;; operation. To not pay the penalty for inverting == we can map our any +;; comparisons to all i.e. any(~x) => all(x). +;; +;; However unlike the AArch64 version, we can't optimize this further as the +;; chain is too long for combine due to these being unspecs so it doesn't fold +;; the operation to something simpler. +(define_expand "cbranch4" + [(set (pc) (if_then_else + (match_operator 0 "expandable_comparison_operator" + [(match_operand:VDQI 1 "register_operand") + (match_operand:VDQI 2 "reg_or_zero_operand")]) + (label_ref (match_operand 3 "" "")) + (pc)))] + "TARGET_NEON" +{ + rtx mask = operands[1]; + + /* If comparing against a non-zero vector we have to do a comparison first + so we can have a != 0 comparison with the result. */ + if (operands[2] != CONST0_RTX (mode)) +{ + mask = gen_reg_rtx (mode); + emit_insn (gen_xor3 (mask, operands[1], operands[2])); +} + + /* For 128-bit vectors we need an additional reductions. */ + if (known_eq (128, GET_MODE_BITSIZE (mode))) +{ + /* Always reduce using a V4SI. */ + mask = gen_reg_rtx (V2SImode); + rtx low = gen_reg_rtx (V2SImode); + rtx high = gen_reg_rtx (V2SImode); + rtx op1 = simplify_gen_subreg (V4SImode, operands[1], mode, 0); + emit_insn (gen_neon_vget_lowv4si (low, op1)); + emit_insn (gen_neon_vget_highv4si (high, op1)); + emit_insn (gen_neon_vpumaxv2si (mask, low, high)); +} + + emit_insn (gen_neon_vpumaxv2si (mask, mask, mask)); + + rtx val = gen_reg_rtx (SImode); + emit_move_insn (val, gen_lowpart (SImode, mask)); + emit_jump_insn (gen_cbranch_cc (operands[0], val, const0_rtx, operands[3])); + DONE; +}) + ;; This pattern is renamed from "vec_extract" to ;; "neon_vec_extract" and this pattern is called ;; by define_expand in vec-common.md file. diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c index 5c32bf94409e9743e72429985ab3bf13aab8f2c1..dec0b492ab883de6e02944a95fd554a109a68a39 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c @@ -5,7 +5,7 @@ /* { dg-additional-options "-Ofast" } */ -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! "arm*-*-*" } } } } */ #include diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c index 8c86c5034d7522b3733543fb384a23c5d6ed0fcf..d218a0686719fee4c167684dcf26402851b53260 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c @@ -5,7 +5,7 @@ /* { dg-additional-options "-Ofast" } */ -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! "arm*-*-*" } } } } */ #include diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_75.c
[PATCH]AArch64 Update costing for vector conversions [PR110625]
Hi All, In gimple the operation short _8; double _9; _9 = (double) _8; denotes two operations. First we have to widen from short to long and then convert this integer to a double. Currently however we only count the widen/truncate operations: (double) _5 6 times vec_promote_demote costs 12 in body (double) _5 12 times vec_promote_demote costs 24 in body but not the actual conversion operation, which needs an additional 12 instructions in the attached testcase. Without this the attached testcase ends up incorrectly thinking that it's beneficial to vectorize the loop at a very high VF = 8 (4x unrolled). Because we can't change the mid-end to account for this the costing code in the backend now keeps track of whether the previous operation was a promotion/demotion and ajdusts the expected number of instructions to: 1. If it's the first FLOAT_EXPR and the precision of the lhs and rhs are different, double it, since we need to convert and promote. 2. If it's the previous operation was a demonition/promotion then reduce the cost of the current operation by the amount we added extra in the last. with the patch we get: (double) _5 6 times vec_promote_demote costs 24 in body (double) _5 12 times vec_promote_demote costs 36 in body which correctly accounts for 30 operations. This fixes the regression reported on Neoverse N2 and using the new generic Armv9-a cost model. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: PR target/110625 * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost): Adjust throughput and latency calculations for vector conversions. (class aarch64_vector_costs): Add m_num_last_promote_demote. gcc/testsuite/ChangeLog: PR target/110625 * gcc.target/aarch64/pr110625_4.c: New test. * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Add --param aarch64-sve-compare-costs=0. * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise --- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index f9850320f61c5ddccf47e6583d304e5f405a484f..561413e52717974b96f79cc83008f237c536 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16077,6 +16077,15 @@ private: leaving a vectorization of { elts }. */ bool m_stores_to_vector_load_decl = false; + /* Non-zero if the last operation we costed is a vector promotion or demotion. + In this case the value is the number of insn in the last operation. + + On AArch64 vector promotion and demotions require us to first widen or + narrow the input and only after that emit conversion instructions. For + costing this means we need to emit the cost of the final conversions as + well. */ + unsigned int m_num_last_promote_demote = 0; + /* - If M_VEC_FLAGS is zero then we're costing the original scalar code. - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced SIMD code. @@ -17132,6 +17141,29 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_cost = aarch64_sve_adjust_stmt_cost (m_vinfo, kind, stmt_info, vectype, stmt_cost); + /* Vector promotion and demotion requires us to widen the operation first + and only after that perform the conversion. Unfortunately the mid-end + expects this to be doable as a single operation and doesn't pass on + enough context here for us to tell which operation is happening. To + account for this we count every promote-demote operation twice and if + the previously costed operation was also a promote-demote we reduce + the cost of the currently being costed operation to simulate the final + conversion cost. Note that for SVE we can do better here if the converted + value comes from a load since the widening load would consume the widening + operations. However since we're in stage 3 we can't change the helper + vect_is_extending_load and duplicating the code seems not useful. */ + gassign *assign = NULL; + if (kind == vec_promote_demote + && (assign = dyn_cast (STMT_VINFO_STMT (stmt_info))) + && gimple_assign_rhs_code (assign) == FLOAT_EXPR) +{ + auto new_count = count * 2 - m_num_last_promote_demote; + m_num_last_promote_demote = count; + count = new_count; +} + else +m_num_last_promote_demote = 0; + if (stmt_info && aarch64_use_new_vector_costs_p ()) { /* Account for any extra "embedded" costs that apply additively diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_4.c b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c new file mode 100644 index ..34dac19d81a85d63706d54f4cb0c738ce592d5d7 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c @@ -0,0 +1,18 @@ +/* { dg-do compile }
[PATCH][committed]middle-end: explicitly initialize vec_stmts [PR113132]
Hi All, when configured with --enable-checking=release we get a false positive on the use of vec_stmts as the compiler seems unable to notice it gets initialized through the pass-by-reference. This explicitly initializes the local. Bootstrapped Regtested on x86_64-pc-linux-gnu and no issues. Committed under the obvious rule. Thanks, Tamar gcc/ChangeLog: PR bootstrap/113132 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Initialize vec_stmts; --- inline copy of patch -- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 88261a3a4f57d5e2124939b069b0e92c57d9abba..f51ae3e719e753059389cf9495b6d65b3b1191cb 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -6207,7 +6207,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, exit_bb = loop_exit->dest; exit_gsi = gsi_after_labels (exit_bb); reduc_inputs.create (slp_node ? vec_num : ncopies); - vec vec_stmts; + vec vec_stmts = vNULL; for (unsigned i = 0; i < vec_num; i++) { gimple_seq stmts = NULL; -- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 88261a3a4f57d5e2124939b069b0e92c57d9abba..f51ae3e719e753059389cf9495b6d65b3b1191cb 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -6207,7 +6207,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo, exit_bb = loop_exit->dest; exit_gsi = gsi_after_labels (exit_bb); reduc_inputs.create (slp_node ? vec_num : ncopies); - vec vec_stmts; + vec vec_stmts = vNULL; for (unsigned i = 0; i < vec_num; i++) { gimple_seq stmts = NULL;
[PATCH][testsuite]: Add more pragma novector to new tests
Hi All, This patch was pre-appproved by Richi. This updates the testsuite and adds more #pragma GCC novector to various tests that would otherwise vectorize the vector result checking code. This cleans out the testsuite since the last rebase and prepares for the landing of the early break patch. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues. Pushed to master. Thanks, Tamar gcc/testsuite/ChangeLog: * gcc.dg/vect/no-scevccp-slp-30.c: Add pragma GCC novector to abort loop. * gcc.dg/vect/no-scevccp-slp-31.c: Likewise. * gcc.dg/vect/no-section-anchors-vect-69.c: Likewise. * gcc.target/aarch64/vect-xorsign_exec.c: Likewise. * gcc.target/i386/avx512er-vrcp28ps-3.c: Likewise. * gcc.target/i386/avx512er-vrsqrt28ps-3.c: Likewise. * gcc.target/i386/avx512er-vrsqrt28ps-5.c: Likewise. * gcc.target/i386/avx512f-ceil-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-ceil-vec-1.c: Likewise. * gcc.target/i386/avx512f-ceilf-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-ceilf-vec-1.c: Likewise. * gcc.target/i386/avx512f-floor-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-floor-vec-1.c: Likewise. * gcc.target/i386/avx512f-floorf-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-floorf-vec-1.c: Likewise. * gcc.target/i386/avx512f-rint-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-rintf-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-round-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-roundf-sfix-vec-1.c: Likewise. * gcc.target/i386/avx512f-trunc-vec-1.c: Likewise. * gcc.target/i386/avx512f-truncf-vec-1.c: Likewise. * gcc.target/i386/vect-alignment-peeling-1.c: Likewise. * gcc.target/i386/vect-alignment-peeling-2.c: Likewise. * gcc.target/i386/vect-pack-trunc-1.c: Likewise. * gcc.target/i386/vect-pack-trunc-2.c: Likewise. * gcc.target/i386/vect-perm-even-1.c: Likewise. * gcc.target/i386/vect-unpack-1.c: Likewise. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c index 00d0eca56eeca6aee6f11567629dc955c0924c74..534bee4a1669a7cbd95cf6007f28dafd23bab8da 100644 --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c @@ -24,9 +24,9 @@ main1 () } /* check results: */ -#pragma GCC novector for (j = 0; j < N; j++) { +#pragma GCC novector for (i = 0; i < N; i++) { if (out[i*4] != 8 diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c index 48b6a9b0681cf1fe410755c3e639b825b27895b0..22817a57ef81398cc018a78597755397d20e0eb9 100644 --- a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c +++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c @@ -27,6 +27,7 @@ main1 () #pragma GCC novector for (i = 0; i < N; i++) { +#pragma GCC novector for (j = 0; j < N; j++) { if (a[i][j] != 8) diff --git a/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c b/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c index a0e53d5fef91868dfdbd542dd0a98dff92bd265b..0861d488e134d3f01a2fa83c56eff7174f36ddfb 100644 --- a/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c +++ b/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c @@ -83,9 +83,9 @@ int main1 () } /* check results: */ -#pragma GCC novector for (i = 0; i < N; i++) { +#pragma GCC novector for (j = 0; j < N; j++) { if (tmp1[2].e.n[1][i][j] != 8) @@ -103,9 +103,9 @@ int main1 () } /* check results: */ -#pragma GCC novector for (i = 0; i < N - NINTS; i++) { +#pragma GCC novector for (j = 0; j < N - NINTS; j++) { if (tmp2[2].e.n[1][i][j] != 8) diff --git a/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c b/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c index cfa22115831272cb1d4e1a38512f10c3a1c6ad77..84f33d3f6cce9b0017fd12ab961019041245ffae 100644 --- a/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c +++ b/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c @@ -33,6 +33,7 @@ main (void) r[i] = a[i] * __builtin_copysignf (1.0f, b[i]); /* check results: */ +#pragma GCC novector for (i = 0; i < N; i++) if (r[i] != a[i] * __builtin_copysignf (1.0f, b[i])) abort (); @@ -41,6 +42,7 @@ main (void) rd[i] = ad[i] * __builtin_copysign (1.0d, bd[i]); /* check results: */ +#pragma GCC novector for (i = 0; i < N; i++) if (rd[i] != ad[i] * __builtin_copysign (1.0d, bd[i])) abort (); diff --git a/gcc/testsuite/gcc.target/i386/avx512er-vrcp28ps-3.c b/gcc/testsuite/gcc.target/i386/avx512er-vrcp28ps-3.c index c0b1f7b31027f9438ab1641d3002887eabd34efa..1e68926a3180fffc6cbc8c6eed639a567fc32566 100644 ---
RE: [PATCH 3/21]middle-end: Implement code motion and dependency analysis for early breaks
> > + /* If we've moved a VDEF, extract the defining MEM and update > > +usages of it. */ > > + tree vdef; > > + /* This statement is to be moved. */ > > + if ((vdef = gimple_vdef (stmt))) > > + LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS > (loop_vinfo).safe_push ( > > + stmt); > > I'm also unsure why you need 'chain' at all given you have the vector > of stores to be moved? > Yeah, so originally I wanted to move statements other than stores. While stores are needed for correctness, the other statements would be so we didn't extend the live range too much for intermediate values. This proved difficult but eventually I got it to work, but as you saw it was meh code. Instead I guess the better approach is to teach sched1 in GCC 15 to schedule across branches in loops. With that in mind, I changed it to move only stores. Since stores never produce a and are sinks, I don't really need fixed nor chain. So here's a much cleaned up patch. Bootstrapped Regtested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-if-conv.cc (ref_within_array_bound): Expose. * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New. (vect_analyze_data_ref_dependences): Use them. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize early_breaks. (move_early_exit_stmts): New. (vect_transform_loop): use it/ * tree-vect-stmts.cc (vect_is_simple_use): Use vect_early_exit_def. * tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def. (ref_within_array_bound): New. (class _loop_vec_info): Add early_breaks, early_break_conflict, early_break_vuses. (LOOP_VINFO_EARLY_BREAKS): New. (LOOP_VINFO_EARLY_BRK_STORES): New. (LOOP_VINFO_EARLY_BRK_DEST_BB): New. (LOOP_VINFO_EARLY_BRK_VUSES): New. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-early-break_57.c: Update. * gcc.dg/vect/vect-early-break_79.c: New test. * gcc.dg/vect/vect-early-break_80.c: New test. * gcc.dg/vect/vect-early-break_81.c: New test. * gcc.dg/vect/vect-early-break_83.c: New test. --- inline copy of patch --- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c index be4a0c7426093059ce37a9f824defb7ae270094d..9a4e795f92b7a8577ac71827f5cb0bd15d88ebe1 100644 --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c @@ -5,6 +5,7 @@ /* { dg-additional-options "-Ofast" } */ /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-tree-dump "epilog loop required" "vect" } } */ void abort (); diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_79.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_79.c new file mode 100644 index ..a26011ef1ba5aa000692babc90d46621efc2f8b5 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_79.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-additional-options "-Ofast" } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#undef N +#define N 32 + +unsigned vect_a[N]; +unsigned vect_b[N]; + +unsigned test4(unsigned x) +{ + unsigned ret = 0; + for (int i = 0; i < 1024; i++) + { + vect_b[i] = x + i; + if (vect_a[i] > x) + break; + vect_a[i] = x; + + } + return ret; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_80.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_80.c new file mode 100644 index ..ddf504e0c8787ae33a0e98045c1c91f2b9f533a9 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_80.c @@ -0,0 +1,43 @@ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-additional-options "-Ofast" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +extern void abort (); + +int x; +__attribute__ ((noinline, noipa)) +void foo (int *a, int *b) +{ + int local_x = x; + for (int i = 0; i < 1024; ++i) +{ + if (i + local_x == 13) +break; + a[i] = 2 * b[i]; +} +} + +int main () +{ + int a[1024] = {0}; + int b[1024] = {0}; + + for (int i = 0; i < 1024; i++) +b[i] = i; + + x = -512; + foo (a, b); + + if (a[524] != 1048) +abort (); + + if (a[525] != 0) +abort (); + + if (a[1023] != 0) +abort (); + return 0; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_81.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_81.c new file mode 100644 index ..c38e394ad87863f0702d422cb58018b979c9fba6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_81.c @@ -0,0
RE: RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some tests
> Do you mean for ARM SVE, these tests need to be specified as only ARM SVE ? I think that would be the right thing to do. I think these tests are checking if we support VLA SLP. changing it to a PASS unconditionally means that if someone runs the testsuite in SVE only mode they’ll fail. > The difference between RVV and ARM is that: variable-length and fixed-length > vectors are both valid on RVV, using same RVV ISA. > Wheras, for ARM, variable-length vectors use SVE ISA but fixed-length vectors > use NEON ISA. Ah that makes sense why you want to remove the check. I guess whomever added the vect_variable_length indended It to fail when VLA though. Perhaps these tests need a dg-add-options ? Since I think other tests already test fixed-length vectors. But lets see what Richi says. Thanks, Tamar From: 钟居哲 Sent: Tuesday, December 19, 2023 1:02 PM To: Tamar Christina ; gcc-patches Cc: rguenther Subject: Re: RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some tests Do you mean for ARM SVE, these tests need to be specified as only ARM SVE ? Actually, for RVV, is same situation as ARM. We are using VLS modes (fixed-length vectors) to vectorize these cases so that they are XPASS. The difference between RVV and ARM is that: variable-length and fixed-length vectors are both valid on RVV, using same RVV ISA. Wheras, for ARM, variable-length vectors use SVE ISA but fixed-length vectors use NEON ISA. juzhe.zh...@rivai.ai<mailto:juzhe.zh...@rivai.ai> From: Tamar Christina<mailto:tamar.christ...@arm.com> Date: 2023-12-19 20:29 To: Juzhe-Zhong<mailto:juzhe.zh...@rivai.ai>; gcc-patches@gcc.gnu.org<mailto:gcc-patches@gcc.gnu.org> CC: rguent...@suse.de<mailto:rguent...@suse.de> Subject: RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some tests Hi Juzhe, > -Original Message- > From: Juzhe-Zhong mailto:juzhe.zh...@rivai.ai>> > Sent: Tuesday, December 19, 2023 11:19 AM > To: gcc-patches@gcc.gnu.org<mailto:gcc-patches@gcc.gnu.org> > Cc: rguent...@suse.de<mailto:rguent...@suse.de>; Tamar Christina > mailto:tamar.christ...@arm.com>>; Juzhe- > Zhong mailto:juzhe.zh...@rivai.ai>> > Subject: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some > tests > > Hi, this patch fixes these following regression FAILs on RVV: > > XPASS: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;" > XPASS: gcc.dg/vect/bb-slp-43.c -flto -ffat-lto-objects scan-tree-dump-not > slp2 > "vector operands from scalars" > XPASS: gcc.dg/vect/bb-slp-43.c scan-tree-dump-not slp2 "vector operands from > scalars" > XPASS: gcc.dg/vect/bb-slp-subgroups-3.c -flto -ffat-lto-objects > scan-tree-dump- > times slp2 "optimized: basic block" 2 > XPASS: gcc.dg/vect/bb-slp-subgroups-3.c scan-tree-dump-times slp2 "optimized: > basic block" 2 > > Since vect_variable_length are available for ARM SVE and RVV, I just use > compiler > explorer to confirm ARM SVE same as > RVV. > > Hi, @Tamar. Could you double check whether this patch fix is reasonable to > you ? > Hmm I would be surprised if this is working correctly for RVV since as far as I know we don't have variable length support in SLP i.e. SLP can't predicate operation during build so the current vectorizer only supports fixed length vector SLP, unless Richi did some magic? For SVE the reason this XPASS is because the compiler will fallback to NEON unless it's told it can't. But that's not actually testing VLA SLP. i.e. https://godbolt.org/z/5n5fWahxh just using `+sve` isn't enough and it has to be told it can only use SVE. Is it perhaps something similar for RVV? If RVV has a similar param, perhaps the correct fix is to append it to the tests so they XFAIL correctly? Regards, Tamar > And. > > Hi, @Richard. Is this patch Ok for trunk if this patch fixes regression for > both RVV > and ARM SVE. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/pr84512.c: Remove vect_variable_length XFAIL. > * gcc.dg/vect/bb-slp-43.c: Ditto. > * gcc.dg/vect/bb-slp-subgroups-3.c: Ditto. > > --- > gcc/testsuite/gcc.dg/tree-ssa/pr84512.c| 2 +- > gcc/testsuite/gcc.dg/vect/bb-slp-43.c | 2 +- > gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 2 +- > 3 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c > b/gcc/testsuite/gcc.dg/tree- > ssa/pr84512.c > index 496c78b28dc..3c027012670 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c > @@ -13,4 +13,4 @@ int foo() > } > > /* Listed targets xfailed due to PR84958. */ > -/* { dg-final { scan-tree-dump "ret
RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some tests
Hi Juzhe, > -Original Message- > From: Juzhe-Zhong > Sent: Tuesday, December 19, 2023 11:19 AM > To: gcc-patches@gcc.gnu.org > Cc: rguent...@suse.de; Tamar Christina ; Juzhe- > Zhong > Subject: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some > tests > > Hi, this patch fixes these following regression FAILs on RVV: > > XPASS: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;" > XPASS: gcc.dg/vect/bb-slp-43.c -flto -ffat-lto-objects scan-tree-dump-not > slp2 > "vector operands from scalars" > XPASS: gcc.dg/vect/bb-slp-43.c scan-tree-dump-not slp2 "vector operands from > scalars" > XPASS: gcc.dg/vect/bb-slp-subgroups-3.c -flto -ffat-lto-objects > scan-tree-dump- > times slp2 "optimized: basic block" 2 > XPASS: gcc.dg/vect/bb-slp-subgroups-3.c scan-tree-dump-times slp2 "optimized: > basic block" 2 > > Since vect_variable_length are available for ARM SVE and RVV, I just use > compiler > explorer to confirm ARM SVE same as > RVV. > > Hi, @Tamar. Could you double check whether this patch fix is reasonable to > you ? > Hmm I would be surprised if this is working correctly for RVV since as far as I know we don't have variable length support in SLP i.e. SLP can't predicate operation during build so the current vectorizer only supports fixed length vector SLP, unless Richi did some magic? For SVE the reason this XPASS is because the compiler will fallback to NEON unless it's told it can't. But that's not actually testing VLA SLP. i.e. https://godbolt.org/z/5n5fWahxh just using `+sve` isn't enough and it has to be told it can only use SVE. Is it perhaps something similar for RVV? If RVV has a similar param, perhaps the correct fix is to append it to the tests so they XFAIL correctly? Regards, Tamar > And. > > Hi, @Richard. Is this patch Ok for trunk if this patch fixes regression for > both RVV > and ARM SVE. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/pr84512.c: Remove vect_variable_length XFAIL. > * gcc.dg/vect/bb-slp-43.c: Ditto. > * gcc.dg/vect/bb-slp-subgroups-3.c: Ditto. > > --- > gcc/testsuite/gcc.dg/tree-ssa/pr84512.c| 2 +- > gcc/testsuite/gcc.dg/vect/bb-slp-43.c | 2 +- > gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 2 +- > 3 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c > b/gcc/testsuite/gcc.dg/tree- > ssa/pr84512.c > index 496c78b28dc..3c027012670 100644 > --- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c > @@ -13,4 +13,4 @@ int foo() > } > > /* Listed targets xfailed due to PR84958. */ > -/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { > amdgcn*-*-* || > vect_variable_length } } } } */ > +/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { > amdgcn*-*-* } } } > } */ > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-43.c > b/gcc/testsuite/gcc.dg/vect/bb- > slp-43.c > index dad2d24262d..40bd2e0dfbf 100644 > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-43.c > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-43.c > @@ -14,4 +14,4 @@ f (int *restrict x, short *restrict y) > } > > /* { dg-final { scan-tree-dump-not "mixed mask and nonmask" "slp2" } } */ > -/* { dg-final { scan-tree-dump-not "vector operands from scalars" "slp2" { > target { > { vect_int && vect_bool_cmp } && { vect_unpack && vect_hw_misalign } } xfail { > vect_variable_length && { ! vect256 } } } } } */ > +/* { dg-final { scan-tree-dump-not "vector operands from scalars" "slp2" { > target { > { vect_int && vect_bool_cmp } && { vect_unpack && vect_hw_misalign } } } } } > */ > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c > b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c > index fb719915db7..3f0d45ce4a1 100644 > --- a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c > @@ -42,7 +42,7 @@ main (int argc, char **argv) > /* Because we disable the cost model, targets with variable-length > vectors can end up vectorizing the store to a[0..7] on its own. > With the cost model we do something sensible. */ > -/* { dg-final { scan-tree-dump-times "optimized: basic block" 2 "slp2" { > target { ! > amdgcn-*-* } xfail vect_variable_length } } } */ > +/* { dg-final { scan-tree-dump-times "optimized: basic block" 2 "slp2" { > target { ! > amdgcn-*-* } } } } */ > > /* amdgcn can do this in one vector. */ > /* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp2" { > target > amdgcn-*-* } } } */ > -- > 2.36.3
[PATCH]middle-end: Handle hybrid SLP induction vectorization with early breaks.
Hi All, While we don't support SLP for early break vectorization, we can land in the situation where the induction was vectorized through hybrid SLP. This means when vectorizing the early break live operation we need to get the results of the SLP operation. Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-vect-loop.cc (vectorizable_live_operation): Handle SLP. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-early-break_82.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c new file mode 100644 index ..f2a6d640f9c0c381cc2af09bd824e272bcfee0b8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-additional-options "-Ofast" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#include + +#define N 1024 +complex double vect_a[N]; +complex double vect_b[N]; + +complex double test4(complex double x, complex double t) +{ + complex double ret = 0; + for (int i = 0; i < N; i++) + { + vect_a[i] = t + i; + if (vect_a[i] == x) + return i; + vect_a[i] += x * vect_a[i]; + + } + return ret; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 85b81d30c5ab869cb1f7323caabd9fe4648bdc50..0993d184afe068784474ac225768d9f38d76c040 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -10856,8 +10856,8 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, bitsize = vector_element_bits_tree (vectype); /* Get the vectorized lhs of STMT and the lane to use (counted in bits). */ - tree vec_lhs, bitstart; - gimple *vec_stmt; + tree vec_lhs, vec_lhs0, bitstart; + gimple *vec_stmt, *vec_stmt0; if (slp_node) { gcc_assert (!loop_vinfo @@ -10868,6 +10868,10 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, vec_lhs = SLP_TREE_VEC_DEFS (slp_node)[vec_entry]; vec_stmt = SSA_NAME_DEF_STMT (vec_lhs); + /* In case we need to early break vectorize also get the first stmt. */ + vec_lhs0 = SLP_TREE_VEC_DEFS (slp_node)[0]; + vec_stmt0 = SSA_NAME_DEF_STMT (vec_lhs0); + /* Get entry to use. */ bitstart = bitsize_int (vec_index); bitstart = int_const_binop (MULT_EXPR, bitsize, bitstart); @@ -10878,6 +10882,10 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info).last (); vec_lhs = gimple_get_lhs (vec_stmt); + /* In case we need to early break vectorize also get the first stmt. */ + vec_stmt0 = STMT_VINFO_VEC_STMTS (stmt_info)[0]; + vec_lhs0 = gimple_get_lhs (vec_stmt0); + /* Get the last lane in the vector. */ bitstart = int_const_binop (MULT_EXPR, bitsize, bitsize_int (nunits - 1)); } @@ -10917,7 +10925,6 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, so use ->src. For main exit the merge block is the destination. */ basic_block dest = main_exit_edge ? main_e->dest : e->src; - gimple *tmp_vec_stmt = vec_stmt; tree tmp_vec_lhs = vec_lhs; tree tmp_bitstart = bitstart; @@ -10928,8 +10935,7 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info, if (restart_loop && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def) { - tmp_vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; - tmp_vec_lhs = gimple_get_lhs (tmp_vec_stmt); + tmp_vec_lhs = vec_lhs0; tmp_bitstart = build_zero_cst (TREE_TYPE (bitstart)); } -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c new file mode 100644 index ..f2a6d640f9c0c381cc2af09bd824e272bcfee0b8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_early_break } */ +/* { dg-require-effective-target vect_int } */ + +/* { dg-additional-options "-Ofast" } */ + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ + +#include + +#define N 1024 +complex double vect_a[N]; +complex double vect_b[N]; + +complex double test4(complex double x, complex double t) +{ + complex double ret = 0; + for (int i = 0; i < N; i++) + { + vect_a[i] = t + i; + if (vect_a[i] == x) + return i; + vect_a[i] += x * vect_a[i]; + + } + return ret; +} diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index