RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD

2024-05-02 Thread Tamar Christina
> > So he was responding for how to do it for the vectorizer and scalar parts.
> > Remember that the goal is not to introduce new gimple IL that can block 
> > other
> optimizations.
> > The vectorizer already introduces new IL (various IFN) but this is fine as 
> > we don't
> track things like ranges for
> > vector instructions.  So we don't loose any information here.
> 
> > Now for the scalar, if we do an early replacement like in match.pd we 
> > prevent a
> lot of other optimizations
> > because they don't know what IFN_SAT_ADD does. gimple-isel runs pretty late,
> and so at this point we don't
> > expect many more optimizations to happen, so it's a safe spot to insert 
> > more IL
> with "unknown semantics".
> 
> > Was that your intention Richi?
> 
> Thanks Tamar for clear explanation, does that mean both the scalar and vector 
> will
> go isel approach? If so I may
> misunderstand in previous that it is only for vectorize.

No, The isel would only be for the scalar, The vectorizer will still use the 
vect_pattern.
It needs to so we can cost the operation correctly, and in some cases depending 
on how
the saturation is described you are unable the vectorize.  The pattern allows 
us to catch
these cases and still vectorize.

But you should be able to use the same match.pd predicate for both the 
vectorizer pattern
and isel.

> 
> Understand the point that we would like to put the pattern match late but I 
> may
> have a question here.
> Given SAT_ADD related pattern is sort of complicated, it is possible that the 
> sub-
> expression of SAT_ADD is optimized
> In early pass by others and we can hardly catch the shapes later.
> 
> For example, there is a plus expression in SAT_ADD, and in early pass it may 
> be
> optimized to .ADD_OVERFLOW, and
> then the pattern is quite different to aware of that in later pass.
> 

Yeah, it looks like this transformation is done in widening_mul, which is the 
other
place richi suggested to recognize SAT_ADD.  widening_mul already runs quite
late as well so it's also ok.

If you put it there before the code that transforms the sequence to overflow it
should work.

Eventually we do need to recognize this variant since:

uint64_t
add_sat(uint64_t x, uint64_t y) noexcept
{
uint64_t z;
if (!__builtin_add_overflow(x, y, ))
return z;
return -1u;
}

Is a valid and common way to do saturation too.

But for now, it's fine.

Cheers,
Tamar

> Sorry not sure if my understanding is correct, feel free to correct me.
> 
> Pan
> 
> -Original Message-
> From: Tamar Christina 
> Sent: Thursday, May 2, 2024 11:26 AM
> To: Li, Pan2 ; gcc-patches@gcc.gnu.org
> Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com;
> Liu, Hongtao 
> Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD
> 
> > -Original Message-
> > From: Li, Pan2 
> > Sent: Thursday, May 2, 2024 4:11 AM
> > To: Tamar Christina ; gcc-patches@gcc.gnu.org
> > Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com;
> > Liu, Hongtao 
> > Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD
> >
> > Thanks Tamar
> >
> > > Could you also split off the vectorizer change from scalar recog one? 
> > > Typically I
> > would structure a change like this as:
> >
> > > 1. create types/structures + scalar recogn
> > > 2. Vector recog code
> > > 3. Backend changes
> >
> > Sure thing, will rearrange the patch like this.
> >
> > > Is ECF_NOTHROW correct here? At least on most targets I believe the scalar
> > version
> > > can set flags/throw exceptions if the saturation happens?
> >
> > I see, will remove that.
> >
> > > Hmm I believe Richi mentioned that he wanted the recognition done in isel?
> >
> > > The problem with doing it in match.pd is that it replaces the operations 
> > > quite
> > > early the pipeline. Did I miss an email perhaps? The early replacement 
> > > means
> we
> > > lose optimizations and things such as range calculations etc, since e.g. 
> > > ranger
> > > doesn't know these internal functions.
> >
> > > I think Richi will want this in islet or mult widening but I'll continue 
> > > with
> match.pd
> > > review just in case.
> >
> > If I understand is correct, Richard suggested try vectorizer patterns first 
> > and then
> > possible isel.
> > Thus, I don't have a try for SAT_ADD in ISEL as vectorizer patterns works 
> > well for
> > SAT_ADD.
> > Let's wait 

RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD

2024-05-01 Thread Tamar Christina
> -Original Message-
> From: Li, Pan2 
> Sent: Thursday, May 2, 2024 4:11 AM
> To: Tamar Christina ; gcc-patches@gcc.gnu.org
> Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com;
> Liu, Hongtao 
> Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD
> 
> Thanks Tamar
> 
> > Could you also split off the vectorizer change from scalar recog one? 
> > Typically I
> would structure a change like this as:
> 
> > 1. create types/structures + scalar recogn
> > 2. Vector recog code
> > 3. Backend changes
> 
> Sure thing, will rearrange the patch like this.
> 
> > Is ECF_NOTHROW correct here? At least on most targets I believe the scalar
> version
> > can set flags/throw exceptions if the saturation happens?
> 
> I see, will remove that.
> 
> > Hmm I believe Richi mentioned that he wanted the recognition done in isel?
> 
> > The problem with doing it in match.pd is that it replaces the operations 
> > quite
> > early the pipeline. Did I miss an email perhaps? The early replacement 
> > means we
> > lose optimizations and things such as range calculations etc, since e.g. 
> > ranger
> > doesn't know these internal functions.
> 
> > I think Richi will want this in islet or mult widening but I'll continue 
> > with match.pd
> > review just in case.
> 
> If I understand is correct, Richard suggested try vectorizer patterns first 
> and then
> possible isel.
> Thus, I don't have a try for SAT_ADD in ISEL as vectorizer patterns works 
> well for
> SAT_ADD.
> Let's wait the confirmation from Richard. Below are the original words from
> previous mail for reference.
> 

I think the comment he made was this

> > Given we have saturating integer alu like below, could you help to coach me 
> > the most reasonable way to represent
> > It in scalar as well as vectorize part? Sorry not familiar with this part 
> > and still dig into how it works...
> 
> As in your v2, .SAT_ADD for both sat_uadd and sat_sadd, similar for
> the other cases.
>
> As I said, use vectorizer patterns and possibly do instruction
> selection at ISEL/widen_mult time.

So he was responding for how to do it for the vectorizer and scalar parts.
Remember that the goal is not to introduce new gimple IL that can block other 
optimizations.
The vectorizer already introduces new IL (various IFN) but this is fine as we 
don't track things like ranges for
vector instructions.  So we don't loose any information here.

Now for the scalar, if we do an early replacement like in match.pd we prevent a 
lot of other optimizations
because they don't know what IFN_SAT_ADD does. gimple-isel runs pretty late, 
and so at this point we don't
expect many more optimizations to happen, so it's a safe spot to insert more IL 
with "unknown semantics".

Was that your intention Richi?

Thanks,
Tamar

> >> As I said, use vectorizer patterns and possibly do instruction
> >> selection at ISEL/widen_mult time.
> 
> > The optimize checks in the match.pd file are weird as it seems to check if 
> > we have
> > optimizations enabled?
> 
> > We don't typically need to do this.
> 
> Sure, will remove this.
> 
> > The function has only one caller, you should just inline it into the 
> > pattern.
> 
> Sure thing.
> 
> > Once you inline vect_sat_add_build_call you can do the check for
> > vtype here, which is the cheaper check so perform it early.
> 
> Sure thing.
> 
> Thanks again and will send the v4 with all comments addressed, as well as the 
> test
> results.
> 
> Pan
> 
> -Original Message-
> From: Tamar Christina 
> Sent: Thursday, May 2, 2024 1:06 AM
> To: Li, Pan2 ; gcc-patches@gcc.gnu.org
> Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; richard.guent...@gmail.com;
> Liu, Hongtao 
> Subject: RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD
> 
> Hi,
> 
> > From: Pan Li 
> >
> > Update in v3:
> > * Rebase upstream for conflict.
> >
> > Update in v2:
> > * Fix one failure for x86 bootstrap.
> >
> > Original log:
> >
> > This patch would like to add the middle-end presentation for the
> > saturation add.  Aka set the result of add to the max when overflow.
> > It will take the pattern similar as below.
> >
> > SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
> >
> > Take uint8_t as example, we will have:
> >
> > * SAT_ADD (1, 254)   => 255.
> > * SAT_ADD (1, 255)   => 255.
> > * SAT_ADD (2, 255)   => 255.
> > * SAT_ADD (255, 255) => 255.
> >
> > The p

RE: [PATCH v3] Internal-fn: Introduce new internal function SAT_ADD

2024-05-01 Thread Tamar Christina
Hi,

> From: Pan Li 
> 
> Update in v3:
> * Rebase upstream for conflict.
> 
> Update in v2:
> * Fix one failure for x86 bootstrap.
> 
> Original log:
> 
> This patch would like to add the middle-end presentation for the
> saturation add.  Aka set the result of add to the max when overflow.
> It will take the pattern similar as below.
> 
> SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
> 
> Take uint8_t as example, we will have:
> 
> * SAT_ADD (1, 254)   => 255.
> * SAT_ADD (1, 255)   => 255.
> * SAT_ADD (2, 255)   => 255.
> * SAT_ADD (255, 255) => 255.
> 
> The patch also implement the SAT_ADD in the riscv backend as
> the sample for both the scalar and vector.  Given below example:
> 
> uint64_t sat_add_u64 (uint64_t x, uint64_t y)
> {
>   return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x));
> }
> 
> Before this patch:
> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> {
>   long unsigned int _1;
>   _Bool _2;
>   long unsigned int _3;
>   long unsigned int _4;
>   uint64_t _7;
>   long unsigned int _10;
>   __complex__ long unsigned int _11;
> 
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _11 = .ADD_OVERFLOW (x_5(D), y_6(D));
>   _1 = REALPART_EXPR <_11>;
>   _10 = IMAGPART_EXPR <_11>;
>   _2 = _10 != 0;
>   _3 = (long unsigned int) _2;
>   _4 = -_3;
>   _7 = _1 | _4;
>   return _7;
> ;;succ:   EXIT
> 
> }
> 
> After this patch:
> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> {
>   uint64_t _7;
> 
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _7 = .SAT_ADD (x_5(D), y_6(D)); [tail call]
>   return _7;
> ;;succ:   EXIT
> }
> 
> For vectorize, we leverage the existing vect pattern recog to find
> the pattern similar to scalar and let the vectorizer to perform
> the rest part for standard name usadd3 in vector mode.
> The riscv vector backend have insn "Vector Single-Width Saturating
> Add and Subtract" which can be leveraged when expand the usadd3
> in vector mode.  For example:
> 
> void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n)
> {
>   unsigned i;
> 
>   for (i = 0; i < n; i++)
> out[i] = (x[i] + y[i]) | (- (uint64_t)((uint64_t)(x[i] + y[i]) < x[i]));
> }
> 
> Before this patch:
> void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n)
> {
>   ...
>   _80 = .SELECT_VL (ivtmp_78, POLY_INT_CST [2, 2]);
>   ivtmp_58 = _80 * 8;
>   vect__4.7_61 = .MASK_LEN_LOAD (vectp_x.5_59, 64B, { -1, ... }, _80, 0);
>   vect__6.10_65 = .MASK_LEN_LOAD (vectp_y.8_63, 64B, { -1, ... }, _80, 0);
>   vect__7.11_66 = vect__4.7_61 + vect__6.10_65;
>   mask__8.12_67 = vect__4.7_61 > vect__7.11_66;
>   vect__12.15_72 = .VCOND_MASK (mask__8.12_67, { 18446744073709551615,
> ... }, vect__7.11_66);
>   .MASK_LEN_STORE (vectp_out.16_74, 64B, { -1, ... }, _80, 0, vect__12.15_72);
>   vectp_x.5_60 = vectp_x.5_59 + ivtmp_58;
>   vectp_y.8_64 = vectp_y.8_63 + ivtmp_58;
>   vectp_out.16_75 = vectp_out.16_74 + ivtmp_58;
>   ivtmp_79 = ivtmp_78 - _80;
>   ...
> }
> 
> vec_sat_add_u64:
>   ...
>   vsetvli a5,a3,e64,m1,ta,ma
>   vle64.v v0,0(a1)
>   vle64.v v1,0(a2)
>   sllia4,a5,3
>   sub a3,a3,a5
>   add a1,a1,a4
>   add a2,a2,a4
>   vadd.vv v1,v0,v1
>   vmsgtu.vv   v0,v0,v1
>   vmerge.vim  v1,v1,-1,v0
>   vse64.v v1,0(a0)
>   ...
> 
> After this patch:
> void vec_sat_add_u64 (uint64_t *out, uint64_t *x, uint64_t *y, unsigned n)
> {
>   ...
>   _62 = .SELECT_VL (ivtmp_60, POLY_INT_CST [2, 2]);
>   ivtmp_46 = _62 * 8;
>   vect__4.7_49 = .MASK_LEN_LOAD (vectp_x.5_47, 64B, { -1, ... }, _62, 0);
>   vect__6.10_53 = .MASK_LEN_LOAD (vectp_y.8_51, 64B, { -1, ... }, _62, 0);
>   vect__12.11_54 = .SAT_ADD (vect__4.7_49, vect__6.10_53);
>   .MASK_LEN_STORE (vectp_out.12_56, 64B, { -1, ... }, _62, 0, vect__12.11_54);
>   ...
> }
> 
> vec_sat_add_u64:
>   ...
>   vsetvli a5,a3,e64,m1,ta,ma
>   vle64.v v1,0(a1)
>   vle64.v v2,0(a2)
>   sllia4,a5,3
>   sub a3,a3,a5
>   add a1,a1,a4
>   add a2,a2,a4
>   vsaddu.vv   v1,v1,v2
>   vse64.v v1,0(a0)
>   ...
> 
> To limit the patch size for review, only unsigned version of
> usadd3 are involved here. The signed version will be covered
> in the underlying patch(es).
> 
> The below test suites are passed for this patch.
> * The riscv fully regression tests.
> * The aarch64 fully regression tests.
> * The x86 bootstrap tests.
> * The x86 fully regression tests.
> 
>   PR target/51492
>   PR target/112600
> 
> gcc/ChangeLog:
> 
>   * config/riscv/autovec.md (usadd3): New pattern expand
>   for unsigned SAT_ADD vector.
>   * config/riscv/riscv-protos.h (riscv_expand_usadd): New func
>   decl to expand usadd3 pattern.
>   (expand_vec_usadd): Ditto but for vector.
>   * config/riscv/riscv-v.cc (emit_vec_saddu): New func impl to
>   emit the vsadd insn.
>   (expand_vec_usadd): New func impl to expand usadd3 for
>   vector.
>   * config/riscv/riscv.cc (riscv_expand_usadd): New func impl
>   to 

[PATCH]middle-end: refactory vect_recog_absolute_difference to simplify flow [PR114769]

2024-04-19 Thread Tamar Christina
Hi All,

As the reporter in PR114769 points out the control flow for the abd detection
is hard to follow.  This is because vect_recog_absolute_difference has two
different ways it can return true.

1. It can return true when the widening operation is matched, in which case
   unprom is set, half_type is not NULL and diff_stmt is not set.

2. It can return true when the widening operation is not matched, but the stmt
   being checked is a minus.  In this case unprom is not set, half_type is set
   to NULL and diff_stmt is set.  This because to get to diff_stmt you have to
   dig through the abs statement and any possible promotions.

This however leads to complicated uses of the function at the call sites as the
exact semantic needs to be known to use it safely.

vect_recog_absolute_difference has two callers:

1. vect_recog_sad_pattern where if you return true with unprom not set, then
   *half_type will be NULL.  The call to vect_supportable_direct_optab_p will
   always reject it since there's no vector mode for NULL.  Note that if looking
   at the dump files, the convention in the dump files have always been that we
   first indicate that a pattern could possibly be recognize and then check that
   it's supported.

   This change somewhat incorrectly makes the diagnostic message get printed for
   "invalid" patterns.

2. vect_recog_abd_pattern, where if half_type is NULL, it then uses diff_stmt to
   set them.

So while the note in the dump file is misleading, the code is safe.

This refactors the code, it now only has 1 success condition, and diff_stmt is
always set to the minus statement in the abs if there is one.

The function now only returns success if the widening minus is found, in which
case unprom and half_type set.

This then leaves it up to the caller to decide if they want to do anything with
diff_stmt.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/114769
* tree-vect-patterns.cc:
(vect_recog_absolute_difference): Have only one success condition.
(vect_recog_abd_pattern): Handle further checks if
vect_recog_absolute_difference fails.

---
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 
4f491c6b8336f8710c3519dec1fa7e0f49387d2b..87c2acff386d91d22a3b2d6e6443d1f2f2326ea6
 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -797,8 +797,7 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info 
stmt2_info, tree new_rhs,
HALF_TYPE and UNPROM will be set should the statement be found to
be a widened operation.
DIFF_STMT will be set to the MINUS_EXPR
-   statement that precedes the ABS_STMT unless vect_widened_op_tree
-   succeeds.
+   statement that precedes the ABS_STMT if it is a MINUS_EXPR..
  */
 static bool
 vect_recog_absolute_difference (vec_info *vinfo, gassign *abs_stmt,
@@ -843,6 +842,12 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign 
*abs_stmt,
   if (!diff_stmt_vinfo)
 return false;
 
+  gassign *diff = dyn_cast  (STMT_VINFO_STMT (diff_stmt_vinfo));
+  if (diff_stmt && diff
+  && gimple_assign_rhs_code (diff) == MINUS_EXPR
+  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd)))
+*diff_stmt = diff;
+
   /* FORNOW.  Can continue analyzing the def-use chain when this stmt in a phi
  inside the loop (in case we are analyzing an outer-loop).  */
   if (vect_widened_op_tree (vinfo, diff_stmt_vinfo,
@@ -850,17 +855,6 @@ vect_recog_absolute_difference (vec_info *vinfo, gassign 
*abs_stmt,
false, 2, unprom, half_type))
 return true;
 
-  /* Failed to find a widen operation so we check for a regular MINUS_EXPR.  */
-  gassign *diff = dyn_cast  (STMT_VINFO_STMT (diff_stmt_vinfo));
-  if (diff_stmt && diff
-  && gimple_assign_rhs_code (diff) == MINUS_EXPR
-  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (abs_oprnd)))
-{
-  *diff_stmt = diff;
-  *half_type = NULL_TREE;
-  return true;
-}
-
   return false;
 }
 
@@ -1499,27 +1493,22 @@ vect_recog_abd_pattern (vec_info *vinfo,
   tree out_type = TREE_TYPE (gimple_assign_lhs (last_stmt));
 
   vect_unpromoted_value unprom[2];
-  gassign *diff_stmt;
-  tree half_type;
-  if (!vect_recog_absolute_difference (vinfo, last_stmt, _type,
+  gassign *diff_stmt = NULL;
+  tree abd_in_type;
+  if (!vect_recog_absolute_difference (vinfo, last_stmt, _in_type,
   unprom, _stmt))
-return NULL;
-
-  tree abd_in_type, abd_out_type;
-
-  if (half_type)
-{
-  abd_in_type = half_type;
-  abd_out_type = abd_in_type;
-}
-  else
 {
+  /* We cannot try further without having a non-widening MINUS.  */
+  if (!diff_stmt)
+   return NULL;
+
   unprom[0].op = gimple_assign_rhs1 (diff_stmt);
   unprom[1].op = gimple_assign_rhs2 (diff_stmt);
   abd_in_type = signed_type_for (out_type);
-  abd_out_type = abd_in_type;
 }
 
+ 

[PATCH]AArch64: remove reliance on register allocator for simd/gpreg costing. [PR114741]

2024-04-18 Thread Tamar Christina
Hi All,

In PR114741 we see that we have a regression in codegen when SVE is enable where
the simple testcase:

void foo(unsigned v, unsigned *p)
{
*p = v & 1;
}

generates

foo:
fmovs31, w0
and z31.s, z31.s, #1
str s31, [x1]
ret

instead of:

foo:
and w0, w0, 1
str w0, [x1]
ret

This causes an impact it not just codesize but also performance.  This is caused
by the use of the ^ constraint modifier in the pattern 3.

The documentation states that this modifier should only have an effect on the
alternative costing in that a particular alternative is to be preferred unless
a non-psuedo reload is needed.

The pattern was trying to convey that whenever both r and w are required, that
it should prefer r unless a reload is needed.  This is because if a reload is
needed then we can construct the constants more flexibly on the SIMD side.

We were using this so simplify the implementation and to get generic cases such
as:

double negabs (double x)
{
   unsigned long long y;
   memcpy (, , sizeof(double));
   y = y | (1UL << 63);
   memcpy (, , sizeof(double));
   return x;
}

which don't go through an expander.
However the implementation of ^ in the register allocator is not according to
the documentation in that it also has an effect during coloring.  During initial
register class selection it applies a penalty to a class, similar to how ? does.

In this example the penalty makes the use of GP regs expensive enough that it no
longer considers them:

r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS
;;3--> b  0: i   9 r106=r105&0x1
:cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0)
 PR_HI_REGS+0(0):model 4

which is not the expected behavior.  For GCC 14 this is a conservative fix.

1. we remove the ^ modifier from the logical optabs.

2. In order not to regress copysign we then move the copysign expansion to
   directly use the SIMD variant.  Since copysign only supports floating point
   modes this is fine and no longer relies on the register allocator to select
   the right alternative.

It once again regresses the general case, but this case wasn't optimized in
earlier GCCs either so it's not a regression in GCC 14.  This change gives
strict better codegen than earlier GCCs and still optimizes the important cases.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:


PR target/114741
* config/aarch64/aarch64.md (3): Remove ^ from alt 2.
(copysign3): Use SIMD version of IOR directly.

gcc/testsuite/ChangeLog:

PR target/114741
* gcc.target/aarch64/fneg-abs_2.c: Update codegen.
* gcc.target/aarch64/fneg-abs_4.c: xfail for now.
* gcc.target/aarch64/pr114741.c: New test.

---
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
385a669b9b3c31cc9108a660e881b9091c71fc7c..dbde066f7478bec51a8703b017ea553aa98be309
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4811,7 +4811,7 @@ (define_insn "3"
   ""
   {@ [ cons: =0 , 1  , 2; attrs: type , arch  ]
  [ r, %r , r; logic_reg   , * ] \t%0, 
%1, %2
- [ rk   , ^r ,  ; logic_imm   , * ] \t%0, 
%1, %2
+ [ rk   , r  ,  ; logic_imm   , * ] \t%0, 
%1, %2
  [ w, 0  ,  ; *   , sve   ] \t%Z0., 
%Z0., #%2
  [ w, w  , w; neon_logic  , simd  ] 
\t%0., %1., %2.
   }
@@ -7192,22 +7192,29 @@ (define_expand "copysign3"
(match_operand:GPF 2 "nonmemory_operand")]
   "TARGET_SIMD"
 {
-  machine_mode int_mode = mode;
-  rtx bitmask = gen_reg_rtx (int_mode);
-  emit_move_insn (bitmask, GEN_INT (HOST_WIDE_INT_M1U
-   << (GET_MODE_BITSIZE (mode) - 1)));
+  rtx signbit_const = GEN_INT (HOST_WIDE_INT_M1U
+  << (GET_MODE_BITSIZE (mode) - 1));
   /* copysign (x, -1) should instead be expanded as orr with the sign
  bit.  */
   rtx op2_elt = unwrap_const_vec_duplicate (operands[2]);
   if (GET_CODE (op2_elt) == CONST_DOUBLE
   && real_isneg (CONST_DOUBLE_REAL_VALUE (op2_elt)))
 {
-  emit_insn (gen_ior3 (
-   lowpart_subreg (int_mode, operands[0], mode),
-   lowpart_subreg (int_mode, operands[1], mode), bitmask));
+  rtx v_bitmask
+   = force_reg (V2mode,
+gen_const_vec_duplicate (V2mode,
+ signbit_const));
+
+  emit_insn (gen_iorv23 (
+   lowpart_subreg (V2mode, operands[0], mode),
+   lowpart_subreg (V2mode, operands[1], mode),
+   v_bitmask));
   DONE;
 }
 
+  machine_mode int_mode = mode;
+  rtx bitmask = gen_reg_rtx (int_mode);
+  emit_move_insn (bitmask, signbit_const);
   operands[2] = force_reg (mode, operands[2]);
   emit_insn (gen_copysign3_insn (operands[0], operands[1], operands[2],
 

RE: [PATCH]middle-end: skip vectorization check on ilp32 on vect-early-break_124-pr114403.c

2024-04-16 Thread Tamar Christina
> On Tue, Apr 16, 2024 at 09:00:53AM +0200, Richard Biener wrote:
> > >   PR tree-optimization/114403
> > >   * gcc.dg/vect/vect-early-break_124-pr114403.c: Skip in ilp32.
> > >
> > > ---
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
> > > index
> 1751296ab813fe85eaab1f58dc674bac10f6eb7a..db8e00556f116ca81c5a6558e
> c6ecd3b222ec93d 100644
> > > --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
> > > @@ -2,11 +2,11 @@
> > >  /* { dg-require-effective-target vect_early_break_hw } */
> > >  /* { dg-require-effective-target vect_long_long } */
> > >
> > > -/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! 
> > > ilp32 } } }
> } */
> > >
> > >  #include "tree-vect.h"
> > >
> > > -typedef unsigned long PV;
> > > +typedef unsigned long long PV;
> > >  typedef struct _buff_t {
> > >  int foo;
> > >  PV Val;
> 
> As discussed on IRC, I think we want
> --- gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c.jj  2024-04-
> 16 08:43:36.001729192 +0200
> +++ gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c 2024-04-
> 16 08:55:11.296214959 +0200
> @@ -64,8 +64,8 @@ int main ()
> 
>int store_size = sizeof(PV);
>  #pragma GCC novector
> -  for (int i = 0; i < NUM - 1; i+=store_size)
> -if (0 != __builtin_memcmp (buffer+i, (char*)[i].Val, store_size))
> +  for (int i = 0; i < NUM - 1; i++)
> +if (0 != __builtin_memcmp (buffer+i*store_size, (char*)[i].Val, 
> store_size))
>__builtin_abort ();
> 
>return 0;
> 
> instead (and then I think there is no need to switch PV from unsigned long
> to unsigned long long, nor disabling on ilp32.
> 

Regtested on x86_64-pc-linux-gnu with -m32,-m64 and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/114403
* gcc.dg/vect/vect-early-break_124-pr114403.c: Fix check loop.

-- inline copy of patch --

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
index 
1751296ab813fe85eaab1f58dc674bac10f6eb7a..51abf245ccb51b85f06916a8a0238698911ab551
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
@@ -68,8 +68,8 @@ int main ()
 
   int store_size = sizeof(PV);
 #pragma GCC novector
-  for (int i = 0; i < NUM - 1; i+=store_size)
-if (0 != __builtin_memcmp (buffer+i, (char*)[i].Val, store_size))
+  for (int i = 0; i < NUM - 1; i++)
+if (0 != __builtin_memcmp (buffer+(i*store_size), (char*)[i].Val, 
store_size))
   __builtin_abort ();
 
   return 0;



rb18418.patch
Description: rb18418.patch


[PATCH]middle-end: skip vectorization check on ilp32 on vect-early-break_124-pr114403.c

2024-04-15 Thread Tamar Christina
Hi all,

The testcase seems to fail vectorization on -m32 since the access pattern is
determined as too complex.  This skips the vectorization check on ilp32 systems
as I couldn't find a better proxy for being able to do strided 64-bit loads and
I suspect it would fail on all 32-bit targets.

Regtested on x86_64-pc-linux-gnu with -m32 and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/114403
* gcc.dg/vect/vect-early-break_124-pr114403.c: Skip in ilp32.

---
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
index 
1751296ab813fe85eaab1f58dc674bac10f6eb7a..db8e00556f116ca81c5a6558ec6ecd3b222ec93d
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
@@ -2,11 +2,11 @@
 /* { dg-require-effective-target vect_early_break_hw } */
 /* { dg-require-effective-target vect_long_long } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! ilp32 } } 
} } */
 
 #include "tree-vect.h"
 
-typedef unsigned long PV;
+typedef unsigned long long PV;
 typedef struct _buff_t {
 int foo;
 PV Val;




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
index 1751296ab813fe85eaab1f58dc674bac10f6eb7a..db8e00556f116ca81c5a6558ec6ecd3b222ec93d 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_124-pr114403.c
@@ -2,11 +2,11 @@
 /* { dg-require-effective-target vect_early_break_hw } */
 /* { dg-require-effective-target vect_long_long } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! ilp32 } } } } */
 
 #include "tree-vect.h"
 
-typedef unsigned long PV;
+typedef unsigned long long PV;
 typedef struct _buff_t {
 int foo;
 PV Val;





docs: document early break support and pragma novector

2024-04-15 Thread Tamar Christina
docs: document early break support and pragma novector

---
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index 
b4c602a523717c1d64333e44aefb60ba0ed02e7a..aceecb86f17443cfae637e90987427b98c42f6eb
 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -200,6 +200,34 @@ a work-in-progress.
 for indicating parameters that are expected to be null-terminated
 strings.
   
+  
+The vectorizer now supports vectorizing loops which contain any number of 
early breaks.
+This means loops such as:
+
+   int z[100], y[100], x[100];
+   int foo (int n)
+   {
+ int res = 0;
+ for (int i = 0; i < n; i++)
+   {
+  y[i] = x[i] * 2;
+  res += x[i] + y[i];
+
+  if (x[i] > 5)
+break;
+
+  if (z[i] > 5)
+break;
+
+   }
+ return res;
+   }
+
+can now be vectorized on a number of targets.  In this first version any
+input data sources must either have a statically known size at compile time
+or the vectorizer must be able to determine based on auxillary information
+that the accesses are aligned.
+  
 
 
 New Languages and Language specific improvements
@@ -231,6 +259,9 @@ a work-in-progress.
   previous options -std=c2x, -std=gnu2x
   and -Wc11-c2x-compat, which are deprecated but remain
   supported.
+  GCC supports a new pragma pragma GCC novector to
+  indicate to the vectorizer not to vectorize the loop annotated with the
+  pragma.
 
 
 C++
@@ -400,6 +431,9 @@ a work-in-progress.
   warnings are enabled for C++ as well
   The DR 2237 code no longer gives an error, it emits
   a -Wtemplate-id-cdtor warning instead
+  GCC supports a new pragma pragma GCC novector to
+  indicate to the vectorizer not to vectorize the loop annotated with the
+  pragma.
 
 
 Runtime Library (libstdc++)




-- 
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index b4c602a523717c1d64333e44aefb60ba0ed02e7a..aceecb86f17443cfae637e90987427b98c42f6eb 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -200,6 +200,34 @@ a work-in-progress.
 for indicating parameters that are expected to be null-terminated
 strings.
   
+  
+The vectorizer now supports vectorizing loops which contain any number of early breaks.
+This means loops such as:
+
+	int z[100], y[100], x[100];
+	int foo (int n)
+	{
+	  int res = 0;
+	  for (int i = 0; i < n; i++)
+	{
+	   y[i] = x[i] * 2;
+	   res += x[i] + y[i];
+
+	   if (x[i] > 5)
+		 break;
+
+	   if (z[i] > 5)
+		 break;
+
+	}
+	  return res;
+	}
+
+can now be vectorized on a number of targets.  In this first version any
+input data sources must either have a statically known size at compile time
+or the vectorizer must be able to determine based on auxillary information
+that the accesses are aligned.
+  
 
 
 New Languages and Language specific improvements
@@ -231,6 +259,9 @@ a work-in-progress.
   previous options -std=c2x, -std=gnu2x
   and -Wc11-c2x-compat, which are deprecated but remain
   supported.
+  GCC supports a new pragma pragma GCC novector to
+  indicate to the vectorizer not to vectorize the loop annotated with the
+  pragma.
 
 
 C++
@@ -400,6 +431,9 @@ a work-in-progress.
   warnings are enabled for C++ as well
   The DR 2237 code no longer gives an error, it emits
   a -Wtemplate-id-cdtor warning instead
+  GCC supports a new pragma pragma GCC novector to
+  indicate to the vectorizer not to vectorize the loop annotated with the
+  pragma.
 
 
 Runtime Library (libstdc++)





[PATCH]middle-end: adjust loop upper bounds when peeling for gaps and early break [PR114403].

2024-04-12 Thread Tamar Christina
Hi All,

This is a story all about how the peeling for gaps introduces a bug in the upper
bounds.

Before I go further, I'll first explain how I understand this to work for loops
with a single exit.

When peeling for gaps we peel N < VF iterations to scalar.
This happens by removing N iterations from the calculation of niters such that
vect_iters * VF == niters is always false.

In other words, when we exit the vector loop we always fall to the scalar loop.
The loop bounds adjustment guarantees this. Because of this we potentially
execute a vector loop iteration less.  That is, if you're at the boundary
condition where niters % VF by peeling one or more scalar iterations the vector
loop executes one less.

This is accounted for by the adjustments in vect_transform_loops.  This
adjustment happens differently based on whether the the vector loop can be
partial or not:

Peeling for gaps sets the bias to 0 and then:

when not partial:  we take the floor of (scalar_upper_bound / VF) - 1 to get the
   vector latch iteration count.

when loop is partial:  For a single exit this means the loop is masked, we take
   the ceil to account for the fact that the loop can handle
   the final partial iteration using masking.

Note that there's no difference between ceil an floor on the boundary condition.
There is a difference however when you're slightly above it. i.e. if scalar
iterates 14 times and VF = 4 and we peel 1 iteration for gaps.

The partial loop does ((13 + 0) / 4) - 1 == 2 vector iterations. and in effect
the partial iteration is ignored and it's done as scalar.

This is fine because the niters modification has capped the vector iteration at
2.  So that when we reduce the induction values you end up entering the scalar
code with ind_var.2 = ind_var.1 + 2 * VF.

Now lets look at early breaks.  To make it esier I'll focus on the specific
testcase:

char buffer[64];

__attribute__ ((noipa))
buff_t *copy (buff_t *first, buff_t *last)
{
  char *buffer_ptr = buffer;
  char *const buffer_end = [SZ-1];
  int store_size = sizeof(first->Val);
  while (first != last && (buffer_ptr + store_size) <= buffer_end)
{
  const char *value_data = (const char *)(>Val);
  __builtin_memcpy(buffer_ptr, value_data, store_size);
  buffer_ptr += store_size;
  ++first;
}

  if (first == last)
return 0;

  return first;
}

Here the first, early exit is on the condition:

  (buffer_ptr + store_size) <= buffer_end

and the main exit is on condition:

  first != last

This is important, as this bug only manifests itself when the first exit has a
known constant iteration count that's lower than the latch exit count.

because buffer holds 64 bytes, and VF = 4, unroll = 2, we end up processing 16
bytes per iteration.  So the exit has a known bounds of 8 + 1.

The vectorizer correctly analizes this:

Statement (exit)if (ivtmp_21 != 0)
 is executed at most 8 (bounded by 8) + 1 times in loop 1.

and as a consequence the IV is bound by 9:

  # vect_vec_iv_.14_117 = PHI <_118(9), { 9, 8, 7, 6 }(20)>
  ...
  vect_ivtmp_21.16_124 = vect_vec_iv_.14_117 + { 18446744073709551615, 
18446744073709551615, 18446744073709551615, 18446744073709551615 };
  mask_patt_22.17_126 = vect_ivtmp_21.16_124 != { 0, 0, 0, 0 };
  if (mask_patt_22.17_126 == { -1, -1, -1, -1 })
goto ; [88.89%]
  else
goto ; [11.11%]

The imporant bits are this:

In this example the value of last - first = 416.

the calculated vector iteration count, is:

x = (((ptr2 - ptr1) - 16) / 16) + 1 = 27

the bounds generated, adjusting for gaps:

   x == (((x - 1) >> 2) << 2)

which means we'll always fall through to the scalar code. as intended.

Here are two key things to note:

1. In this loop, the early exit will always be the one taken.  When it's taken
   we enter the scalar loop with the correct induction value to apply the gap
   peeling.

2. If the main exit is taken, the induction values assumes you've finished all
   vector iterations.  i.e. it assumes you have completed 24 iterations, as we
   treat the main exit the same for normal loop vect and early break when not
   PEELED.
   This means the induction value is adjusted to ind_var.2 = ind_var.1 + 24 * 
VF;

So what's going wrong.  The vectorizer's codegen is correct and efficient,
however when we adjust the upper bounds, that code knows that the loops upper
bound is based on the early exit. i.e. 8 latch iterations. or in other words.
It thinks the loop iterates once.

This is incorrect as the vector loop iterates twice, as it has set up the
induction value such that it exits at the early exit.   So it in effect iterates
2.5x times.

Becuase the upper bound is incorrect, when we unroll it now exits from the main
exit which uses the incorrect induction value.

So there are three ways to fix this:

1.  If we take the position that the main exit should support both premature
exits and final exits then vect_update_ivs_after_vectorizer 

[PATCH]middle-end vect: adjust loop upper bounds when peeling for gaps and early break [PR114403]

2024-04-04 Thread Tamar Christina
Hi All,

The report shows that we end up in a situation where the code has been peeled
for gaps and we have an early break.

The code for peeling for gaps assume that a scalar loop needs to perform at
least one iteration.  However this doesn't take into account early break where
the scalar loop may not need to be executed.

That the early break loop can be partial is not accounted for in this scenario.
loop partiality is normally handled by setting bias_for_lowest to 1, but when
peeling for gaps we end up with 0, which when the loop upper bounds are
calculated means that a partial loop iteration loses the final partial iter:

Analyzing # of iterations of loop 1
  exit condition [8, + , 18446744073709551615] != 0
  bounds on difference of bases: -8 ... -8
  result:
# of iterations 8, bounded by 8

and a VF=4 calculating:

Loop 1 iterates at most 1 times.
Loop 1 likely iterates at most 1 times.
Analyzing # of iterations of loop 1
  exit condition [1, + , 1](no_overflow) < bnd.5505_39
  bounds on difference of bases: 0 ... 4611686018427387902
Matching expression match.pd:2011, generic-match-8.cc:27
Applying pattern match.pd:2067, generic-match-1.cc:4813
  result:
# of iterations bnd.5505_39 + 18446744073709551615, bounded by 
4611686018427387902
Estimating sizes for loop 1
...
   Induction variable computation will be folded away.
  size:   2 if (ivtmp_312 < bnd.5505_39)
   Exit condition will be eliminated in last copy.
size: 24-3, last_iteration: 24-5
  Loop size: 24
  Estimated size after unrolling: 26
;; Guessed iterations of loop 1 is 0.858446. New upper bound 1.

upper bound should be 2 not 1.

This patch forced the bias_for_lowest to be 1 even when peeling for gaps.

I have however not been able to write a standalone reproducer for this so I have
no tests but bootstrap and LLVM build fine now.

The testcase:

#define COUNT 9
#define SIZE COUNT * 4
#define TYPE unsigned long

TYPE x[SIZE], y[SIZE];

void __attribute__((noipa))
loop (TYPE val)
{
  for (int i = 0; i < COUNT; ++i)
{
  if (x[i * 4] > val || x[i * 4 + 1] > val)
return;
  x[i * 4] = y[i * 2] + 1;
  x[i * 4 + 1] = y[i * 2] + 2;
  x[i * 4 + 2] = y[i * 2 + 1] + 3;
  x[i * 4 + 3] = y[i * 2 + 1] + 4;
}
}

does perform the peeling for gaps and early beak, however it creates a hybrid
loop which works fine. adjusting the indices to non linear also works. So I'd
like to submit the fix and work on a testcase separately if needed.

Bootstrapped Regtested on x86_64-pc-linux-gnu no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/114403
* tree-vect-loop.cc (vect_transform_loop): Adjust upper bounds for when
peeling for gaps and early break.

---
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -12139,7 +12139,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
   /* The minimum number of iterations performed by the epilogue.  This
  is 1 when peeling for gaps because we always need a final scalar
  iteration.  */
-  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+  && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0;
   /* +1 to convert latch counts to loop iteration counts,
  -min_epilogue_iters to remove iterations that cannot be performed
by the vector code.  */




-- 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 4375ebdcb493a90fd0501cbb4b07466077b525c3..bf1bb9b005c68fbb13ee1b1279424865b237245a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -12139,7 +12139,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   /* The minimum number of iterations performed by the epilogue.  This
  is 1 when peeling for gaps because we always need a final scalar
  iteration.  */
-  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  int min_epilogue_iters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+			   && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo) ? 1 : 0;
   /* +1 to convert latch counts to loop iteration counts,
  -min_epilogue_iters to remove iterations that cannot be performed
by the vector code.  */





Summary: [PATCH][committed]AArch64: Do not allow SIMD clones with simdlen 1 [PR113552][GCC 13/12/11 backport]

2024-03-12 Thread Tamar Christina
Hi All,

This is a backport of g:306713c953d509720dc394c43c0890548bb0ae07.

The AArch64 vector PCS does not allow simd calls with simdlen 1,
however due to a bug we currently do allow it for num == 0.

This causes us to emit a symbol that doesn't exist and we fail to link.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Committed to GCC 13,12,11 branches as previously approved.

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113552
* config/aarch64/aarch64.cc
(aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1.

gcc/testsuite/ChangeLog:

PR tree-optimization/113552
* gcc.target/aarch64/pr113552.c: New test.
* gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check.

---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
f546c48ae2d2bad2e34c6b72e5e3e30aba3c3bd6..d19a9c16cc97ae75afd4e29f4339d65d39cfb73a
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27027,7 +27027,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int elt_bits, count;
+  unsigned int elt_bits, count = 0;
   unsigned HOST_WIDE_INT const_simdlen;
   poly_uint64 vec_bits;
 
@@ -27104,7 +27104,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
   vec_bits = (num == 0 ? 64 : 128);
   clonei->simdlen = exact_div (vec_bits, elt_bits);
 }
-  else
+  else if (maybe_ne (clonei->simdlen, 1U))
 {
   count = 1;
   vec_bits = clonei->simdlen * elt_bits;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
b/gcc/testsuite/gcc.target/aarch64/pr113552.c
new file mode 100644
index 
..9c96b061ed2b4fcc57e58925277f74d14f79c51f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a" } */
+
+__attribute__ ((__simd__ ("notinbranch"), const))
+double cos (double);
+
+void foo (float *a, double *b)
+{
+for (int i = 0; i < 12; i+=3)
+  {
+b[i] = cos (5.0 * a[i]);
+b[i+1] = cos (5.0 * a[i+1]);
+b[i+2] = cos (5.0 * a[i+2]);
+  }
+}
+
+/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
index 
95f6a6803e889c02177ef10972962ed62d2095eb..c6dac6b104c94c9de89ed88dc5a73e185d2be125
 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
@@ -18,7 +18,7 @@ double foo(double x)
 }
 
 /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnM1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
+/* { dg-final { scan-assembler-not {\.variant_pcs\t_ZGVnN1v_foo} } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */




-- 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index f546c48ae2d2bad2e34c6b72e5e3e30aba3c3bd6..d19a9c16cc97ae75afd4e29f4339d65d39cfb73a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -27027,7 +27027,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
 	bool explicit_p)
 {
   tree t, ret_type;
-  unsigned int elt_bits, count;
+  unsigned int elt_bits, count = 0;
   unsigned HOST_WIDE_INT const_simdlen;
   poly_uint64 vec_bits;
 
@@ -27104,7 +27104,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *node,
   vec_bits = (num == 0 ? 64 : 128);
   clonei->simdlen = exact_div (vec_bits, elt_bits);
 }
-  else
+  else if (maybe_ne (clonei->simdlen, 1U))
 {
   count = 1;
   vec_bits = clonei->simdlen * elt_bits;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c b/gcc/testsuite/gcc.target/aarch64/pr113552.c
new file mode 100644
index ..9c96b061ed2b4fcc57e58925277f74d14f79c51f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a" } */
+
+__attribute__ ((__simd__ ("notinbranch"), const))
+double cos (double);
+
+void foo (float *a, double *b)
+{
+for (int i = 0; i < 12; i+=3)
+  {
+b[i] = cos (5.0 * a[i]);
+b[i+1] = cos (5.0 * a[i+1]);
+b[i+2] = cos (5.0 * a[i+2]);
+  }
+}
+
+/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
index 

RE: [PATCH] vect: Do not peel epilogue for partial vectors [PR114196].

2024-03-07 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Thursday, March 7, 2024 8:47 AM
> To: Robin Dapp 
> Cc: gcc-patches ; Tamar Christina
> 
> Subject: Re: [PATCH] vect: Do not peel epilogue for partial vectors 
> [PR114196].
> 
> On Wed, Mar 6, 2024 at 9:21 PM Robin Dapp  wrote:
> >
> > Hi,
> >
> > r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early
> > break but PR114196 shows that we also run into the problem without early
> > break.  Therefore remove early break from the conditions.
> >
> > gcc/ChangeLog:
> >
> > PR middle-end/114196
> >
> > * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Remove
> > early break check from guards.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/pr114196.c: New test.
> > * gcc.target/riscv/rvv/autovec/pr114196.c: New test.
> > ---
> >  gcc/testsuite/gcc.target/aarch64/pr114196.c   | 19 +++
> >  .../gcc.target/riscv/rvv/autovec/pr114196.c   | 19 +++
> >  gcc/tree-vect-loop-manip.cc   |  6 +++---
> >  3 files changed, 41 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr114196.c
> >  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr114196.c
> b/gcc/testsuite/gcc.target/aarch64/pr114196.c
> > new file mode 100644
> > index 000..15e4b0e31b8
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr114196.c
> > @@ -0,0 +1,19 @@
> > +/* { dg-do compile } */
> > +/* { dg-options { -O3 -fno-vect-cost-model -march=armv9-a -msve-vector-
> bits=256 } } */
> > +
> > +unsigned a;
> > +int b;
> > +long *c;
> > +
> > +int
> > +main ()
> > +{
> > +  for (int d = 0; d < 22; d += 4) {
> > +  b = ({
> > +   int e = c[d];
> > +   e;
> > +   })
> > +  ? 0 : -c[d];
> > +  a *= 3;
> > +  }
> > +}
> > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
> > new file mode 100644
> > index 000..7ba9cbbed70
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr114196.c
> > @@ -0,0 +1,19 @@
> > +/* { dg-do compile } */
> > +/* { dg-options { -O3 -fno-vect-cost-model -march=rv64gcv_zvl256b -
> mabi=lp64d -mrvv-vector-bits=zvl } } */
> > +
> > +unsigned a;
> > +int b;
> > +long *c;
> > +
> > +int
> > +main ()
> > +{
> > +  for (int d = 0; d < 22; d += 4) {
> > +  b = ({
> > +   int e = c[d];
> > +   e;
> > +   })
> > +  ? 0 : -c[d];
> > +  a *= 3;
> > +  }
> > +}
> > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > index f72da915103..c3cd20eef70 100644
> > --- a/gcc/tree-vect-loop-manip.cc
> > +++ b/gcc/tree-vect-loop-manip.cc
> > @@ -2183,9 +2183,9 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info
> loop_vinfo,
> >   perform the peeling.  The below condition mirrors that of
> >   vect_gen_vector_loop_niters  where niters_vector_mult_vf_var then sets
> >   step_vector to VF rather than 1.  This is what creates the nonlinear
> > - IV.  PR113163.  */
> > -  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > -  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
> > + IV.  PR113163.
> > + This also happens without early breaks, see PR114196.  */
> 
> Can you instead reword to not mention early breaks, maybe instead
> say PR113163 (with early breaks), PR114196 (without)?
> 
> The dump message also needs adjustments, it mentions early breaks as
> well.
> 
> The comment says it matches a condition in vect_gen_vector_loop_niters
> but I can't see what that means ... Tamar?
> 

The comment was trying to say that this case is when you manage to get here:
https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2847

because that makes you fall into 
https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L3528
 which creates the nonlinear IV variable.

The vect_step_op_neg exception is because vect_update_ivs_after_vectorizer can 
deal with that case specifically
https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2398

which is what the previous check is also explaining 
https://github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2133

If this also happens for non-early breaks it's just better to merge the check 
into the earlier one at 
github.com/gcc-mirror/gcc/blob/95b6ee96348041eaee9133f082b57f3e57ef0b11/gcc/tree-vect-loop-manip.cc#L2133

Tamar

> > +  if (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
> >&& LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> >&& induction_type != vect_step_op_neg)
> >  {
> > --
> > 2.43.2


RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation US_PLUS

2024-02-27 Thread Tamar Christina
> Thanks Tamar.
> 
> > Those two cases also *completely* stop vectorization because of either the
> > control flow or the fact the vectorizer can't handle complex types.
> 
> Yes, we eventually would like to vectorize the SAT ALU but we start with 
> scalar part
> first.
> I tried the DEF_INTERNAL_SIGNED_OPTAB_EXT_FN as your suggestion. It works
> well with some additions as below.
> Feel free to correct me if any misunderstandings.
> 
> 1. usadd$Q$a3 are restricted to fixed point and we need to change it to
> usadd$a3(as well as gen_int_libfunc) for int.
> 2. We need to implement a default implementation of SAT_ADD if
> direct_binary_optab_supported_p is false.
> It looks like the default implementation is difficult to make every 
> backend happy.
> That is why you suggest just normal
> DEF_INTERNAL_SIGNED_OPTAB_FN in another thread.
> 
> Thanks Richard.
> 
> > But what I'd like to see is that we do more instruction selection on GIMPLE
> > but _late_ (there's the pass_optimize_widening_mul and pass_gimple_isel
> > passes doing what I'd call instruction selection).  But that means not 
> > adding
> > match.pd patterns for that or at least have a separate isel-match.pd
> > machinery for that.
> 
> > So as a start I would go for a direct optab and see to recognize it during
> > ISEL?
> 
> Looks we have sorts of SAT alu like PLUS/MINUS/MULT/DIV/SHIFT/NEG/ABS, good
> to know isel and I am happy to
> try that once we have conclusion.
> 

So after a lively discussion on IRC, the conclusion is that before we proceed 
Richi would
like to see some examples of various operations.  The problem is that unsigned 
saturating
addition is the simplest example and it may lead to an implementation strategy 
that doesn't
scale.

So I'd suggest writing some example of both signed and unsigned saturating add 
and multiply

Because signed addition, will likely require a branch and signed multiplication 
would require a
larger type.

This would allow us to better understand what kind of gimple would have to to 
deal with in
ISEL and VECT if we decide not to lower early.

Thanks,
Tamar

> Pan
> 
> -Original Message-
> From: Tamar Christina 
> Sent: Tuesday, February 27, 2024 5:57 PM
> To: Richard Biener 
> Cc: Li, Pan2 ; gcc-patches@gcc.gnu.org; 
> juzhe.zh...@rivai.ai;
> Wang, Yanzhang ; kito.ch...@gmail.com;
> richard.sandiford@arm.com2; jeffreya...@gmail.com
> Subject: RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation
> US_PLUS
> 
> > -Original Message-
> > From: Richard Biener 
> > Sent: Tuesday, February 27, 2024 9:44 AM
> > To: Tamar Christina 
> > Cc: pan2...@intel.com; gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai;
> > yanzhang.w...@intel.com; kito.ch...@gmail.com;
> > richard.sandiford@arm.com2; jeffreya...@gmail.com
> > Subject: Re: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation
> > US_PLUS
> >
> > On Sun, Feb 25, 2024 at 10:01 AM Tamar Christina
> >  wrote:
> > >
> > > Hi Pan,
> > >
> > > > From: Pan Li 
> > > >
> > > > Hi Richard & Tamar,
> > > >
> > > > Try the DEF_INTERNAL_INT_EXT_FN as your suggestion.  By mapping
> > > > us_plus$a3 to the RTL representation (us_plus:m x y) in optabs.def.
> > > > And then expand_US_PLUS in internal-fn.cc.  Not very sure if my
> > > > understanding is correct for DEF_INTERNAL_INT_EXT_FN.
> > > >
> > > > I am not sure if we still need DEF_INTERNAL_SIGNED_OPTAB_FN here, given
> > > > the RTL representation has (ss_plus:m x y) and (us_plus:m x y) already.
> > > >
> > >
> > > I think a couple of things are being confused here.  So lets break it 
> > > down:
> > >
> > > The reason for DEF_INTERNAL_SIGNED_OPTAB_FN is because in GIMPLE
> > > we only want one internal function for both signed and unsigned SAT_ADD.
> > > with this definition we don't need SAT_UADD and SAT_SADD but instead
> > > we will only have SAT_ADD, which will expand to us_plus or ss_plus.
> > >
> > > Now the downside of this is that this is a direct internal optab.  This 
> > > means
> > > that for the representation to be used the target *must* have the optab
> > > implemented.   This is a bit annoying because it doesn't allow us to 
> > > generically
> > > assume that all targets use SAT_ADD for saturating add and thus only have 
> > > to
> > > write optimization for this representation.
> > >
> > > This is why Richi said we may need to use a new tree_code because we c

RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU

2024-02-27 Thread Tamar Christina
> Am 19.02.24 um 08:36 schrieb Richard Biener:
> > On Sat, Feb 17, 2024 at 11:30 AM  wrote:
> >>
> >> From: Pan Li 
> >>
> >> This patch would like to add the middle-end presentation for the
> >> unsigned saturation add.  Aka set the result of add to the max
> >> when overflow.  It will take the pattern similar as below.
> >>
> >> SAT_ADDU (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
> 
> Does this even try to wort out the costs?
> 
> For example, with the following example
> 
> 
> #define T __UINT16_TYPE__
> 
> T sat_add1 (T x, T y)
> {
>return (x + y) | (- (T)((T)(x + y) < x));
> }
> 
> T sat_add2 (T x, T y)
> {
>  T z = x + y;
>  if (z < x)
>  z = (T) -1;
>  return z;
> }
> 
> And then "avr-gcc -S -Os -dp" the code is
> 
> 
> sat_add1:
>   add r22,r24  ;  7   [c=8 l=2]  *addhi3/0
>   adc r23,r25
>   ldi r18,lo8(1)   ;  8   [c=4 l=2]  *movhi/4
>   ldi r19,0
>   cp r22,r24   ;  9   [c=8 l=2]  cmphi3/2
>   cpc r23,r25
>   brlo .L2 ;  10  [c=16 l=1]  branch
>   ldi r19,0;  31  [c=4 l=1]  movqi_insn/0
>   ldi r18,0;  32  [c=4 l=1]  movqi_insn/0
> .L2:
>   clr r24  ;  13  [c=12 l=4]  neghi2/1
>   clr r25
>   sub r24,r18
>   sbc r25,r19
>   or r24,r22   ;  29  [c=4 l=1]  iorqi3/0
>   or r25,r23   ;  30  [c=4 l=1]  iorqi3/0
>   ret  ;  35  [c=0 l=1]  return
> 
> sat_add2:
>   add r22,r24  ;  8   [c=8 l=2]  *addhi3/0
>   adc r23,r25
>   cp r22,r24   ;  9   [c=8 l=2]  cmphi3/2
>   cpc r23,r25
>   brsh .L3 ;  10  [c=16 l=1]  branch
>   ldi r22,lo8(-1)  ;  5   [c=4 l=2]  *movhi/4
>   ldi r23,lo8(-1)
> .L3:
>   mov r25,r23  ;  21  [c=4 l=1]  movqi_insn/0
>   mov r24,r22  ;  22  [c=4 l=1]  movqi_insn/0
>   ret  ;  25  [c=0 l=1]  return
> 
> i.e. the conditional jump is better than overly smart arithmetic
> (smaller and faster code with less register pressure).
> With larger dypes the difference is even more pronounced-
> 

*on AVR. https://godbolt.org/z/7jaExbTa8  shows the branchless code is better.
And the branchy code will vectorize worse if at all 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492

But looking at that output it just seems like it's your expansion that's 
inefficient.

But fair point, perhaps it should be just a normal DEF_INTERNAL_SIGNED_OPTAB_FN 
so that we
provide the additional optimization only for targets that want it.

Tamar

> >> Take uint8_t as example, we will have:
> >>
> >> * SAT_ADDU (1, 254)   => 255.
> >> * SAT_ADDU (1, 255)   => 255.
> >> * SAT_ADDU (2, 255)   => 255.
> >> * SAT_ADDU (255, 255) => 255.
> >>
> >> The patch also implement the SAT_ADDU in the riscv backend as
> >> the sample.  Given below example:
> >>
> >> uint64_t sat_add_u64 (uint64_t x, uint64_t y)
> >> {
> >>return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x));
> >> }
> >>
> >> Before this patch:
> >>
> >> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> >> {
> >>long unsigned int _1;
> >>_Bool _2;
> >>long unsigned int _3;
> >>long unsigned int _4;
> >>uint64_t _7;
> >>long unsigned int _10;
> >>__complex__ long unsigned int _11;
> >>
> >> ;;   basic block 2, loop depth 0
> >> ;;pred:   ENTRY
> >>_11 = .ADD_OVERFLOW (x_5(D), y_6(D));
> >>_1 = REALPART_EXPR <_11>;
> >>_10 = IMAGPART_EXPR <_11>;
> >>_2 = _10 != 0;
> >>_3 = (long unsigned int) _2;
> >>_4 = -_3;
> >>_7 = _1 | _4;
> >>return _7;
> >> ;;succ:   EXIT
> >>
> >> }
> >>
> >> After this patch:
> >>
> >> uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> >> {
> >>uint64_t _7;
> >>
> >> ;;   basic block 2, loop depth 0
> >> ;;pred:   ENTRY
> >>_7 = .SAT_ADDU (x_5(D), y_6(D)); [tail call]
> >>return _7;
> >> ;;succ:   EXIT
> >>
> >> }
> >>
> >> Then we will have the middle-end representation like .SAT_ADDU after
> >> this patch.
> >
> > I'll note that on RTL we already have SS_PLUS/US_PLUS and friends and
> > the corresponding ssadd/usadd optabs.  There's not much documentation
> > unfortunately besides the use of gen_*_fixed_libfunc usage where the comment
> > suggests this is used for fixed-point operations.  It looks like arm uses
> > fractional/accumulator modes for this but for example bfin has ssaddsi3.
> >
> > So the question is whether the fixed-point case can be distinguished from
> > the integer case based on mode.
> >
> > There's also FIXED_POINT_TYPE on the GENERIC/GIMPLE side and
> > no special tree operator codes for them.  So compared to what appears
> > to be the case on RTL we'd need a way to represent saturating integer
> > operations on GIMPLE.
> >
> > The natural thing is to use direct optab internal functions (that's what you
> > basically did, but you added a new optab, IMO without good reason).
> > More 

RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation US_PLUS

2024-02-27 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, February 27, 2024 9:44 AM
> To: Tamar Christina 
> Cc: pan2...@intel.com; gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai;
> yanzhang.w...@intel.com; kito.ch...@gmail.com;
> richard.sandiford@arm.com2; jeffreya...@gmail.com
> Subject: Re: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation
> US_PLUS
> 
> On Sun, Feb 25, 2024 at 10:01 AM Tamar Christina
>  wrote:
> >
> > Hi Pan,
> >
> > > From: Pan Li 
> > >
> > > Hi Richard & Tamar,
> > >
> > > Try the DEF_INTERNAL_INT_EXT_FN as your suggestion.  By mapping
> > > us_plus$a3 to the RTL representation (us_plus:m x y) in optabs.def.
> > > And then expand_US_PLUS in internal-fn.cc.  Not very sure if my
> > > understanding is correct for DEF_INTERNAL_INT_EXT_FN.
> > >
> > > I am not sure if we still need DEF_INTERNAL_SIGNED_OPTAB_FN here, given
> > > the RTL representation has (ss_plus:m x y) and (us_plus:m x y) already.
> > >
> >
> > I think a couple of things are being confused here.  So lets break it down:
> >
> > The reason for DEF_INTERNAL_SIGNED_OPTAB_FN is because in GIMPLE
> > we only want one internal function for both signed and unsigned SAT_ADD.
> > with this definition we don't need SAT_UADD and SAT_SADD but instead
> > we will only have SAT_ADD, which will expand to us_plus or ss_plus.
> >
> > Now the downside of this is that this is a direct internal optab.  This 
> > means
> > that for the representation to be used the target *must* have the optab
> > implemented.   This is a bit annoying because it doesn't allow us to 
> > generically
> > assume that all targets use SAT_ADD for saturating add and thus only have to
> > write optimization for this representation.
> >
> > This is why Richi said we may need to use a new tree_code because we can
> > override tree code expansions.  However the same can be done with the 
> > _EXT_FN
> > internal functions.
> >
> > So what I meant was that we want to have a combination of the two. i.e. a
> > DEF_INTERNAL_SIGNED_OPTAB_EXT_FN.
> 
> Whether we want/need _EXT or only direct depends mainly on how we want to
> leverage support.  If it's only during vectorization and possibly instruction
> selection a direct optab is IMO the way to go.  Generic optimization only
> marginally improves when you explode the number of basic operations you
> expose - in fact it gets quite unwieldly to support all of them in
> simplifications
> and/or canonicalization and you possibly need to translate them back to what
> the target CPU supports.
> 
> We already do have too many (IMO) "special" operations exposed "early"
> in the GIMPLE pipeline.
> 
> But what I'd like to see is that we do more instruction selection on GIMPLE
> but _late_ (there's the pass_optimize_widening_mul and pass_gimple_isel
> passes doing what I'd call instruction selection).  But that means not adding
> match.pd patterns for that or at least have a separate isel-match.pd
> machinery for that.
> 
> So as a start I would go for a direct optab and see to recognize it during
> ISEL?
> 

The problem with ISEL and the reason I suggested an indirect IFN is that there
Are benefit to be had from recognizing it early.  Saturating arithmetic can be 
optimized
Differently from non-saturating ones.

But additionally a common way of specifying them decomposes to branches
and/or using COMPLEX_EXPR (see the various PRs on saturating arithmetic).

These two representation can be detected in PHI-opts and it's beneficial to all
targets to canonicalize them to the branchless code.

Those two cases also *completely* stop vectorization because of either the
control flow or the fact the vectorizer can't handle complex types.

So really, gimple ISEL would fix just 1 of the 3 very common cases, and then
We'd still need to hack the vectorizer cost models for targets with saturating
vector instructions.

I of course defer to you, but it seems quite suboptimal to do it this way and
doesn't get us first class saturation support.

Additionally there have been discussions whether both clang and gcc should
provide __builtin_saturate_* methods, which the non-direct IFN would help
support.

Tamar.

> > If Richi agrees, the below is what I meant. It creates the infrastructure 
> > for this
> > and for now only allows a default fallback for unsigned saturating add and 
> > makes
> > it easier for us to add the rest later
> >
> > Also, unless I'm wrong (and Richi can correct me here), us_plus and ss_plus 
> > are
> the
> > RTL expressi

RE: [PATCH]middle-end: delay updating of dominators until later during vectorization. [PR114081]

2024-02-26 Thread Tamar Christina
> > The testcase shows an interesting case where we have multiple loops sharing 
> > a
> > live value and have an early exit that go to the same location.  The 
> > additional
> > complication is that on x86_64 with -mavx we seem to also do prologue 
> > peeling
> > on the loops.
> >
> > We correctly identify which BB we need their dominators updated for, but we 
> > do
> > so too early.
> >
> > Instead of adding more dominator update we can solve this by for the cases 
> > with
> > multiple exits not to verify dominators at the end of peeling if peeling for
> > vectorization.
> >
> > We can then perform the final dominator updates just before vectorization 
> > when
> > all loop transformations are done.
> 
> What's the actual CFG transform that happens between the old and the new
> place?  I see a possible edge splitting but where is the one that makes
> this patch work?

It's not one but two.
1. loop 1 is prologue peeled. This ICEs because the dominator update is only 
happening
for epilogue peeling.  Note that loop 1 here dominates 21 and the ICE is:

ice.c: In function 'void php_zval_filter(int, int)':
ice.c:7:6: error: dominator of 14 should be 21, not 3
7 | void php_zval_filter(int filter, int id1) {
  |  ^~~
ice.c:7:6: error: dominator of 10 should be 21, not 3
during GIMPLE pass: vect
dump file: a-ice.c.179t.vect

This can be simply fixed by just moving the dom update code down:

diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index a5202f32e27..e88948370c6 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -1845,13 +1845,7 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
*loop, edge loop_exit,
 to the original function exit we recorded.  Other exits are already
 correct.  */
   if (multiple_exits_p)
-   {
- update_loop = new_loop;
- doms = get_all_dominated_blocks (CDI_DOMINATORS, loop->header);
- for (unsigned i = 0; i < doms.length (); ++i)
-   if (flow_bb_inside_loop_p (loop, doms[i]))
- doms.unordered_remove (i);
-   }
+   update_loop = new_loop;
 }
   else /* Add the copy at entry.  */
 {
@@ -1906,6 +1900,11 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
*loop, edge loop_exit,

   if (multiple_exits_p)
 {
+  doms = get_all_dominated_blocks (CDI_DOMINATORS, loop->header);
+  for (unsigned i = 0; i < doms.length (); ++i)
+   if (flow_bb_inside_loop_p (loop, doms[i]))
+ doms.unordered_remove (i);
+
   for (edge e : get_loop_exit_edges (update_loop))
{
  edge ex;

with that done, the next ICE comes along. Loop 1 is peeled again, but this time 
for epilogue.
however loop 1 no longer dominates the exits as the prologue peeled loop does.

So we don't find anything to update and ice with the second ICE:

ice.c: In function 'void php_zval_filter(int, int)':
ice.c:7:6: error: dominator of 14 should be 2, not 21
7 | void php_zval_filter(int filter, int id1) {
  |  ^~~
ice.c:7:6: error: dominator of 10 should be 2, not 21
during GIMPLE pass: vect
dump file: a-ice.c.179t.vect

because the prologue loop no longer dominates them due to the skip edge.  This 
is why delaying
works because we know we have to update the dominators of 14 and 10, but to 
what we don't know
yet.

Tamar

> 
> > This also means we reduce the number of dominator updates needed by at least
> > 50% and fixes the ICE.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and
> > x86_64-pc-linux-gnu no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/114081
> > PR tree-optimization/113290
> > * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
> > Skip dominator update when multiple exit.
> > (vect_do_peeling): Remove multiple exit dominator update.
> > * tree-vect-loop.cc (vect_transform_loop): Update dominators when
> > multiple exits.
> > * tree-vectorizer.h (LOOP_VINFO_DOMS_NEED_UPDATE,
> >  dominators_needing_update): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/114081
> > PR tree-optimization/113290
> > * gcc.dg/vect/vect-early-break_120-pr114081.c: New test.
> > * gcc.dg/vect/vect-early-break_121-pr114081.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c
> > new file mode 100644
> > index
> ..2cd4ce1e4ac573ba6e4173
> 0fd2216f0ec8061376
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c
> > @@ -0,0 +1,38 @@
> > +/* { dg-do compile } */
> > +/* { dg-add-options vect_early_break } */
> > +/* { dg-require-effective-target vect_early_break } */
> > +/* { dg-require-effective-target vect_int } */
> > +/* { 

[PATCH]middle-end: delay updating of dominators until later during vectorization. [PR114081]

2024-02-25 Thread Tamar Christina
Hi All,

The testcase shows an interesting case where we have multiple loops sharing a
live value and have an early exit that go to the same location.  The additional
complication is that on x86_64 with -mavx we seem to also do prologue peeling
on the loops.

We correctly identify which BB we need their dominators updated for, but we do
so too early.

Instead of adding more dominator update we can solve this by for the cases with
multiple exits not to verify dominators at the end of peeling if peeling for
vectorization.

We can then perform the final dominator updates just before vectorization when
all loop transformations are done.

This also means we reduce the number of dominator updates needed by at least
50% and fixes the ICE.

Bootstrapped Regtested on aarch64-none-linux-gnu and
x86_64-pc-linux-gnu no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/114081
PR tree-optimization/113290
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Skip dominator update when multiple exit.
(vect_do_peeling): Remove multiple exit dominator update.
* tree-vect-loop.cc (vect_transform_loop): Update dominators when
multiple exits.
* tree-vectorizer.h (LOOP_VINFO_DOMS_NEED_UPDATE,
 dominators_needing_update): New.

gcc/testsuite/ChangeLog:

PR tree-optimization/114081
PR tree-optimization/113290
* gcc.dg/vect/vect-early-break_120-pr114081.c: New test.
* gcc.dg/vect/vect-early-break_121-pr114081.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c
new file mode 100644
index 
..2cd4ce1e4ac573ba6e41730fd2216f0ec8061376
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_120-pr114081.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+typedef struct filter_list_entry {
+  const char *name;
+  int id;
+  void (*function)();
+} filter_list_entry;
+
+static const filter_list_entry filter_list[9] = {0};
+
+void php_zval_filter(int filter, int id1) {
+  filter_list_entry filter_func;
+
+  int size = 9;
+  for (int i = 0; i < size; ++i) {
+if (filter_list[i].id == filter) {
+  filter_func = filter_list[i];
+  goto done;
+}
+  }
+
+#pragma GCC novector
+  for (int i = 0; i < size; ++i) {
+if (filter_list[i].id == 0x0204) {
+  filter_func = filter_list[i];
+  goto done;
+}
+  }
+done:
+  if (!filter_func.id)
+filter_func.function();
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_121-pr114081.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_121-pr114081.c
new file mode 100644
index 
..feebdb7a6c9b8981d7be31dd1c741f9e36738515
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_121-pr114081.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+typedef struct filter_list_entry {
+  const char *name;
+  int id;
+  void (*function)();
+} filter_list_entry;
+
+static const filter_list_entry filter_list[9] = {0};
+
+void php_zval_filter(int filter, int id1) {
+  filter_list_entry filter_func;
+
+  int size = 9;
+  for (int i = 0; i < size; ++i) {
+if (filter_list[i].id == filter) {
+  filter_func = filter_list[i];
+  goto done;
+}
+  }
+
+  for (int i = 0; i < size; ++i) {
+if (filter_list[i].id == 0x0204) {
+  filter_func = filter_list[i];
+  goto done;
+}
+  }
+done:
+  if (!filter_func.id)
+filter_func.function();
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
3f974d6d839e32516ae316f28ca25316e43d7d86..b5e158bc5cfb5107d5ff461e489d306f81e090d0
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -1917,7 +1917,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, 
edge loop_exit,
  doms.safe_push (e->dest);
}
 
-  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
   if (updated_doms)
updated_doms->safe_splice (doms);
 }
@@ -1925,7 +1924,9 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, 
edge loop_exit,
   free (new_bbs);
   free (bbs);
 
-  checking_verify_dominators (CDI_DOMINATORS);
+  /* If we're peeling for vectorization then delay verifying dominators.  */
+  if (!flow_loops || !multiple_exits_p)
+checking_verify_dominators (CDI_DOMINATORS);
 
   return 

RE: [PATCH v2] Draft|Internal-fn: Introduce internal fn saturation US_PLUS

2024-02-25 Thread Tamar Christina
Hi Pan,

> From: Pan Li 
> 
> Hi Richard & Tamar,
> 
> Try the DEF_INTERNAL_INT_EXT_FN as your suggestion.  By mapping
> us_plus$a3 to the RTL representation (us_plus:m x y) in optabs.def.
> And then expand_US_PLUS in internal-fn.cc.  Not very sure if my
> understanding is correct for DEF_INTERNAL_INT_EXT_FN.
> 
> I am not sure if we still need DEF_INTERNAL_SIGNED_OPTAB_FN here, given
> the RTL representation has (ss_plus:m x y) and (us_plus:m x y) already.
> 

I think a couple of things are being confused here.  So lets break it down:

The reason for DEF_INTERNAL_SIGNED_OPTAB_FN is because in GIMPLE
we only want one internal function for both signed and unsigned SAT_ADD.
with this definition we don't need SAT_UADD and SAT_SADD but instead
we will only have SAT_ADD, which will expand to us_plus or ss_plus.

Now the downside of this is that this is a direct internal optab.  This means
that for the representation to be used the target *must* have the optab
implemented.   This is a bit annoying because it doesn't allow us to generically
assume that all targets use SAT_ADD for saturating add and thus only have to
write optimization for this representation.

This is why Richi said we may need to use a new tree_code because we can
override tree code expansions.  However the same can be done with the _EXT_FN
internal functions.

So what I meant was that we want to have a combination of the two. i.e. a
DEF_INTERNAL_SIGNED_OPTAB_EXT_FN.

If Richi agrees, the below is what I meant. It creates the infrastructure for 
this
and for now only allows a default fallback for unsigned saturating add and makes
it easier for us to add the rest later

Also, unless I'm wrong (and Richi can correct me here), us_plus and ss_plus are 
the
RTL expression, but the optab for saturation are ssadd and usadd.  So you don't
need to make new us_plus and ss_plus ones.

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index a07f25f3aee..aaf9f8991b3 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -4103,6 +4103,17 @@ direct_internal_fn_supported_p (internal_fn fn, 
tree_pair types,
return direct_##TYPE##_optab_supported_p (which_optab, types,   \
  opt_type);\
   }
+#define DEF_INTERNAL_SIGNED_OPTAB_EXT_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \
+UNSIGNED_OPTAB, TYPE)  \
+case IFN_##CODE:   \
+  {
\
+   optab which_optab = (TYPE_UNSIGNED (types.SELECTOR) \
+? UNSIGNED_OPTAB ## _optab \
+: SIGNED_OPTAB ## _optab); \
+   return direct_##TYPE##_optab_supported_p (which_optab, types,   \
+ opt_type) \
+  || internal_##CODE##_fn_supported_p (types.SELECTOR, opt_type); \
+  }
 #include "internal-fn.def"
 
 case IFN_LAST:
@@ -4303,6 +4314,8 @@ set_edom_supported_p (void)
 optab which_optab = direct_internal_fn_optab (fn, types);  \
 expand_##TYPE##_optab_fn (fn, stmt, which_optab);  \
   }
+#define DEF_INTERNAL_SIGNED_OPTAB_EXT_FN(CODE, FLAGS, SELECTOR, SIGNED_OPTAB, \
+UNSIGNED_OPTAB, TYPE)
 #include "internal-fn.def"
 
 /* Routines to expand each internal function, indexed by function number.
@@ -5177,3 +5190,45 @@ expand_POPCOUNT (internal_fn fn, gcall *stmt)
   emit_move_insn (plhs, cmp);
 }
 }
+
+void
+expand_SAT_ADD (internal_fn fn, gcall *stmt)
+{
+  /* Check if the target supports the expansion through an IFN.  */
+  tree_pair types = direct_internal_fn_types (fn, stmt);
+  optab which_optab = direct_internal_fn_optab (fn, types);
+  if (direct_binary_optab_supported_p (which_optab, types,
+  insn_optimization_type ()))
+{
+  expand_binary_optab_fn (fn, stmt, which_optab);
+  return;
+}
+
+  /* Target does not support the optab, but we can de-compose it.  */
+  /*
+  ... decompose to a canonical representation ...
+  if (TYPE_UNSIGNED (types.SELECTOR))
+{
+  ...
+  decompose back to (X + Y) | - ((X + Y) < X)
+}
+  else
+{
+  ...
+}
+  */
+}
+
+bool internal_SAT_ADD_fn_supported_p (tree type, optimization_type /* optype 
*/)
+{
+  /* For now, don't support decomposing vector ops.  */
+  if (VECTOR_TYPE_P (type))
+return false;
+
+  /* Signed saturating arithmetic is harder to do since we'll so for now
+ lets ignore.  */
+  if (!TYPE_UNSIGNED (type))
+return false;
+
+  return TREE_CODE (type) == INTEGER_TYPE;
+}
\ No newline at end of file
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index c14d30365c1..5a2491228d5 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -92,6 +92,10 @@ along with GCC; see the file 

[PATCH]middle-end: update vuses out of loop which use a vdef that's moved [PR114068]

2024-02-23 Thread Tamar Christina
Hi All,

In certain cases we can have a situation where the merge block has a vUSE
virtual PHI and the exits do not.  In this case for instance the exits lead
to an abort so they have no virtual PHIs.  If we have a store before the first
exit and we move it to a later block during vectorization we update the vUSE
chain.

However the merge block is not an exit and is not visited by the update code.

This patch fixes it by checking during moving if there are any out of loop uses
of the vDEF that is the last_seen_vuse.  Normally there wouldn't be any and
things are skipped, but if there is then update that to the last vDEF in the
exit block.

Bootstrapped Regtested on aarch64-none-linux-gnu and
x86_64-pc-linux-gnu no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimizations/114068
* tree-vect-loop.cc (move_early_exit_stmts): Update vUSE chain in merge
block.

gcc/testsuite/ChangeLog:

PR tree-optimizations/114068
* gcc.dg/vect/vect-early-break_118-pr114068.c: New test.
* gcc.dg/vect/vect-early-break_119-pr114068.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c
new file mode 100644
index 
..b462a464b6603e718c5a283513ea586fc13e37ce
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+struct h {
+  int b;
+  int f;
+} k;
+
+void n(int m) {
+  struct h a = k;
+  for (int o = m; o; ++o) {
+if (a.f)
+  __builtin_unreachable();
+if (o > 1)
+  __builtin_unreachable();
+*( + o) = 1;
+  }
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_119-pr114068.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_119-pr114068.c
new file mode 100644
index 
..a65ef7b8c4901b2ada585f38fda436dc07d1e1de
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_119-pr114068.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+struct h {
+  int b;
+  int c;
+  int f;
+} k;
+
+void n(int m) {
+  struct h a = k;
+  for (int o = m; o; ++o) {
+if (a.f)
+  __builtin_unreachable();
+if (o > 1)
+  __builtin_unreachable();
+*( + o) = 1;
+*( + o*m) = 2;
+  }
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
35f1f8c7d4245135ace740ff9be548919587..44bd8032b55b1ef84fdf4fa9d6117304b7709d6f
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11837,6 +11837,27 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
   update_stmt (p);
 }
 
+  /* last_seen_vuse should now be the PHI in the loop header.  Check for
+ any out of loop uses and update them to the vUSE on the loop latch.  */
+  auto vuse_stmt =  loop_vinfo->lookup_def (last_seen_vuse);
+  gphi *vuse_def;
+  if (vuse_stmt
+  && (vuse_def = dyn_cast  (STMT_VINFO_STMT (vuse_stmt
+{
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  gimple *use_stmt;
+  auto loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vuse = PHI_ARG_DEF_FROM_EDGE (vuse_def, loop_latch_edge (loop));
+  FOR_EACH_IMM_USE_STMT (use_stmt, iter, last_seen_vuse)
+   {
+ if (flow_bb_inside_loop_p (loop, use_stmt->bb))
+   continue;
+ FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+   SET_USE (use_p, vuse);
+   }
+}
+
   /* And update the LC PHIs on exits.  */
   for (edge e : get_loop_exit_edges (LOOP_VINFO_LOOP  (loop_vinfo)))
 if (!dominated_by_p (CDI_DOMINATORS, e->src, dest_bb))




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c
new file mode 100644
index 
..b462a464b6603e718c5a283513ea586fc13e37ce
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_118-pr114068.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+struct h {
+  int b;
+  int f;
+} k;
+
+void n(int m) {
+  struct h a = k;
+  for (int o = m; o; ++o) {
+if (a.f)
+  __builtin_unreachable();
+if (o > 1)
+  __builtin_unreachable();
+*( + o) = 1;
+  }
+}
diff --git 

RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU

2024-02-19 Thread Tamar Christina
> -Original Message-
> From: Li, Pan2 
> Sent: Monday, February 19, 2024 12:59 PM
> To: Tamar Christina ; Richard Biener
> 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang
> ; kito.ch...@gmail.com
> Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
> 
> Thanks Tamar for comments and explanations.
> 
> > I think we should actually do an indirect optab here, because the IFN can 
> > be used
> > to replace the general representation of saturating arithmetic.
> 
> > e.g. the __builtin_add_overflow case in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600
> > is inefficient on all targets and so the IFN can always expand to something 
> > that's
> more
> > efficient like the branchless version add_sat2.
> 
> > I think this is why you suggested a new tree code below, but we don't 
> > really need
> > tree-codes for this. It can be done cleaner using the same way as
> DEF_INTERNAL_INT_EXT_FN
> 
> Yes, the backend could choose a branchless(of course we always hate branch for
> performance) code-gen or even better there is one saturation insn.
> Good to learn DEF_INTERNAL_INT_EXT_FN, and will have a try for it.
> 
> > Richard means that there shouldn't be .SAT_ADDU and .SAT_ADDS  and that the
> sign
> > should be determined by the types at expansion time.  i.e. there should 
> > only be
> > .SAT_ADD.
> 
> Got it, my initial idea comes from that we may have two insns for saturation 
> add,
> mostly these insns need to be signed or unsigned.
> For example, slt/sltu in riscv scalar. But I am not very clear about a 
> scenario like this.
> During define_expand in backend, we hit the standard name
> sat_add_3 but can we tell it is signed or not here? AFAIK, we only have 
> QI, HI,
> SI and DI.

Yeah, the way DEF_INTERNAL_SIGNED_OPTAB_FN works is that you give it two optabs,
one for when it's signed and one for when it's unsigned, and the right one is 
picked
automatically during expansion.  But in GIMPLE you'd only have one IFN.

> Maybe I will have the answer after try DEF_INTERNAL_SIGNED_OPTAB_FN, will
> keep you posted.

Awesome, Thanks!

Tamar
> 
> Pan
> 
> -Original Message-
> From: Tamar Christina 
> Sent: Monday, February 19, 2024 4:55 PM
> To: Li, Pan2 ; Richard Biener 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang
> ; kito.ch...@gmail.com
> Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
> 
> Thanks for doing this!
> 
> > -Original Message-
> > From: Li, Pan2 
> > Sent: Monday, February 19, 2024 8:42 AM
> > To: Richard Biener 
> > Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang
> > ; kito.ch...@gmail.com; Tamar Christina
> > 
> > Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
> >
> > Thanks Richard for comments.
> >
> > > I'll note that on RTL we already have SS_PLUS/US_PLUS and friends and
> > > the corresponding ssadd/usadd optabs.  There's not much documentation
> > > unfortunately besides the use of gen_*_fixed_libfunc usage where the
> comment
> > > suggests this is used for fixed-point operations.  It looks like arm uses
> > > fractional/accumulator modes for this but for example bfin has ssaddsi3.
> >
> > I find the related description about plus family in GCC internals doc but 
> > it doesn't
> > mention
> > anything about mode m here.
> >
> > (plus:m x y)
> > (ss_plus:m x y)
> > (us_plus:m x y)
> > These three expressions all represent the sum of the values represented by x
> > and y carried out in machine mode m. They diff er in their behavior on 
> > overflow
> > of integer modes. plus wraps round modulo the width of m; ss_plus saturates
> > at the maximum signed value representable in m; us_plus saturates at the
> > maximum unsigned value.
> >
> > > The natural thing is to use direct optab internal functions (that's what 
> > > you
> > > basically did, but you added a new optab, IMO without good reason).
> 
> I think we should actually do an indirect optab here, because the IFN can be 
> used
> to replace the general representation of saturating arithmetic.
> 
> e.g. the __builtin_add_overflow case in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600
> is inefficient on all targets and so the IFN can always expand to something 
> that's
> more
> efficient like the branchless version add_sat2.
> 
> I think this is why you suggested a new tree code below, but we don't really 
> need
> tree-codes for this. It can be done cleaner using the sam

RE: [PATCH]AArch64: xfail modes_1.f90 [PR107071]

2024-02-19 Thread Tamar Christina
> -Original Message-
> From: Tamar Christina
> Sent: Thursday, February 15, 2024 11:05 AM
> To: Richard Earnshaw (lists) ; gcc-
> patc...@gcc.gnu.org
> Cc: nd ; Marcus Shawcroft ; Kyrylo
> Tkachov ; Richard Sandiford
> 
> Subject: RE: [PATCH]AArch64: xfail modes_1.f90 [PR107071]
> 
> > -Original Message-
> > From: Richard Earnshaw (lists) 
> > Sent: Thursday, February 15, 2024 11:01 AM
> > To: Tamar Christina ; gcc-patches@gcc.gnu.org
> > Cc: nd ; Marcus Shawcroft ;
> Kyrylo
> > Tkachov ; Richard Sandiford
> > 
> > Subject: Re: [PATCH]AArch64: xfail modes_1.f90 [PR107071]
> >
> > On 15/02/2024 10:57, Tamar Christina wrote:
> > > Hi All,
> > >
> > > This test has never worked on AArch64 since the day it was committed.  It 
> > > has
> > > a number of issues that prevent it from working on AArch64:
> > >
> > > 1.  IEEE does not require that FP operations raise a SIGFPE for FP 
> > > operations,
> > >     only that an exception is raised somehow.
> > >
> > > 2. Most Arm designed cores don't raise SIGFPE and instead set a status 
> > > register
> > >    and some partner cores raise a SIGILL instead.
> > >
> > > 3. The way it checks for feenableexcept doesn't really work for AArch64.
> > >
> > > As such this test doesn't seem to really provide much value on AArch64 so 
> > > we
> > > should just xfail it.
> > >
> > > Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master?
> >
> > Wouldn't it be better to just skip the test.  XFAIL just adds clutter to 
> > verbose
> output
> > and suggests that someday the tools might be fixed for this case.
> >
> > Better still would be a new dg-requires fp_exceptions_raise_sigfpe as a 
> > guard for
> > the test.
> 

It looks like this is similar to 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78314 so
I'll just similarly skip it.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 
b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
index 
205c47f38007d06116289c19d6b23cf3bf83bd48..e29d8c678e6e51c3f2e5dac53c7703bb18a99ac4
 100644
--- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
+++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
@@ -1,5 +1,5 @@
 ! { dg-do run }
-!
+! { dg-skip-if "PR libfortran/78314" { aarch64*-*-gnu* arm*-*-gnueabi 
arm*-*-gnueabihf } }
 ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES
 
Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR fortran/107071
* gfortran.dg/ieee/modes_1.f90: skip aarch64, arm.


rb18274.patch
Description: rb18274.patch


RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU

2024-02-19 Thread Tamar Christina
Thanks for doing this!

> -Original Message-
> From: Li, Pan2 
> Sent: Monday, February 19, 2024 8:42 AM
> To: Richard Biener 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang
> ; kito.ch...@gmail.com; Tamar Christina
> 
> Subject: RE: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
> 
> Thanks Richard for comments.
> 
> > I'll note that on RTL we already have SS_PLUS/US_PLUS and friends and
> > the corresponding ssadd/usadd optabs.  There's not much documentation
> > unfortunately besides the use of gen_*_fixed_libfunc usage where the comment
> > suggests this is used for fixed-point operations.  It looks like arm uses
> > fractional/accumulator modes for this but for example bfin has ssaddsi3.
> 
> I find the related description about plus family in GCC internals doc but it 
> doesn't
> mention
> anything about mode m here.
> 
> (plus:m x y)
> (ss_plus:m x y)
> (us_plus:m x y)
> These three expressions all represent the sum of the values represented by x
> and y carried out in machine mode m. They diff er in their behavior on 
> overflow
> of integer modes. plus wraps round modulo the width of m; ss_plus saturates
> at the maximum signed value representable in m; us_plus saturates at the
> maximum unsigned value.
> 
> > The natural thing is to use direct optab internal functions (that's what you
> > basically did, but you added a new optab, IMO without good reason).

I think we should actually do an indirect optab here, because the IFN can be 
used
to replace the general representation of saturating arithmetic.

e.g. the __builtin_add_overflow case in 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112600
is inefficient on all targets and so the IFN can always expand to something 
that's more
efficient like the branchless version add_sat2. 

I think this is why you suggested a new tree code below, but we don't really 
need
tree-codes for this. It can be done cleaner using the same way as 
DEF_INTERNAL_INT_EXT_FN.

> 
> That makes sense to me, I will try to leverage US_PLUS instead here.
> 
> > More GIMPLE-like would be to let the types involved decide whether
> > it's signed or unsigned saturation.  That's actually what I'd prefer here
> > and if we don't map 1:1 to optabs then instead use tree codes like
> > S_PLUS_EXPR (mimicing RTL here).
> 
> Sorry I don't get the point here for GIMPLE-like way. For the .SAT_ADDU, I 
> add one
> restriction
> like unsigned_p (type) in match.pd. Looks we have a better way here.
> 

Richard means that there shouldn't be .SAT_ADDU and .SAT_ADDS  and that the sign
should be determined by the types at expansion time.  i.e. there should only be
.SAT_ADD. 

i.e. instead of this

+DEF_INTERNAL_OPTAB_FN (SAT_ADDU, ECF_CONST | ECF_NOTHROW, sat_addu, binary)

You should use DEF_INTERNAL_SIGNED_OPTAB_FN.

Regards,
Tamar

> > Any other opinions?  Anyone knows more about fixed-point and RTL/modes?
> 
> AFAIK, the scalar of the riscv backend doesn't have fixed-point but the 
> vector does
> have. They
> share the same mode as vector integer. For example, RVVM1SI in vector-
> iterators.md. Kito
> and Juzhe can help to correct me if any misunderstandings.
> 
> Pan
> 
> -Original Message-
> From: Richard Biener 
> Sent: Monday, February 19, 2024 3:36 PM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang
> ; kito.ch...@gmail.com; tamar.christ...@arm.com
> Subject: Re: [PATCH v1] Internal-fn: Add new internal function SAT_ADDU
> 
> On Sat, Feb 17, 2024 at 11:30 AM  wrote:
> >
> > From: Pan Li 
> >
> > This patch would like to add the middle-end presentation for the
> > unsigned saturation add.  Aka set the result of add to the max
> > when overflow.  It will take the pattern similar as below.
> >
> > SAT_ADDU (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
> >
> > Take uint8_t as example, we will have:
> >
> > * SAT_ADDU (1, 254)   => 255.
> > * SAT_ADDU (1, 255)   => 255.
> > * SAT_ADDU (2, 255)   => 255.
> > * SAT_ADDU (255, 255) => 255.
> >
> > The patch also implement the SAT_ADDU in the riscv backend as
> > the sample.  Given below example:
> >
> > uint64_t sat_add_u64 (uint64_t x, uint64_t y)
> > {
> >   return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x));
> > }
> >
> > Before this patch:
> >
> > uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
> > {
> >   long unsigned int _1;
> >   _Bool _2;
> >   long unsigned int _3;
> >   long unsigned int _4;
> >   uint64_t _7;
> >   long unsigned int _10;
> >   __complex__ long unsigned int _11

RE: [PATCH] aarch64: Improve PERM<{0}, a, ...> (64bit) by adding whole vector shift right [PR113872]

2024-02-15 Thread Tamar Christina
> -Original Message-
> From: Richard Sandiford 
> Sent: Thursday, February 15, 2024 2:56 PM
> To: Andrew Pinski 
> Cc: gcc-patches@gcc.gnu.org; Tamar Christina 
> Subject: Re: [PATCH] aarch64: Improve PERM<{0}, a, ...> (64bit) by adding 
> whole
> vector shift right [PR113872]
> 
> Andrew Pinski  writes:
> > The backend currently defines a whole vector shift left for 64bit vectors, 
> > adding
> the
> > shift right can also improve code for some PERMs too. So this adds that 
> > pattern.
> 
> Is this reversed?  It looks like we have the shift right and the patch is
> adding the shift left (at least in GCC internal and little-endian terms).
> 
> But on many Arm cores, EXT has a higher throughput than SHL, so I don't think
> we should do this unconditionally.

Yeah, on most (if not all) all Arm cores the EXT has higher throughput than SHL
and on Cortex-A75 the EXT has both higher throughput and lower latency.

I guess the expected gain here is that we wouldn't need to create the zero 
vector,
However on modern Arm cores the zero vector creation is free using movi and EXT
being three operand also means we only need one copy if e.g in a loop.

Kind Regards,
Tamar

> 
> Thanks,
> Richard
> 
> >
> > I added a testcase for the shift left also. I also fixed the instruction 
> > template
> > there which was using a space instead of a tab after the instruction.
> >
> > Built and tested on aarch64-linux-gnu.
> >
> > PR target/113872
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-simd.md (vec_shr_):
> Use tab instead of space after
> > the instruction in the template.
> > (vec_shl_): New pattern
> > * config/aarch64/iterators.md (unspec): Add UNSPEC_VEC_SHL
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/perm_zero-1.c: New test.
> > * gcc.target/aarch64/perm_zero-2.c: New test.
> >
> > Signed-off-by: Andrew Pinski 
> > ---
> >  gcc/config/aarch64/aarch64-simd.md | 18 --
> >  gcc/config/aarch64/iterators.md|  1 +
> >  gcc/testsuite/gcc.target/aarch64/perm_zero-1.c | 15 +++
> >  gcc/testsuite/gcc.target/aarch64/perm_zero-2.c | 15 +++
> >  4 files changed, 47 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/perm_zero-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/perm_zero-2.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> > index f8bb973a278..0d2f1ea3902 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -1592,9 +1592,23 @@ (define_insn "vec_shr_"
> >"TARGET_SIMD"
> >{
> >  if (BYTES_BIG_ENDIAN)
> > -  return "shl %d0, %d1, %2";
> > +  return "shl\t%d0, %d1, %2";
> >  else
> > -  return "ushr %d0, %d1, %2";
> > +  return "ushr\t%d0, %d1, %2";
> > +  }
> > +  [(set_attr "type" "neon_shift_imm")]
> > +)
> > +(define_insn "vec_shl_"
> > +  [(set (match_operand:VD 0 "register_operand" "=w")
> > +(unspec:VD [(match_operand:VD 1 "register_operand" "w")
> > +   (match_operand:SI 2 "immediate_operand" "i")]
> > +  UNSPEC_VEC_SHL))]
> > +  "TARGET_SIMD"
> > +  {
> > +if (BYTES_BIG_ENDIAN)
> > +  return "ushr\t%d0, %d1, %2";
> > +else
> > +  return "shl\t%d0, %d1, %2";
> >}
> >[(set_attr "type" "neon_shift_imm")]
> >  )
> > diff --git a/gcc/config/aarch64/iterators.md 
> > b/gcc/config/aarch64/iterators.md
> > index 99cde46f1ba..3aebe9cf18a 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -758,6 +758,7 @@ (define_c_enum "unspec"
> >  UNSPEC_PMULL; Used in aarch64-simd.md.
> >  UNSPEC_PMULL2   ; Used in aarch64-simd.md.
> >  UNSPEC_REV_REGLIST  ; Used in aarch64-simd.md.
> > +UNSPEC_VEC_SHL  ; Used in aarch64-simd.md.
> >  UNSPEC_VEC_SHR  ; Used in aarch64-simd.md.
> >  UNSPEC_SQRDMLAH ; Used in aarch64-simd.md.
> >  UNSPEC_SQRDMLSH ; Used in aarch64-simd.md.
> > diff --git a/gcc/testsuite/gcc.target/aarch64/perm_zero-1.c
> b/gcc/testsuite/gcc.target/aarch64/perm_zero-1.c
> > new file mode 100644
> > 

RE: [PATCH]AArch64: xfail modes_1.f90 [PR107071]

2024-02-15 Thread Tamar Christina
> -Original Message-
> From: Richard Earnshaw (lists) 
> Sent: Thursday, February 15, 2024 11:01 AM
> To: Tamar Christina ; gcc-patches@gcc.gnu.org
> Cc: nd ; Marcus Shawcroft ; Kyrylo
> Tkachov ; Richard Sandiford
> 
> Subject: Re: [PATCH]AArch64: xfail modes_1.f90 [PR107071]
> 
> On 15/02/2024 10:57, Tamar Christina wrote:
> > Hi All,
> >
> > This test has never worked on AArch64 since the day it was committed.  It 
> > has
> > a number of issues that prevent it from working on AArch64:
> >
> > 1.  IEEE does not require that FP operations raise a SIGFPE for FP 
> > operations,
> >     only that an exception is raised somehow.
> >
> > 2. Most Arm designed cores don't raise SIGFPE and instead set a status 
> > register
> >    and some partner cores raise a SIGILL instead.
> >
> > 3. The way it checks for feenableexcept doesn't really work for AArch64.
> >
> > As such this test doesn't seem to really provide much value on AArch64 so we
> > should just xfail it.
> >
> > Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> 
> Wouldn't it be better to just skip the test.  XFAIL just adds clutter to 
> verbose output
> and suggests that someday the tools might be fixed for this case.
> 
> Better still would be a new dg-requires fp_exceptions_raise_sigfpe as a guard 
> for
> the test.

There seems to be check_effective_target_fenv_exceptions which seems to test for
if the target can raise FP exceptions.  I'll see if that works.

Thanks,
Tamar

> 
> R.
> 
> >
> > Thanks,
> > Tamar
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR fortran/107071
> >     * gfortran.dg/ieee/modes_1.f90: xfail aarch64.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
> b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
> > index
> 205c47f38007d06116289c19d6b23cf3bf83bd48..3667571969427ae7b2b9668
> 4ec1af8b3fdd4985f 100644
> > --- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
> > +++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
> > @@ -1,4 +1,4 @@
> > -! { dg-do run }
> > +! { dg-do run { xfail { aarch64*-*-* } } }
> >  !
> >  ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES
> >
> >
> >
> >
> >
> > --



[PATCH]AArch64: xfail modes_1.f90 [PR107071]

2024-02-15 Thread Tamar Christina
Hi All,

This test has never worked on AArch64 since the day it was committed.  It has
a number of issues that prevent it from working on AArch64:

1.  IEEE does not require that FP operations raise a SIGFPE for FP operations,
only that an exception is raised somehow.

2. Most Arm designed cores don't raise SIGFPE and instead set a status register
   and some partner cores raise a SIGILL instead.

3. The way it checks for feenableexcept doesn't really work for AArch64.

As such this test doesn't seem to really provide much value on AArch64 so we
should just xfail it.

Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR fortran/107071
* gfortran.dg/ieee/modes_1.f90: xfail aarch64.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 
b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
index 
205c47f38007d06116289c19d6b23cf3bf83bd48..3667571969427ae7b2b96684ec1af8b3fdd4985f
 100644
--- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
+++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
@@ -1,4 +1,4 @@
-! { dg-do run }
+! { dg-do run { xfail { aarch64*-*-* } } }
 !
 ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES
 




-- 
diff --git a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90 
b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
index 
205c47f38007d06116289c19d6b23cf3bf83bd48..3667571969427ae7b2b96684ec1af8b3fdd4985f
 100644
--- a/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
+++ b/gcc/testsuite/gfortran.dg/ieee/modes_1.f90
@@ -1,4 +1,4 @@
-! { dg-do run }
+! { dg-do run { xfail { aarch64*-*-* } } }
 !
 ! Test IEEE_MODES_TYPE, IEEE_GET_MODES and IEEE_SET_MODES
 





RE: [PATCH]AArch64: remove ls64 from being mandatory on armv8.7-a..

2024-02-15 Thread Tamar Christina
Hi,  this I a new version of the patch updating some additional tests
because some of the LTO tests required a newer binutils than my distro had.

---

The Arm Architectural Reference Manual (Version J.a, section A2.9 on FEAT_LS64)
shows that ls64 is an optional extensions and should not be enabled by default
for Armv8.7-a.

This drops it from the mandatory bits for the architecture and brings GCC inline
with LLVM and the achitecture.

Note that we will not be changing binutils to preserve compatibility with older
released compilers.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master? and backport to GCC 13,12,11?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-arches.def (AARCH64_ARCH): Remove LS64 from
Armv8.7-a.

gcc/testsuite/ChangeLog:

* g++.target/aarch64/acle/ls64.C: Add +ls64.
* g++.target/aarch64/acle/ls64_lto.C: Likewise.
* gcc.target/aarch64/acle/ls64_lto.c: Likewise.
* gcc.target/aarch64/acle/pr110100.c: Likewise.
* gcc.target/aarch64/acle/pr110132.c: Likewise.
* gcc.target/aarch64/options_set_28.c: Drop check for nols64.
* gcc.target/aarch64/pragma_cpp_predefs_2.c: Correct header checks.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-arches.def 
b/gcc/config/aarch64/aarch64-arches.def
index 
b7115ff7c3d4a7ee7abbedcb091ef15a7efacc79..9bec30e9203bac01155281ef3474846c402bb29e
 100644
--- a/gcc/config/aarch64/aarch64-arches.def
+++ b/gcc/config/aarch64/aarch64-arches.def
@@ -37,7 +37,7 @@ AARCH64_ARCH("armv8.3-a", generic_armv8_a,   V8_3A, 
8,  (V8_2A, PAUTH, R
 AARCH64_ARCH("armv8.4-a", generic_armv8_a,   V8_4A, 8,  (V8_3A, 
F16FML, DOTPROD, FLAGM))
 AARCH64_ARCH("armv8.5-a", generic_armv8_a,   V8_5A, 8,  (V8_4A, SB, 
SSBS, PREDRES))
 AARCH64_ARCH("armv8.6-a", generic_armv8_a,   V8_6A, 8,  (V8_5A, I8MM, 
BF16))
-AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A, LS64))
+AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A))
 AARCH64_ARCH("armv8.8-a", generic_armv8_a,   V8_8A, 8,  (V8_7A, MOPS))
 AARCH64_ARCH("armv8.9-a", generic_armv8_a,   V8_9A, 8,  (V8_8A))
 AARCH64_ARCH("armv8-r",   generic_armv8_a,   V8R  , 8,  (V8_4A))
diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64.C 
b/gcc/testsuite/g++.target/aarch64/acle/ls64.C
index 
d9002785b578741bde1202761f0881dc3d47e608..dcfe6f1af6711a7f3ec2562f6aabf56baecf417d
 100644
--- a/gcc/testsuite/g++.target/aarch64/acle/ls64.C
+++ b/gcc/testsuite/g++.target/aarch64/acle/ls64.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv8.7-a" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64" } */
 #include 
 int main()
 {
diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C 
b/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C
index 
274a4771e1c1d13bcb1a7bdc77c2e499726f024c..0198fe2a1b78627b873bf22e3d8416dbdcc77078
 100644
--- a/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C
+++ b/gcc/testsuite/g++.target/aarch64/acle/ls64_lto.C
@@ -1,5 +1,5 @@
 /* { dg-do link { target aarch64_asm_ls64_ok } } */
-/* { dg-additional-options "-march=armv8.7-a -flto" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64 -flto" } */
 #include 
 int main()
 {
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c 
b/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c
index 
8b4f24277717675badc39dd145d365f75f5ceb27..0e5ae0b052b50b08d35151f4bc113617c1569bd3
 100644
--- a/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c
+++ b/gcc/testsuite/gcc.target/aarch64/acle/ls64_lto.c
@@ -1,5 +1,5 @@
 /* { dg-do link { target aarch64_asm_ls64_ok } } */
-/* { dg-additional-options "-march=armv8.7-a -flto" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64 -flto" } */
 #include 
 int main(void)
 {
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c 
b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
index 
f56d5e619e8ac23cdf720574bd6ee08fbfd36423..62a82b97c56debad092cc8fd1ed48f0219109cd7
 100644
--- a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
+++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-march=armv8.7-a -O2" } */
+/* { dg-options "-march=armv8.7-a+ls64 -O2" } */
 #include 
 void do_st64b(data512_t data) {
   __arm_st64b((void*)0x1000, data);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c 
b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
index 
fb88d633dd20772fd96e976a400fe52ae0bc3647..423d91b9a99f269d01d07428414ade7cc518c711
 100644
--- a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
+++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv8.7-a" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64" } */
 
 /* Check that ls64 builtins can be invoked using a preprocesed testcase
without triggering bogus builtin warnings, 

RE: [PATCH]AArch64: update vget_set_lane_1.c test output

2024-02-15 Thread Tamar Christina
> -Original Message-
> From: Richard Sandiford 
> Sent: Thursday, February 1, 2024 4:42 PM
> To: Tamar Christina 
> Cc: Andrew Pinski ; gcc-patches@gcc.gnu.org; nd
> ; Richard Earnshaw ; Marcus
> Shawcroft ; Kyrylo Tkachov
> 
> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
> 
> Tamar Christina  writes:
> >> -Original Message-
> >> From: Richard Sandiford 
> >> Sent: Thursday, February 1, 2024 2:24 PM
> >> To: Andrew Pinski 
> >> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org; nd
> >> ; Richard Earnshaw ; Marcus
> >> Shawcroft ; Kyrylo Tkachov
> >> 
> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
> >>
> >> Andrew Pinski  writes:
> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina 
> >> wrote:
> >> >>
> >> >> Hi All,
> >> >>
> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1
> instead
> >> of an INS
> >> >>
> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
> >> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
> >> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> >> >>
> >> >> This is because the non-Q variant for indices 0 and 1 are just 
> >> >> shuffling values.
> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such 
> >> >> just
> >> update the
> >> >> test file.
> >> > Hmm, is this true on all cores? I suspect there is a core out there
> >> > where INS is implemented with a much lower latency than ZIP.
> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
> >> > while ZIP is 6 cycles (3/7 for q versions).
> >> > Now I don't have any invested interest in that core any more but I
> >> > just wanted to point out that is not exactly true for all cores.
> >>
> >> Thanks for the pointer.  In that case, perhaps we should prefer
> >> aarch64_evpc_ins over aarch64_evpc_zip in
> aarch64_expand_vec_perm_const_1?
> >> That's enough to fix this failure, but it'll probably require other
> >> tests to be adjusted...
> >
> > I think given that Thundex-X is a 10 year old micro-architecture that is 
> > several
> cases where
> > often used instructions have very high latencies that generic codegen 
> > should not
> be blocked
> > from progressing because of it.
> >
> > we use zips in many things and if thunderx codegen is really of that much
> importance then I
> > think the old codegen should be gated behind -mcpu=thunderx rather than
> preventing generic
> > changes.
> 
> But you said there was no perf difference between INS and ZIP, so it
> sounds like for all known cases, using INS rather than ZIP is either
> neutral or better.
> 
> There's also the possible secondary benefit that the INS patterns use
> standard RTL operations whereas the ZIP patterns use unspecs.
> 
> Keeping ZIP seems OK there's a specific reason to prefer it over INS for
> more modern cores though.

Ok, that's a fair point.  Doing some due diligence, Neoverse-E1 and
Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput
than INSs. However the entries are inconsistent and I can't measure the
difference so I believe this to be a documentation bug.

That said, switching the operands seems to show one issue in that preferring
INS degenerates code in cases where we are inserting the top bits of the first
parameter into the bottom of the second parameter and returning,

Zip being a Three operand instruction allows us to put the result into the final
destination register with one operation whereas INS requires an fmov:

foo_uzp1_s32:
ins v0.s[1], v1.s[0]
fmovd0, d0
ret
foo_uzp2_s32:
ins v1.s[0], v0.s[1]
fmovd0, d1
ret

I've posted uzp but zip has the same issue.

So I guess it's not better to flip the order but perhaps I should add a case to
the zip/unzip RTL patterns for when op0 == op1?

Thanks,
Tamar
> 
> Thanks,
> Richard



[PATCH]AArch64: remove ls64 from being mandatory on armv8.7-a..

2024-02-14 Thread Tamar Christina
Hi All,

The Arm Architectural Reference Manual (Version J.a, section A2.9 on FEAT_LS64)
shows that ls64 is an optional extensions and should not be enabled by default
for Armv8.7-a.

This drops it from the mandatory bits for the architecture and brings GCC inline
with LLVM and the achitecture.

Note that we will not be changing binutils to preserve compatibility with older
released compilers.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master? and backport to GCC 13,12,11?

Thanks,
Tamar

gcc/ChangeLog:

* config/aarch64/aarch64-arches.def (AARCH64_ARCH): Remove LS64 from
Armv8.7-a.

gcc/testsuite/ChangeLog:

* g++.target/aarch64/acle/ls64.C: Add +ls64.
* gcc.target/aarch64/acle/pr110100.c: Likewise.
* gcc.target/aarch64/acle/pr110132.c: Likewise.
* gcc.target/aarch64/options_set_28.c: Drop check for nols64.
* gcc.target/aarch64/pragma_cpp_predefs_2.c: Correct header checks.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-arches.def 
b/gcc/config/aarch64/aarch64-arches.def
index 
b7115ff7c3d4a7ee7abbedcb091ef15a7efacc79..9bec30e9203bac01155281ef3474846c402bb29e
 100644
--- a/gcc/config/aarch64/aarch64-arches.def
+++ b/gcc/config/aarch64/aarch64-arches.def
@@ -37,7 +37,7 @@ AARCH64_ARCH("armv8.3-a", generic_armv8_a,   V8_3A, 
8,  (V8_2A, PAUTH, R
 AARCH64_ARCH("armv8.4-a", generic_armv8_a,   V8_4A, 8,  (V8_3A, 
F16FML, DOTPROD, FLAGM))
 AARCH64_ARCH("armv8.5-a", generic_armv8_a,   V8_5A, 8,  (V8_4A, SB, 
SSBS, PREDRES))
 AARCH64_ARCH("armv8.6-a", generic_armv8_a,   V8_6A, 8,  (V8_5A, I8MM, 
BF16))
-AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A, LS64))
+AARCH64_ARCH("armv8.7-a", generic_armv8_a,   V8_7A, 8,  (V8_6A))
 AARCH64_ARCH("armv8.8-a", generic_armv8_a,   V8_8A, 8,  (V8_7A, MOPS))
 AARCH64_ARCH("armv8.9-a", generic_armv8_a,   V8_9A, 8,  (V8_8A))
 AARCH64_ARCH("armv8-r",   generic_armv8_a,   V8R  , 8,  (V8_4A))
diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64.C 
b/gcc/testsuite/g++.target/aarch64/acle/ls64.C
index 
d9002785b578741bde1202761f0881dc3d47e608..dcfe6f1af6711a7f3ec2562f6aabf56baecf417d
 100644
--- a/gcc/testsuite/g++.target/aarch64/acle/ls64.C
+++ b/gcc/testsuite/g++.target/aarch64/acle/ls64.C
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv8.7-a" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64" } */
 #include 
 int main()
 {
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c 
b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
index 
f56d5e619e8ac23cdf720574bd6ee08fbfd36423..62a82b97c56debad092cc8fd1ed48f0219109cd7
 100644
--- a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
+++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-march=armv8.7-a -O2" } */
+/* { dg-options "-march=armv8.7-a+ls64 -O2" } */
 #include 
 void do_st64b(data512_t data) {
   __arm_st64b((void*)0x1000, data);
diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c 
b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
index 
fb88d633dd20772fd96e976a400fe52ae0bc3647..423d91b9a99f269d01d07428414ade7cc518c711
 100644
--- a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
+++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv8.7-a" } */
+/* { dg-additional-options "-march=armv8.7-a+ls64" } */
 
 /* Check that ls64 builtins can be invoked using a preprocesed testcase
without triggering bogus builtin warnings, see PR110132.
diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_28.c 
b/gcc/testsuite/gcc.target/aarch64/options_set_28.c
index 
9e63768581e9d429e9408863942051b1b04761ac..d5b15f8bc5831de56fe667179d83d9c853529aaf
 100644
--- a/gcc/testsuite/gcc.target/aarch64/options_set_28.c
+++ b/gcc/testsuite/gcc.target/aarch64/options_set_28.c
@@ -1,9 +1,9 @@
 /* { dg-do compile } */
-/* { dg-additional-options "-march=armv9.3-a+nopredres+nols64+nomops" } */
+/* { dg-additional-options "-march=armv9.3-a+nopredres+nomops" } */
 
 int main ()
 {
   return 0;
 }
 
-/* { dg-final { scan-assembler-times {\.arch 
armv9\.3\-a\+crc\+nopredres\+nols64\+nomops\n} 1 } } */
+/* { dg-final { scan-assembler-times {\.arch 
armv9\.3\-a\+crc\+nopredres\+nomops\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c 
b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c
index 
2d76bfc23dfdcd78a74ec0e4845a3bd8d110b010..d8fc86d1557895f91ffe8be2f65d6581abe51568
 100644
--- a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c
@@ -242,8 +242,8 @@
 
 #pragma GCC push_options
 #pragma GCC target ("arch=armv8.7-a")
-#ifndef __ARM_FEATURE_LS64
-#error "__ARM_FEATURE_LS64 is not defined but should be!"
+#ifdef __ARM_FEATURE_LS64
+#error 

RE: [PATCH]middle-end: inspect all exits for additional annotations for loop.

2024-02-14 Thread Tamar Christina
> 
> I think this isn't entirely good.  For simple cases for
> do {} while the condition ends up in the latch while for while () {}
> loops it ends up in the header.  In your case the latch isn't empty
> so it doesn't end up with the conditional.
> 
> I think your patch is OK to the point of looking at all loop exit
> sources but you should elide the special-casing of header and
> latch since it's really only exit conditionals that matter.
> 

That makes sense, since in both cases the edges are in the respective
blocks.  Should have thought about it more.

So how about this one.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-cfg.cc (replace_loop_annotate): Inspect loop edges for 
annotations.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-novect_gcond.c: New test.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
new file mode 100644
index 
..01e69cbef9d51b234c08a400c78dc078d53252f1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
@@ -0,0 +1,39 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+#pragma GCC novector
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+#pragma GCC novector
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
cdd439fe7506e7bc33654ffa027b493f23d278ac..bdffc3b4ed277724e81b7dd67fe7966e8ece0c13
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -320,12 +320,9 @@ replace_loop_annotate (void)
 
   for (auto loop : loops_list (cfun, 0))
 {
-  /* First look into the header.  */
-  replace_loop_annotate_in_block (loop->header, loop);
-
-  /* Then look into the latch, if any.  */
-  if (loop->latch)
-   replace_loop_annotate_in_block (loop->latch, loop);
+  /* Check all exit source blocks for annotations.  */
+  for (auto e : get_loop_exit_edges (loop))
+   replace_loop_annotate_in_block (e->src, loop);
 
   /* Push the global flag_finite_loops state down to individual loops.  */
   loop->finite_p = flag_finite_loops;


rb18267.patch
Description: rb18267.patch


[PATCH]middle-end: inspect all exits for additional annotations for loop.

2024-02-14 Thread Tamar Christina
Hi All,

Attaching a pragma to a loop which has a complex condition often gets the pragma
dropped. e.g.

#pragma GCC novector
  while (i < N && parse_tables_n--)

before lowering this is represented as:

 if (ANNOTATE_EXPR ) ...

But after lowering the condition is broken appart and attached to the final
component of the expression:

  if (parse_tables_n.2_2 != 0) goto ; else goto ;
  :
iftmp.1D.4452 = 1;
goto ;
  :
iftmp.1D.4452 = 0;
  :
D.4451 = .ANNOTATE (iftmp.1D.4452, 2, 0);
if (D.4451 != 0) goto ; else goto ;
  :

and it's never heard from again because during replace_loop_annotate we only
inspect the loop header and latch for annotations.

Since annotations were supposed to apply to the loop as a whole this fixes it
by also checking the loop exit src blocks for annotations.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-cfg.cc (replace_loop_annotate): Inspect loop edges for 
annotations.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-novect_gcond.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
new file mode 100644
index 
..01e69cbef9d51b234c08a400c78dc078d53252f1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
@@ -0,0 +1,39 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+#pragma GCC novector
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+#pragma GCC novector
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -327,6 +327,10 @@ replace_loop_annotate (void)
   if (loop->latch)
replace_loop_annotate_in_block (loop->latch, loop);
 
+  /* Then also check all other exits.  */
+  for (auto e : get_loop_exit_edges (loop))
+   replace_loop_annotate_in_block (e->src, loop);
+
   /* Push the global flag_finite_loops state down to individual loops.  */
   loop->finite_p = flag_finite_loops;
 }




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c 
b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
new file mode 100644
index 
..01e69cbef9d51b234c08a400c78dc078d53252f1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c
@@ -0,0 +1,39 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+#pragma GCC novector
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+#pragma GCC novector
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index 
cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc
 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -327,6 +327,10 @@ replace_loop_annotate (void)
   if (loop->latch)
replace_loop_annotate_in_block (loop->latch, loop);
 
+  /* Then also check all other exits.  */
+  for (auto e : get_loop_exit_edges (loop))
+   replace_loop_annotate_in_block (e->src, loop);
+
   /* Push the global flag_finite_loops state down to individual loops.  */
   loop->finite_p = flag_finite_loops;
 }





[PATCH]middle-end: update vector loop upper bounds when early break vect [PR113734]

2024-02-13 Thread Tamar Christina
Hi All,

When doing early break vectorization we should treat the final iteration as
possibly being partial.  This so that when we calculate the vector loop upper
bounds we take into account that final iteration could have done some work.

The attached testcase shows that if we don't then cunroll may unroll the loop an
if the upper bound is wrong we lose a vector iteration.

This is similar to how we adjust the scalar loop bounds for the PEELED case.

Bootstrapped Regtested on aarch64-none-linux-gnu and
x86_64-pc-linux-gnu no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113734
* tree-vect-loop.cc (vect_transform_loop): Treat the final iteration of
an early break loop as partial.

gcc/testsuite/ChangeLog:

PR tree-optimization/113734
* gcc.dg/vect/vect-early-break_117-pr113734.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c
new file mode 100644
index 
..36ae09483dfd426f977a3d92cf24a78d76de6961
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c
@@ -0,0 +1,37 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
854e9d78bc71721e6559a6bc5dff78c813603a78..0b1656fef2fed83f30295846c382ad9fb318454a
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -12171,7 +12171,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
   /* True if the final iteration might not handle a full vector's
  worth of scalar iterations.  */
   bool final_iter_may_be_partial
-= LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo);
+= LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+  || LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
   /* The minimum number of iterations performed by the epilogue.  This
  is 1 when peeling for gaps because we always need a final scalar
  iteration.  */




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c
new file mode 100644
index 
..36ae09483dfd426f977a3d92cf24a78d76de6961
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_117-pr113734.c
@@ -0,0 +1,37 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break_hw } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-O3" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+#define N 306
+#define NEEDLE 136
+
+int table[N];
+
+__attribute__ ((noipa))
+int foo (int i, unsigned short parse_tables_n)
+{
+  parse_tables_n >>= 9;
+  parse_tables_n += 11;
+  while (i < N && parse_tables_n--)
+table[i++] = 0;
+
+  return table[NEEDLE];
+}
+
+int main ()
+{
+  check_vect ();
+
+  for (int j = 0; j < N; j++)
+table[j] = -1;
+
+  if (foo (0, 0x) != 0)
+__builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
854e9d78bc71721e6559a6bc5dff78c813603a78..0b1656fef2fed83f30295846c382ad9fb318454a
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -12171,7 +12171,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
   /* True if the final iteration might not handle a full vector's
  worth of scalar iterations.  */
   bool final_iter_may_be_partial
-= LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo);
+= LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+  || LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
   /* The minimum number of iterations performed by the epilogue.  This
  is 1 when peeling for gaps because we always need a final scalar
  iteration.  */





RE: [PATCH]middle-end: add two debug counters for early-break vectorization debugging

2024-02-08 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Thursday, February 8, 2024 2:16 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: add two debug counters for early-break
> vectorization debugging
> 
> On Thu, 8 Feb 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This adds two new debug counter to aid in debugging early break code.
> >
> > - vect_force_last_exit: when reached will always force the final loop exit.
> > - vect_skip_exit: when reached will skip selecting the current candidate 
> > exit
> >   as the loop exit.
> >
> > The first counter essentially allows you to turn off the PEELED case and the
> > second counter to pick a different exit, which may mean you pick no exit at
> > all.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > * dbgcnt.def (vect_force_last_exit, vect_skip_exit): New.
> > * tree-vect-loop.cc (vec_init_loop_exit_info): Use them.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
> > index
> ed9f062eac2c28c52df76b39d4312dd9fde1c800..8f7bebf93fceabdf6ae86c2df5
> 91eae4848b8a5c 100644
> > --- a/gcc/dbgcnt.def
> > +++ b/gcc/dbgcnt.def
> > @@ -213,5 +213,7 @@ DEBUG_COUNTER (stv_conversion)
> >  DEBUG_COUNTER (tail_call)
> >  DEBUG_COUNTER (tree_sra)
> >  DEBUG_COUNTER (treepre_insert)
> > +DEBUG_COUNTER (vect_force_last_exit)
> >  DEBUG_COUNTER (vect_loop)
> > +DEBUG_COUNTER (vect_skip_exit)
> >  DEBUG_COUNTER (vect_slp)
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index
> eed2268e9bae7e7ad36d13da03e0b54eab26ef6f..854e9d78bc71721e6559a6bc
> 5dff78c813603a78 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -58,6 +58,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "tree-eh.h"
> >  #include "case-cfn-macros.h"
> >  #include "langhooks.h"
> > +#include "dbgcnt.h"
> >
> >  /* Loop Vectorization Pass.
> >
> > @@ -977,6 +978,20 @@ vec_init_loop_exit_info (class loop *loop)
> >if (exits.length () == 1)
> >  return exits[0];
> >
> > +  /* Check to see if we've been asked to force the last exit.  */
> > +  if (!dbg_cnt (vect_force_last_exit))
> > +{
> > +  basic_block bb = ip_normal_pos (loop);
> > +  if (!bb)
> > +   return NULL;
> > +
> > +  edge exit = EDGE_SUCC (bb, 0);
> > +  if (exit->dest == loop->latch)
> > +   return EDGE_SUCC (bb, 1);
> > +
> > +  return exit;
> 
> Err, that's quite odd.  Why not just below do
> 
> > +}
> > +
> >/* If we have multiple exits we only support counting IV at the moment.
> >   Analyze all exits and return the last one we can analyze.  */
> >class tree_niter_desc niter_desc;
> > @@ -998,6 +1013,7 @@ vec_init_loop_exit_info (class loop *loop)
> >&& exit->src == single_pred (loop->latch)
> >&& (integer_nonzerop (may_be_zero)
> >|| COMPARISON_CLASS_P (may_be_zero
> > + && dbg_cnt (vect_skip_exit)
> 
>   && (dbg_cnt (vect_force_last_exit)
>   || exit->src == single_pred (loop->latch))
> 
> (also computed above already)?  It's also oddly named, it's more like
> vect_allow_peeled_exit or so.

Because this isn't deterministic. If a loop has n exits the above always forces
you to pick the final one regardless of n, rather than just skip consideration 
of an exit.

And in that case is there a point in analyzing all the exits just to throw away 
the information?

Doing in inside the consideration check would only skip one exit unless I'm 
misunderstanding.

> 
> It's also seemingly redundant with vect_skip_exit, no?
> 
> Note the counter gets incremented even if we'd not consider the exit
> because we have a later candidate already.
> 
> I fear it's going to be quite random even with the debug counter.

It is, I think the first counter is more useful. But in general the reason I 
kept the second counter
which kinda does what was suggested in the RFC I sent before was that it should 
in theory at
least allow us to test forcing of a PEELED case. Since we generally prefer the 
non-PEELED case
if possible.

At least that was the intention.

Thanks,
Tamar

> 
> Can you see whether it really helps you?
> 
> >   && (!candidate
> >   || dominated_by_p (CDI_DOMINATORS, exit->src,
> >  candidate->src)))
> >
> >
> >
> >
> >
> 
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH]middle-end: add two debug counters for early-break vectorization debugging

2024-02-08 Thread Tamar Christina
Hi All,

This adds two new debug counter to aid in debugging early break code.

- vect_force_last_exit: when reached will always force the final loop exit.
- vect_skip_exit: when reached will skip selecting the current candidate exit
  as the loop exit.

The first counter essentially allows you to turn off the PEELED case and the
second counter to pick a different exit, which may mean you pick no exit at
all.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* dbgcnt.def (vect_force_last_exit, vect_skip_exit): New.
* tree-vect-loop.cc (vec_init_loop_exit_info): Use them.

--- inline copy of patch -- 
diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index 
ed9f062eac2c28c52df76b39d4312dd9fde1c800..8f7bebf93fceabdf6ae86c2df591eae4848b8a5c
 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -213,5 +213,7 @@ DEBUG_COUNTER (stv_conversion)
 DEBUG_COUNTER (tail_call)
 DEBUG_COUNTER (tree_sra)
 DEBUG_COUNTER (treepre_insert)
+DEBUG_COUNTER (vect_force_last_exit)
 DEBUG_COUNTER (vect_loop)
+DEBUG_COUNTER (vect_skip_exit)
 DEBUG_COUNTER (vect_slp)
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
eed2268e9bae7e7ad36d13da03e0b54eab26ef6f..854e9d78bc71721e6559a6bc5dff78c813603a78
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -58,6 +58,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "case-cfn-macros.h"
 #include "langhooks.h"
+#include "dbgcnt.h"
 
 /* Loop Vectorization Pass.
 
@@ -977,6 +978,20 @@ vec_init_loop_exit_info (class loop *loop)
   if (exits.length () == 1)
 return exits[0];
 
+  /* Check to see if we've been asked to force the last exit.  */
+  if (!dbg_cnt (vect_force_last_exit))
+{
+  basic_block bb = ip_normal_pos (loop);
+  if (!bb)
+   return NULL;
+
+  edge exit = EDGE_SUCC (bb, 0);
+  if (exit->dest == loop->latch)
+   return EDGE_SUCC (bb, 1);
+
+  return exit;
+}
+
   /* If we have multiple exits we only support counting IV at the moment.
  Analyze all exits and return the last one we can analyze.  */
   class tree_niter_desc niter_desc;
@@ -998,6 +1013,7 @@ vec_init_loop_exit_info (class loop *loop)
   && exit->src == single_pred (loop->latch)
   && (integer_nonzerop (may_be_zero)
   || COMPARISON_CLASS_P (may_be_zero
+ && dbg_cnt (vect_skip_exit)
  && (!candidate
  || dominated_by_p (CDI_DOMINATORS, exit->src,
 candidate->src)))




-- 
diff --git a/gcc/dbgcnt.def b/gcc/dbgcnt.def
index 
ed9f062eac2c28c52df76b39d4312dd9fde1c800..8f7bebf93fceabdf6ae86c2df591eae4848b8a5c
 100644
--- a/gcc/dbgcnt.def
+++ b/gcc/dbgcnt.def
@@ -213,5 +213,7 @@ DEBUG_COUNTER (stv_conversion)
 DEBUG_COUNTER (tail_call)
 DEBUG_COUNTER (tree_sra)
 DEBUG_COUNTER (treepre_insert)
+DEBUG_COUNTER (vect_force_last_exit)
 DEBUG_COUNTER (vect_loop)
+DEBUG_COUNTER (vect_skip_exit)
 DEBUG_COUNTER (vect_slp)
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
eed2268e9bae7e7ad36d13da03e0b54eab26ef6f..854e9d78bc71721e6559a6bc5dff78c813603a78
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -58,6 +58,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "case-cfn-macros.h"
 #include "langhooks.h"
+#include "dbgcnt.h"
 
 /* Loop Vectorization Pass.
 
@@ -977,6 +978,20 @@ vec_init_loop_exit_info (class loop *loop)
   if (exits.length () == 1)
 return exits[0];
 
+  /* Check to see if we've been asked to force the last exit.  */
+  if (!dbg_cnt (vect_force_last_exit))
+{
+  basic_block bb = ip_normal_pos (loop);
+  if (!bb)
+   return NULL;
+
+  edge exit = EDGE_SUCC (bb, 0);
+  if (exit->dest == loop->latch)
+   return EDGE_SUCC (bb, 1);
+
+  return exit;
+}
+
   /* If we have multiple exits we only support counting IV at the moment.
  Analyze all exits and return the last one we can analyze.  */
   class tree_niter_desc niter_desc;
@@ -998,6 +1013,7 @@ vec_init_loop_exit_info (class loop *loop)
   && exit->src == single_pred (loop->latch)
   && (integer_nonzerop (may_be_zero)
   || COMPARISON_CLASS_P (may_be_zero
+ && dbg_cnt (vect_skip_exit)
  && (!candidate
  || dominated_by_p (CDI_DOMINATORS, exit->src,
 candidate->src)))





RE: [PATCH]middle-end: don't cache restart_loop in vectorizable_live_operations [PR113808]

2024-02-08 Thread Tamar Christina
> Please either drop lastprivate(k) clause or use linear(k:1)
> The iteration var of simd loop without collapse or with
> collapse(1) is implicitly linear with the step, and even linear
> means the value from the last iteration can be used after the
> simd construct.  Overriding the data sharing to something different
> has been only added recently to OpenMP and isn't really needed here.
> 

Sorry I know very little about fortran, is this ok?

Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/113808
* gfortran.dg/vect/vect-early-break_1-PR113808.f90: Moved to...
* gfortran.dg/vect/vect-early-break_1-pr113808.f90: ...here.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 
b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-pr113808.f90
similarity index 93%
rename from gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90
rename to gcc/testsuite/gfortran.dg/vect/vect-early-break_1-pr113808.f90
index 
5c339fa7a348fac5527bbbf456a535da96b5c1ed..6f92e9095bdee08a5a9db2816f57da6c14d91b11
 100644
--- a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90
+++ b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-pr113808.f90
@@ -9,7 +9,7 @@ program main
   integer :: n, i,k
   n = 11
   do i = 1, n,2
-!$omp simd lastprivate(k)
+!$omp simd
 do k = 1, i + 41
   if (k > 11 + 41 .or. k < 1) error stop
 end do


rb18253.patch
Description: rb18253.patch


[PATCH]middle-end: don't cache restart_loop in vectorizable_live_operations [PR113808]

2024-02-08 Thread Tamar Christina
Hi All,

There's a bug in vectorizable_live_operation that restart_loop is defined
outside the loop.

This variable is supposed to indicate whether we are doing a first or last
index reduction.  The problem is that by defining it outside the loop it becomes
dependent on the order we visit the USE/DEFs.

In the given example, the loop isn't PEELED, but we visit the early exit uses
first.  This then sets the boolean to true and it can't get to false again.

So when we visit the main exit we still treat it as an early exit for that
SSA name.

This cleans it up and renames the variables to something that's hopefully
clearer to their intention.

Bootstrapped Regtested on aarch64-none-linux-gnu and
x86_64-pc-linux-gnu no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113808
* tree-vect-loop.cc (vectorizable_live_operation): Don't cache the 
value cross iterations.

gcc/testsuite/ChangeLog:

PR tree-optimization/113808
* gfortran.dg/vect/vect-early-break_1-PR113808.f90: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 
b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90
new file mode 100644
index 
..5c339fa7a348fac5527bbbf456a535da96b5c1ed
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90
@@ -0,0 +1,21 @@
+! { dg-add-options vect_early_break }
+! { dg-require-effective-target vect_early_break }
+! { dg-require-effective-target vect_long_long }
+! { dg-additional-options "-fopenmp-simd" }
+
+! { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } }
+
+program main
+  integer :: n, i,k
+  n = 11
+  do i = 1, n,2
+!$omp simd lastprivate(k)
+do k = 1, i + 41
+  if (k > 11 + 41 .or. k < 1) error stop
+end do
+  end do
+  if (k /= 53) then
+print *, k, 53
+error stop
+  endif
+end
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
190df9ec7741fd05aa0b9abe150baf06b2ca9a57..eed2268e9bae7e7ad36d13da03e0b54eab26ef6f
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10950,7 +10950,7 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
 did.  For the live values we want the value at the start of the 
iteration
 rather than at the end.  */
   edge main_e = LOOP_VINFO_IV_EXIT (loop_vinfo);
-  bool restart_loop = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
+  bool all_exits_as_early_p = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED 
(loop_vinfo);
   FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
if (!is_gimple_debug (use_stmt)
&& !flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
@@ -10966,8 +10966,7 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
  /* For early exit where the exit is not in the BB that leads
 to the latch then we're restarting the iteration in the
 scalar loop.  So get the first live value.  */
- restart_loop = restart_loop || !main_exit_edge;
- if (restart_loop
+ if ((all_exits_as_early_p || !main_exit_edge)
  && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def)
{
  tmp_vec_lhs = vec_lhs0;




-- 
diff --git a/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90 
b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90
new file mode 100644
index 
..5c339fa7a348fac5527bbbf456a535da96b5c1ed
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/vect/vect-early-break_1-PR113808.f90
@@ -0,0 +1,21 @@
+! { dg-add-options vect_early_break }
+! { dg-require-effective-target vect_early_break }
+! { dg-require-effective-target vect_long_long }
+! { dg-additional-options "-fopenmp-simd" }
+
+! { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } }
+
+program main
+  integer :: n, i,k
+  n = 11
+  do i = 1, n,2
+!$omp simd lastprivate(k)
+do k = 1, i + 41
+  if (k > 11 + 41 .or. k < 1) error stop
+end do
+  end do
+  if (k /= 53) then
+print *, k, 53
+error stop
+  endif
+end
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
190df9ec7741fd05aa0b9abe150baf06b2ca9a57..eed2268e9bae7e7ad36d13da03e0b54eab26ef6f
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10950,7 +10950,7 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
 did.  For the live values we want the value at the start of the 
iteration
 rather than at the end.  */
   edge main_e = LOOP_VINFO_IV_EXIT (loop_vinfo);
-  bool restart_loop = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
+  bool all_exits_as_early_p = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED 
(loop_vinfo);
   FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
if (!is_gimple_debug (use_stmt)
&& !flow_bb_inside_loop_p 

[PATCH][committed]middle-end: fix pointer conversion error in testcase vect-early-break_110-pr113467.c

2024-02-08 Thread Tamar Christina
Hi All,

I had missed a conversion from unsigned long to uint64_t.
This fixes the failing test on -m32.

Regtested on x86_64-pc-linux-gnu with -m32 and no issues.

Committed as obvious.

Thanks,
Tamar

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-early-break_110-pr113467.c: Change unsigned long *
to uint64_t *.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
index 
1e2c47be5fdf1e1fed88e4b5f45d7eda6c3b85d1..12d0ea1e871b51742c040c909ea5741bc820206e
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
@@ -10,7 +10,7 @@
 typedef struct gcry_mpi *gcry_mpi_t;
 struct gcry_mpi {
   int nlimbs;
-  unsigned long *d;
+  uint64_t *d;
 };
 
 long gcry_mpi_add_ui_up;




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
index 
1e2c47be5fdf1e1fed88e4b5f45d7eda6c3b85d1..12d0ea1e871b51742c040c909ea5741bc820206e
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
@@ -10,7 +10,7 @@
 typedef struct gcry_mpi *gcry_mpi_t;
 struct gcry_mpi {
   int nlimbs;
-  unsigned long *d;
+  uint64_t *d;
 };
 
 long gcry_mpi_add_ui_up;





RE: [PATCH]middle-end: fix ICE when moving statements to empty BB [PR113731]

2024-02-05 Thread Tamar Christina
> It looks like LOOP_VINFO_EARLY_BRK_STORES is "reverse"?  Is that
> why you are doing gsi_move_before + gsi_prev?  Why do gsi_prev
> at all?
> 

As discussed on IRC, then how about this one.
Incremental building passed all tests and bootstrap is running.

Ok for master if bootstrap and regtesting clean?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113731
* gimple-iterator.cc (gsi_move_before): Take new parameter for update
method.
* gimple-iterator.h (gsi_move_before): Default new param to
GSI_SAME_STMT.
* tree-vect-loop.cc (move_early_exit_stmts): Call gsi_move_before with
GSI_NEW_STMT.

gcc/testsuite/ChangeLog:

PR tree-optimization/113731
* gcc.dg/vect/vect-early-break_111-pr113731.c: New test.

--- inline copy of patch ---

diff --git a/gcc/gimple-iterator.cc b/gcc/gimple-iterator.cc
index 
517c53376f0511af59e124f52ec7be566a6c4789..f67bcfbfdfdd7c6cb0ad0130972f5b1dc4429bcf
 100644
--- a/gcc/gimple-iterator.cc
+++ b/gcc/gimple-iterator.cc
@@ -666,10 +666,11 @@ gsi_move_after (gimple_stmt_iterator *from, 
gimple_stmt_iterator *to)
 
 
 /* Move the statement at FROM so it comes right before the statement
-   at TO.  */
+   at TO using method M.  */
 
 void
-gsi_move_before (gimple_stmt_iterator *from, gimple_stmt_iterator *to)
+gsi_move_before (gimple_stmt_iterator *from, gimple_stmt_iterator *to,
+gsi_iterator_update m = GSI_SAME_STMT)
 {
   gimple *stmt = gsi_stmt (*from);
   gsi_remove (from, false);
@@ -677,7 +678,7 @@ gsi_move_before (gimple_stmt_iterator *from, 
gimple_stmt_iterator *to)
   /* For consistency with gsi_move_after, it might be better to have
  GSI_NEW_STMT here; however, that breaks several places that expect
  that TO does not change.  */
-  gsi_insert_before (to, stmt, GSI_SAME_STMT);
+  gsi_insert_before (to, stmt, m);
 }
 
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
new file mode 100644
index 
..2d6db91df97625a7f11609d034e89af0461129b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+char* inet_net_pton_ipv4_bits;
+char inet_net_pton_ipv4_odst;
+void __errno_location();
+void inet_net_pton_ipv4();
+void inet_net_pton() { inet_net_pton_ipv4(); }
+void inet_net_pton_ipv4(char *dst, int size) {
+  while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) {
+if (size-- <= 0)
+  goto emsgsize;
+*dst++ = '\0';
+  }
+emsgsize:
+  __errno_location();
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
30b90d99925bea74caf14833d8ab1695607d0fe9..9aba94bd6ca2061a19487ac4a2735a16d03bcbee
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11800,8 +11800,7 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
dump_printf_loc (MSG_NOTE, vect_location, "moving stmt %G", stmt);
 
   gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt);
-  gsi_move_before (_gsi, _gsi);
-  gsi_prev (_gsi);
+  gsi_move_before (_gsi, _gsi, GSI_NEW_STMT);
 }
 
   /* Update all the stmts with their new reaching VUSES.  */


rb18247.patch
Description: rb18247.patch


RE: [PATCH]middle-end: add additional runtime test for [PR113467]

2024-02-05 Thread Tamar Christina
> > Ok for master?
> 
> I think you need a lp64 target check for the large constants or
> alternatively use uint64_t?
> 

Ok, how about this one.

Regtested on x86_64-pc-linux-gnu with -m32,-m64 and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/113467
* gcc.dg/vect/vect-early-break_110-pr113467.c: New test.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
new file mode 100644
index 
..1e2c47be5fdf1e1fed88e4b5f45d7eda6c3b85d1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
@@ -0,0 +1,52 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_long_long } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+#include 
+
+typedef struct gcry_mpi *gcry_mpi_t;
+struct gcry_mpi {
+  int nlimbs;
+  unsigned long *d;
+};
+
+long gcry_mpi_add_ui_up;
+void gcry_mpi_add_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned v) {
+  gcry_mpi_add_ui_up = *w->d;
+  if (u) {
+uint64_t *res_ptr = w->d, *s1_ptr = w->d;
+int s1_size = u->nlimbs;
+unsigned s2_limb = v, x = *s1_ptr++;
+s2_limb += x;
+*res_ptr++ = s2_limb;
+if (x)
+  while (--s1_size) {
+x = *s1_ptr++ + 1;
+*res_ptr++ = x;
+if (x) {
+  break;
+}
+  }
+  }
+}
+
+int main()
+{
+  check_vect ();
+
+  static struct gcry_mpi sv;
+  static uint64_t vals[] = {4294967288ULL, 191ULL,4160749568ULL, 
4294963263ULL,
+127ULL,4294950912ULL, 255ULL,
4294901760ULL,
+534781951ULL,  33546240ULL,   4294967292ULL, 
4294960127ULL,
+4292872191ULL, 4294967295ULL, 4294443007ULL, 3ULL};
+  gcry_mpi_t v = 
+  v->nlimbs = 16;
+  v->d = vals;
+
+  gcry_mpi_add_ui(v, v, 8);
+  if (v->d[1] != 192)
+__builtin_abort();
+}


rb18246.patch
Description: rb18246.patch


RE: [PATCH]middle-end: fix ICE when moving statements to empty BB [PR113731]

2024-02-05 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Monday, February 5, 2024 1:22 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: fix ICE when moving statements to empty BB
> [PR113731]
> 
> On Mon, 5 Feb 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > We use gsi_move_before (_gsi, _gsi); to request that the new
> statement
> > be placed before any other statement.  Typically this then moves the current
> > pointer to be after the statement we just inserted.
> >
> > However it looks like when the BB is empty, this does not happen and the CUR
> > pointer stays NULL.   There's a comment in the source of gsi_insert_before 
> > that
> > explains:
> >
> > /* If CUR is NULL, we link at the end of the sequence (this case happens
> >
> > so it adds it to the end instead of start like you asked.  This means that 
> > in
> > this case there's nothing to move and so we shouldn't move the pointer if 
> > we're
> > already at the HEAD.
> 
> The issue is that a gsi_end_p () is ambiguous, it could be the start
> or the end.  gsi_insert_before treats it as "end" while gsi_insert_after
> treats it as "start" since you can't really insert "after" the "end".
> 
> gsi_move_before doesn't update the insertion pointer (using
> GSI_SAME_STMT), so with a gsi_end_p () you get what you ask for.
> 
> Btw,
> 
>   /* Move all stmts that need moving.  */
>   basic_block dest_bb = LOOP_VINFO_EARLY_BRK_DEST_BB (loop_vinfo);
>   gimple_stmt_iterator dest_gsi = gsi_start_bb (dest_bb);
> 
> should probably use gsi_after_labels (dest_bb) just in case.

See next patch.

> 
> It looks like LOOP_VINFO_EARLY_BRK_STORES is "reverse"?  Is that
> why you are doing gsi_move_before + gsi_prev?  Why do gsi_prev
> at all?
> 

Yes, it stores them reverse because we record them from the latch on up.
So we either have to iterate backwards, insert them to the front or move gsi.

I guess I could remove it by removing the for-each loop and iterating in
reverse.  Is that preferred?

Tamar.

> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/113731
> > * tree-vect-loop.cc (move_early_exit_stmts): Conditionally move pointer.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/113731
> > * gcc.dg/vect/vect-early-break_111-pr113731.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
> > new file mode 100644
> > index
> ..2d6db91df97625a7f1160
> 9d034e89af0461129b2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
> > @@ -0,0 +1,21 @@
> > +/* { dg-do compile } */
> > +/* { dg-add-options vect_early_break } */
> > +/* { dg-require-effective-target vect_early_break } */
> > +/* { dg-require-effective-target vect_int } */
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +
> > +char* inet_net_pton_ipv4_bits;
> > +char inet_net_pton_ipv4_odst;
> > +void __errno_location();
> > +void inet_net_pton_ipv4();
> > +void inet_net_pton() { inet_net_pton_ipv4(); }
> > +void inet_net_pton_ipv4(char *dst, int size) {
> > +  while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) {
> > +if (size-- <= 0)
> > +  goto emsgsize;
> > +*dst++ = '\0';
> > +  }
> > +emsgsize:
> > +  __errno_location();
> > +}
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index
> 30b90d99925bea74caf14833d8ab1695607d0fe9..e2587315020a35a7d4ebd3e
> 7a9842caa36bb5d3c 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -11801,7 +11801,8 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
> >
> >gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt);
> >gsi_move_before (_gsi, _gsi);
> > -  gsi_prev (_gsi);
> > +  if (!gsi_end_p (dest_gsi))
> > +   gsi_prev (_gsi);
> >  }
> >
> >/* Update all the stmts with their new reaching VUSES.  */
> >
> >
> >
> >
> >
> 
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH]middle-end: fix ICE when destination BB for stores starts with a label [PR113750]

2024-02-05 Thread Tamar Christina
Hi All,

The report shows that if the FE leaves a label as the first thing in the dest
BB then we ICE because we move the stores before the label.

This is easy to fix if we know that there's still only one way into the BB.
We would have already rejected the loop if there was multiple paths into the BB
however I added an additional check just for early break in case the other
constraints are relaxed later with an explanation.

After that we fix the issue just by getting the GSI after the labels and I add
a bunch of testcases for different positions the label can be added.  Only the
vect-early-break_112-pr113750.c one results in the label being kept.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113750
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Check
for single predecessor when doing early break vect.
* tree-vect-loop.cc (move_early_exit_stmts): Get gsi at the start but
after labels.

gcc/testsuite/ChangeLog:

PR tree-optimization/113750
* gcc.dg/vect/vect-early-break_112-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_113-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_114-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_115-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_116-pr113750.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_112-pr113750.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_112-pr113750.c
new file mode 100644
index 
..559ebd84d5c39881e694e7c8c31be29d846866ed
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_112-pr113750.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#ifndef N
+#define N 800
+#endif
+unsigned vect_a[N];
+unsigned vect_b[N];
+
+unsigned test4(unsigned x)
+{
+ unsigned ret = 0;
+ for (int i = 0; i < N; i++)
+ {
+   vect_b[i] = x + i;
+   if (vect_a[i] != x)
+ break;
+foo:
+   vect_a[i] = x;
+ }
+ return ret;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_113-pr113750.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_113-pr113750.c
new file mode 100644
index 
..ba85780a46b1378aaec238ff9eb5f906be9a44dd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_113-pr113750.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#ifndef N
+#define N 800
+#endif
+unsigned vect_a[N];
+unsigned vect_b[N];
+
+unsigned test4(unsigned x)
+{
+ unsigned ret = 0;
+ for (int i = 0; i < N; i++)
+ {
+   vect_b[i] = x + i;
+   if (vect_a[i] != x)
+ break;
+   vect_a[i] = x;
+foo:
+ }
+ return ret;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_114-pr113750.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_114-pr113750.c
new file mode 100644
index 
..37af2998688f5d60e2cdb372ab43afcaa52a3146
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_114-pr113750.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#ifndef N
+#define N 800
+#endif
+unsigned vect_a[N];
+unsigned vect_b[N];
+
+unsigned test4(unsigned x)
+{
+ unsigned ret = 0;
+ for (int i = 0; i < N; i++)
+ {
+   vect_b[i] = x + i;
+foo:
+   if (vect_a[i] != x)
+ break;
+   vect_a[i] = x;
+ }
+ return ret;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_115-pr113750.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_115-pr113750.c
new file mode 100644
index 
..502686d308e298cd84e9e3b74d7b4ad1979602a9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_115-pr113750.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#ifndef N
+#define N 800
+#endif
+unsigned vect_a[N];
+unsigned vect_b[N];
+
+unsigned test4(unsigned x)
+{
+ unsigned ret = 0;
+ for (int i = 0; i < N; i++)
+ {
+foo:
+   vect_b[i] = x + i;
+   if (vect_a[i] != x)
+ break;
+   vect_a[i] = x;
+ }
+ return ret;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_116-pr113750.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_116-pr113750.c
new file mode 

[PATCH]middle-end: fix ICE when moving statements to empty BB [PR113731]

2024-02-05 Thread Tamar Christina
Hi All,

We use gsi_move_before (_gsi, _gsi); to request that the new statement
be placed before any other statement.  Typically this then moves the current
pointer to be after the statement we just inserted.

However it looks like when the BB is empty, this does not happen and the CUR
pointer stays NULL.   There's a comment in the source of gsi_insert_before that
explains:

/* If CUR is NULL, we link at the end of the sequence (this case happens

so it adds it to the end instead of start like you asked.  This means that in
this case there's nothing to move and so we shouldn't move the pointer if we're
already at the HEAD.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113731
* tree-vect-loop.cc (move_early_exit_stmts): Conditionally move pointer.

gcc/testsuite/ChangeLog:

PR tree-optimization/113731
* gcc.dg/vect/vect-early-break_111-pr113731.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
new file mode 100644
index 
..2d6db91df97625a7f11609d034e89af0461129b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+char* inet_net_pton_ipv4_bits;
+char inet_net_pton_ipv4_odst;
+void __errno_location();
+void inet_net_pton_ipv4();
+void inet_net_pton() { inet_net_pton_ipv4(); }
+void inet_net_pton_ipv4(char *dst, int size) {
+  while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) {
+if (size-- <= 0)
+  goto emsgsize;
+*dst++ = '\0';
+  }
+emsgsize:
+  __errno_location();
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
30b90d99925bea74caf14833d8ab1695607d0fe9..e2587315020a35a7d4ebd3e7a9842caa36bb5d3c
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11801,7 +11801,8 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
 
   gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt);
   gsi_move_before (_gsi, _gsi);
-  gsi_prev (_gsi);
+  if (!gsi_end_p (dest_gsi))
+   gsi_prev (_gsi);
 }
 
   /* Update all the stmts with their new reaching VUSES.  */




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
new file mode 100644
index 
..2d6db91df97625a7f11609d034e89af0461129b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_111-pr113731.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+char* inet_net_pton_ipv4_bits;
+char inet_net_pton_ipv4_odst;
+void __errno_location();
+void inet_net_pton_ipv4();
+void inet_net_pton() { inet_net_pton_ipv4(); }
+void inet_net_pton_ipv4(char *dst, int size) {
+  while ((inet_net_pton_ipv4_bits > dst) & inet_net_pton_ipv4_odst) {
+if (size-- <= 0)
+  goto emsgsize;
+*dst++ = '\0';
+  }
+emsgsize:
+  __errno_location();
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
30b90d99925bea74caf14833d8ab1695607d0fe9..e2587315020a35a7d4ebd3e7a9842caa36bb5d3c
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11801,7 +11801,8 @@ move_early_exit_stmts (loop_vec_info loop_vinfo)
 
   gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt);
   gsi_move_before (_gsi, _gsi);
-  gsi_prev (_gsi);
+  if (!gsi_end_p (dest_gsi))
+   gsi_prev (_gsi);
 }
 
   /* Update all the stmts with their new reaching VUSES.  */





[PATCH]middle-end: add additional runtime test for [PR113467]

2024-02-05 Thread Tamar Christina
Hi All,

This just adds an additional runtime testcase for the fixed issue.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/113467
* gcc.dg/vect/vect-early-break_110-pr113467.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
new file mode 100644
index 
..2d8a071c0e922ccfd5fa8c7b2704852dbd95
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
@@ -0,0 +1,51 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+typedef struct gcry_mpi *gcry_mpi_t;
+struct gcry_mpi {
+  int nlimbs;
+  unsigned long *d;
+};
+
+long gcry_mpi_add_ui_up;
+void gcry_mpi_add_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned v) {
+  gcry_mpi_add_ui_up = *w->d;
+  if (u) {
+unsigned long *res_ptr = w->d, *s1_ptr = w->d;
+int s1_size = u->nlimbs;
+unsigned s2_limb = v, x = *s1_ptr++;
+s2_limb += x;
+*res_ptr++ = s2_limb;
+if (x)
+  while (--s1_size) {
+x = *s1_ptr++ + 1;
+*res_ptr++ = x;
+if (x) {
+  break;
+}
+  }
+  }
+}
+
+int main()
+{
+  check_vect ();
+
+  static struct gcry_mpi sv;
+  static unsigned long vals[] = {4294967288, 191,4160749568, 
4294963263,
+ 127,4294950912, 255,
4294901760,
+ 534781951,  33546240,   4294967292, 
4294960127,
+ 4292872191, 4294967295, 4294443007, 3};
+  gcry_mpi_t v = 
+  v->nlimbs = 16;
+  v->d = vals;
+
+  gcry_mpi_add_ui(v, v, 8);
+  if (v->d[1] != 192)
+__builtin_abort();
+}




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
new file mode 100644
index 
..2d8a071c0e922ccfd5fa8c7b2704852dbd95
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_110-pr113467.c
@@ -0,0 +1,51 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include "tree-vect.h"
+
+typedef struct gcry_mpi *gcry_mpi_t;
+struct gcry_mpi {
+  int nlimbs;
+  unsigned long *d;
+};
+
+long gcry_mpi_add_ui_up;
+void gcry_mpi_add_ui(gcry_mpi_t w, gcry_mpi_t u, unsigned v) {
+  gcry_mpi_add_ui_up = *w->d;
+  if (u) {
+unsigned long *res_ptr = w->d, *s1_ptr = w->d;
+int s1_size = u->nlimbs;
+unsigned s2_limb = v, x = *s1_ptr++;
+s2_limb += x;
+*res_ptr++ = s2_limb;
+if (x)
+  while (--s1_size) {
+x = *s1_ptr++ + 1;
+*res_ptr++ = x;
+if (x) {
+  break;
+}
+  }
+  }
+}
+
+int main()
+{
+  check_vect ();
+
+  static struct gcry_mpi sv;
+  static unsigned long vals[] = {4294967288, 191,4160749568, 
4294963263,
+ 127,4294950912, 255,
4294901760,
+ 534781951,  33546240,   4294967292, 
4294960127,
+ 4292872191, 4294967295, 4294443007, 3};
+  gcry_mpi_t v = 
+  v->nlimbs = 16;
+  v->d = vals;
+
+  gcry_mpi_add_ui(v, v, 8);
+  if (v->d[1] != 192)
+__builtin_abort();
+}





RE: [PATCH]middle-end: check memory accesses in the destination block [PR113588].

2024-02-01 Thread Tamar Christina
> >
> > If the above is correct then I think I understand what you're saying and
> > will update the patch and do some Checks.
> 
> Yes, I think that's what I wanted to say.
> 

As discussed:

Bootstrapped Regtested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu no 
issues.
Also checked both with --enable-lto --with-build-config='bootstrap-O3 
bootstrap-lto' --enable-multilib
and --enable-lto --with-build-config=bootstrap-O3 
--enable-checking=release,yes,rtl,extra;
and checked the libcrypt testsuite as reported on PR113467.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113588
PR tree-optimization/113467
(vect_analyze_data_ref_dependence):  Choose correct dest and fix checks.
(vect_analyze_early_break_dependences): Update comments.

gcc/testsuite/ChangeLog:

PR tree-optimization/113588
PR tree-optimization/113467
* gcc.dg/vect/vect-early-break_108-pr113588.c: New test.
* gcc.dg/vect/vect-early-break_109-pr113588.c: New test.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c
new file mode 100644
index 
..e488619c9aac41fafbcf479818392a6bb7c6924f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+int foo (const char *s, unsigned long n)
+{
+ unsigned long len = 0;
+ while (*s++ && n--)
+   ++len;
+ return len;
+}
+
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c
new file mode 100644
index 
..488c19d3ede809631d1a7ede0e7f7bcdc7a1ae43
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c
@@ -0,0 +1,44 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target mmap } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include 
+#include 
+
+#include "tree-vect.h"
+
+__attribute__((noipa))
+int foo (const char *s, unsigned long n)
+{
+ unsigned long len = 0;
+ while (*s++ && n--)
+   ++len;
+ return len;
+}
+
+int main()
+{
+
+  check_vect ();
+
+  long pgsz = sysconf (_SC_PAGESIZE);
+  void *p = mmap (NULL, pgsz * 3, PROT_READ|PROT_WRITE,
+ MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
+  if (p == MAP_FAILED)
+return 0;
+  mprotect (p, pgsz, PROT_NONE);
+  mprotect (p+2*pgsz, pgsz, PROT_NONE);
+  char *p1 = p + pgsz;
+  p1[0] = 1;
+  p1[1] = 0;
+  foo (p1, 1000);
+  p1 = p + 2*pgsz - 2;
+  p1[0] = 1;
+  p1[1] = 0;
+  foo (p1, 1000);
+  return 0;
+}
+
diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index 
f592aeb8028afd4fd70e2175104efab2a2c0d82e..53fdfc25d7dc2deb7788176252697d2e45fc
 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -619,10 +619,10 @@ vect_analyze_data_ref_dependence (struct 
data_dependence_relation *ddr,
   return opt_result::success ();
 }
 
-/* Funcion vect_analyze_early_break_dependences.
+/* Function vect_analyze_early_break_dependences.
 
-   Examime all the data references in the loop and make sure that if we have
-   mulitple exits that we are able to safely move stores such that they become
+   Examine all the data references in the loop and make sure that if we have
+   multiple exits that we are able to safely move stores such that they become
safe for vectorization.  The function also calculates the place where to 
move
the instructions to and computes what the new vUSE chain should be.
 
@@ -639,7 +639,7 @@ vect_analyze_data_ref_dependence (struct 
data_dependence_relation *ddr,
  - Multiple loads are allowed as long as they don't alias.
 
NOTE:
- This implemementation is very conservative. Any overlappig loads/stores
+ This implementation is very conservative. Any overlapping loads/stores
  that take place before the early break statement gets rejected aside from
  WAR dependencies.
 
@@ -668,7 +668,6 @@ vect_analyze_early_break_dependences (loop_vec_info 
loop_vinfo)
   auto_vec bases;
   basic_block dest_bb = NULL;
 
-  hash_set  visited;
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   class loop *loop_nest = loop_outer (loop);
 
@@ -677,19 +676,33 @@ vect_analyze_early_break_dependences (loop_vec_info 
loop_vinfo)
 "loop contains multiple exits, analyzing"
 " statement dependencies.\n");
 
+  if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
+if (dump_enabled_p ())
+  dump_printf_loc (MSG_NOTE, vect_location,
+

RE: [PATCH]AArch64: update vget_set_lane_1.c test output

2024-02-01 Thread Tamar Christina
> -Original Message-
> From: Richard Sandiford 
> Sent: Thursday, February 1, 2024 2:24 PM
> To: Andrew Pinski 
> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org; nd
> ; Richard Earnshaw ; Marcus
> Shawcroft ; Kyrylo Tkachov
> 
> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
> 
> Andrew Pinski  writes:
> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina 
> wrote:
> >>
> >> Hi All,
> >>
> >> In the vget_set_lane_1.c test the following entries now generate a zip1 
> >> instead
> of an INS
> >>
> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
> >>
> >> This is because the non-Q variant for indices 0 and 1 are just shuffling 
> >> values.
> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such just
> update the
> >> test file.
> > Hmm, is this true on all cores? I suspect there is a core out there
> > where INS is implemented with a much lower latency than ZIP.
> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
> > while ZIP is 6 cycles (3/7 for q versions).
> > Now I don't have any invested interest in that core any more but I
> > just wanted to point out that is not exactly true for all cores.
> 
> Thanks for the pointer.  In that case, perhaps we should prefer
> aarch64_evpc_ins over aarch64_evpc_zip in aarch64_expand_vec_perm_const_1?
> That's enough to fix this failure, but it'll probably require other
> tests to be adjusted...

I think given that Thundex-X is a 10 year old micro-architecture that is 
several cases where
often used instructions have very high latencies that generic codegen should 
not be blocked
from progressing because of it.

we use zips in many things and if thunderx codegen is really of that much 
importance then I
think the old codegen should be gated behind -mcpu=thunderx rather than 
preventing generic
changes.

Regards,
Tamar.

> 
> Richard


[PATCH]AArch64: update vget_set_lane_1.c test output

2024-02-01 Thread Tamar Christina
Hi All,

In the vget_set_lane_1.c test the following entries now generate a zip1 instead 
of an INS

BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)

This is because the non-Q variant for indices 0 and 1 are just shuffling values.
There is no perf difference between INS SIMD to SIMD and ZIP, as such just 
update the
test file.

Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vget_set_lane_1.c: Update test output.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c 
b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
index 
07a77de319206c5c6dad1c0d2d9bcc998583f9c1..a3978f68e4ff5899f395a98615a5e86c3b1389cb
 100644
--- a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
@@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
 BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
 BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
 BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
-/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 } 
} */
+/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
 
 BUILD_TEST (poly8x8_t, poly8x16_t, , q, p8, 7, 15)
 BUILD_TEST (int8x8_t,  int8x16_t,  , q, s8, 7, 15)




-- 
diff --git a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c 
b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
index 
07a77de319206c5c6dad1c0d2d9bcc998583f9c1..a3978f68e4ff5899f395a98615a5e86c3b1389cb
 100644
--- a/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/vget_set_lane_1.c
@@ -22,7 +22,7 @@ BUILD_TEST (uint16x4_t, uint16x4_t, , , u16, 3, 2)
 BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
 BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
 BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
-/* { dg-final { scan-assembler-times "ins\\tv0.s\\\[1\\\], v1.s\\\[0\\\]" 3 } 
} */
+/* { dg-final { scan-assembler-times "zip1\\tv0.2s, v0.2s, v1.2s" 3 } } */
 
 BUILD_TEST (poly8x8_t, poly8x16_t, , q, p8, 7, 15)
 BUILD_TEST (int8x8_t,  int8x16_t,  , q, s8, 7, 15)





[PATCH 2/2][libsanitizer] hwasan: Remove testsuite check for a complaint message [PR112644]

2024-01-31 Thread Tamar Christina
Hi All,

With recent updates to hwasan runtime libraries, the error reporting for
this particular check is has been reworked.

I would question why it has lost this message.  To me it looks strange
that num_descriptions_printed is incremented whenever we call
PrintHeapOrGlobalCandidate whether that function prints anything or not.
(See PrintAddressDescription in libsanitizer/hwasan/hwasan_report.cpp).

The message is no longer printed because we increment this
num_descriptions_printed variable indicating that we have found some
description.

I would like to question this upstream, but it doesn't look that much of
a problem and if pressed for time we should just change our testsuite.
Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR sanitizer/112644
* c-c++-common/hwasan/hwasan-thread-clears-stack.c: Update testcase.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c 
b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c
index 
09c72a56f0f50a8c301d89217aa8c7df70087e6c..6c70684d72a887c49b02ecb17ca097da81a9168f
 100644
--- a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c
+++ b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c
@@ -52,5 +52,4 @@ main (int argc, char **argv)
 
 /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" } 
*/
 /* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
\[\[:xdigit:\]\]\[\[:xdigit:\]\]/00 \\(ptr/mem\\) in thread T0.*" } */
-/* { dg-output "HWAddressSanitizer can not describe address in more 
detail\..*" } */
 /* { dg-output "SUMMARY: HWAddressSanitizer: tag-mismatch \[^\n\]*.*" } */




-- 
diff --git a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c 
b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c
index 
09c72a56f0f50a8c301d89217aa8c7df70087e6c..6c70684d72a887c49b02ecb17ca097da81a9168f
 100644
--- a/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c
+++ b/gcc/testsuite/c-c++-common/hwasan/hwasan-thread-clears-stack.c
@@ -52,5 +52,4 @@ main (int argc, char **argv)
 
 /* { dg-output "HWAddressSanitizer: tag-mismatch on address 0x\[0-9a-f\]*.*" } 
*/
 /* { dg-output "READ of size 4 at 0x\[0-9a-f\]* tags: 
\[\[:xdigit:\]\]\[\[:xdigit:\]\]/00 \\(ptr/mem\\) in thread T0.*" } */
-/* { dg-output "HWAddressSanitizer can not describe address in more 
detail\..*" } */
 /* { dg-output "SUMMARY: HWAddressSanitizer: tag-mismatch \[^\n\]*.*" } */





[PATCH 1/2][libsanitizer] hwasan: Remove testsuite check for a complaint message [PR112644]

2024-01-31 Thread Tamar Christina
Hi All,

Recent libhwasan updates[1] intercept various string and memory functions.
These functions have checking in them, which means there's no need to
inline the checking.

This patch marks said functions as intercepted, and adjusts a testcase
to handle the difference.  It also looks for HWASAN in a check in
expand_builtin.  This check originally is there to avoid using expand to
inline the behaviour of builtins like memset which are intercepted by
ASAN and hence which we rely on the function call staying as a function
call.  With the new reliance on function calls in HWASAN we need to do
the same thing for HWASAN too.

HWASAN and ASAN don't seem to however instrument the same functions.

Looking into 
libsanitizer/sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc
it looks like the common ones are memset, memmove and memcpy.

The rest of the routines for asan seem to be defined in
compiler-rt/lib/asan/asan_interceptors.h however compiler-rt/lib/hwasan/
does not have such a file but it does have
compiler-rt/lib/hwasan/hwasan_platform_interceptors.h which it looks like is
forcing off everything but memset, memmove, memcpy, memcmp and bcmp.

As such I've taken those as the final list that hwasan currently supports.
This also means that on future updates this list should be cross checked.

[1] 
https://discourse.llvm.org/t/hwasan-question-about-the-recent-interceptors-being-added/75351

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR sanitizer/112644
* asan.h (asan_intercepted_p): Incercept memset, memmove, memcpy and
memcmp.
* builtins.cc (expand_builtin): Include HWASAN when checking for
builtin inlining.

gcc/testsuite/ChangeLog:

PR sanitizer/112644
* c-c++-common/hwasan/builtin-special-handling.c: Update testcase.

Co-Authored-By: Matthew Malcomson 

--- inline copy of patch -- 
diff --git a/gcc/asan.h b/gcc/asan.h
index 
82811bdbe697665652aba89f2ee1c3ac07970df9..d1bf8b1e701b15525c6a900d324f2aebfb778cba
 100644
--- a/gcc/asan.h
+++ b/gcc/asan.h
@@ -185,8 +185,13 @@ extern hash_set *asan_handled_variables;
 inline bool
 asan_intercepted_p (enum built_in_function fcode)
 {
+  /* This list should be kept up-to-date with upstream's version at
+ compiler-rt/lib/hwasan/hwasan_platform_interceptors.h.  */
   if (hwasan_sanitize_p ())
-return false;
+return fcode == BUILT_IN_MEMCMP
+|| fcode == BUILT_IN_MEMCPY
+|| fcode == BUILT_IN_MEMMOVE
+|| fcode == BUILT_IN_MEMSET;
 
   return fcode == BUILT_IN_INDEX
 || fcode == BUILT_IN_MEMCHR
diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 
a0bd82c7981c05caf2764de70c62fe83bef9ad29..12cc7a54e99555d0f4b21fa2cc32ffa7bb548f18
 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -7792,7 +7792,8 @@ expand_builtin (tree exp, rtx target, rtx subtarget, 
machine_mode mode,
   default:
break;
   }
-  if (sanitize_flags_p (SANITIZE_ADDRESS) && asan_intercepted_p (fcode))
+  if (sanitize_flags_p (SANITIZE_ADDRESS | SANITIZE_HWADDRESS)
+ && asan_intercepted_p (fcode))
 return expand_call (exp, target, ignore);
 
   /* When not optimizing, generate calls to library functions for a certain
diff --git a/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c 
b/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c
index 
a7a6d91693ae48c20f33ab28f28d27b01af4722c..f975b1cc397bc0d6fd475dbfed5ccc8ac386
 100644
--- a/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c
+++ b/gcc/testsuite/c-c++-common/hwasan/builtin-special-handling.c
@@ -8,24 +8,24 @@
 /* { dg-skip-if "" { *-*-* }  { "-flto" } { "-flto-partition=none" } } */
 
 typedef __SIZE_TYPE__ size_t;
-/* Functions to observe that HWASAN instruments memory builtins in the expected
-   manner.  */
+/* HWASAN used to instrument calls to memset, memcpy, and memmove.  It no
+   longer does this.  Many other string and memory builtins are intercepted by
+   the runtime (and hence the codegen need not do anything).  */
 void * __attribute__((noinline))
 memset_builtin (void *dest, int value, size_t len)
 {
   return __builtin_memset (dest, value, len);
 }
 
-/* HWASAN avoids strlen because it doesn't know the size of the memory access
-   until *after* the function call.  */
 size_t __attribute__ ((noinline))
 strlen_builtin (char *element)
 {
   return __builtin_strlen (element);
 }
 
-/* First test ensures that the HWASAN_CHECK was emitted before the
-   memset.  Second test ensures there was only HWASAN_CHECK (which demonstrates
-   that strlen was not instrumented).  */
-/* { dg-final { scan-tree-dump-times "HWASAN_CHECK.*memset" 1 "asan1" } } */
-/* { dg-final { scan-tree-dump-times "HWASAN_CHECK" 1 "asan1" } } */
+/* First check here ensures there is no inline instrumentation generated for
+   these builtins.  Second checks that we end up calling memset (i.e. that it's
+   not optimised 

RE: [PATCH][libsanitizer]: Sync fixes for asan interceptors from upstream [PR112644]

2024-01-31 Thread Tamar Christina
> -Original Message-
> From: Andrew Pinski 
> Sent: Monday, January 29, 2024 9:55 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; ja...@redhat.com;
> do...@redhat.com; k...@google.com; dvyu...@google.com
> Subject: Re: [PATCH][libsanitizer]: Sync fixes for asan interceptors from 
> upstream
> [PR112644]
> 
> On Mon, Jan 29, 2024 at 7:04 AM Tamar Christina 
> wrote:
> >
> > Hi All,
> >
> > This cherry-picks and squashes the differences between commits
> >
> >
> d3e5c20ab846303874a2a25e5877c72271fc798b..76e1e45922e6709392fb82aa
> c44bebe3dbc2ea63
> > from LLVM upstream from compiler-rt/lib/hwasan/ to GCC on the changes
> relevant
> > for GCC.
> >
> > This is required to fix the linked PR.
> >
> > As mentioned in the PR the last sync brought in a bug from upstream[1] where
> > operations became non-recoverable and as such the tests in AArch64 started
> > failing.  This cherry picks the fix and there are minor updates needed to 
> > GCC
> > after this to fix the cases.
> >
> > [1] https://github.com/llvm/llvm-project/pull/74000
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> 
> Thanks for handling this; though I wonder how this slipped through
> testing upstream in LLVM. I see they added some new testcases for
> this. I Know GCC's testsuite for sanitizer is slightly different from
> LLVM's. Is it the case, GCC has more tests in this area? Is someone
> adding the testcases that GCC has in this area upstream to LLVM;
> basically so merging won't bring in regressions like this in the
> future?

There were two parts here.  The first one is that their testsuite didn't have 
any
test for the recovery case.  Which they've now added.

But the second parts (which I'm not posting patches for) is that the change
In hwasan means that the runtime can now instrument some additional
library methods which it couldn't before.  And GCC now needs to not inline
these anymore.

This does mean that on future updates one needs to take a look at the
Instrumentation list and make sure to keep it in sync with GCC's otherwise
we'll lose instrumentation.

Regards,
Tamar
> 
> Thanks,
> Andrew
> 
> >
> > Thanks,
> > Tamar
> >
> > libsanitizer/ChangeLog:
> >
> > PR sanitizer/112644
> > * hwasan/hwasan_interceptors.cpp (ACCESS_MEMORY_RANGE,
> > HWASAN_READ_RANGE, HWASAN_WRITE_RANGE,
> COMMON_SYSCALL_PRE_READ_RANGE,
> > COMMON_SYSCALL_PRE_WRITE_RANGE,
> COMMON_INTERCEPTOR_WRITE_RANGE,
> > COMMON_INTERCEPTOR_READ_RANGE): Make recoverable.
> >
> > --- inline copy of patch --
> > diff --git a/libsanitizer/hwasan/hwasan_interceptors.cpp
> b/libsanitizer/hwasan/hwasan_interceptors.cpp
> > index
> d9237cf9b8e3bf982cf213123ef22e73ec027c9e..96df4dd0c24d7d3db28fa2557
> cf63da0f295e33f 100644
> > --- a/libsanitizer/hwasan/hwasan_interceptors.cpp
> > +++ b/libsanitizer/hwasan/hwasan_interceptors.cpp
> > @@ -36,16 +36,16 @@ struct HWAsanInterceptorContext {
> >const char *interceptor_name;
> >  };
> >
> > -#  define ACCESS_MEMORY_RANGE(ctx, offset, size, access)   
> >  \
> > -do {   
> >  \
> > -  __hwasan::CheckAddressSized > access>((uptr)offset, \
> > -  size);   
> >  \
> > +#  define ACCESS_MEMORY_RANGE(offset, size, access)
> >\
> > +do {   
> >\
> > +  __hwasan::CheckAddressSized > access>((uptr)offset, \
> > +size); 
> >\
> >  } while (0)
> >
> > -#  define HWASAN_READ_RANGE(ctx, offset, size) \
> > -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Load)
> > -#  define HWASAN_WRITE_RANGE(ctx, offset, size) \
> > -ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Store)
> > +#  define HWASAN_READ_RANGE(offset, size) \
> > +ACCESS_MEMORY_RANGE(offset, size, AccessType::Load)
> > +#  define HWASAN_WRITE_RANGE(offset, size) \
> > +ACCESS_MEMORY_RANGE(offset, size, AccessType::Store)
> >
> >  #  if !SANITIZER_APPLE
> >  #define HWASAN_INTERCEPT_FUNC(name)
> > \
> > @@ -74,9 +74,8 @@ struct HWAsanInterceptorContext {
> >
> >  #  if HWASAN_WITH_INTERCEPTORS
> >
> > -#define COMMON_SYSC

RE: [PATCH]middle-end: check memory accesses in the destination block [PR113588].

2024-01-30 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, January 30, 2024 9:51 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: check memory accesses in the destination block
> [PR113588].
> 
> On Mon, 29 Jan 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > When analyzing loads for early break it was always the intention that
> > for the exit where things get moved to we only check the loads that can
> > be reached from the condition.
> 
> Looking at the code I'm a bit confused that we always move to
> single_pred (loop->latch) - IIRC that was different at some point?
> 
> Shouldn't we move stores after the last early exit condition instead?

Yes it was changed during another PR fix.  The rationale at that time didn't 
take into account
the peeled case.  It used to be that we would "search" for the the exit to 
place it in.

At that time the rational was, well it doesn't make sense. It has to go in the 
block that is the
last to be executed.  With the non-peeled case it's always the one before the 
latch.

Or put differently, I think the destination should be the main IV block.  I am 
not quite sure
I'm following why you want to put the peeled cases inside the latch block.

Ah, is it because the latch block is always going to only be executed when you 
make a full iteration?
That makes sense, but then I think we should also analyze the stores in all 
blocks (which your change
maybe already does, let me check) since we'll also lifting past the final block 
we need to update the vuses
there too.

If the above is correct then I think I understand what you're saying and will 
update the patch and do some
Checks.

Thanks,
Tamar

> 
> In particular for the peeled case single_pred (loop->latch) is the
> block with the actual early exit condition?  So for that case we'd
> need to move to the latch itself instead?  For non-peeled we move
> to the block with the IV condition which looks OK.
> 
> > However the main loop checks all loads and we skip the destination BB.
> > As such we never actually check the loads reachable from the COND in the
> > last BB unless this BB was also the exit chosen by the vectorizer.
> >
> > This leads us to incorrectly vectorize the loop in the PR and in doing so 
> > access
> > out of bounds.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> 
> The patch ends up with a worklist and another confusing comment
> 
> +  /* For the destination BB we need to only analyze loads reachable from
> the early
> + break statement itself.  */
> 
> But I think it's a downstream issue from the issue above.  That said,
> even for the non-peeled case we need to check ref_within_array_bound,
> no?
> 
> So what about re-doing that initial loop like the following instead
> (and also fix dest_bb, but I'd like clarification here).  Basically
> walk all blocks, do the ref_within_array_bound first and only
> after we've seen 'dest_bb' do the checks required for moving
> stores for all upstream BBs.
> 
> And dest_bb should be
> 
>   /* Move side-effects to the in-loop destination of the last early
>  exit.  */
>   if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> dest_bb = loop->latch;
>   else
> dest_bb = single_pred (loop->latch);
> 
> 
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index f592aeb8028..d6c8910dd6c 100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -668,7 +668,6 @@ vect_analyze_early_break_dependences (loop_vec_info
> loop_vinfo)
>auto_vec bases;
>basic_block dest_bb = NULL;
> 
> -  hash_set  visited;
>class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>class loop *loop_nest = loop_outer (loop);
> 
> @@ -681,15 +680,11 @@ vect_analyze_early_break_dependences
> (loop_vec_info loop_vinfo)
>   side-effects to is always the latch connected exit.  When we support
>   general control flow we can do better but for now this is fine.  */
>dest_bb = single_pred (loop->latch);
> -  basic_block bb = dest_bb;
> +  basic_block bb = loop->latch;
> +  bool check_deps = false;
> 
>do
>  {
> -  /* If the destination block is also the header then we have nothing to 
> do.  */
> -  if (!single_pred_p (bb))
> - continue;
> -
> -  bb = single_pred (bb);
>gimple_stmt_iterator gsi = gsi_last_bb (bb);
> 
>/* Now analyze all the remaining statements and try to determine which
> @@ -707,6 +702,25 @@ vect_analyze_early_break_dependences (loop_vec_info
> loop_vi

[PATCH]middle-end: check memory accesses in the destination block [PR113588].

2024-01-29 Thread Tamar Christina
Hi All,

When analyzing loads for early break it was always the intention that for the
exit where things get moved to we only check the loads that can be reached from
the condition.

However the main loop checks all loads and we skip the destination BB.  As such
we never actually check the loads reachable from the COND in the last BB unless
this BB was also the exit chosen by the vectorizer.

This leads us to incorrectly vectorize the loop in the PR and in doing so access
out of bounds.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113588
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences_1): New.
(vect_analyze_data_ref_dependence):  Use it.
(vect_analyze_early_break_dependences): Update comments.

gcc/testsuite/ChangeLog:

PR tree-optimization/113588
* gcc.dg/vect/vect-early-break_108-pr113588.c: New test.
* gcc.dg/vect/vect-early-break_109-pr113588.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c
new file mode 100644
index 
..e488619c9aac41fafbcf479818392a6bb7c6924f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_108-pr113588.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+int foo (const char *s, unsigned long n)
+{
+ unsigned long len = 0;
+ while (*s++ && n--)
+   ++len;
+ return len;
+}
+
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c
new file mode 100644
index 
..488c19d3ede809631d1a7ede0e7f7bcdc7a1ae43
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_109-pr113588.c
@@ -0,0 +1,44 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target mmap } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#include 
+#include 
+
+#include "tree-vect.h"
+
+__attribute__((noipa))
+int foo (const char *s, unsigned long n)
+{
+ unsigned long len = 0;
+ while (*s++ && n--)
+   ++len;
+ return len;
+}
+
+int main()
+{
+
+  check_vect ();
+
+  long pgsz = sysconf (_SC_PAGESIZE);
+  void *p = mmap (NULL, pgsz * 3, PROT_READ|PROT_WRITE,
+ MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
+  if (p == MAP_FAILED)
+return 0;
+  mprotect (p, pgsz, PROT_NONE);
+  mprotect (p+2*pgsz, pgsz, PROT_NONE);
+  char *p1 = p + pgsz;
+  p1[0] = 1;
+  p1[1] = 0;
+  foo (p1, 1000);
+  p1 = p + 2*pgsz - 2;
+  p1[0] = 1;
+  p1[1] = 0;
+  foo (p1, 1000);
+  return 0;
+}
+
diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index 
f592aeb8028afd4fd70e2175104efab2a2c0d82e..52cef242a7ce5d0e525bff639fa1dc2f0a6f30b9
 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -619,10 +619,69 @@ vect_analyze_data_ref_dependence (struct 
data_dependence_relation *ddr,
   return opt_result::success ();
 }
 
-/* Funcion vect_analyze_early_break_dependences.
+/* Function vect_analyze_early_break_dependences_1
 
-   Examime all the data references in the loop and make sure that if we have
-   mulitple exits that we are able to safely move stores such that they become
+   Helper function of vect_analyze_early_break_dependences which performs 
safety
+   analysis for load operations in an early break.  */
+
+static opt_result
+vect_analyze_early_break_dependences_1 (data_reference *dr_ref, gimple *stmt)
+{
+  /* We currently only support statically allocated objects due to
+ not having first-faulting loads support or peeling for
+ alignment support.  Compute the size of the referenced object
+ (it could be dynamically allocated).  */
+  tree obj = DR_BASE_ADDRESS (dr_ref);
+  if (!obj || TREE_CODE (obj) != ADDR_EXPR)
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"early breaks only supported on statically"
+" allocated objects.\n");
+  return opt_result::failure_at (stmt,
+"can't safely apply code motion to "
+"dependencies of %G to vectorize "
+"the early exit.\n", stmt);
+}
+
+  tree refop = TREE_OPERAND (obj, 0);
+  tree refbase = get_base_address (refop);
+  if (!refbase || !DECL_P (refbase) || !DECL_SIZE (refbase)
+  || TREE_CODE (DECL_SIZE (refbase)) != INTEGER_CST)
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"early 

[PATCH]AArch64: relax cbranch tests to accepted inverted branches [PR113502]

2024-01-29 Thread Tamar Christina
Hi All,

Recently something in the midend had started inverting the branches by inverting
the condition and the branches.

While this is fine, it makes it hard to actually test.  In RTL I disable
scheduling and BB reordering to prevent this.  But in GIMPLE there seems to be
nothing I can do.  __builtin_expect seems to have no impact on the change since
I suspect this is happening during expand where conditions can be flipped
regardless of probability during compare_and_branch.

Since the mid-end has plenty of correctness tests, this weakens the backend
tests to just check that a correct looking sequence is emitted.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR testsuite/113502
* gcc.target/aarch64/sve/vect-early-break-cbranch.c: Ignore exact 
branch.
* gcc.target/aarch64/vect-early-break-cbranch.c: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c
index 
d15053553f94e7dce3540e21f0c1f0d39ea4f289..d7cef1105410be04ed67d1d3b800746267f205a8
 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c
@@ -9,7 +9,7 @@ int b[N] = {0};
 ** ...
 ** cmpgt   p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0
 ** ptest   p[0-9]+, p[0-9]+.b
-** b.any   \.L[0-9]+
+** b.(any|none)\.L[0-9]+
 ** ...
 */
 void f1 ()
@@ -26,7 +26,7 @@ void f1 ()
 ** ...
 ** cmpge   p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0
 ** ptest   p[0-9]+, p[0-9]+.b
-** b.any   \.L[0-9]+
+** b.(any|none)\.L[0-9]+
 ** ...
 */
 void f2 ()
@@ -43,7 +43,7 @@ void f2 ()
 ** ...
 ** cmpeq   p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0
 ** ptest   p[0-9]+, p[0-9]+.b
-** b.any   \.L[0-9]+
+** b.(any|none)\.L[0-9]+
 ** ...
 */
 void f3 ()
@@ -60,7 +60,7 @@ void f3 ()
 ** ...
 ** cmpne   p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0
 ** ptest   p[0-9]+, p[0-9]+.b
-** b.any   \.L[0-9]+
+** b.(any|none)\.L[0-9]+
 ** ...
 */
 void f4 ()
@@ -77,7 +77,7 @@ void f4 ()
 ** ...
 ** cmplt   p[0-9]+.s, p7/z, z[0-9]+.s, #0
 ** ptest   p[0-9]+, p[0-9]+.b
-** b.any   .L[0-9]+
+** b.(any|none).L[0-9]+
 ** ...
 */
 void f5 ()
@@ -94,7 +94,7 @@ void f5 ()
 ** ...
 ** cmple   p[0-9]+.s, p[0-9]+/z, z[0-9]+.s, #0
 ** ptest   p[0-9]+, p[0-9]+.b
-** b.any   \.L[0-9]+
+** b.(any|none)\.L[0-9]+
 ** ...
 */
 void f6 ()
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c 
b/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c
index 
a5e7b94827dd70240d754a834f1d11750a9c27a9..673b781eb6d092f6311409797b20a971f4fae247
 100644
--- a/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c
+++ b/gcc/testsuite/gcc.target/aarch64/vect-early-break-cbranch.c
@@ -15,7 +15,7 @@ int b[N] = {0};
 ** cmgtv[0-9]+.4s, v[0-9]+.4s, #0
 ** umaxp   v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
 ** fmovx[0-9]+, d[0-9]+
-** cbnzx[0-9]+, \.L[0-9]+
+** cbn?z   x[0-9]+, \.L[0-9]+
 ** ...
 */
 void f1 ()
@@ -34,7 +34,7 @@ void f1 ()
 ** cmgev[0-9]+.4s, v[0-9]+.4s, #0
 ** umaxp   v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
 ** fmovx[0-9]+, d[0-9]+
-** cbnzx[0-9]+, \.L[0-9]+
+** cbn?z   x[0-9]+, \.L[0-9]+
 ** ...
 */
 void f2 ()
@@ -53,7 +53,7 @@ void f2 ()
 ** cmeqv[0-9]+.4s, v[0-9]+.4s, #0
 ** umaxp   v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
 ** fmovx[0-9]+, d[0-9]+
-** cbnzx[0-9]+, \.L[0-9]+
+** cbn?z   x[0-9]+, \.L[0-9]+
 ** ...
 */
 void f3 ()
@@ -72,7 +72,7 @@ void f3 ()
 ** cmtst   v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
 ** umaxp   v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
 ** fmovx[0-9]+, d[0-9]+
-** cbnzx[0-9]+, \.L[0-9]+
+** cbn?z   x[0-9]+, \.L[0-9]+
 ** ...
 */
 void f4 ()
@@ -91,7 +91,7 @@ void f4 ()
 ** cmltv[0-9]+.4s, v[0-9]+.4s, #0
 ** umaxp   v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
 ** fmovx[0-9]+, d[0-9]+
-** cbnzx[0-9]+, \.L[0-9]+
+** cbn?z   x[0-9]+, \.L[0-9]+
 ** ...
 */
 void f5 ()
@@ -110,7 +110,7 @@ void f5 ()
 ** cmlev[0-9]+.4s, v[0-9]+.4s, #0
 ** umaxp   v[0-9]+.4s, v[0-9]+.4s, v[0-9]+.4s
 ** fmovx[0-9]+, d[0-9]+
-** cbnzx[0-9]+, \.L[0-9]+
+** cbn?z   x[0-9]+, \.L[0-9]+
 ** ...
 */
 void f6 ()




-- 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c 
b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c
index 
d15053553f94e7dce3540e21f0c1f0d39ea4f289..d7cef1105410be04ed67d1d3b800746267f205a8
 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch.c
@@ -9,7 +9,7 @@ int b[N] = {0};
 ** ...
 ** cmpgt   p[0-9]+.s, 

[PATCH][libsanitizer]: Sync fixes for asan interceptors from upstream [PR112644]

2024-01-29 Thread Tamar Christina
Hi All,

This cherry-picks and squashes the differences between commits

d3e5c20ab846303874a2a25e5877c72271fc798b..76e1e45922e6709392fb82aac44bebe3dbc2ea63
from LLVM upstream from compiler-rt/lib/hwasan/ to GCC on the changes relevant
for GCC.

This is required to fix the linked PR.

As mentioned in the PR the last sync brought in a bug from upstream[1] where
operations became non-recoverable and as such the tests in AArch64 started
failing.  This cherry picks the fix and there are minor updates needed to GCC
after this to fix the cases.

[1] https://github.com/llvm/llvm-project/pull/74000

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

libsanitizer/ChangeLog:

PR sanitizer/112644
* hwasan/hwasan_interceptors.cpp (ACCESS_MEMORY_RANGE,
HWASAN_READ_RANGE, HWASAN_WRITE_RANGE, COMMON_SYSCALL_PRE_READ_RANGE,
COMMON_SYSCALL_PRE_WRITE_RANGE, COMMON_INTERCEPTOR_WRITE_RANGE,
COMMON_INTERCEPTOR_READ_RANGE): Make recoverable.

--- inline copy of patch -- 
diff --git a/libsanitizer/hwasan/hwasan_interceptors.cpp 
b/libsanitizer/hwasan/hwasan_interceptors.cpp
index 
d9237cf9b8e3bf982cf213123ef22e73ec027c9e..96df4dd0c24d7d3db28fa2557cf63da0f295e33f
 100644
--- a/libsanitizer/hwasan/hwasan_interceptors.cpp
+++ b/libsanitizer/hwasan/hwasan_interceptors.cpp
@@ -36,16 +36,16 @@ struct HWAsanInterceptorContext {
   const char *interceptor_name;
 };
 
-#  define ACCESS_MEMORY_RANGE(ctx, offset, size, access)\
-do {\
-  __hwasan::CheckAddressSized((uptr)offset, \
-  size);\
+#  define ACCESS_MEMORY_RANGE(offset, size, access)   \
+do {  \
+  __hwasan::CheckAddressSized((uptr)offset, \
+size);\
 } while (0)
 
-#  define HWASAN_READ_RANGE(ctx, offset, size) \
-ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Load)
-#  define HWASAN_WRITE_RANGE(ctx, offset, size) \
-ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Store)
+#  define HWASAN_READ_RANGE(offset, size) \
+ACCESS_MEMORY_RANGE(offset, size, AccessType::Load)
+#  define HWASAN_WRITE_RANGE(offset, size) \
+ACCESS_MEMORY_RANGE(offset, size, AccessType::Store)
 
 #  if !SANITIZER_APPLE
 #define HWASAN_INTERCEPT_FUNC(name)
\
@@ -74,9 +74,8 @@ struct HWAsanInterceptorContext {
 
 #  if HWASAN_WITH_INTERCEPTORS
 
-#define COMMON_SYSCALL_PRE_READ_RANGE(p, s) __hwasan_loadN((uptr)p, 
(uptr)s)
-#define COMMON_SYSCALL_PRE_WRITE_RANGE(p, s) \
-  __hwasan_storeN((uptr)p, (uptr)s)
+#define COMMON_SYSCALL_PRE_READ_RANGE(p, s) HWASAN_READ_RANGE(p, s)
+#define COMMON_SYSCALL_PRE_WRITE_RANGE(p, s) HWASAN_WRITE_RANGE(p, s)
 #define COMMON_SYSCALL_POST_READ_RANGE(p, s) \
   do {   \
 (void)(p);   \
@@ -91,10 +90,10 @@ struct HWAsanInterceptorContext {
 #include "sanitizer_common/sanitizer_syscalls_netbsd.inc"
 
 #define COMMON_INTERCEPTOR_WRITE_RANGE(ctx, ptr, size) \
-  HWASAN_WRITE_RANGE(ctx, ptr, size)
+  HWASAN_WRITE_RANGE(ptr, size)
 
 #define COMMON_INTERCEPTOR_READ_RANGE(ctx, ptr, size) \
-  HWASAN_READ_RANGE(ctx, ptr, size)
+  HWASAN_READ_RANGE(ptr, size)
 
 #define COMMON_INTERCEPTOR_ENTER(ctx, func, ...) \
   HWAsanInterceptorContext _ctx = {#func};   \




-- 
diff --git a/libsanitizer/hwasan/hwasan_interceptors.cpp 
b/libsanitizer/hwasan/hwasan_interceptors.cpp
index 
d9237cf9b8e3bf982cf213123ef22e73ec027c9e..96df4dd0c24d7d3db28fa2557cf63da0f295e33f
 100644
--- a/libsanitizer/hwasan/hwasan_interceptors.cpp
+++ b/libsanitizer/hwasan/hwasan_interceptors.cpp
@@ -36,16 +36,16 @@ struct HWAsanInterceptorContext {
   const char *interceptor_name;
 };
 
-#  define ACCESS_MEMORY_RANGE(ctx, offset, size, access)\
-do {\
-  __hwasan::CheckAddressSized((uptr)offset, \
-  size);\
+#  define ACCESS_MEMORY_RANGE(offset, size, access)   \
+do {  \
+  __hwasan::CheckAddressSized((uptr)offset, \
+size);\
 } while (0)
 
-#  define HWASAN_READ_RANGE(ctx, offset, size) \
-ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Load)
-#  define HWASAN_WRITE_RANGE(ctx, offset, size) \
-ACCESS_MEMORY_RANGE(ctx, offset, size, AccessType::Store)
+#  define HWASAN_READ_RANGE(offset, size) \
+ACCESS_MEMORY_RANGE(offset, size, AccessType::Load)

[PATCH]AArch64: Do not allow SIMD clones with simdlen 1 [PR113552]

2024-01-24 Thread Tamar Christina
Hi All,

The AArch64 vector PCS does not allow simd calls with simdlen 1,
however due to a bug we currently do allow it for num == 0.

This causes us to emit a symbol that doesn't exist and we fail to link.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master? and for backport to GCC 13,12,11?

Thanks,
Tamar



gcc/ChangeLog:

PR tree-optimization/113552
* config/aarch64/aarch64.cc
(aarch64_simd_clone_compute_vecsize_and_simdlen): Block simdlen 1.

gcc/testsuite/ChangeLog:

PR tree-optimization/113552
* gcc.target/aarch64/pr113552.c: New test.
* gcc.target/aarch64/simd_pcs_attribute-3.c: Remove bogus check.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
e6bd3fd0bb42c70603d5335402b89c9deeaf48d8..a2fc1a5d9d27e9d837e4d616e3feaf38f7272b4f
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28620,7 +28620,8 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
   if (known_eq (clonei->simdlen, 0U))
 {
   simdlen = exact_div (poly_uint64 (64), nds_elt_bits);
-  simdlens.safe_push (simdlen);
+  if (known_ne (simdlen, 1U))
+   simdlens.safe_push (simdlen);
   simdlens.safe_push (simdlen * 2);
 }
   else
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
b/gcc/testsuite/gcc.target/aarch64/pr113552.c
new file mode 100644
index 
..9c96b061ed2b4fcc57e58925277f74d14f79c51f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a" } */
+
+__attribute__ ((__simd__ ("notinbranch"), const))
+double cos (double);
+
+void foo (float *a, double *b)
+{
+for (int i = 0; i < 12; i+=3)
+  {
+b[i] = cos (5.0 * a[i]);
+b[i+1] = cos (5.0 * a[i+1]);
+b[i+2] = cos (5.0 * a[i+2]);
+  }
+}
+
+/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
index 
95f6a6803e889c02177ef10972962ed62d2095eb..661764b3d4a89e08951a7a3c0495d5b7ba7f0871
 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
@@ -18,7 +18,5 @@ double foo(double x)
 }
 
 /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */




-- 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
e6bd3fd0bb42c70603d5335402b89c9deeaf48d8..a2fc1a5d9d27e9d837e4d616e3feaf38f7272b4f
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28620,7 +28620,8 @@ aarch64_simd_clone_compute_vecsize_and_simdlen (struct 
cgraph_node *node,
   if (known_eq (clonei->simdlen, 0U))
 {
   simdlen = exact_div (poly_uint64 (64), nds_elt_bits);
-  simdlens.safe_push (simdlen);
+  if (known_ne (simdlen, 1U))
+   simdlens.safe_push (simdlen);
   simdlens.safe_push (simdlen * 2);
 }
   else
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113552.c 
b/gcc/testsuite/gcc.target/aarch64/pr113552.c
new file mode 100644
index 
..9c96b061ed2b4fcc57e58925277f74d14f79c51f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113552.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8-a" } */
+
+__attribute__ ((__simd__ ("notinbranch"), const))
+double cos (double);
+
+void foo (float *a, double *b)
+{
+for (int i = 0; i < 12; i+=3)
+  {
+b[i] = cos (5.0 * a[i]);
+b[i+1] = cos (5.0 * a[i+1]);
+b[i+2] = cos (5.0 * a[i+2]);
+  }
+}
+
+/* { dg-final { scan-assembler-times {bl\t_ZGVnN2v_cos} 6 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c 
b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
index 
95f6a6803e889c02177ef10972962ed62d2095eb..661764b3d4a89e08951a7a3c0495d5b7ba7f0871
 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd_pcs_attribute-3.c
@@ -18,7 +18,5 @@ double foo(double x)
 }
 
 /* { dg-final { scan-assembler-not {\.variant_pcs\tfoo} } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM1v_foo} 1 } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnM2v_foo} 1 } } */
-/* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN1v_foo} 1 } } */
 /* { dg-final { scan-assembler-times {\.variant_pcs\t_ZGVnN2v_foo} 1 } } */





[PATCH]AArch64: Fix expansion of Advanced SIMD div and mul using SVE [PR109636]

2024-01-24 Thread Tamar Christina
Hi All,

As suggested in the ticket this replaces the expansion by converting the
Advanced SIMD types to SVE types by simply printing out an SVE register for
these instructions.

This fixes the subreg issues since there are no subregs involved anymore.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR target/109636
* config/aarch64/aarch64-simd.md (div3,
mulv2di3): Remove.
* config/aarch64/iterators.md (VQDIV): Remove.
(SVE_FULL_SDI_SIMD, SVE_FULL_SDI_SIMD_DI, SVE_FULL_HSDI_SIMD_DI,
SVE_I_SIMD_DI): New.
(VPRED, sve_lane_con): Add V4SI and V2DI.
* config/aarch64/aarch64-sve.md (3,
@aarch64_pred_): Support Advanced SIMD types.
(mul3): New, split from 3.
(@aarch64_pred_, *post_ra_3): New.
* config/aarch64/aarch64-sve2.md (@aarch64_mul_lane_,
*aarch64_mul_unpredicated_): Change SVE_FULL_HSDI to
SVE_FULL_HSDI_SIMD_DI.

gcc/testsuite/ChangeLog:

PR target/109636
* gcc.target/aarch64/sve/pr109636_1.c: New test.
* gcc.target/aarch64/sve/pr109636_2.c: New test.
* gcc.target/aarch64/sve2/pr109636_1.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
6f48b4d5f21da9f96a376cd6b34110c2a39deb33..556d0cf359fedf2c28dfe1e0a75e1c12321be68a
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -389,26 +389,6 @@ (define_insn "mul3"
   [(set_attr "type" "neon_mul_")]
 )
 
-;; Advanced SIMD does not support vector DImode MUL, but SVE does.
-;; Make use of the overlap between Z and V registers to implement the V2DI
-;; optab for TARGET_SVE.  The mulvnx2di3 expander can
-;; handle the TARGET_SVE2 case transparently.
-(define_expand "mulv2di3"
-  [(set (match_operand:V2DI 0 "register_operand")
-(mult:V2DI (match_operand:V2DI 1 "register_operand")
-  (match_operand:V2DI 2 "aarch64_sve_vsm_operand")))]
-  "TARGET_SVE"
-  {
-machine_mode sve_mode = VNx2DImode;
-rtx sve_op0 = simplify_gen_subreg (sve_mode, operands[0], V2DImode, 0);
-rtx sve_op1 = simplify_gen_subreg (sve_mode, operands[1], V2DImode, 0);
-rtx sve_op2 = simplify_gen_subreg (sve_mode, operands[2], V2DImode, 0);
-
-emit_insn (gen_mulvnx2di3 (sve_op0, sve_op1, sve_op2));
-DONE;
-  }
-)
-
 (define_insn "bswap2"
   [(set (match_operand:VDQHSD 0 "register_operand" "=w")
 (bswap:VDQHSD (match_operand:VDQHSD 1 "register_operand" "w")))]
@@ -2678,27 +2658,6 @@ (define_insn "*div3"
   [(set_attr "type" "neon_fp_div_")]
 )
 
-;; SVE has vector integer divisions, unlike Advanced SIMD.
-;; We can use it with Advanced SIMD modes to expose the V2DI and V4SI
-;; optabs to the midend.
-(define_expand "div3"
-  [(set (match_operand:VQDIV 0 "register_operand")
-   (ANY_DIV:VQDIV
- (match_operand:VQDIV 1 "register_operand")
- (match_operand:VQDIV 2 "register_operand")))]
-  "TARGET_SVE"
-  {
-machine_mode sve_mode
-  = aarch64_full_sve_mode (GET_MODE_INNER (mode)).require ();
-rtx sve_op0 = simplify_gen_subreg (sve_mode, operands[0], mode, 0);
-rtx sve_op1 = simplify_gen_subreg (sve_mode, operands[1], mode, 0);
-rtx sve_op2 = simplify_gen_subreg (sve_mode, operands[2], mode, 0);
-
-emit_insn (gen_div3 (sve_op0, sve_op1, sve_op2));
-DONE;
-  }
-)
-
 (define_insn "neg2"
  [(set (match_operand:VHSDF 0 "register_operand" "=w")
(neg:VHSDF (match_operand:VHSDF 1 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 
e1e3c1bd0b7d12eefe43dc95a10716c24e3a48de..eca8623e587af944927a9459e29d5f8af170d347
 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -3789,16 +3789,35 @@ (define_expand "3"
   [(set (match_operand:SVE_I 0 "register_operand")
(unspec:SVE_I
  [(match_dup 3)
-  (SVE_INT_BINARY_IMM:SVE_I
+  (SVE_INT_BINARY_MULTI:SVE_I
 (match_operand:SVE_I 1 "register_operand")
 (match_operand:SVE_I 2 "aarch64_sve__operand"))]
  UNSPEC_PRED_X))]
   "TARGET_SVE"
+  {
+operands[3] = aarch64_ptrue_reg (mode);
+  }
+)
+
+;; Unpredicated integer binary operations that have an immediate form.
+;; Advanced SIMD does not support vector DImode MUL, but SVE does.
+;; Make use of the overlap between Z and V registers to implement the V2DI
+;; optab for TARGET_SVE.  The mulvnx2di3 expander can
+;; handle the TARGET_SVE2 case transparently.
+(define_expand "mul3"
+  [(set (match_operand:SVE_I_SIMD_DI 0 "register_operand")
+   (unspec:SVE_I_SIMD_DI
+ [(match_dup 3)
+  (mult:SVE_I_SIMD_DI
+(match_operand:SVE_I_SIMD_DI 1 "register_operand")
+(match_operand:SVE_I_SIMD_DI 2 "aarch64_sve_vsm_operand"))]
+ UNSPEC_PRED_X))]
+  "TARGET_SVE"
   {
 /* SVE2 supports 

[PATCH]middle-end: rename main_exit_p in reduction code.

2024-01-23 Thread Tamar Christina
Hi All,

This renamed main_exit_p to last_val_reduc_p to more accurately
reflect what the value is calculating.

Ok for master if bootstrap passes? Incremental build shows it's fine.

Thanks,
Tamar

gcc/ChangeLog:

* tree-vect-loop.cc (vect_get_vect_def,
vect_create_epilog_for_reduction): Rename main_exit_p to
last_val_reduc_p.

--- inline copy of patch -- 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
4da1421c8f09746ef4b293573e4f861b642349e1..21a997599f397ba6c2cd15c3b9c8b04513bc0c83
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5892,25 +5892,26 @@ vect_create_partial_epilog (tree vec_def, tree vectype, 
code_helper code,
 }
 
 /* Retrieves the definining statement to be used for a reduction.
-   For MAIN_EXIT_P we use the current VEC_STMTs and otherwise we look at
-   the reduction definitions.  */
+   For LAST_VAL_REDUC_P we use the current VEC_STMTs which correspond to the
+   final value after vectorization and otherwise we look at the reduction
+   definitions to get the first.  */
 
 tree
 vect_get_vect_def (stmt_vec_info reduc_info, slp_tree slp_node,
-  slp_instance slp_node_instance, bool main_exit_p, unsigned i,
-  vec  _stmts)
+  slp_instance slp_node_instance, bool last_val_reduc_p,
+  unsigned i, vec  _stmts)
 {
   tree def;
 
   if (slp_node)
 {
-  if (!main_exit_p)
+  if (!last_val_reduc_p)
 slp_node = slp_node_instance->reduc_phis;
   def = vect_get_slp_vect_def (slp_node, i);
 }
   else
 {
-  if (!main_exit_p)
+  if (!last_val_reduc_p)
reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (reduc_info));
   vec_stmts = STMT_VINFO_VEC_STMTS (reduc_info);
   def = gimple_get_lhs (vec_stmts[0]);
@@ -5982,8 +5983,8 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
  loop-closed PHI of the inner loop which we remember as
  def for the reduction PHI generation.  */
   bool double_reduc = false;
-  bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit
-&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
+  bool last_val_reduc_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit
+ && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
   stmt_vec_info rdef_info = stmt_info;
   if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
 {
@@ -6233,7 +6234,7 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
 {
   gimple_seq stmts = NULL;
   def = vect_get_vect_def (rdef_info, slp_node, slp_node_instance,
-  main_exit_p, i, vec_stmts);
+  last_val_reduc_p, i, vec_stmts);
   for (j = 0; j < ncopies; j++)
{
  tree new_def = copy_ssa_name (def);




-- 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
4da1421c8f09746ef4b293573e4f861b642349e1..21a997599f397ba6c2cd15c3b9c8b04513bc0c83
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5892,25 +5892,26 @@ vect_create_partial_epilog (tree vec_def, tree vectype, 
code_helper code,
 }
 
 /* Retrieves the definining statement to be used for a reduction.
-   For MAIN_EXIT_P we use the current VEC_STMTs and otherwise we look at
-   the reduction definitions.  */
+   For LAST_VAL_REDUC_P we use the current VEC_STMTs which correspond to the
+   final value after vectorization and otherwise we look at the reduction
+   definitions to get the first.  */
 
 tree
 vect_get_vect_def (stmt_vec_info reduc_info, slp_tree slp_node,
-  slp_instance slp_node_instance, bool main_exit_p, unsigned i,
-  vec  _stmts)
+  slp_instance slp_node_instance, bool last_val_reduc_p,
+  unsigned i, vec  _stmts)
 {
   tree def;
 
   if (slp_node)
 {
-  if (!main_exit_p)
+  if (!last_val_reduc_p)
 slp_node = slp_node_instance->reduc_phis;
   def = vect_get_slp_vect_def (slp_node, i);
 }
   else
 {
-  if (!main_exit_p)
+  if (!last_val_reduc_p)
reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (reduc_info));
   vec_stmts = STMT_VINFO_VEC_STMTS (reduc_info);
   def = gimple_get_lhs (vec_stmts[0]);
@@ -5982,8 +5983,8 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
  loop-closed PHI of the inner loop which we remember as
  def for the reduction PHI generation.  */
   bool double_reduc = false;
-  bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit
-&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
+  bool last_val_reduc_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit
+ && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
   stmt_vec_info rdef_info = stmt_info;
   if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
 {
@@ -6233,7 +6234,7 @@ vect_create_epilog_for_reduction 

[PATCH]middle-end: fix epilog reductions when vector iters peeled [PR113364]

2024-01-23 Thread Tamar Christina
Hi All,

This fixes a bug where vect_create_epilog_for_reduction does not handle the
case where all exits are early exits.  In this case we should do like induction
handling code does and not have a main exit.

Bootstrapped Regtested on x86_64-pc-linux-gnu
with --enable-checking=release --enable-lto --with-arch=native
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.

This shows that some new miscompiles are happening (stage3 is likely 
miscompiled)
but that's unrelated to this patch and I'll look at it next.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113364
* tree-vect-loop.cc (vect_create_epilog_for_reduction): If all exits all
early exits then we must reduce from the first offset for all of them.

gcc/testsuite/ChangeLog:

PR tree-optimization/113364
* gcc.dg/vect/vect-early-break_107-pr113364.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c
new file mode 100644
index 
..f489265dbfe5eb8fe302dcc34901abaf6e6d5c14
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-w" } */
+
+typedef const unsigned char *It;
+It DecodeSLEB128(It begin, It end, int *v) {
+  int value = 0;
+  unsigned shift = 0;
+  unsigned char byte;
+  do
+  {
+if (begin == end)
+  return begin;
+byte = *(begin++);
+int slice = byte & 0x7f;
+value |= slice << shift;
+  } while (byte >= 128);
+  *v = value;
+  return begin;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
fe631252dc2258e8ea42179b4ba068a480be9e38..4da1421c8f09746ef4b293573e4f861b642349e1
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5982,7 +5982,8 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
  loop-closed PHI of the inner loop which we remember as
  def for the reduction PHI generation.  */
   bool double_reduc = false;
-  bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit;
+  bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit
+&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
   stmt_vec_info rdef_info = stmt_info;
   if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
 {




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c
new file mode 100644
index 
..f489265dbfe5eb8fe302dcc34901abaf6e6d5c14
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_107-pr113364.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-w" } */
+
+typedef const unsigned char *It;
+It DecodeSLEB128(It begin, It end, int *v) {
+  int value = 0;
+  unsigned shift = 0;
+  unsigned char byte;
+  do
+  {
+if (begin == end)
+  return begin;
+byte = *(begin++);
+int slice = byte & 0x7f;
+value |= slice << shift;
+  } while (byte >= 128);
+  *v = value;
+  return begin;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
fe631252dc2258e8ea42179b4ba068a480be9e38..4da1421c8f09746ef4b293573e4f861b642349e1
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5982,7 +5982,8 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
  loop-closed PHI of the inner loop which we remember as
  def for the reduction PHI generation.  */
   bool double_reduc = false;
-  bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit;
+  bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit
+&& !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
   stmt_vec_info rdef_info = stmt_info;
   if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
 {





[PATCH]middle-end: remove more usages of single_exit

2024-01-12 Thread Tamar Christina
Hi All,

This replaces two more usages of single_exit that I had missed before.
They both seem to happen when we re-use the ifcvt scalar loop for versioning.

The condition in versioning is the same as the one for when we don't re-use the
scalar loop.

I hit these during an LTO enabled bootstrap now.

Bootstrapped Regtested on aarch64-none-linux-gnu with lto enabled and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-vect-loop-manip.cc (vect_loop_versioning): Replace single_exit.
* tree-vect-loop.cc (vect_transform_loop): Likewise.

--- inline copy of patch -- 
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
0931b18404856f6c33dcae1ffa8d5a350dbd0f8f..0d8c90f69e9693d5d25095e799fbc17a9910779b
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -4051,7 +4051,16 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   basic_block preheader = loop_preheader_edge (loop_to_version)->src;
   preheader->count = preheader->count.apply_probability (prob * prob2);
   scale_loop_frequencies (loop_to_version, prob * prob2);
-  single_exit (loop_to_version)->dest->count = preheader->count;
+  /* When the loop has multiple exits then we can only version itself.
+   This is denoted by loop_to_version == loop.  In this case we can
+   do the versioning by selecting the exit edge the vectorizer is
+   currently using.  */
+  edge exit_edge;
+  if (loop_to_version == loop)
+   exit_edge = LOOP_VINFO_IV_EXIT (loop_vinfo);
+  else
+   exit_edge = single_exit (loop_to_version);
+  exit_edge->dest->count = preheader->count;
   LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo) = (prob * prob2).invert ();
 
   nloop = scalar_loop;
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
da2dfa176ecd457ebc11d1131302ca15d77d779d..eccf0953bbae2a0e95efba0966c85492e5057b14
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11910,8 +11910,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
  (LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo));
   scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
  LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo));
-  single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))->dest->count
-   = preheader->count;
+  LOOP_VINFO_SCALAR_IV_EXIT (loop_vinfo)->dest->count = preheader->count;
 }
 
   if (niters_vector == NULL_TREE)




-- 
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
0931b18404856f6c33dcae1ffa8d5a350dbd0f8f..0d8c90f69e9693d5d25095e799fbc17a9910779b
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -4051,7 +4051,16 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   basic_block preheader = loop_preheader_edge (loop_to_version)->src;
   preheader->count = preheader->count.apply_probability (prob * prob2);
   scale_loop_frequencies (loop_to_version, prob * prob2);
-  single_exit (loop_to_version)->dest->count = preheader->count;
+  /* When the loop has multiple exits then we can only version itself.
+   This is denoted by loop_to_version == loop.  In this case we can
+   do the versioning by selecting the exit edge the vectorizer is
+   currently using.  */
+  edge exit_edge;
+  if (loop_to_version == loop)
+   exit_edge = LOOP_VINFO_IV_EXIT (loop_vinfo);
+  else
+   exit_edge = single_exit (loop_to_version);
+  exit_edge->dest->count = preheader->count;
   LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo) = (prob * prob2).invert ();
 
   nloop = scalar_loop;
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
da2dfa176ecd457ebc11d1131302ca15d77d779d..eccf0953bbae2a0e95efba0966c85492e5057b14
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -11910,8 +11910,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple 
*loop_vectorized_call)
  (LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo));
   scale_loop_frequencies (LOOP_VINFO_SCALAR_LOOP (loop_vinfo),
  LOOP_VINFO_SCALAR_LOOP_SCALING (loop_vinfo));
-  single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))->dest->count
-   = preheader->count;
+  LOOP_VINFO_SCALAR_IV_EXIT (loop_vinfo)->dest->count = preheader->count;
 }
 
   if (niters_vector == NULL_TREE)





[PATCH]middle-end testsuite: remove -save-temps from many tests [PR113319]

2024-01-11 Thread Tamar Christina
Hi All,

This removes -save-temps from the tests I've introduced to fix the LTO
mismatches.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issue

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR testsuite/113319
* gcc.dg/bic-bitmask-13.c: Remove -save-temps.
* gcc.dg/bic-bitmask-14.c: Likewise.
* gcc.dg/bic-bitmask-15.c: Likewise.
* gcc.dg/bic-bitmask-16.c: Likewise.
* gcc.dg/bic-bitmask-17.c: Likewise.
* gcc.dg/bic-bitmask-18.c: Likewise.
* gcc.dg/bic-bitmask-19.c: Likewise.
* gcc.dg/bic-bitmask-20.c: Likewise.
* gcc.dg/bic-bitmask-21.c: Likewise.
* gcc.dg/bic-bitmask-22.c: Likewise.
* gcc.dg/bic-bitmask-7.c: Likewise.
* gcc.dg/vect/vect-early-break-run_1.c: Likewise.
* gcc.dg/vect/vect-early-break-run_10.c: Likewise.
* gcc.dg/vect/vect-early-break-run_2.c: Likewise.
* gcc.dg/vect/vect-early-break-run_3.c: Likewise.
* gcc.dg/vect/vect-early-break-run_4.c: Likewise.
* gcc.dg/vect/vect-early-break-run_5.c: Likewise.
* gcc.dg/vect/vect-early-break-run_6.c: Likewise.
* gcc.dg/vect/vect-early-break-run_7.c: Likewise.
* gcc.dg/vect/vect-early-break-run_8.c: Likewise.
* gcc.dg/vect/vect-early-break-run_9.c: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-13.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
index 
bac86c2cfcebb4fd83eef1ea276026af97bcb096..141b03d6df772e9bdfaaf832287a1e91ebc6be0d
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-13.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-13.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O0 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O0 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-14.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
index 
ec3bd6a7e04de93e60b0a606ec4cabf5bb90af22..59a008c01e22b21cbe4b8d15e411046d7940a7cf
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-14.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-14.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-15.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
index 
8bdf1ea4eb2e5117c6d84b0d6cdf95798c4b8e2c..c28d9b13f4eb300414cdf19ab0550a888b8edeec
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-15.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-15.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-16.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
index 
cfea925b59104ad5c84beea90cea5e6ec9b1e787..f93912f0cc579b3c56e24577b36d755ec3737ed6
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-16.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-16.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-17.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
index 
86873b97f27c5fe6e1495ac0cf3471b7782a8067..f8d651b829b4f3c771bc2db056f15aa385c8302e
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-17.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-17.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-18.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
index 
70bab0c520321ba13c6dd7969d1b51708dc3c71f..d6242fe3c19b8e958e4eca5ae8a633c376f09794
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-18.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-18.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-19.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
index 
c4620dfaad3b8fdbb0ba214bbd69b975f37c68db..aa139da5c1ede2aa422c7e56956051c3b854f983
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-19.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-19.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-20.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
index 
a114122e075eab6be651b4e0954f084a2fd427c9..849eca4e51489b7f68f6695de3389ed5a0697ef2
 100644
--- a/gcc/testsuite/gcc.dg/bic-bitmask-20.c
+++ b/gcc/testsuite/gcc.dg/bic-bitmask-20.c
@@ -1,5 +1,5 @@
 /* { dg-do run } */
-/* { dg-options "-O1 -save-temps -fdump-tree-dce" } */
+/* { dg-options "-O1 -fdump-tree-dce" } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/bic-bitmask-21.c 
b/gcc/testsuite/gcc.dg/bic-bitmask-21.c
index 
bd12a58da1ed5868b78b18742ed9d69289e58a37..9aecd7228523be5d7c4fd13c16833986ada79615

[PATCH]middle-end: make memory analysis for early break more deterministic [PR113135]

2024-01-11 Thread Tamar Christina
Hi All,

Instead of searching for where to move stores to, they should always be in
exit belonging to the latch.  We can only ever delay stores and even if we
pick a different exit than the latch one as the main one, effects still
happen in program order when vectorized.  If we don't move the stores to the
latch exit but instead to whever we pick as the "main" exit then we can
perform incorrect memory accesses (luckily these are trapped by verify_ssa).

We used to iterate over the conds and check the loads and stores inside them.
However this relies on the conds being ordered in program order.  Additionally
if there is a basic block between two conds we would not have analyzed it.

Instead this now walks from the preds of the destination basic block up to the
loop header and analyzes every block along the way.  As a later optimization we
could stop as soon as we've seen all the BBs we have conds for.  For now the
header will always contain the first cond, but this can change when we support
arbitrary control flow.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues normally and with --enable-checking=release --enable-lto
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113135
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Rework
dependency analysis.

gcc/testsuite/ChangeLog:

PR tree-optimization/113135
* gcc.dg/vect/vect-early-break_103-pr113135.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_103-pr113135.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_103-pr113135.c
new file mode 100644
index 
..bbad7ee2cb18086e470f4a2a2dc0a2b345bbdd71
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_103-pr113135.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-w" } */
+
+char UnpackReadTables_BitLength[20];
+int UnpackReadTables_ZeroCount;
+void UnpackReadTables() {
+  for (unsigned I = 0; I < 20;)
+while (UnpackReadTables_ZeroCount-- &&
+   I < sizeof(UnpackReadTables_BitLength))
+  UnpackReadTables_BitLength[I++] = 0;
+}
diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index 
3d9673fb0b580ff21ff151dc5c199840df41a1cd..6b76eee72cb7d09de5f443589b4fc3a0e8c2584f
 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -671,13 +671,18 @@ vect_analyze_early_break_dependences (loop_vec_info 
loop_vinfo)
 "loop contains multiple exits, analyzing"
 " statement dependencies.\n");
 
-  for (gimple *c : LOOP_VINFO_LOOP_CONDS (loop_vinfo))
-{
-  stmt_vec_info loop_cond_info = loop_vinfo->lookup_stmt (c);
-  if (STMT_VINFO_TYPE (loop_cond_info) != loop_exit_ctrl_vec_info_type)
-   continue;
+  /* Since we don't support general control flow, the location we'll move the
+ side-effects to is always the latch connected exit.  When we support
+ general control flow we can do better but for now this is fine.  */
+  dest_bb = single_pred (loop->latch);
+  auto_vec  workset;
+  for (auto e: dest_bb->preds)
+workset.safe_push (e);
 
-  gimple_stmt_iterator gsi = gsi_for_stmt (c);
+  while (!workset.is_empty ())
+{
+  basic_block bb = workset.pop ()->src;
+  gimple_stmt_iterator gsi = gsi_last_bb (bb);
 
   /* Now analyze all the remaining statements and try to determine which
 instructions are allowed/needed to be moved.  */
@@ -705,10 +710,10 @@ vect_analyze_early_break_dependences (loop_vec_info 
loop_vinfo)
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "early breaks only supported on statically"
 " allocated objects.\n");
- return opt_result::failure_at (c,
+ return opt_result::failure_at (stmt,
 "can't safely apply code motion to "
 "dependencies of %G to vectorize "
-"the early exit.\n", c);
+"the early exit.\n", stmt);
}
 
  tree refop = TREE_OPERAND (obj, 0);
@@ -720,10 +725,10 @@ vect_analyze_early_break_dependences (loop_vec_info 
loop_vinfo)
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "early breaks only supported on"
 " statically allocated objects.\n");
- return opt_result::failure_at (c,
+ return opt_result::failure_at (stmt,
 "can't safely apply code motion to "
 "dependencies of %G to vectorize 

[PATCH]middle-end: fill in reduction PHI for all alt exits [PR113144]

2024-01-10 Thread Tamar Christina
Hi All,

When we have a loop with more than 2 exits and a reduction I forgot to fill in
the PHI value for all alternate exits.

All alternate exits use the same PHI value so we should loop over the new
PHI elements and copy the value across since we call the reduction calculation
code only once for all exits.  This was normally covered up by earlier parts of
the compiler rejecting loops incorrectly (which has been fixed now).

Note that while I can use the loop in all cases, the reason I separated out the
main and alt exit is so that if you pass the wrong edge the macro will assert.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113178
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Fill in all
alternate exits.

gcc/testsuite/ChangeLog:

PR tree-optimization/113178
* g++.dg/vect/vect-early-break_6-pr113178.cc: New test.
* gcc.dg/vect/vect-early-break_101-pr113178.c: New test.
* gcc.dg/vect/vect-early-break_102-pr113178.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_6-pr113178.cc 
b/gcc/testsuite/g++.dg/vect/vect-early-break_6-pr113178.cc
new file mode 100644
index 
..da008759a72dd563bf4930decd74470ae35cb98e
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/vect-early-break_6-pr113178.cc
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+struct PixelWeight {
+  int m_SrcStart;
+  int m_Weights[];
+};
+struct CWeightTable {
+  int *GetValueFromPixelWeight(PixelWeight *, int) const;
+};
+char ContinueStretchHorz_dest_scan;
+struct CStretchEngine {
+  bool ContinueStretchHorz();
+  CWeightTable m_WeightTable;
+};
+int *CWeightTable::GetValueFromPixelWeight(PixelWeight *pWeight,
+   int index) const {
+  long __trans_tmp_1;
+  if (index < pWeight->m_SrcStart)
+return __trans_tmp_1 ? >m_Weights[pWeight->m_SrcStart] : nullptr;
+}
+bool CStretchEngine::ContinueStretchHorz() {
+  {
+PixelWeight pPixelWeights;
+int dest_g_m;
+for (int j; j; j++) {
+  int pWeight = *m_WeightTable.GetValueFromPixelWeight(, j);
+  dest_g_m += pWeight;
+}
+ContinueStretchHorz_dest_scan = dest_g_m;
+  }
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_101-pr113178.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_101-pr113178.c
new file mode 100644
index 
..8b91112133f0522270bb4d92664355838a405aaf
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_101-pr113178.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+struct PixelWeight {
+  int m_SrcStart;
+  int m_Weights[16];
+};
+char h;
+void f(struct PixelWeight *pPixelWeights) {
+int dest_g_m;
+long tt;
+for (int j = 0; j < 16; j++) {
+  int *p = 0;
+  if (j < pPixelWeights->m_SrcStart)
+p = tt ? >m_Weights[0] : 0;
+  int pWeight = *p;
+  dest_g_m += pWeight;
+}
+h = dest_g_m;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_102-pr113178.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_102-pr113178.c
new file mode 100644
index 
..ad7582e440720e50a2769239c88b1e07517e4c10
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_102-pr113178.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "-std=gnu99 -fpermissive -fgnu89-inline -Ofast 
-fprofile-generate -w" } */
+
+extern int replace_reg_with_saved_mem_i, replace_reg_with_saved_mem_nregs,
+replace_reg_with_saved_mem_mem_1;
+replace_reg_with_saved_mem_mode() {
+  if (replace_reg_with_saved_mem_i)
+return;
+  while (++replace_reg_with_saved_mem_i < replace_reg_with_saved_mem_nregs)
+if (replace_reg_with_saved_mem_i)
+  break;
+  if (replace_reg_with_saved_mem_i)
+if (replace_reg_with_saved_mem_mem_1)
+  adjust_address_1();
+  replace_reg_with_saved_mem_mem_1 ? fancy_abort() : 0;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
27bb28365936978013a576b64b72d9e92375f361..da2dfa176ecd457ebc11d1131302ca15d77d779d
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -6223,7 +6223,13 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
  phi = create_phi_node (new_def, exit_bb);
  if (j)
def = gimple_get_lhs (vec_stmts[j]);
- SET_PHI_ARG_DEF (phi, loop_exit->dest_idx, def);
+ if (LOOP_VINFO_IV_EXIT 

RE: [PATCH][testsuite]: Make bitint early vect test more accurate

2024-01-10 Thread Tamar Christina
> But I'm afraid I have no idea how is this supposed to work on
> non-bitint targets or where __BITINT_MAXWIDTH__ is smaller than 9020.
> There is no loop at all there, so what should be vectorized?
> 

Yeah It was giving an unresolved and I didn't notice in diff.

> I'd say introduce
> # Return 1 if the target supports _BitInt(65535), 0 otherwise.
> 
> proc check_effective_target_bitint65535 { } {
> return [check_no_compiler_messages bitint65535 object {
> _BitInt (2) a = 1wb;
> unsigned _BitInt (65535) b = 0uwb;
> } "-std=c23"]
> }
> 
> after bitint575 effective target and use it in the test.
>

Sure, how's:

--

This changes the tests I committed for PR113287 to also
run on targets that don't support bitint.

Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no issues and
tests run on both.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* doc/sourcebuild.texi (check_effective_target_bitint65535): New.

gcc/testsuite/ChangeLog:

PR tree-optimization/113287
* gcc.dg/vect/vect-early-break_100-pr113287.c: Support non-bitint.
* gcc.dg/vect/vect-early-break_99-pr113287.c: Likewise.
* lib/target-supports.exp (bitint, bitint128, bitint575, bitint65535):
Document them.

---inline copy of patch ---

diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index 
bd62b21f3b725936eae34c22159ccbc9db40873f..6fbb102f9971d54d66d77dcee8f10a1b57aa6e5a
 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -2864,6 +2864,18 @@ Target supports Graphite optimizations.
 @item fixed_point
 Target supports fixed-point extension to C.
 
+@item bitint
+Target supports _BitInt(N).
+
+@item bitint128
+Target supports _BitInt(128).
+
+@item bitint575
+Target supports _BitInt(575).
+
+@item bitint65535
+Target supports _BitInt(65535).
+
 @item fopenacc
 Target supports OpenACC via @option{-fopenacc}.
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
index 
f908e5bc60779c148dc95bda3e200383d12b9e1e..05fb84e1d36d4d05f39e48e41fc70703074ecabd
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
@@ -1,28 +1,29 @@
 /* { dg-add-options vect_early_break } */
 /* { dg-require-effective-target vect_early_break } */
-/* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target bitint } */
+/* { dg-require-effective-target vect_long_long } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
 
 __attribute__((noipa)) void
-bar (unsigned long *p)
+bar (unsigned long long *p)
 {
-  __builtin_memset (p, 0, 142 * sizeof (unsigned long));
-  p[17] = 0x500UL;
+  __builtin_memset (p, 0, 142 * sizeof (unsigned long long));
+  p[17] = 0x500ULL;
 }
 
 __attribute__((noipa)) int
 foo (void)
 {
-  unsigned long r[142];
+  unsigned long long r[142];
   bar (r);
-  unsigned long v = ((long) r[0] >> 31);
+  unsigned long long v = ((long) r[0] >> 31);
   if (v + 1 > 1)
 return 1;
-  for (unsigned long i = 1; i <= 140; ++i)
+  for (unsigned long long i = 1; i <= 140; ++i)
 if (r[i] != v)
   return 1;
-  unsigned long w = r[141];
-  if ((unsigned long) (((long) (w << 60)) >> 60) != v)
+  unsigned long long w = r[141];
+  if ((unsigned long long) (((long) (w << 60)) >> 60) != v)
 return 1;
   return 0;
 }
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
index 
b92a8a268d803ab1656b4716b1a319ed4edc87a3..e141e8a9277f89527e8aff809fe101fdd91a4c46
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
@@ -1,7 +1,8 @@
 /* { dg-add-options vect_early_break } */
 /* { dg-require-effective-target vect_early_break } */
-/* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target bitint } */
+/* { dg-require-effective-target bitint65535 } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
 
 _BitInt(998) b;
 char c;
diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 
a9c76e0b290b19fd07574805bb2b87c86a5e9cf7..1ddcb3926a8d549b6a17b61e29e1d9836ecce897
 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -3850,6 +3850,15 @@ proc check_effective_target_bitint575 { } {
 } "-std=c23"]
 }
 
+# Return 1 if the target supports _BitInt(65535), 0 otherwise.
+
+proc check_effective_target_bitint65535 { } {
+return [check_no_compiler_messages bitint65535 object {
+_BitInt (2) a = 1wb;
+unsigned _BitInt (65535) b = 0uwb;
+} "-std=c23"]
+}
+
 # Return 1 if the target supports compiling decimal floating point,
 # 0 otherwise.



rb18146.patch
Description: rb18146.patch


[PATCH][testsuite]: Make bitint early vect test more accurate

2024-01-10 Thread Tamar Christina
Hi All,

This changes the tests I committed for PR113287 to also
run on targets that don't support bitint.

Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues and tests run on both.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

PR tree-optimization/113287
* gcc.dg/vect/vect-early-break_100-pr113287.c: Support non-bitint.
* gcc.dg/vect/vect-early-break_99-pr113287.c: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
index 
f908e5bc60779c148dc95bda3e200383d12b9e1e..05fb84e1d36d4d05f39e48e41fc70703074ecabd
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
@@ -1,28 +1,29 @@
 /* { dg-add-options vect_early_break } */
 /* { dg-require-effective-target vect_early_break } */
-/* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target bitint } */
+/* { dg-require-effective-target vect_long_long } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
 
 __attribute__((noipa)) void
-bar (unsigned long *p)
+bar (unsigned long long *p)
 {
-  __builtin_memset (p, 0, 142 * sizeof (unsigned long));
-  p[17] = 0x500UL;
+  __builtin_memset (p, 0, 142 * sizeof (unsigned long long));
+  p[17] = 0x500ULL;
 }
 
 __attribute__((noipa)) int
 foo (void)
 {
-  unsigned long r[142];
+  unsigned long long r[142];
   bar (r);
-  unsigned long v = ((long) r[0] >> 31);
+  unsigned long long v = ((long) r[0] >> 31);
   if (v + 1 > 1)
 return 1;
-  for (unsigned long i = 1; i <= 140; ++i)
+  for (unsigned long long i = 1; i <= 140; ++i)
 if (r[i] != v)
   return 1;
-  unsigned long w = r[141];
-  if ((unsigned long) (((long) (w << 60)) >> 60) != v)
+  unsigned long long w = r[141];
+  if ((unsigned long long) (((long) (w << 60)) >> 60) != v)
 return 1;
   return 0;
 }
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
index 
b92a8a268d803ab1656b4716b1a319ed4edc87a3..fb99ef39402ee7b3c6c564e7db5f5543a5f0c2e0
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
@@ -1,9 +1,18 @@
 /* { dg-add-options vect_early_break } */
 /* { dg-require-effective-target vect_early_break } */
-/* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target bitint } */
+/* { dg-require-effective-target vect_long_long } */
 
-_BitInt(998) b;
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#if __BITINT_MAXWIDTH__ >= 9020
+typedef _BitInt(9020) B9020;
+typedef _BitInt(998) B998;
+#else
+typedef long long B998;
+typedef long long B9020;
+#endif
+
+B998 b;
 char c;
 char d;
 char e;
@@ -14,7 +23,7 @@ char i;
 char j;
 
 void
-foo(char y, _BitInt(9020) a, char *r)
+foo(char y, B9020 a, char *r)
 {
   char x = __builtin_mul_overflow_p(a << sizeof(a), y, 0);
   x += c + d + e + f + g + h + i + j + b;
@@ -26,7 +35,12 @@ main(void)
 {
   char x;
   foo(5, 5, );
+#if __BITINT_MAXWIDTH__ >= 9020
   if (x != 1)
 __builtin_abort();
+#else
+  if (x != 0)
+__builtin_abort();
+#endif
   return 0;
 }




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
index 
f908e5bc60779c148dc95bda3e200383d12b9e1e..05fb84e1d36d4d05f39e48e41fc70703074ecabd
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
@@ -1,28 +1,29 @@
 /* { dg-add-options vect_early_break } */
 /* { dg-require-effective-target vect_early_break } */
-/* { dg-require-effective-target vect_int } */
-/* { dg-require-effective-target bitint } */
+/* { dg-require-effective-target vect_long_long } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
 
 __attribute__((noipa)) void
-bar (unsigned long *p)
+bar (unsigned long long *p)
 {
-  __builtin_memset (p, 0, 142 * sizeof (unsigned long));
-  p[17] = 0x500UL;
+  __builtin_memset (p, 0, 142 * sizeof (unsigned long long));
+  p[17] = 0x500ULL;
 }
 
 __attribute__((noipa)) int
 foo (void)
 {
-  unsigned long r[142];
+  unsigned long long r[142];
   bar (r);
-  unsigned long v = ((long) r[0] >> 31);
+  unsigned long long v = ((long) r[0] >> 31);
   if (v + 1 > 1)
 return 1;
-  for (unsigned long i = 1; i <= 140; ++i)
+  for (unsigned long long i = 1; i <= 140; ++i)
 if (r[i] != v)
   return 1;
-  unsigned long w = r[141];
-  if ((unsigned long) (((long) (w << 60)) >> 60) != v)
+  unsigned long long w = r[141];
+  if ((unsigned long long) (((long) (w << 60)) >> 60) != v)
 return 1;
   return 0;
 }
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
index 

RE: [PATCH]middle-end: correctly identify the edge taken when condition is true. [PR113287]

2024-01-10 Thread Tamar Christina
> -Original Message-
> From: Jakub Jelinek 
> Sent: Wednesday, January 10, 2024 2:42 PM
> To: Tamar Christina ; Richard Biener
> 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: correctly identify the edge taken when 
> condition
> is true. [PR113287]
> 
> Hi!
> 
> Thanks for fixing it, just testsuite nits.
> 
> On Wed, Jan 10, 2024 at 03:22:53PM +0100, Richard Biener wrote:
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
> > > @@ -0,0 +1,35 @@
> > > +/* { dg-add-options vect_early_break } */
> > > +/* { dg-require-effective-target vect_early_break } */
> > > +/* { dg-require-effective-target vect_int } */
> > > +/* { dg-require-effective-target bitint } */
> 
> This test doesn't need bitint effective target.
> But relies on long being 64-bit, otherwise e.g.
> 0x500UL doesn't need to fit or shifting it by 60 is invalid.
> So, maybe use lp64 effective target instead.

I was thinking about it. Would using effective-target longlong and
changing the constant to ULL instead work?

Thanks,
Tamar


[PATCH]middle-end: correctly identify the edge taken when condition is true. [PR113287]

2024-01-10 Thread Tamar Christina
Hi All,

The vectorizer needs to know during early break vectorization whether the edge
that will be taken if the condition is true stays or leaves the loop.

This is because the code assumes that if you take the true branch you exit the
loop.  If you don't exit the loop it has to generate a different condition.

Basically it uses this information to decide whether it's generating a
"any element" or an "all element" check.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues with --enable-lto --with-build-config=bootstrap-O3
--enable-checking=release,yes,rtl,extra.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113287
* tree-vect-stmts.cc (vectorizable_early_exit): Check the flags on edge
instead of using BRANCH_EDGE to determine true edge.

gcc/testsuite/ChangeLog:

PR tree-optimization/113287
* gcc.dg/vect/vect-early-break_100-pr113287.c: New test.
* gcc.dg/vect/vect-early-break_99-pr113287.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
new file mode 100644
index 
..f908e5bc60779c148dc95bda3e200383d12b9e1e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
@@ -0,0 +1,35 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target bitint } */
+
+__attribute__((noipa)) void
+bar (unsigned long *p)
+{
+  __builtin_memset (p, 0, 142 * sizeof (unsigned long));
+  p[17] = 0x500UL;
+}
+
+__attribute__((noipa)) int
+foo (void)
+{
+  unsigned long r[142];
+  bar (r);
+  unsigned long v = ((long) r[0] >> 31);
+  if (v + 1 > 1)
+return 1;
+  for (unsigned long i = 1; i <= 140; ++i)
+if (r[i] != v)
+  return 1;
+  unsigned long w = r[141];
+  if ((unsigned long) (((long) (w << 60)) >> 60) != v)
+return 1;
+  return 0;
+}
+
+int
+main ()
+{
+  if (foo () != 1)
+__builtin_abort ();
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
new file mode 100644
index 
..b92a8a268d803ab1656b4716b1a319ed4edc87a3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_99-pr113287.c
@@ -0,0 +1,32 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target bitint } */
+
+_BitInt(998) b;
+char c;
+char d;
+char e;
+char f;
+char g;
+char h;
+char i;
+char j;
+
+void
+foo(char y, _BitInt(9020) a, char *r)
+{
+  char x = __builtin_mul_overflow_p(a << sizeof(a), y, 0);
+  x += c + d + e + f + g + h + i + j + b;
+  *r = x;
+}
+
+int
+main(void)
+{
+  char x;
+  foo(5, 5, );
+  if (x != 1)
+__builtin_abort();
+  return 0;
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 
1333d8934783acdb5277e3a03c2b4021fec4777b..da004b0e9e2696cd2ce358d3b221851c7b60b448
 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -12870,13 +12870,18 @@ vectorizable_early_exit (vec_info *vinfo, 
stmt_vec_info stmt_info,
  rewrite conditions to always be a comparison against 0.  To do this it
  sometimes flips the edges.  This is fine for scalar,  but for vector we
  then have to flip the test, as we're still assuming that if you take the
- branch edge that we found the exit condition.  */
+ branch edge that we found the exit condition.  i.e. we need to know 
whether
+ we are generating a `forall` or an `exist` condition.  */
   auto new_code = NE_EXPR;
   auto reduc_optab = ior_optab;
   auto reduc_op = BIT_IOR_EXPR;
   tree cst = build_zero_cst (vectype);
+  edge exit_true_edge = EDGE_SUCC (gimple_bb (cond_stmt), 0);
+  if (exit_true_edge->flags & EDGE_FALSE_VALUE)
+exit_true_edge = EDGE_SUCC (gimple_bb (cond_stmt), 1);
+  gcc_assert (exit_true_edge->flags & EDGE_TRUE_VALUE);
   if (flow_bb_inside_loop_p (LOOP_VINFO_LOOP (loop_vinfo),
-BRANCH_EDGE (gimple_bb (cond_stmt))->dest))
+exit_true_edge->dest))
 {
   new_code = EQ_EXPR;
   reduc_optab = and_optab;




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
new file mode 100644
index 
..f908e5bc60779c148dc95bda3e200383d12b9e1e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_100-pr113287.c
@@ -0,0 +1,35 @@
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target bitint } */
+
+__attribute__((noipa)) void
+bar (unsigned long *p)
+{

[PATCH][committed][c++ frontend]: initialize ivdep value

2024-01-10 Thread Tamar Christina
Hi All,

Should control enter the switch from one of the cases other than
the IVDEP one then the variable remains uninitialized.

This fixes it by initializing it to false.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues

Committed as obvious.

Thanks,
Tamar

gcc/cp/ChangeLog:

* parser.cc (cp_parser_pragma): Initialize to false.

--- inline copy of patch -- 
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 
379aeb56b152b9b29606ba4d75ad4c49dfe92aac..1b4ce1497e893d6463350eecf5ef4e88957f5f00
 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -50625,7 +50625,7 @@ cp_parser_pragma (cp_parser *parser, enum 
pragma_context context, bool *if_p)
 case PRAGMA_UNROLL:
 case PRAGMA_NOVECTOR:
   {
-   bool ivdep;
+   bool ivdep = false;
tree unroll = NULL_TREE;
bool novector = false;
const char *pragma_str;




-- 
diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 
379aeb56b152b9b29606ba4d75ad4c49dfe92aac..1b4ce1497e893d6463350eecf5ef4e88957f5f00
 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -50625,7 +50625,7 @@ cp_parser_pragma (cp_parser *parser, enum 
pragma_context context, bool *if_p)
 case PRAGMA_UNROLL:
 case PRAGMA_NOVECTOR:
   {
-   bool ivdep;
+   bool ivdep = false;
tree unroll = NULL_TREE;
bool novector = false;
const char *pragma_str;





RE: [PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]

2024-01-10 Thread Tamar Christina
ping

> -Original Message-
> From: Tamar Christina 
> Sent: Friday, January 5, 2024 1:31 PM
> To: Xi Ruoyao ; Palmer Dabbelt 
> Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de; Jeff Law
> 
> Subject: RE: [PATCH]middle-end: Don't apply copysign optimization if target 
> does
> not implement optab [PR112468]
> 
> > On Fri, 2024-01-05 at 11:02 +, Tamar Christina wrote:
> > > Ok, so something like:
> > >
> > > > > ([istarget loongarch*-*-*] &&
> > > > > ([check_effective_target_loongarch_sx] ||
> > > > > [check_effective_target_hard_float]))
> > > ?
> >
> > We don't need "[check_effective_target_loongarch_sx] ||" because SIMD
> > requires hard float.
> >
> 
> Cool, thanks!
> 
> --
> 
> Hi All,
> 
> currently GCC does not treat IFN_COPYSIGN the same as the copysign tree expr.
> The latter has a libcall fallback and the IFN can only do optabs.
> 
> Because of this the change I made to optimize copysign only works if the
> target has impemented the optab, but it should work for those that have the
> libcall too.
> 
> More annoyingly if a target has vector versions of ABS and NEG but not 
> COPYSIGN
> then the change made them lose vectorization.
> 
> The proper fix for this is to treat the IFN the same as the tree EXPR and to
> enhance expand_COPYSIGN to also support vector calls.
> 
> I have such a patch for GCC 15 but it's quite big and too invasive for 
> stage-4.
> As such this is a minimal fix, just don't apply the transformation and leave
> targets which don't have the optab unoptimized.
> 
> Targets list for check_effective_target_ifn_copysign was gotten by grepping 
> for
> copysign and looking at the optab.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Tests ran in x86_64-pc-linux-gnu -m32 and tests no longer fail.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   PR tree-optimization/112468
>   * doc/sourcebuild.texi: Document ifn_copysign.
>   * match.pd: Only apply transformation if target supports the IFN.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR tree-optimization/112468
>   * gcc.dg/fold-copysign-1.c: Modify tests based on if target supports
>   IFN_COPYSIGN.
>   * gcc.dg/pr55152-2.c: Likewise.
>   * gcc.dg/tree-ssa/abs-4.c: Likewise.
>   * gcc.dg/tree-ssa/backprop-6.c: Likewise.
>   * gcc.dg/tree-ssa/copy-sign-2.c: Likewise.
>   * gcc.dg/tree-ssa/mult-abs-2.c: Likewise.
>   * lib/target-supports.exp (check_effective_target_ifn_copysign): New.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
> index
> 4be67daedb20d394857c02739389cabf23c0d533..f4847dafe65cbbf8c9de3490
> 5f614ef6957658b4 100644
> --- a/gcc/doc/sourcebuild.texi
> +++ b/gcc/doc/sourcebuild.texi
> @@ -2664,6 +2664,10 @@ Target requires a command line argument to enable a
> SIMD instruction set.
>  @item xorsign
>  Target supports the xorsign optab expansion.
> 
> +@item ifn_copysign
> +Target supports the IFN_COPYSIGN optab expansion for both scalar and vector
> +types.
> +
>  @end table
> 
>  @subsubsection Environment attributes
> diff --git a/gcc/match.pd b/gcc/match.pd
> index
> d57e29bfe1d68afd4df4dda20fecc2405ff05332..87d13e7e3e1aa6d89119142b6
> 14890dc4729b521 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -1159,13 +1159,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (simplify
>(copysigns @0 REAL_CST@1)
>(if (!REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1)))
> -   (abs @0
> +   (abs @0)
> +#if GIMPLE
> +   (if (!direct_internal_fn_supported_p (IFN_COPYSIGN, type,
> +  OPTIMIZE_FOR_BOTH))
> +(negate (abs @0)))
> +#endif
> +   )))
> 
> +#if GIMPLE
>  /* Transform fneg (fabs (X)) -> copysign (X, -1).  */
>  (simplify
>   (negate (abs @0))
> - (IFN_COPYSIGN @0 { build_minus_one_cst (type); }))
> -
> + (if (direct_internal_fn_supported_p (IFN_COPYSIGN, type,
> +   OPTIMIZE_FOR_BOTH))
> +   (IFN_COPYSIGN @0 { build_minus_one_cst (type); })))
> +#endif
>  /* copysign(copysign(x, y), z) -> copysign(x, z).  */
>  (for copysigns (COPYSIGN_ALL)
>   (simplify
> diff --git a/gcc/testsuite/gcc.dg/fold-copysign-1.c 
> b/gcc/testsuite/gcc.dg/fold-
> copysign-1.c
> index
> f9cafd14ab05f5e8ab2f6f68e62801d21c2df6a6..96b80c733794fffada1b08274ef
> 39cc8f6e442ce 100644
> --- a/gcc/testsuite/gcc.dg/fold-copysign-1.c
> +++ b/gcc/testsuite/gcc.dg/fold-copysign-1.c
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } 

[PATCH][committed]middle-end: removed unused variable in vectorizable_live_operation_1

2024-01-09 Thread Tamar Christina
Hi All,

It looks like the previous patch had an unused variable.
It's odd that my bootstrap didn't catch it (I'm assuming
-Werror is still on for O3 bootstraps) but this fixes it.

Committed to fix bootstrap.

Thanks,
Tamar

gcc/ChangeLog:

* tree-vect-loop.cc (vectorizable_live_operation_1): Drop unused
restart_loop.
(vectorizable_live_operation): Likewise.

--- inline copy of patch -- 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
39b1161309d8ff8bfe88ee26df9147df0af0a58c..c218d514fe4be57fca97a85a36be7240d3e84edf
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10575,13 +10575,12 @@ vectorizable_induction (loop_vec_info loop_vinfo,
 
helper function for vectorizable_live_operation.  */
 
-tree
+static tree
 vectorizable_live_operation_1 (loop_vec_info loop_vinfo,
   stmt_vec_info stmt_info, basic_block exit_bb,
   tree vectype, int ncopies, slp_tree slp_node,
   tree bitsize, tree bitstart, tree vec_lhs,
-  tree lhs_type, bool restart_loop,
-  gimple_stmt_iterator *exit_gsi)
+  tree lhs_type, gimple_stmt_iterator *exit_gsi)
 {
   gcc_assert (single_pred_p (exit_bb) || LOOP_VINFO_EARLY_BREAKS (loop_vinfo));
 
@@ -10597,7 +10596,7 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo,
   if (integer_zerop (bitstart))
 {
   tree scalar_res = gimple_build (, BIT_FIELD_REF, TREE_TYPE 
(vectype),
-  vec_lhs_phi, bitsize, bitstart);
+ vec_lhs_phi, bitsize, bitstart);
 
   /* Convert the extracted vector element to the scalar type.  */
   new_tree = gimple_convert (, lhs_type, scalar_res);
@@ -10958,8 +10957,7 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
 dest, vectype, ncopies,
 slp_node, bitsize,
 tmp_bitstart, tmp_vec_lhs,
-lhs_type, restart_loop,
-_gsi);
+lhs_type, _gsi);
 
  if (gimple_phi_num_args (use_stmt) == 1)
{




-- 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
39b1161309d8ff8bfe88ee26df9147df0af0a58c..c218d514fe4be57fca97a85a36be7240d3e84edf
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10575,13 +10575,12 @@ vectorizable_induction (loop_vec_info loop_vinfo,
 
helper function for vectorizable_live_operation.  */
 
-tree
+static tree
 vectorizable_live_operation_1 (loop_vec_info loop_vinfo,
   stmt_vec_info stmt_info, basic_block exit_bb,
   tree vectype, int ncopies, slp_tree slp_node,
   tree bitsize, tree bitstart, tree vec_lhs,
-  tree lhs_type, bool restart_loop,
-  gimple_stmt_iterator *exit_gsi)
+  tree lhs_type, gimple_stmt_iterator *exit_gsi)
 {
   gcc_assert (single_pred_p (exit_bb) || LOOP_VINFO_EARLY_BREAKS (loop_vinfo));
 
@@ -10597,7 +10596,7 @@ vectorizable_live_operation_1 (loop_vec_info loop_vinfo,
   if (integer_zerop (bitstart))
 {
   tree scalar_res = gimple_build (, BIT_FIELD_REF, TREE_TYPE 
(vectype),
-  vec_lhs_phi, bitsize, bitstart);
+ vec_lhs_phi, bitsize, bitstart);
 
   /* Convert the extracted vector element to the scalar type.  */
   new_tree = gimple_convert (, lhs_type, scalar_res);
@@ -10958,8 +10957,7 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
 dest, vectype, ncopies,
 slp_node, bitsize,
 tmp_bitstart, tmp_vec_lhs,
-lhs_type, restart_loop,
-_gsi);
+lhs_type, _gsi);
 
  if (gimple_phi_num_args (use_stmt) == 1)
{





RE: [PATCH]middle-end: check if target can do extract first for early breaks [PR113199]

2024-01-09 Thread Tamar Christina
Hmm I'm confused as to why It didn't break mine.. just did one again.. anyway 
I'll remove the unused variable.

> -Original Message-
> From: Rainer Orth 
> Sent: Tuesday, January 9, 2024 4:06 PM
> To: Richard Biener 
> Cc: Tamar Christina ; gcc-patches@gcc.gnu.org; nd
> ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: check if target can do extract first for 
> early breaks
> [PR113199]
> 
> Richard Biener  writes:
> 
> > On Tue, 9 Jan 2024, Tamar Christina wrote:
> >
> >> > > -
> >> > > -  gimple_seq_add_seq (, tem);
> >> > > -
> >> > > -  scalar_res = gimple_build (, CFN_EXTRACT_LAST, 
> >> > > scalar_type,
> >> > > -   mask, vec_lhs_phi);
> >> > > +  scalar_res = gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE
> >> > (vectype),
> >> > > + vec_lhs_phi, bitstart);
> >> >
> >> > So bitstart is always zero?  I wonder why using CFN_VEC_EXTRACT over
> >> > BIT_FIELD_REF here which wouldn't need any additional target support.
> >> >
> >>
> >> Ok, how about...
> >>
> >> ---
> >>
> >> I was generating the vector reverse mask without checking if the target
> >> actually supported such an operation.
> >>
> >> This patch changes it to if the bitstart is 0 then use BIT_FIELD_REF 
> >> instead
> >> to extract the first element since this is supported by all targets.
> >>
> >> This is good for now since masks always come from whilelo.  But in the 
> >> future
> >> when masks can come from other sources we will need the old code back.
> >>
> >> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> >> and no issues with --enable-checking=release --enable-lto
> >> --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.
> >> tested on cross cc1 for amdgcn-amdhsa and issue fixed.
> >>
> >> Ok for master?
> >
> > OK.
> >
> >> Thanks,
> >> Tamar
> >>
> >> gcc/ChangeLog:
> >>
> >>PR tree-optimization/113199
> >>* tree-vect-loop.cc (vectorizable_live_operation_1): Use
> >>BIT_FIELD_REF.
> 
> This patch broke bootstrap (everywhere, it seems; seen on
> i386-pc-solaris2.11 and sparc-sun-solaris2.11):
> 
> /vol/gcc/src/hg/master/local/gcc/tree-vect-loop.cc: In function 'tree_node*
> vectorizable_live_operation_1(loop_vec_info, stmt_vec_info, basic_block, 
> tree, int,
> slp_tree, tree, tree, tree, tree, bool, gimple_stmt_iterator*)':
> /vol/gcc/src/hg/master/local/gcc/tree-vect-loop.cc:10598:52: error: unused
> parameter 'restart_loop' [-Werror=unused-parameter]
> 10598 |tree lhs_type, bool restart_loop,
>   |   ~^~~~
> 
>   Rainer
> 
> --
> -
> Rainer Orth, Center for Biotechnology, Bielefeld University


RE: [PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]

2024-01-09 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, January 9, 2024 1:51 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: RE: [PATCH]middle-end: Fix dominators updates when peeling with
> multiple exits [PR113144]
> 
> On Tue, 9 Jan 2024, Richard Biener wrote:
> 
> > On Tue, 9 Jan 2024, Tamar Christina wrote:
> >
> > >
> > >
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Tuesday, January 9, 2024 12:26 PM
> > > > To: Tamar Christina 
> > > > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> > > > Subject: RE: [PATCH]middle-end: Fix dominators updates when peeling with
> > > > multiple exits [PR113144]
> > > >
> > > > On Tue, 9 Jan 2024, Tamar Christina wrote:
> > > >
> > > > > > This makes it quadratic in the number of vectorized early exit loops
> > > > > > in a function.  The vectorizer CFG manipulation operates in a local
> > > > > > enough bubble that programmatic updating of dominators should be
> > > > > > possible (after all we manage to produce correct SSA form!), the
> > > > > > proposed change gets us too far off to a point where re-computating
> > > > > > dominance info is likely cheaper (but no, we shouldn't do this 
> > > > > > either).
> > > > > >
> > > > > > Can you instead give manual updating a try again?  I think
> > > > > > versioning should produce up-to-date dominator info, it's only
> > > > > > when you redirect branches during peeling that you'd need
> > > > > > adjustments - but IIRC we're never introducing new merges?
> > > > > >
> > > > > > IIRC we can't wipe dominators during transform since we query them
> > > > > > during code generation.  We possibly could code generate all
> > > > > > CFG manipulations of all vectorized loops, recompute all dominators
> > > > > > and then do code generation of all vectorized loops.
> > > > > >
> > > > > > But then we're doing a loop transform and the exits will 
> > > > > > ultimatively
> > > > > > end up in the same place, so the CFG and dominator update is bound 
> > > > > > to
> > > > > > where the original exits went to.
> > > > >
> > > > > Yeah that's a fair point, the issue is specifically with at_exit.  So 
> > > > > how about:
> > > > >
> > > > > When we peel at_exit we are moving the new loop at the exit of the
> previous
> > > > > loop.  This means that the blocks outside the loop dat the previous 
> > > > > loop
> used to
> > > > > dominate are no longer being dominated by it.
> > > >
> > > > Hmm, indeed.  Note this does make the dominator update O(function-size)
> > > > and when vectorizing multiple loops in a function this becomes
> > > > quadratic.  That's quite unfortunate so I wonder if we can delay the
> > > > update to the parts we do not need up-to-date dominators during
> > > > vectorization (of course it gets fragile with having only partly
> > > > correct dominators).
> > >
> > > Fair, I created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113290 and 
> > > will
> > > tackle it when I add SLP support in GCC 15.
> > >
> > > I think the problem is, and the reason we do early dominator correction 
> > > and
> > > validation is because the same function is used by loop distribution.
> > >
> > > But you're right that during vectorization we perform dominators update 
> > > twice
> > > now.
> >
> > We're performing it at least once per multi-exit loop that is vectorized,
> > covering all downstream blocks.
> 
> That is, consider sth like
> 
> int a[77];
> 
> int bar ();
> void foo ()
> {
>   int val;
> #define LOOP \
>   val = bar (); \
>   for (int i = 0; i < 77; ++i) \
> { \
>   if (a[i] == val) \
> break; \
>   a[i]++; \
> }
> #define LOOP10 LOOP LOOP LOOP LOOP LOOP LOOP LOOP LOOP LOOP LOOP
> #define LOOP100 LOOP10 LOOP10 LOOP10 LOOP10 LOOP10 LOOP10 LOOP10
> LOOP10
> LOOP10 LOOP10
> #define LOOP1000 LOOP100 LOOP100 LOOP100 LOOP100 LOOP100 LOOP100
> LOOP100
> LOOP100 LOOP100 LOOP100
>   LOOP1000
> }
>

RE: [PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]

2024-01-09 Thread Tamar Christina



> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, January 9, 2024 12:26 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: RE: [PATCH]middle-end: Fix dominators updates when peeling with
> multiple exits [PR113144]
> 
> On Tue, 9 Jan 2024, Tamar Christina wrote:
> 
> > > This makes it quadratic in the number of vectorized early exit loops
> > > in a function.  The vectorizer CFG manipulation operates in a local
> > > enough bubble that programmatic updating of dominators should be
> > > possible (after all we manage to produce correct SSA form!), the
> > > proposed change gets us too far off to a point where re-computating
> > > dominance info is likely cheaper (but no, we shouldn't do this either).
> > >
> > > Can you instead give manual updating a try again?  I think
> > > versioning should produce up-to-date dominator info, it's only
> > > when you redirect branches during peeling that you'd need
> > > adjustments - but IIRC we're never introducing new merges?
> > >
> > > IIRC we can't wipe dominators during transform since we query them
> > > during code generation.  We possibly could code generate all
> > > CFG manipulations of all vectorized loops, recompute all dominators
> > > and then do code generation of all vectorized loops.
> > >
> > > But then we're doing a loop transform and the exits will ultimatively
> > > end up in the same place, so the CFG and dominator update is bound to
> > > where the original exits went to.
> >
> > Yeah that's a fair point, the issue is specifically with at_exit.  So how 
> > about:
> >
> > When we peel at_exit we are moving the new loop at the exit of the previous
> > loop.  This means that the blocks outside the loop dat the previous loop 
> > used to
> > dominate are no longer being dominated by it.
> 
> Hmm, indeed.  Note this does make the dominator update O(function-size)
> and when vectorizing multiple loops in a function this becomes
> quadratic.  That's quite unfortunate so I wonder if we can delay the
> update to the parts we do not need up-to-date dominators during
> vectorization (of course it gets fragile with having only partly
> correct dominators).

Fair, I created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113290 and will
tackle it when I add SLP support in GCC 15.

I think the problem is, and the reason we do early dominator correction and
validation is because the same function is used by loop distribution.

But you're right that during vectorization we perform dominators update twice
now.

So Maybe we should have a parameter to indicate whether dominators should
be updated?

Thanks,
Tamar

> 
> > The new dominators however are hard to predict since if the loop has 
> > multiple
> > exits and all the exits are an "early" one then we always execute the scalar
> > loop.  In this case the scalar loop can completely dominate the new loop.
> >
> > If we later have skip_vector then there's an additional skip edge added that
> > might change the dominators.
> >
> > The previous patch would force an update of all blocks reachable from the 
> > new
> > exits.  This one updates *only* blocks that we know the scalar exits 
> > dominated.
> >
> > For the examples this reduces the blocks to update from 18 to 3.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > and no issues normally and with --enable-checking=release --enable-lto
> > --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.
> >
> > Ok for master?
> 
> See below.
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/113144
> > PR tree-optimization/113145
> > * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
> > Update all BB that the original exits dominated.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/113144
> > PR tree-optimization/113145
> > * gcc.dg/vect/vect-early-break_94-pr113144.c: New test.
> >
> > --- inline copy of patch ---
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
> > new file mode 100644
> > index
> ..903fe7be6621e81db6f294
> 41e4309fa213d027c5
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
> > @@ -0,0 +1,41 @@
> > +/* { dg-do compile 

RE: [PATCH]Arm: Update early-break tests to accept thumb output too.

2024-01-09 Thread Tamar Christina
> > 3f40b2a241953 100644
> > --- a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> > +++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
> > @@ -16,8 +16,12 @@ int b[N] = {0};
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vmovr[0-9]+, s[0-9]+@ int
> > +** (
> >  ** cmp r[0-9]+, #0
> >  ** bne \.L[0-9]+
> > +** |
> > +** cbnzr[0-9]+, \.L.+
> > +** )
> 
> If we want to be a bit fancy, I think the scan syntax allows to add a target 
> selector,
> you should be able to do
> ** | { target_thumb }
> **   cbnz...
> 

I tried, but it looks like this doesn't work because the | is not a TCL 
feature, so the contents
of the conditional match gets interpreted as regexpr:

body: .*\tvcgt.s32  q[0-9]+, q[0-9]+, #0
\tvpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+
\tvpmax.u32 d[0-9]+, d[0-9]+, d[0-9]+
\tvmov  r[0-9]+, s[0-9]+@ int
(?:\tcmpr[0-9]+, #0
\tbne   \.L[0-9]+
\t| { target_thumb }
\tcbnz  r[0-9]+, \.L.+
).*


> Ok for trunk with or without that change.

Will commit without,

Thanks,
Tamar

> Thanks,
> Kyrill
> 
> >  ** ...
> >  */
> >  void f1 ()
> > @@ -37,8 +41,12 @@ void f1 ()
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vmovr[0-9]+, s[0-9]+@ int
> > +** (
> >  ** cmp r[0-9]+, #0
> >  ** bne \.L[0-9]+
> > +** |
> > +** cbnzr[0-9]+, \.L.+
> > +** )
> >  ** ...
> >  */
> >  void f2 ()
> > @@ -58,8 +66,12 @@ void f2 ()
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vmovr[0-9]+, s[0-9]+@ int
> > +** (
> >  ** cmp r[0-9]+, #0
> >  ** bne \.L[0-9]+
> > +** |
> > +** cbnzr[0-9]+, \.L.+
> > +** )
> >  ** ...
> >  */
> >  void f3 ()
> > @@ -80,8 +92,12 @@ void f3 ()
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vmovr[0-9]+, s[0-9]+@ int
> > +** (
> >  ** cmp r[0-9]+, #0
> >  ** bne \.L[0-9]+
> > +** |
> > +** cbnzr[0-9]+, \.L.+
> > +** )
> >  ** ...
> >  */
> >  void f4 ()
> > @@ -101,8 +117,12 @@ void f4 ()
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vmovr[0-9]+, s[0-9]+@ int
> > +** (
> >  ** cmp r[0-9]+, #0
> >  ** bne \.L[0-9]+
> > +** |
> > +** cbnzr[0-9]+, \.L.+
> > +** )
> >  ** ...
> >  */
> >  void f5 ()
> > @@ -122,8 +142,12 @@ void f5 ()
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
> >  ** vmovr[0-9]+, s[0-9]+@ int
> > +** (
> >  ** cmp r[0-9]+, #0
> >  ** bne \.L[0-9]+
> > +** |
> > +** cbnzr[0-9]+, \.L.+
> > +** )
> >  ** ...
> >  */
> >  void f6 ()
> >
> >
> >
> >
> > --


[PATCH]Arm: Update early-break tests to accept thumb output too.

2024-01-09 Thread Tamar Christina
Hi All,

The tests I recently added for early break fail in thumb mode
because in thumb mode `cbz/cbnz` exist and so the cmp+branch
is fused.  This updates the testcases to accept either output.

Tested on arm-none-linux-gnueabihf with -mthumb/-marm.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

* gcc.target/arm/vect-early-break-cbranch.c: Accept thumb output.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c 
b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
index 
f57bbd8be428d75dcf35aa194b5892fe04124cf6..d5c6d56ec869b8fa868acb78d4c3f40b2a241953
 100644
--- a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
+++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
@@ -16,8 +16,12 @@ int b[N] = {0};
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f1 ()
@@ -37,8 +41,12 @@ void f1 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f2 ()
@@ -58,8 +66,12 @@ void f2 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f3 ()
@@ -80,8 +92,12 @@ void f3 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f4 ()
@@ -101,8 +117,12 @@ void f4 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f5 ()
@@ -122,8 +142,12 @@ void f5 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f6 ()




-- 
diff --git a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c 
b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
index 
f57bbd8be428d75dcf35aa194b5892fe04124cf6..d5c6d56ec869b8fa868acb78d4c3f40b2a241953
 100644
--- a/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
+++ b/gcc/testsuite/gcc.target/arm/vect-early-break-cbranch.c
@@ -16,8 +16,12 @@ int b[N] = {0};
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f1 ()
@@ -37,8 +41,12 @@ void f1 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f2 ()
@@ -58,8 +66,12 @@ void f2 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f3 ()
@@ -80,8 +92,12 @@ void f3 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f4 ()
@@ -101,8 +117,12 @@ void f4 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f5 ()
@@ -122,8 +142,12 @@ void f5 ()
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vpmax.u32   d[0-9]+, d[0-9]+, d[0-9]+
 ** vmovr[0-9]+, s[0-9]+@ int
+** (
 ** cmp r[0-9]+, #0
 ** bne \.L[0-9]+
+** |
+** cbnzr[0-9]+, \.L.+
+** )
 ** ...
 */
 void f6 ()





RE: [PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]

2024-01-09 Thread Tamar Christina
> This makes it quadratic in the number of vectorized early exit loops
> in a function.  The vectorizer CFG manipulation operates in a local
> enough bubble that programmatic updating of dominators should be
> possible (after all we manage to produce correct SSA form!), the
> proposed change gets us too far off to a point where re-computating
> dominance info is likely cheaper (but no, we shouldn't do this either).
> 
> Can you instead give manual updating a try again?  I think
> versioning should produce up-to-date dominator info, it's only
> when you redirect branches during peeling that you'd need
> adjustments - but IIRC we're never introducing new merges?
> 
> IIRC we can't wipe dominators during transform since we query them
> during code generation.  We possibly could code generate all
> CFG manipulations of all vectorized loops, recompute all dominators
> and then do code generation of all vectorized loops.
> 
> But then we're doing a loop transform and the exits will ultimatively
> end up in the same place, so the CFG and dominator update is bound to
> where the original exits went to.

Yeah that's a fair point, the issue is specifically with at_exit.  So how about:

When we peel at_exit we are moving the new loop at the exit of the previous
loop.  This means that the blocks outside the loop dat the previous loop used to
dominate are no longer being dominated by it.

The new dominators however are hard to predict since if the loop has multiple
exits and all the exits are an "early" one then we always execute the scalar
loop.  In this case the scalar loop can completely dominate the new loop.

If we later have skip_vector then there's an additional skip edge added that
might change the dominators.

The previous patch would force an update of all blocks reachable from the new
exits.  This one updates *only* blocks that we know the scalar exits dominated.

For the examples this reduces the blocks to update from 18 to 3.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues normally and with --enable-checking=release --enable-lto
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113144
PR tree-optimization/113145
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Update all BB that the original exits dominated.

gcc/testsuite/ChangeLog:

PR tree-optimization/113144
PR tree-optimization/113145
* gcc.dg/vect/vect-early-break_94-pr113144.c: New test.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
new file mode 100644
index 
..903fe7be6621e81db6f29441e4309fa213d027c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+long tar_atol256_max, tar_atol256_size, tar_atosl_min;
+char tar_atol256_s;
+void __errno_location();
+
+
+inline static long tar_atol256(long min) {
+  char c;
+  int sign;
+  c = tar_atol256_s;
+  sign = c;
+  while (tar_atol256_size) {
+if (c != sign)
+  return sign ? min : tar_atol256_max;
+c = tar_atol256_size--;
+  }
+  if ((c & 128) != (sign & 128))
+return sign ? min : tar_atol256_max;
+  return 0;
+}
+
+inline static long tar_atol(long min) {
+  return tar_atol256(min);
+}
+
+long tar_atosl() {
+  long n = tar_atol(-1);
+  if (tar_atosl_min) {
+__errno_location();
+return 0;
+  }
+  if (n > 0)
+return 0;
+  return n;
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
76d4979c0b3b374dcaacf6825a95a8714114a63b..9bacaa182a3919cae1cb99dfc5ae4923e1f93376
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -1719,8 +1719,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, 
edge loop_exit,
  /* Now link the alternative exits.  */
  if (multiple_exits_p)
{
- set_immediate_dominator (CDI_DOMINATORS, new_preheader,
-  main_loop_exit_block);
  for (auto gsi_from = gsi_start_phis (loop->header),
   gsi_to = gsi_start_phis (new_preheader);
   !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
@@ -1776,7 +1774,14 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
*loop, edge loop_exit,
{
  update_loop = new_loop;
  for (edge e : get_loop_exit_edges (loop))
-   doms.safe_push (e->dest);
+   {
+ /* Basic blocks that the old loop dominated are now dominated by
+the new loop and so we have to update those.  */

RE: [PATCH]middle-end: check if target can do extract first for early breaks [PR113199]

2024-01-09 Thread Tamar Christina
> > -
> > -  gimple_seq_add_seq (, tem);
> > -
> > -  scalar_res = gimple_build (, CFN_EXTRACT_LAST, scalar_type,
> > -mask, vec_lhs_phi);
> > +   scalar_res = gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE
> (vectype),
> > +  vec_lhs_phi, bitstart);
> 
> So bitstart is always zero?  I wonder why using CFN_VEC_EXTRACT over
> BIT_FIELD_REF here which wouldn't need any additional target support.
> 

Ok, how about...

---

I was generating the vector reverse mask without checking if the target
actually supported such an operation.

This patch changes it to if the bitstart is 0 then use BIT_FIELD_REF instead
to extract the first element since this is supported by all targets.

This is good for now since masks always come from whilelo.  But in the future
when masks can come from other sources we will need the old code back.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues with --enable-checking=release --enable-lto
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.
tested on cross cc1 for amdgcn-amdhsa and issue fixed.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113199
* tree-vect-loop.cc (vectorizable_live_operation_1): Use
BIT_FIELD_REF.

gcc/testsuite/ChangeLog:

PR tree-optimization/113199
* gcc.target/gcn/pr113199.c: New test.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gcc.target/gcn/pr113199.c 
b/gcc/testsuite/gcc.target/gcn/pr113199.c
new file mode 100644
index 
..8a641e5536e80e207ca0163cac66c0f4f6ca93f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/pr113199.c
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2" } */
+
+typedef long unsigned int size_t;
+typedef int wchar_t;
+struct tm
+{
+  int tm_mon;
+  int tm_year;
+};
+int abs (int);
+struct lc_time_T { const char *month[12]; };
+struct __locale_t * __get_current_locale (void) { }
+const struct lc_time_T * __get_time_locale (struct __locale_t *locale) { }
+const wchar_t * __ctloc (wchar_t *buf, const char *elem, size_t *len_ret) { 
return buf; }
+size_t
+__strftime (wchar_t *s, size_t maxsize, const wchar_t *format,
+ const struct tm *tim_p, struct __locale_t *locale)
+{
+  size_t count = 0;
+  const wchar_t *ctloc;
+  wchar_t ctlocbuf[256];
+  size_t i, ctloclen;
+  const struct lc_time_T *_CurrentTimeLocale = __get_time_locale (locale);
+{
+  switch (*format)
+ {
+ case L'B':
+   (ctloc = __ctloc (ctlocbuf, _CurrentTimeLocale->month[tim_p->tm_mon], 
));
+   for (i = 0; i < ctloclen; i++)
+ {
+   if (count < maxsize - 1)
+  s[count++] = ctloc[i];
+   else
+  return 0;
+   {
+  int century = tim_p->tm_year >= 0
+? tim_p->tm_year / 100 + 1900 / 100
+: abs (tim_p->tm_year + 1900) / 100;
+   }
+   }
+ }
+}
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
37f1be1101ffae779214056a0886411e0683e887..39b1161309d8ff8bfe88ee26df9147df0af0a58c
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10592,7 +10592,17 @@ vectorizable_live_operation_1 (loop_vec_info 
loop_vinfo,
 
   gimple_seq stmts = NULL;
   tree new_tree;
-  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+
+  /* If bitstart is 0 then we can use a BIT_FIELD_REF  */
+  if (integer_zerop (bitstart))
+{
+  tree scalar_res = gimple_build (, BIT_FIELD_REF, TREE_TYPE 
(vectype),
+  vec_lhs_phi, bitsize, bitstart);
+
+  /* Convert the extracted vector element to the scalar type.  */
+  new_tree = gimple_convert (, lhs_type, scalar_res);
+}
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
 {
   /* Emit:
 
@@ -10618,12 +10628,6 @@ vectorizable_live_operation_1 (loop_vec_info 
loop_vinfo,
   tree last_index = gimple_build (, PLUS_EXPR, TREE_TYPE (len),
 len, bias_minus_one);
 
-  /* This needs to implement extraction of the first index, but not sure
-how the LEN stuff works.  At the moment we shouldn't get here since
-there's no LEN support for early breaks.  But guard this so there's
-no incorrect codegen.  */
-  gcc_assert (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo));
-
   /* SCALAR_RES = VEC_EXTRACT .  */
   tree scalar_res
= gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE (vectype),
@@ -10648,32 +10652,6 @@ vectorizable_live_operation_1 (loop_vec_info 
loop_vinfo,
  _VINFO_MASKS (loop_vinfo),
  1, vectype, 0);
   tree scalar_res;
-
-  /* For an inverted control flow with early breaks we want EXTRACT_FIRST
-instead of EXTRACT_LAST.  Emulate by reversing the vector and mask. */
-  if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
-   {
- /* First create the permuted mask.  */
- tree perm_mask = 

RE: [PATCH]middle-end: check if target can do extract first for early breaks [PR113199]

2024-01-08 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Monday, January 8, 2024 12:48 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: check if target can do extract first for 
> early breaks
> [PR113199]
> 
> On Tue, 2 Jan 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > I was generating the vector reverse mask without checking if the target
> > actually supported such an operation.
> >
> > It also seems like more targets implement VEC_EXTRACT than permute on mask
> > registers.
> >
> > So this adds a check for IFN_VEC_EXTRACT support when required and changes
> > the select first code to use it.
> >
> > This is good for now since masks always come from whilelo.  But in the 
> > future
> > when masks can come from other sources we will need the old code back.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > and no issues with --enable-checking=release --enable-lto
> > --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.
> > tested on cross cc1 for amdgcn-amdhsa and issue fixed.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/113199
> > * tree-vect-loop.cc (vectorizable_live_operation_1): Use
> > IFN_VEC_EXTRACT.
> > (vectorizable_live_operation): Check for IFN_VEC_EXTRACT support.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/113199
> > * gcc.target/gcn/pr113199.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/testsuite/gcc.target/gcn/pr113199.c
> b/gcc/testsuite/gcc.target/gcn/pr113199.c
> > new file mode 100644
> > index
> ..8a641e5536e80e207ca01
> 63cac66c0f4f6ca93f7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/gcn/pr113199.c
> > @@ -0,0 +1,44 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O2" } */
> > +
> > +typedef long unsigned int size_t;
> > +typedef int wchar_t;
> > +struct tm
> > +{
> > +  int tm_mon;
> > +  int tm_year;
> > +};
> > +int abs (int);
> > +struct lc_time_T { const char *month[12]; };
> > +struct __locale_t * __get_current_locale (void) { }
> > +const struct lc_time_T * __get_time_locale (struct __locale_t *locale) { }
> > +const wchar_t * __ctloc (wchar_t *buf, const char *elem, size_t *len_ret) {
> return buf; }
> > +size_t
> > +__strftime (wchar_t *s, size_t maxsize, const wchar_t *format,
> > + const struct tm *tim_p, struct __locale_t *locale)
> > +{
> > +  size_t count = 0;
> > +  const wchar_t *ctloc;
> > +  wchar_t ctlocbuf[256];
> > +  size_t i, ctloclen;
> > +  const struct lc_time_T *_CurrentTimeLocale = __get_time_locale (locale);
> > +{
> > +  switch (*format)
> > + {
> > + case L'B':
> > +   (ctloc = __ctloc (ctlocbuf, _CurrentTimeLocale->month[tim_p->tm_mon],
> ));
> > +   for (i = 0; i < ctloclen; i++)
> > + {
> > +   if (count < maxsize - 1)
> > +  s[count++] = ctloc[i];
> > +   else
> > +  return 0;
> > +   {
> > +  int century = tim_p->tm_year >= 0
> > +? tim_p->tm_year / 100 + 1900 / 100
> > +: abs (tim_p->tm_year + 1900) / 100;
> > +   }
> > +   }
> > + }
> > +}
> > +}
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index
> 37f1be1101ffae779214056a0886411e0683e887..5aa92e67444e7aacf458fffa14
> 28f1983c482374 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -10648,36 +10648,18 @@ vectorizable_live_operation_1 (loop_vec_info
> loop_vinfo,
> >   _VINFO_MASKS (loop_vinfo),
> >   1, vectype, 0);
> >tree scalar_res;
> > +  gimple_seq_add_seq (, tem);
> >
> >/* For an inverted control flow with early breaks we want 
> > EXTRACT_FIRST
> > -instead of EXTRACT_LAST.  Emulate by reversing the vector and mask. */
> > +instead of EXTRACT_LAST.  For now since the mask always comes from a
> > +WHILELO we can get the first element ignoring the mask since CLZ of the
> > +mask will always be zero.  */
> >if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > -   {
> > - /* First create the permuted mask.  */
>

RE: [PATCH]middle-end: maintain LCSSA form when peeled vector iterations have virtual operands

2024-01-08 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Monday, January 8, 2024 12:38 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: maintain LCSSA form when peeled vector
> iterations have virtual operands
> 
> On Fri, 29 Dec 2023, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This patch fixes several interconnected issues.
> >
> > 1. When picking an exit we wanted to check for niter_desc.may_be_zero not
> true.
> >i.e. we want to pick an exit which we know will iterate at least once.
> >However niter_desc.may_be_zero is not a boolean.  It is a tree that 
> > encodes
> >a boolean value.  !niter_desc.may_be_zero is just checking if we have 
> > some
> >information, not what the information is.  This leads us to pick a more
> >difficult to vectorize exit more often than we should.
> >
> > 2. Because we had this bug, we used to pick an alternative exit much more 
> > ofthen
> >which showed one issue, when the loop accesses memory and we "invert it" 
> > we
> >would corrupt the VUSE chain.  This is because on an peeled vector 
> > iteration
> >every exit restarts the loop (i.e. they're all early) BUT since we may 
> > have
> >performed a store, the vUSE would need to be updated.  This version 
> > maintains
> >virtual PHIs correctly in these cases.   Note that we can't simply 
> > remove all
> >of them and recreate them because we need the PHI nodes still in the 
> > right
> >order for if skip_vector.
> >
> > 3. Since we're moving the stores to a safe location I don't think we 
> > actually
> >need to analyze whether the store is in range of the memref,  because if 
> > we
> >ever get there, we know that the loads must be in range, and if the 
> > loads are
> >in range and we get to the store we know the early breaks were not taken 
> > and
> >so the scalar loop would have done the VF stores too.
> >
> > 4. Instead of searching for where to move stores to, they should always be 
> > in
> >exit belonging to the latch.  We can only ever delay stores and even if 
> > we
> >pick a different exit than the latch one as the main one, effects still
> >happen in program order when vectorized.  If we don't move the stores to 
> > the
> >latch exit but instead to whever we pick as the "main" exit then we can
> >perform incorrect memory accesses (luckily these are trapped by 
> > verify_ssa).
> >
> > 5. We only used to analyze loads inside the same BB as an early break, and 
> > also
> >we'd never analyze the ones inside the block where we'd be moving memory
> >references to.  This is obviously bogus and to fix it this patch splits 
> > apart
> >the two constraints.  We first validate that all load memory references 
> > are
> >in bounds and only after that do we perform the alias checks for the 
> > writes.
> >This makes the code simpler to understand and more trivially correct.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
> > and no issues with --enable-checking=release --enable-lto
> > --with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/113137
> > PR tree-optimization/113136
> > PR tree-optimization/113172
> > * tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
> > * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
> > (vect_do_peeling): Maintain virtual PHIs on inverted loops.
> > * tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to
> > latch.
> > (vect_create_loop_vinfo): Record all conds instead of only alt ones.
> > * tree-vectorizer.h: Fix comment
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/113137
> > PR tree-optimization/113136
> > PR tree-optimization/113172
> > * g++.dg/vect/vect-early-break_4-pr113137.cc: New test.
> > * g++.dg/vect/vect-early-break_5-pr113137.cc: New test.
> > * gcc.dg/vect/vect-early-break_95-pr113137.c: New test.
> > * gcc.dg/vect/vect-early-break_96-pr113136.c: New test.
> > * gcc.dg/vect/vect-early-break_97-pr113172.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/testsuite/g++.dg/vec

RE: [PATCH]middle-end: rejects loops with nonlinear inductions and early breaks [PR113163]

2024-01-08 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Monday, January 8, 2024 12:07 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: rejects loops with nonlinear inductions and 
> early
> breaks [PR113163]
> 
> On Fri, 29 Dec 2023, Tamar Christina wrote:
> 
> > Hi All,
> >
> > We can't support nonlinear inductions other than neg when vectorizing
> > early breaks and iteration count is known.
> >
> > For early break we currently require a peeled epilog but in these cases
> > we can't compute the remaining values.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > tested on cross cc1 for amdgcn-amdhsa and issue fixed.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR middle-end/113163
> > * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p):
> 
> Misses sth.
> 
> > gcc/testsuite/ChangeLog:
> >
> > PR middle-end/113163
> > * gcc.target/gcn/pr113163.c: New test.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c
> b/gcc/testsuite/gcc.target/gcn/pr113163.c
> > new file mode 100644
> > index
> ..99b0fdbaf3a3152ca008b5
> 109abf6e80d8cb3d6a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/gcn/pr113163.c
> > @@ -0,0 +1,30 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O2 -ftree-vectorize" } */
> > +
> > +struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; };
> > +static const char R64_ARRAY[] =
> "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
> ;
> > +char *
> > +_l64a_r (struct _reent *rptr,
> > + long value)
> > +{
> > +  char *ptr;
> > +  char *result;
> > +  int i, index;
> > +  unsigned long tmp = (unsigned long)value & 0x;
> > +  result =
> > +  ((
> > +  rptr
> > +  )->_new._reent._l64a_buf)
> > +   ;
> > +  ptr = result;
> > +  for (i = 0; i < 6; ++i)
> > +{
> > +  if (tmp == 0)
> > + {
> > +   *ptr = '\0';
> > +   break;
> > + }
> > +  *ptr++ = R64_ARRAY[index];
> > +  tmp >>= 6;
> > +}
> > +}
> > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > index
> 3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c18
> 6302fbeaf515be8cf 100644
> > --- a/gcc/tree-vect-loop-manip.cc
> > +++ b/gcc/tree-vect-loop-manip.cc
> > @@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info
> loop_vinfo,
> >return false;
> >  }
> >
> > +  /* We can't support partial vectors and early breaks with an induction
> > + type other than add or neg since we require the epilog and can't
> > + perform the peeling.  PR113163.  */
> > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > +  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
> 
> But why's that only for constant VF?  We might never end up here
> with variable VF but the check looks odd ...

It's mirroring the condition in vect_gen_vector_loop_niters where we
create step_vector which is not 1. This is the case which causes
niters_vector_mult_vf_var to become a tree var instead.

I'll update the comment to say this.

Thanks,
Tamar
> 
> OK with that clarified and/or the test removed.
> 
> Thanks,
> Richard.
> 
> > +  && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> > +  && induction_type != vect_step_op_neg)
> > +{
> > +  if (dump_enabled_p ())
> > +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +"Peeling for epilogue is not supported"
> > +" for nonlinear induction except neg"
> > +" when iteration count is known and early breaks.\n");
> > +  return false;
> > +}
> > +
> >return true;
> >  }
> >
> >
> >
> >
> >
> >
> 
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


RE: [PATCH] tree-optimization/113026 - avoid vector epilog in more cases

2024-01-08 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Monday, January 8, 2024 11:29 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Tamar Christina 
> Subject: [PATCH] tree-optimization/113026 - avoid vector epilog in more cases
> 
> The following avoids creating a niter peeling epilog more consistently,
> matching what peeling later uses for the skip_vector condition, in
> particular when versioning is required which then also ensures the
> vector loop is entered unless the epilog is vectorized.  This should
> ideally match LOOP_VINFO_VERSIONING_THRESHOLD which is only computed
> later, some refactoring could make that better matching.
> 
> The patch also makes sure to adjust the upper bound of the epilogues
> when we do not have a skip edge around the vector loop.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.  Tamar, does
> that look OK wrt early-breaks?

Yeah the value looks correct, I did find a few cases where the niters should 
actually be
higher for skip_vector, namely when of the breaks forces ncopies > 1 and we 
have a
break condition that requires all values to be true to continue.

The code is not wrong in that case, just executes a completely useless vector 
iters.

But that's unrelated, this looks correct because it means bound_scalar is not 
set, in
which case there's no difference between one and multiple exits.

Thanks,
Tamar

> 
> Thanks,
> Richard.
> 
>   PR tree-optimization/113026
>   * tree-vect-loop.cc (vect_need_peeling_or_partial_vectors_p):
>   Avoid an epilog in more cases.
>   * tree-vect-loop-manip.cc (vect_do_peeling): Adjust the
>   epilogues niter upper bounds and estimates.
> 
>   * gcc.dg/torture/pr113026-1.c: New testcase.
>   * gcc.dg/torture/pr113026-2.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/torture/pr113026-1.c | 11 
>  gcc/testsuite/gcc.dg/torture/pr113026-2.c | 18 +
>  gcc/tree-vect-loop-manip.cc   | 32 +++
>  gcc/tree-vect-loop.cc |  6 -
>  4 files changed, 66 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr113026-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr113026-2.c
> 
> diff --git a/gcc/testsuite/gcc.dg/torture/pr113026-1.c
> b/gcc/testsuite/gcc.dg/torture/pr113026-1.c
> new file mode 100644
> index 000..56dfef3b36c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr113026-1.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-Wall" } */
> +
> +char dst[16];
> +
> +void
> +foo (char *src, long n)
> +{
> +  for (long i = 0; i < n; i++)
> +dst[i] = src[i]; /* { dg-bogus "" } */
> +}
> diff --git a/gcc/testsuite/gcc.dg/torture/pr113026-2.c
> b/gcc/testsuite/gcc.dg/torture/pr113026-2.c
> new file mode 100644
> index 000..b9d5857a403
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr113026-2.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-Wall" } */
> +
> +char dst1[17];
> +void
> +foo1 (char *src, long n)
> +{
> +  for (long i = 0; i < n; i++)
> +dst1[i] = src[i]; /* { dg-bogus "" } */
> +}
> +
> +char dst2[18];
> +void
> +foo2 (char *src, long n)
> +{
> +  for (long i = 0; i < n; i++)
> +dst2[i] = src[i]; /* { dg-bogus "" } */
> +}
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 9330183bfb9..927f76a0947 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -3364,6 +3364,38 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> niters, tree nitersm1,
>   bb_before_epilog->count = single_pred_edge (bb_before_epilog)->count
> ();
> bb_before_epilog = loop_preheader_edge (epilog)->src;
>   }
> +  else
> + {
> +   /* When we do not have a loop-around edge to the epilog we know
> +  the vector loop covered at least VF scalar iterations unless
> +  we have early breaks and the epilog will cover at most
> +  VF - 1 + gap peeling iterations.
> +  Update any known upper bound with this knowledge.  */
> +   if (! LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> + {
> +   if (epilog->any_upper_bound)
> + epilog->nb_iterations_upper_bound -= lowest_vf;
> +   if (epilog->any_likely_upper_bound)
> + epilog->nb_iterations_likely_upper_bound -= lowest_vf;
> +   if (epilog->any_estimate)
> + epilog->nb_iterations_estimate -= lowest_vf;
> + }
> +   unsigned HOST_WIDE_INT const_vf;
> +   if

[PATCH][frontend]: don't ice with pragma NOVECTOR if loop in C has no condition [PR113267]

2024-01-08 Thread Tamar Christina
Hi All,

In C you can have loops without a condition, the original version of the patch
was rejecting the use of #pragma GCC novector, however during review it was
changed to not due this with the reason that we didn't want to give a compile
error with such cases.

However because annotations seem to be only be allowed on conditions (unless
I'm mistaken?) the attached example ICEs because there's no condition.

This will have it ignore the pragma instead of ICEing.  I don't know if this is
the best solution,  but as far as I can tell we can't attach the annotation to
anything else.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/c/ChangeLog:

PR c/113267
* c-parser.cc (c_parser_for_statement): Skip the pragma is no cond.

gcc/testsuite/ChangeLog:

PR c/113267
* gcc.dg/pr113267.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 
c3724304580cf54f52655e10d2697c68966b9a17..e8300cea8ef7cedead5871e40c2a9ba5333bf839
 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -8442,7 +8442,7 @@ c_parser_for_statement (c_parser *parser, bool ivdep, 
unsigned short unroll,
   build_int_cst (integer_type_node,
  annot_expr_unroll_kind),
   build_int_cst (integer_type_node, unroll));
- if (novector && cond != error_mark_node)
+ if (novector && cond && cond != error_mark_node)
cond = build3 (ANNOTATE_EXPR, TREE_TYPE (cond), cond,
   build_int_cst (integer_type_node,
  annot_expr_no_vector_kind),
diff --git a/gcc/testsuite/gcc.dg/pr113267.c b/gcc/testsuite/gcc.dg/pr113267.c
new file mode 100644
index 
..8b6fa08324eb12ad6493291cca8e80bd3a072ba8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr113267.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+
+void f (char *a, int i)
+{
+#pragma GCC novector
+  for (;;i++)
+a[i] *= 2;
+}




-- 
diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 
c3724304580cf54f52655e10d2697c68966b9a17..e8300cea8ef7cedead5871e40c2a9ba5333bf839
 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -8442,7 +8442,7 @@ c_parser_for_statement (c_parser *parser, bool ivdep, 
unsigned short unroll,
   build_int_cst (integer_type_node,
  annot_expr_unroll_kind),
   build_int_cst (integer_type_node, unroll));
- if (novector && cond != error_mark_node)
+ if (novector && cond && cond != error_mark_node)
cond = build3 (ANNOTATE_EXPR, TREE_TYPE (cond), cond,
   build_int_cst (integer_type_node,
  annot_expr_no_vector_kind),
diff --git a/gcc/testsuite/gcc.dg/pr113267.c b/gcc/testsuite/gcc.dg/pr113267.c
new file mode 100644
index 
..8b6fa08324eb12ad6493291cca8e80bd3a072ba8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr113267.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+
+void f (char *a, int i)
+{
+#pragma GCC novector
+  for (;;i++)
+a[i] *= 2;
+}





Re: [PATCH]middle-end: thread through existing LCSSA variable for alternative exits too [PR113237]

2024-01-08 Thread Tamar Christina
No, that error is fixed by some earlier patches sent early last week that are 
awaiting review :)


From: Toon Moene 
Sent: Sunday, January 7, 2024 7:11 PM
To: gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH]middle-end: thread through existing LCSSA variable for 
alternative exits too [PR113237]

On 1/7/24 18:29, Tamar Christina wrote:

> gcc/ChangeLog:
>
>PR tree-optimization/113237
>* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use
>existing LCSSA variable for exit when all exits are early break.

Might that be the same error as I got here when building with
bootstrap-lto and bootstrap-O3:

https://gcc.gnu.org/pipermail/gcc-testresults/2024-January/804807.html

?

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands



[PATCH]middle-end: thread through existing LCSSA variable for alternative exits too [PR113237]

2024-01-07 Thread Tamar Christina
Hi All,

Builing on top of the previous patch, similar to when we have a single exit if
we have a case where all exits are considered early exits and there are existing
non virtual phi then in order to maintain LCSSA we have to use the existing PHI
variables.  We can't simply clear them and just rebuild them because the order
of the PHIs in the main exit must match the original exit for when we add the
skip_epilog guard.

But the infrastructure is already in place to maintain them, we just have to use
the right value.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues normally and with with --enable-checking=release --enable-lto
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113237
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use
existing LCSSA variable for exit when all exits are early break.

gcc/testsuite/ChangeLog:

PR tree-optimization/113237
* gcc.dg/vect/vect-early-break_98-pr113237.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c
new file mode 100644
index 
..e6d150b571f753e9eb3859f06f62b371817494a3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+long Perl_pp_split_limit;
+int Perl_block_gimme();
+int Perl_pp_split() {
+  char strend;
+  long iters;
+  int gimme = Perl_block_gimme();
+  while (--Perl_pp_split_limit) {
+if (gimme)
+  iters++;
+if (strend)
+  break;
+  }
+  if (iters)
+return 0;
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
7fd6566341b4893a1e209d1f8ff65d6d180f1190..77649b84f45b9e5dacec2809e0c854c8fcc17ce1
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -1700,7 +1700,12 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
*loop, edge loop_exit,
  if (peeled_iters && !virtual_operand_p (new_arg))
{
  tree tmp_arg = gimple_phi_result (from_phi);
- if (!new_phi_args.get (tmp_arg))
+ /* Similar to the single exit case, If we have an existing
+LCSSA variable thread through the original value otherwise
+skip it and directly use the final value.  */
+ if (tree *res = new_phi_args.get (tmp_arg))
+   new_arg = *res;
+ else
new_arg = tmp_arg;
}
 




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c
new file mode 100644
index 
..e6d150b571f753e9eb3859f06f62b371817494a3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_98-pr113237.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+long Perl_pp_split_limit;
+int Perl_block_gimme();
+int Perl_pp_split() {
+  char strend;
+  long iters;
+  int gimme = Perl_block_gimme();
+  while (--Perl_pp_split_limit) {
+if (gimme)
+  iters++;
+if (strend)
+  break;
+  }
+  if (iters)
+return 0;
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
7fd6566341b4893a1e209d1f8ff65d6d180f1190..77649b84f45b9e5dacec2809e0c854c8fcc17ce1
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -1700,7 +1700,12 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
*loop, edge loop_exit,
  if (peeled_iters && !virtual_operand_p (new_arg))
{
  tree tmp_arg = gimple_phi_result (from_phi);
- if (!new_phi_args.get (tmp_arg))
+ /* Similar to the single exit case, If we have an existing
+LCSSA variable thread through the original value otherwise
+skip it and directly use the final value.  */
+ if (tree *res = new_phi_args.get (tmp_arg))
+   new_arg = *res;
+ else
new_arg = tmp_arg;
}
 





RE: [PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]

2024-01-05 Thread Tamar Christina
> On Fri, 2024-01-05 at 11:02 +0000, Tamar Christina wrote:
> > Ok, so something like:
> >
> > > > ([istarget loongarch*-*-*] &&
> > > > ([check_effective_target_loongarch_sx] ||
> > > > [check_effective_target_hard_float]))
> > ?
> 
> We don't need "[check_effective_target_loongarch_sx] ||" because SIMD
> requires hard float.
> 

Cool, thanks! 

--

Hi All,

currently GCC does not treat IFN_COPYSIGN the same as the copysign tree expr.
The latter has a libcall fallback and the IFN can only do optabs.

Because of this the change I made to optimize copysign only works if the
target has impemented the optab, but it should work for those that have the
libcall too.

More annoyingly if a target has vector versions of ABS and NEG but not COPYSIGN
then the change made them lose vectorization.

The proper fix for this is to treat the IFN the same as the tree EXPR and to
enhance expand_COPYSIGN to also support vector calls.

I have such a patch for GCC 15 but it's quite big and too invasive for stage-4.
As such this is a minimal fix, just don't apply the transformation and leave
targets which don't have the optab unoptimized.

Targets list for check_effective_target_ifn_copysign was gotten by grepping for
copysign and looking at the optab.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Tests ran in x86_64-pc-linux-gnu -m32 and tests no longer fail.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/112468
* doc/sourcebuild.texi: Document ifn_copysign.
* match.pd: Only apply transformation if target supports the IFN.

gcc/testsuite/ChangeLog:

PR tree-optimization/112468
* gcc.dg/fold-copysign-1.c: Modify tests based on if target supports
IFN_COPYSIGN.
* gcc.dg/pr55152-2.c: Likewise.
* gcc.dg/tree-ssa/abs-4.c: Likewise.
* gcc.dg/tree-ssa/backprop-6.c: Likewise.
* gcc.dg/tree-ssa/copy-sign-2.c: Likewise.
* gcc.dg/tree-ssa/mult-abs-2.c: Likewise.
* lib/target-supports.exp (check_effective_target_ifn_copysign): New.

--- inline copy of patch ---

diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index 
4be67daedb20d394857c02739389cabf23c0d533..f4847dafe65cbbf8c9de34905f614ef6957658b4
 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -2664,6 +2664,10 @@ Target requires a command line argument to enable a SIMD 
instruction set.
 @item xorsign
 Target supports the xorsign optab expansion.
 
+@item ifn_copysign
+Target supports the IFN_COPYSIGN optab expansion for both scalar and vector
+types.
+
 @end table
 
 @subsubsection Environment attributes
diff --git a/gcc/match.pd b/gcc/match.pd
index 
d57e29bfe1d68afd4df4dda20fecc2405ff05332..87d13e7e3e1aa6d89119142b614890dc4729b521
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1159,13 +1159,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (simplify
   (copysigns @0 REAL_CST@1)
   (if (!REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1)))
-   (abs @0
+   (abs @0)
+#if GIMPLE
+   (if (!direct_internal_fn_supported_p (IFN_COPYSIGN, type,
+OPTIMIZE_FOR_BOTH))
+(negate (abs @0)))
+#endif
+   )))
 
+#if GIMPLE
 /* Transform fneg (fabs (X)) -> copysign (X, -1).  */
 (simplify
  (negate (abs @0))
- (IFN_COPYSIGN @0 { build_minus_one_cst (type); }))
-
+ (if (direct_internal_fn_supported_p (IFN_COPYSIGN, type,
+ OPTIMIZE_FOR_BOTH))
+   (IFN_COPYSIGN @0 { build_minus_one_cst (type); })))
+#endif
 /* copysign(copysign(x, y), z) -> copysign(x, z).  */
 (for copysigns (COPYSIGN_ALL)
  (simplify
diff --git a/gcc/testsuite/gcc.dg/fold-copysign-1.c 
b/gcc/testsuite/gcc.dg/fold-copysign-1.c
index 
f9cafd14ab05f5e8ab2f6f68e62801d21c2df6a6..96b80c733794fffada1b08274ef39cc8f6e442ce
 100644
--- a/gcc/testsuite/gcc.dg/fold-copysign-1.c
+++ b/gcc/testsuite/gcc.dg/fold-copysign-1.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O -fdump-tree-cddce1" } */
+/* { dg-additional-options "-msse -mfpmath=sse" { target { { i?86-*-* 
x86_64-*-* } && ilp32 } } } */
 
 double foo (double x)
 {
@@ -12,5 +13,7 @@ double bar (double x)
   return __builtin_copysign (x, minuszero);
 }
 
-/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" } } */
-/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" { target 
ifn_copysign } } } */
+/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" { target 
ifn_copysign } } } */
+/* { dg-final { scan-tree-dump-times "= -" 1 "cddce1" { target { ! 
ifn_copysign } } } } */
+/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 2 "cddce1" { target { ! 
ifn_copysign } } } } */
diff --git a/gc

RE: [PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]

2024-01-05 Thread Tamar Christina
> -Original Message-
> From: Xi Ruoyao 
> Sent: Thursday, January 4, 2024 10:39 PM
> To: Palmer Dabbelt ; Tamar Christina
> 
> Cc: gcc-patches@gcc.gnu.org; nd ; rguent...@suse.de; Jeff Law
> 
> Subject: Re: [PATCH]middle-end: Don't apply copysign optimization if target 
> does
> not implement optab [PR112468]
> 
> On Thu, 2024-01-04 at 14:32 -0800, Palmer Dabbelt wrote:
> > > +proc check_effective_target_ifn_copysign { } {
> > > +    return [check_cached_effective_target_indexed ifn_copysign {
> > > +  expr {
> > > +  (([istarget i?86-*-*] || [istarget x86_64-*-*])
> > > +    && [is-effective-target sse])
> > > +  || ([istarget loongarch*-*-*] && [check_effective_target_loongarch_sx])
> 
> LoongArch has [scalar FP copysign][1] too.
> 
> [1]:https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-
> EN.html#_fscaleblogbcopysign_sd

Ok, so something like:

|| ([istarget loongarch*-*-*] && ([check_effective_target_loongarch_sx] ||  
[check_effective_target_hard_float]))
?

> 
> > > +  || ([istarget powerpc*-*-*]
> > > +  && ![istarget powerpc-*-linux*paired*])
> > > +  || [istarget alpha*-*-*]
> > > +  || [istarget aarch64*-*-*]
> > > +  || [is-effective-target arm_neon]
> > > +  || ([istarget s390*-*-*]
> > > +  && [check_effective_target_s390_vx])
> > > +  || ([istarget riscv*-*-*]
> > > +  && [check_effective_target_riscv_v])
> >
> > Unless I'm missing something, we have copysign in the scalar
> > floating-point ISAs as well.  So I think this should be
> >
> >   || ([istarget riscv*-*-*]
> >   && [check_effective_target_hard_float])
> 

Ah cool, will update it in next version. 

Thanks,
Tamar

> --
> Xi Ruoyao 
> School of Aerospace Science and Technology, Xidian University


[PATCH]middle-end: Don't apply copysign optimization if target does not implement optab [PR112468]

2024-01-04 Thread Tamar Christina
Hi All,

currently GCC does not treat IFN_COPYSIGN the same as the copysign tree expr.
The latter has a libcall fallback and the IFN can only do optabs.

Because of this the change I made to optimize copysign only works if the
target has impemented the optab, but it should work for those that have the
libcall too.

More annoyingly if a target has vector versions of ABS and NEG but not COPYSIGN
then the change made them lose vectorization.

The proper fix for this is to treat the IFN the same as the tree EXPR and to
enhance expand_COPYSIGN to also support vector calls.

I have such a patch for GCC 15 but it's quite big and too invasive for stage-4.
As such this is a minimal fix, just don't apply the transformation and leave
targets which don't have the optab unoptimized.

Targets list for check_effective_target_ifn_copysign was gotten by grepping for
copysign and looking at the optab.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Tests ran in x86_64-pc-linux-gnu -m64/-m32 and tests no longer fail.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/112468
* doc/sourcebuild.texi: Document ifn_copysign.
* match.pd: Only apply transformation if target supports the IFN.

gcc/testsuite/ChangeLog:

PR tree-optimization/112468
* gcc.dg/fold-copysign-1.c: Modify tests based on if target supports
IFN_COPYSIGN.
* gcc.dg/pr55152-2.c: Likewise.
* gcc.dg/tree-ssa/abs-4.c: Likewise.
* gcc.dg/tree-ssa/backprop-6.c: Likewise.
* gcc.dg/tree-ssa/copy-sign-2.c: Likewise.
* gcc.dg/tree-ssa/mult-abs-2.c: Likewise.
* lib/target-supports.exp (check_effective_target_ifn_copysign): New.

--- inline copy of patch -- 
diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index 
4be67daedb20d394857c02739389cabf23c0d533..f4847dafe65cbbf8c9de34905f614ef6957658b4
 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -2664,6 +2664,10 @@ Target requires a command line argument to enable a SIMD 
instruction set.
 @item xorsign
 Target supports the xorsign optab expansion.
 
+@item ifn_copysign
+Target supports the IFN_COPYSIGN optab expansion for both scalar and vector
+types.
+
 @end table
 
 @subsubsection Environment attributes
diff --git a/gcc/match.pd b/gcc/match.pd
index 
d57e29bfe1d68afd4df4dda20fecc2405ff05332..87d13e7e3e1aa6d89119142b614890dc4729b521
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1159,13 +1159,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (simplify
   (copysigns @0 REAL_CST@1)
   (if (!REAL_VALUE_NEGATIVE (TREE_REAL_CST (@1)))
-   (abs @0
+   (abs @0)
+#if GIMPLE
+   (if (!direct_internal_fn_supported_p (IFN_COPYSIGN, type,
+OPTIMIZE_FOR_BOTH))
+(negate (abs @0)))
+#endif
+   )))
 
+#if GIMPLE
 /* Transform fneg (fabs (X)) -> copysign (X, -1).  */
 (simplify
  (negate (abs @0))
- (IFN_COPYSIGN @0 { build_minus_one_cst (type); }))
-
+ (if (direct_internal_fn_supported_p (IFN_COPYSIGN, type,
+ OPTIMIZE_FOR_BOTH))
+   (IFN_COPYSIGN @0 { build_minus_one_cst (type); })))
+#endif
 /* copysign(copysign(x, y), z) -> copysign(x, z).  */
 (for copysigns (COPYSIGN_ALL)
  (simplify
diff --git a/gcc/testsuite/gcc.dg/fold-copysign-1.c 
b/gcc/testsuite/gcc.dg/fold-copysign-1.c
index 
f9cafd14ab05f5e8ab2f6f68e62801d21c2df6a6..96b80c733794fffada1b08274ef39cc8f6e442ce
 100644
--- a/gcc/testsuite/gcc.dg/fold-copysign-1.c
+++ b/gcc/testsuite/gcc.dg/fold-copysign-1.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O -fdump-tree-cddce1" } */
+/* { dg-additional-options "-msse -mfpmath=sse" { target { { i?86-*-* 
x86_64-*-* } && ilp32 } } } */
 
 double foo (double x)
 {
@@ -12,5 +13,7 @@ double bar (double x)
   return __builtin_copysign (x, minuszero);
 }
 
-/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" } } */
-/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_copysign" 1 "cddce1" { target 
ifn_copysign } } } */
+/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 1 "cddce1" { target 
ifn_copysign } } } */
+/* { dg-final { scan-tree-dump-times "= -" 1 "cddce1" { target { ! 
ifn_copysign } } } } */
+/* { dg-final { scan-tree-dump-times "= ABS_EXPR" 2 "cddce1" { target { ! 
ifn_copysign } } } } */
diff --git a/gcc/testsuite/gcc.dg/pr55152-2.c b/gcc/testsuite/gcc.dg/pr55152-2.c
index 
605f202ed6bc7aa8fe921457b02ff0b88cc63ce6..24068cffa4a8e2807ba7d16c4ed3def4f736e797
 100644
--- a/gcc/testsuite/gcc.dg/pr55152-2.c
+++ b/gcc/testsuite/gcc.dg/pr55152-2.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-options "-O -ffinite-math-only -fno-signed-zeros -fstrict-overflow 
-fdump-tree-optimized" } */
+/* { dg-additional-options "-msse -mfpmath=sse" { target { { i?86-*-* 
x86_64-*-* } && ilp32 } } } */
 
 double g (double a)
 {
@@ -10,5 +11,6 @@ int f(int a)
   return (a<-a)?a:-a;
 }
 
-/* { dg-final { 

RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation

2024-01-04 Thread Tamar Christina
> -Original Message-
> From: Kyrylo Tkachov 
> Sent: Thursday, January 4, 2024 11:12 AM
> To: Tamar Christina ; gcc-patches@gcc.gnu.org
> Cc: nd ; Ramana Radhakrishnan
> ; Richard Earnshaw
> ; ni...@redhat.com
> Subject: RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation
> 
> Hi Tamar,
> 
> > -----Original Message-
> > From: Tamar Christina 
> > Sent: Thursday, January 4, 2024 11:06 AM
> > To: Tamar Christina ; gcc-patches@gcc.gnu.org
> > Cc: nd ; Ramana Radhakrishnan
> > ; Richard Earnshaw
> > ; ni...@redhat.com; Kyrylo Tkachov
> > 
> > Subject: RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation
> >
> > Ping,
> >
> > ---
> >
> > Hi All,
> >
> > This adds an implementation for conditional branch optab for AArch32.
> > The previous version only allowed operand 0 but it looks like cbranch
> > expansion does not check with the target and so we have to implement all.
> >
> > I therefore did not commit it.  This is a larger version. I've also dropped 
> > the MVE
> > version because the mid-end can rewrite the comparison into comparing two
> > predicates without checking with the backend.  Since MVE only has 1 
> > predicate
> > register this would need to go through memory and two MRS calls.  It's 
> > unlikely
> > to be beneficial and so that's for GCC 15 when I can fix the middle-end.
> >
> > The cases where AArch32 is skipped in the testsuite are all 
> > missed-optimizations
> > due to AArch32 missing some optabs.
> 
> Does the testsuite have vect_* checks that can be used instead of target arm*?
> If so let's use those.

Unfortunately not, a lot of them center around handling of complex doubles.
Some tests work and some fail, which makes it hard to disable based on a
target effective test.  They are things that look easy to fix so I may file 
some tickets
for them.

Cheers,
Tamar

> Otherwise it's okay as is.
> Thanks,
> Kyrill
> 
> >
> > For e.g.
> >
> > void f1 ()
> > {
> >   for (int i = 0; i < N; i++)
> > {
> >   b[i] += a[i];
> >   if (a[i] > 0)
> > break;
> > }
> > }
> >
> > For 128-bit vectors we generate:
> >
> > vcgt.s32q8, q9, #0
> > vpmax.u32   d7, d16, d17
> > vpmax.u32   d7, d7, d7
> > vmovr3, s14 @ int
> > cmp r3, #0
> >
> > and of 64-bit vector we can omit one vpmax as we still need to compress to
> > 32-bits.
> >
> > Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > * config/arm/neon.md (cbranch4): New.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/vect/vect-early-break_2.c: Skip Arm.
> > * gcc.dg/vect/vect-early-break_7.c: Likewise.
> > * gcc.dg/vect/vect-early-break_75.c: Likewise.
> > * gcc.dg/vect/vect-early-break_77.c: Likewise.
> > * gcc.dg/vect/vect-early-break_82.c: Likewise.
> > * gcc.dg/vect/vect-early-break_88.c: Likewise.
> > * lib/target-supports.exp (add_options_for_vect_early_break,
> > check_effective_target_vect_early_break_hw,
> > check_effective_target_vect_early_break): Support AArch32.
> > * gcc.target/arm/vect-early-break-cbranch.c: New test.
> >
> > --- inline version of patch ---
> >
> > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> > index
> >
> d213369ffc38fb88ad0357d848cc7da5af73bab7..ed659ab736862da416d1ff624
> 1d
> > 0d3e6c6b96ff1 100644
> > --- a/gcc/config/arm/neon.md
> > +++ b/gcc/config/arm/neon.md
> > @@ -408,6 +408,55 @@ (define_insn "vec_extract"
> >[(set_attr "type" "neon_store1_one_lane,neon_to_gp")]
> >  )
> >
> > +;; Patterns comparing two vectors and conditionally jump.
> > +;; Avdanced SIMD lacks a vector != comparison, but this is a quite common
> > +;; operation.  To not pay the penalty for inverting == we can map our any
> > +;; comparisons to all i.e. any(~x) => all(x).
> > +;;
> > +;; However unlike the AArch64 version, we can't optimize this further as 
> > the
> > +;; chain is too long for combine due to these being unspecs so it doesn't 
> > fold
> > +;; the operation to something simpler.
> > +(define_expand "cbranch4"
> > +  [(set (pc) (if_then_else
> > + (match_operator 0 "expandable_comparison_operator"
> >

RE: [PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation

2024-01-04 Thread Tamar Christina
Ping,

---

Hi All,

This adds an implementation for conditional branch optab for AArch32.
The previous version only allowed operand 0 but it looks like cbranch
expansion does not check with the target and so we have to implement all.

I therefore did not commit it.  This is a larger version. I've also dropped the 
MVE
version because the mid-end can rewrite the comparison into comparing two
predicates without checking with the backend.  Since MVE only has 1 predicate
register this would need to go through memory and two MRS calls.  It's unlikely
to be beneficial and so that's for GCC 15 when I can fix the middle-end.

The cases where AArch32 is skipped in the testsuite are all missed-optimizations
due to AArch32 missing some optabs.

For e.g.

void f1 ()
{
  for (int i = 0; i < N; i++)
{
  b[i] += a[i];
  if (a[i] > 0)
break;
}
}

For 128-bit vectors we generate:

vcgt.s32q8, q9, #0
vpmax.u32   d7, d16, d17
vpmax.u32   d7, d7, d7
vmovr3, s14 @ int
cmp r3, #0

and of 64-bit vector we can omit one vpmax as we still need to compress to
32-bits.

Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/arm/neon.md (cbranch4): New.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-early-break_2.c: Skip Arm.
* gcc.dg/vect/vect-early-break_7.c: Likewise.
* gcc.dg/vect/vect-early-break_75.c: Likewise.
* gcc.dg/vect/vect-early-break_77.c: Likewise.
* gcc.dg/vect/vect-early-break_82.c: Likewise.
* gcc.dg/vect/vect-early-break_88.c: Likewise.
* lib/target-supports.exp (add_options_for_vect_early_break,
check_effective_target_vect_early_break_hw,
check_effective_target_vect_early_break): Support AArch32.
* gcc.target/arm/vect-early-break-cbranch.c: New test.

--- inline version of patch ---

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 
d213369ffc38fb88ad0357d848cc7da5af73bab7..ed659ab736862da416d1ff6241d0d3e6c6b96ff1
 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -408,6 +408,55 @@ (define_insn "vec_extract"
   [(set_attr "type" "neon_store1_one_lane,neon_to_gp")]
 )
 
+;; Patterns comparing two vectors and conditionally jump.
+;; Avdanced SIMD lacks a vector != comparison, but this is a quite common
+;; operation.  To not pay the penalty for inverting == we can map our any
+;; comparisons to all i.e. any(~x) => all(x).
+;;
+;; However unlike the AArch64 version, we can't optimize this further as the
+;; chain is too long for combine due to these being unspecs so it doesn't fold
+;; the operation to something simpler.
+(define_expand "cbranch4"
+  [(set (pc) (if_then_else
+ (match_operator 0 "expandable_comparison_operator"
+  [(match_operand:VDQI 1 "register_operand")
+   (match_operand:VDQI 2 "reg_or_zero_operand")])
+ (label_ref (match_operand 3 "" ""))
+ (pc)))]
+  "TARGET_NEON"
+{
+  rtx mask = operands[1];
+
+  /* If comparing against a non-zero vector we have to do a comparison first
+ so we can have a != 0 comparison with the result.  */
+  if (operands[2] != CONST0_RTX (mode))
+{
+  mask = gen_reg_rtx (mode);
+  emit_insn (gen_xor3 (mask, operands[1], operands[2]));
+}
+
+  /* For 128-bit vectors we need an additional reductions.  */
+  if (known_eq (128, GET_MODE_BITSIZE (mode)))
+{
+  /* Always reduce using a V4SI.  */
+  mask = gen_reg_rtx (V2SImode);
+  rtx low = gen_reg_rtx (V2SImode);
+  rtx high = gen_reg_rtx (V2SImode);
+  rtx op1 = lowpart_subreg (V4SImode, operands[1], mode);
+  emit_insn (gen_neon_vget_lowv4si (low, op1));
+  emit_insn (gen_neon_vget_highv4si (high, op1));
+  emit_insn (gen_neon_vpumaxv2si (mask, low, high));
+}
+
+  rtx op1 = lowpart_subreg (V2SImode, mask, GET_MODE (mask));
+  emit_insn (gen_neon_vpumaxv2si (op1, op1, op1));
+
+  rtx val = gen_reg_rtx (SImode);
+  emit_move_insn (val, gen_lowpart (SImode, mask));
+  emit_jump_insn (gen_cbranch_cc (operands[0], val, const0_rtx, operands[3]));
+  DONE;
+})
+
 ;; This pattern is renamed from "vec_extract" to
 ;; "neon_vec_extract" and this pattern is called
 ;; by define_expand in vec-common.md file.
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
index 
5c32bf94409e9743e72429985ab3bf13aab8f2c1..dec0b492ab883de6e02944a95fd554a109a68a39
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
@@ -5,7 +5,7 @@
 
 /* { dg-additional-options "-Ofast" } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! 
"arm*-*-*" } } } } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c 

[PATCH]middle-end: check if target can do extract first for early breaks [PR113199]

2024-01-02 Thread Tamar Christina
Hi All,

I was generating the vector reverse mask without checking if the target
actually supported such an operation.

It also seems like more targets implement VEC_EXTRACT than permute on mask
registers.

So this adds a check for IFN_VEC_EXTRACT support when required and changes
the select first code to use it.

This is good for now since masks always come from whilelo.  But in the future
when masks can come from other sources we will need the old code back.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues with --enable-checking=release --enable-lto
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.
tested on cross cc1 for amdgcn-amdhsa and issue fixed.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113199
* tree-vect-loop.cc (vectorizable_live_operation_1): Use
IFN_VEC_EXTRACT.
(vectorizable_live_operation): Check for IFN_VEC_EXTRACT support.

gcc/testsuite/ChangeLog:

PR tree-optimization/113199
* gcc.target/gcn/pr113199.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.target/gcn/pr113199.c 
b/gcc/testsuite/gcc.target/gcn/pr113199.c
new file mode 100644
index 
..8a641e5536e80e207ca0163cac66c0f4f6ca93f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/pr113199.c
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2" } */
+
+typedef long unsigned int size_t;
+typedef int wchar_t;
+struct tm
+{
+  int tm_mon;
+  int tm_year;
+};
+int abs (int);
+struct lc_time_T { const char *month[12]; };
+struct __locale_t * __get_current_locale (void) { }
+const struct lc_time_T * __get_time_locale (struct __locale_t *locale) { }
+const wchar_t * __ctloc (wchar_t *buf, const char *elem, size_t *len_ret) { 
return buf; }
+size_t
+__strftime (wchar_t *s, size_t maxsize, const wchar_t *format,
+ const struct tm *tim_p, struct __locale_t *locale)
+{
+  size_t count = 0;
+  const wchar_t *ctloc;
+  wchar_t ctlocbuf[256];
+  size_t i, ctloclen;
+  const struct lc_time_T *_CurrentTimeLocale = __get_time_locale (locale);
+{
+  switch (*format)
+ {
+ case L'B':
+   (ctloc = __ctloc (ctlocbuf, _CurrentTimeLocale->month[tim_p->tm_mon], 
));
+   for (i = 0; i < ctloclen; i++)
+ {
+   if (count < maxsize - 1)
+  s[count++] = ctloc[i];
+   else
+  return 0;
+   {
+  int century = tim_p->tm_year >= 0
+? tim_p->tm_year / 100 + 1900 / 100
+: abs (tim_p->tm_year + 1900) / 100;
+   }
+   }
+ }
+}
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
37f1be1101ffae779214056a0886411e0683e887..5aa92e67444e7aacf458fffa1428f1983c482374
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10648,36 +10648,18 @@ vectorizable_live_operation_1 (loop_vec_info 
loop_vinfo,
  _VINFO_MASKS (loop_vinfo),
  1, vectype, 0);
   tree scalar_res;
+  gimple_seq_add_seq (, tem);
 
   /* For an inverted control flow with early breaks we want EXTRACT_FIRST
-instead of EXTRACT_LAST.  Emulate by reversing the vector and mask. */
+instead of EXTRACT_LAST.  For now since the mask always comes from a
+WHILELO we can get the first element ignoring the mask since CLZ of the
+mask will always be zero.  */
   if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
-   {
- /* First create the permuted mask.  */
- tree perm_mask = perm_mask_for_reverse (TREE_TYPE (mask));
- tree perm_dest = copy_ssa_name (mask);
- gimple *perm_stmt
-   = gimple_build_assign (perm_dest, VEC_PERM_EXPR, mask,
-  mask, perm_mask);
- vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt,
-  );
- mask = perm_dest;
-
- /* Then permute the vector contents.  */
- tree perm_elem = perm_mask_for_reverse (vectype);
- perm_dest = copy_ssa_name (vec_lhs_phi);
- perm_stmt
-   = gimple_build_assign (perm_dest, VEC_PERM_EXPR, vec_lhs_phi,
-  vec_lhs_phi, perm_elem);
- vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt,
-  );
- vec_lhs_phi = perm_dest;
-   }
-
-  gimple_seq_add_seq (, tem);
-
-  scalar_res = gimple_build (, CFN_EXTRACT_LAST, scalar_type,
-mask, vec_lhs_phi);
+   scalar_res = gimple_build (, CFN_VEC_EXTRACT, TREE_TYPE (vectype),
+  vec_lhs_phi, bitstart);
+  else
+   scalar_res = gimple_build (, CFN_EXTRACT_LAST, scalar_type,
+  mask, vec_lhs_phi);
 
   /* Convert the extracted vector element to the scalar type.  */
   new_tree = gimple_convert (, lhs_type, scalar_res);
@@ -10852,9 

RE: skip vector profiles multiple exits

2024-01-02 Thread Tamar Christina
> -Original Message-
> From: Jan Hubicka 
> Sent: Friday, December 29, 2023 10:32 PM
> To: Tamar Christina 
> Cc: rguent...@suse.de; GCC Patches ; nd
> 
> Subject: Re: skip vector profiles multiple exits
> 
> > Hi Honza,
> Hi,
> >
> > I wasn't sure what to do here so I figured I'd ask.
> >
> > In adding support for multiple exits to the vectorizer I didn't know how to 
> > update
> this bit:
> >
> > https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-vect-loop-
> manip.cc#L3363
> >
> > Essentially, if skip_vector (i.e. not enough iteration to enter the vector 
> > loop) then
> the
> > previous code would update the new probability to be the same as that of the
> > exit edge.  This made sense because that's the only edge which could bring 
> > you to
> > the next loop preheader.
> >
> > With multiple exits this is no longer the case since any exit can bring you 
> > to the
> > Preaheader node.  I figured the new counts should simply be the sum of all 
> > exit
> > edges.  But that gives quite large count values compared to the rest of the 
> > loop.
> The sum of all exit counts (not probabilities) relative to header count should
> give you estimated probability that the loop iterates at any given
> iteration.  I am not sure how good estimate this is for loop
> preconditioning to be true (without profile histograms it is really hard
> to tell).
Happy new years!

Ah, so I need to subtract the loop header from the sum? I'll try 

> >
> > I then thought I would need to scale the counts by the probability of the 
> > edge
> > being taken.  The problem here is that the probabilities don't end up to 
> > 100%
> 
> So you are summing exit_edge->count ()?
> I am not sure how useful would be summit probabilities since they are
> conditional (relative to probability of entering BB you go to).
> How complicated CFG we now handle with vectorization?
> 

Yeah I as trying to sum the edge counts.  The CFG can get quite complicated
because we allow vectorization of any arbitrary number of exits as long as
that exit leaves the loop body.

In this current version we force everything to the scalar epilog, so the merge
block can get any number of incoming edges now.  Aside from this we still
support versioning and skip_epilog so you have the additional edges coming
in from there too.

Regards,
Tamar

> Honza
> >
> > so the scaled counts also looked kinda wonkey.   Any suggestions?
> >
> > If you want some small examples to look at, testcases
> > ./gcc/testsuite/gcc.dg/vect/vect-early-break_90.c to
> ./gcc/testsuite/gcc.dg/vect/vect-early-break_93.c
> > should be relevant here.
> >
> > Thanks,
> > Tamar


skip vector profiles multiple exits

2023-12-29 Thread Tamar Christina
Hi Honza,

I wasn't sure what to do here so I figured I'd ask.

In adding support for multiple exits to the vectorizer I didn't know how to 
update this bit:

https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-vect-loop-manip.cc#L3363

Essentially, if skip_vector (i.e. not enough iteration to enter the vector 
loop) then the
previous code would update the new probability to be the same as that of the
exit edge.  This made sense because that's the only edge which could bring you 
to
the next loop preheader.

With multiple exits this is no longer the case since any exit can bring you to 
the
Preaheader node.  I figured the new counts should simply be the sum of all exit
edges.  But that gives quite large count values compared to the rest of the 
loop.

I then thought I would need to scale the counts by the probability of the edge
being taken.  The problem here is that the probabilities don't end up to 100%

so the scaled counts also looked kinda wonkey.   Any suggestions?

If you want some small examples to look at, testcases
./gcc/testsuite/gcc.dg/vect/vect-early-break_90.c to 
./gcc/testsuite/gcc.dg/vect/vect-early-break_93.c
should be relevant here.

Thanks,
Tamar


[PATCH]middle-end: maintain LCSSA form when peeled vector iterations have virtual operands

2023-12-29 Thread Tamar Christina
Hi All,

This patch fixes several interconnected issues.

1. When picking an exit we wanted to check for niter_desc.may_be_zero not true.
   i.e. we want to pick an exit which we know will iterate at least once.
   However niter_desc.may_be_zero is not a boolean.  It is a tree that encodes
   a boolean value.  !niter_desc.may_be_zero is just checking if we have some
   information, not what the information is.  This leads us to pick a more
   difficult to vectorize exit more often than we should.

2. Because we had this bug, we used to pick an alternative exit much more ofthen
   which showed one issue, when the loop accesses memory and we "invert it" we
   would corrupt the VUSE chain.  This is because on an peeled vector iteration
   every exit restarts the loop (i.e. they're all early) BUT since we may have
   performed a store, the vUSE would need to be updated.  This version maintains
   virtual PHIs correctly in these cases.   Note that we can't simply remove all
   of them and recreate them because we need the PHI nodes still in the right
   order for if skip_vector.

3. Since we're moving the stores to a safe location I don't think we actually
   need to analyze whether the store is in range of the memref,  because if we
   ever get there, we know that the loads must be in range, and if the loads are
   in range and we get to the store we know the early breaks were not taken and
   so the scalar loop would have done the VF stores too.

4. Instead of searching for where to move stores to, they should always be in
   exit belonging to the latch.  We can only ever delay stores and even if we
   pick a different exit than the latch one as the main one, effects still
   happen in program order when vectorized.  If we don't move the stores to the
   latch exit but instead to whever we pick as the "main" exit then we can
   perform incorrect memory accesses (luckily these are trapped by verify_ssa).

5. We only used to analyze loads inside the same BB as an early break, and also
   we'd never analyze the ones inside the block where we'd be moving memory
   references to.  This is obviously bogus and to fix it this patch splits apart
   the two constraints.  We first validate that all load memory references are
   in bounds and only after that do we perform the alias checks for the writes.
   This makes the code simpler to understand and more trivially correct.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues with --enable-checking=release --enable-lto
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
(vect_do_peeling): Maintain virtual PHIs on inverted loops.
* tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to
latch.
(vect_create_loop_vinfo): Record all conds instead of only alt ones.
* tree-vectorizer.h: Fix comment

gcc/testsuite/ChangeLog:

PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
* g++.dg/vect/vect-early-break_4-pr113137.cc: New test.
* g++.dg/vect/vect-early-break_5-pr113137.cc: New test.
* gcc.dg/vect/vect-early-break_95-pr113137.c: New test.
* gcc.dg/vect/vect-early-break_96-pr113136.c: New test.
* gcc.dg/vect/vect-early-break_97-pr113172.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc 
b/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc
new file mode 100644
index 
..f78db8669dcc65f1b45ea78f4433d175e1138332
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+int b;
+void a() __attribute__((__noreturn__));
+void c() {
+  char *buf;
+  int bufsz = 64;
+  while (b) {
+!bufsz ? a(), 0 : *buf++ = bufsz--;
+b -= 4;
+  }
+}
diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc 
b/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc
new file mode 100644
index 
..dcd19fa2d2145e09de18279479b3f20fc27336ba
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+char UnpackReadTables_BitLength[20];
+int UnpackReadTables_ZeroCount;
+void UnpackReadTables() {
+  for (unsigned I = 0; I < 20;)
+while 

[PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]

2023-12-29 Thread Tamar Christina
Hi All,

Only trying to update certain dominators doesn't seem to work very well
because as the loop gets versioned, peeled, or skip_vector then we end up with
very complicated control flow.  This means that the final merge blocks for the
loop exit are not easy to find or update.

Instead of trying to pick which exits to update, this changes it to update all
the blocks reachable by the new exits.  This is because they'll contain common
blocks with e.g. the versioned loop.  It's these blocks that need an update
most of the time.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR middle-end/113144
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Update all dominators reachable from exit.

gcc/testsuite/ChangeLog:

PR middle-end/113144
* gcc.dg/vect/vect-early-break_94-pr113144.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
new file mode 100644
index 
..903fe7be6621e81db6f29441e4309fa213d027c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+long tar_atol256_max, tar_atol256_size, tar_atosl_min;
+char tar_atol256_s;
+void __errno_location();
+
+
+inline static long tar_atol256(long min) {
+  char c;
+  int sign;
+  c = tar_atol256_s;
+  sign = c;
+  while (tar_atol256_size) {
+if (c != sign)
+  return sign ? min : tar_atol256_max;
+c = tar_atol256_size--;
+  }
+  if ((c & 128) != (sign & 128))
+return sign ? min : tar_atol256_max;
+  return 0;
+}
+
+inline static long tar_atol(long min) {
+  return tar_atol256(min);
+}
+
+long tar_atosl() {
+  long n = tar_atol(-1);
+  if (tar_atosl_min) {
+__errno_location();
+return 0;
+  }
+  if (n > 0)
+return 0;
+  return n;
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
1066ea17c5674e03412b3dcd8a62ddf4dd54cf31..3810983a80c8b989be9fd9a9993642069fd39b99
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -1716,8 +1716,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, 
edge loop_exit,
  /* Now link the alternative exits.  */
  if (multiple_exits_p)
{
- set_immediate_dominator (CDI_DOMINATORS, new_preheader,
-  main_loop_exit_block);
  for (auto gsi_from = gsi_start_phis (loop->header),
   gsi_to = gsi_start_phis (new_preheader);
   !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
@@ -1751,12 +1749,26 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
*loop, edge loop_exit,
 
   /* Finally after wiring the new epilogue we need to update its main exit
 to the original function exit we recorded.  Other exits are already
-correct.  */
+correct.  Because of versioning, skip vectors and others we must update
+the dominators of every node reachable by the new exits.  */
   if (multiple_exits_p)
{
  update_loop = new_loop;
- for (edge e : get_loop_exit_edges (loop))
-   doms.safe_push (e->dest);
+ hash_set  visited;
+ auto_vec  workset;
+ edge ev;
+ edge_iterator ei;
+ workset.safe_splice (get_loop_exit_edges (loop));
+ while (!workset.is_empty ())
+   {
+ auto bb = workset.pop ()->dest;
+ if (visited.add (bb))
+   continue;
+ doms.safe_push (bb);
+ FOR_EACH_EDGE (ev, ei, bb->succs)
+   workset.safe_push (ev);
+   }
+ visited.empty ();
  doms.safe_push (exit_dest);
 
  /* Likely a fall-through edge, so update if needed.  */




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
new file mode 100644
index 
..903fe7be6621e81db6f29441e4309fa213d027c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+long tar_atol256_max, tar_atol256_size, tar_atosl_min;
+char tar_atol256_s;
+void __errno_location();
+
+
+inline static long tar_atol256(long min) {
+  char c;
+  int sign;
+  c = tar_atol256_s;
+  sign = c;
+  while (tar_atol256_size) {
+if (c != sign)
+  return sign 

[PATCH]middle-end: rejects loops with nonlinear inductions and early breaks [PR113163]

2023-12-29 Thread Tamar Christina
Hi All,

We can't support nonlinear inductions other than neg when vectorizing
early breaks and iteration count is known.

For early break we currently require a peeled epilog but in these cases
we can't compute the remaining values.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
tested on cross cc1 for amdgcn-amdhsa and issue fixed.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR middle-end/113163
* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p):

gcc/testsuite/ChangeLog:

PR middle-end/113163
* gcc.target/gcn/pr113163.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c 
b/gcc/testsuite/gcc.target/gcn/pr113163.c
new file mode 100644
index 
..99b0fdbaf3a3152ca008b5109abf6e80d8cb3d6a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/pr113163.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize" } */ 
+
+struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; };
+static const char R64_ARRAY[] = 
"./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+char *
+_l64a_r (struct _reent *rptr,
+ long value)
+{
+  char *ptr;
+  char *result;
+  int i, index;
+  unsigned long tmp = (unsigned long)value & 0x;
+  result = 
+  ((
+  rptr
+  )->_new._reent._l64a_buf)
+   ;
+  ptr = result;
+  for (i = 0; i < 6; ++i)
+{
+  if (tmp == 0)
+ {
+   *ptr = '\0';
+   break;
+ }
+  *ptr++ = R64_ARRAY[index];
+  tmp >>= 6;
+}
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c186302fbeaf515be8cf
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
   return false;
 }
 
+  /* We can't support partial vectors and early breaks with an induction
+ type other than add or neg since we require the epilog and can't
+ perform the peeling.  PR113163.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
+  && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+  && induction_type != vect_step_op_neg)
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"Peeling for epilogue is not supported"
+" for nonlinear induction except neg"
+" when iteration count is known and early breaks.\n");
+  return false;
+}
+
   return true;
 }
 




-- 
diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c 
b/gcc/testsuite/gcc.target/gcn/pr113163.c
new file mode 100644
index 
..99b0fdbaf3a3152ca008b5109abf6e80d8cb3d6a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/pr113163.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize" } */ 
+
+struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; };
+static const char R64_ARRAY[] = 
"./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+char *
+_l64a_r (struct _reent *rptr,
+ long value)
+{
+  char *ptr;
+  char *result;
+  int i, index;
+  unsigned long tmp = (unsigned long)value & 0x;
+  result = 
+  ((
+  rptr
+  )->_new._reent._l64a_buf)
+   ;
+  ptr = result;
+  for (i = 0; i < 6; ++i)
+{
+  if (tmp == 0)
+ {
+   *ptr = '\0';
+   break;
+ }
+  *ptr++ = R64_ARRAY[index];
+  tmp >>= 6;
+}
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c186302fbeaf515be8cf
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
   return false;
 }
 
+  /* We can't support partial vectors and early breaks with an induction
+ type other than add or neg since we require the epilog and can't
+ perform the peeling.  PR113163.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
+  && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+  && induction_type != vect_step_op_neg)
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"Peeling for epilogue is not supported"
+" for nonlinear induction except neg"
+" when iteration count is known and early breaks.\n");
+  return false;
+}
+
   return true;
 }
 





[PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation

2023-12-29 Thread Tamar Christina
Hi All,

This adds an implementation for conditional branch optab for AArch32.
The previous version only allowed operand 0 but it looks like cbranch
expansion does not check with the target and so we have to implement all.

I therefore did not commit it.  This is a larger version. 

For e.g.

void f1 ()
{
  for (int i = 0; i < N; i++)
{
  b[i] += a[i];
  if (a[i] > 0)
break;
}
}

For 128-bit vectors we generate:

vcgt.s32q8, q9, #0
vpmax.u32   d7, d16, d17
vpmax.u32   d7, d7, d7
vmovr3, s14 @ int
cmp r3, #0

and of 64-bit vector we can omit one vpmax as we still need to compress to
32-bits.

Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/arm/neon.md (cbranch4): New.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-early-break_2.c: Skip Arm.
* gcc.dg/vect/vect-early-break_7.c: Likewise.
* gcc.dg/vect/vect-early-break_75.c: Likewise.
* gcc.dg/vect/vect-early-break_77.c: Likewise.
* gcc.dg/vect/vect-early-break_82.c: Likewise.
* gcc.dg/vect/vect-early-break_88.c: Likewise.
* lib/target-supports.exp (add_options_for_vect_early_break,
check_effective_target_vect_early_break_hw,
check_effective_target_vect_early_break): Support AArch32.
* gcc.target/arm/vect-early-break-cbranch.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 
d213369ffc38fb88ad0357d848cc7da5af73bab7..0f088a51d31e6882bc0fabbad99862b8b465dd22
 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -408,6 +408,54 @@ (define_insn "vec_extract"
   [(set_attr "type" "neon_store1_one_lane,neon_to_gp")]
 )
 
+;; Patterns comparing two vectors and conditionally jump.
+;; Avdanced SIMD lacks a vector != comparison, but this is a quite common
+;; operation.  To not pay the penalty for inverting == we can map our any
+;; comparisons to all i.e. any(~x) => all(x).
+;;
+;; However unlike the AArch64 version, we can't optimize this further as the
+;; chain is too long for combine due to these being unspecs so it doesn't fold
+;; the operation to something simpler.
+(define_expand "cbranch4"
+  [(set (pc) (if_then_else
+ (match_operator 0 "expandable_comparison_operator"
+  [(match_operand:VDQI 1 "register_operand")
+   (match_operand:VDQI 2 "reg_or_zero_operand")])
+ (label_ref (match_operand 3 "" ""))
+ (pc)))]
+  "TARGET_NEON"
+{
+  rtx mask = operands[1];
+
+  /* If comparing against a non-zero vector we have to do a comparison first
+ so we can have a != 0 comparison with the result.  */
+  if (operands[2] != CONST0_RTX (mode))
+{
+  mask = gen_reg_rtx (mode);
+  emit_insn (gen_xor3 (mask, operands[1], operands[2]));
+}
+
+  /* For 128-bit vectors we need an additional reductions.  */
+  if (known_eq (128, GET_MODE_BITSIZE (mode)))
+{
+  /* Always reduce using a V4SI.  */
+  mask = gen_reg_rtx (V2SImode);
+  rtx low = gen_reg_rtx (V2SImode);
+  rtx high = gen_reg_rtx (V2SImode);
+  rtx op1 = simplify_gen_subreg (V4SImode, operands[1], mode, 0);
+  emit_insn (gen_neon_vget_lowv4si (low, op1));
+  emit_insn (gen_neon_vget_highv4si (high, op1));
+  emit_insn (gen_neon_vpumaxv2si (mask, low, high));
+}
+
+  emit_insn (gen_neon_vpumaxv2si (mask, mask, mask));
+
+  rtx val = gen_reg_rtx (SImode);
+  emit_move_insn (val, gen_lowpart (SImode, mask));
+  emit_jump_insn (gen_cbranch_cc (operands[0], val, const0_rtx, operands[3]));
+  DONE;
+})
+
 ;; This pattern is renamed from "vec_extract" to
 ;; "neon_vec_extract" and this pattern is called
 ;; by define_expand in vec-common.md file.
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
index 
5c32bf94409e9743e72429985ab3bf13aab8f2c1..dec0b492ab883de6e02944a95fd554a109a68a39
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
@@ -5,7 +5,7 @@
 
 /* { dg-additional-options "-Ofast" } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! 
"arm*-*-*" } } } } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c
index 
8c86c5034d7522b3733543fb384a23c5d6ed0fcf..d218a0686719fee4c167684dcf26402851b53260
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c
@@ -5,7 +5,7 @@
 
 /* { dg-additional-options "-Ofast" } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! 
"arm*-*-*" } } } } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_75.c 

[PATCH]AArch64 Update costing for vector conversions [PR110625]

2023-12-29 Thread Tamar Christina
Hi All,

In gimple the operation

short _8;
double _9;
_9 = (double) _8;

denotes two operations.  First we have to widen from short to long and then
convert this integer to a double.

Currently however we only count the widen/truncate operations:

(double) _5 6 times vec_promote_demote costs 12 in body
(double) _5 12 times vec_promote_demote costs 24 in body

but not the actual conversion operation, which needs an additional 12
instructions in the attached testcase.   Without this the attached testcase ends
up incorrectly thinking that it's beneficial to vectorize the loop at a very
high VF = 8 (4x unrolled).

Because we can't change the mid-end to account for this the costing code in the
backend now keeps track of whether the previous operation was a
promotion/demotion and ajdusts the expected number of instructions to:

1. If it's the first FLOAT_EXPR and the precision of the lhs and rhs are
   different, double it, since we need to convert and promote.
2. If it's the previous operation was a demonition/promotion then reduce the
   cost of the current operation by the amount we added extra in the last.

with the patch we get:

(double) _5 6 times vec_promote_demote costs 24 in body
(double) _5 12 times vec_promote_demote costs 36 in body

which correctly accounts for 30 operations.

This fixes the regression reported on Neoverse N2 and using the new generic
Armv9-a cost model.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
Adjust throughput and latency calculations for vector conversions.
(class aarch64_vector_costs): Add m_num_last_promote_demote.

gcc/testsuite/ChangeLog:

PR target/110625
* gcc.target/aarch64/pr110625_4.c: New test.
* gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Add
--param aarch64-sve-compare-costs=0.
* gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
f9850320f61c5ddccf47e6583d304e5f405a484f..561413e52717974b96f79cc83008f237c536
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16077,6 +16077,15 @@ private:
  leaving a vectorization of { elts }.  */
   bool m_stores_to_vector_load_decl = false;
 
+  /* Non-zero if the last operation we costed is a vector promotion or 
demotion.
+ In this case the value is the number of insn in the last operation.
+
+ On AArch64 vector promotion and demotions require us to first widen or
+ narrow the input and only after that emit conversion instructions.  For
+ costing this means we need to emit the cost of the final conversions as
+ well.  */
+  unsigned int m_num_last_promote_demote = 0;
+
   /* - If M_VEC_FLAGS is zero then we're costing the original scalar code.
  - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
SIMD code.
@@ -17132,6 +17141,29 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
 stmt_cost = aarch64_sve_adjust_stmt_cost (m_vinfo, kind, stmt_info,
  vectype, stmt_cost);
 
+  /*  Vector promotion and demotion requires us to widen the operation first
+  and only after that perform the conversion.  Unfortunately the mid-end
+  expects this to be doable as a single operation and doesn't pass on
+  enough context here for us to tell which operation is happening.  To
+  account for this we count every promote-demote operation twice and if
+  the previously costed operation was also a promote-demote we reduce
+  the cost of the currently being costed operation to simulate the final
+  conversion cost.  Note that for SVE we can do better here if the 
converted
+  value comes from a load since the widening load would consume the 
widening
+  operations.  However since we're in stage 3 we can't change the helper
+  vect_is_extending_load and duplicating the code seems not useful.  */
+  gassign *assign = NULL;
+  if (kind == vec_promote_demote
+  && (assign = dyn_cast  (STMT_VINFO_STMT (stmt_info)))
+  && gimple_assign_rhs_code (assign) == FLOAT_EXPR)
+{
+  auto new_count = count * 2 - m_num_last_promote_demote;
+  m_num_last_promote_demote = count;
+  count = new_count;
+}
+  else
+m_num_last_promote_demote = 0;
+
   if (stmt_info && aarch64_use_new_vector_costs_p ())
 {
   /* Account for any extra "embedded" costs that apply additively
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_4.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c
new file mode 100644
index 
..34dac19d81a85d63706d54f4cb0c738ce592d5d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } 

[PATCH][committed]middle-end: explicitly initialize vec_stmts [PR113132]

2023-12-25 Thread Tamar Christina
Hi All,

when configured with --enable-checking=release we get a false
positive on the use of vec_stmts as the compiler seems unable
to notice it gets initialized through the pass-by-reference.

This explicitly initializes the local.

Bootstrapped Regtested on x86_64-pc-linux-gnu and no issues.

Committed under the obvious rule.

Thanks,
Tamar

gcc/ChangeLog:

PR bootstrap/113132
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Initialize 
vec_stmts;

--- inline copy of patch -- 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
88261a3a4f57d5e2124939b069b0e92c57d9abba..f51ae3e719e753059389cf9495b6d65b3b1191cb
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -6207,7 +6207,7 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
   exit_bb = loop_exit->dest;
   exit_gsi = gsi_after_labels (exit_bb);
   reduc_inputs.create (slp_node ? vec_num : ncopies);
-  vec  vec_stmts;
+  vec  vec_stmts = vNULL;
   for (unsigned i = 0; i < vec_num; i++)
 {
   gimple_seq stmts = NULL;




-- 
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
88261a3a4f57d5e2124939b069b0e92c57d9abba..f51ae3e719e753059389cf9495b6d65b3b1191cb
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -6207,7 +6207,7 @@ vect_create_epilog_for_reduction (loop_vec_info 
loop_vinfo,
   exit_bb = loop_exit->dest;
   exit_gsi = gsi_after_labels (exit_bb);
   reduc_inputs.create (slp_node ? vec_num : ncopies);
-  vec  vec_stmts;
+  vec  vec_stmts = vNULL;
   for (unsigned i = 0; i < vec_num; i++)
 {
   gimple_seq stmts = NULL;





[PATCH][testsuite]: Add more pragma novector to new tests

2023-12-24 Thread Tamar Christina
Hi All,

This patch was pre-appproved by Richi.

This updates the testsuite and adds more #pragma GCC novector to various tests
that would otherwise vectorize the vector result checking code.

This cleans out the testsuite since the last rebase and prepares for the landing
of the early break patch.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu 
and no issues.

Pushed to master.

Thanks,
Tamar

gcc/testsuite/ChangeLog:

* gcc.dg/vect/no-scevccp-slp-30.c: Add pragma GCC novector to abort
loop.
* gcc.dg/vect/no-scevccp-slp-31.c: Likewise.
* gcc.dg/vect/no-section-anchors-vect-69.c: Likewise.
* gcc.target/aarch64/vect-xorsign_exec.c: Likewise.
* gcc.target/i386/avx512er-vrcp28ps-3.c: Likewise.
* gcc.target/i386/avx512er-vrsqrt28ps-3.c: Likewise.
* gcc.target/i386/avx512er-vrsqrt28ps-5.c: Likewise.
* gcc.target/i386/avx512f-ceil-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-ceil-vec-1.c: Likewise.
* gcc.target/i386/avx512f-ceilf-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-ceilf-vec-1.c: Likewise.
* gcc.target/i386/avx512f-floor-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-floor-vec-1.c: Likewise.
* gcc.target/i386/avx512f-floorf-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-floorf-vec-1.c: Likewise.
* gcc.target/i386/avx512f-rint-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-rintf-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-round-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-roundf-sfix-vec-1.c: Likewise.
* gcc.target/i386/avx512f-trunc-vec-1.c: Likewise.
* gcc.target/i386/avx512f-truncf-vec-1.c: Likewise.
* gcc.target/i386/vect-alignment-peeling-1.c: Likewise.
* gcc.target/i386/vect-alignment-peeling-2.c: Likewise.
* gcc.target/i386/vect-pack-trunc-1.c: Likewise.
* gcc.target/i386/vect-pack-trunc-2.c: Likewise.
* gcc.target/i386/vect-perm-even-1.c: Likewise.
* gcc.target/i386/vect-unpack-1.c: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c 
b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c
index 
00d0eca56eeca6aee6f11567629dc955c0924c74..534bee4a1669a7cbd95cf6007f28dafd23bab8da
 100644
--- a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c
+++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-30.c
@@ -24,9 +24,9 @@ main1 ()
}
 
   /* check results:  */
-#pragma GCC novector
for (j = 0; j < N; j++)
{
+#pragma GCC novector
 for (i = 0; i < N; i++)
   {
 if (out[i*4] != 8
diff --git a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c 
b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
index 
48b6a9b0681cf1fe410755c3e639b825b27895b0..22817a57ef81398cc018a78597755397d20e0eb9
 100644
--- a/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
+++ b/gcc/testsuite/gcc.dg/vect/no-scevccp-slp-31.c
@@ -27,6 +27,7 @@ main1 ()
 #pragma GCC novector
  for (i = 0; i < N; i++)
{
+#pragma GCC novector
 for (j = 0; j < N; j++) 
   {
 if (a[i][j] != 8)
diff --git a/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c 
b/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c
index 
a0e53d5fef91868dfdbd542dd0a98dff92bd265b..0861d488e134d3f01a2fa83c56eff7174f36ddfb
 100644
--- a/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c
+++ b/gcc/testsuite/gcc.dg/vect/no-section-anchors-vect-69.c
@@ -83,9 +83,9 @@ int main1 ()
 }
 
   /* check results:  */
-#pragma GCC novector
   for (i = 0; i < N; i++)
 {
+#pragma GCC novector
   for (j = 0; j < N; j++)
{
   if (tmp1[2].e.n[1][i][j] != 8)
@@ -103,9 +103,9 @@ int main1 ()
 }
 
   /* check results:  */
-#pragma GCC novector
   for (i = 0; i < N - NINTS; i++)
 {
+#pragma GCC novector
   for (j = 0; j < N - NINTS; j++)
{
   if (tmp2[2].e.n[1][i][j] != 8)
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c 
b/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c
index 
cfa22115831272cb1d4e1a38512f10c3a1c6ad77..84f33d3f6cce9b0017fd12ab961019041245ffae
 100644
--- a/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c
+++ b/gcc/testsuite/gcc.target/aarch64/vect-xorsign_exec.c
@@ -33,6 +33,7 @@ main (void)
 r[i] = a[i] * __builtin_copysignf (1.0f, b[i]);
 
   /* check results:  */
+#pragma GCC novector
   for (i = 0; i < N; i++)
 if (r[i] != a[i] * __builtin_copysignf (1.0f, b[i]))
   abort ();
@@ -41,6 +42,7 @@ main (void)
 rd[i] = ad[i] * __builtin_copysign (1.0d, bd[i]);
 
   /* check results:  */
+#pragma GCC novector
   for (i = 0; i < N; i++)
 if (rd[i] != ad[i] * __builtin_copysign (1.0d, bd[i]))
   abort ();
diff --git a/gcc/testsuite/gcc.target/i386/avx512er-vrcp28ps-3.c 
b/gcc/testsuite/gcc.target/i386/avx512er-vrcp28ps-3.c
index 
c0b1f7b31027f9438ab1641d3002887eabd34efa..1e68926a3180fffc6cbc8c6eed639a567fc32566
 100644
--- 

RE: [PATCH 3/21]middle-end: Implement code motion and dependency analysis for early breaks

2023-12-20 Thread Tamar Christina
> > + /* If we've moved a VDEF, extract the defining MEM and update
> > +usages of it.   */
> > + tree vdef;
> > + /* This statement is to be moved.  */
> > + if ((vdef = gimple_vdef (stmt)))
> > +   LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS
> (loop_vinfo).safe_push (
> > +   stmt);
> 
> I'm also unsure why you need 'chain' at all given you have the vector
> of stores to be moved?
> 

Yeah, so originally I wanted to move statements other than stores.  While stores
are needed for correctness, the other statements would be so we didn't extend 
the
live range too much for intermediate values.

This proved difficult but eventually I got it to work, but as you saw it was 
meh code.
Instead I guess the better approach is to teach sched1 in GCC 15 to schedule 
across
branches in loops.

With that in mind, I changed it to move only stores.  Since stores never 
produce a
and are sinks, I don't really need fixed nor chain.

So here's a much cleaned up patch.

Bootstrapped Regtested on aarch64-none-linux-gnu and
x86_64-pc-linux-gnu no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-if-conv.cc (ref_within_array_bound): Expose.
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New.
(vect_analyze_data_ref_dependences): Use them.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
early_breaks.
(move_early_exit_stmts): New.
(vect_transform_loop): use it/
* tree-vect-stmts.cc (vect_is_simple_use): Use vect_early_exit_def.
* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
(ref_within_array_bound): New.
(class _loop_vec_info): Add early_breaks, early_break_conflict,
early_break_vuses.
(LOOP_VINFO_EARLY_BREAKS): New.
(LOOP_VINFO_EARLY_BRK_STORES): New.
(LOOP_VINFO_EARLY_BRK_DEST_BB): New.
(LOOP_VINFO_EARLY_BRK_VUSES): New.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-early-break_57.c: Update.
* gcc.dg/vect/vect-early-break_79.c: New test.
* gcc.dg/vect/vect-early-break_80.c: New test.
* gcc.dg/vect/vect-early-break_81.c: New test.
* gcc.dg/vect/vect-early-break_83.c: New test.

--- inline copy of patch ---

diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c
index 
be4a0c7426093059ce37a9f824defb7ae270094d..9a4e795f92b7a8577ac71827f5cb0bd15d88ebe1
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_57.c
@@ -5,6 +5,7 @@
 /* { dg-additional-options "-Ofast" } */
 
 /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "epilog loop required" "vect" } } */
 
 void abort ();
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_79.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_79.c
new file mode 100644
index 
..a26011ef1ba5aa000692babc90d46621efc2f8b5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_79.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-additional-options "-Ofast" } */
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+
+#undef N
+#define N 32
+
+unsigned vect_a[N];
+unsigned vect_b[N];
+  
+unsigned test4(unsigned x)
+{
+ unsigned ret = 0;
+ for (int i = 0; i < 1024; i++)
+ {
+   vect_b[i] = x + i;
+   if (vect_a[i] > x)
+ break;
+   vect_a[i] = x;
+   
+ }
+ return ret;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_80.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_80.c
new file mode 100644
index 
..ddf504e0c8787ae33a0e98045c1c91f2b9f533a9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_80.c
@@ -0,0 +1,43 @@
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-additional-options "-Ofast" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+extern void abort ();
+
+int x;
+__attribute__ ((noinline, noipa))
+void foo (int *a, int *b)
+{
+  int local_x = x;
+  for (int i = 0; i < 1024; ++i)
+{
+  if (i + local_x == 13)
+break;
+  a[i] = 2 * b[i];
+}
+}
+
+int main ()
+{
+  int a[1024] = {0};
+  int b[1024] = {0};
+
+  for (int i = 0; i < 1024; i++)
+b[i] = i;
+
+  x = -512;
+  foo (a, b);
+
+  if (a[524] != 1048)
+abort ();
+
+  if (a[525] != 0)
+abort ();
+
+  if (a[1023] != 0)
+abort ();
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_81.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_81.c
new file mode 100644
index 
..c38e394ad87863f0702d422cb58018b979c9fba6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_81.c
@@ -0,0 

RE: RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some tests

2023-12-19 Thread Tamar Christina
> Do you mean for ARM SVE, these tests need to be specified as only ARM SVE ?

I think that would be the right thing to do.  I think these tests are checking 
if we support VLA SLP.
changing it to a PASS unconditionally means that if someone runs the testsuite 
in SVE only mode they’ll fail.

> The difference between RVV and ARM is that: variable-length and fixed-length 
> vectors are both valid on RVV, using same RVV ISA.
> Wheras, for ARM, variable-length vectors use SVE ISA but fixed-length vectors 
> use NEON ISA.

Ah that makes sense why you want to remove the check.  I guess whomever added 
the vect_variable_length indended
It to fail when VLA though. Perhaps these tests need a dg-add-options 
? Since I think other tests already test fixed-length 
vectors.

But lets see what Richi says.

Thanks,
Tamar


From: 钟居哲 
Sent: Tuesday, December 19, 2023 1:02 PM
To: Tamar Christina ; gcc-patches 

Cc: rguenther 
Subject: Re: RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from 
some tests

Do you mean for ARM SVE, these tests need to be specified as only ARM SVE ?

Actually, for RVV, is same situation as ARM. We are using VLS modes 
(fixed-length vectors) to vectorize these cases so that they are XPASS.

The difference between RVV and ARM is that: variable-length and fixed-length 
vectors are both valid on RVV, using same RVV ISA.
Wheras, for ARM, variable-length vectors use SVE ISA but fixed-length vectors 
use NEON ISA.



juzhe.zh...@rivai.ai<mailto:juzhe.zh...@rivai.ai>

From: Tamar Christina<mailto:tamar.christ...@arm.com>
Date: 2023-12-19 20:29
To: Juzhe-Zhong<mailto:juzhe.zh...@rivai.ai>; 
gcc-patches@gcc.gnu.org<mailto:gcc-patches@gcc.gnu.org>
CC: rguent...@suse.de<mailto:rguent...@suse.de>
Subject: RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from 
some tests
Hi Juzhe,

> -Original Message-
> From: Juzhe-Zhong mailto:juzhe.zh...@rivai.ai>>
> Sent: Tuesday, December 19, 2023 11:19 AM
> To: gcc-patches@gcc.gnu.org<mailto:gcc-patches@gcc.gnu.org>
> Cc: rguent...@suse.de<mailto:rguent...@suse.de>; Tamar Christina 
> mailto:tamar.christ...@arm.com>>; Juzhe-
> Zhong mailto:juzhe.zh...@rivai.ai>>
> Subject: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some
> tests
>
> Hi, this patch fixes these following regression FAILs on RVV:
>
> XPASS: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> XPASS: gcc.dg/vect/bb-slp-43.c -flto -ffat-lto-objects  scan-tree-dump-not 
> slp2
> "vector operands from scalars"
> XPASS: gcc.dg/vect/bb-slp-43.c scan-tree-dump-not slp2 "vector operands from
> scalars"
> XPASS: gcc.dg/vect/bb-slp-subgroups-3.c -flto -ffat-lto-objects  
> scan-tree-dump-
> times slp2 "optimized: basic block" 2
> XPASS: gcc.dg/vect/bb-slp-subgroups-3.c scan-tree-dump-times slp2 "optimized:
> basic block" 2
>
> Since vect_variable_length are available for ARM SVE and RVV, I just use 
> compiler
> explorer to confirm ARM SVE same as
> RVV.
>
> Hi, @Tamar. Could you double check whether this patch fix is reasonable to 
> you ?
>

Hmm I would be surprised if this is working correctly for RVV since as far as I 
know we don't have
variable length support in SLP i.e. SLP can't predicate operation during build 
so the
current vectorizer only supports fixed length vector SLP, unless Richi did some 
magic?

For SVE the reason this XPASS is because the compiler will fallback to NEON 
unless it's
told it can't.  But that's not actually testing VLA SLP.

i.e. https://godbolt.org/z/5n5fWahxh  just using `+sve` isn't enough and it has 
to be told
it can only use SVE.  Is it perhaps something similar for RVV?

If RVV has a similar param, perhaps the correct fix is to append it to the 
tests so they
XFAIL correctly?

Regards,
Tamar

> And.
>
> Hi, @Richard. Is this patch Ok for trunk if this patch fixes regression for 
> both RVV
> and ARM SVE.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/pr84512.c: Remove vect_variable_length XFAIL.
> * gcc.dg/vect/bb-slp-43.c: Ditto.
> * gcc.dg/vect/bb-slp-subgroups-3.c: Ditto.
>
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/pr84512.c| 2 +-
>  gcc/testsuite/gcc.dg/vect/bb-slp-43.c  | 2 +-
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c 
> b/gcc/testsuite/gcc.dg/tree-
> ssa/pr84512.c
> index 496c78b28dc..3c027012670 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> @@ -13,4 +13,4 @@ int foo()
>  }
>
>  /* Listed targets xfailed due to PR84958.  */
> -/* { dg-final { scan-tree-dump "ret

RE: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some tests

2023-12-19 Thread Tamar Christina
Hi Juzhe,

> -Original Message-
> From: Juzhe-Zhong 
> Sent: Tuesday, December 19, 2023 11:19 AM
> To: gcc-patches@gcc.gnu.org
> Cc: rguent...@suse.de; Tamar Christina ; Juzhe-
> Zhong 
> Subject: [PATCH] Regression FIX: Remove vect_variable_length XFAIL from some
> tests
> 
> Hi, this patch fixes these following regression FAILs on RVV:
> 
> XPASS: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> XPASS: gcc.dg/vect/bb-slp-43.c -flto -ffat-lto-objects  scan-tree-dump-not 
> slp2
> "vector operands from scalars"
> XPASS: gcc.dg/vect/bb-slp-43.c scan-tree-dump-not slp2 "vector operands from
> scalars"
> XPASS: gcc.dg/vect/bb-slp-subgroups-3.c -flto -ffat-lto-objects  
> scan-tree-dump-
> times slp2 "optimized: basic block" 2
> XPASS: gcc.dg/vect/bb-slp-subgroups-3.c scan-tree-dump-times slp2 "optimized:
> basic block" 2
> 
> Since vect_variable_length are available for ARM SVE and RVV, I just use 
> compiler
> explorer to confirm ARM SVE same as
> RVV.
> 
> Hi, @Tamar. Could you double check whether this patch fix is reasonable to 
> you ?
> 

Hmm I would be surprised if this is working correctly for RVV since as far as I 
know we don't have
variable length support in SLP i.e. SLP can't predicate operation during build 
so the
current vectorizer only supports fixed length vector SLP, unless Richi did some 
magic?

For SVE the reason this XPASS is because the compiler will fallback to NEON 
unless it's
told it can't.  But that's not actually testing VLA SLP.

i.e. https://godbolt.org/z/5n5fWahxh  just using `+sve` isn't enough and it has 
to be told
it can only use SVE.  Is it perhaps something similar for RVV?

If RVV has a similar param, perhaps the correct fix is to append it to the 
tests so they
XFAIL correctly?

Regards,
Tamar

> And.
> 
> Hi, @Richard. Is this patch Ok for trunk if this patch fixes regression for 
> both RVV
> and ARM SVE.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/tree-ssa/pr84512.c: Remove vect_variable_length XFAIL.
>   * gcc.dg/vect/bb-slp-43.c: Ditto.
>   * gcc.dg/vect/bb-slp-subgroups-3.c: Ditto.
> 
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/pr84512.c| 2 +-
>  gcc/testsuite/gcc.dg/vect/bb-slp-43.c  | 2 +-
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c 
> b/gcc/testsuite/gcc.dg/tree-
> ssa/pr84512.c
> index 496c78b28dc..3c027012670 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
> @@ -13,4 +13,4 @@ int foo()
>  }
> 
>  /* Listed targets xfailed due to PR84958.  */
> -/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { 
> amdgcn*-*-* ||
> vect_variable_length } } } } */
> +/* { dg-final { scan-tree-dump "return 285;" "optimized" { xfail { 
> amdgcn*-*-* } } }
> } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-43.c 
> b/gcc/testsuite/gcc.dg/vect/bb-
> slp-43.c
> index dad2d24262d..40bd2e0dfbf 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-43.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-43.c
> @@ -14,4 +14,4 @@ f (int *restrict x, short *restrict y)
>  }
> 
>  /* { dg-final { scan-tree-dump-not "mixed mask and nonmask" "slp2" } } */
> -/* { dg-final { scan-tree-dump-not "vector operands from scalars" "slp2" { 
> target {
> { vect_int && vect_bool_cmp } && { vect_unpack && vect_hw_misalign } } xfail {
> vect_variable_length && { ! vect256 } } } } } */
> +/* { dg-final { scan-tree-dump-not "vector operands from scalars" "slp2" { 
> target {
> { vect_int && vect_bool_cmp } && { vect_unpack && vect_hw_misalign } } } } } 
> */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> index fb719915db7..3f0d45ce4a1 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -42,7 +42,7 @@ main (int argc, char **argv)
>  /* Because we disable the cost model, targets with variable-length
> vectors can end up vectorizing the store to a[0..7] on its own.
> With the cost model we do something sensible.  */
> -/* { dg-final { scan-tree-dump-times "optimized: basic block" 2 "slp2" { 
> target { !
> amdgcn-*-* } xfail vect_variable_length } } } */
> +/* { dg-final { scan-tree-dump-times "optimized: basic block" 2 "slp2" { 
> target { !
> amdgcn-*-* } } } } */
> 
>  /* amdgcn can do this in one vector.  */
>  /* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp2" { 
> target
> amdgcn-*-* } } } */
> --
> 2.36.3



[PATCH]middle-end: Handle hybrid SLP induction vectorization with early breaks.

2023-12-19 Thread Tamar Christina
Hi All,

While we don't support SLP for early break vectorization, we
can land in the situation where the induction was vectorized
through hybrid SLP.  This means when vectorizing the early
break live operation we need to get the results of the SLP
operation.


Bootstrapped Regtested on aarch64-none-linux-gnu,
x86_64-pc-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-vect-loop.cc (vectorizable_live_operation): Handle SLP.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-early-break_82.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c
new file mode 100644
index 
..f2a6d640f9c0c381cc2af09bd824e272bcfee0b8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-additional-options "-Ofast" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#include 
+
+#define N 1024
+complex double vect_a[N];
+complex double vect_b[N];
+  
+complex double test4(complex double x, complex double t)
+{
+ complex double ret = 0;
+ for (int i = 0; i < N; i++)
+ {
+   vect_a[i] = t + i;
+   if (vect_a[i] == x)
+ return i;
+   vect_a[i] += x * vect_a[i];
+   
+ }
+ return ret;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 
85b81d30c5ab869cb1f7323caabd9fe4648bdc50..0993d184afe068784474ac225768d9f38d76c040
 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10856,8 +10856,8 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
   bitsize = vector_element_bits_tree (vectype);
 
   /* Get the vectorized lhs of STMT and the lane to use (counted in bits).  */
-  tree vec_lhs, bitstart;
-  gimple *vec_stmt;
+  tree vec_lhs, vec_lhs0, bitstart;
+  gimple *vec_stmt, *vec_stmt0;
   if (slp_node)
 {
   gcc_assert (!loop_vinfo
@@ -10868,6 +10868,10 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
   vec_lhs = SLP_TREE_VEC_DEFS (slp_node)[vec_entry];
   vec_stmt = SSA_NAME_DEF_STMT (vec_lhs);
 
+  /* In case we need to early break vectorize also get the first stmt.  */
+  vec_lhs0 = SLP_TREE_VEC_DEFS (slp_node)[0];
+  vec_stmt0 = SSA_NAME_DEF_STMT (vec_lhs0);
+
   /* Get entry to use.  */
   bitstart = bitsize_int (vec_index);
   bitstart = int_const_binop (MULT_EXPR, bitsize, bitstart);
@@ -10878,6 +10882,10 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
   vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info).last ();
   vec_lhs = gimple_get_lhs (vec_stmt);
 
+  /* In case we need to early break vectorize also get the first stmt.  */
+  vec_stmt0 = STMT_VINFO_VEC_STMTS (stmt_info)[0];
+  vec_lhs0 = gimple_get_lhs (vec_stmt0);
+
   /* Get the last lane in the vector.  */
   bitstart = int_const_binop (MULT_EXPR, bitsize, bitsize_int (nunits - 
1));
 }
@@ -10917,7 +10925,6 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
 so use ->src.  For main exit the merge block is the
 destination.  */
  basic_block dest = main_exit_edge ? main_e->dest : e->src;
- gimple *tmp_vec_stmt = vec_stmt;
  tree tmp_vec_lhs = vec_lhs;
  tree tmp_bitstart = bitstart;
 
@@ -10928,8 +10935,7 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
  if (restart_loop
  && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def)
{
- tmp_vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
- tmp_vec_lhs = gimple_get_lhs (tmp_vec_stmt);
+ tmp_vec_lhs = vec_lhs0;
  tmp_bitstart = build_zero_cst (TREE_TYPE (bitstart));
}
 




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c
new file mode 100644
index 
..f2a6d640f9c0c381cc2af09bd824e272bcfee0b8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_82.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-additional-options "-Ofast" } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+#include 
+
+#define N 1024
+complex double vect_a[N];
+complex double vect_b[N];
+  
+complex double test4(complex double x, complex double t)
+{
+ complex double ret = 0;
+ for (int i = 0; i < N; i++)
+ {
+   vect_a[i] = t + i;
+   if (vect_a[i] == x)
+ return i;
+   vect_a[i] += x * vect_a[i];
+   
+ }
+ return ret;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 

  1   2   3   4   5   6   7   8   9   10   >