[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-28 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #17 from JuzheZhong  ---
(In reply to Tamar Christina from comment #16)
> > 
> > I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index =
> > 0.
> 
> Perhaps, I'll look into it thanks. though this is ofcourse only applicable
> when the mask comes from whilelo.
> 
> In the future when we get to loops such as:
> 
> for (int i = ..;;)
> {
>   if (a)
> {
>   
>   if (b)
> return i;
> }
> }
> 
> the reduction would come from the first active element of the mask created
> by the condition a and not the whilelo.

If the mask comes from a condition, VEC_EXTRACT approach is definitely
incorrect.

However, look into vectorizable_live_operation_1:
The mask should always come from whilo instruction (or say it is always loop
mask):

  tree mask = vect_get_loop_mask (loop_vinfo, ,
  _VINFO_MASKS (loop_vinfo),
  1, vectype, 0);

So I think it should be correct using VEC_EXTRACT with index = 0.

But if we look into vectorizable_early_break which will handle mask come from
condition, that is another story.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #16 from Tamar Christina  ---
> 
> I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index =
> 0.

Perhaps, I'll look into it thanks. though this is ofcourse only applicable when
the mask comes from whilelo.

In the future when we get to loops such as:

for (int i = ..;;)
{
  if (a)
{
  
  if (b)
return i;
}
}

the reduction would come from the first active element of the mask created by
the condition a and not the whilelo.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #15 from Tamar Christina  ---
(In reply to JuzheZhong from comment #14)
> > > > sure, but you can't use BIT_FIELD_REF on VLA vectors.
> > > 
> > > So, for length partial vector. We can use VEC_EXTRACT with index = 0 since
> > > VEC_EXTRACT optab allows VLA vectors now for length target.
> > 
> > Sounds good :)
> 
> I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index =
> 0.
> 
> I guess the only issue is that when mask = all zero. That is, there is no
> active elements, What behavior should be here for early break ?

That shouldn't happen, in that case you wouldn't have entered the loop. To
prevent this there's always a compare of n > 0 at the start of the loops to
skip the vector body entirely.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #14 from JuzheZhong  ---
(In reply to Tamar Christina from comment #13)
> (In reply to JuzheZhong from comment #12)
> > (In reply to Tamar Christina from comment #11)
> > > (In reply to JuzheZhong from comment #10)
> > > > (In reply to Tamar Christina from comment #9)
> > > > > (In reply to JuzheZhong from comment #8)
> > > > > > Suppose the loop mask is generated by whilelo instruction of ARM 
> > > > > > SVE.
> > > > > > 
> > > > > > Suppose we have 8 elements in a single whole vector.
> > > > > > 
> > > > > > mask = whilo (0, res) if res = 6, then mask = 1000.
> > > > > > data = 12345678
> > > > > > 
> > > > > > Then if it is early break. You are reversing both data and mask as 
> > > > > > follows:
> > > > > > 
> > > > > > new_mask = 0001
> > > > > > new_data = 87654321
> > > > > > 
> > > > > > Then use the EXTRACT_LAST, we will get value = 1 for early break.
> > > > > > 
> > > > > > Am I right ?
> > > > > 
> > > > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do 
> > > > > any
> > > > > side effects that we couldn't apply.
> > > > > 
> > > > > We went with this approach first because it works for non-masked
> > > > > architectures too. In GCC-15 we'll try to implement staying entirely 
> > > > > inside
> > > > > a vector loop by splitting the mask in elements until first active and
> > > > > element from first active so we can correctly mask the operations.
> > > > 
> > > > Ok. For the current approach. Isn't it the first element is always 
> > > > element 0
> > > > ?
> > > > 
> > > > Since for ARM SVE loop mask is generated by whilelo instructions, it 
> > > > always
> > > > set
> > > > mask bit from 0 to the last active element - 1.
> > > 
> > > sure, but you can't use BIT_FIELD_REF on VLA vectors.
> > 
> > So, for length partial vector. We can use VEC_EXTRACT with index = 0 since
> > VEC_EXTRACT optab allows VLA vectors now for length target.
> 
> Sounds good :)

I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index = 0.

I guess the only issue is that when mask = all zero. That is, there is no
active elements, What behavior should be here for early break ?

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #13 from Tamar Christina  ---
(In reply to JuzheZhong from comment #12)
> (In reply to Tamar Christina from comment #11)
> > (In reply to JuzheZhong from comment #10)
> > > (In reply to Tamar Christina from comment #9)
> > > > (In reply to JuzheZhong from comment #8)
> > > > > Suppose the loop mask is generated by whilelo instruction of ARM SVE.
> > > > > 
> > > > > Suppose we have 8 elements in a single whole vector.
> > > > > 
> > > > > mask = whilo (0, res) if res = 6, then mask = 1000.
> > > > > data = 12345678
> > > > > 
> > > > > Then if it is early break. You are reversing both data and mask as 
> > > > > follows:
> > > > > 
> > > > > new_mask = 0001
> > > > > new_data = 87654321
> > > > > 
> > > > > Then use the EXTRACT_LAST, we will get value = 1 for early break.
> > > > > 
> > > > > Am I right ?
> > > > 
> > > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do any
> > > > side effects that we couldn't apply.
> > > > 
> > > > We went with this approach first because it works for non-masked
> > > > architectures too. In GCC-15 we'll try to implement staying entirely 
> > > > inside
> > > > a vector loop by splitting the mask in elements until first active and
> > > > element from first active so we can correctly mask the operations.
> > > 
> > > Ok. For the current approach. Isn't it the first element is always 
> > > element 0
> > > ?
> > > 
> > > Since for ARM SVE loop mask is generated by whilelo instructions, it 
> > > always
> > > set
> > > mask bit from 0 to the last active element - 1.
> > 
> > sure, but you can't use BIT_FIELD_REF on VLA vectors.
> 
> So, for length partial vector. We can use VEC_EXTRACT with index = 0 since
> VEC_EXTRACT optab allows VLA vectors now for length target.

Sounds good :)

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #12 from JuzheZhong  ---
(In reply to Tamar Christina from comment #11)
> (In reply to JuzheZhong from comment #10)
> > (In reply to Tamar Christina from comment #9)
> > > (In reply to JuzheZhong from comment #8)
> > > > Suppose the loop mask is generated by whilelo instruction of ARM SVE.
> > > > 
> > > > Suppose we have 8 elements in a single whole vector.
> > > > 
> > > > mask = whilo (0, res) if res = 6, then mask = 1000.
> > > > data = 12345678
> > > > 
> > > > Then if it is early break. You are reversing both data and mask as 
> > > > follows:
> > > > 
> > > > new_mask = 0001
> > > > new_data = 87654321
> > > > 
> > > > Then use the EXTRACT_LAST, we will get value = 1 for early break.
> > > > 
> > > > Am I right ?
> > > 
> > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do any
> > > side effects that we couldn't apply.
> > > 
> > > We went with this approach first because it works for non-masked
> > > architectures too. In GCC-15 we'll try to implement staying entirely 
> > > inside
> > > a vector loop by splitting the mask in elements until first active and
> > > element from first active so we can correctly mask the operations.
> > 
> > Ok. For the current approach. Isn't it the first element is always element 0
> > ?
> > 
> > Since for ARM SVE loop mask is generated by whilelo instructions, it always
> > set
> > mask bit from 0 to the last active element - 1.
> 
> sure, but you can't use BIT_FIELD_REF on VLA vectors.

So, for length partial vector. We can use VEC_EXTRACT with index = 0 since
VEC_EXTRACT optab allows VLA vectors now for length target.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #11 from Tamar Christina  ---
(In reply to JuzheZhong from comment #10)
> (In reply to Tamar Christina from comment #9)
> > (In reply to JuzheZhong from comment #8)
> > > Suppose the loop mask is generated by whilelo instruction of ARM SVE.
> > > 
> > > Suppose we have 8 elements in a single whole vector.
> > > 
> > > mask = whilo (0, res) if res = 6, then mask = 1000.
> > > data = 12345678
> > > 
> > > Then if it is early break. You are reversing both data and mask as 
> > > follows:
> > > 
> > > new_mask = 0001
> > > new_data = 87654321
> > > 
> > > Then use the EXTRACT_LAST, we will get value = 1 for early break.
> > > 
> > > Am I right ?
> > 
> > Yeah, the idea being the scalar loop will then run from 1 to 6 to do any
> > side effects that we couldn't apply.
> > 
> > We went with this approach first because it works for non-masked
> > architectures too. In GCC-15 we'll try to implement staying entirely inside
> > a vector loop by splitting the mask in elements until first active and
> > element from first active so we can correctly mask the operations.
> 
> Ok. For the current approach. Isn't it the first element is always element 0
> ?
> 
> Since for ARM SVE loop mask is generated by whilelo instructions, it always
> set
> mask bit from 0 to the last active element - 1.

sure, but you can't use BIT_FIELD_REF on VLA vectors.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #10 from JuzheZhong  ---
(In reply to Tamar Christina from comment #9)
> (In reply to JuzheZhong from comment #8)
> > Suppose the loop mask is generated by whilelo instruction of ARM SVE.
> > 
> > Suppose we have 8 elements in a single whole vector.
> > 
> > mask = whilo (0, res) if res = 6, then mask = 1000.
> > data = 12345678
> > 
> > Then if it is early break. You are reversing both data and mask as follows:
> > 
> > new_mask = 0001
> > new_data = 87654321
> > 
> > Then use the EXTRACT_LAST, we will get value = 1 for early break.
> > 
> > Am I right ?
> 
> Yeah, the idea being the scalar loop will then run from 1 to 6 to do any
> side effects that we couldn't apply.
> 
> We went with this approach first because it works for non-masked
> architectures too. In GCC-15 we'll try to implement staying entirely inside
> a vector loop by splitting the mask in elements until first active and
> element from first active so we can correctly mask the operations.

Ok. For the current approach. Isn't it the first element is always element 0 ?

Since for ARM SVE loop mask is generated by whilelo instructions, it always set
mask bit from 0 to the last active element - 1.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #9 from Tamar Christina  ---
(In reply to JuzheZhong from comment #8)
> Suppose the loop mask is generated by whilelo instruction of ARM SVE.
> 
> Suppose we have 8 elements in a single whole vector.
> 
> mask = whilo (0, res) if res = 6, then mask = 1000.
> data = 12345678
> 
> Then if it is early break. You are reversing both data and mask as follows:
> 
> new_mask = 0001
> new_data = 87654321
> 
> Then use the EXTRACT_LAST, we will get value = 1 for early break.
> 
> Am I right ?

Yeah, the idea being the scalar loop will then run from 1 to 6 to do any side
effects that we couldn't apply.

We went with this approach first because it works for non-masked architectures
too. In GCC-15 we'll try to implement staying entirely inside a vector loop by
splitting the mask in elements until first active and element from first active
so we can correctly mask the operations.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #8 from JuzheZhong  ---
(In reply to Tamar Christina from comment #7)
> You may be able to use the same approach as
> 
>   else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> 
> that is, reverse both the mask and the vector and using extract last.
> It's not going to be performance critical so it's more important to be
> correct rather than fast.

I just carefully read this code:

  /* For an inverted control flow with early breaks we want EXTRACT_FIRST
 instead of EXTRACT_LAST.  Emulate by reversing the vector and mask. */
  if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
{
  /* First create the permuted mask.  */
  tree perm_mask = perm_mask_for_reverse (TREE_TYPE (mask));
  tree perm_dest = copy_ssa_name (mask);
  gimple *perm_stmt
= gimple_build_assign (perm_dest, VEC_PERM_EXPR, mask,
   mask, perm_mask);
  vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt,
   );
  mask = perm_dest;

  /* Then permute the vector contents.  */
  tree perm_elem = perm_mask_for_reverse (vectype);
  perm_dest = copy_ssa_name (vec_lhs_phi);
  perm_stmt
= gimple_build_assign (perm_dest, VEC_PERM_EXPR, vec_lhs_phi,
   vec_lhs_phi, perm_elem);
  vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt,
   );
  vec_lhs_phi = perm_dest;
}


Suppose the loop mask is generated by whilelo instruction of ARM SVE.

Suppose we have 8 elements in a single whole vector.

mask = whilo (0, res) if res = 6, then mask = 1000.
data = 12345678

Then if it is early break. You are reversing both data and mask as follows:

new_mask = 0001
new_data = 87654321

Then use the EXTRACT_LAST, we will get value = 1 for early break.

Am I right ?

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #7 from Tamar Christina  ---
You may be able to use the same approach as

  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))

that is, reverse both the mask and the vector and using extract last.
It's not going to be performance critical so it's more important to be correct
rather than fast.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #6 from Tamar Christina  ---
(In reply to JuzheZhong from comment #5)
> (In reply to Tamar Christina from comment #4)
> > (In reply to JuzheZhong from comment #3)
> > > I guess this code is just disabling partial vector for length for now.
> > > 
> > > And need me to test and port this part for length in the followup patches.
> > > 
> > > Am I right ?
> > 
> > Yeah, it needed to safely not allow it through for now. Once implemented 
> > you'll hit an assert in vectorizable_live_operations where you need to
> > provide a way to also get the first active element from a vector.
> 
> So for a length target, I enable cbranch optab but no vcond_mask_len optab.
> Will it behavior wrong ?
> 

You need both, if the operation requires a mask it'll reject it without
vcond_mask_len support.  Because I didn't know how to extract first element
using vcond_mask_len I had to disable it.

> Another question is could you give me more hints about
> vectorizable_live_operation?
> 
> I thought vectorizable_live_operation is doing extract last active element,
> I didn't see extract first active element.

Normally yes, but I added extract first active element for this patch.  This is
because when you hit and take an early exit we restrart the vector iteration
since there may be partial effects to perform between where the loop started
and where the element is found.

specifically look at vectorizable_live_operation_1 there's an assert under 
  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))

with a comment saying what's needed.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #5 from JuzheZhong  ---
(In reply to Tamar Christina from comment #4)
> (In reply to JuzheZhong from comment #3)
> > I guess this code is just disabling partial vector for length for now.
> > 
> > And need me to test and port this part for length in the followup patches.
> > 
> > Am I right ?
> 
> Yeah, it needed to safely not allow it through for now. Once implemented 
> you'll hit an assert in vectorizable_live_operations where you need to
> provide a way to also get the first active element from a vector.

So for a length target, I enable cbranch optab but no vcond_mask_len optab.
Will it behavior wrong ?

Another question is could you give me more hints about
vectorizable_live_operation?

I thought vectorizable_live_operation is doing extract last active element,
I didn't see extract first active element.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #4 from Tamar Christina  ---
(In reply to JuzheZhong from comment #3)
> I guess this code is just disabling partial vector for length for now.
> 
> And need me to test and port this part for length in the followup patches.
> 
> Am I right ?

Yeah, it needed to safely not allow it through for now. Once implemented 
you'll hit an assert in vectorizable_live_operations where you need to provide
a way to also get the first active element from a vector.

[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects

2023-12-27 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134

--- Comment #3 from JuzheZhong  ---
Thanks Tamar. And I agree with you. This is not supposed to let early break to
handle that.  It should be optimization on scalar IR instead of loop
vectorizer.

I believe it is GCC-15 topic.

Btw, I am trying to enable early break on RISC-V port. But I failed to do that
since this following codes look quite to length target:

  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
{
  if (direct_internal_fn_supported_p (IFN_VCOND_MASK_LEN, vectype,
  OPTIMIZE_FOR_SPEED))
return false;
  else
vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, NULL);
}

I guess this code is just disabling partial vector for length for now.

And need me to test and port this part for length in the followup patches.

Am I right ?