[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #17 from JuzheZhong --- (In reply to Tamar Christina from comment #16) > > > > I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index = > > 0. > > Perhaps, I'll look into it thanks. though this is ofcourse only applicable > when the mask comes from whilelo. > > In the future when we get to loops such as: > > for (int i = ..;;) > { > if (a) > { > > if (b) > return i; > } > } > > the reduction would come from the first active element of the mask created > by the condition a and not the whilelo. If the mask comes from a condition, VEC_EXTRACT approach is definitely incorrect. However, look into vectorizable_live_operation_1: The mask should always come from whilo instruction (or say it is always loop mask): tree mask = vect_get_loop_mask (loop_vinfo, , _VINFO_MASKS (loop_vinfo), 1, vectype, 0); So I think it should be correct using VEC_EXTRACT with index = 0. But if we look into vectorizable_early_break which will handle mask come from condition, that is another story.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #16 from Tamar Christina --- > > I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index = > 0. Perhaps, I'll look into it thanks. though this is ofcourse only applicable when the mask comes from whilelo. In the future when we get to loops such as: for (int i = ..;;) { if (a) { if (b) return i; } } the reduction would come from the first active element of the mask created by the condition a and not the whilelo.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #15 from Tamar Christina --- (In reply to JuzheZhong from comment #14) > > > > sure, but you can't use BIT_FIELD_REF on VLA vectors. > > > > > > So, for length partial vector. We can use VEC_EXTRACT with index = 0 since > > > VEC_EXTRACT optab allows VLA vectors now for length target. > > > > Sounds good :) > > I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index = > 0. > > I guess the only issue is that when mask = all zero. That is, there is no > active elements, What behavior should be here for early break ? That shouldn't happen, in that case you wouldn't have entered the loop. To prevent this there's always a compare of n > 0 at the start of the loops to skip the vector body entirely.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #14 from JuzheZhong --- (In reply to Tamar Christina from comment #13) > (In reply to JuzheZhong from comment #12) > > (In reply to Tamar Christina from comment #11) > > > (In reply to JuzheZhong from comment #10) > > > > (In reply to Tamar Christina from comment #9) > > > > > (In reply to JuzheZhong from comment #8) > > > > > > Suppose the loop mask is generated by whilelo instruction of ARM > > > > > > SVE. > > > > > > > > > > > > Suppose we have 8 elements in a single whole vector. > > > > > > > > > > > > mask = whilo (0, res) if res = 6, then mask = 1000. > > > > > > data = 12345678 > > > > > > > > > > > > Then if it is early break. You are reversing both data and mask as > > > > > > follows: > > > > > > > > > > > > new_mask = 0001 > > > > > > new_data = 87654321 > > > > > > > > > > > > Then use the EXTRACT_LAST, we will get value = 1 for early break. > > > > > > > > > > > > Am I right ? > > > > > > > > > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do > > > > > any > > > > > side effects that we couldn't apply. > > > > > > > > > > We went with this approach first because it works for non-masked > > > > > architectures too. In GCC-15 we'll try to implement staying entirely > > > > > inside > > > > > a vector loop by splitting the mask in elements until first active and > > > > > element from first active so we can correctly mask the operations. > > > > > > > > Ok. For the current approach. Isn't it the first element is always > > > > element 0 > > > > ? > > > > > > > > Since for ARM SVE loop mask is generated by whilelo instructions, it > > > > always > > > > set > > > > mask bit from 0 to the last active element - 1. > > > > > > sure, but you can't use BIT_FIELD_REF on VLA vectors. > > > > So, for length partial vector. We can use VEC_EXTRACT with index = 0 since > > VEC_EXTRACT optab allows VLA vectors now for length target. > > Sounds good :) I wonder whether ARM SVE can also use this approach VEC_EXTRACT with index = 0. I guess the only issue is that when mask = all zero. That is, there is no active elements, What behavior should be here for early break ?
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #13 from Tamar Christina --- (In reply to JuzheZhong from comment #12) > (In reply to Tamar Christina from comment #11) > > (In reply to JuzheZhong from comment #10) > > > (In reply to Tamar Christina from comment #9) > > > > (In reply to JuzheZhong from comment #8) > > > > > Suppose the loop mask is generated by whilelo instruction of ARM SVE. > > > > > > > > > > Suppose we have 8 elements in a single whole vector. > > > > > > > > > > mask = whilo (0, res) if res = 6, then mask = 1000. > > > > > data = 12345678 > > > > > > > > > > Then if it is early break. You are reversing both data and mask as > > > > > follows: > > > > > > > > > > new_mask = 0001 > > > > > new_data = 87654321 > > > > > > > > > > Then use the EXTRACT_LAST, we will get value = 1 for early break. > > > > > > > > > > Am I right ? > > > > > > > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do any > > > > side effects that we couldn't apply. > > > > > > > > We went with this approach first because it works for non-masked > > > > architectures too. In GCC-15 we'll try to implement staying entirely > > > > inside > > > > a vector loop by splitting the mask in elements until first active and > > > > element from first active so we can correctly mask the operations. > > > > > > Ok. For the current approach. Isn't it the first element is always > > > element 0 > > > ? > > > > > > Since for ARM SVE loop mask is generated by whilelo instructions, it > > > always > > > set > > > mask bit from 0 to the last active element - 1. > > > > sure, but you can't use BIT_FIELD_REF on VLA vectors. > > So, for length partial vector. We can use VEC_EXTRACT with index = 0 since > VEC_EXTRACT optab allows VLA vectors now for length target. Sounds good :)
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #12 from JuzheZhong --- (In reply to Tamar Christina from comment #11) > (In reply to JuzheZhong from comment #10) > > (In reply to Tamar Christina from comment #9) > > > (In reply to JuzheZhong from comment #8) > > > > Suppose the loop mask is generated by whilelo instruction of ARM SVE. > > > > > > > > Suppose we have 8 elements in a single whole vector. > > > > > > > > mask = whilo (0, res) if res = 6, then mask = 1000. > > > > data = 12345678 > > > > > > > > Then if it is early break. You are reversing both data and mask as > > > > follows: > > > > > > > > new_mask = 0001 > > > > new_data = 87654321 > > > > > > > > Then use the EXTRACT_LAST, we will get value = 1 for early break. > > > > > > > > Am I right ? > > > > > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do any > > > side effects that we couldn't apply. > > > > > > We went with this approach first because it works for non-masked > > > architectures too. In GCC-15 we'll try to implement staying entirely > > > inside > > > a vector loop by splitting the mask in elements until first active and > > > element from first active so we can correctly mask the operations. > > > > Ok. For the current approach. Isn't it the first element is always element 0 > > ? > > > > Since for ARM SVE loop mask is generated by whilelo instructions, it always > > set > > mask bit from 0 to the last active element - 1. > > sure, but you can't use BIT_FIELD_REF on VLA vectors. So, for length partial vector. We can use VEC_EXTRACT with index = 0 since VEC_EXTRACT optab allows VLA vectors now for length target.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #11 from Tamar Christina --- (In reply to JuzheZhong from comment #10) > (In reply to Tamar Christina from comment #9) > > (In reply to JuzheZhong from comment #8) > > > Suppose the loop mask is generated by whilelo instruction of ARM SVE. > > > > > > Suppose we have 8 elements in a single whole vector. > > > > > > mask = whilo (0, res) if res = 6, then mask = 1000. > > > data = 12345678 > > > > > > Then if it is early break. You are reversing both data and mask as > > > follows: > > > > > > new_mask = 0001 > > > new_data = 87654321 > > > > > > Then use the EXTRACT_LAST, we will get value = 1 for early break. > > > > > > Am I right ? > > > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do any > > side effects that we couldn't apply. > > > > We went with this approach first because it works for non-masked > > architectures too. In GCC-15 we'll try to implement staying entirely inside > > a vector loop by splitting the mask in elements until first active and > > element from first active so we can correctly mask the operations. > > Ok. For the current approach. Isn't it the first element is always element 0 > ? > > Since for ARM SVE loop mask is generated by whilelo instructions, it always > set > mask bit from 0 to the last active element - 1. sure, but you can't use BIT_FIELD_REF on VLA vectors.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #10 from JuzheZhong --- (In reply to Tamar Christina from comment #9) > (In reply to JuzheZhong from comment #8) > > Suppose the loop mask is generated by whilelo instruction of ARM SVE. > > > > Suppose we have 8 elements in a single whole vector. > > > > mask = whilo (0, res) if res = 6, then mask = 1000. > > data = 12345678 > > > > Then if it is early break. You are reversing both data and mask as follows: > > > > new_mask = 0001 > > new_data = 87654321 > > > > Then use the EXTRACT_LAST, we will get value = 1 for early break. > > > > Am I right ? > > Yeah, the idea being the scalar loop will then run from 1 to 6 to do any > side effects that we couldn't apply. > > We went with this approach first because it works for non-masked > architectures too. In GCC-15 we'll try to implement staying entirely inside > a vector loop by splitting the mask in elements until first active and > element from first active so we can correctly mask the operations. Ok. For the current approach. Isn't it the first element is always element 0 ? Since for ARM SVE loop mask is generated by whilelo instructions, it always set mask bit from 0 to the last active element - 1.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #9 from Tamar Christina --- (In reply to JuzheZhong from comment #8) > Suppose the loop mask is generated by whilelo instruction of ARM SVE. > > Suppose we have 8 elements in a single whole vector. > > mask = whilo (0, res) if res = 6, then mask = 1000. > data = 12345678 > > Then if it is early break. You are reversing both data and mask as follows: > > new_mask = 0001 > new_data = 87654321 > > Then use the EXTRACT_LAST, we will get value = 1 for early break. > > Am I right ? Yeah, the idea being the scalar loop will then run from 1 to 6 to do any side effects that we couldn't apply. We went with this approach first because it works for non-masked architectures too. In GCC-15 we'll try to implement staying entirely inside a vector loop by splitting the mask in elements until first active and element from first active so we can correctly mask the operations.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #8 from JuzheZhong --- (In reply to Tamar Christina from comment #7) > You may be able to use the same approach as > > else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > > that is, reverse both the mask and the vector and using extract last. > It's not going to be performance critical so it's more important to be > correct rather than fast. I just carefully read this code: /* For an inverted control flow with early breaks we want EXTRACT_FIRST instead of EXTRACT_LAST. Emulate by reversing the vector and mask. */ if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) { /* First create the permuted mask. */ tree perm_mask = perm_mask_for_reverse (TREE_TYPE (mask)); tree perm_dest = copy_ssa_name (mask); gimple *perm_stmt = gimple_build_assign (perm_dest, VEC_PERM_EXPR, mask, mask, perm_mask); vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt, ); mask = perm_dest; /* Then permute the vector contents. */ tree perm_elem = perm_mask_for_reverse (vectype); perm_dest = copy_ssa_name (vec_lhs_phi); perm_stmt = gimple_build_assign (perm_dest, VEC_PERM_EXPR, vec_lhs_phi, vec_lhs_phi, perm_elem); vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt, ); vec_lhs_phi = perm_dest; } Suppose the loop mask is generated by whilelo instruction of ARM SVE. Suppose we have 8 elements in a single whole vector. mask = whilo (0, res) if res = 6, then mask = 1000. data = 12345678 Then if it is early break. You are reversing both data and mask as follows: new_mask = 0001 new_data = 87654321 Then use the EXTRACT_LAST, we will get value = 1 for early break. Am I right ?
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #7 from Tamar Christina --- You may be able to use the same approach as else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) that is, reverse both the mask and the vector and using extract last. It's not going to be performance critical so it's more important to be correct rather than fast.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #6 from Tamar Christina --- (In reply to JuzheZhong from comment #5) > (In reply to Tamar Christina from comment #4) > > (In reply to JuzheZhong from comment #3) > > > I guess this code is just disabling partial vector for length for now. > > > > > > And need me to test and port this part for length in the followup patches. > > > > > > Am I right ? > > > > Yeah, it needed to safely not allow it through for now. Once implemented > > you'll hit an assert in vectorizable_live_operations where you need to > > provide a way to also get the first active element from a vector. > > So for a length target, I enable cbranch optab but no vcond_mask_len optab. > Will it behavior wrong ? > You need both, if the operation requires a mask it'll reject it without vcond_mask_len support. Because I didn't know how to extract first element using vcond_mask_len I had to disable it. > Another question is could you give me more hints about > vectorizable_live_operation? > > I thought vectorizable_live_operation is doing extract last active element, > I didn't see extract first active element. Normally yes, but I added extract first active element for this patch. This is because when you hit and take an early exit we restrart the vector iteration since there may be partial effects to perform between where the loop started and where the element is found. specifically look at vectorizable_live_operation_1 there's an assert under if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) with a comment saying what's needed.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #5 from JuzheZhong --- (In reply to Tamar Christina from comment #4) > (In reply to JuzheZhong from comment #3) > > I guess this code is just disabling partial vector for length for now. > > > > And need me to test and port this part for length in the followup patches. > > > > Am I right ? > > Yeah, it needed to safely not allow it through for now. Once implemented > you'll hit an assert in vectorizable_live_operations where you need to > provide a way to also get the first active element from a vector. So for a length target, I enable cbranch optab but no vcond_mask_len optab. Will it behavior wrong ? Another question is could you give me more hints about vectorizable_live_operation? I thought vectorizable_live_operation is doing extract last active element, I didn't see extract first active element.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #4 from Tamar Christina --- (In reply to JuzheZhong from comment #3) > I guess this code is just disabling partial vector for length for now. > > And need me to test and port this part for length in the followup patches. > > Am I right ? Yeah, it needed to safely not allow it through for now. Once implemented you'll hit an assert in vectorizable_live_operations where you need to provide a way to also get the first active element from a vector.
[Bug c/113134] gcc does not version loops with early break conditions that don't have side-effects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113134 --- Comment #3 from JuzheZhong --- Thanks Tamar. And I agree with you. This is not supposed to let early break to handle that. It should be optimization on scalar IR instead of loop vectorizer. I believe it is GCC-15 topic. Btw, I am trying to enable early break on RISC-V port. But I failed to do that since this following codes look quite to length target: if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) { if (direct_internal_fn_supported_p (IFN_VCOND_MASK_LEN, vectype, OPTIMIZE_FOR_SPEED)) return false; else vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, NULL); } I guess this code is just disabling partial vector for length for now. And need me to test and port this part for length in the followup patches. Am I right ?