Re: [committed][vect]PR 88915: Vectorize epilogues when versioning loops (was Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops)
On Tue, 29 Oct 2019, Andre Vieira (lists) wrote: > Hi richi, > > > Committed the patch with the suggested change in revision r277569. > > I will look at the follow-up, from what I understood the idea was to add a > field to the dr_vec_info where we keep the "extra-offset" that represents the > position in the DR wrt the current loop/epilogue. So when we advance the DR, > we change that field and not the DR_OFFSET? That way we always keep the > "original state". Yes. That would apply to the adjustment done by prologue peeling as well. Still wonder how gather/scatter get by not having any adjustment ;) Ah, I guess they simply are not affected by peeling since they get their DR_OFFSET & friends computed in the loop (loaded from memory). But DR_REF needs adjustment because it tries to walk def-stmts (repeatedly in both analysis and transform phase). Richard. > Cheers, > Andre > > gcc/ChangeLog: > 2019-10-29 Andre Vieira > > PR 88915 > * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. > * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter > and make the valueize function pointer also take a void pointer. > * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap > around vn_valueize, to call it without a context. > (process_bb): Use vn_valueize_wrapper instead of vn_valueize. > * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. > (~_loop_vec_info): Release epilogue_vinfos. > (vect_analyze_loop_costing): Use knowledge of main VF to estimate > number of iterations of epilogue. > (vect_analyze_loop_2): Adapt to analyse main loop for all supported > vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > versioning threshold needed for main loop. > (vect_analyze_loop): Likewise. > (find_in_mapping): New helper function. > (update_epilogue_loop_vinfo): New function. > (vect_transform_loop): When vectorizing epilogues re-use analysis done > on main loop and call update_epilogue_loop_vinfo to update it. > * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert > stmts on loop preheader edge. > (vect_do_peeling): Enable skip-vectors when doing loop versioning if > we decided to vectorize epilogues. Update epilogues NITERS and > construct ADVANCE to update epilogues data references where needed. > * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. > (vect_do_peeling, vect_update_inits_of_drs, > determine_peel_for_niter, vect_analyze_loop): Add or update declarations. > * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already > created loop_vec_info's for epilogues when available. Otherwise > analyse > epilogue separately. > > > > Cheers, > Andre > > On 29/10/2019 11:48, Richard Biener wrote: > > On Mon, 28 Oct 2019, Andre Vieira (lists) wrote: > > > >> Hi, > >> > >> Reworked according to your comments, see inline for clarification. > >> > >> Is this OK for trunk? > > > > + gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo); > > + while (seq) > > + { > > + stmt_worklist.safe_push (seq); > > + seq = seq->next; > > + } > > > > you're supposed to do to the following, not access the ->next > > implementation detail: > > > > for (gimple_stmt_iterator gsi = gsi_start (seq); !gsi_end_p (gsi); > > gsi_next (&gsi)) > >stmt_worklist.safe_push (gsi_stmt (gsi)); > > > > > > + /* Data references for gather loads and scatter stores do not use > > the > > +updated offset we set using ADVANCE. Instead we have to make > > sure > > the > > +reference in the data references point to the corresponding copy > > of > > +the original in the epilogue. */ > > + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) > > + { > > + DR_REF (dr) > > + = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE, > > +&find_in_mapping, &mapping); > > + DR_BASE_ADDRESS (dr) > > + = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE, > > NULL_TREE, > > +&find_in_mapping, &mapping); > > + } > > > > Hmm. So for other DRs we account for the previous vector loop > > by adjusting DR_OFFSET? But STMT_VINFO_GATHER_SCATTER_P ends up > > using (unconditionally) DR_REF here? In that case it seems > > best to adjust DR_REF only but NULL out DR_BASE_ADDRESS and > > DR_OFFSET? I wonder how prologue peeling deals with > > STMT_VINFO_GATHER_SCATTER_P ... I see the caller of > > vect_update_init_of_dr there does nothing for STMT_VINFO_GATHER_SCATTER_P. > > > > I wonder if (as followup to not delay this further) we can > > "offload" all the DR adjustment by storing ADVANCE in dr_vec_info > > and accounting for that when we create the dataref pointers in > > vecto
[committed][vect]PR 88915: Vectorize epilogues when versioning loops (was Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops)
mt_vinfo)) + { + int j; + if (TREE_CODE (DR_REF (dr)) == MEM_REF) + j = 0; + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF) + j = 1; + else + gcc_unreachable (); + + if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j))) + { + DR_REF (dr) = unshare_expr (DR_REF (dr)); + TREE_OPERAND (DR_REF (dr), j) = *new_op; + } huh, do you really only ever see MEM_REF or ARRAY_REF here? I would guess using simplify_replace_tree is safer. There's also DR_BASE_ADDRESS - we seem to leave the DRs partially updated, is that correct? Yeah can use simplify_replace_tree indeed. And I have changed it so it updates DR_BASE_ADDRESS. I think DR_BASE_ADDRESS never actually changed in the way we use data_references... Either way, replacing them if they do change is cleaner and more future proof. Otherwise looks OK to me. Thanks, Richard. gcc/ChangeLog: 2019-10-25 Andre Vieira PR 88915 * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. (vect_do_peeling, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Cheers, Andre >From 3aba91fb4d1a0dcfa04b8c8ead1941679cd46ae3 Mon Sep 17 00:00:00 2001 From: Andre Simoes Dias Vieira Date: Fri, 18 Oct 2019 17:49:58 +0100 Subject: [PATCH 2/3] [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops --- gcc/tree-ssa-loop-niter.c | 6 +- gcc/tree-ssa-loop-niter.h | 4 +- gcc/tree-ssa-sccvn.c | 6 +- gcc/tree-vect-loop-manip.c | 203 --- gcc/tree-vect-loop.c | 332 ++--- gcc/tree-vectorizer.c | 25 ++- gcc/tree-vectorizer.h | 13 +- 7 files changed, 490 insertions(+), 99 deletions(-) diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644 --- a/gcc/tree-ssa-loop-niter.c +++ b/gcc/tree-ssa-loop-niter.c @@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop, tree simplify_replace_tree (tree expr, tree old, tree new_tree, - tree (*valueize) (tree)) + tree (*valueize) (tree, void*), void *context) { unsigned i, n; tree ret = NULL_TREE, e, se; @@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree, { if (TREE_CODE (expr) == SSA_NAME) { - new_tree = valueize (expr); + new_tree = valueize (expr, context); if (new_tree != expr) return new_tree; } @@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree, for (i = 0; i < n; i++) { e = TREE_OPERAND (expr, i); - se = simplify_replace_tree (e, old, new_tree, valueize); + se = simplify_replace_tree (e, old, new_tree, valueize, context); if (e == se) continue; diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644 --- a/gcc/tree-ssa-loop-niter.h +++ b/gcc/tree-ssa-loop-niter.h @@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *, class loop *, bool); extern void free_numbers_of_iterations_estimates (class loop *); ext
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On Mon, 28 Oct 2019, Andre Vieira (lists) wrote: > Hi, > > Reworked according to your comments, see inline for clarification. > > Is this OK for trunk? + gimple_seq seq = STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo); + while (seq) + { + stmt_worklist.safe_push (seq); + seq = seq->next; + } you're supposed to do to the following, not access the ->next implementation detail: for (gimple_stmt_iterator gsi = gsi_start (seq); !gsi_end_p (gsi); gsi_next (&gsi)) stmt_worklist.safe_push (gsi_stmt (gsi)); + /* Data references for gather loads and scatter stores do not use the +updated offset we set using ADVANCE. Instead we have to make sure the +reference in the data references point to the corresponding copy of +the original in the epilogue. */ + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) + { + DR_REF (dr) + = simplify_replace_tree (DR_REF (dr), NULL_TREE, NULL_TREE, +&find_in_mapping, &mapping); + DR_BASE_ADDRESS (dr) + = simplify_replace_tree (DR_BASE_ADDRESS (dr), NULL_TREE, NULL_TREE, +&find_in_mapping, &mapping); + } Hmm. So for other DRs we account for the previous vector loop by adjusting DR_OFFSET? But STMT_VINFO_GATHER_SCATTER_P ends up using (unconditionally) DR_REF here? In that case it seems best to adjust DR_REF only but NULL out DR_BASE_ADDRESS and DR_OFFSET? I wonder how prologue peeling deals with STMT_VINFO_GATHER_SCATTER_P ... I see the caller of vect_update_init_of_dr there does nothing for STMT_VINFO_GATHER_SCATTER_P. I wonder if (as followup to not delay this further) we can "offload" all the DR adjustment by storing ADVANCE in dr_vec_info and accounting for that when we create the dataref pointers in vectorizable_load/store? That way we could avoid saving/restoring DR_OFFSET as well. So, the patch is OK with the sequence iteration fixed. I think sorting out the above can be done as followup. Thanks, Richard. > gcc/ChangeLog: > 2019-10-28 Andre Vieira > > PR 88915 > * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. > * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter > and make the valueize function pointer also take a void pointer. > * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap > around vn_valueize, to call it without a context. > (process_bb): Use vn_valueize_wrapper instead of vn_valueize. > * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. > (~_loop_vec_info): Release epilogue_vinfos. > (vect_analyze_loop_costing): Use knowledge of main VF to estimate > number of iterations of epilogue. > (vect_analyze_loop_2): Adapt to analyse main loop for all supported > vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > versioning threshold needed for main loop. > (vect_analyze_loop): Likewise. > (find_in_mapping): New helper function. > (update_epilogue_loop_vinfo): New function. > (vect_transform_loop): When vectorizing epilogues re-use analysis done > on main loop and call update_epilogue_loop_vinfo to update it. > * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert > stmts on loop preheader edge. > (vect_do_peeling): Enable skip-vectors when doing loop versioning if > we decided to vectorize epilogues. Update epilogues NITERS and > construct ADVANCE to update epilogues data references where needed. > * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. > (vect_do_peeling, vect_update_inits_of_drs, > determine_peel_for_niter, vect_analyze_loop): Add or update declarations. > * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already > created loop_vec_info's for epilogues when available. Otherwise > analyse > epilogue separately. > > > > Cheers, > Andre > > On 28/10/2019 14:16, Richard Biener wrote: > > On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: > > > >> Hi, > >> > >> This is the reworked patch after your comments. > >> > >> I have moved the epilogue check into the analysis form disguised under > >> '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing > >> the > >> "lowest threshold" check there. > >> > >> The only place where we may reject an epilogue_vinfo is when we know the > >> number of scalar iterations and we realize the number of iterations left > >> after > >> the main loop are not enough to enter the vectorized epilogue so we > >> optimize > >> away that code-gen. The only way we know this to be true is if the number > >> of > >> scalar iterations are known and the peeling for alignment is known. So we > >> know > >> we will enter the main loop regardless, so whether the threshold we use is > >> for > >> a lower VF or not it shouldn't
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Hi, Reworked according to your comments, see inline for clarification. Is this OK for trunk? gcc/ChangeLog: 2019-10-28 Andre Vieira PR 88915 * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. (vect_do_peeling, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Cheers, Andre On 28/10/2019 14:16, Richard Biener wrote: On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: Hi, This is the reworked patch after your comments. I have moved the epilogue check into the analysis form disguised under '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing the "lowest threshold" check there. The only place where we may reject an epilogue_vinfo is when we know the number of scalar iterations and we realize the number of iterations left after the main loop are not enough to enter the vectorized epilogue so we optimize away that code-gen. The only way we know this to be true is if the number of scalar iterations are known and the peeling for alignment is known. So we know we will enter the main loop regardless, so whether the threshold we use is for a lower VF or not it shouldn't matter as much, I would even like to think that check isn't done, but I am not sure... Might be worth checking as an optimization. Is this OK for trunk? + for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]); + !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi)) + { .. + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) + pattern_worklist.safe_push (stmt_vinfo); + + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); + while (related_vinfo && related_vinfo != stmt_vinfo) + { I think PHIs cannot have patterns. You can assert that STMT_VINFO_RELATED_STMT is NULL I think. Done. + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); + while (related_vinfo && related_vinfo != stmt_vinfo) + { + related_worklist.safe_push (related_vinfo); + /* Set BB such that the assert in + 'get_initial_def_for_reduction' is able to determine that + the BB of the related stmt is inside this loop. */ + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), +gimple_bb (new_stmt)); + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); + } do we really keep references to "nested" patterns? Thus, do you need this loop? Changed and added asserts. They didn't trigger so I suppose you are right, I didn't know at the time whether it was possible, so I just operated on the side of caution. Can remove the asserts and so on if you want. + /* The PATTERN_DEF_SEQs in the epilogue were constructed using the + original main loop and thus need to be updated to refer to the cloned + variables used in the epilogue. */ + for (unsigned i = 0; i < pattern_worklist.length (); ++i) +{ ... + op = simplify_replace_tree (op, NULL_TREE, NULL_TREE, +&find_in_mapping, &mapping); + gimple_set_op (seq, j, op); you do this for the pattern-def seq but not for the related one. I guess you ran into this for COND_EXPR conditions. I wondered to use a shared worklist for both the def-seq and the main pattern stmt or at l
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: > Hi, > > This is the reworked patch after your comments. > > I have moved the epilogue check into the analysis form disguised under > '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing the > "lowest threshold" check there. > > The only place where we may reject an epilogue_vinfo is when we know the > number of scalar iterations and we realize the number of iterations left after > the main loop are not enough to enter the vectorized epilogue so we optimize > away that code-gen. The only way we know this to be true is if the number of > scalar iterations are known and the peeling for alignment is known. So we know > we will enter the main loop regardless, so whether the threshold we use is for > a lower VF or not it shouldn't matter as much, I would even like to think that > check isn't done, but I am not sure... Might be worth checking as an > optimization. > > > Is this OK for trunk? + for (epilogue_phi_gsi = gsi_start_phis (epilogue_bbs[i]); + !gsi_end_p (epilogue_phi_gsi); gsi_next (&epilogue_phi_gsi)) + { .. + if (STMT_VINFO_PATTERN_DEF_SEQ (stmt_vinfo)) + pattern_worklist.safe_push (stmt_vinfo); + + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); + while (related_vinfo && related_vinfo != stmt_vinfo) + { I think PHIs cannot have patterns. You can assert that STMT_VINFO_RELATED_STMT is NULL I think. + related_vinfo = STMT_VINFO_RELATED_STMT (stmt_vinfo); + while (related_vinfo && related_vinfo != stmt_vinfo) + { + related_worklist.safe_push (related_vinfo); + /* Set BB such that the assert in + 'get_initial_def_for_reduction' is able to determine that + the BB of the related stmt is inside this loop. */ + gimple_set_bb (STMT_VINFO_STMT (related_vinfo), +gimple_bb (new_stmt)); + related_vinfo = STMT_VINFO_RELATED_STMT (related_vinfo); + } do we really keep references to "nested" patterns? Thus, do you need this loop? + /* The PATTERN_DEF_SEQs in the epilogue were constructed using the + original main loop and thus need to be updated to refer to the cloned + variables used in the epilogue. */ + for (unsigned i = 0; i < pattern_worklist.length (); ++i) +{ ... + op = simplify_replace_tree (op, NULL_TREE, NULL_TREE, +&find_in_mapping, &mapping); + gimple_set_op (seq, j, op); you do this for the pattern-def seq but not for the related one. I guess you ran into this for COND_EXPR conditions. I wondered to use a shared worklist for both the def-seq and the main pattern stmt or at least to split out the replacement so you can share it. + /* Data references for gather loads and scatter stores do not use the +updated offset we set using ADVANCE. Instead we have to make sure the +reference in the data references point to the corresponding copy of +the original in the epilogue. */ + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) + { + int j; + if (TREE_CODE (DR_REF (dr)) == MEM_REF) + j = 0; + else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF) + j = 1; + else + gcc_unreachable (); + + if (tree *new_op = mapping.get (TREE_OPERAND (DR_REF (dr), j))) + { + DR_REF (dr) = unshare_expr (DR_REF (dr)); + TREE_OPERAND (DR_REF (dr), j) = *new_op; + } huh, do you really only ever see MEM_REF or ARRAY_REF here? I would guess using simplify_replace_tree is safer. There's also DR_BASE_ADDRESS - we seem to leave the DRs partially updated, is that correct? Otherwise looks OK to me. Thanks, Richard. > gcc/ChangeLog: > 2019-10-25 Andre Vieira > > PR 88915 > * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. > * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter > and make the valueize function pointer also take a void pointer. > * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap > around vn_valueize, to call it without a context. > (process_bb): Use vn_valueize_wrapper instead of vn_valueize. > * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. > (~_loop_vec_info): Release epilogue_vinfos. > (vect_analyze_loop_costing): Use knowledge of main VF to estimate > number of iterations of epilogue. > (vect_analyze_loop_2): Adapt to analyse main loop for all supported > vector sizes when vect-epilogues-nomask=1. Also keep track of lowest > versioning threshold needed for main loop. > (vect_analyze_loop): Likewise. > (find_in_mapping): New helper function. > (update_epilogue_loop_vinfo): New function. > (vect_transform_loop): When vectorizing epilogues re-use analys
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On Fri, 25 Oct 2019, Andre Vieira (lists) wrote: > > > On 22/10/2019 14:56, Richard Biener wrote: > > On Tue, 22 Oct 2019, Andre Vieira (lists) wrote: > > > >> Hi Richi, > >> > >> See inline responses to your comments. > >> > >> On 11/10/2019 13:57, Richard Biener wrote: > >>> On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: > >>> > Hi, > > >> > >>> > >>> + > >>> + /* Keep track of vector sizes we know we can vectorize the epilogue > >>> with. */ > >>> + vector_sizes epilogue_vsizes; > >>>}; > >>> > >>> please don't enlarge struct loop, instead track this somewhere > >>> in the vectorizer (in loop_vinfo? I see you already have > >>> epilogue_vinfos there - so the loop_vinfo simply lacks > >>> convenient access to the vector_size?) I don't see any > >>> use that could be trivially adjusted to look at a loop_vinfo > >>> member instead. > >> > >> Done. > >>> > >>> For the vect_update_inits_of_drs this means that we'd possibly > >>> do less CSE. Not sure if really an issue. > >> > >> CSE of what exactly? You are afraid we are repeating a calculation here we > >> have done elsewhere before? > > > > All uses of those inits now possibly get the expression instead of > > just the SSA name we inserted code for once. But as said, we'll see. > > > > This code changed after some comments from Richard Sandiford. > > > + /* We are done vectorizing the main loop, so now we update the > > epilogues > > + stmt_vec_info's. At the same time we set the gimple UID of each > > + statement in the epilogue, as these are used to look them up in the > > + epilogues loop_vec_info later. We also keep track of what > > + stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might > > + need updating and we construct a mapping between variables defined > > in > > + the main loop and their corresponding names in epilogue. */ > > + for (unsigned i = 0; i < epilogue->num_nodes; ++i) > > > > so for the following code I wonder if you can make use of the > > fact that loop copying also copies UIDs, so you should be able > > to match stmts via their UIDs and get at the other loop infos > > stmt_info by the copy loop stmt UID. > > > > I wonder why you need no modification for the SLP tree? > > > I checked with Tamar and the SLP tree works with the position of operands and > not SSA_NAMES. So we should be fine. There's now SLP_TREE_SCALAR_OPS but only for invariants so I guess we should indeed be fine here. Everything else is already stmt_infos which you patch with the new underlying stmts. Richard.
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Hi, This is the reworked patch after your comments. I have moved the epilogue check into the analysis form disguised under '!epilogue_vinfos.is_empty ()'. This because I realized that I am doing the "lowest threshold" check there. The only place where we may reject an epilogue_vinfo is when we know the number of scalar iterations and we realize the number of iterations left after the main loop are not enough to enter the vectorized epilogue so we optimize away that code-gen. The only way we know this to be true is if the number of scalar iterations are known and the peeling for alignment is known. So we know we will enter the main loop regardless, so whether the threshold we use is for a lower VF or not it shouldn't matter as much, I would even like to think that check isn't done, but I am not sure... Might be worth checking as an optimization. Is this OK for trunk? gcc/ChangeLog: 2019-10-25 Andre Vieira PR 88915 * tree-ssa-loop-niter.h (simplify_replace_tree): Change declaration. * tree-ssa-loop-niter.c (simplify_replace_tree): Add context parameter and make the valueize function pointer also take a void pointer. * gcc/tree-ssa-sccvn.c (vn_valueize_wrapper): New function to wrap around vn_valueize, to call it without a context. (process_bb): Use vn_valueize_wrapper instead of vn_valueize. * tree-vect-loop.c (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (find_in_mapping): New helper function. (update_epilogue_loop_vinfo): New function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and call update_epilogue_loop_vinfo to update it. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos. (vect_do_peeling, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already created loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Cheers, Andre diff --git a/gcc/tree-ssa-loop-niter.h b/gcc/tree-ssa-loop-niter.h index 4454c1ac78e02228047511a9e0214c82946855b8..aec6225125ce42ab0e4dbc930fc1a93862e6e267 100644 --- a/gcc/tree-ssa-loop-niter.h +++ b/gcc/tree-ssa-loop-niter.h @@ -53,7 +53,9 @@ extern bool scev_probably_wraps_p (tree, tree, tree, gimple *, class loop *, bool); extern void free_numbers_of_iterations_estimates (class loop *); extern void free_numbers_of_iterations_estimates (function *); -extern tree simplify_replace_tree (tree, tree, tree, tree (*)(tree) = NULL); +extern tree simplify_replace_tree (tree, tree, + tree, tree (*)(tree, void *) = NULL, + void * = NULL); extern void substitute_in_loop_info (class loop *, tree, tree); #endif /* GCC_TREE_SSA_LOOP_NITER_H */ diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c index cd2ced369719c37afd4aac08ff360719d7702e42..db666f019808850ed3a4aeef1a454a7ae2c65ef2 100644 --- a/gcc/tree-ssa-loop-niter.c +++ b/gcc/tree-ssa-loop-niter.c @@ -1935,7 +1935,7 @@ number_of_iterations_cond (class loop *loop, tree simplify_replace_tree (tree expr, tree old, tree new_tree, - tree (*valueize) (tree)) + tree (*valueize) (tree, void*), void *context) { unsigned i, n; tree ret = NULL_TREE, e, se; @@ -1951,7 +1951,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree, { if (TREE_CODE (expr) == SSA_NAME) { - new_tree = valueize (expr); + new_tree = valueize (expr, context); if (new_tree != expr) return new_tree; } @@ -1967,7 +1967,7 @@ simplify_replace_tree (tree expr, tree old, tree new_tree, for (i = 0; i < n; i++) { e = TREE_OPERAND (expr, i); - se = simplify_replace_tree (e, old, new_tree, valueize); + se = simplify_replace_tree (e, old, new_tree, valueize, context); if (e == se) continue; diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c index 57331ab44dc78c16d97065cd28e8c4cdcbf8d96e..0abe3fb8453ecf2e25ff55c5c9846663f68f7c8c 100644 --- a/gcc/tree-ssa-sccvn.c +++ b/gcc/tree-ssa-sccvn.c @@ -309,6 +309,10 @@ static vn_tables_t valid_info; /* Valueization hook. Valueize NAME if it is an SSA name, otherwise just return it. */ tree (*vn_valueize) (tree); +tr
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On 22/10/2019 14:56, Richard Biener wrote: On Tue, 22 Oct 2019, Andre Vieira (lists) wrote: Hi Richi, See inline responses to your comments. On 11/10/2019 13:57, Richard Biener wrote: On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: Hi, + + /* Keep track of vector sizes we know we can vectorize the epilogue with. */ + vector_sizes epilogue_vsizes; }; please don't enlarge struct loop, instead track this somewhere in the vectorizer (in loop_vinfo? I see you already have epilogue_vinfos there - so the loop_vinfo simply lacks convenient access to the vector_size?) I don't see any use that could be trivially adjusted to look at a loop_vinfo member instead. Done. For the vect_update_inits_of_drs this means that we'd possibly do less CSE. Not sure if really an issue. CSE of what exactly? You are afraid we are repeating a calculation here we have done elsewhere before? All uses of those inits now possibly get the expression instead of just the SSA name we inserted code for once. But as said, we'll see. This code changed after some comments from Richard Sandiford. + /* We are done vectorizing the main loop, so now we update the epilogues + stmt_vec_info's. At the same time we set the gimple UID of each + statement in the epilogue, as these are used to look them up in the + epilogues loop_vec_info later. We also keep track of what + stmt_vec_info's have PATTERN_DEF_SEQ's and RELATED_STMT's that might + need updating and we construct a mapping between variables defined in + the main loop and their corresponding names in epilogue. */ + for (unsigned i = 0; i < epilogue->num_nodes; ++i) so for the following code I wonder if you can make use of the fact that loop copying also copies UIDs, so you should be able to match stmts via their UIDs and get at the other loop infos stmt_info by the copy loop stmt UID. I wonder why you need no modification for the SLP tree? I checked with Tamar and the SLP tree works with the position of operands and not SSA_NAMES. So we should be fine.
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On 22/10/2019 18:52, Richard Sandiford wrote: Thanks for doing this. Hope this message doesn't cover too much old ground or duplicate too much... "Andre Vieira (lists)" writes: @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, else niters_prolog = build_int_cst (type, 0); + loop_vec_info epilogue_vinfo = NULL; + if (vect_epilogues) +{ + /* Take the next epilogue_vinfo to vectorize for. */ + epilogue_vinfo = loop_vinfo->epilogue_vinfos[0]; + loop_vinfo->epilogue_vinfos.ordered_remove (0); + + /* Don't vectorize epilogues if this is not the most inner loop or if +the epilogue may need peeling for alignment as the vectorizer doesn't +know how to handle these situations properly yet. */ + if (loop->inner != NULL + || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo)) + vect_epilogues = false; + +} Nit: excess blank line before "}". Sorry if this was discussed before, but what's the reason for delaying the check for "loop->inner" to this point, rather than doing it in vect_analyze_loop? Done. + + tree niters_vector_mult_vf; + unsigned int lowest_vf = constant_lower_bound (vf); + /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work + on niters already ajusted for the iterations of the prologue. */ Pre-existing typo: adjusted. But... + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && known_eq (vf, lowest_vf)) +{ + loop_vec_info orig_loop_vinfo; + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) + orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo); + else + orig_loop_vinfo = loop_vinfo; + vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES (orig_loop_vinfo); + unsigned next_size = 0; + unsigned HOST_WIDE_INT eiters + = (LOOP_VINFO_INT_NITERS (loop_vinfo) + - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); + + if (prolog_peeling > 0) + eiters -= prolog_peeling; ...is that comment still true? We're now subtracting the peeling amount here. It is not, "adjusted" the comment ;) Might be worth asserting prolog_peeling >= 0, just to emphasise that we can't get here for variable peeling amounts, and then subtract prolog_peeling unconditionally (assuming that's the right thing to do). Can't assert as LOOP_VINFO_NITERS_KNOWN_P can be true even with prolog_peeling < 0, since we still know the constant number of scalar iterations, we just don't know how many vector iterations will be performed due to the runtime peeling. I will however, not reject vectorizing the epilogue, when we don't know how much we are peeling. + eiters + = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo); + + unsigned int ratio; + while (next_size < vector_sizes.length () +&& !(constant_multiple_p (current_vector_size, + vector_sizes[next_size], &ratio) + && eiters >= lowest_vf / ratio)) + next_size += 1; + + if (next_size == vector_sizes.length ()) + vect_epilogues = false; +} + /* Prolog loop may be skipped. */ bool skip_prolog = (prolog_peeling != 0); /* Skip to epilog if scalar loop may be preferred. It's only needed - when we peel for epilog loop and when it hasn't been checked with - loop versioning. */ + when we peel for epilog loop or when we loop version. */ bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo), bound_prolog + bound_epilog) - : !LOOP_REQUIRES_VERSIONING (loop_vinfo)); + : (!LOOP_REQUIRES_VERSIONING (loop_vinfo) +|| vect_epilogues)); The comment update looks wrong here: without epilogues, we don't need the skip when loop versioning, because loop versioning ensures that we have at least one vector iteration. (I think "it" was supposed to mean "skipping to the epilogue" rather than the epilogue loop itself, in case that's the confusion.) It'd be good to mention the epilogue condition in the comment too. Rewrote comment, hopefully this now better reflects reality. + + if (vect_epilogues) +{ + epilog->aux = epilogue_vinfo; + LOOP_VINFO_LOOP (epilogue_vinfo) = epilog; + + loop_constraint_clear (epilog, LOOP_C_INFINITE); + + /* We now must calculate the number of iterations for our epilogue. */ + tree cond_niters, niters; + + /* Depending on whether we peel for gaps we take niters or niters - 1, +we will refer to this as N - G, where N and G are the NITERS and +GAP for the original loop. */ + niters = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) + ? LOOP_VINFO_NITERSM1 (loop_vinfo) + : LOOP_VINFO_NITERS (loop_vinfo); + + /* Here we build a vector factorization mask: +vf_mask = ~(VF
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Thanks for doing this. Hope this message doesn't cover too much old ground or duplicate too much... "Andre Vieira (lists)" writes: > @@ -2466,15 +2476,65 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree > niters, tree nitersm1, >else > niters_prolog = build_int_cst (type, 0); > > + loop_vec_info epilogue_vinfo = NULL; > + if (vect_epilogues) > +{ > + /* Take the next epilogue_vinfo to vectorize for. */ > + epilogue_vinfo = loop_vinfo->epilogue_vinfos[0]; > + loop_vinfo->epilogue_vinfos.ordered_remove (0); > + > + /* Don't vectorize epilogues if this is not the most inner loop or if > + the epilogue may need peeling for alignment as the vectorizer doesn't > + know how to handle these situations properly yet. */ > + if (loop->inner != NULL > + || LOOP_VINFO_PEELING_FOR_ALIGNMENT (epilogue_vinfo)) > + vect_epilogues = false; > + > +} Nit: excess blank line before "}". Sorry if this was discussed before, but what's the reason for delaying the check for "loop->inner" to this point, rather than doing it in vect_analyze_loop? > + > + tree niters_vector_mult_vf; > + unsigned int lowest_vf = constant_lower_bound (vf); > + /* Note LOOP_VINFO_NITERS_KNOWN_P and LOOP_VINFO_INT_NITERS work > + on niters already ajusted for the iterations of the prologue. */ Pre-existing typo: adjusted. But... > + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > + && known_eq (vf, lowest_vf)) > +{ > + loop_vec_info orig_loop_vinfo; > + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) > + orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo); > + else > + orig_loop_vinfo = loop_vinfo; > + vector_sizes vector_sizes = LOOP_VINFO_EPILOGUE_SIZES > (orig_loop_vinfo); > + unsigned next_size = 0; > + unsigned HOST_WIDE_INT eiters > + = (LOOP_VINFO_INT_NITERS (loop_vinfo) > +- LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)); > + > + if (prolog_peeling > 0) > + eiters -= prolog_peeling; ...is that comment still true? We're now subtracting the peeling amount here. Might be worth asserting prolog_peeling >= 0, just to emphasise that we can't get here for variable peeling amounts, and then subtract prolog_peeling unconditionally (assuming that's the right thing to do). > + eiters > + = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo); > + > + unsigned int ratio; > + while (next_size < vector_sizes.length () > + && !(constant_multiple_p (current_vector_size, > +vector_sizes[next_size], &ratio) > + && eiters >= lowest_vf / ratio)) > + next_size += 1; > + > + if (next_size == vector_sizes.length ()) > + vect_epilogues = false; > +} > + >/* Prolog loop may be skipped. */ >bool skip_prolog = (prolog_peeling != 0); >/* Skip to epilog if scalar loop may be preferred. It's only needed > - when we peel for epilog loop and when it hasn't been checked with > - loop versioning. */ > + when we peel for epilog loop or when we loop version. */ >bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo), > bound_prolog + bound_epilog) > - : !LOOP_REQUIRES_VERSIONING (loop_vinfo)); > + : (!LOOP_REQUIRES_VERSIONING (loop_vinfo) > + || vect_epilogues)); The comment update looks wrong here: without epilogues, we don't need the skip when loop versioning, because loop versioning ensures that we have at least one vector iteration. (I think "it" was supposed to mean "skipping to the epilogue" rather than the epilogue loop itself, in case that's the confusion.) It'd be good to mention the epilogue condition in the comment too. > @@ -2504,6 +2564,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree > niters, tree nitersm1, > >dump_user_location_t loop_loc = find_loop_location (loop); >class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); > + if (vect_epilogues) > +/* Make sure to set the epilogue's epilogue scalar loop, such that we can > + we can use the original scalar loop as remaining epilogue if > + necessary. */ Double "we can". > +LOOP_VINFO_SCALAR_LOOP (epilogue_vinfo) > + = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); > + >if (prolog_peeling) > { >e = loop_preheader_edge (loop); > @@ -2584,14 +2651,22 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree > niters, tree nitersm1, > "loop can't be duplicated to exit edge.\n"); > gcc_unreachable (); > } > - /* Peel epilog and put it on exit edge of loop. */ > - epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e); > + /* Peel epilog and put it on exit edge of loop. If we are vectorizing > + said epilog then we should use a copy of the main loop a
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On Tue, 22 Oct 2019, Andre Vieira (lists) wrote: > Hi Richi, > > See inline responses to your comments. > > On 11/10/2019 13:57, Richard Biener wrote: > > On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: > > > >> Hi, > >> > > > > > + > > + /* Keep track of vector sizes we know we can vectorize the epilogue > > with. */ > > + vector_sizes epilogue_vsizes; > > }; > > > > please don't enlarge struct loop, instead track this somewhere > > in the vectorizer (in loop_vinfo? I see you already have > > epilogue_vinfos there - so the loop_vinfo simply lacks > > convenient access to the vector_size?) I don't see any > > use that could be trivially adjusted to look at a loop_vinfo > > member instead. > > Done. > > > > For the vect_update_inits_of_drs this means that we'd possibly > > do less CSE. Not sure if really an issue. > > CSE of what exactly? You are afraid we are repeating a calculation here we > have done elsewhere before? All uses of those inits now possibly get the expression instead of just the SSA name we inserted code for once. But as said, we'll see. > > > > You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes > > LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to > > LOOP_VINFO_EPILOGUE_P. > > I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I > then use the resulting loop info. If there are cases you feel strongly about > let me know. Not too strongly, no. > > > > @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree > > niters, tree nitersm1, > > else > > niters_prolog = build_int_cst (type, 0); > > > > + loop_vec_info epilogue_vinfo = NULL; > > + if (vect_epilogues) > > +{ > > ... > > + vect_epilogues = false; > > +} > > + > > > > I don't understand what all this does - it clearly needs a comment. > > Maybe the overall comment of the function should be amended with > > an overview of how we handle [multiple] epilogue loop vectorization? > > I added more comments both here and on top of the function. Hopefully it is a > bit clearer now, but it might need some tweaking. > > > > > + > > + if (epilogue_any_upper_bound && prolog_peeling >= 0) > > + { > > + epilog->any_upper_bound = true; > > + epilog->nb_iterations_upper_bound = eiters + 1; > > + } > > + > > > > comment missing. How can prolog_peeling be < 0? We likely > > didn't set the upper bound because we don't know it in the > > case we skipped the vector loop (skip_vector)? So make sure > > to not introduce wrong-code issues here - maybe do this > > optimization as followup?n > > > > So the reason for this code wasn't so much an optimization as it was for > correctness. But I was mistaken, the failure I was seeing without this code > was not because of this code, but rather being hidden by it. The problem I was > seeing was that a prolog was being created using the original loop copy, > rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in > the scalar prolog, leading to expand ICEs. I have fixed that issue by making > sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or > SCALAR loop for epilogue peeling depending on whether we will be vectorizing > said epilogue. OK. > > > @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info > > loop_vinfo) > > return 0; > > } > > > > - HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop); > > + HOST_WIDE_INT estimated_niter = -1; > > + > > + if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) > > +estimated_niter > > + = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1; > > + if (estimated_niter == -1) > > +estimated_niter = estimated_stmt_executions_int (loop); > > if (estimated_niter == -1) > > estimated_niter = likely_max_stmt_executions_int (loop); > > if (estimated_niter != -1 > > > > it's clearer if the old code is completely in a else {} path > > even though vect_vf_for_cost - 1 should never be -1. > > > Done for the == -1 cases, need to keep the != -1 outside of course. > > - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) > > + if (LOOP_REQUIRES_VERSIONING (loop_vinfo) > > + || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) > > + && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))) > > > > not sure why we need to do this for epilouges? > > > > This is because we want to compute the versioning threshold for epilogues such > that we can use the minimum versioning threshold when versioning the main > loop. The reason we need to ask we need to ask the original main loop is > partially because of code in 'vect_analyze_data_ref_dependences' that chooses > to not do DR dependence analysis and thus never fills > LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence > LOOP_VINFO_COMP_ALIAS_DDRS is always 0. > > The piece of code is preceded by this comment: > /* For epilogues we either
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Hi Richi, See inline responses to your comments. On 11/10/2019 13:57, Richard Biener wrote: On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: Hi, + + /* Keep track of vector sizes we know we can vectorize the epilogue with. */ + vector_sizes epilogue_vsizes; }; please don't enlarge struct loop, instead track this somewhere in the vectorizer (in loop_vinfo? I see you already have epilogue_vinfos there - so the loop_vinfo simply lacks convenient access to the vector_size?) I don't see any use that could be trivially adjusted to look at a loop_vinfo member instead. Done. For the vect_update_inits_of_drs this means that we'd possibly do less CSE. Not sure if really an issue. CSE of what exactly? You are afraid we are repeating a calculation here we have done elsewhere before? You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to LOOP_VINFO_EPILOGUE_P. I checked and the points where I use LOOP_VINFO_ORIG_LOOP_INFO is because I then use the resulting loop info. If there are cases you feel strongly about let me know. @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, else niters_prolog = build_int_cst (type, 0); + loop_vec_info epilogue_vinfo = NULL; + if (vect_epilogues) +{ ... + vect_epilogues = false; +} + I don't understand what all this does - it clearly needs a comment. Maybe the overall comment of the function should be amended with an overview of how we handle [multiple] epilogue loop vectorization? I added more comments both here and on top of the function. Hopefully it is a bit clearer now, but it might need some tweaking. + + if (epilogue_any_upper_bound && prolog_peeling >= 0) + { + epilog->any_upper_bound = true; + epilog->nb_iterations_upper_bound = eiters + 1; + } + comment missing. How can prolog_peeling be < 0? We likely didn't set the upper bound because we don't know it in the case we skipped the vector loop (skip_vector)? So make sure to not introduce wrong-code issues here - maybe do this optimization as followup?n So the reason for this code wasn't so much an optimization as it was for correctness. But I was mistaken, the failure I was seeing without this code was not because of this code, but rather being hidden by it. The problem I was seeing was that a prolog was being created using the original loop copy, rather than the scalar loop, leading to MASK_LOAD and MASK_STORE being left in the scalar prolog, leading to expand ICEs. I have fixed that issue by making sure the SCALAR_LOOP is used for prolog peeling and either the loop copy or SCALAR loop for epilogue peeling depending on whether we will be vectorizing said epilogue. @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo) return 0; } - HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop); + HOST_WIDE_INT estimated_niter = -1; + + if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) +estimated_niter + = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1; + if (estimated_niter == -1) +estimated_niter = estimated_stmt_executions_int (loop); if (estimated_niter == -1) estimated_niter = likely_max_stmt_executions_int (loop); if (estimated_niter != -1 it's clearer if the old code is completely in a else {} path even though vect_vf_for_cost - 1 should never be -1. Done for the == -1 cases, need to keep the != -1 outside of course. - if (LOOP_REQUIRES_VERSIONING (loop_vinfo)) + if (LOOP_REQUIRES_VERSIONING (loop_vinfo) + || ((orig_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) + && LOOP_REQUIRES_VERSIONING (orig_loop_vinfo))) not sure why we need to do this for epilouges? This is because we want to compute the versioning threshold for epilogues such that we can use the minimum versioning threshold when versioning the main loop. The reason we need to ask we need to ask the original main loop is partially because of code in 'vect_analyze_data_ref_dependences' that chooses to not do DR dependence analysis and thus never fills LOOP_VINFO_MAY_ALIAS_DDRS for the epilogues loop_vinfo and as a consequence LOOP_VINFO_COMP_ALIAS_DDRS is always 0. The piece of code is preceded by this comment: /* For epilogues we either have no aliases or alias versioning was applied to original loop. Therefore we may just get max_vf using VF of original loop. */ I have added some comments to make it clearer. +static tree +replace_ops (tree op, hash_map &mapping) +{ I'm quite sure I've seen such beast elsewhere ;) simplify_replace_tree comes up first (not a 1:1 match but hints at a possible tree sharing issue in your variant). The reason I couldn't use simplify_replace_tree is because I didn't know what the "OLD" value is at the time I want to call it. Basically I want to check whether an SSA n
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On Thu, 10 Oct 2019, Andre Vieira (lists) wrote: > Hi, > > After all the discussions and respins I now believe this patch is close to > what we envisioned. > > This patch achieves two things when vect-epilogues-nomask=1: > 1) It analyzes the original loop for each supported vector size and saves this > analysis per loop, as well as the vector sizes we know we can vectorize the > loop for. > 2) When loop versioning it uses the 'skip_vector' code path to vectorize the > epilogue, and uses the lowest versioning threshold between the main and > epilogue's. > > As side effects of this patch I also changed ensure_base_align to only update > the alignment if the new alignment is lower than the current one. This > function already did that if the object was a symbol, now it behaves this way > for any object. > > I bootstrapped this patch with both vect-epilogues-nomask turned on and off on > x86_64 (AVX512) and aarch64. Regression tests looked good. > > Is this OK for trunk? + + /* Keep track of vector sizes we know we can vectorize the epilogue with. */ + vector_sizes epilogue_vsizes; }; please don't enlarge struct loop, instead track this somewhere in the vectorizer (in loop_vinfo? I see you already have epilogue_vinfos there - so the loop_vinfo simply lacks convenient access to the vector_size?) I don't see any use that could be trivially adjusted to look at a loop_vinfo member instead. For the vect_update_inits_of_drs this means that we'd possibly do less CSE. Not sure if really an issue. You use LOOP_VINFO_EPILOGUE_P sometimes and sometimes LOOP_VINFO_ORIG_LOOP_INFO, please change predicates to LOOP_VINFO_EPILOGUE_P. @@ -2466,15 +2461,62 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, else niters_prolog = build_int_cst (type, 0); + loop_vec_info epilogue_vinfo = NULL; + if (vect_epilogues) +{ ... + vect_epilogues = false; +} + I don't understand what all this does - it clearly needs a comment. Maybe the overall comment of the function should be amended with an overview of how we handle [multiple] epilogue loop vectorization? + + if (epilogue_any_upper_bound && prolog_peeling >= 0) + { + epilog->any_upper_bound = true; + epilog->nb_iterations_upper_bound = eiters + 1; + } + comment missing. How can prolog_peeling be < 0? We likely didn't set the upper bound because we don't know it in the case we skipped the vector loop (skip_vector)? So make sure to not introduce wrong-code issues here - maybe do this optimization as followup? class loop * -vect_loop_versioning (loop_vec_info loop_vinfo, - unsigned int th, bool check_profitability, - poly_uint64 versioning_threshold) +vect_loop_versioning (loop_vec_info loop_vinfo) { class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop; class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); @@ -2988,10 +3151,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo, bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo); bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo); bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo); + poly_uint64 versioning_threshold += LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); tree version_simd_if_cond = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo); + unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); - if (check_profitability) + if (th >= vect_vf_for_cost (loop_vinfo) + && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && !ordered_p (th, versioning_threshold)) cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters, build_int_cst (TREE_TYPE (scalar_loop_iters), th - 1)); split out this refactoring - preapproved. @@ -1726,7 +1729,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo) return 0; } - HOST_WIDE_INT estimated_niter = estimated_stmt_executions_int (loop); + HOST_WIDE_INT estimated_niter = -1; + + if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) +estimated_niter + = vect_vf_for_cost (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)) - 1; + if (estimated_niter == -1) +estimated_niter = estimated_stmt_executions_int (loop); if (estimated_niter == -1) estimated_niter = likely_max_stmt_executions_int (loop); if (estimated_niter != -1 it's clearer if the old code is completely in a else {} path even though vect_vf_for_cost - 1 should never be -1. +/* Decides whether we need to create an epilogue loop to handle + remaining scalar iterations and sets PEELING_FOR_NITERS accordingly. */ + +void +determine_peel_for_niter (loop_vec_info loop_vinfo) +{ + extra vertical space + unsigned HOST_WIDE_INT const_vf; + HOST_WIDE_INT max_niter if it's a 1:1 copy outlined then split it out - preapproved (so further rev
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Hi, After all the discussions and respins I now believe this patch is close to what we envisioned. This patch achieves two things when vect-epilogues-nomask=1: 1) It analyzes the original loop for each supported vector size and saves this analysis per loop, as well as the vector sizes we know we can vectorize the loop for. 2) When loop versioning it uses the 'skip_vector' code path to vectorize the epilogue, and uses the lowest versioning threshold between the main and epilogue's. As side effects of this patch I also changed ensure_base_align to only update the alignment if the new alignment is lower than the current one. This function already did that if the object was a symbol, now it behaves this way for any object. I bootstrapped this patch with both vect-epilogues-nomask turned on and off on x86_64 (AVX512) and aarch64. Regression tests looked good. Is this OK for trunk? gcc/ChangeLog: 2019-10-10 Andre Vieira PR 88915 * cfgloop.h (loop): Add epilogue_vsizes member. * cfgloop.c (flow_loop_free): Release epilogue_vsizes. (alloc_loop): Initialize epilogue_vsizes. * gentype.c (main): Add poly_uint64 type and vector_sizes to generator. * tree-vect-loop.c (vect_get_loop_niters): Make externally visible. (_loop_vec_info): Initialize epilogue_vinfos. (~_loop_vec_info): Release epilogue_vinfos. (vect_analyze_loop_costing): Use knowledge of main VF to estimate number of iterations of epilogue. (determine_peel_for_niter): New. Outlined code to re-use in two places. (vect_analyze_loop_2): Adapt to analyse main loop for all supported vector sizes when vect-epilogues-nomask=1. Also keep track of lowest versioning threshold needed for main loop. (vect_analyze_loop): Likewise. (replace_ops): New helper function. (vect_transform_loop): When vectorizing epilogues re-use analysis done on main loop and update necessary information. * tree-vect-loop-manip.c (vect_update_inits_of_drs): No longer insert stmts on loop preheader edge. (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. Update epilogues NITERS and construct ADVANCE to update epilogues data references where needed. (vect_loop_versioning): Moved decision to check_profitability based on cost model. * tree-vect-stmts.c (ensure_base_align): Only update alignment if new alignment is lower. * tree-vectorizer.h (_loop_vec_info): Add epilogue_vinfos member. (vect_loop_versioning, vect_do_peeling, vect_get_loop_niters, vect_update_inits_of_drs, determine_peel_for_niter, vect_analyze_loop): Add or update declarations. * tree-vectorizer.c (try_vectorize_loop_1): Make sure to use already create loop_vec_info's for epilogues when available. Otherwise analyse epilogue separately. Cheers, Andre diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644 --- a/gcc/cfgloop.h +++ b/gcc/cfgloop.h @@ -21,6 +21,7 @@ along with GCC; see the file COPYING3. If not see #define GCC_CFGLOOP_H #include "cfgloopmanip.h" +#include "target.h" /* Structure to hold decision about unrolling/peeling. */ enum lpt_dec @@ -268,6 +269,9 @@ public: the basic-block from being collected but its index can still be reused. */ basic_block former_header; + + /* Keep track of vector sizes we know we can vectorize the epilogue with. */ + vector_sizes epilogue_vsizes; }; /* Set if the loop is known to be infinite. */ diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644 --- a/gcc/cfgloop.c +++ b/gcc/cfgloop.c @@ -198,6 +198,7 @@ flow_loop_free (class loop *loop) exit->prev = exit; } + loop->epilogue_vsizes.release(); ggc_free (loop->exits); ggc_free (loop); } @@ -355,6 +356,7 @@ alloc_loop (void) loop->nb_iterations_upper_bound = 0; loop->nb_iterations_likely_upper_bound = 0; loop->nb_iterations_estimate = 0; + loop->epilogue_vsizes.create(8); return loop; } diff --git a/gcc/gengtype.c b/gcc/gengtype.c index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644 --- a/gcc/gengtype.c +++ b/gcc/gengtype.c @@ -5197,6 +5197,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("widest_int", &pos)); POS_HERE (do_scalar_typedef ("int64_t", &pos)); POS_HERE (do_scalar_typedef ("poly_int64", &pos)); + POS_HERE (do_scalar_typedef ("poly_uint64", &pos)); POS_HERE (do_scalar_typedef ("uint64_t", &pos)); POS_HERE (do_scalar_typedef ("uint8", &pos)); POS_HERE (do_scalar_typedef ("uintptr_t", &pos)); @@ -5206,6 +5207,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("machine_mode", &pos)); POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos));
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On Tue, 8 Oct 2019, Andre Vieira (lists) wrote: > Hi Richard, > > As I mentioned in the IRC channel, I managed to get "most" of the regression > testsuite working for x86_64 (avx512) and aarch64. > > On x86_64 I get a failure that I can't explain, was hoping you might be able > to have a look with me: > "PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test" > > vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first > iteration of the main loop. The tree dumps look alright, but I do notice the > stack usage seems to change between --param vect-epilogue-nomask={0,1}. So the issue is that we have => 0x00400778 <+72>:vmovdqa64 %zmm1,-0x40(%rax) but the memory accessed is not appropriately aligned. The vectorizer sets DECL_USER_ALIGN on the stack local but somehow later it downs it to 256: Old value = 640 New value = 576 ensure_base_align (dr_info=0x526f788) at /tmp/trunk/gcc/tree-vect-stmts.c:6294 6294 DECL_USER_ALIGN (base_decl) = 1; (gdb) l 6289 if (decl_in_symtab_p (base_decl)) 6290symtab_node::get (base_decl)->increase_alignment (align_base_to); 6291 else 6292{ 6293 SET_DECL_ALIGN (base_decl, align_base_to); 6294 DECL_USER_ALIGN (base_decl) = 1; 6295} this means vectorizing the epilogue modifies the DRs, in particular the base alignment? > Am I missing to update some field that may later lead to the amount of stack > being used? I am confused, it could very well be that I am missing something > obvious, I am not too familiar with x86's ISA. I will try to investigate > further. > > This patch needs further clean-up and more comments (or comment updates), but > I thought I'd share current state to see if you can help me unblock. > > Cheers, > Andre > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Hi Richard, As I mentioned in the IRC channel, I managed to get "most" of the regression testsuite working for x86_64 (avx512) and aarch64. On x86_64 I get a failure that I can't explain, was hoping you might be able to have a look with me: "PASS->FAIL: gcc.target/i386/vect-perm-odd-1.c execution test" vect-perm-odd-1.exe segfaults and when I gdb it seems to be the first iteration of the main loop. The tree dumps look alright, but I do notice the stack usage seems to change between --param vect-epilogue-nomask={0,1}. Am I missing to update some field that may later lead to the amount of stack being used? I am confused, it could very well be that I am missing something obvious, I am not too familiar with x86's ISA. I will try to investigate further. This patch needs further clean-up and more comments (or comment updates), but I thought I'd share current state to see if you can help me unblock. Cheers, Andre diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644 --- a/gcc/cfgloop.h +++ b/gcc/cfgloop.h @@ -21,6 +21,7 @@ along with GCC; see the file COPYING3. If not see #define GCC_CFGLOOP_H #include "cfgloopmanip.h" +#include "target.h" /* Structure to hold decision about unrolling/peeling. */ enum lpt_dec @@ -268,6 +269,9 @@ public: the basic-block from being collected but its index can still be reused. */ basic_block former_header; + + /* Keep track of vector sizes we know we can vectorize the epilogue with. */ + vector_sizes epilogue_vsizes; }; /* Set if the loop is known to be infinite. */ diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644 --- a/gcc/cfgloop.c +++ b/gcc/cfgloop.c @@ -198,6 +198,7 @@ flow_loop_free (class loop *loop) exit->prev = exit; } + loop->epilogue_vsizes.release(); ggc_free (loop->exits); ggc_free (loop); } @@ -355,6 +356,7 @@ alloc_loop (void) loop->nb_iterations_upper_bound = 0; loop->nb_iterations_likely_upper_bound = 0; loop->nb_iterations_estimate = 0; + loop->epilogue_vsizes.create(8); return loop; } diff --git a/gcc/gengtype.c b/gcc/gengtype.c index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644 --- a/gcc/gengtype.c +++ b/gcc/gengtype.c @@ -5197,6 +5197,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("widest_int", &pos)); POS_HERE (do_scalar_typedef ("int64_t", &pos)); POS_HERE (do_scalar_typedef ("poly_int64", &pos)); + POS_HERE (do_scalar_typedef ("poly_uint64", &pos)); POS_HERE (do_scalar_typedef ("uint64_t", &pos)); POS_HERE (do_scalar_typedef ("uint8", &pos)); POS_HERE (do_scalar_typedef ("uintptr_t", &pos)); @@ -5206,6 +5207,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("machine_mode", &pos)); POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos)); POS_HERE (do_scalar_typedef ("CONSTEXPR", &pos)); + POS_HERE (do_scalar_typedef ("vector_sizes", &pos)); POS_HERE (do_typedef ("PTR", create_pointer (resolve_typedef ("void", &pos)), &pos)); diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 5c25441c70a271f04730486e513437fffa75b7e3..189f7458b1b20be06a9a20d3ee05e74bc176434c 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -26,6 +26,7 @@ along with GCC; see the file COPYING3. If not see #include "tree.h" #include "gimple.h" #include "cfghooks.h" +#include "tree-if-conv.h" #include "tree-pass.h" #include "ssa.h" #include "fold-const.h" @@ -1724,7 +1725,7 @@ vect_update_init_of_dr (struct data_reference *dr, tree niters, tree_code code) Apply vect_update_inits_of_dr to all accesses in LOOP_VINFO. CODE and NITERS are as for vect_update_inits_of_dr. */ -static void +void vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters, tree_code code) { @@ -1736,19 +1737,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters, /* Adjust niters to sizetype and insert stmts on loop preheader edge. */ if (!types_compatible_p (sizetype, TREE_TYPE (niters))) -{ - gimple_seq seq; - edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo)); - tree var = create_tmp_var (sizetype, "prolog_loop_adjusted_niters"); - - niters = fold_convert (sizetype, niters); - niters = force_gimple_operand (niters, &seq, false, var); - if (seq) - { - basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq); - gcc_assert (!new_bb); - } -} +niters = fold_convert (sizetype, niters); FOR_EACH_VEC_ELT (datarefs, i, dr) { @@ -2401,14 +2390,18 @@ class loop * vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, tree *niters_vector, tree *step_vector, tree *niters_vector_mult_vf_var, int th, - boo
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Hi Richard, As I mentioned in the IRC channel, this is my current work in progress patch. It currently ICE's when vectorizing gcc/testsuite/gcc.c-torture/execute/nestfunc-2.c with '-O3' and '--param vect-epilogues-nomask=1'. It ICE's because the epilogue loop (after if conversion) and main loop (before vectorization) are not the same, there are a bunch of extra BBs and the epilogue loop seems to need some cleaning up too. Let me know if you see a way around this issue. Cheers, Andre diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h index 0b0154ffd7bf031a005de993b101d9db6dd98c43..d01512ea46467f1cf77793bdc75b48e71b0b9641 100644 --- a/gcc/cfgloop.h +++ b/gcc/cfgloop.h @@ -21,6 +21,7 @@ along with GCC; see the file COPYING3. If not see #define GCC_CFGLOOP_H #include "cfgloopmanip.h" +#include "target.h" /* Structure to hold decision about unrolling/peeling. */ enum lpt_dec @@ -268,6 +269,9 @@ public: the basic-block from being collected but its index can still be reused. */ basic_block former_header; + + /* Keep track of vector sizes we know we can vectorize the epilogue with. */ + vector_sizes epilogue_vsizes; }; /* Set if the loop is known to be infinite. */ diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c index 4ad1f658708f83dbd8789666c26d4bd056837bc6..f3e81bcd00b3f125389aa15b12dc5201b3578d20 100644 --- a/gcc/cfgloop.c +++ b/gcc/cfgloop.c @@ -198,6 +198,7 @@ flow_loop_free (class loop *loop) exit->prev = exit; } + loop->epilogue_vsizes.release(); ggc_free (loop->exits); ggc_free (loop); } @@ -355,6 +356,7 @@ alloc_loop (void) loop->nb_iterations_upper_bound = 0; loop->nb_iterations_likely_upper_bound = 0; loop->nb_iterations_estimate = 0; + loop->epilogue_vsizes.create(8); return loop; } diff --git a/gcc/gengtype.c b/gcc/gengtype.c index 53317337cf8c8e8caefd6b819d28b3bba301e755..80fb6ef71465b24e034fa45d69fec56be6b2e7f8 100644 --- a/gcc/gengtype.c +++ b/gcc/gengtype.c @@ -5197,6 +5197,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("widest_int", &pos)); POS_HERE (do_scalar_typedef ("int64_t", &pos)); POS_HERE (do_scalar_typedef ("poly_int64", &pos)); + POS_HERE (do_scalar_typedef ("poly_uint64", &pos)); POS_HERE (do_scalar_typedef ("uint64_t", &pos)); POS_HERE (do_scalar_typedef ("uint8", &pos)); POS_HERE (do_scalar_typedef ("uintptr_t", &pos)); @@ -5206,6 +5207,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("machine_mode", &pos)); POS_HERE (do_scalar_typedef ("fixed_size_mode", &pos)); POS_HERE (do_scalar_typedef ("CONSTEXPR", &pos)); + POS_HERE (do_scalar_typedef ("vector_sizes", &pos)); POS_HERE (do_typedef ("PTR", create_pointer (resolve_typedef ("void", &pos)), &pos)); diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 5c25441c70a271f04730486e513437fffa75b7e3..b1c13dafdeeec8d95f00c232822d3ab9b11f4046 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -26,6 +26,7 @@ along with GCC; see the file COPYING3. If not see #include "tree.h" #include "gimple.h" #include "cfghooks.h" +#include "tree-if-conv.h" #include "tree-pass.h" #include "ssa.h" #include "fold-const.h" @@ -1730,6 +1731,7 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters, { unsigned int i; vec datarefs = LOOP_VINFO_DATAREFS (loop_vinfo); + vec datarefs_copy = loop_vinfo->shared->datarefs_copy; struct data_reference *dr; DUMP_VECT_SCOPE ("vect_update_inits_of_dr"); @@ -1756,6 +1758,12 @@ vect_update_inits_of_drs (loop_vec_info loop_vinfo, tree niters, if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt)) vect_update_init_of_dr (dr, niters, code); } + FOR_EACH_VEC_ELT (datarefs_copy, i, dr) +{ + dr_vec_info *dr_info = loop_vinfo->lookup_dr (dr); + if (!STMT_VINFO_GATHER_SCATTER_P (dr_info->stmt)) + vect_update_init_of_dr (dr, niters, code); +} } /* For the information recorded in LOOP_VINFO prepare the loop for peeling @@ -2409,6 +2417,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, profile_probability prob_prolog, prob_vector, prob_epilog; int estimated_vf; int prolog_peeling = 0; + bool vect_epilogues += loop_vinfo->epilogue_vinfos.length () > 0; /* We currently do not support prolog peeling if the target alignment is not known at compile time. 'vect_gen_prolog_loop_niters' depends on the target alignment being constant. */ @@ -2469,12 +2479,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, /* Prolog loop may be skipped. */ bool skip_prolog = (prolog_peeling != 0); /* Skip to epilog if scalar loop may be preferred. It's only needed - when we peel for epilog loop and when it hasn't been checked with - loop versioning. */ + when we peel for epilog loop or when we loop version. */ bool skip_vector = (LOOP_VINFO_NITERS_KN
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On 8/23/19 10:50 AM, Andre Vieira (lists) wrote: > Hi, > > This patch is an improvement on my last RFC. As you pointed out, we can > do the vectorization analysis of the epilogues before doing the > transformation, using the same approach as used by openmp simd. I have > not yet incorporated the cost tweaks for vectorizing the epilogue, I > would like to do this in a subsequent patch, to make it easier to test > the differences. > > I currently disable the vectorization of epilogues when versioning for > iterations. This is simply because I do not completely understand how > the assumptions are created and couldn't determine whether using > skip_vectors with this would work. If you don't think it is a problem > or have a testcase to show it work I would gladly look at it. > > Bootstrapped this and the next patch on x86_64 and > aarch64-unknown-linux-gnu, with no regressions (after test changes in > next patch). > > gcc/ChangeLog: > 2019-08-23 Andre Vieira > > PR 88915 > * gentype.c (main): Add poly_uint64 type to generator. > * tree-vect-loop.c (vect_analyze_loop_2): Make it determine > whether we vectorize epilogue loops. > (vect_analyze_loop): Idem. > (vect_transform_loop): Pass decision to vectorize epilogues > to vect_do_peeling. > * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors > when doing loop versioning if we decided to vectorize epilogues. > (vect-loop_versioning): Moved decision to check_profitability > based on cost model. > * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling, > vect_analyze_loop, vect_transform_loop): Update declarations. > * tree-vectorizer.c: Include params.h > (try_vectorize_loop_1): Initialize vect_epilogues_nomask > to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop > and vect_transform_loop. Also make sure vectorizing epilogues > does not count towards number of vectorized loops. Nit. In several places you use "epilog", proper spelling is "epilogue". > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c > index > 173e6b51652fd023893b38da786ff28f827553b5..25c3fc8ff55e017ae0b971fa93ce8ce2a07cb94c > 100644 > --- a/gcc/tree-vectorizer.c > +++ b/gcc/tree-vectorizer.c > @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table > *&simduid_to_vf_htab, > >/* Epilogue of vectorized loop must be vectorized too. */ >if (new_loop) > -ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops, > - new_loop, loop_vinfo, NULL, NULL); > +{ > + /* Don't include vectorized epilogues in the "vectorized loops" count. > + */ > + unsigned dont_count = *num_vectorized_loops; > + ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count, > +new_loop, loop_vinfo, NULL, NULL); > +} Nit. Don't wrap a comment with just the closing */ on its own line. Instead wrap before "count" so that. This is fine for the trunk after fixing those nits. jeff
Re: [PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
On Fri, 23 Aug 2019, Andre Vieira (lists) wrote: > Hi, > > This patch is an improvement on my last RFC. As you pointed out, we can do > the vectorization analysis of the epilogues before doing the transformation, > using the same approach as used by openmp simd. I have not yet incorporated > the cost tweaks for vectorizing the epilogue, I would like to do this in a > subsequent patch, to make it easier to test the differences. > > I currently disable the vectorization of epilogues when versioning for > iterations. This is simply because I do not completely understand how the > assumptions are created and couldn't determine whether using skip_vectors with > this would work. If you don't think it is a problem or have a testcase to > show it work I would gladly look at it. I don't think there's any problem. Basically the versioning condition is if (can_we_compute_niter), most of the time it is an extra condition from niter analysis under which niter is for example zero. This should also be the same with all vector sizes. - delete loop_vinfo; + { + /* Set versioning threshold of the original LOOP_VINFO based +on the last vectorization of the epilog. */ + LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo) + = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); + delete loop_vinfo; + } I'm not sure this works reliably since the order we process vector sizes is under target control and not necessarily decreasing. I think you want to keep track of the minimum here? Preferably separately I guess. >From what I see vect_analyze_loop_2 doesn't need vect_epilogues_nomask and thus it doesn't change throughout the iteration. else - delete loop_vinfo; + { + /* Disable epilog vectorization if we can't determine the epilogs can +be vectorized. */ + *vect_epilogues_nomask &= vectorized_loops > 1; + delete loop_vinfo; + } and this is a bit premature and instead it should be done just before returning success? Maybe also storing the epilogue vector sizes we can handle in the loop_vinfo, thereby representing !vect_epilogues_nomask if there are no such sizes which also means that @@ -1013,8 +1015,13 @@ try_vectorize_loop_1 (hash_table *&simduid_to_vf_htab, /* Epilogue of vectorized loop must be vectorized too. */ if (new_loop) -ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops, -new_loop, loop_vinfo, NULL, NULL); +{ + /* Don't include vectorized epilogues in the "vectorized loops" count. + */ + unsigned dont_count = *num_vectorized_loops; + ret |= try_vectorize_loop_1 (simduid_to_vf_htab, &dont_count, + new_loop, loop_vinfo, NULL, NULL); +} can be optimized to not re-check all smaller sizes (but even assert re-analysis succeeds to the original result for the actual transform). Otherwise this looks reasonable to me. Thanks, Richard. > > Bootstrapped this and the next patch on x86_64 and aarch64-unknown-linux-gnu, > with no regressions (after test changes in next patch). > > gcc/ChangeLog: > 2019-08-23 Andre Vieira > > PR 88915 > * gentype.c (main): Add poly_uint64 type to generator. > * tree-vect-loop.c (vect_analyze_loop_2): Make it determine > whether we vectorize epilogue loops. > (vect_analyze_loop): Idem. > (vect_transform_loop): Pass decision to vectorize epilogues > to vect_do_peeling. > * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors > when doing loop versioning if we decided to vectorize epilogues. > (vect-loop_versioning): Moved decision to check_profitability > based on cost model. > * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling, > vect_analyze_loop, vect_transform_loop): Update declarations. > * tree-vectorizer.c: Include params.h > (try_vectorize_loop_1): Initialize vect_epilogues_nomask > to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop > and vect_transform_loop. Also make sure vectorizing epilogues > does not count towards number of vectorized loops. > > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
[PATCH 1/2][vect]PR 88915: Vectorize epilogues when versioning loops
Hi, This patch is an improvement on my last RFC. As you pointed out, we can do the vectorization analysis of the epilogues before doing the transformation, using the same approach as used by openmp simd. I have not yet incorporated the cost tweaks for vectorizing the epilogue, I would like to do this in a subsequent patch, to make it easier to test the differences. I currently disable the vectorization of epilogues when versioning for iterations. This is simply because I do not completely understand how the assumptions are created and couldn't determine whether using skip_vectors with this would work. If you don't think it is a problem or have a testcase to show it work I would gladly look at it. Bootstrapped this and the next patch on x86_64 and aarch64-unknown-linux-gnu, with no regressions (after test changes in next patch). gcc/ChangeLog: 2019-08-23 Andre Vieira PR 88915 * gentype.c (main): Add poly_uint64 type to generator. * tree-vect-loop.c (vect_analyze_loop_2): Make it determine whether we vectorize epilogue loops. (vect_analyze_loop): Idem. (vect_transform_loop): Pass decision to vectorize epilogues to vect_do_peeling. * tree-vect-loop-manip.c (vect_do_peeling): Enable skip-vectors when doing loop versioning if we decided to vectorize epilogues. (vect-loop_versioning): Moved decision to check_profitability based on cost model. * tree-vectorizer.h (vect_loop_versioning, vect_do_peeling, vect_analyze_loop, vect_transform_loop): Update declarations. * tree-vectorizer.c: Include params.h (try_vectorize_loop_1): Initialize vect_epilogues_nomask to PARAM_VECT_EPILOGUES_NOMASK and pass it to vect_analyze_loop and vect_transform_loop. Also make sure vectorizing epilogues does not count towards number of vectorized loops. diff --git a/gcc/gengtype.c b/gcc/gengtype.c index 53317337cf8c8e8caefd6b819d28b3bba301e755..56ffa08a7dee54837441f0c743f8c0faa285c74b 100644 --- a/gcc/gengtype.c +++ b/gcc/gengtype.c @@ -5197,6 +5197,7 @@ main (int argc, char **argv) POS_HERE (do_scalar_typedef ("widest_int", &pos)); POS_HERE (do_scalar_typedef ("int64_t", &pos)); POS_HERE (do_scalar_typedef ("poly_int64", &pos)); + POS_HERE (do_scalar_typedef ("poly_uint64", &pos)); POS_HERE (do_scalar_typedef ("uint64_t", &pos)); POS_HERE (do_scalar_typedef ("uint8", &pos)); POS_HERE (do_scalar_typedef ("uintptr_t", &pos)); diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 5c25441c70a271f04730486e513437fffa75b7e3..3b5f14c45b5b9b601120c6776734bbafefe1e178 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2401,7 +2401,8 @@ class loop * vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, tree *niters_vector, tree *step_vector, tree *niters_vector_mult_vf_var, int th, - bool check_profitability, bool niters_no_overflow) + bool check_profitability, bool niters_no_overflow, + bool vect_epilogues_nomask) { edge e, guard_e; tree type = TREE_TYPE (niters), guard_cond; @@ -2474,7 +2475,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, bool skip_vector = (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) ? maybe_lt (LOOP_VINFO_INT_NITERS (loop_vinfo), bound_prolog + bound_epilog) - : !LOOP_REQUIRES_VERSIONING (loop_vinfo)); + : (!LOOP_REQUIRES_VERSIONING (loop_vinfo) + || vect_epilogues_nomask)); /* Epilog loop must be executed if the number of iterations for epilog loop is known at compile time, otherwise we need to add a check at the end of vector loop and skip to the end of epilog loop. */ @@ -2966,9 +2968,7 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr) *COND_EXPR_STMT_LIST. */ class loop * -vect_loop_versioning (loop_vec_info loop_vinfo, - unsigned int th, bool check_profitability, - poly_uint64 versioning_threshold) +vect_loop_versioning (loop_vec_info loop_vinfo) { class loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop; class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); @@ -2988,10 +2988,15 @@ vect_loop_versioning (loop_vec_info loop_vinfo, bool version_align = LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT (loop_vinfo); bool version_alias = LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo); bool version_niter = LOOP_REQUIRES_VERSIONING_FOR_NITERS (loop_vinfo); + poly_uint64 versioning_threshold += LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo); tree version_simd_if_cond = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo); + unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo); - if (check_profitability) + if (th >= vect_vf_for_cost (loop_vinfo) + && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) + && !ordered_p (th, versioning_threshold))