[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 janus at gcc dot gnu.org changed: What|Removed |Added CC||janus at gcc dot gnu.org --- Comment #39 from janus at gcc dot gnu.org --- (In reply to uros from comment #37) > Author: uros > Date: Thu Feb 8 22:31:15 2018 > New Revision: 257505 > > URL: https://gcc.gnu.org/viewcvs?rev=257505=gcc=rev > Log: > PR target/83008 > * config/i386/x86-tune-costs.h (skylake_cost): Fix cost of > storing integer register in SImode. Fix cost of 256 and 512 > byte aligned SSE register store. > > * config/i386/i386.c (ix86_multiplication_cost): Fix > multiplication cost for TARGET_AVX512DQ. This caused PR 86735.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 sergey.shalnov at intel dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #38 from sergey.shalnov at intel dot com --- Implemented
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #37 from uros at gcc dot gnu.org --- Author: uros Date: Thu Feb 8 22:31:15 2018 New Revision: 257505 URL: https://gcc.gnu.org/viewcvs?rev=257505=gcc=rev Log: PR target/83008 * config/i386/x86-tune-costs.h (skylake_cost): Fix cost of storing integer register in SImode. Fix cost of 256 and 512 byte aligned SSE register store. * config/i386/i386.c (ix86_multiplication_cost): Fix multiplication cost for TARGET_AVX512DQ. testsuite/ChangeLog: PR target/83008 * gcc.target/i386/pr83008.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr83008.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/x86-tune-costs.h trunk/gcc/testsuite/ChangeLog
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #36 from sergey.shalnov at intel dot com --- The patch fixes the issue for SKX is in https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html I will close the PR after the patch has been merged. Thank you very much for all involved. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #35 from Christophe Lyon --- Author: clyon Date: Wed Feb 7 09:12:48 2018 New Revision: 257438 URL: https://gcc.gnu.org/viewcvs?rev=257438=gcc=rev Log: [testsuite] Fix gcc.dg/cse_recip.c for AArch64 after r257181. 2018-02-07 Christophe LyonPR tree-optimization/83008 * gcc.dg/cse_recip.c: Add -fno-tree-slp-vectorize. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/cse_recip.c
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #34 from Richard Biener --- Author: rguenth Date: Tue Jan 30 11:19:47 2018 New Revision: 257181 URL: https://gcc.gnu.org/viewcvs?rev=257181=gcc=rev Log: 2018-01-30 Richard BienerPR tree-optimization/83008 * tree-vect-slp.c (vect_analyze_slp_cost_1): Properly cost invariant and constant vector uses in stmts when they need more than one stmt. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vect-slp.c
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #33 from sergey.shalnov at intel dot com --- Richard, I'm not sure is it a regression or not. I see code has been visibly refactored in this commit https://github.com/gcc-mirror/gcc/commit/ee6e9ba576099aed29f1097195c649fc796ecf5e in 2013 year. However this fix is highly important for our workloads and really desirable for GCC8. For example, your patch plus SKX cost model changes give us ~+1% in geomean with spec2017 intrate. It would be really good for us to make this patch into GCC8. Cost model changes planned to be (will be proposed separately): diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index e943d13..d5e6ef6 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = { {4, 4, 4}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ - {6, 6, 6}, /* cost of storing integer registers */ + {6, 6, 3}, /* cost of storing integer registers */ 2, /* cost of reg,reg fld/fst */ {6, 6, 8}, /* cost of loading fp registers in SFmode, DFmode and XFmode */ @@ -1572,7 +1572,7 @@ struct processor_costs skylake_cost = { {6, 6, 6, 10, 20}, /* cost of loading SSE registers in 32,64,128,256 and 512-bit */ {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ - {8, 8, 8, 8, 16},/* cost of storing SSE registers + {8, 8, 8, 14, 24}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ {8, 8, 8, 8, 16},/* cost of unaligned stores. */ 2, 2,/* SSE->integer and integer->SSE moves */ I know that you have some concerns about costs above, I'm saving it for next discussion because your patch is a good foundation to proceed with costs tuning. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #32 from rguenther at suse dot de --- On Fri, 26 Jan 2018, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #31 from sergey.shalnov at intel dot com --- > Richard, > Thank you for your latest patch. This patch is exactly that > I’ve discussed in this issue request. > I tested it with SPEC20[06|17] and see no performance/stability degradation. > > Could you please merge your latest patch into main trunk? Is this bug a regression? If not I think the change has to wait for GCC 9. > I will provide Intel specific changes (in gcc/config/i386) that > will leverage your patch to get the performance better > for the provided testcase (step N2) Can you attach those?
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #31 from sergey.shalnov at intel dot com --- Richard, Thank you for your latest patch. This patch is exactly that I’ve discussed in this issue request. I tested it with SPEC20[06|17] and see no performance/stability degradation. Could you please merge your latest patch into main trunk? I will provide Intel specific changes (in gcc/config/i386) that will leverage your patch to get the performance better for the provided testcase (step N2) Thank you Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 Richard Biener changed: What|Removed |Added Attachment #43084|0 |1 is obsolete|| --- Comment #30 from Richard Biener --- Created attachment 43238 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43238=edit updated patch for SLP costing You are right, the patch contained several errors. The multiple_of_p is supposed to handle the case where we know all vectors will be equal, like when group_size is two and const_nunits is 4. The arguments were swapped. Also the loop over the elements were bogus. I've corrected this with the attached updated patch. This now costs two vector constructions for the testcase as expected but still: t.c:32:12: note: Cost model analysis: Vector inside of basic block cost: 32 Vector prologue cost: 64 Vector epilogue cost: 0 Scalar cost of basic block: 192 t.c:32:12: note: Basic block will be vectorized using SLP that's for two aligned stores and two 8 element vector constructions. We're offsetting 16 scalar stores after all... They each seem to cost 12 while an aligned vector store costs 16. And the vector constructions cost 32 each (8 times a SSE op costing 4 aka "element insert"). The only thing I notice is that 40240 ix86_vec_cost (machine_mode mode, int cost, bool parallel) 40241 { 40242 if (!VECTOR_MODE_P (mode)) 40243 return cost; 40244 40245 if (!parallel) 40246 return cost * GET_MODE_NUNITS (mode); 40247 if (GET_MODE_BITSIZE (mode) == 128 40248 && TARGET_SSE_SPLIT_REGS) 40249 return cost * 2; (gdb) 40250 if (GET_MODE_BITSIZE (mode) > 128 40251 && TARGET_AVX128_OPTIMAL) 40252 return cost * GET_MODE_BITSIZE (mode) / 128; 40253 return cost; 40254 } all the pessimizing for TARGET_SSE_SPLIT_REGS/TARGET_AVX128_OPTIMAL isn't applied to the !parallel case. But they wouldn't apply to AVX512 AFAICS.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #29 from sergey.shalnov at intel dot com --- Richard, Thank you for your latest patch. I would like to clarify the multiple_p() function usage in if() clause. First of all, I assume that architectures with fixed size of HW registers (x86) should go to if(){} branch, but arch with unknown registers size should go to else{}. This is because is_constant() function used. The problem is in multiple_p() function. In the test case we have group_size=16 and const_nunits=8. So, the multiple_p(16, 8) returns 1 and with –march=skylake-avx512 (or –march=znver1) we go to else{} branch which is not correct. I used following change in your patch and test works as I expect. --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1923,7 +1923,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree node, unsigned HOST_WIDE_INT const_nunits; unsigned nelt_limit; if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits) - && ! multiple_p (group_size, const_nunits)) + && multiple_p (group_size, const_nunits)) { num_vects_to_check = ncopies_for_cost * const_nunits / group_size; nelt_limit = const_nunits; -- Other thing here is what we should do if group_size is, for example, 17. In this case (after my changes) wrong else{} branch will be taken. I would propose to change this particular if() in following way. if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits)) { If(multiple_p (group_size, const_nunits)) { num_vects_to_check = ncopies_for_cost * const_nunits / group_size; nelt_limit = const_nunits; } else { ...not clear what we should have here... } } else { num_vects_to_check = 1; nelt_limit = group_size; } Or perhaps multiple_p() should not be here at all? Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #28 from sergey.shalnov at intel dot com --- Richard, Thank you for your comments. I see that TYPE_VECTOR_SUBPARTS is constant for for the test case but multiple_p (group_size, const_nunits) returns 1 in the code: if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits) && ! multiple_p (group_size, const_nunits)) { num_vects_to_check = ncopies_for_cost * const_nunits / group_size; nelt_limit = const_nunits; } else { num_vects_to_check = 1; nelt_limit = group_size; } This is because group_size = 16 and const_nunits = 8 in the test case with -march=skylake-avx512. And control flow goes to "num_vects_to_check = 1". Anyway, the ncopies_for_cost = 2 and equation " ncopies_for_cost * const_nunits / group_size " will be 1. Do you think we have any possibility to make a conditional clause to make num_vects_to_check = 2? Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #27 from rguenther at suse dot de --- On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #26 from sergey.shalnov at intel dot com --- > Sorry, did you meant "arm_sve.h" on ARM? > In this case we have machine specific code in common part of the gcc code. > Should we make it as machine dependent callback function because having > "num_vects_to_check = 1" is incorrect for SSE? TYPE_VECTOR_SUBPARTS is never non-constant for SSE.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #26 from sergey.shalnov at intel dot com --- Sorry, did you meant "arm_sve.h" on ARM? In this case we have machine specific code in common part of the gcc code. Should we make it as machine dependent callback function because having "num_vects_to_check = 1" is incorrect for SSE?
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #25 from rguenther at suse dot de --- On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #24 from sergey.shalnov at intel dot com --- > Richard, > The latest "SLP costing for constants/externs improvement" patch generates the > same code as baseline for the test example. > > Are you sure that "num_vects_to_check" should 1 if vector is not constant? I > would expect "num_vects_to_check = ncopies_for_cost;" here: > > 1863 else > 1864{ > 1865 num_vects_to_check = 1; > 1866 nelt_limit = group_size; > 1867} Yes, that's correct for variable length vectors (SVE)
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #24 from sergey.shalnov at intel dot com --- Richard, The latest "SLP costing for constants/externs improvement" patch generates the same code as baseline for the test example. Are you sure that "num_vects_to_check" should 1 if vector is not constant? I would expect "num_vects_to_check = ncopies_for_cost;" here: 1863 else 1864{ 1865 num_vects_to_check = 1; 1866 nelt_limit = group_size; 1867} Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #22 from rguenther at suse dot de --- On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #21 from sergey.shalnov at intel dot com --- > Thanks Richard for your comments. > Based on our discussion I've produced the patch attached and > run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto > -march=skylake-avx512 -mfpmath=sse -funroll-loops]). > Please note, I used your last proposed patch and changed loop trip count > calculation ("ncopies_for_cost * nunits / group_size" is always 1). > > I see the following performance changes: > SPEC CPU 2017 intrate > 500.perlbench_r -0.7% > 525.x264_r +7.2% > Geomean: +0.8% > > SPEC CPU 2017 fprate > 527.cam4_r -1.1% > 538.imagick_r +4.7% > 544.nab_r +3.6% > Geomean: +0.6% > > I believe that after appropriate cost model tweaks for other targets a gain > could be observed but I haven't checked it carefully. > It provides a good performance gain for the original case and a few other > improvements. > > Can you please take a look at the patch and comment (or might propose another > method)? It mixes several things, the last one (> to >= change in cost evaluation clearly wrong). The skylake_cost changes look somewhat odd to me. I'll attach my current SLP costing adjustment patch (after the SVE changes the old didn't build anymore). > Sergey > > From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001 > From: Sergey Shalnov> Date: Tue, 9 Jan 2018 14:37:14 +0100 > Subject: [PATCH, SLP] SLP_common_algo_changes > > --- > gcc/config/i386/x86-tune-costs.h | 4 ++-- > gcc/tree-vect-slp.c | 41 > ++-- > 2 files changed, 33 insertions(+), 12 deletions(-) > > diff --git a/gcc/config/i386/x86-tune-costs.h > b/gcc/config/i386/x86-tune-costs.h > index 312467d..3e0f904 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = { >{4, 4, 4}, /* cost of loading integer registers >in QImode, HImode and SImode. >Relative to reg-reg move (2). */ > - {6, 6, 6}, /* cost of storing integer registers > */ > + {6, 6, 4}, /* cost of storing integer registers. > */ >2, /* cost of reg,reg fld/fst */ >{6, 6, 8}, /* cost of loading fp registers >in SFmode, DFmode and XFmode */ > @@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = { >{6, 6, 6, 10, 20}, /* cost of loading SSE registers >in 32,64,128,256 and 512-bit */ >{6, 6, 6, 10, 20}, /* cost of unaligned loads. */ > - {8, 8, 8, 8, 16},/* cost of storing SSE registers > + {8, 8, 8, 16, 32}, /* cost of storing SSE registers >in 32,64,128,256 and 512-bit */ >{8, 8, 8, 8, 16},/* cost of unaligned stores. */ >2, 2,/* SSE->integer and > integer->SSE moves */ > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c > index 0ca42b4..7e63a1c 100644 > --- a/gcc/tree-vect-slp.c > +++ b/gcc/tree-vect-slp.c > @@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance, > slp_tree node, >enum vect_def_type dt; >if (!op || op == lhs) > continue; > - if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )) > + if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ) > +&& (dt == vect_constant_def || dt == vect_external_def)) > { > /* Without looking at the actual initializer a vector of > constants can be implemented as load from the constant pool. > -??? We need to pass down stmt_info for a vector type > -even if it points to the wrong stmt. */ > - if (dt == vect_constant_def) > - record_stmt_cost (prologue_cost_vec, 1, vector_load, > - stmt_info, 0, vect_prologue); > - else if (dt == vect_external_def) > - record_stmt_cost (prologue_cost_vec, 1, vec_construct, > - stmt_info, 0, vect_prologue); > +When all elements are the same we can use a splat. */ > + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length (); > + tree elt = NULL_TREE; > + unsigned nelt = 0; > + for (unsigned j = 0; j < ncopies_for_cost; ++j) > + for (unsigned k = 0; k < group_size; ++k) > + { > + if (nelt == 0) > + elt =
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #22 from rguenther at suse dot de --- On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #21 from sergey.shalnov at intel dot com --- > Thanks Richard for your comments. > Based on our discussion I've produced the patch attached and > run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto > -march=skylake-avx512 -mfpmath=sse -funroll-loops]). > Please note, I used your last proposed patch and changed loop trip count > calculation ("ncopies_for_cost * nunits / group_size" is always 1). > > I see the following performance changes: > SPEC CPU 2017 intrate > 500.perlbench_r -0.7% > 525.x264_r +7.2% > Geomean: +0.8% > > SPEC CPU 2017 fprate > 527.cam4_r -1.1% > 538.imagick_r +4.7% > 544.nab_r +3.6% > Geomean: +0.6% > > I believe that after appropriate cost model tweaks for other targets a gain > could be observed but I haven't checked it carefully. > It provides a good performance gain for the original case and a few other > improvements. > > Can you please take a look at the patch and comment (or might propose another > method)? It mixes several things, the last one (> to >= change in cost evaluation clearly wrong). The skylake_cost changes look somewhat odd to me. I'll attach my current SLP costing adjustment patch (after the SVE changes the old didn't build anymore). > Sergey > > From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001 > From: Sergey Shalnov> Date: Tue, 9 Jan 2018 14:37:14 +0100 > Subject: [PATCH, SLP] SLP_common_algo_changes > > --- > gcc/config/i386/x86-tune-costs.h | 4 ++-- > gcc/tree-vect-slp.c | 41 > ++-- > 2 files changed, 33 insertions(+), 12 deletions(-) > > diff --git a/gcc/config/i386/x86-tune-costs.h > b/gcc/config/i386/x86-tune-costs.h > index 312467d..3e0f904 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = { >{4, 4, 4}, /* cost of loading integer registers >in QImode, HImode and SImode. >Relative to reg-reg move (2). */ > - {6, 6, 6}, /* cost of storing integer registers > */ > + {6, 6, 4}, /* cost of storing integer registers. > */ >2, /* cost of reg,reg fld/fst */ >{6, 6, 8}, /* cost of loading fp registers >in SFmode, DFmode and XFmode */ > @@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = { >{6, 6, 6, 10, 20}, /* cost of loading SSE registers >in 32,64,128,256 and 512-bit */ >{6, 6, 6, 10, 20}, /* cost of unaligned loads. */ > - {8, 8, 8, 8, 16},/* cost of storing SSE registers > + {8, 8, 8, 16, 32}, /* cost of storing SSE registers >in 32,64,128,256 and 512-bit */ >{8, 8, 8, 8, 16},/* cost of unaligned stores. */ >2, 2,/* SSE->integer and > integer->SSE moves */ > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c > index 0ca42b4..7e63a1c 100644 > --- a/gcc/tree-vect-slp.c > +++ b/gcc/tree-vect-slp.c > @@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance, > slp_tree node, >enum vect_def_type dt; >if (!op || op == lhs) > continue; > - if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )) > + if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ) > +&& (dt == vect_constant_def || dt == vect_external_def)) > { > /* Without looking at the actual initializer a vector of > constants can be implemented as load from the constant pool. > -??? We need to pass down stmt_info for a vector type > -even if it points to the wrong stmt. */ > - if (dt == vect_constant_def) > - record_stmt_cost (prologue_cost_vec, 1, vector_load, > - stmt_info, 0, vect_prologue); > - else if (dt == vect_external_def) > - record_stmt_cost (prologue_cost_vec, 1, vec_construct, > - stmt_info, 0, vect_prologue); > +When all elements are the same we can use a splat. */ > + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length (); > + tree elt = NULL_TREE; > + unsigned nelt = 0; > + for (unsigned j = 0; j < ncopies_for_cost; ++j) > + for (unsigned k = 0; k < group_size; ++k) > + { > + if (nelt == 0) > + elt =
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #23 from Richard Biener --- Created attachment 43084 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43084=edit SLP costing for constants/externs improvement
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #21 from sergey.shalnov at intel dot com --- Thanks Richard for your comments. Based on our discussion I've produced the patch attached and run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto -march=skylake-avx512 -mfpmath=sse -funroll-loops]). Please note, I used your last proposed patch and changed loop trip count calculation ("ncopies_for_cost * nunits / group_size" is always 1). I see the following performance changes: SPEC CPU 2017 intrate 500.perlbench_r -0.7% 525.x264_r +7.2% Geomean: +0.8% SPEC CPU 2017 fprate 527.cam4_r -1.1% 538.imagick_r +4.7% 544.nab_r +3.6% Geomean: +0.6% I believe that after appropriate cost model tweaks for other targets a gain could be observed but I haven't checked it carefully. It provides a good performance gain for the original case and a few other improvements. Can you please take a look at the patch and comment (or might propose another method)? Sergey From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001 From: Sergey ShalnovDate: Tue, 9 Jan 2018 14:37:14 +0100 Subject: [PATCH, SLP] SLP_common_algo_changes --- gcc/config/i386/x86-tune-costs.h | 4 ++-- gcc/tree-vect-slp.c | 41 ++-- 2 files changed, 33 insertions(+), 12 deletions(-) diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 312467d..3e0f904 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = { {4, 4, 4}, /* cost of loading integer registers in QImode, HImode and SImode. Relative to reg-reg move (2). */ - {6, 6, 6}, /* cost of storing integer registers */ + {6, 6, 4}, /* cost of storing integer registers. */ 2, /* cost of reg,reg fld/fst */ {6, 6, 8}, /* cost of loading fp registers in SFmode, DFmode and XFmode */ @@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = { {6, 6, 6, 10, 20}, /* cost of loading SSE registers in 32,64,128,256 and 512-bit */ {6, 6, 6, 10, 20}, /* cost of unaligned loads. */ - {8, 8, 8, 8, 16},/* cost of storing SSE registers + {8, 8, 8, 16, 32}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ {8, 8, 8, 8, 16},/* cost of unaligned stores. */ 2, 2,/* SSE->integer and integer->SSE moves */ diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 0ca42b4..7e63a1c 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree node, enum vect_def_type dt; if (!op || op == lhs) continue; - if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )) + if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ) +&& (dt == vect_constant_def || dt == vect_external_def)) { /* Without looking at the actual initializer a vector of constants can be implemented as load from the constant pool. -??? We need to pass down stmt_info for a vector type -even if it points to the wrong stmt. */ - if (dt == vect_constant_def) - record_stmt_cost (prologue_cost_vec, 1, vector_load, - stmt_info, 0, vect_prologue); - else if (dt == vect_external_def) - record_stmt_cost (prologue_cost_vec, 1, vec_construct, - stmt_info, 0, vect_prologue); +When all elements are the same we can use a splat. */ + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length (); + tree elt = NULL_TREE; + unsigned nelt = 0; + for (unsigned j = 0; j < ncopies_for_cost; ++j) + for (unsigned k = 0; k < group_size; ++k) + { + if (nelt == 0) + elt = gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], i); + /* ??? We're just tracking whether all operands of a single + vector initializer are the same, ideally we'd check if + we emitted the same one already. */ + else if (elt != gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], + i)) + elt = NULL_TREE; + nelt++; + if (nelt == group_size) + { + /* ??? We need to pass down stmt_info for a vector type + even if it
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #20 from sergey.shalnov at intel dot com --- Richard, I did quick static analysis for your latest patch. Using command line “-g -Ofast -mfpmath=sse -funroll-loops -march=znver1” your latest patch doesn’t affects the issue I discussed but it affects costs for first loop. I thought the loop costs should be calculated in other place (tree-vect-loop.c) but as I can see everything is interconnected. The SLP block we discussed remains with the same statistic: Vector inside of basic block cost: 64 Vector prologue cost: 32 Vector epilogue cost: 0 Scalar cost of basic block: 256 note: Basic block will be vectorized using SLP First loop was: note: Cost model analysis:. Vector inside of loop cost: 5392 Vector prologue cost: 48 Vector epilogue cost: 0 Scalar iteration cost: 464 Scalar outside cost: 0 Vector outside cost: 48 prologue iterations: 0 epilogue iterations: 0 note: cost model: the vector iteration cost = 5392 divided by the scalar iteration cost = 464 is greater or equal to the vectorization factor = 4. Became: note: Cost model analysis: Vector inside of loop cost: 5392 Vector prologue cost: 192 Vector epilogue cost: 0 Scalar iteration cost: 464 Scalar outside cost: 0 Vector outside cost: 192 prologue iterations: 0 epilogue iterations: 0 note: cost model: the vector iteration cost = 5392 divided by the scalar iteration cost = 464 is greater or equal to the vectorization factor = 4. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #19 from rguenther at suse dot de --- On Sun, 24 Dec 2017, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #18 from sergey.shalnov at intel dot com --- > Yes, I agree that vector_store stage has it’s own vectorization cost. > And each vector_store has vector_construction stage. These stages are > different > in gcc slp (as you know). > To better illustrate my point of view I would like to propose a patch. > I didn’t submit the patch to community yet because I think it is better to > discuss it before that. > > index 0ca42b4..72e3123 100644 > --- a/gcc/tree-vect-slp.c > +++ b/gcc/tree-vect-slp.c > @@ -1825,7 +1825,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree > node, > record_stmt_cost (prologue_cost_vec, 1, vector_load, > stmt_info, 0, vect_prologue); > else if (dt == vect_external_def) > - record_stmt_cost (prologue_cost_vec, 1, vec_construct, > + record_stmt_cost (prologue_cost_vec, ncopies_for_cost, > vec_construct, > stmt_info, 0, vect_prologue); > } > } I'm not sure it affects the testcase in question but yes, there's imprecision in this in that it assumes too much CSE happening. Using ncopies_for_cost will over-estimate the cost though, it will match the code we generate though (CSE will cleanup). As the comment says this code should be re-engineered ;) (it also misses to make a splat cheap). Sth like the following might work for most cases: Index: gcc/tree-vect-slp.c === --- gcc/tree-vect-slp.c (revision 256070) +++ gcc/tree-vect-slp.c (working copy) @@ -1815,18 +1815,41 @@ vect_analyze_slp_cost_1 (slp_instance in enum vect_def_type dt; if (!op || op == lhs) continue; - if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )) + if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ) + && (dt == vect_constant_def || dt == vect_external_def)) { /* Without looking at the actual initializer a vector of constants can be implemented as load from the constant pool. -??? We need to pass down stmt_info for a vector type -even if it points to the wrong stmt. */ - if (dt == vect_constant_def) - record_stmt_cost (prologue_cost_vec, 1, vector_load, - stmt_info, 0, vect_prologue); - else if (dt == vect_external_def) - record_stmt_cost (prologue_cost_vec, 1, vec_construct, - stmt_info, 0, vect_prologue); +When all elements are the same we can use a splat. */ + tree vectype = get_vectype_for_scalar_type (TREE_TYPE (op)); + unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype); + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length (); + tree elt = NULL_TREE; + unsigned nelt = 0; + for (unsigned j = 0; j < ncopies_for_cost * nunits / group_size; ++j) + for (unsigned k = 0; k < group_size; ++k) + { + if (nelt == 0) + elt = gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], i); + /* ??? We're just tracking whether all operands of a single + vector initializer are the same, ideally we'd check if + we emitted the same one already. */ + else if (elt != gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], + i)) + elt = NULL_TREE; + nelt++; + if (nelt == group_size) + { + /* ??? We need to pass down stmt_info for a vector type + even if it points to the wrong stmt. */ + record_stmt_cost (prologue_cost_vec, 1, + dt == vect_external_def + ? (elt ? scalar_to_vec : vec_construct) + : vector_load, + stmt_info, 0, vect_prologue); + nelt = 0; + } + } } }
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #18 from sergey.shalnov at intel dot com --- Yes, I agree that vector_store stage has it’s own vectorization cost. And each vector_store has vector_construction stage. These stages are different in gcc slp (as you know). To better illustrate my point of view I would like to propose a patch. I didn’t submit the patch to community yet because I think it is better to discuss it before that. index 0ca42b4..72e3123 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1825,7 +1825,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree node, record_stmt_cost (prologue_cost_vec, 1, vector_load, stmt_info, 0, vect_prologue); else if (dt == vect_external_def) - record_stmt_cost (prologue_cost_vec, 1, vec_construct, + record_stmt_cost (prologue_cost_vec, ncopies_for_cost, vec_construct, stmt_info, 0, vect_prologue); } }
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #17 from rguenther at suse dot de --- On Fri, 15 Dec 2017, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #16 from sergey.shalnov at intel dot com --- > «it's one vec_construct operation - it's the task of the target to turn this > into a cost comparable to vector_store» > I agree that vec_construct operation cost is based on the target cost model. > > I think I don't completely understand why we have 1 vec_construct for 3 > vector_store. Each vector store is it's own vectorization and thus has its own Cost Model summary.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #16 from sergey.shalnov at intel dot com --- «it's one vec_construct operation - it's the task of the target to turn this into a cost comparable to vector_store» I agree that vec_construct operation cost is based on the target cost model. I think I don't completely understand why we have 1 vec_construct for 3 vector_store. I would expect vec_construct for each vec_store operation. I just looking for asm and see that each "vmovaps %xmm, (%rsp)" has it's own construction procedure (in this asm I extracted only one vector_store from 3 vector_store set generated) vec_construct: 2d3: c5 79 6e dd vmovd %ebp,%xmm11 310: c4 41 79 6e d0 vmovd %r8d,%xmm10 2e0: c4 63 21 22 e2 01 vpinsrd $0x1,%edx,%xmm11,%xmm12 31a: c4 43 29 22 e9 01 vpinsrd $0x1,%r9d,%xmm10,%xmm13 329: c4 41 11 6c f4 vpunpcklqdq %xmm12,%xmm13,%xmm14 vector_store: 340: c5 78 29 74 24 d8 vmovaps %xmm14,-0x28(%rsp)
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #15 from rguenther at suse dot de --- On Fri, 15 Dec 2017, sergey.shalnov at intel dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > --- Comment #14 from sergey.shalnov at intel dot com --- > " we have a basic-block vectorizer. Do you propose to remove it? " > Definitely not! SLP vectorizer is very good to have! > > “What's the rationale for not using vector registers” > I just tried " -fno-tree-slp-vectorize" option and found the performance gain > for different -march= options. > > I see some misunderstanding here. Let me clarify the original question with > –march=znver1. > I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for > experiments. > > For the basic block we are discussing we have (in vect_analyze_slp_cost() in > tree-vect-slp.c:1897): > > tmp[i_220][0] = _150; > tmp[i_220][2] = _147; > tmp[i_220][1] = _144; > tmp[i_220][3] = _141; > > tmp[i_139][0] = _447; > tmp[i_139][2] = _450; > tmp[i_139][1] = _453; > tmp[i_139][3] = _456; > > tmp[i_458][0] = _54; > tmp[i_458][2] = _56; > tmp[i_458][1] = _58; > tmp[i_458][3] = _60; > > this is si->stmt printed in the loop with "vect_prologue" calculation. > > I see SLP statistic related to this BB: > note: Cost model analysis:. > Vector inside of basic block cost: 64 > Vector prologue cost: 32 > Vector epilogue cost: 0 > Scalar cost of basic block: 256 > note: Basic block will be vectorized using SLP So it looks like both vector and scalar stores are costed 64 by the target and the vector construction is costed 32, this is likely because COSTS_N_INSNS (1),/* cost of cheap SSE instruction. */ and vector construction of 4 elements is thus cost 4 while {8, 8, 8},/* cost of storing integer registers. */ ... {8, 8, 8, 8, 16}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit. */ That cheap SSE instruction cost is esp. odd when looking at 2, 3, 6, /* cost of moving XMM,YMM,ZMM register. */ or COSTS_N_INSNS (3),/* cost of ADDSS/SD SUBSS/SD insns. */ so if a movq %xmm0, %xmm1 costs 2 I can't imagine anything being cheaper than that. Honza? ... > Please correct me if I wrong but I think we have to have count=3 in > prologue_cost_vec. No, it's one vec_construct operation - it's the task of the target to turn this into a cost comparable to vector_store and scalar_store in this case. > And this could slightly change costs for "Vector prologue cost" and might have > an influence to vectorizer decision. > > Sergey > PS > Richard, > I didn't catch your idea in " but DOM isn't powerful enough " sentence. > Could you please slightly clarify it? The BB vectorization prevents eliding 'tmp' from memory to registers, the CSE pass after vectorization would be responsible for this but the memory references look too "complicated" to the respective implementation used.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #14 from sergey.shalnov at intel dot com --- " we have a basic-block vectorizer. Do you propose to remove it? " Definitely not! SLP vectorizer is very good to have! “What's the rationale for not using vector registers” I just tried " -fno-tree-slp-vectorize" option and found the performance gain for different -march= options. I see some misunderstanding here. Let me clarify the original question with –march=znver1. I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for experiments. For the basic block we are discussing we have (in vect_analyze_slp_cost() in tree-vect-slp.c:1897): tmp[i_220][0] = _150; tmp[i_220][2] = _147; tmp[i_220][1] = _144; tmp[i_220][3] = _141; tmp[i_139][0] = _447; tmp[i_139][2] = _450; tmp[i_139][1] = _453; tmp[i_139][3] = _456; tmp[i_458][0] = _54; tmp[i_458][2] = _56; tmp[i_458][1] = _58; tmp[i_458][3] = _60; this is si->stmt printed in the loop with "vect_prologue" calculation. I see SLP statistic related to this BB: note: Cost model analysis:. Vector inside of basic block cost: 64 Vector prologue cost: 32 Vector epilogue cost: 0 Scalar cost of basic block: 256 note: Basic block will be vectorized using SLP I see 12 statements that are calculated into 3 vector instructions with 4 data type each (4*int->xmm) group_size = 12 ncopies_for_cost = 3 nunits = 4 But I see "count" is 1 in cost vector related to prolog. prologue_cost_vec = {m_vec = 0x3fc6e70 = {{count = 1, kind = vec_construct, stmt = , misalign = 0}}} body_cost_vec = {m_vec = 0x3fc6f70 = {{count = 3, kind = vector_store, stmt = , misalign = 0}}} Please correct me if I wrong but I think we have to have count=3 in prologue_cost_vec. And this could slightly change costs for "Vector prologue cost" and might have an influence to vectorizer decision. Sergey PS Richard, I didn't catch your idea in " but DOM isn't powerful enough " sentence. Could you please slightly clarify it? Thank you.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #13 from rguenther at suse dot de --- On Fri, 8 Dec 2017, sergey.shalnov at intel dot com wrote: > And it uses xmm+ vpbroadcastd to spill tmp[] to stack > ... > 1e7: 62 d2 7d 08 7c c9 vpbroadcastd %r9d,%xmm1 > 1ed: c4 c1 79 7e c9 vmovd %xmm1,%r9d ^^^ this is an odd instruction ... > 1f2: 62 f1 fd 08 7f 8c 24vmovdqa64 %xmm1,-0x38(%rsp) > 1f9: c8 ff ff ff > 1fd: 62 f2 7d 08 7c d7 vpbroadcastd %edi,%xmm2 > 203: c5 f9 7e d7 vmovd %xmm2,%edi > 207: 62 f1 fd 08 7f 94 24vmovdqa64 %xmm2,-0x28(%rsp) > 20e: d8 ff ff ff > 212: 62 f2 7d 08 7c db vpbroadcastd %ebx,%xmm3 > 218: c5 f9 7e de vmovd %xmm3,%esi > 21c: 62 f1 fd 08 7f 9c 24vmovdqa64 %xmm3,-0x18(%rsp) > 223: e8 ff ff ff > 227: 01 fe add%edi,%esi > 229: 45 01 c8add%r9d,%r8d > 22c: 41 01 f0add%esi,%r8d > 22f: 8b 5c 24 dc mov-0x24(%rsp),%ebx > 233: 03 5c 24 ec add-0x14(%rsp),%ebx > 237: 8b 6c 24 bc mov-0x44(%rsp),%ebp > 23b: 03 6c 24 cc add-0x34(%rsp),%ebp > ... > > I think this is better in case of performance perspective but, as I said > before, not using vector registers here is the best option if no loops > vectorized. As I said we have a basic-block vectorizer. Do you propose to remove it? What's the rationale for "not using vector registers ... if no loops [are] vectorized"? With AVX256/512 an additional cost when using vector registers is a vzeroupper required at function boundaries. Is this (one of) the reason? If it is the case that the instruction sequence vpbroadcastd %ebx,%xmm3 vmovdqa64 %xmm3,-0x18(%rsp) is slower than doing vmovd %ebx,-0x18(%rsp) vmovd %ebx,-0x22(%rsp) vmovd %ebx,-0x26(%rsp) vmovd %ebx,-0x30(%rsp) then the costing in the backend needs to reflect that. I see that the vectorization prevents eliding 'tmp' on GIMPLE but that's a completely separate issue (the value-numbering we run after vectorization is poor): vect_cst__108 = {_48, _48, _48, _48}; vect_cst__106 = {_349, _349, _349, _349}; vect_cst__251 = {_97, _97, _97, _97}; MEM[(unsigned int *) + 16B] = vect_cst__251; MEM[(unsigned int *) + 32B] = vect_cst__106; MEM[(unsigned int *) + 48B] = vect_cst__108; _292 = tmp[0][0]; _291 = tmp[1][0]; ... those loads could have been elided but DOM isn't powerful enough to see that (and never will be).
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #12 from sergey.shalnov at intel dot com --- Richard, Your last proposal changed the code generated a bit. Currently is shows: test_bugzilla1.c:6:5: note: Cost model analysis:. Vector inside of loop cost: 62576 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 328 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 62576 divided by the scalar iteration cost = 328 is greater or equal to the vectorization factor = 4. test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable. test_bugzilla1.c:6:5: note: not vectorized: vector version will never be profitable. And it uses xmm+ vpbroadcastd to spill tmp[] to stack ... 1e7: 62 d2 7d 08 7c c9 vpbroadcastd %r9d,%xmm1 1ed: c4 c1 79 7e c9 vmovd %xmm1,%r9d 1f2: 62 f1 fd 08 7f 8c 24vmovdqa64 %xmm1,-0x38(%rsp) 1f9: c8 ff ff ff 1fd: 62 f2 7d 08 7c d7 vpbroadcastd %edi,%xmm2 203: c5 f9 7e d7 vmovd %xmm2,%edi 207: 62 f1 fd 08 7f 94 24vmovdqa64 %xmm2,-0x28(%rsp) 20e: d8 ff ff ff 212: 62 f2 7d 08 7c db vpbroadcastd %ebx,%xmm3 218: c5 f9 7e de vmovd %xmm3,%esi 21c: 62 f1 fd 08 7f 9c 24vmovdqa64 %xmm3,-0x18(%rsp) 223: e8 ff ff ff 227: 01 fe add%edi,%esi 229: 45 01 c8add%r9d,%r8d 22c: 41 01 f0add%esi,%r8d 22f: 8b 5c 24 dc mov-0x24(%rsp),%ebx 233: 03 5c 24 ec add-0x14(%rsp),%ebx 237: 8b 6c 24 bc mov-0x44(%rsp),%ebp 23b: 03 6c 24 cc add-0x34(%rsp),%ebp ... I think this is better in case of performance perspective but, as I said before, not using vector registers here is the best option if no loops vectorized. In case of static loop increment (the first test case) - the first loop vectorized as before. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #11 from sergey.shalnov at intel dot com --- Richard, “Is this about the "stupid" attempt to use as little AVX512 as possible” No, it is not. I provided asm listing at the beginning with zmm only to illustrate the issue more crystalized. I used "-mprefer-vector-width=512" option to get that asm output. I believe that if we remove these registers (any. Including ymm and xmm) from BB you mentioned - it will be profitable for all architectures. " Sergey, can you test this?" Yes. Going to try this. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #10 from Richard Biener --- Just to note this is _basic block vectorization_ triggering. Of course we do vectorize basic blocks even when we do not vectorize any loop. Is this about the "stupid" attempt to use as little AVX512 as possible because the CPU has to down-clock? We could add a hint to the prefered-vector-size target hook whether we are asking for BB or loop vectorization and the target could decide to not do AVX512 for BB vectorization. Of course we also happily vectorize very short running loops with AVX512. I think the compiler will have a _very_ hard job working around this CPU design limitation... So - I do have a way to "fix" it for this case. What SLP detection does for the piecewise vector construction is to push that down as far as possible, even through binary operations that could otherwise be vectorized (the idea was to minimize the number of such build-ups). We could limit that to unary operations. Then we get a much larger SLP tree and t.c:32:12: note: Cost model analysis: Vector inside of basic block cost: 152 Vector prologue cost: 512 Vector epilogue cost: 0 Scalar cost of basic block: 544 t.c:32:12: note: not vectorized: vectorization is not profitable. but of course the real issue is that the target claims that replacing N stores with a vector store is profitable. Hmm. Honza did case vec_construct: return ix86_vec_cost (mode, ix86_cost->sse_op, false); but it was case vec_construct: return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1); before. That seems like a bogus change. I guess it should really be case vec_construct: return (ix86_vec_cost (mode, ix86_cost->sse_op, false) * (TYPE_VECTOR_SUBPARTS (vectype) - 1)); Honza - what was the motivation for this change? Sergey, can you test this? For me it makes the thing vectorized using SSE which means it can use vpbroadcastd. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 255499) +++ gcc/config/i386/i386.c (working copy) @@ -44879,7 +44879,8 @@ ix86_builtin_vectorization_cost (enum ve ix86_cost->sse_op, true); case vec_construct: - return ix86_vec_cost (mode, ix86_cost->sse_op, false); + return (ix86_vec_cost (mode, ix86_cost->sse_op, false) + * (TYPE_VECTOR_SUBPARTS (vectype) - 1)); default: gcc_unreachable ();
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #9 from sergey.shalnov at intel dot com --- Created attachment 42813 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42813=edit New reproducer Slightly changed first loop
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #8 from sergey.shalnov at intel dot com --- Richard, This is great changes and I see the first loop became vectorized for the test example I provided with gcc-8.0 main trunk. But I think the issue a bit more complicated. Vectorization of the first loop just hide the issue I reported. Currently I see following: test loop: “for (int i = 0; i < 4; i++, input1 += 4, input2 += 4)” test_bugzilla.c:6:5: note: Cost model analysis:. Vector inside of loop cost: 1136 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 328 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 test_bugzilla.c:6:5: note: Runtime profitability threshold = 4 test_bugzilla.c:6:5: note: Static estimate profitability threshold = 4 test_bugzilla.c:6:5: note: loop vectorized if I slightly change the loop (to be closer to real application): “for (int i = 0; i < 4; i++, input1 += stride1, input2 += stride2)” test_bugzilla1.c:6:5: note: Cost model analysis:. Vector inside of loop cost: 5232 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 328 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 5232 divided by the scalar iteration cost = 328 is greater or equal to the vectorization factor = 4. test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable. test_bugzilla1.c:6:5: note: not vectorized: vector version will never be profitable. And the issue with extra vector operations remains the same. I’m not sure but I think it is really profitable to avoid vector registers usage if the loop is not vectorized. Do you agree? Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #7 from Richard Biener --- Note the first loop is now vectorized fine thus the strange code is gone. -> fixed? (probably by the fix for PR83202)
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #6 from sergey.shalnov at intel dot com --- I found the issue request related to the vactorization issues in second loop (reduction uint->int). https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #5 from sergey.shalnov at intel dot com --- (In reply to Richard Biener from comment #2) > The strange code is because we perform basic-block vectorization resulting in > > vect_cst__249 = {_251, _251, _251, _251, _334, _334, _334, _334, _417, > _417, _417, _417, _48, _48, _48, _48}; > MEM[(unsigned int *)] = vect_cst__249; > _186 = tmp[0][0]; > _185 = tmp[1][0]; > ... > > which for some reason is deemed profitable: > > t.c:32:12: note: Cost model analysis: > Vector inside of basic block cost: 24 > Vector prologue cost: 64 > Vector epilogue cost: 0 > Scalar cost of basic block: 192 > t.c:32:12: note: Basic block will be vectorized using SLP > > what is odd is that the single vector store is costed 24 while the 16 scalar > int stores are costed 192. The vector build from scalar costs 64. > > I guess Honzas cost-model tweaks might have gone wrong here or we're hitting > an oddity in the SLP costing. > > Even if it looks strange maybe the sequence _is_ profitable? > > The second loop would be vectorized if 'sum' was unsigned. Richard, No, the sequence is not profitable. If we don't use any vector registers here the performance will be better for all architectures. I'm talking about vectorized code only here. I'm trying to look into vect_stmt_relevant_p() function to implement additional limitations and avoid block vectorization if the loop is not vectorized. If you have any idea how to avoid vectorization in this particular place - please let me know. Sergey
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #4 from rguenther at suse dot de --- On Sun, 19 Nov 2017, hubicka at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 > > Jan Hubicka changed: > >What|Removed |Added > > Status|UNCONFIRMED |NEW >Last reconfirmed||2017-11-19 > Ever confirmed|0 |1 > > --- Comment #3 from Jan Hubicka --- > I now get fir first loop: > > > a.c:6:5: note: Cost model analysis: > > Vector inside of loop cost: 1120 > > Vector prologue cost: 0 > > Vector epilogue cost: 0 > > Scalar iteration cost: 328 > > Scalar outside cost: 0 > > Vector outside cost: 0 > > prologue iterations: 0 > > epilogue iterations: 0 > > Calculated minimum iters for profitability: 0 > > a.c:6:5: note: Runtime profitability threshold = 4 > > a.c:6:5: note: Static estimate profitability threshold = 4 > > > For second lop I get: > > a.c:20:5: note: type of def: internal > > a.c:20:5: note: mark relevant 1, live 0: sum.0_60 = (unsigned int) sum_102; > > a.c:20:5: note: worklist: examine stmt: sum.0_60 = (unsigned int) sum_102; > > a.c:20:5: note: vect_is_simple_use: operand sum_102 > > a.c:20:5: note: def_stmt: sum_102 = PHI <0(6), sum_75(7)> > > a.c:20:5: note: type of def: unknown > > a.c:20:5: note: Unsupported pattern. > > a.c:20:5: note: not vectorized: unsupported use in stmt. > > a.c:20:5: note: unexpected pattern. > > Is it really that hard to sum values in vector? The issue is the vectorizer doesn't currently handle conversions: t.c:20:5: note: Analyze phi: sum_102 = PHI <0(6), sum_75(7)> t.c:20:5: note: reduction: not commutative/associative: sum_75 = (int) _61; I have patches to fix that though. Sitting somewhere... see PR65930.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-11-19 Ever confirmed|0 |1 --- Comment #3 from Jan Hubicka --- I now get fir first loop: a.c:6:5: note: Cost model analysis: Vector inside of loop cost: 1120 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 328 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 0 a.c:6:5: note: Runtime profitability threshold = 4 a.c:6:5: note: Static estimate profitability threshold = 4 For second lop I get: a.c:20:5: note: type of def: internal a.c:20:5: note: mark relevant 1, live 0: sum.0_60 = (unsigned int) sum_102; a.c:20:5: note: worklist: examine stmt: sum.0_60 = (unsigned int) sum_102; a.c:20:5: note: vect_is_simple_use: operand sum_102 a.c:20:5: note: def_stmt: sum_102 = PHI <0(6), sum_75(7)> a.c:20:5: note: type of def: unknown a.c:20:5: note: Unsupported pattern. a.c:20:5: note: not vectorized: unsupported use in stmt. a.c:20:5: note: unexpected pattern. Is it really that hard to sum values in vector? The code does not use AVX512 regs at all.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-*-* i?86-*-* CC||hubicka at gcc dot gnu.org, ||rguenth at gcc dot gnu.org --- Comment #2 from Richard Biener --- The strange code is because we perform basic-block vectorization resulting in vect_cst__249 = {_251, _251, _251, _251, _334, _334, _334, _334, _417, _417, _417, _417, _48, _48, _48, _48}; MEM[(unsigned int *)] = vect_cst__249; _186 = tmp[0][0]; _185 = tmp[1][0]; ... which for some reason is deemed profitable: t.c:32:12: note: Cost model analysis: Vector inside of basic block cost: 24 Vector prologue cost: 64 Vector epilogue cost: 0 Scalar cost of basic block: 192 t.c:32:12: note: Basic block will be vectorized using SLP what is odd is that the single vector store is costed 24 while the 16 scalar int stores are costed 192. The vector build from scalar costs 64. I guess Honzas cost-model tweaks might have gone wrong here or we're hitting an oddity in the SLP costing. Even if it looks strange maybe the sequence _is_ profitable? The second loop would be vectorized if 'sum' was unsigned.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #1 from sergey.shalnov at intel dot com --- Created attachment 42616 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42616=edit reproducer