[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-07-30 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

janus at gcc dot gnu.org changed:

   What|Removed |Added

 CC||janus at gcc dot gnu.org

--- Comment #39 from janus at gcc dot gnu.org ---
(In reply to uros from comment #37)
> Author: uros
> Date: Thu Feb  8 22:31:15 2018
> New Revision: 257505
> 
> URL: https://gcc.gnu.org/viewcvs?rev=257505=gcc=rev
> Log:
>   PR target/83008
>   * config/i386/x86-tune-costs.h (skylake_cost): Fix cost of
>   storing integer register in SImode.  Fix cost of 256 and 512
>   byte aligned SSE register store.
> 
>   * config/i386/i386.c (ix86_multiplication_cost): Fix
>   multiplication cost for TARGET_AVX512DQ.

This caused PR 86735.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-09 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

sergey.shalnov at intel dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #38 from sergey.shalnov at intel dot com ---
Implemented

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-08 Thread uros at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #37 from uros at gcc dot gnu.org ---
Author: uros
Date: Thu Feb  8 22:31:15 2018
New Revision: 257505

URL: https://gcc.gnu.org/viewcvs?rev=257505=gcc=rev
Log:
PR target/83008
* config/i386/x86-tune-costs.h (skylake_cost): Fix cost of
storing integer register in SImode.  Fix cost of 256 and 512
byte aligned SSE register store.

* config/i386/i386.c (ix86_multiplication_cost): Fix
multiplication cost for TARGET_AVX512DQ.

testsuite/ChangeLog:

PR target/83008
* gcc.target/i386/pr83008.c: New test.


Added:
trunk/gcc/testsuite/gcc.target/i386/pr83008.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c
trunk/gcc/config/i386/x86-tune-costs.h
trunk/gcc/testsuite/ChangeLog

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #36 from sergey.shalnov at intel dot com ---
The patch fixes the issue for SKX is in
https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html

I will close the PR after the patch has been merged.

Thank you very much for all involved.
Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-07 Thread clyon at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #35 from Christophe Lyon  ---
Author: clyon
Date: Wed Feb  7 09:12:48 2018
New Revision: 257438

URL: https://gcc.gnu.org/viewcvs?rev=257438=gcc=rev
Log:
[testsuite] Fix gcc.dg/cse_recip.c for AArch64 after r257181.

2018-02-07  Christophe Lyon 

PR tree-optimization/83008
* gcc.dg/cse_recip.c: Add -fno-tree-slp-vectorize.



Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/cse_recip.c

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #34 from Richard Biener  ---
Author: rguenth
Date: Tue Jan 30 11:19:47 2018
New Revision: 257181

URL: https://gcc.gnu.org/viewcvs?rev=257181=gcc=rev
Log:
2018-01-30  Richard Biener  

PR tree-optimization/83008
* tree-vect-slp.c (vect_analyze_slp_cost_1): Properly cost
invariant and constant vector uses in stmts when they need
more than one stmt.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-slp.c

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-29 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #33 from sergey.shalnov at intel dot com ---
Richard,
I'm not sure is it a regression or not. I see code has been visibly refactored 
in this commit
https://github.com/gcc-mirror/gcc/commit/ee6e9ba576099aed29f1097195c649fc796ecf5e
in 2013 year.

However this fix is highly important for our workloads and really desirable for
GCC8.
For example, your patch plus SKX cost model changes give us ~+1% in geomean
with spec2017 intrate.

It would be really good for us to make this patch into GCC8.

Cost model changes planned to be (will be proposed separately):

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index e943d13..d5e6ef6 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1557,7 +1557,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 6},   /* cost of storing integer registers */
+  {6, 6, 3},   /* cost of storing integer registers */
   2,   /* cost of reg,reg fld/fst */
   {6, 6, 8},   /* cost of loading fp registers
   in SFmode, DFmode and XFmode */
@@ -1572,7 +1572,7 @@ struct processor_costs skylake_cost = {
   {6, 6, 6, 10, 20},   /* cost of loading SSE registers
   in 32,64,128,256 and 512-bit */
   {6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
-  {8, 8, 8, 8, 16},/* cost of storing SSE registers
+  {8, 8, 8, 14, 24},   /* cost of storing SSE registers
   in 32,64,128,256 and 512-bit */
   {8, 8, 8, 8, 16},/* cost of unaligned stores.  */
   2, 2,/* SSE->integer and
integer->SSE moves */

I know that you have some concerns about costs above, I'm saving it for next
discussion 
because your patch is a good foundation to proceed with costs tuning.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-29 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #32 from rguenther at suse dot de  ---
On Fri, 26 Jan 2018, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #31 from sergey.shalnov at intel dot com ---
> Richard,
> Thank you for your latest patch. This patch is exactly that 
> I’ve discussed in this issue request.
> I tested it with SPEC20[06|17] and see no performance/stability degradation.
> 
> Could you please merge your latest patch into main trunk?

Is this bug a regression?  If not I think the change has to wait for
GCC 9.

> I will provide Intel specific changes (in gcc/config/i386) that 
> will leverage your patch to get the performance better 
> for the provided testcase (step N2)

Can you attach those?

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-26 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #31 from sergey.shalnov at intel dot com ---
Richard,
Thank you for your latest patch. This patch is exactly that 
I’ve discussed in this issue request.
I tested it with SPEC20[06|17] and see no performance/stability degradation.

Could you please merge your latest patch into main trunk?

I will provide Intel specific changes (in gcc/config/i386) that 
will leverage your patch to get the performance better 
for the provided testcase (step N2)

Thank you
Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-25 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

Richard Biener  changed:

   What|Removed |Added

  Attachment #43084|0   |1
is obsolete||

--- Comment #30 from Richard Biener  ---
Created attachment 43238
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43238=edit
updated patch for SLP costing

You are right, the patch contained several errors.  The multiple_of_p is
supposed to handle the case where we know all vectors will be equal, like when
group_size is two and const_nunits is 4.  The arguments were swapped.  Also the
loop over
the elements were bogus.  I've corrected this with the attached updated patch.

This now costs two vector constructions for the testcase as expected but still:

t.c:32:12: note: Cost model analysis:
  Vector inside of basic block cost: 32
  Vector prologue cost: 64
  Vector epilogue cost: 0
  Scalar cost of basic block: 192
t.c:32:12: note: Basic block will be vectorized using SLP

that's for two aligned stores and two 8 element vector constructions.  We're
offsetting 16 scalar stores after all...  They each seem to cost 12 while
an aligned vector store costs 16.  And the vector constructions cost 32 each
(8 times a SSE op costing 4 aka "element insert").

The only thing I notice is that

40240   ix86_vec_cost (machine_mode mode, int cost, bool parallel)
40241   {
40242 if (!VECTOR_MODE_P (mode))
40243   return cost;
40244
40245 if (!parallel)
40246   return cost * GET_MODE_NUNITS (mode);
40247 if (GET_MODE_BITSIZE (mode) == 128
40248 && TARGET_SSE_SPLIT_REGS)
40249   return cost * 2;
(gdb) 
40250 if (GET_MODE_BITSIZE (mode) > 128
40251 && TARGET_AVX128_OPTIMAL)
40252   return cost * GET_MODE_BITSIZE (mode) / 128;
40253 return cost;
40254   }

all the pessimizing for TARGET_SSE_SPLIT_REGS/TARGET_AVX128_OPTIMAL isn't
applied to the !parallel case.  But they wouldn't apply to AVX512 AFAICS.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-19 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #29 from sergey.shalnov at intel dot com ---
Richard,
Thank you for your latest patch. I would like to clarify 
the multiple_p() function usage in if() clause.

First of all,  I assume that architectures with fixed 
size of HW registers (x86) should go to if(){} branch, 
but arch with unknown registers size should go to else{}.
This is because is_constant() function used.

The problem is in multiple_p() function. In the test case 
we have group_size=16 and const_nunits=8.
So, the multiple_p(16, 8) returns 1 and with 
–march=skylake-avx512 (or –march=znver1) we go to 
else{} branch which is not correct.

I used following change in your patch and test works as I expect.
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1923,7 +1923,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree
node,
  unsigned HOST_WIDE_INT const_nunits;
  unsigned nelt_limit;
  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits)
- && ! multiple_p (group_size, const_nunits))
+ && multiple_p (group_size, const_nunits))
{
  num_vects_to_check = ncopies_for_cost * const_nunits /
group_size;
  nelt_limit = const_nunits;
-- 

Other thing here is what we should do if group_size is, for example, 17. 
In this case (after my changes) wrong else{} branch will be taken.

I would propose to change this particular if() in following way.

  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits))
{
If(multiple_p (group_size, const_nunits))
  {
num_vects_to_check = ncopies_for_cost * const_nunits /
group_size;
nelt_limit = const_nunits;
  }
else
  {
...not clear what we should have here...
  }
}
  else
{
  num_vects_to_check = 1;
  nelt_limit = group_size;
}

Or perhaps multiple_p() should not be here at all?

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-17 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #28 from sergey.shalnov at intel dot com ---
Richard,
Thank you for your comments. 
I see that TYPE_VECTOR_SUBPARTS is constant for for the test case but
multiple_p (group_size, const_nunits) returns 1 in the code:
  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant (_nunits)
  && ! multiple_p (group_size, const_nunits))
{
  num_vects_to_check = ncopies_for_cost * const_nunits /
group_size;
  nelt_limit = const_nunits;
}
  else
{
  num_vects_to_check = 1;
  nelt_limit = group_size;
}
This is because group_size = 16 and const_nunits = 8 in the test case with
-march=skylake-avx512.
And control flow goes to "num_vects_to_check = 1".
Anyway, the ncopies_for_cost = 2 and equation " ncopies_for_cost * const_nunits
/ group_size " will be 1.

Do you think we have any possibility to make a conditional clause to make
num_vects_to_check = 2?

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #27 from rguenther at suse dot de  ---
On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #26 from sergey.shalnov at intel dot com ---
> Sorry, did you meant "arm_sve.h" on ARM?
> In this case we have machine specific code in common part of the gcc code.
> Should we make it as machine dependent callback function because having
> "num_vects_to_check = 1" is incorrect for SSE?

TYPE_VECTOR_SUBPARTS is never non-constant for SSE.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #26 from sergey.shalnov at intel dot com ---
Sorry, did you meant "arm_sve.h" on ARM?
In this case we have machine specific code in common part of the gcc code.
Should we make it as machine dependent callback function because having
"num_vects_to_check = 1" is incorrect for SSE?

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #25 from rguenther at suse dot de  ---
On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #24 from sergey.shalnov at intel dot com ---
> Richard,
> The latest "SLP costing for constants/externs improvement" patch generates the
> same code as baseline for the test example.
> 
> Are you sure that "num_vects_to_check" should 1 if vector is not constant? I
> would expect "num_vects_to_check = ncopies_for_cost;" here:
> 
> 1863  else
> 1864{
> 1865  num_vects_to_check = 1;
> 1866  nelt_limit = group_size;
> 1867}

Yes, that's correct for variable length vectors (SVE)

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #24 from sergey.shalnov at intel dot com ---
Richard,
The latest "SLP costing for constants/externs improvement" patch generates the
same code as baseline for the test example.

Are you sure that "num_vects_to_check" should 1 if vector is not constant? I
would expect "num_vects_to_check = ncopies_for_cost;" here:

1863  else
1864{
1865  num_vects_to_check = 1;
1866  nelt_limit = group_size;
1867}

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #22 from rguenther at suse dot de  ---
On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #21 from sergey.shalnov at intel dot com ---
> Thanks Richard for your comments. 
> Based on our discussion I've produced the patch attached and
> run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto
> -march=skylake-avx512 -mfpmath=sse -funroll-loops]).
> Please note, I used your last proposed patch and changed loop trip count
> calculation ("ncopies_for_cost * nunits / group_size" is always 1).
> 
> I see the following performance changes:
> SPEC CPU 2017 intrate
> 500.perlbench_r -0.7%
> 525.x264_r +7.2%
> Geomean: +0.8%
> 
> SPEC CPU 2017 fprate
> 527.cam4_r -1.1%
> 538.imagick_r +4.7%
> 544.nab_r +3.6%
> Geomean: +0.6%
> 
> I believe that after appropriate cost model tweaks for other targets a gain
> could be observed but I haven't checked it carefully.
> It provides a good performance gain for the original case and a few other
> improvements.
> 
> Can you please take a look at the patch and comment (or might propose another
> method)?

It mixes several things, the last one (> to >= change in cost evaluation
clearly wrong).  The skylake_cost changes look somewhat odd to me.

I'll attach my current SLP costing adjustment patch (after the SVE
changes the old didn't build anymore).

> Sergey
> 
> From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001
> From: Sergey Shalnov 
> Date: Tue, 9 Jan 2018 14:37:14 +0100
> Subject: [PATCH, SLP] SLP_common_algo_changes
> 
> ---
>  gcc/config/i386/x86-tune-costs.h |  4 ++--
>  gcc/tree-vect-slp.c  | 41 
> ++--
>  2 files changed, 33 insertions(+), 12 deletions(-)
> 
> diff --git a/gcc/config/i386/x86-tune-costs.h
> b/gcc/config/i386/x86-tune-costs.h
> index 312467d..3e0f904 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = {
>{4, 4, 4},   /* cost of loading integer registers
>in QImode, HImode and SImode.
>Relative to reg-reg move (2).  */
> -  {6, 6, 6},   /* cost of storing integer registers 
> */
> +  {6, 6, 4},   /* cost of storing integer registers. 
> */
>2,   /* cost of reg,reg fld/fst */
>{6, 6, 8},   /* cost of loading fp registers
>in SFmode, DFmode and XFmode */
> @@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = {
>{6, 6, 6, 10, 20},   /* cost of loading SSE registers
>in 32,64,128,256 and 512-bit */
>{6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
> -  {8, 8, 8, 8, 16},/* cost of storing SSE registers
> +  {8, 8, 8, 16, 32},   /* cost of storing SSE registers
>in 32,64,128,256 and 512-bit */
>{8, 8, 8, 8, 16},/* cost of unaligned stores.  */
>2, 2,/* SSE->integer and
> integer->SSE moves */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 0ca42b4..7e63a1c 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance,
> slp_tree node,
>enum vect_def_type dt;
>if (!op || op == lhs)
> continue;
> -  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ))
> +  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )
> +&& (dt == vect_constant_def || dt == vect_external_def))
> {
>   /* Without looking at the actual initializer a vector of
>  constants can be implemented as load from the constant pool.
> -???  We need to pass down stmt_info for a vector type
> -even if it points to the wrong stmt.  */
> - if (dt == vect_constant_def)
> -   record_stmt_cost (prologue_cost_vec, 1, vector_load,
> - stmt_info, 0, vect_prologue);
> - else if (dt == vect_external_def)
> -   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
> - stmt_info, 0, vect_prologue);
> +When all elements are the same we can use a splat.  */
> + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length ();
> + tree elt = NULL_TREE;
> + unsigned nelt = 0;
> + for (unsigned j = 0; j < ncopies_for_cost; ++j)
> +   for (unsigned k = 0; k < group_size; ++k)
> + {
> +   if (nelt == 0)
> + elt = 

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #22 from rguenther at suse dot de  ---
On Wed, 10 Jan 2018, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #21 from sergey.shalnov at intel dot com ---
> Thanks Richard for your comments. 
> Based on our discussion I've produced the patch attached and
> run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto
> -march=skylake-avx512 -mfpmath=sse -funroll-loops]).
> Please note, I used your last proposed patch and changed loop trip count
> calculation ("ncopies_for_cost * nunits / group_size" is always 1).
> 
> I see the following performance changes:
> SPEC CPU 2017 intrate
> 500.perlbench_r -0.7%
> 525.x264_r +7.2%
> Geomean: +0.8%
> 
> SPEC CPU 2017 fprate
> 527.cam4_r -1.1%
> 538.imagick_r +4.7%
> 544.nab_r +3.6%
> Geomean: +0.6%
> 
> I believe that after appropriate cost model tweaks for other targets a gain
> could be observed but I haven't checked it carefully.
> It provides a good performance gain for the original case and a few other
> improvements.
> 
> Can you please take a look at the patch and comment (or might propose another
> method)?

It mixes several things, the last one (> to >= change in cost evaluation
clearly wrong).  The skylake_cost changes look somewhat odd to me.

I'll attach my current SLP costing adjustment patch (after the SVE
changes the old didn't build anymore).

> Sergey
> 
> From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001
> From: Sergey Shalnov 
> Date: Tue, 9 Jan 2018 14:37:14 +0100
> Subject: [PATCH, SLP] SLP_common_algo_changes
> 
> ---
>  gcc/config/i386/x86-tune-costs.h |  4 ++--
>  gcc/tree-vect-slp.c  | 41 
> ++--
>  2 files changed, 33 insertions(+), 12 deletions(-)
> 
> diff --git a/gcc/config/i386/x86-tune-costs.h
> b/gcc/config/i386/x86-tune-costs.h
> index 312467d..3e0f904 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = {
>{4, 4, 4},   /* cost of loading integer registers
>in QImode, HImode and SImode.
>Relative to reg-reg move (2).  */
> -  {6, 6, 6},   /* cost of storing integer registers 
> */
> +  {6, 6, 4},   /* cost of storing integer registers. 
> */
>2,   /* cost of reg,reg fld/fst */
>{6, 6, 8},   /* cost of loading fp registers
>in SFmode, DFmode and XFmode */
> @@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = {
>{6, 6, 6, 10, 20},   /* cost of loading SSE registers
>in 32,64,128,256 and 512-bit */
>{6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
> -  {8, 8, 8, 8, 16},/* cost of storing SSE registers
> +  {8, 8, 8, 16, 32},   /* cost of storing SSE registers
>in 32,64,128,256 and 512-bit */
>{8, 8, 8, 8, 16},/* cost of unaligned stores.  */
>2, 2,/* SSE->integer and
> integer->SSE moves */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 0ca42b4..7e63a1c 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance,
> slp_tree node,
>enum vect_def_type dt;
>if (!op || op == lhs)
> continue;
> -  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ))
> +  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )
> +&& (dt == vect_constant_def || dt == vect_external_def))
> {
>   /* Without looking at the actual initializer a vector of
>  constants can be implemented as load from the constant pool.
> -???  We need to pass down stmt_info for a vector type
> -even if it points to the wrong stmt.  */
> - if (dt == vect_constant_def)
> -   record_stmt_cost (prologue_cost_vec, 1, vector_load,
> - stmt_info, 0, vect_prologue);
> - else if (dt == vect_external_def)
> -   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
> - stmt_info, 0, vect_prologue);
> +When all elements are the same we can use a splat.  */
> + unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length ();
> + tree elt = NULL_TREE;
> + unsigned nelt = 0;
> + for (unsigned j = 0; j < ncopies_for_cost; ++j)
> +   for (unsigned k = 0; k < group_size; ++k)
> + {
> +   if (nelt == 0)
> + elt = 

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #23 from Richard Biener  ---
Created attachment 43084
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43084=edit
SLP costing for constants/externs improvement

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #21 from sergey.shalnov at intel dot com ---
Thanks Richard for your comments. 
Based on our discussion I've produced the patch attached and
run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto
-march=skylake-avx512 -mfpmath=sse -funroll-loops]).
Please note, I used your last proposed patch and changed loop trip count
calculation ("ncopies_for_cost * nunits / group_size" is always 1).

I see the following performance changes:
SPEC CPU 2017 intrate
500.perlbench_r -0.7%
525.x264_r +7.2%
Geomean: +0.8%

SPEC CPU 2017 fprate
527.cam4_r -1.1%
538.imagick_r +4.7%
544.nab_r +3.6%
Geomean: +0.6%

I believe that after appropriate cost model tweaks for other targets a gain
could be observed but I haven't checked it carefully.
It provides a good performance gain for the original case and a few other
improvements.

Can you please take a look at the patch and comment (or might propose another
method)?

Sergey

From 41e5094cbdce72d4cc5e04fc3d11c01c3c1adbb7 Mon Sep 17 00:00:00 2001
From: Sergey Shalnov 
Date: Tue, 9 Jan 2018 14:37:14 +0100
Subject: [PATCH, SLP] SLP_common_algo_changes

---
 gcc/config/i386/x86-tune-costs.h |  4 ++--
 gcc/tree-vect-slp.c  | 41 ++--
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/gcc/config/i386/x86-tune-costs.h
b/gcc/config/i386/x86-tune-costs.h
index 312467d..3e0f904 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1555,7 +1555,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
-  {6, 6, 6},   /* cost of storing integer registers */
+  {6, 6, 4},   /* cost of storing integer registers. 
*/
   2,   /* cost of reg,reg fld/fst */
   {6, 6, 8},   /* cost of loading fp registers
   in SFmode, DFmode and XFmode */
@@ -1570,7 +1570,7 @@ struct processor_costs skylake_cost = {
   {6, 6, 6, 10, 20},   /* cost of loading SSE registers
   in 32,64,128,256 and 512-bit */
   {6, 6, 6, 10, 20},   /* cost of unaligned loads.  */
-  {8, 8, 8, 8, 16},/* cost of storing SSE registers
+  {8, 8, 8, 16, 32},   /* cost of storing SSE registers
   in 32,64,128,256 and 512-bit */
   {8, 8, 8, 8, 16},/* cost of unaligned stores.  */
   2, 2,/* SSE->integer and
integer->SSE moves */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 0ca42b4..7e63a1c 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1815,18 +1815,39 @@ vect_analyze_slp_cost_1 (slp_instance instance,
slp_tree node,
   enum vect_def_type dt;
   if (!op || op == lhs)
continue;
-  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ))
+  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )
+&& (dt == vect_constant_def || dt == vect_external_def))
{
  /* Without looking at the actual initializer a vector of
 constants can be implemented as load from the constant pool.
-???  We need to pass down stmt_info for a vector type
-even if it points to the wrong stmt.  */
- if (dt == vect_constant_def)
-   record_stmt_cost (prologue_cost_vec, 1, vector_load,
- stmt_info, 0, vect_prologue);
- else if (dt == vect_external_def)
-   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
- stmt_info, 0, vect_prologue);
+When all elements are the same we can use a splat.  */
+ unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length ();
+ tree elt = NULL_TREE;
+ unsigned nelt = 0;
+ for (unsigned j = 0; j < ncopies_for_cost; ++j)
+   for (unsigned k = 0; k < group_size; ++k)
+ {
+   if (nelt == 0)
+ elt = gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], i);
+   /* ???  We're just tracking whether all operands of a single
+  vector initializer are the same, ideally we'd check if
+  we emitted the same one already.  */
+   else if (elt != gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt],
+  i))
+ elt = NULL_TREE;
+   nelt++;
+   if (nelt == group_size)
+ {
+   /* ???  We need to pass down stmt_info for a vector type
+  even if it 

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-02 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #20 from sergey.shalnov at intel dot com ---
Richard,
I did quick static analysis for your latest patch.
Using command line “-g -Ofast -mfpmath=sse -funroll-loops -march=znver1” your
latest patch 
doesn’t affects the issue I discussed but it affects costs for first loop. 
I thought the loop costs should be calculated in other place (tree-vect-loop.c)
but as I can see everything is interconnected.

The SLP block we discussed remains with the same statistic:
  Vector inside of basic block cost: 64
  Vector prologue cost: 32
  Vector epilogue cost: 0
  Scalar cost of basic block: 256
note: Basic block will be vectorized using SLP

First loop was:
note: Cost model analysis:.
  Vector inside of loop cost: 5392
  Vector prologue cost: 48
  Vector epilogue cost: 0
  Scalar iteration cost: 464
  Scalar outside cost: 0
  Vector outside cost: 48
  prologue iterations: 0
  epilogue iterations: 0
note: cost model: the vector iteration cost = 5392 divided by the scalar
iteration cost = 464 is greater or equal to the vectorization factor = 4.

Became:
note: Cost model analysis:
  Vector inside of loop cost: 5392
  Vector prologue cost: 192
  Vector epilogue cost: 0
  Scalar iteration cost: 464
  Scalar outside cost: 0
  Vector outside cost: 192
  prologue iterations: 0
  epilogue iterations: 0
note: cost model: the vector iteration cost = 5392 divided by the scalar
iteration cost = 464 is greater or equal to the vectorization factor = 4.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-02 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #19 from rguenther at suse dot de  ---
On Sun, 24 Dec 2017, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #18 from sergey.shalnov at intel dot com ---
> Yes, I agree that vector_store stage has it’s own vectorization cost.
> And each vector_store has vector_construction stage. These stages are 
> different
> in gcc slp (as you know).
> To better illustrate my point of view I would like to propose a patch.
> I didn’t submit the patch to community yet because I think it is better to
> discuss it before that.
> 
> index 0ca42b4..72e3123 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1825,7 +1825,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree
> node,
> record_stmt_cost (prologue_cost_vec, 1, vector_load,
>   stmt_info, 0, vect_prologue);
>   else if (dt == vect_external_def)
> -   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
> +   record_stmt_cost (prologue_cost_vec, ncopies_for_cost,
> vec_construct,
>   stmt_info, 0, vect_prologue);
> }
>  }

I'm not sure it affects the testcase in question but yes, there's
imprecision in this in that it assumes too much CSE happening.
Using ncopies_for_cost will over-estimate the cost though,
it will match the code we generate though (CSE will cleanup).
As the comment says this code should be re-engineered ;)  (it also
misses to make a splat cheap).  Sth like the following might work
for most cases:

Index: gcc/tree-vect-slp.c
===
--- gcc/tree-vect-slp.c (revision 256070)
+++ gcc/tree-vect-slp.c (working copy)
@@ -1815,18 +1815,41 @@ vect_analyze_slp_cost_1 (slp_instance in
   enum vect_def_type dt;
   if (!op || op == lhs)
continue;
-  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, ))
+  if (vect_is_simple_use (op, stmt_info->vinfo, _stmt, )
+ && (dt == vect_constant_def || dt == vect_external_def))
{
  /* Without looking at the actual initializer a vector of
 constants can be implemented as load from the constant pool.
-???  We need to pass down stmt_info for a vector type
-even if it points to the wrong stmt.  */
- if (dt == vect_constant_def)
-   record_stmt_cost (prologue_cost_vec, 1, vector_load,
- stmt_info, 0, vect_prologue);
- else if (dt == vect_external_def)
-   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
- stmt_info, 0, vect_prologue);
+When all elements are the same we can use a splat.  */
+ tree vectype = get_vectype_for_scalar_type (TREE_TYPE (op));
+ unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
+ unsigned group_size = SLP_TREE_SCALAR_STMTS (node).length ();
+ tree elt = NULL_TREE;
+ unsigned nelt = 0;
+ for (unsigned j = 0; j < ncopies_for_cost * nunits / group_size; ++j)
+   for (unsigned k = 0; k < group_size; ++k)
+ {
+   if (nelt == 0)
+ elt = gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt], i);
+   /* ???  We're just tracking whether all operands of a single
+  vector initializer are the same, ideally we'd check if
+  we emitted the same one already.  */
+   else if (elt != gimple_op (SLP_TREE_SCALAR_STMTS (node)[nelt],
+  i))
+ elt = NULL_TREE;
+   nelt++;
+   if (nelt == group_size)
+ {
+   /* ???  We need to pass down stmt_info for a vector type
+  even if it points to the wrong stmt.  */
+   record_stmt_cost (prologue_cost_vec, 1,
+ dt == vect_external_def
+ ? (elt ? scalar_to_vec : vec_construct)
+ : vector_load,
+ stmt_info, 0, vect_prologue);
+   nelt = 0;
+ }
+ }
}
 }

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-24 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #18 from sergey.shalnov at intel dot com ---
Yes, I agree that vector_store stage has it’s own vectorization cost.
And each vector_store has vector_construction stage. These stages are different
in gcc slp (as you know).
To better illustrate my point of view I would like to propose a patch.
I didn’t submit the patch to community yet because I think it is better to
discuss it before that.

index 0ca42b4..72e3123 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1825,7 +1825,7 @@ vect_analyze_slp_cost_1 (slp_instance instance, slp_tree
node,
record_stmt_cost (prologue_cost_vec, 1, vector_load,
  stmt_info, 0, vect_prologue);
  else if (dt == vect_external_def)
-   record_stmt_cost (prologue_cost_vec, 1, vec_construct,
+   record_stmt_cost (prologue_cost_vec, ncopies_for_cost,
vec_construct,
  stmt_info, 0, vect_prologue);
}
 }

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #17 from rguenther at suse dot de  ---
On Fri, 15 Dec 2017, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #16 from sergey.shalnov at intel dot com ---
> «it's one vec_construct operation - it's the task of the target to turn this
> into a cost comparable to vector_store»
> I agree that vec_construct operation cost is based on the target cost model.
> 
> I think I don't completely understand why we have 1 vec_construct for 3
> vector_store.

Each vector store is it's own vectorization and thus has its own
Cost Model summary.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #16 from sergey.shalnov at intel dot com ---
«it's one vec_construct operation - it's the task of the target to turn this
into a cost comparable to vector_store»
I agree that vec_construct operation cost is based on the target cost model.

I think I don't completely understand why we have 1 vec_construct for 3
vector_store.
I would expect vec_construct for each vec_store operation.
I just looking for asm and see that each "vmovaps %xmm, (%rsp)" has it's own
construction procedure

(in this asm I extracted only one vector_store from 3 vector_store set
generated)
 vec_construct:
 2d3:   c5 79 6e dd vmovd  %ebp,%xmm11
 310:   c4 41 79 6e d0  vmovd  %r8d,%xmm10
 2e0:   c4 63 21 22 e2 01   vpinsrd $0x1,%edx,%xmm11,%xmm12
 31a:   c4 43 29 22 e9 01   vpinsrd $0x1,%r9d,%xmm10,%xmm13
 329:   c4 41 11 6c f4  vpunpcklqdq %xmm12,%xmm13,%xmm14
vector_store:
 340:   c5 78 29 74 24 d8   vmovaps %xmm14,-0x28(%rsp)

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #15 from rguenther at suse dot de  ---
On Fri, 15 Dec 2017, sergey.shalnov at intel dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> --- Comment #14 from sergey.shalnov at intel dot com ---
> " we have a basic-block vectorizer.  Do you propose to remove it? "
> Definitely not! SLP vectorizer is very good to have!
> 
> “What's the rationale for not using vector registers”
> I just tried " -fno-tree-slp-vectorize" option and found the performance gain
> for different -march= options.
> 
> I see some misunderstanding here. Let me clarify the original question with
> –march=znver1.
> I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for
> experiments.
> 
> For the basic block we are discussing we have (in vect_analyze_slp_cost() in
> tree-vect-slp.c:1897):
> 
> tmp[i_220][0] = _150;
> tmp[i_220][2] = _147;
> tmp[i_220][1] = _144;
> tmp[i_220][3] = _141;
> 
> tmp[i_139][0] = _447;
> tmp[i_139][2] = _450;
> tmp[i_139][1] = _453;
> tmp[i_139][3] = _456;
> 
> tmp[i_458][0] = _54;
> tmp[i_458][2] = _56;
> tmp[i_458][1] = _58;
> tmp[i_458][3] = _60;
> 
> this is si->stmt printed in the loop with "vect_prologue" calculation.
> 
> I see SLP statistic related to this BB:
> note: Cost model analysis:. 
>   Vector inside of basic block cost: 64 
>   Vector prologue cost: 32 
>   Vector epilogue cost: 0 
>   Scalar cost of basic block: 256 
> note: Basic block will be vectorized using SLP

So it looks like both vector and scalar stores are costed 64 by the
target and the vector construction is costed 32, this is likely
because

  COSTS_N_INSNS (1),/* cost of cheap SSE instruction.  
*/

and vector construction of 4 elements is thus cost 4 while

  {8, 8, 8},/* cost of storing integer
   registers.  */
...
  {8, 8, 8, 8, 16}, /* cost of storing SSE registers
   in 32,64,128,256 and 512-bit.  
*/

That cheap SSE instruction cost is esp. odd when looking at

  2, 3, 6,  /* cost of moving XMM,YMM,ZMM 
register.  */

or

  COSTS_N_INSNS (3),/* cost of ADDSS/SD SUBSS/SD 
insns.  */

so if a movq %xmm0, %xmm1 costs 2 I can't imagine anything being cheaper
than that.

Honza?

...
> Please correct me if I wrong but I think we have to have count=3 in
> prologue_cost_vec.

No, it's one vec_construct operation - it's the task of the target
to turn this into a cost comparable to vector_store and scalar_store
in this case.

> And this could slightly change costs for "Vector prologue cost" and might have
> an influence to vectorizer decision.
> 
> Sergey
> PS
> Richard,
> I didn't catch your idea in " but DOM isn't powerful enough " sentence.
> Could you please slightly clarify it?

The BB vectorization prevents eliding 'tmp' from memory to registers,
the CSE pass after vectorization would be responsible for this but
the memory references look too "complicated" to the respective
implementation used.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #14 from sergey.shalnov at intel dot com ---
" we have a basic-block vectorizer.  Do you propose to remove it? "
Definitely not! SLP vectorizer is very good to have!

“What's the rationale for not using vector registers”
I just tried " -fno-tree-slp-vectorize" option and found the performance gain
for different -march= options.

I see some misunderstanding here. Let me clarify the original question with
–march=znver1.
I use " -Ofast -mfpmath=sse -funroll-loops -march=znver1" options set for
experiments.

For the basic block we are discussing we have (in vect_analyze_slp_cost() in
tree-vect-slp.c:1897):

tmp[i_220][0] = _150;
tmp[i_220][2] = _147;
tmp[i_220][1] = _144;
tmp[i_220][3] = _141;

tmp[i_139][0] = _447;
tmp[i_139][2] = _450;
tmp[i_139][1] = _453;
tmp[i_139][3] = _456;

tmp[i_458][0] = _54;
tmp[i_458][2] = _56;
tmp[i_458][1] = _58;
tmp[i_458][3] = _60;

this is si->stmt printed in the loop with "vect_prologue" calculation.

I see SLP statistic related to this BB:
note: Cost model analysis:. 
  Vector inside of basic block cost: 64 
  Vector prologue cost: 32 
  Vector epilogue cost: 0 
  Scalar cost of basic block: 256 
note: Basic block will be vectorized using SLP

I see 12 statements that are calculated into 3 vector instructions with 4 data
type each (4*int->xmm)
group_size = 12
ncopies_for_cost = 3
nunits = 4

But I see "count" is 1 in cost vector related to prolog.
prologue_cost_vec = {m_vec = 0x3fc6e70 = {{count = 1, kind = vec_construct,
stmt = , misalign = 0}}}
body_cost_vec = {m_vec = 0x3fc6f70 = {{count = 3, kind = vector_store, stmt =
, misalign = 0}}}

Please correct me if I wrong but I think we have to have count=3 in
prologue_cost_vec.
And this could slightly change costs for "Vector prologue cost" and might have
an influence to vectorizer decision.

Sergey
PS
Richard,
I didn't catch your idea in " but DOM isn't powerful enough " sentence.
Could you please slightly clarify it?
Thank you.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #13 from rguenther at suse dot de  ---
On Fri, 8 Dec 2017, sergey.shalnov at intel dot com wrote:

> And it uses xmm+ vpbroadcastd to spill tmp[] to stack
> ...
> 1e7:   62 d2 7d 08 7c c9   vpbroadcastd %r9d,%xmm1
>  1ed:   c4 c1 79 7e c9  vmovd  %xmm1,%r9d

^^^
this is an odd instruction ...

>  1f2:   62 f1 fd 08 7f 8c 24vmovdqa64 %xmm1,-0x38(%rsp)
>  1f9:   c8 ff ff ff 
>  1fd:   62 f2 7d 08 7c d7   vpbroadcastd %edi,%xmm2
>  203:   c5 f9 7e d7 vmovd  %xmm2,%edi
>  207:   62 f1 fd 08 7f 94 24vmovdqa64 %xmm2,-0x28(%rsp)
>  20e:   d8 ff ff ff 
>  212:   62 f2 7d 08 7c db   vpbroadcastd %ebx,%xmm3
>  218:   c5 f9 7e de vmovd  %xmm3,%esi
>  21c:   62 f1 fd 08 7f 9c 24vmovdqa64 %xmm3,-0x18(%rsp)
>  223:   e8 ff ff ff 
>  227:   01 fe   add%edi,%esi
>  229:   45 01 c8add%r9d,%r8d
>  22c:   41 01 f0add%esi,%r8d
>  22f:   8b 5c 24 dc mov-0x24(%rsp),%ebx
>  233:   03 5c 24 ec add-0x14(%rsp),%ebx
>  237:   8b 6c 24 bc mov-0x44(%rsp),%ebp
>  23b:   03 6c 24 cc add-0x34(%rsp),%ebp
> ...
> 
> I think this is better in case of performance perspective but, as I said
> before, not using vector registers here is the best option if no loops
> vectorized.

As I said we have a basic-block vectorizer.  Do you propose to remove it?
What's the rationale for "not using vector registers ... if no loops [are]
vectorized"?

With AVX256/512 an additional cost when using vector registers is
a vzeroupper required at function boundaries.  Is this (one of) the
reason?

If it is the case that the instruction sequence

vpbroadcastd %ebx,%xmm3
vmovdqa64 %xmm3,-0x18(%rsp)

is slower than doing

  vmovd %ebx,-0x18(%rsp)
  vmovd %ebx,-0x22(%rsp)
  vmovd %ebx,-0x26(%rsp)
  vmovd %ebx,-0x30(%rsp)

then the costing in the backend needs to reflect that.

I see that the vectorization prevents eliding 'tmp' on GIMPLE
but that's a completely separate issue (the value-numbering we
run after vectorization is poor):

  vect_cst__108 = {_48, _48, _48, _48};
  vect_cst__106 = {_349, _349, _349, _349};
  vect_cst__251 = {_97, _97, _97, _97};
  MEM[(unsigned int *) + 16B] = vect_cst__251;
  MEM[(unsigned int *) + 32B] = vect_cst__106;
  MEM[(unsigned int *) + 48B] = vect_cst__108;
  _292 = tmp[0][0];
  _291 = tmp[1][0];
...

those loads could have been elided but DOM isn't powerful enough to
see that (and never will be).

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #12 from sergey.shalnov at intel dot com ---
Richard,
Your last proposal changed the code generated a bit.
Currently is shows:
test_bugzilla1.c:6:5: note: Cost model analysis:. 
  Vector inside of loop cost: 62576 
  Vector prologue cost: 0 
  Vector epilogue cost: 0 
  Scalar iteration cost: 328 
  Scalar outside cost: 0 
  Vector outside cost: 0 
  prologue iterations: 0 
  epilogue iterations: 0 
test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 62576
divided by the scalar iteration cost = 328 is greater or equal to the
vectorization factor = 4.
test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable.
test_bugzilla1.c:6:5: note: not vectorized: vector version will never be
profitable.

And it uses xmm+ vpbroadcastd to spill tmp[] to stack
...
1e7:   62 d2 7d 08 7c c9   vpbroadcastd %r9d,%xmm1
 1ed:   c4 c1 79 7e c9  vmovd  %xmm1,%r9d
 1f2:   62 f1 fd 08 7f 8c 24vmovdqa64 %xmm1,-0x38(%rsp)
 1f9:   c8 ff ff ff 
 1fd:   62 f2 7d 08 7c d7   vpbroadcastd %edi,%xmm2
 203:   c5 f9 7e d7 vmovd  %xmm2,%edi
 207:   62 f1 fd 08 7f 94 24vmovdqa64 %xmm2,-0x28(%rsp)
 20e:   d8 ff ff ff 
 212:   62 f2 7d 08 7c db   vpbroadcastd %ebx,%xmm3
 218:   c5 f9 7e de vmovd  %xmm3,%esi
 21c:   62 f1 fd 08 7f 9c 24vmovdqa64 %xmm3,-0x18(%rsp)
 223:   e8 ff ff ff 
 227:   01 fe   add%edi,%esi
 229:   45 01 c8add%r9d,%r8d
 22c:   41 01 f0add%esi,%r8d
 22f:   8b 5c 24 dc mov-0x24(%rsp),%ebx
 233:   03 5c 24 ec add-0x14(%rsp),%ebx
 237:   8b 6c 24 bc mov-0x44(%rsp),%ebp
 23b:   03 6c 24 cc add-0x34(%rsp),%ebp
...

I think this is better in case of performance perspective but, as I said
before, not using vector registers here is the best option if no loops
vectorized.

In case of static loop increment (the first test case) - the first loop
vectorized as before.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #11 from sergey.shalnov at intel dot com ---
Richard,
“Is this about the "stupid" attempt to use as little AVX512 as possible”
No, it is not.
I provided asm listing at the beginning with zmm only to illustrate the issue
more crystalized. I used "-mprefer-vector-width=512" option to get that asm
output.
I believe that if we remove these registers (any. Including ymm and xmm) from
BB you mentioned - it will be profitable for all architectures.

" Sergey, can you test this?"
Yes. Going to try this.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #10 from Richard Biener  ---
Just to note this is _basic block vectorization_ triggering.  Of course we do
vectorize basic blocks even when we do not vectorize any loop.

Is this about the "stupid" attempt to use as little AVX512 as possible because
the CPU has to down-clock?  We could add a hint to the prefered-vector-size
target hook whether we are asking for BB or loop vectorization and the target
could decide to not do AVX512 for BB vectorization.

Of course we also happily vectorize very short running loops with AVX512.

I think the compiler will have a _very_ hard job working around this CPU design
limitation...

So - I do have a way to "fix" it for this case.  What SLP detection does
for the piecewise vector construction is to push that down as far as possible,
even through binary operations that could otherwise be vectorized (the idea
was to minimize the number of such build-ups).  We could limit that to
unary operations.  Then we get a much larger SLP tree and

t.c:32:12: note: Cost model analysis:
  Vector inside of basic block cost: 152
  Vector prologue cost: 512
  Vector epilogue cost: 0
  Scalar cost of basic block: 544
t.c:32:12: note: not vectorized: vectorization is not profitable.

but of course the real issue is that the target claims that replacing
N stores with a vector store is profitable.

Hmm.  Honza did

  case vec_construct:
return ix86_vec_cost (mode, ix86_cost->sse_op, false);

but it was

  case vec_construct:
return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1);


before.  That seems like a bogus change.  I guess it should really be

  case vec_construct:
return (ix86_vec_cost (mode, ix86_cost->sse_op, false)
* (TYPE_VECTOR_SUBPARTS (vectype) - 1));

Honza - what was the motivation for this change?

Sergey, can you test this?  For me it makes the thing vectorized using
SSE which means it can use vpbroadcastd.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 255499)
+++ gcc/config/i386/i386.c  (working copy)
@@ -44879,7 +44879,8 @@ ix86_builtin_vectorization_cost (enum ve
  ix86_cost->sse_op, true);

   case vec_construct:
-   return ix86_vec_cost (mode, ix86_cost->sse_op, false);
+   return (ix86_vec_cost (mode, ix86_cost->sse_op, false)
+   * (TYPE_VECTOR_SUBPARTS (vectype) - 1));

   default:
 gcc_unreachable ();

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #9 from sergey.shalnov at intel dot com ---
Created attachment 42813
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42813=edit
New reproducer

Slightly changed first loop

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #8 from sergey.shalnov at intel dot com ---
Richard,
This is great changes and I see the first loop became vectorized for the test
example I provided with gcc-8.0 main trunk.
But I think the issue a bit more complicated. Vectorization of the first loop
just hide the issue I reported.

Currently I see following:
test loop: “for (int i = 0; i < 4; i++, input1 += 4, input2 += 4)”
test_bugzilla.c:6:5: note: Cost model analysis:. 
  Vector inside of loop cost: 1136 
  Vector prologue cost: 0 
  Vector epilogue cost: 0 
  Scalar iteration cost: 328 
  Scalar outside cost: 0 
  Vector outside cost: 0 
  prologue iterations: 0 
  epilogue iterations: 0 
  Calculated minimum iters for profitability: 0 
test_bugzilla.c:6:5: note:   Runtime profitability threshold = 4 
test_bugzilla.c:6:5: note:   Static estimate profitability threshold = 4 
test_bugzilla.c:6:5: note: loop vectorized

if I slightly change the loop (to be closer to real application): “for (int i =
0; i < 4; i++, input1 += stride1, input2 += stride2)”
test_bugzilla1.c:6:5: note: Cost model analysis:. 
  Vector inside of loop cost: 5232 
  Vector prologue cost: 0 
  Vector epilogue cost: 0 
  Scalar iteration cost: 328 
  Scalar outside cost: 0 
  Vector outside cost: 0 
  prologue iterations: 0 
  epilogue iterations: 0 
test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 5232
divided by the scalar iteration cost = 328 is greater or equal to the
vectorization factor = 4. 
test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable. 
test_bugzilla1.c:6:5: note: not vectorized: vector version will never be
profitable.

And the issue with extra vector operations remains the same.

I’m not sure but I think it is really profitable to avoid vector registers
usage if the loop is not vectorized.
Do you agree?

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #7 from Richard Biener  ---
Note the first loop is now vectorized fine thus the strange code is gone.

-> fixed? (probably by the fix for PR83202)

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #6 from sergey.shalnov at intel dot com ---
I found the issue request related to the vactorization issues in second loop
(reduction uint->int).

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #5 from sergey.shalnov at intel dot com ---
(In reply to Richard Biener from comment #2)
> The strange code is because we perform basic-block vectorization resulting in
> 
>   vect_cst__249 = {_251, _251, _251, _251, _334, _334, _334, _334, _417,
> _417, _417, _417, _48, _48, _48, _48};
>   MEM[(unsigned int *)] = vect_cst__249;
>   _186 = tmp[0][0];
>   _185 = tmp[1][0];
> ...
> 
> which for some reason is deemed profitable:
> 
> t.c:32:12: note: Cost model analysis:
>   Vector inside of basic block cost: 24
>   Vector prologue cost: 64
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 192
> t.c:32:12: note: Basic block will be vectorized using SLP
> 
> what is odd is that the single vector store is costed 24 while the 16 scalar
> int stores are costed 192.  The vector build from scalar costs 64.
> 
> I guess Honzas cost-model tweaks might have gone wrong here or we're hitting
> an oddity in the SLP costing.
> 
> Even if it looks strange maybe the sequence _is_ profitable?
> 
> The second loop would be vectorized if 'sum' was unsigned.

Richard,
No, the sequence is not profitable. If we don't use any vector registers here
the performance will be better for all architectures.
I'm talking about vectorized code only here.

I'm trying to look into vect_stmt_relevant_p() function to implement additional
limitations and avoid block vectorization if the loop is not vectorized.

If you have any idea how to avoid vectorization in this particular place -
please let me know.

Sergey

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-20 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #4 from rguenther at suse dot de  ---
On Sun, 19 Nov 2017, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
> 
> Jan Hubicka  changed:
> 
>What|Removed |Added
> 
>  Status|UNCONFIRMED |NEW
>Last reconfirmed||2017-11-19
>  Ever confirmed|0   |1
> 
> --- Comment #3 from Jan Hubicka  ---
> I now get fir first loop:
> 
> 
> a.c:6:5: note: Cost model analysis:   
>   
>   Vector inside of loop cost: 1120
>   
>   Vector prologue cost: 0 
>   
>   Vector epilogue cost: 0 
>   
>   Scalar iteration cost: 328  
>   
>   Scalar outside cost: 0  
>   
>   Vector outside cost: 0  
>   
>   prologue iterations: 0  
>   
>   epilogue iterations: 0  
>   
>   Calculated minimum iters for profitability: 0   
>   
> a.c:6:5: note:   Runtime profitability threshold = 4  
>   
> a.c:6:5: note:   Static estimate profitability threshold = 4  
>   
> 
> For second lop I get:
> 
> a.c:20:5: note: type of def: internal 
>   
> a.c:20:5: note: mark relevant 1, live 0: sum.0_60 = (unsigned int) sum_102;   
>   
> a.c:20:5: note: worklist: examine stmt: sum.0_60 = (unsigned int) sum_102;
>   
> a.c:20:5: note: vect_is_simple_use: operand sum_102   
>   
> a.c:20:5: note: def_stmt: sum_102 = PHI <0(6), sum_75(7)> 
>   
> a.c:20:5: note: type of def: unknown  
>   
> a.c:20:5: note: Unsupported pattern.  
>   
> a.c:20:5: note: not vectorized: unsupported use in stmt.  
>   
> a.c:20:5: note: unexpected pattern.   
> 
> Is it really that hard to sum values in vector?  

The issue is the vectorizer doesn't currently handle conversions:

t.c:20:5: note: Analyze phi: sum_102 = PHI <0(6), sum_75(7)>
t.c:20:5: note: reduction: not commutative/associative: sum_75 = (int) 
_61;

I have patches to fix that though.  Sitting somewhere...  see PR65930.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-11-19
 Ever confirmed|0   |1

--- Comment #3 from Jan Hubicka  ---
I now get fir first loop:


a.c:6:5: note: Cost model analysis: 
  Vector inside of loop cost: 1120  
  Vector prologue cost: 0   
  Vector epilogue cost: 0   
  Scalar iteration cost: 328
  Scalar outside cost: 0
  Vector outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 0 
a.c:6:5: note:   Runtime profitability threshold = 4
a.c:6:5: note:   Static estimate profitability threshold = 4

For second lop I get:

a.c:20:5: note: type of def: internal   
a.c:20:5: note: mark relevant 1, live 0: sum.0_60 = (unsigned int) sum_102; 
a.c:20:5: note: worklist: examine stmt: sum.0_60 = (unsigned int) sum_102;  
a.c:20:5: note: vect_is_simple_use: operand sum_102 
a.c:20:5: note: def_stmt: sum_102 = PHI <0(6), sum_75(7)>   
a.c:20:5: note: type of def: unknown
a.c:20:5: note: Unsupported pattern.
a.c:20:5: note: not vectorized: unsupported use in stmt.
a.c:20:5: note: unexpected pattern.   

Is it really that hard to sum values in vector?  

The code does not use AVX512 regs at all.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-* i?86-*-*
 CC||hubicka at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org

--- Comment #2 from Richard Biener  ---
The strange code is because we perform basic-block vectorization resulting in

  vect_cst__249 = {_251, _251, _251, _251, _334, _334, _334, _334, _417, _417,
_417, _417, _48, _48, _48, _48};
  MEM[(unsigned int *)] = vect_cst__249;
  _186 = tmp[0][0];
  _185 = tmp[1][0];
...

which for some reason is deemed profitable:

t.c:32:12: note: Cost model analysis:
  Vector inside of basic block cost: 24
  Vector prologue cost: 64
  Vector epilogue cost: 0
  Scalar cost of basic block: 192
t.c:32:12: note: Basic block will be vectorized using SLP

what is odd is that the single vector store is costed 24 while the 16 scalar
int stores are costed 192.  The vector build from scalar costs 64.

I guess Honzas cost-model tweaks might have gone wrong here or we're hitting an
oddity in the SLP costing.

Even if it looks strange maybe the sequence _is_ profitable?

The second loop would be vectorized if 'sum' was unsigned.

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008

--- Comment #1 from sergey.shalnov at intel dot com ---
Created attachment 42616
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42616=edit
reproducer