[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2023-01-31 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

Martin Jambor  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |INVALID

--- Comment #11 from Martin Jambor  ---
Probably just weirdness of the universe we live in rather than a bug.  At least
the LNT graph loogs good now too.

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-10-14 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

--- Comment #10 from Martin Jambor  ---
Looking at the LNT graph, I guess this bug should be either closed or suspended
(not sure what the suspended state means for the blocked metabug, so probably
closed).

Yeah, it's weird.

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-10-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

Richard Biener  changed:

   What|Removed |Added

   Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|NEW

--- Comment #9 from Richard Biener  ---
433.milc on that specific LNT instance seems to jump up and down with
recovering from the originally reported regression but now being worse than
ever, regressing between Sep. 27 and 28.

But as said, on Zen2, while the changes are reproducible, perf is almost
useless there, pointing to code that's exactly the same :/

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-10-07 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

--- Comment #8 from Jan Hubicka  ---
so smarter merging in modref is now implemented ;)

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-08-22 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

--- Comment #7 from Jan Hubicka  ---
"every access" means that we no longer track individual bases+offsets+sizes and
everything matching the base/ref alias set will be considered conflicting.

I planned to implement smarter merging of accesses so we do not run out of
limits for such sequential case.  Will look into it.

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-07-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

--- Comment #6 from Richard Biener  ---
Btw, there's no effect of the change visible on Haswell.

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-07-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

Richard Biener  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #5 from Richard Biener  ---
OK, so some interesting difference is (that's all of the -fopt-info-vec
differences):

-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
-s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte
vectors
+m_mat_nn.c:90:17: optimized: basic block part vectorized using 16 byte vectors

The +m_mat_nn.c:90:17 is mult_su3_nn while the -s_m_a_mat.c:18:18 is
scalar_mult_add_su3_matrix which is inlined at all call sites.   The cases
missing are all inlined into the function update_u.

The odd thing is that we're seeing changes in .vect of update_u like

@@ -3426,46 +3334,40 @@
   # DEBUG j => 0
   # DEBUG BEGIN_STMT
   # DEBUG BEGIN_STMT
-  _918 = MEM  [(struct su3_matrix
*)s_103].link[dir_67].e[0][0].real;
   _919 = temp1.e[0][0].real;
   _920 = t5_12 * _919;
-  _921 = _918 + _920;
+  _921 = _920 + _1023;
   temp2.e[0][0].real = _921;
   # DEBUG BEGIN_STMT
-  _923 = MEM  [(struct su3_matrix
*)s_103].link[dir_67].e[0][0].imag;
   _924 = temp1.e[0][0].imag;
   _925 = t5_12 * _924;
-  _926 = _923 + _925;
+  _926 = _925 + _1028;
...

which in the end result in less DRs into SLP and thus a different outcome
there.
This difference starts in the cunrolli dump!?  Dump differences are like

+ipa-modref: call stmt mult_su3_nn (, link_24, );
+ipa-modref: call to mult_su3_nn/1705 does not clobber base: temp2 alias sets:
6->5
...
 Value numbering stmt = _938 = link_24->e[i_915][2].real;
-Setting value number of _938 to _938 (changed)
-Making available beyond BB152 _938 for value _938
+ipa-modref: call stmt mult_su3_nn (, , );
+ipa-modref: call to mult_su3_nn/1705 does not clobber base: MEM 
[(struct su3_matrix *)s_5] alias sets: 6->5
+ipa-modref: call stmt mult_su3_nn (, link_24, );
+ipa-modref: call to mult_su3_nn/1705 does not clobber base: MEM 
[(struct su3_matrix *)s_5] alias sets: 6->5
+Setting value number of _938 to _1043 (changed)
+_1043 is available for _1043
+Replaced link_24->e[i_915][2].real with _1043 in all uses of _938 =
link_24->e[i_915][2].real;

it's really odd, the WPA and LTRANS modref dumps do not show any difference
but the above looks like IPA summary is once available and once not.  Ah,
the late modref pass results spill over and it looks like we "improve" here:

   loads:
 Limits: 32 bases, 16 refs
-  Base 0: alias set 6
+  Base 0: alias set 5
+Ref 0: alias set 5
+  Every access
+  Base 1: alias set 6
 Ref 0: alias set 5
   Every access
   stores:
 Limits: 32 bases, 16 refs
-  Base 0: alias set 6
+  Base 0: alias set 5
 Ref 0: alias set 5
-  Every access
+  access: Parm 2 param offset:0 offset:0 size:128 max_size:128
+  access: Parm 2 param offset:16 offset:0 size:128 max_size:128
+  access: Parm 2 param offset:48 offset:0 size:128 max_size:128
+  access: Parm 2 param offset:64 offset:0 size:128 max_size:128
+  access: Parm 2 param offset:112 offset:0 size:128 max_size:128
+  Base 1: alias set 6
+Ref 0: alias set 5
+  access: Parm 2 param offset:0 offset:256 size:64 max_size:64
+  access: Parm 2 param offset:0 offset:320 size:64 max_size:64
+  access: Parm 2 param offset:0 offset:640 size:64 max_size:64
+  access: Parm 2 param offset:0 offset:704 size:64 max_size:64
+  access: Parm 2 param offset:0 offset:768 size:64 max_size:64
+  access: Parm 2 param offset:0 offset:832 size:64 max_size:64
+  access: Parm 2 param offset:0 offset:1024 size:64 max_size:64
+  access: Parm 2 param offset:0 offset:1088 size:64 max_size:64
   parm 0 flags: nodirectescape
   parm 1 flags: nodirectescape
   parm 2 flags: direct noescape nodirectescape
 void mult_su3_nn (struct su3_matrix * a, struct su3_matrix * b, struct
su3_matrix * c)

I'm not sure what "Every access" means but I suppose it's "bad" here.  Maybe
it's

  - Analyzing load: b_10(D)->e[2][1].real
- Recording base_set=6 ref_set=5 parm=1
---param param=modref-max-accesses limit reached
  - Analyzing load: b_10(D)->e[2][1].imag
- Recording base_set=6 ref_set=5 parm=1
... (a lot) ...
+--param param=modref-max-accesses limit reached
  - Analyzing load: a_7(D)->e[1][1].imag
- Recording base_set=6 ref_set=5 parm=0
  - ECF_CONST | ECF_NOVOPS, ignoring all stores and all loads except for args.

so eventually vectorizing helps reducing the number of accesses and 

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-07-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

--- Comment #4 from Richard Biener  ---
Disabling vectorization for mult_su3_nn (the one with the vaddsubpd
instructions) still reproduces

433.milc 9180126   73.1 *9180133   69.2 *  

and thus a 5% slowdown.

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-07-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

--- Comment #3 from Richard Biener  ---
Created attachment 51104
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51104=edit
mult_su3_nn testcase

This is the function with the (nearly) only and many vaddsubpd instructions.

With the addsub pattern we have 15 addsub and 33 fma, 51 mul, 14 add and 3 sub
while without the pattern we have zero addsub and 54 fma, 54 mul, 32 add and 9
sub.  Detecting fmaddsub directly in the vectorizer might be worthwhile.

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-07-05 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

--- Comment #2 from Richard Biener  ---
Samples: 884K of event 'cycles:u', Event count (approx.): 96751841  
Overhead   Samples  Command  Shared Object Symbol   
  13.76%119196  milc_peak.amd64  milc_peak.amd64-m64-mine  [.]
u_shift_fermion #
  10.08% 87085  milc_base.amd64  milc_base.amd64-m64-mine  [.]
add_force_to_mom#
   9.93% 85891  milc_base.amd64  milc_base.amd64-m64-mine  [.]
u_shift_fermion #
   9.38% 81331  milc_peak.amd64  milc_peak.amd64-m64-mine  [.]
add_force_to_mom#
   9.03% 82570  milc_peak.amd64  milc_peak.amd64-m64-mine  [.]
mult_su3_na #
   8.55% 77803  milc_base.amd64  milc_base.amd64-m64-mine  [.]
mult_su3_na #
   7.41% 65641  milc_peak.amd64  milc_peak.amd64-m64-mine  [.]
mult_su3_nn #
   6.26% 55314  milc_base.amd64  milc_base.amd64-m64-mine  [.]
mult_su3_nn #
   1.48% 12876  milc_peak.amd64  milc_peak.amd64-m64-mine  [.]
mult_su3_an #
   1.42% 12625  milc_base.amd64  milc_base.amd64-m64-mine  [.]
imp_gauge_force.constprop.0 #
   1.18% 10602  milc_peak.amd64  milc_peak.amd64-m64-mine  [.]
imp_gauge_force.constprop.0 #
   1.00%  8853  milc_base.amd64  milc_base.amd64-m64-mine  [.]
mult_su3_mat_vec_sum_4dir   #
   0.94%  8343  milc_peak.amd64  milc_peak.amd64-m64-mine  [.]
mult_su3_mat_vec_sum_4dir   #
   0.94%  8156  milc_base.amd64  milc_base.amd64-m64-mine  [.]
mult_su3_an

The odd thing is that for example mult_su3_an reports vastly different
amount of cycles but the assembly is 1:1 identical.

There are in total 16 vaddsubpd instructions in the new variant in
symbols add_force_to_mom (1) and mult_su3_nn (15) but that doesn't
explain the difference seen above.

There are more detected ADDSUB patterns but they do not materialize in the
end, still there's some effect on RA and scheduling in functions like
u_shift_fermion, but the vectorizer dumps do not reveal anything interesting
for this example either.

I was using the following to disable the added pattern:

diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
index 2671f91972d..388b185dc7b 100644
--- a/gcc/tree-vect-slp-patterns.c
+++ b/gcc/tree-vect-slp-patterns.c
@@ -1510,7 +1510,7 @@ addsub_pattern::recognize (slp_tree_to_load_perm_map_t *,
slp_tree *node_)
 {
   slp_tree node = *node_;
   if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
-  || SLP_TREE_CHILDREN (node).length () != 2)
+  || SLP_TREE_CHILDREN (node).length () != 2 || 1)
 return NULL;

   /* Match a blend of a plus and a minus op with the same number of plus and


To sum up - I have no idea why performance has regressed.

[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto

2021-07-02 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
   Last reconfirmed||2021-07-02
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Richard Biener  ---
I will have a look next week.  A quick look shows FMAs being used and addsub
can break FMA detection until we get general optab support for fmaddsub
and friends.  So it might be { fma, fms } + blend compared to addsub + mul
where the former maybe has lower latency though Agner says FMA (5c) + blend
(1c)
vs ADDSUB (3c) + MUL (3c).  As said, I have to look into this in more detail.

double a[4], b[4], c[4];

void foo ()
{
  c[0] = a[0] - b[0] * c[0];
  c[1] = a[1] + b[1] * c[1];
  c[2] = a[2] - b[2] * c[2];
  c[3] = a[3] + b[3] * c[3];
}

vmovapd a(%rip), %ymm2
vmovapd b(%rip), %ymm1
vmovapd b(%rip), %ymm0
vfmadd132pd c(%rip), %ymm2, %ymm1
vfnmadd132pdc(%rip), %ymm2, %ymm0
vshufpd $10, %ymm1, %ymm0, %ymm0
vmovapd %ymm0, c(%rip)

vs.

vmovapd b(%rip), %ymm1
vmovapd a(%rip), %ymm2
vmulpd  c(%rip), %ymm1, %ymm0
vaddsubpd   %ymm0, %ymm2, %ymm0
vmovapd %ymm0, c(%rip)