[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 Martin Jambor changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |INVALID --- Comment #11 from Martin Jambor --- Probably just weirdness of the universe we live in rather than a bug. At least the LNT graph loogs good now too.
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 --- Comment #10 from Martin Jambor --- Looking at the LNT graph, I guess this bug should be either closed or suspended (not sure what the suspended state means for the blocked metabug, so probably closed). Yeah, it's weird.
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 Richard Biener changed: What|Removed |Added Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot gnu.org Status|ASSIGNED|NEW --- Comment #9 from Richard Biener --- 433.milc on that specific LNT instance seems to jump up and down with recovering from the originally reported regression but now being worse than ever, regressing between Sep. 27 and 28. But as said, on Zen2, while the changes are reproducible, perf is almost useless there, pointing to code that's exactly the same :/
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 --- Comment #8 from Jan Hubicka --- so smarter merging in modref is now implemented ;)
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 --- Comment #7 from Jan Hubicka --- "every access" means that we no longer track individual bases+offsets+sizes and everything matching the base/ref alias set will be considered conflicting. I planned to implement smarter merging of accesses so we do not run out of limits for such sequential case. Will look into it.
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 --- Comment #6 from Richard Biener --- Btw, there's no effect of the change visible on Haswell.
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 Richard Biener changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #5 from Richard Biener --- OK, so some interesting difference is (that's all of the -fopt-info-vec differences): -s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte vectors -s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte vectors -s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte vectors -s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte vectors -s_m_a_mat.c:18:18: optimized: basic block part vectorized using 32 byte vectors +m_mat_nn.c:90:17: optimized: basic block part vectorized using 16 byte vectors The +m_mat_nn.c:90:17 is mult_su3_nn while the -s_m_a_mat.c:18:18 is scalar_mult_add_su3_matrix which is inlined at all call sites. The cases missing are all inlined into the function update_u. The odd thing is that we're seeing changes in .vect of update_u like @@ -3426,46 +3334,40 @@ # DEBUG j => 0 # DEBUG BEGIN_STMT # DEBUG BEGIN_STMT - _918 = MEM [(struct su3_matrix *)s_103].link[dir_67].e[0][0].real; _919 = temp1.e[0][0].real; _920 = t5_12 * _919; - _921 = _918 + _920; + _921 = _920 + _1023; temp2.e[0][0].real = _921; # DEBUG BEGIN_STMT - _923 = MEM [(struct su3_matrix *)s_103].link[dir_67].e[0][0].imag; _924 = temp1.e[0][0].imag; _925 = t5_12 * _924; - _926 = _923 + _925; + _926 = _925 + _1028; ... which in the end result in less DRs into SLP and thus a different outcome there. This difference starts in the cunrolli dump!? Dump differences are like +ipa-modref: call stmt mult_su3_nn (, link_24, ); +ipa-modref: call to mult_su3_nn/1705 does not clobber base: temp2 alias sets: 6->5 ... Value numbering stmt = _938 = link_24->e[i_915][2].real; -Setting value number of _938 to _938 (changed) -Making available beyond BB152 _938 for value _938 +ipa-modref: call stmt mult_su3_nn (, , ); +ipa-modref: call to mult_su3_nn/1705 does not clobber base: MEM [(struct su3_matrix *)s_5] alias sets: 6->5 +ipa-modref: call stmt mult_su3_nn (, link_24, ); +ipa-modref: call to mult_su3_nn/1705 does not clobber base: MEM [(struct su3_matrix *)s_5] alias sets: 6->5 +Setting value number of _938 to _1043 (changed) +_1043 is available for _1043 +Replaced link_24->e[i_915][2].real with _1043 in all uses of _938 = link_24->e[i_915][2].real; it's really odd, the WPA and LTRANS modref dumps do not show any difference but the above looks like IPA summary is once available and once not. Ah, the late modref pass results spill over and it looks like we "improve" here: loads: Limits: 32 bases, 16 refs - Base 0: alias set 6 + Base 0: alias set 5 +Ref 0: alias set 5 + Every access + Base 1: alias set 6 Ref 0: alias set 5 Every access stores: Limits: 32 bases, 16 refs - Base 0: alias set 6 + Base 0: alias set 5 Ref 0: alias set 5 - Every access + access: Parm 2 param offset:0 offset:0 size:128 max_size:128 + access: Parm 2 param offset:16 offset:0 size:128 max_size:128 + access: Parm 2 param offset:48 offset:0 size:128 max_size:128 + access: Parm 2 param offset:64 offset:0 size:128 max_size:128 + access: Parm 2 param offset:112 offset:0 size:128 max_size:128 + Base 1: alias set 6 +Ref 0: alias set 5 + access: Parm 2 param offset:0 offset:256 size:64 max_size:64 + access: Parm 2 param offset:0 offset:320 size:64 max_size:64 + access: Parm 2 param offset:0 offset:640 size:64 max_size:64 + access: Parm 2 param offset:0 offset:704 size:64 max_size:64 + access: Parm 2 param offset:0 offset:768 size:64 max_size:64 + access: Parm 2 param offset:0 offset:832 size:64 max_size:64 + access: Parm 2 param offset:0 offset:1024 size:64 max_size:64 + access: Parm 2 param offset:0 offset:1088 size:64 max_size:64 parm 0 flags: nodirectescape parm 1 flags: nodirectescape parm 2 flags: direct noescape nodirectescape void mult_su3_nn (struct su3_matrix * a, struct su3_matrix * b, struct su3_matrix * c) I'm not sure what "Every access" means but I suppose it's "bad" here. Maybe it's - Analyzing load: b_10(D)->e[2][1].real - Recording base_set=6 ref_set=5 parm=1 ---param param=modref-max-accesses limit reached - Analyzing load: b_10(D)->e[2][1].imag - Recording base_set=6 ref_set=5 parm=1 ... (a lot) ... +--param param=modref-max-accesses limit reached - Analyzing load: a_7(D)->e[1][1].imag - Recording base_set=6 ref_set=5 parm=0 - ECF_CONST | ECF_NOVOPS, ignoring all stores and all loads except for args. so eventually vectorizing helps reducing the number of accesses and
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 --- Comment #4 from Richard Biener --- Disabling vectorization for mult_su3_nn (the one with the vaddsubpd instructions) still reproduces 433.milc 9180126 73.1 *9180133 69.2 * and thus a 5% slowdown.
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 --- Comment #3 from Richard Biener --- Created attachment 51104 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51104=edit mult_su3_nn testcase This is the function with the (nearly) only and many vaddsubpd instructions. With the addsub pattern we have 15 addsub and 33 fma, 51 mul, 14 add and 3 sub while without the pattern we have zero addsub and 54 fma, 54 mul, 32 add and 9 sub. Detecting fmaddsub directly in the vectorizer might be worthwhile.
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 --- Comment #2 from Richard Biener --- Samples: 884K of event 'cycles:u', Event count (approx.): 96751841 Overhead Samples Command Shared Object Symbol 13.76%119196 milc_peak.amd64 milc_peak.amd64-m64-mine [.] u_shift_fermion # 10.08% 87085 milc_base.amd64 milc_base.amd64-m64-mine [.] add_force_to_mom# 9.93% 85891 milc_base.amd64 milc_base.amd64-m64-mine [.] u_shift_fermion # 9.38% 81331 milc_peak.amd64 milc_peak.amd64-m64-mine [.] add_force_to_mom# 9.03% 82570 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_na # 8.55% 77803 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_na # 7.41% 65641 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_nn # 6.26% 55314 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_nn # 1.48% 12876 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_an # 1.42% 12625 milc_base.amd64 milc_base.amd64-m64-mine [.] imp_gauge_force.constprop.0 # 1.18% 10602 milc_peak.amd64 milc_peak.amd64-m64-mine [.] imp_gauge_force.constprop.0 # 1.00% 8853 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_mat_vec_sum_4dir # 0.94% 8343 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_mat_vec_sum_4dir # 0.94% 8156 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_an The odd thing is that for example mult_su3_an reports vastly different amount of cycles but the assembly is 1:1 identical. There are in total 16 vaddsubpd instructions in the new variant in symbols add_force_to_mom (1) and mult_su3_nn (15) but that doesn't explain the difference seen above. There are more detected ADDSUB patterns but they do not materialize in the end, still there's some effect on RA and scheduling in functions like u_shift_fermion, but the vectorizer dumps do not reveal anything interesting for this example either. I was using the following to disable the added pattern: diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index 2671f91972d..388b185dc7b 100644 --- a/gcc/tree-vect-slp-patterns.c +++ b/gcc/tree-vect-slp-patterns.c @@ -1510,7 +1510,7 @@ addsub_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_) { slp_tree node = *node_; if (SLP_TREE_CODE (node) != VEC_PERM_EXPR - || SLP_TREE_CHILDREN (node).length () != 2) + || SLP_TREE_CHILDREN (node).length () != 2 || 1) return NULL; /* Match a blend of a plus and a minus op with the same number of plus and To sum up - I have no idea why performance has regressed.
[Bug target/101296] Addition of x86 addsub SLP patterned slowed down 433.milc by 12% on znver2 with -Ofast -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Last reconfirmed||2021-07-02 Status|UNCONFIRMED |ASSIGNED --- Comment #1 from Richard Biener --- I will have a look next week. A quick look shows FMAs being used and addsub can break FMA detection until we get general optab support for fmaddsub and friends. So it might be { fma, fms } + blend compared to addsub + mul where the former maybe has lower latency though Agner says FMA (5c) + blend (1c) vs ADDSUB (3c) + MUL (3c). As said, I have to look into this in more detail. double a[4], b[4], c[4]; void foo () { c[0] = a[0] - b[0] * c[0]; c[1] = a[1] + b[1] * c[1]; c[2] = a[2] - b[2] * c[2]; c[3] = a[3] + b[3] * c[3]; } vmovapd a(%rip), %ymm2 vmovapd b(%rip), %ymm1 vmovapd b(%rip), %ymm0 vfmadd132pd c(%rip), %ymm2, %ymm1 vfnmadd132pdc(%rip), %ymm2, %ymm0 vshufpd $10, %ymm1, %ymm0, %ymm0 vmovapd %ymm0, c(%rip) vs. vmovapd b(%rip), %ymm1 vmovapd a(%rip), %ymm2 vmulpd c(%rip), %ymm1, %ymm0 vaddsubpd %ymm0, %ymm2, %ymm0 vmovapd %ymm0, c(%rip)