12 Regression] SLP Vectorizer change pushes VEC_PERM_EXPR into bad location spoiling further optimization opportunities

law at gcc dot gnu.org via Gcc-bugs Thu, 12 Aug 2021 22:01:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101895


            Bug ID: 101895
           Summary: [11/12 Regression] SLP Vectorizer change pushes
                    VEC_PERM_EXPR into bad location spoiling further
                    optimization opportunities
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: law at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

Consider this code:


void foo(int * restrict a, int b, int *c) {
  a[0] = c[0]*b + a[0];
  a[1] = c[2]*b + a[1];
  a[2] = c[1]*b + a[2];
  a[3] = c[3]*b + a[3];
}

Prior to this commit:

commit 126ed72b9f48f8530b194532cc281fb761690435
Author: Richard Biener <rguent...@suse.de>
Date:   Wed Sep 30 17:08:01 2020 +0200

    optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts

    This introduces a permute optimization phase for SLP which is
    intended to cover the existing permute eliding for SLP reductions
    plus handling commonizing the easy cases.

    It currently uses graphds to compute a postorder on the reverse
    SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts
    did (hopefully - I've adjusted most testcases that triggered it
    a few days ago).  It restricts itself to move around bijective
    permutations to simplify things for now, mainly around constant nodes.

    As a prerequesite it makes the SLP graph cyclic (ugh).  It looks
    like it would pay off to compute a PRE/POST order visit array
    once and elide all the recursive SLP graph walks and their
    visited hash-set.  At least for the time where we do not change
    the SLP graph during such walk.

    I do not like using graphds too much but at least I don't have to
    re-implement yet another RPO walk, so maybe it isn't too bad.

    It now computes permute placement during iteration and thus should
    get cycles more obviously correct.

[ ... ]

GCC would generate this (x86_64 -O3 -march=native):

  vect__1.6_27 = VEC_PERM_EXPR <vect__1.5_25, vect__1.5_25, { 0, 2, 1, 3 }>;
  vect__2.7_29 = vect__1.6_27 * _28;
  _1 = *c_18(D);
  _2 = _1 * b_19(D);
  vectp.9_30 = a_20(D);
  vect__3.10_31 = MEM <vector(4) int> [(int *)vectp.9_30];
  vect__4.11_32 = vect__2.7_29 + vect__3.10_31;



This is good.  Note how the VEC_PERM_EXPR happens before the vector multiply
and how the vector multiply directly feeds the vector add.  On our target we
have a vector multiply-add which would be generated and all is good.


After the above change we generate this:

  vect__2.6_28 = vect__1.5_25 * _27;
  _29 = VEC_PERM_EXPR <vect__2.6_28, vect__2.6_28, { 0, 2, 1, 3 }>;
  _1 = *c_18(D);
  _2 = _1 * b_19(D);
  vectp.8_30 = a_20(D);
  vect__3.9_31 = MEM <vector(4) int> [(int *)vectp.8_30];
  vect__4.10_32 = _29 + vect__3.9_31;


Note how we have the vmul, then permute, then vadd.  This spoils our ability to
generate a vmadd.  This behavior is still seen on the trunk as well.

Conceptually it seems to me that having a permute at the start or end of a
chain of vector operations is better than moving the permute into the middle of
a chain of dependent vector operations.

We could probably fix this in the backend with some special patterns, but ISTM
that getting it right in SLP would be better.

[Bug tree-optimization/101895] New: [11/12 Regression] SLP Vectorizer change pushes VEC_PERM_EXPR into bad location spoiling further optimization opportunities

Reply via email to