https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101895

            Bug ID: 101895
           Summary: [11/12 Regression] SLP Vectorizer change pushes
                    VEC_PERM_EXPR into bad location spoiling further
                    optimization opportunities
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: law at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

Consider this code:


void foo(int * restrict a, int b, int *c) {
  a[0] = c[0]*b + a[0];
  a[1] = c[2]*b + a[1];
  a[2] = c[1]*b + a[2];
  a[3] = c[3]*b + a[3];
}

Prior to this commit:

commit 126ed72b9f48f8530b194532cc281fb761690435
Author: Richard Biener <rguent...@suse.de>
Date:   Wed Sep 30 17:08:01 2020 +0200

    optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts

    This introduces a permute optimization phase for SLP which is
    intended to cover the existing permute eliding for SLP reductions
    plus handling commonizing the easy cases.

    It currently uses graphds to compute a postorder on the reverse
    SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts
    did (hopefully - I've adjusted most testcases that triggered it
    a few days ago).  It restricts itself to move around bijective
    permutations to simplify things for now, mainly around constant nodes.

    As a prerequesite it makes the SLP graph cyclic (ugh).  It looks
    like it would pay off to compute a PRE/POST order visit array
    once and elide all the recursive SLP graph walks and their
    visited hash-set.  At least for the time where we do not change
    the SLP graph during such walk.

    I do not like using graphds too much but at least I don't have to
    re-implement yet another RPO walk, so maybe it isn't too bad.

    It now computes permute placement during iteration and thus should
    get cycles more obviously correct.

[ ... ]

GCC would generate this (x86_64 -O3 -march=native):

  vect__1.6_27 = VEC_PERM_EXPR <vect__1.5_25, vect__1.5_25, { 0, 2, 1, 3 }>;
  vect__2.7_29 = vect__1.6_27 * _28;
  _1 = *c_18(D);
  _2 = _1 * b_19(D);
  vectp.9_30 = a_20(D);
  vect__3.10_31 = MEM <vector(4) int> [(int *)vectp.9_30];
  vect__4.11_32 = vect__2.7_29 + vect__3.10_31;



This is good.  Note how the VEC_PERM_EXPR happens before the vector multiply
and how the vector multiply directly feeds the vector add.  On our target we
have a vector multiply-add which would be generated and all is good.


After the above change we generate this:

  vect__2.6_28 = vect__1.5_25 * _27;
  _29 = VEC_PERM_EXPR <vect__2.6_28, vect__2.6_28, { 0, 2, 1, 3 }>;
  _1 = *c_18(D);
  _2 = _1 * b_19(D);
  vectp.8_30 = a_20(D);
  vect__3.9_31 = MEM <vector(4) int> [(int *)vectp.8_30];
  vect__4.10_32 = _29 + vect__3.9_31;


Note how we have the vmul, then permute, then vadd.  This spoils our ability to
generate a vmadd.  This behavior is still seen on the trunk as well.

Conceptually it seems to me that having a permute at the start or end of a
chain of vector operations is better than moving the permute into the middle of
a chain of dependent vector operations.

We could probably fix this in the backend with some special patterns, but ISTM
that getting it right in SLP would be better.

Reply via email to