https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101895
Bug ID: 101895 Summary: [11/12 Regression] SLP Vectorizer change pushes VEC_PERM_EXPR into bad location spoiling further optimization opportunities Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: law at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- Consider this code: void foo(int * restrict a, int b, int *c) { a[0] = c[0]*b + a[0]; a[1] = c[2]*b + a[1]; a[2] = c[1]*b + a[2]; a[3] = c[3]*b + a[3]; } Prior to this commit: commit 126ed72b9f48f8530b194532cc281fb761690435 Author: Richard Biener <rguent...@suse.de> Date: Wed Sep 30 17:08:01 2020 +0200 optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts This introduces a permute optimization phase for SLP which is intended to cover the existing permute eliding for SLP reductions plus handling commonizing the easy cases. It currently uses graphds to compute a postorder on the reverse SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts did (hopefully - I've adjusted most testcases that triggered it a few days ago). It restricts itself to move around bijective permutations to simplify things for now, mainly around constant nodes. As a prerequesite it makes the SLP graph cyclic (ugh). It looks like it would pay off to compute a PRE/POST order visit array once and elide all the recursive SLP graph walks and their visited hash-set. At least for the time where we do not change the SLP graph during such walk. I do not like using graphds too much but at least I don't have to re-implement yet another RPO walk, so maybe it isn't too bad. It now computes permute placement during iteration and thus should get cycles more obviously correct. [ ... ] GCC would generate this (x86_64 -O3 -march=native): vect__1.6_27 = VEC_PERM_EXPR <vect__1.5_25, vect__1.5_25, { 0, 2, 1, 3 }>; vect__2.7_29 = vect__1.6_27 * _28; _1 = *c_18(D); _2 = _1 * b_19(D); vectp.9_30 = a_20(D); vect__3.10_31 = MEM <vector(4) int> [(int *)vectp.9_30]; vect__4.11_32 = vect__2.7_29 + vect__3.10_31; This is good. Note how the VEC_PERM_EXPR happens before the vector multiply and how the vector multiply directly feeds the vector add. On our target we have a vector multiply-add which would be generated and all is good. After the above change we generate this: vect__2.6_28 = vect__1.5_25 * _27; _29 = VEC_PERM_EXPR <vect__2.6_28, vect__2.6_28, { 0, 2, 1, 3 }>; _1 = *c_18(D); _2 = _1 * b_19(D); vectp.8_30 = a_20(D); vect__3.9_31 = MEM <vector(4) int> [(int *)vectp.8_30]; vect__4.10_32 = _29 + vect__3.9_31; Note how we have the vmul, then permute, then vadd. This spoils our ability to generate a vmadd. This behavior is still seen on the trunk as well. Conceptually it seems to me that having a permute at the start or end of a chain of vector operations is better than moving the permute into the middle of a chain of dependent vector operations. We could probably fix this in the backend with some special patterns, but ISTM that getting it right in SLP would be better.