https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69848
Bug ID: 69848 Summary: poor vectorization of a loop from SPEC2006 464.h264ref Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilson at gcc dot gnu.org Target Milestone: --- This is a continuation of bug 69282, which reported an ICE on the same loop, which has since been fixed. There is still the problem that the code is poorly optimized. These problems can be seen on both armhf and aarch64. There are multiple problems here. The testcase is #include <stdlib.h> int fn1 (int) __attribute__ ((noinline)); int a[32]; int fn1(int d) { int c = 1; for (int b = 0; b < 32; b++) if (a[b]) c = 0; return c; } int main (void) { int i; for (i = 0; i < 32; i++) a[i] = 0; if (fn1(10) != 1) abort (); a[3] = 2; a[24] = 1; if (fn1(10) != 0) abort (); return 0; } Compiled with -O2 -ftree-vectorize, the inner loop of fn1 is .L2: ldr q0, [x0, x1] add x0, x0, 16 cmp x0, 128 cmeq v0.4s, v0.4s, #0 not v0.16b, v0.16b cmlt v0.4s, v0.4s, #0 bit v1.16b, v2.16b, v0.16b bic v3.16b, v3.16b, v0.16b add v2.4s, v2.4s, v4.4s bne .L2 The cmlt instruction serves no useful purpose, as the output is the same as the input. This can be fixed by adding the missing vcond_mask* patterns to the armhf and aarch64 ports. The not instruction is unnecessary. It can be eliminated by changing the bit/bic instructions into bif/and. This might be possible via combine, and might require rewriting some aarch64/armhf patterns to use vector rtl instead of unspecs. The v2 iterator is computing the index in the array as a vector, which is info we don't need. We only need the info in v3. We can eliminate the instructions setting v1 and v2, plus the instructions before the loop setting v1, v2, and v4, and the instructions after the loop using v1. Also, related to that, after the loop, we have two reductions. umaxv s0, v1.4s dup v0.4s, v0.s[0] cmeq v1.4s, v1.4s, v0.4s and v1.16b, v3.16b, v1.16b umaxv s1, v1.4s umov w0, v1.s[0] We only need one reduction here, and we only need the info in v3. This can be simplified to uminv s1, v3.4s umov w0, v1.s[0] I don't know offhand what vectorizer changes are necessary to make these last two transformations. I verified that these transformations work on aarch64. Before the transformations, we have 8 instructions before the loop, 10 instructions inside the loop, and 6 instructions after the loop. After the transformations, we have 4 instructions before the loop, 6 instructions inside the loop, and 2 instructions after the loop. So it is half the size statically, and roughly 60% of the original size dynamically.