[Bug tree-optimization/69848] New: poor vectorization of a loop from SPEC2006 464.h264ref

wilson at gcc dot gnu.org Tue, 16 Feb 2016 17:35:57 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69848


            Bug ID: 69848
           Summary: poor vectorization of a loop from SPEC2006 464.h264ref
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wilson at gcc dot gnu.org
  Target Milestone: ---

This is a continuation of bug 69282, which reported an ICE on the same loop,
which has since been fixed.  There is still the problem that the code is poorly
optimized.  These problems can be seen on both armhf and aarch64.  There are
multiple problems here.

The testcase is

#include <stdlib.h>

int fn1 (int) __attribute__ ((noinline));

int a[32];
int fn1(int d) {
  int c = 1;
  for (int b = 0; b < 32; b++)
    if (a[b])
      c = 0;
  return c;
}

int
main (void)
{
  int i;
  for (i = 0; i < 32; i++)
    a[i] = 0;
  if (fn1(10) != 1)
    abort ();
  a[3] = 2;
  a[24] = 1;
  if (fn1(10) != 0)
    abort ();
  return 0;
}

Compiled with -O2 -ftree-vectorize, the inner loop of fn1 is
.L2:
        ldr     q0, [x0, x1]
        add     x0, x0, 16
        cmp     x0, 128
        cmeq    v0.4s, v0.4s, #0
        not     v0.16b, v0.16b
        cmlt    v0.4s, v0.4s, #0
        bit     v1.16b, v2.16b, v0.16b
        bic     v3.16b, v3.16b, v0.16b
        add     v2.4s, v2.4s, v4.4s
        bne     .L2

The cmlt instruction serves no useful purpose, as the output is the same as the
input.  This can be fixed by adding the missing vcond_mask* patterns to the
armhf and aarch64 ports.

The not instruction is unnecessary.  It can be eliminated by changing the
bit/bic instructions into bif/and.  This might be possible via combine, and
might require rewriting some aarch64/armhf patterns to use vector rtl instead
of unspecs.

The v2 iterator is computing the index in the array as a vector, which is info
we don't need.  We only need the info in v3.  We can eliminate the instructions
setting v1 and v2, plus the instructions before the loop setting v1, v2, and
v4, and the instructions after the loop using v1.

Also, related to that, after the loop, we have two reductions.
        umaxv   s0, v1.4s
        dup     v0.4s, v0.s[0]
        cmeq    v1.4s, v1.4s, v0.4s
        and     v1.16b, v3.16b, v1.16b
        umaxv   s1, v1.4s
        umov    w0, v1.s[0]
We only need one reduction here, and we only need the info in v3.  This can be
simplified to
        uminv   s1, v3.4s
        umov    w0, v1.s[0]

I don't know offhand what vectorizer changes are necessary to make these last
two transformations.

I verified that these transformations work on aarch64.  Before the
transformations, we have 8 instructions before the loop, 10 instructions inside
the loop, and 6 instructions after the loop.  After the transformations, we
have 4 instructions before the loop, 6 instructions inside the loop, and 2
instructions after the loop.  So it is half the size statically, and roughly
60% of the original size dynamically.

[Bug tree-optimization/69848] New: poor vectorization of a loop from SPEC2006 464.h264ref

Reply via email to