https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113813

            Bug ID: 113813
           Summary: Reduction of xor/and/ior of 16 bytes can be improved
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pinskia at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Take:
```
#define SIGN unsigned
#define TYPE char
#define SIZE 16

void sor(SIGN TYPE *a, SIGN TYPE *r)
{
  SIGN TYPE b = 0;
  for(int i = 0; i < SIZE; i++)
    b |= a[i];
  *r = b;
}

void sxor(SIGN TYPE *a, SIGN TYPE *r)
{
  SIGN TYPE b = 0;
  for(int i = 0; i < SIZE; i++)
    b ^= a[i];
  *r = b;
}

void sand(SIGN TYPE *a, SIGN TYPE *r)
{

  SIGN TYPE b = -1;
  for(int i = 0; i < SIZE; i++)
    b &= a[i];
  *r = b;
}
```

Currently for sor GCC (at `-O3 -march=armv9-a+sve2 -fno-vect-cost-model`)
produces:
```
        ptrue   p7.b, vl16
        ptrue   p6.b, all
        ld1b    z31.b, p7/z, [x0]
        mov     z30.b, #0
        sel     z30.b, p7, z31.b, z30.b
        orv     b30, p6, z30.b
        str     b30, [x1]
```

But this could be improved to just:
```
        ptrue   p7.b, vl16
        ld1b    z31.b, p7/z, [x0]
        orv     b30, p7, z30.b
        str     b30, [x1]
```

Similarly for sxor/sand.
The same is true for short/int (8/4).

Note without -fno-vect-cost-model, it is just so much worse (on the trunk
only).

Note we should be able to use the SVE instruction when perfering NEON auto-vec
too.

Reply via email to