https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77568

            Bug ID: 77568
           Summary: [7 regression] CSE/PRE/Hoisting blocks common
                    instruction contractions
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wdijkstr at arm dot com
  Target Milestone: ---

The recently introduced code hoisting aggressively moves common subexpressions
that might otherwise be mergeable with other operations. This caused a large
regression in one benchmark. A simple reduced test shows the issue:

float f(float x, float y, float z, int a)
{
   if (a > 100)
     x += y * z;
   else
     x -= y * z;
   return x;
}

This now produces on AArch64:

f:
        fmul    s2, s1, s2
        cmp     w0, 100
        fadd    s1, s0, s2
        fsub    s0, s0, s2
        fcsel   s0, s0, s1, le
        ret

Note the issue is not limited to hoisting, CSE/PRE cause similar issues:

void g(int, int);
int f2(int x)
{
  g(x, x+1);
  g(x, x+1);
  return x+1;
}

f2:
        stp     x29, x30, [sp, -32]!
        add     x29, sp, 0
        stp     x19, x20, [sp, 16]
        add     w19, w0, 1
        mov     w20, w0
        mov     w1, w19
        bl      g
        mov     w1, w19
        mov     w0, w20
        bl      g
        mov     w0, w19
        ldp     x19, x20, [sp, 16]
        ldp     x29, x30, [sp], 32
        ret

Given x+1 is used as a function argument, there is no benefit in making it
available as a CSE after each call - repeating the addition is cheaper than
using an extra callee-save and copying it several times.

This shows a similar issue for bit tests. Most targets support ANDS or bit test
as a single instruction (or even bit test+branch), so CSEing the (x & C)
actually makes things worse:

void f3(char *p, int x)
{
  if (x & 1) p[0] = 0;
  if (x & 2) p[1] = 0;
  if (x & 4) p[2] = 0;
  if (x & 8) p[2] = 0;
  g(0,0);
  if (x & 1) p[3] = 0;
  if (x & 2) p[4] = 0;
  if (x & 4) p[5] = 0;
  if (x & 8) p[6] = 0;
}

This uses 4 callee-saves to hold the (x & C) CSEs. Doing several such bit tests
in a more complex function means you quickly run out of registers...

Given it would be much harder to undo these CSEs at RTL level (combine does
most contractions but can only do it intra-block), would it be reasonable to
block CSEs for these special cases?

Reply via email to