https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434

            Bug ID: 81434
           Summary: AArch64 instruction fusing and pipeline scheduling
                    problem
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wilson at gcc dot gnu.org
  Target Milestone: ---

Consider this testcase extracted from OpenSSL sources

extern int Te0[256];
extern int Te1[256];
extern int Te2[256];
extern int Te3[256];

void
sub (unsigned int s0, unsigned int s1, unsigned int s2, unsigned int s3,
     unsigned int *rk, int *result)
{
  unsigned int t0;

  t0 =
    Te0[(s0 >> 24)       ] ^
    Te1[(s1 >> 16) & 0xff] ^
    Te2[(s2 >>  8) & 0xff] ^
    Te3[(s3      ) & 0xff] ^
    rk[4];

  *result = t0;
}

If I compile with -O2 -mcpu=cortex-a57 -fsched-verbose=9
-fdump-rtl-sched2 -S and look at the tmp.c.295r.sched2 file, I see

;;        3--> b  0: i  17 x8=x8+low(`Te3')                       
:ca57_sx1|ca57_sx2
;;        4--> b  0: i  23 x7=high(`Te1')                         
:ca57_sx1|ca57_sx2
;;        4--> b  0: i  24 x7=x7+low(`Te1')                       
:ca57_sx1|ca57_sx2
;;        5--> b  0: i  27 x1=zxt(x1,0x8,0x10)                    
:ca57_sx1|ca57_sx2
;;        5--> b  0: i  28 x6=high(`Te0')                         
:ca57_sx1|ca57_sx2
;;        6--> b  0: i  29 x6=x6+low(`Te0')                       
:ca57_sx1|ca57_sx2
;;        7--> b  0: i  21 x8=zxn([x3*0x4+x8])                    
:ca57_load_model

The first thing to notice here are that we only scheduled one instruction in
cycle 3, even though we have two ALUs, an issue rate of 3, and there are other
ALU insns available to schedule.  The second thing is that the load was not
scheduled until cycle 7, even though it was ready in cycle 4, and there was an
available issue slot for it.

Part of the problem here is that the AArch64 port uses SCHED_GROUP for
instruction fusing.  The other part is that in the scheduler, when we have a
SCHED_GROUP, all non-sched-group instructions are forced to issue in the next
cycle.  This is OK if you can only issue one instruction per cycle, or if a
sched group insn can't issue in this cycle.  It is unnecessary and wrong if we
can issue multiple insns per cycle, and sched group insns can all issue in the
same cycle.  This is the case for Aarch64 cortex-a57 instruction fusing.  I can
also see the same issue for falkor, except that falkor is 4 issue, so there is
more of an effect on the schedule.

So considering the testcase above, we can't issue a second instruction in cycle
3 because the low() is a sched group insn.  We can't issue the load in the
fourth cycle because high/low are both sched group insns.  We can't issue the
load in the fifth cycle because high is a sched group insn.  We can't issue the
load in the sixth cycle because low is a sched group insn.  So the load finally
issues in cycle 7 when we have no sched group insns left.

I have a patch to fix this.  With the patch, I instead get

;;        3--> b  0: i  17 x8=x8+low(`Te3')                       
:ca57_sx1|ca57_sx2
;;        3--> b  0: i  23 x7=high(`Te1')                         
:ca57_sx1|ca57_sx2
;;        4--> b  0: i  24 x7=x7+low(`Te1')                       
:ca57_sx1|ca57_sx2
;;        4--> b  0: i  27 x1=zxt(x1,0x8,0x10)                    
:ca57_sx1|ca57_sx2
;;        4--> b  0: i  21 x8=zxn([x3*0x4+x8])                    
:ca57_load_model
;;        5--> b  0: i  28 x6=high(`Te0')                         
:ca57_sx1|ca57_sx2
;;        5--> b  0: i  29 x6=x6+low(`Te0')                       
:ca57_sx1|ca57_sx2

which looks better.  Without the patch, the testcase takes 22 cycles according
to the scheduler.  With the patch, the testcase takes 19 cycles according to
the scheduler.

Reply via email to