https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81434
Bug ID: 81434 Summary: AArch64 instruction fusing and pipeline scheduling problem Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wilson at gcc dot gnu.org Target Milestone: --- Consider this testcase extracted from OpenSSL sources extern int Te0[256]; extern int Te1[256]; extern int Te2[256]; extern int Te3[256]; void sub (unsigned int s0, unsigned int s1, unsigned int s2, unsigned int s3, unsigned int *rk, int *result) { unsigned int t0; t0 = Te0[(s0 >> 24) ] ^ Te1[(s1 >> 16) & 0xff] ^ Te2[(s2 >> 8) & 0xff] ^ Te3[(s3 ) & 0xff] ^ rk[4]; *result = t0; } If I compile with -O2 -mcpu=cortex-a57 -fsched-verbose=9 -fdump-rtl-sched2 -S and look at the tmp.c.295r.sched2 file, I see ;; 3--> b 0: i 17 x8=x8+low(`Te3') :ca57_sx1|ca57_sx2 ;; 4--> b 0: i 23 x7=high(`Te1') :ca57_sx1|ca57_sx2 ;; 4--> b 0: i 24 x7=x7+low(`Te1') :ca57_sx1|ca57_sx2 ;; 5--> b 0: i 27 x1=zxt(x1,0x8,0x10) :ca57_sx1|ca57_sx2 ;; 5--> b 0: i 28 x6=high(`Te0') :ca57_sx1|ca57_sx2 ;; 6--> b 0: i 29 x6=x6+low(`Te0') :ca57_sx1|ca57_sx2 ;; 7--> b 0: i 21 x8=zxn([x3*0x4+x8]) :ca57_load_model The first thing to notice here are that we only scheduled one instruction in cycle 3, even though we have two ALUs, an issue rate of 3, and there are other ALU insns available to schedule. The second thing is that the load was not scheduled until cycle 7, even though it was ready in cycle 4, and there was an available issue slot for it. Part of the problem here is that the AArch64 port uses SCHED_GROUP for instruction fusing. The other part is that in the scheduler, when we have a SCHED_GROUP, all non-sched-group instructions are forced to issue in the next cycle. This is OK if you can only issue one instruction per cycle, or if a sched group insn can't issue in this cycle. It is unnecessary and wrong if we can issue multiple insns per cycle, and sched group insns can all issue in the same cycle. This is the case for Aarch64 cortex-a57 instruction fusing. I can also see the same issue for falkor, except that falkor is 4 issue, so there is more of an effect on the schedule. So considering the testcase above, we can't issue a second instruction in cycle 3 because the low() is a sched group insn. We can't issue the load in the fourth cycle because high/low are both sched group insns. We can't issue the load in the fifth cycle because high is a sched group insn. We can't issue the load in the sixth cycle because low is a sched group insn. So the load finally issues in cycle 7 when we have no sched group insns left. I have a patch to fix this. With the patch, I instead get ;; 3--> b 0: i 17 x8=x8+low(`Te3') :ca57_sx1|ca57_sx2 ;; 3--> b 0: i 23 x7=high(`Te1') :ca57_sx1|ca57_sx2 ;; 4--> b 0: i 24 x7=x7+low(`Te1') :ca57_sx1|ca57_sx2 ;; 4--> b 0: i 27 x1=zxt(x1,0x8,0x10) :ca57_sx1|ca57_sx2 ;; 4--> b 0: i 21 x8=zxn([x3*0x4+x8]) :ca57_load_model ;; 5--> b 0: i 28 x6=high(`Te0') :ca57_sx1|ca57_sx2 ;; 5--> b 0: i 29 x6=x6+low(`Te0') :ca57_sx1|ca57_sx2 which looks better. Without the patch, the testcase takes 22 cycles according to the scheduler. With the patch, the testcase takes 19 cycles according to the scheduler.