https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90292

            Bug ID: 90292
           Summary: GCC Fails to hoist loop invariant in nested loops
           Product: gcc
           Version: tree-ssa
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: giuliano.belinassi at usp dot br
  Target Milestone: ---

Both GCC 8.3.0 and 9.0.1 fails to hoist the index calculation for 'i' and 'k'
variables of the following function (matrix multiplication with blocking):

void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A,
double *restrict _B)
{
    #define BLOCK_I 128
    #define BLOCK_K 8

    #define A(i, j) _A[n*(i) + (j)]
    #define B(i, j) _B[n*(i) + (j)]
    #define C(i, j) _C[n*(i) + (j)]
    unsigned ii, kk, i, j, k;

    for (ii = 0; ii < n; ii += BLOCK_I)
        for (kk = 0; kk < n; kk += BLOCK_K)
            for (i = ii; i < ii + BLOCK_I; ++i)
                for (k = kk; k < kk + BLOCK_K; ++k)
                    for (j = 0; j < n; ++j)
                       C(i, j) += A(i, k) * B(k, j);
}

which then generate the following GIMPLE code on the deepest nested block (-O2)

  <bb 4> [local count: 955630225]:
  # ivtmp.3_88 = PHI <_1(6), ivtmp.3_87(4)>
  _3 = (long unsigned int) ivtmp.3_88;
  _4 = _3 * 8;
  _5 = _C_34(D) + _4; 
  _6 = *_5;
  _13 = ivtmp.14_81 + ivtmp.3_88;
  _14 = (long unsigned int) _13;
  _15 = _14 * 8;
  _16 = _B_36(D) + _15;
  _17 = *_16;
  _18 = _11 * _17;
  _19 = _6 + _18;
  *_5 = _19;
  ivtmp.3_87 = ivtmp.3_88 + 1;
  if (ivtmp.25_71 != ivtmp.3_87)
    goto <bb 4>; [89.00%]
  else
    goto <bb 5>; [11.00%]

If I modify the function code to do the following:

void matrix_dgemm_2(unsigned n, double *restrict _C, double *restrict _A,
double *restrict _B) 
{
    #define BLOCK_I 128
    #define BLOCK_K 8

    #define A(i, j) _A[n*(i) + (j)]
    #define B(i, j) _B[n*(i) + (j)]
    #define C(i, j) _C[n*(i) + (j)]
    unsigned ii, kk, i, j, k;

    for (ii = 0; ii < n; ii += BLOCK_I)
        for (kk = 0; kk < n; kk += BLOCK_K)
            for (i = ii; i < ii + BLOCK_I; ++i)
                for (k = kk; k < kk + BLOCK_K; ++k)
                {
                    double  a =  A(i, k); 
                    double* b = &B(k, 0); 
                    double* c = &C(i, 0); 
                    for (j = 0; j < n; ++j)
                    {
                        *c = a * (*b);
                        c++; b++;
                    }
                }
}

Then GCC generates the following deepest block: (-O2)

  <bb 6> [local count: 955630224]:
  # ivtmp.1_47 = PHI <0(5), ivtmp.1_46(6)>
  _11 = MEM[base: b_32, index: ivtmp.1_47, step: 8, offset: 0B];
  _12 = _11 * a_30;
  MEM[base: c_34, index: ivtmp.1_47, step: 8, offset: 0B] = _12;
  ivtmp.1_46 = ivtmp.1_47 + 1;
  if (_44 != ivtmp.1_47)
    goto <bb 6>; [89.00%]
  else
    goto <bb 7>; [11.00%]

and therefore it will generate faster final code. With a 2048x2048 matrix at
-O2, this modification only gives about 0.3s of speedup, however at -O3
-march=native in a  Core(TM) i5-8250U, this modification provides a 2x speedup,
arguably due to vectorization.

Reply via email to