https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114151

            Bug ID: 114151
           Summary: [14 Regression] weird and inefficient codegen and
                    addressing modes since
                    g:a0b1798042d033fd2cc2c806afbb77875dd2909b
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

Created attachment 57559
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57559&action=edit
testcase

The attached C++ testcase compiled with: -O3 -mcpu=neoverse-n2

used to compile a nice and simple loop.  But after
g:a0b1798042d033fd2cc2c806afbb77875dd2909b

The codegen is weird and it uses horrible addressing modes.

The first odd part is that it's decided to split the loop, the "main" loop has
a guard after it to branch to the exit is the iteration count is 1.

If not instead of just loop again it falls through the a copy of the main loop,
but has destroyed addressing modes.

The copy of the loop seems to have unshared the address calculations. Before we
had:

  _128 = (void *) ivtmp.11_20;
  _54 = MEM <__SVFloat16_t> [(__fp16 *)_128];
  _10 = MEM <__SVFloat16_t> [(__fp16 *)_128 + POLY_INT_CST [16B, 16B]];
  _75 = MEM <__SVFloat16_t> [(__fp16 *)_128 + POLY_INT_CST [32B, 32B]];

etc, so all as an offset from _128.  Now we have:

  col_i_61 = (int) ivtmp.11_100;
  _60 = (long unsigned int) col_i_61;
  _59 = _60 * 2;
  _58 = a_j_69 + _59;
  _54 = MEM <__SVFloat16_t> [(__fp16 *)_58];
  _53 = _59 + POLY_INT_CST [16, 16];
  _13 = a_j_69 + _53;
  _10 = MEM <__SVFloat16_t> [(__fp16 *)_13];
  _74 = _59 + POLY_INT_CST [32, 32];
  _19 = a_j_69 + _74;
  _75 = MEM <__SVFloat16_t> [(__fp16 *)_19];

and similarly for the stores as well.

it also weirdly creates some very complicated addressing computations. Before
we had:

  _144 = p_mat_16(D) + 6; 
  _64 = MEM <__SVFloat16_t> [(__fp16 *)_144 + ivtmp.10_100 * 2];
  _143 = p_mat_16(D) + 4;
  _84 = MEM <__SVFloat16_t> [(__fp16 *)_143 + ivtmp.10_100 * 2];

and after:

  ivtmp.23_130 = (unsigned long) p_mat_16(D);
  _123 = 2 - ivtmp.23_130;
  _124 = &MEM <__SVFloat16_t> [(__fp16 *)0B + _123 + ivtmp.12_109 * 2];
  _64 = MEM <__SVFloat16_t> [(__fp16 *)_124];

  _122 = -ivtmp.23_130;
  _120 = &MEM <__SVFloat16_t> [(__fp16 *)0B + _122 + ivtmp.12_109 * 2];
  _84 = MEM <__SVFloat16_t> [(__fp16 *)_120];

This results in quite the codesize increase, and a 7-10% performance loss.

Reply via email to