[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #15 from Richard Biener  ---
So the issue is really that for

  for (int i = 0; i < n; ++i)
{
  double tem1 = a4[i*4] + a4[i*4+n*4] (**);
  double tem2 = a4[i*4+2*n*4+1];
  a4[i*4+n*4+1] = tem1;
  a4[i*4+1] = tem2;
  double tem3 = a4[i*4] - tem2;
  double tem4 = tem3 + a4[i*4+n*4];
  a4[i*4+n*4+1] = tem3 + a4[i*4+n*4+1] (**);
}

we detect an interleaving load for (**) and emit it before the
later strided store to a4[i*4+n*4+1].

This issue is that vect_preserves_scalar_order_p expects to the
vectorization will happen via SLP but we will end up doing interleaving
which does not perform the load in place of the last load but in place
of ->first_element.  Unfortunately SLP analysis is done _after_
dependence analysis.  That means we have to conservatively assume both
paths may happen.

Fixed testcase:

void __attribute__((noinline,noclone))
foo (double *a4, int n)
{
  for (int i = 0; i < n; ++i)
{
  double tem1 = a4[i*4] + a4[i*4+n*4];
  double tem2 = a4[i*4+2*n*4+1];
  a4[i*4+n*4+1] = tem1;
  a4[i*4+1] = tem2;
  double tem3 = a4[i*4] - tem2;
  double tem4 = tem3 + a4[i*4+n*4];
  a4[i*4+n*4+1] = tem4 + a4[i*4+n*4+1];
}
}
int main(int argc, char **argv)
{
  int n = 11;
  double a4[4 * n * 8];
  double a42[4 * n * 8];
  for (int i = 0; i < 4 * n * 8; ++i)
a4[i] = a42[i] = i;
  foo (a4, n);
  for (int i = 0; i < n; ++i)
{
  double tem1 = a42[i*4] + a42[i*4+n*4];
  double tem2 = a42[i*4+2*n*4+1];
  a42[i*4+n*4+1] = tem1;
  a42[i*4+1] = tem2;
  double tem3 = a42[i*4] - tem2;
  double tem4 = tem3 + a42[i*4+n*4];
  a42[i*4+n*4+1] = tem4 + a42[i*4+n*4+1];
  __asm__ volatile ("": : : "memory");
}
  for (int i = 0; i < 4 * n * 8; ++i)
if (a4[i] != a42[i])
  __builtin_abort ();
  return 0;
}

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #14 from Richard Biener  ---
So the important difference when comparing patched/unpatched is the unpatched
compiler rejected vectorization with

mapz_module.fppized.f90:730:0: note: dependence distance == 0 between
*a4_627(D)[_196] and *a4_627(D)[_196]
mapz_module.fppized.f90:730:0: note: READ_WRITE dependence in interleaving.
mapz_module.fppized.f90:730:0: note: bad data dependence.

while the patched compiler is happy.  That points to the patched function
and it's call here:

static bool
vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
  loop_vec_info loop_vinfo,
  unsigned int *max_vf)
{ 
...
  if (dist == 0) 
{   
...
  if (!vect_preserves_scalar_order_p (DR_STMT (dra), DR_STMT (drb)))
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "READ_WRITE dependence in interleaving.\n");
  return true;

it's probably failure to factor in unrolling that breaks this case.

The unvectorized loop body looks like (all but relevant loads/stores elided):

   [local count: 118111594]:
  # i_313 = PHI <_1(19), i_293(24)>
  _146 = *a4_255(D)[_145];
  _152 = *a4_255(D)[_151];
  _165 = *a4_255(D)[_164];
  *a4_255(D)[_194] = _195;
  *a4_255(D)[_201] = _202;
  _203 = *a4_255(D)[_145];
  _290 = *a4_255(D)[_151];
  _291 = *a4_255(D)[_194];
  *a4_255(D)[_194] = M.42_316;
  i_293 = i_313 + 1;
  if (_2 < i_293)

final runtime alias checks are:

create runtime check for data references *a4_255(D)[_151] and *a4_255(D)[_201]
create runtime check for data references *a4_255(D)[_194] and *a4_255(D)[_164]
create runtime check for data references *a4_255(D)[_194] and *a4_255(D)[_145]
create runtime check for data references *a4_255(D)[_164] and *a4_255(D)[_201]

and groups are

note: Detected interleaving load *a4_255(D)[_151] and *a4_255(D)[_194]
note: Detected interleaving load of size 4 starting with _152 =
*a4_255(D)[_151];
note: There is a gap of 2 elements after the group
note: Detected single element interleaving *a4_255(D)[_151] step 32
note: not consecutive access *a4_255(D)[_194] = _195;
note: using strided accesses
note: not consecutive access *a4_255(D)[_194] = M.42_316;
note: using strided accesses
note: Detected single element interleaving *a4_255(D)[_164] step 32
note: Detected single element interleaving *a4_255(D)[_145] step 32
note: Detected single element interleaving *a4_255(D)[_145] step 32
note: not consecutive access *a4_255(D)[_201] = _202;
note: using strided accesses

so there's no SLP involved.

The respective loop doesn't involve a reduction so -ffast-math shouldn't be
required here, only -ffinite-math-only for min/max recognition.

A C testcase mimicing the memory accesses and failing is

void __attribute__((noinline,noclone))
foo (double *a4, int n)
{
  for (int i = 0; i < n; ++i)
{
  double tem1 = a4[i*4] + a4[i*4+n];
  double tem2 = a4[i*4+2*n+1];
  a4[i*4+n+1] = tem1;
  a4[i*4+1] = tem2;
  double tem3 = a4[i*4] - a4[i*4+1];
  double tem4 = tem3 + a4[i*4+n];
  a4[i*4+n+1] = tem3 + a4[i*4+n+1];
}
}

int main()
{
  const int n = 5;
  double a4[4 * n * 8];
  double a42[4 * n * 8];
  for (int i = 0; i < 4 * n * 8; ++i)
a4[i] = a42[i] = i;
  foo (a4, n);
  for (int i = 0; i < n; ++i)
{
  double tem1 = a42[i*4] + a4[i*4+n];
  double tem2 = a42[i*4+2*n+1];
  a42[i*4+n+1] = tem1;
  a42[i*4+1] = tem2;
  double tem3 = a42[i*4] - a42[i*4+1];
  double tem4 = tem3 + a42[i*4+n];
  a42[i*4+n+1] = tem3 + a42[i*4+n+1];
  __asm__ volatile ("": : : "memory");
}
  for (int i = 0; i < 4 * n * 8; ++i)
if (a4[i] != a42[i])
  __builtin_abort ();
  return 0;
}

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-10 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #13 from Martin Liška  ---
Can be reproduced also on e.g. a Haswell machine:
-Ofast -march=haswell -g -funroll-loops

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-10 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #12 from Martin Liška  ---
Theoretically similar to PR87214, but the patch was backported and this issues
is present in 8.3.1.

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-10 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #11 from Martin Liška  ---
Created attachment 46124
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46124=edit
Vectorizer and optimized dumps

So I can confirm the problematic file is mapz_module.fppized.f90. The
problematic vectorization happens with:

-Ofast -march=native -funroll-loops -fdbg-cnt=vect_loop:10

this is OK:

-Ofast -march=native -funroll-loops -fdbg-cnt=vect_loop:9

I'm attaching dump files, however the vectorized loop is quite huge. Hard to
see something suspicious.

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #10 from Richard Biener  ---
Looking at the rev. and the context I figured the original caller was
added for a case that can no longer happen (SAME_DR_STMT set, that
can never happen since we rewrote interleaving chain detection for GCC 4.9).

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-09 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #9 from Martin Liška  ---
However, '--size=test' helps here, fails quickly. With the revision, there 2
files are difference: mapz_module.fppized.o.s and optics_lib.o.s.
I suspect the later one.

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-09 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

Martin Liška  changed:

   What|Removed |Added

 Status|WAITING |NEW

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-09 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #8 from Martin Liška  ---
> 
> Please use GCC 8 branch, not trunk.  The problem only shows up on GCC 8
> branch.

I can confirm that with r265453 I see:

*** Miscompare of cam4_validate.txt; for details see
   
/home/mliska/Programming/cpu2017/benchspec/CPU/527.cam4_r/run/run_peak_refrate_gcc7-m64./cam4_validate.txt.mis
0001:   PASS:  4  points. 
Failure at Step:2   1   1   1
^
'cam4_validate.txt' long

But it's not immediately, it takes couple of minutes to see it.
I'm reducing that.

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-09 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

Martin Liška  changed:

   What|Removed |Added

 CC||marxin at gcc dot gnu.org

--- Comment #3 from Martin Liška  ---
Richi do you want a help with a test-case reduction? Or is it a known issue
that has been fixed on trunk?

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-08 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

--- Comment #2 from H.J. Lu  ---
On trunk, r265457 fixed 527.cam4_r in SPEC CPU 2017 with:

-march=native -Ofast -funroll-loops

[Bug tree-optimization/90018] [8 Regression] r265453 miscompiled 527.cam4_r in SPEC CPU 2017

2019-04-08 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018

H.J. Lu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-04-09
   Target Milestone|--- |8.4
 Ever confirmed|0   |1

--- Comment #1 from H.J. Lu  ---
r265452, which was backported to GCC 8 as r265453, caused the same failure
on trunk.