[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2

2019-01-21 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835

--- Comment #4 from Tom de Vries  ---
This minimized test-case (rewritten to avoid the kernels construct, by setting
the num_gangs as libgomp would have chosen it for kernels, and making the loop
a  gang loop):
...
/* { dg-do run } */
/* { dg-additional-options "-lcuda" { target openacc_nvidia_accel_selected } }
*/

#include 
#include 
#include "cuda.h"

#include 

#define n 128

int
main (void)
{
  CUresult r;
  CUstream stream1;
  int N = n;
  int a[n];
  int b[n];
  int c[n];

  acc_init (acc_device_nvidia);

  r = cuStreamCreate (, CU_STREAM_NON_BLOCKING);
  if (r != CUDA_SUCCESS)
{
  fprintf (stderr, "cuStreamCreate failed: %d\n", r);
  abort ();
}

  acc_set_cuda_stream (1, stream1);

  for (int i = 0; i < n; i++)
{
  a[i] = 3;
  c[i] = 0;
}

#pragma acc data copy (a, b, c) copyin (N)
  {
#pragma acc parallel async (1)
;

#pragma acc parallel async (1) num_gangs (320)
#pragma loop gang
for (int ii = 0; ii < N; ii++)
  c[ii] = (a[ii] + a[N - ii - 1]);

#pragma acc parallel async (1)
#pragma acc loop seq
for (int ii = 0; ii < n; ii++)
  a[ii] = 6;

#pragma acc wait (1)
  }

  unsigned sum = 0;
  for (int i = 0; i < n; i++)
if (c[i] != 6)
  {
printf ("%d@%d ", c[i], i);
sum++;
  }

  if (sum > 0)
{
  printf ("mismatches: %u\n", sum);
  abort ();
}

  return 0;
}
...

reproduces 100 out of 100 for me at -O2:
...
nr=100; sum=0; for n in $(seq 1 $nr); do ./run.sh ./asyncwait-1.exe ; if [ $?
-eq 0 ]; then sum=$(($sum + 1)); fi; done; echo ; echo "$sum/$nr"
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
15@125 27@126 39@127 mismatches: 3
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 18@126 30@127 mismatches: 3
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
12@125 24@126 33@127 mismatches: 3
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 18@126 30@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 30@127 mismatches: 3
Aborted (core dumped)
18@126 30@127 mismatches: 2
Aborted (core dumped)
21@126 30@127 mismatches: 2
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 18@126 33@127 mismatches: 3
Aborted (core dumped)
12@125 24@126 33@127 mismatches: 3
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
15@126 24@127 mismatches: 2
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
15@125 24@126 39@127 mismatches: 3
Aborted (core dumped)
18@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 21@126 30@127 mismatches: 3
Aborted (core dumped)
18@125 27@126 39@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
12@125 21@126 36@127 mismatches: 3
Aborted (core dumped)
15@125 24@126 36@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
18@126 27@127 mismatches: 2
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
18@126 27@127 mismatches: 2
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 18@126 33@127 mismatches: 3
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 18@126 33@127 mismatches: 3
Aborted (core dumped)
9@125 18@126 30@127 mismatches: 3
Aborted (core dumped)
15@126 30@127 mismatches: 2
Aborted (core dumped)
18@126 30@127 mismatches: 2
Aborted (core dumped)
9@125 18@126 30@127 mismatches: 3
Aborted (core dumped)
12@125 24@126 33@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
12@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
18@126 30@127 mismatches: 2
Aborted (core dumped)
9@125 24@126 33@127 mismatches: 3
Aborted (core dumped)
12@125 24@126 36@127 mismatches: 3
Aborted (core dumped)
9@125 18@126 33@127 mismatches: 3
Aborted (core dumped)
12@125 24@126 36@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 33@127 mismatches: 3
Aborted (core dumped)
12@125 24@126 36@127 mismatches: 3
Aborted (core dumped)
15@126 30@127 mismatches: 2
Aborted (core dumped)
15@125 24@126 36@127 mismatches: 3
Aborted (core dumped)
9@126 21@127 mismatches: 2
Aborted (core dumped)
15@126 27@127 mismatches: 2
Aborted (core dumped)
9@125 18@126 30@127 mismatches: 3
Aborted (core dumped)
9@125 18@126 33@127 mismatches: 3
Aborted (core dumped)
12@125 24@126 33@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 30@127 mismatches: 3
Aborted (core dumped)
12@125 24@126 36@127 mismatches: 3
Aborted (core dumped)
9@125 21@126 30@127 mismatches: 3
Aborted (core dumped)
9@125 18@126 30@127 mismatches: 3
Aborted (core dumped)
9@125 18@126 33@127 mismatches: 3
Aborted (core dumped)
15@126 27@127 

[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2

2019-01-18 Thread tschwinge at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835

--- Comment #3 from Thomas Schwinge  ---
Created attachment 45457
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45457=edit
[WIP] libgomp.oacc-c-c++-common/asyncwait-1.c debug

(In reply to Tom de Vries from comment #2)
> (In reply to Tom de Vries from comment #1)
> > (In reply to Thomas Schwinge from comment #0)
> > > After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm
> > > seeing (intermittently only, and only on some systems):
> > 
> > I see the failure reproduced consistently with a Quadro M1200.

Oh, good -- in a way ;-) -- that it's consistently reproducable for you.  For
me, the failure is rather rare.

> > > I have not yet analyzed what's causing this, but I have some ideas about
> > > pending patches that might cure it.

Unfortunately, the patches I've been thinking of either are on trunk already,
or can't possibly be related to this problem.

The 'async'/'wait' clauses/directives in the test case look correct.

> do you intend to address this before stage4 closes?

I'd like to, yes.


Here is my current status.


With "-O2":

[...]
  nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1,
vectors=32
  nvptx_exec: kernel main$_omp_fn$37: finished
  GOACC_data_end: restore mappings
  GOACC_data_end: mappings restored
[abort]

In addition to "main$_omp_fn$37", sometimes also seen with "main$_omp_fn$25",
"main$_omp_fn$29", "main$_omp_fn$33".

So far only seen with OpenACC 'kernels' constructs, but not with the very
similar 'parallel' ones earlier in the file.

For example, without "DEBUG_K":

[...]
  nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1,
vectors=32
  nvptx_exec: kernel main$_omp_fn$37: finished
GOACC_wait -2 1
goacc_wait -2 1
goacc_wait   1
  GOACC_data_end: restore mappings
  GOACC_data_end: mappings restored
1007 c[64] 0
1019 e[64] 13
1007 c[65] 0
1019 e[65] 13
1007 c[66] 0
1019 e[66] 13
[...]
1007 c[125] 0
1019 e[125] 13
1007 c[126] 0
1019 e[126] 13
1007 c[127] 0
1019 e[127] 13

With "DEBUG_K":

[...]
  nvptx_exec: kernel main$_omp_fn$37: launch gangs=1, workers=1, vectors=32
  nvptx_exec: kernel main$_omp_fn$37: finished
GOACC_wait -2 1
goacc_wait -2 1
goacc_wait   1
966 c[64] 0
966 c[65] 0
966 c[66] 0
[...]
966 c[125] 0
966 c[126] 0
966 c[127] 0

So, the compute kernel ("main$_omp_fn$37") doesn't find the "c" array properly
initialized, even though they're enqueued on the same 'async', so have to
execute in proper order by definition.

I've only ever seen this with the "c" array.

Sometimes that's starting already with index 0 (often seen with
"main$_omp_fn$29"), or as late as index 100 (rarely).


When running under "valgrind", repeatedly until there's an "abort", that
doesn't print anything suspicious.


Might this perhaps be a latent issue in OpenACC 'kernels' plus 'async', now
uncovered by the r264397 "[nvptx] Remove use of CUDA unified memory in libgomp"
commit?

[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2

2019-01-12 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835

--- Comment #2 from Tom de Vries  ---
(In reply to Tom de Vries from comment #1)
> (In reply to Thomas Schwinge from comment #0)
> > After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm
> > seeing (intermittently only, and only on some systems):
> > 
> 
> I see the failure reproduced consistently with a Quadro M1200.
> 
> > I have not yet analyzed what's causing this, but I have some ideas about
> > pending patches that might cure it.
> 
> OK, let's see if those make it. If not, we may want to investigate and
> decide if we want to revert the patch.

Hi Thomas,

do you intend to address this before stage4 closes?

Thanks,
- Tom

[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2

2018-12-14 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835

--- Comment #1 from Tom de Vries  ---
(In reply to Thomas Schwinge from comment #0)
> After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm
> seeing (intermittently only, and only on some systems):
> 

I see the failure reproduced consistently with a Quadro M1200.

> I have not yet analyzed what's causing this, but I have some ideas about
> pending patches that might cure it.

OK, let's see if those make it. If not, we may want to investigate and decide
if we want to revert the patch.