[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835 --- Comment #4 from Tom de Vries --- This minimized test-case (rewritten to avoid the kernels construct, by setting the num_gangs as libgomp would have chosen it for kernels, and making the loop a gang loop): ... /* { dg-do run } */ /* { dg-additional-options "-lcuda" { target openacc_nvidia_accel_selected } } */ #include #include #include "cuda.h" #include #define n 128 int main (void) { CUresult r; CUstream stream1; int N = n; int a[n]; int b[n]; int c[n]; acc_init (acc_device_nvidia); r = cuStreamCreate (, CU_STREAM_NON_BLOCKING); if (r != CUDA_SUCCESS) { fprintf (stderr, "cuStreamCreate failed: %d\n", r); abort (); } acc_set_cuda_stream (1, stream1); for (int i = 0; i < n; i++) { a[i] = 3; c[i] = 0; } #pragma acc data copy (a, b, c) copyin (N) { #pragma acc parallel async (1) ; #pragma acc parallel async (1) num_gangs (320) #pragma loop gang for (int ii = 0; ii < N; ii++) c[ii] = (a[ii] + a[N - ii - 1]); #pragma acc parallel async (1) #pragma acc loop seq for (int ii = 0; ii < n; ii++) a[ii] = 6; #pragma acc wait (1) } unsigned sum = 0; for (int i = 0; i < n; i++) if (c[i] != 6) { printf ("%d@%d ", c[i], i); sum++; } if (sum > 0) { printf ("mismatches: %u\n", sum); abort (); } return 0; } ... reproduces 100 out of 100 for me at -O2: ... nr=100; sum=0; for n in $(seq 1 $nr); do ./run.sh ./asyncwait-1.exe ; if [ $? -eq 0 ]; then sum=$(($sum + 1)); fi; done; echo ; echo "$sum/$nr" 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 15@125 27@126 39@127 mismatches: 3 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 18@126 30@127 mismatches: 3 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 12@125 24@126 33@127 mismatches: 3 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 18@126 30@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 30@127 mismatches: 3 Aborted (core dumped) 18@126 30@127 mismatches: 2 Aborted (core dumped) 21@126 30@127 mismatches: 2 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 18@126 33@127 mismatches: 3 Aborted (core dumped) 12@125 24@126 33@127 mismatches: 3 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 15@126 24@127 mismatches: 2 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 15@125 24@126 39@127 mismatches: 3 Aborted (core dumped) 18@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 21@126 30@127 mismatches: 3 Aborted (core dumped) 18@125 27@126 39@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 12@125 21@126 36@127 mismatches: 3 Aborted (core dumped) 15@125 24@126 36@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 18@126 27@127 mismatches: 2 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 18@126 27@127 mismatches: 2 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 18@126 33@127 mismatches: 3 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 18@126 33@127 mismatches: 3 Aborted (core dumped) 9@125 18@126 30@127 mismatches: 3 Aborted (core dumped) 15@126 30@127 mismatches: 2 Aborted (core dumped) 18@126 30@127 mismatches: 2 Aborted (core dumped) 9@125 18@126 30@127 mismatches: 3 Aborted (core dumped) 12@125 24@126 33@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 12@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 18@126 30@127 mismatches: 2 Aborted (core dumped) 9@125 24@126 33@127 mismatches: 3 Aborted (core dumped) 12@125 24@126 36@127 mismatches: 3 Aborted (core dumped) 9@125 18@126 33@127 mismatches: 3 Aborted (core dumped) 12@125 24@126 36@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 33@127 mismatches: 3 Aborted (core dumped) 12@125 24@126 36@127 mismatches: 3 Aborted (core dumped) 15@126 30@127 mismatches: 2 Aborted (core dumped) 15@125 24@126 36@127 mismatches: 3 Aborted (core dumped) 9@126 21@127 mismatches: 2 Aborted (core dumped) 15@126 27@127 mismatches: 2 Aborted (core dumped) 9@125 18@126 30@127 mismatches: 3 Aborted (core dumped) 9@125 18@126 33@127 mismatches: 3 Aborted (core dumped) 12@125 24@126 33@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 30@127 mismatches: 3 Aborted (core dumped) 12@125 24@126 36@127 mismatches: 3 Aborted (core dumped) 9@125 21@126 30@127 mismatches: 3 Aborted (core dumped) 9@125 18@126 30@127 mismatches: 3 Aborted (core dumped) 9@125 18@126 33@127 mismatches: 3 Aborted (core dumped) 15@126 27@127
[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835 --- Comment #3 from Thomas Schwinge --- Created attachment 45457 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45457=edit [WIP] libgomp.oacc-c-c++-common/asyncwait-1.c debug (In reply to Tom de Vries from comment #2) > (In reply to Tom de Vries from comment #1) > > (In reply to Thomas Schwinge from comment #0) > > > After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm > > > seeing (intermittently only, and only on some systems): > > > > I see the failure reproduced consistently with a Quadro M1200. Oh, good -- in a way ;-) -- that it's consistently reproducable for you. For me, the failure is rather rare. > > > I have not yet analyzed what's causing this, but I have some ideas about > > > pending patches that might cure it. Unfortunately, the patches I've been thinking of either are on trunk already, or can't possibly be related to this problem. The 'async'/'wait' clauses/directives in the test case look correct. > do you intend to address this before stage4 closes? I'd like to, yes. Here is my current status. With "-O2": [...] nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$37: finished GOACC_data_end: restore mappings GOACC_data_end: mappings restored [abort] In addition to "main$_omp_fn$37", sometimes also seen with "main$_omp_fn$25", "main$_omp_fn$29", "main$_omp_fn$33". So far only seen with OpenACC 'kernels' constructs, but not with the very similar 'parallel' ones earlier in the file. For example, without "DEBUG_K": [...] nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$37: finished GOACC_wait -2 1 goacc_wait -2 1 goacc_wait 1 GOACC_data_end: restore mappings GOACC_data_end: mappings restored 1007 c[64] 0 1019 e[64] 13 1007 c[65] 0 1019 e[65] 13 1007 c[66] 0 1019 e[66] 13 [...] 1007 c[125] 0 1019 e[125] 13 1007 c[126] 0 1019 e[126] 13 1007 c[127] 0 1019 e[127] 13 With "DEBUG_K": [...] nvptx_exec: kernel main$_omp_fn$37: launch gangs=1, workers=1, vectors=32 nvptx_exec: kernel main$_omp_fn$37: finished GOACC_wait -2 1 goacc_wait -2 1 goacc_wait 1 966 c[64] 0 966 c[65] 0 966 c[66] 0 [...] 966 c[125] 0 966 c[126] 0 966 c[127] 0 So, the compute kernel ("main$_omp_fn$37") doesn't find the "c" array properly initialized, even though they're enqueued on the same 'async', so have to execute in proper order by definition. I've only ever seen this with the "c" array. Sometimes that's starting already with index 0 (often seen with "main$_omp_fn$29"), or as late as index 100 (rarely). When running under "valgrind", repeatedly until there's an "abort", that doesn't print anything suspicious. Might this perhaps be a latent issue in OpenACC 'kernels' plus 'async', now uncovered by the r264397 "[nvptx] Remove use of CUDA unified memory in libgomp" commit?
[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835 --- Comment #2 from Tom de Vries --- (In reply to Tom de Vries from comment #1) > (In reply to Thomas Schwinge from comment #0) > > After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm > > seeing (intermittently only, and only on some systems): > > > > I see the failure reproduced consistently with a Quadro M1200. > > > I have not yet analyzed what's causing this, but I have some ideas about > > pending patches that might cure it. > > OK, let's see if those make it. If not, we may want to investigate and > decide if we want to revert the patch. Hi Thomas, do you intend to address this before stage4 closes? Thanks, - Tom
[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835 --- Comment #1 from Tom de Vries --- (In reply to Thomas Schwinge from comment #0) > After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm > seeing (intermittently only, and only on some systems): > I see the failure reproduced consistently with a Quadro M1200. > I have not yet analyzed what's causing this, but I have some ideas about > pending patches that might cure it. OK, let's see if those make it. If not, we may want to investigate and decide if we want to revert the patch.