You can first double check you MPI_Init_thread(..., MPI_THREAD_MULTIPLE, ...) And the provided level is MPI_THREAD_MULTIPLE as you requested.
Cheers, Gilles On Fri, Apr 22, 2022, 21:45 Angel de Vicente via users < users@lists.open-mpi.org> wrote: > Hello, > > I'm running out of ideas, and wonder if someone here could have some > tips on how to debug a segmentation fault I'm having with my > application [due to the nature of the problem I'm wondering if the > problem is with OpenMPI itself rather than my app, though at this point > I'm not leaning strongly either way]. > > The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and > OpenMPI 4.1.3. > > Usually I was running the code with "mpirun -np X --bind-to none [...]" > so that the threads created by OpenMP don't get bound to a single core > and I actually get proper speedup out of OpenMP. > > Now, since I introduced some changes to the code this week (though I > have read the changes carefully a number of times, and I don't see > anything suspicious), I now get a segmentation fault sometimes, but only > when I run with "--bind-to none" and only in my workstation. It is not > always with the same running configuration, but I can see some pattern, > and the problem shows up only if I run the version compiled with OpenMP > support and most of the times only when the number of rank*threads goes > above 4 or so. If I run it with "--bind-to socket" all looks good all > the time. > > If I run it in another server, "--bind-to none" doesn't seem to be any > issue (I submitted the jobs many many times and not a single > segmentation fault), but in my workstation it fails almost every time if > using MPI+OpenMP with a handful of threads and with "--bind-to none". In > this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3. > > For example, setting OMP_NUM_THREADS to 1, I run the code like the > following, and get the segmentation fault as below: > > ,---- > | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 > --bind-to none ../../../../../pcorona+openmp~gauss Fe13_NL3.params > | Reading control file: Fe13_NL3.params > | ... Control file parameters broadcasted > | > | [...] > | > | Starting calculation loop on the line of sight > | Receiving results from: 2 > | Receiving results from: 1 > | > | Program received signal SIGSEGV: Segmentation fault - invalid memory > reference. > | > | Backtrace for this error: > | Receiving results from: 3 > | #0 0x7fd747e7555f in ??? > | #1 0x7fd7488778e1 in ??? > | #2 0x7fd7488667a4 in ??? > | #3 0x7fd7486fe84c in ??? > | #4 0x7fd7489aa9ce in ??? > | #5 0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0 > | at src/pcorona_main.f90:627 > | #6 0x7fd74813ec75 in ??? > | #7 0x412bb0 in pcorona > | at src/pcorona.f90:49 > | #8 0x40361c in main > | at src/pcorona.f90:17 > | > | [...] > | > | > -------------------------------------------------------------------------- > | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on > signal 11 (Segmentation fault). > | ------------------------------------------------------- > `---- > > I cannot see inside the MPI library (I don't really know if that would > be helpful) but line 627 in pcorona_main.f90 is: > > ,---- > | call > mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror) > `---- > > Any ideas/suggestions what could be going on or how to try an get some > more clues about the possible causes of this? > > Many thanks, > -- > Ángel de Vicente > > Tel.: +34 922 605 747 > Web.: http://research.iac.es/proyecto/polmag/ > > --------------------------------------------------------------------------------------------- > AVISO LEGAL: Este mensaje puede contener información confidencial y/o > privilegiada. Si usted no es el destinatario final del mismo o lo ha > recibido por error, por favor notifíquelo al remitente inmediatamente. > Cualquier uso no autorizadas del contenido de este mensaje está > estrictamente prohibida. Más información en: > https://www.iac.es/es/responsabilidad-legal > DISCLAIMER: This message may contain confidential and / or privileged > information. If you are not the final recipient or have received it in > error, please notify the sender immediately. Any unauthorized use of the > content of this message is strictly prohibited. More information: > https://www.iac.es/en/disclaimer >