Dear Angel, You’re using MPI_Probe() with Threads; that’s not safe. Please consider using MPI_Mprobe() together with MPI_Mrecv().
However, you mention running with only one Thread — setting OMP_NUM_THREADS=1, assuming you didn’t set using omp_set_num_threads() again, or use num_threads() clause… So getting into valgrind may be of help, possibly recompiling Open MPI enabling valgrind-checking together with debugging options. Best regards, Rainer > On 22. Apr 2022, at 14:40, Angel de Vicente via users > <users@lists.open-mpi.org> wrote: > > Hello, > > I'm running out of ideas, and wonder if someone here could have some > tips on how to debug a segmentation fault I'm having with my > application [due to the nature of the problem I'm wondering if the > problem is with OpenMPI itself rather than my app, though at this point > I'm not leaning strongly either way]. > > The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and > OpenMPI 4.1.3. > > Usually I was running the code with "mpirun -np X --bind-to none [...]" > so that the threads created by OpenMP don't get bound to a single core > and I actually get proper speedup out of OpenMP. > > Now, since I introduced some changes to the code this week (though I > have read the changes carefully a number of times, and I don't see > anything suspicious), I now get a segmentation fault sometimes, but only > when I run with "--bind-to none" and only in my workstation. It is not > always with the same running configuration, but I can see some pattern, > and the problem shows up only if I run the version compiled with OpenMP > support and most of the times only when the number of rank*threads goes > above 4 or so. If I run it with "--bind-to socket" all looks good all > the time. > > If I run it in another server, "--bind-to none" doesn't seem to be any > issue (I submitted the jobs many many times and not a single > segmentation fault), but in my workstation it fails almost every time if > using MPI+OpenMP with a handful of threads and with "--bind-to none". In > this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3. > > For example, setting OMP_NUM_THREADS to 1, I run the code like the > following, and get the segmentation fault as below: > > ,---- > | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 --bind-to > none ../../../../../pcorona+openmp~gauss Fe13_NL3.params > | Reading control file: Fe13_NL3.params > | ... Control file parameters broadcasted > | > | [...] > | > | Starting calculation loop on the line of sight > | Receiving results from: 2 > | Receiving results from: 1 > | > | Program received signal SIGSEGV: Segmentation fault - invalid memory > reference. > | > | Backtrace for this error: > | Receiving results from: 3 > | #0 0x7fd747e7555f in ??? > | #1 0x7fd7488778e1 in ??? > | #2 0x7fd7488667a4 in ??? > | #3 0x7fd7486fe84c in ??? > | #4 0x7fd7489aa9ce in ??? > | #5 0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0 > | at src/pcorona_main.f90:627 > | #6 0x7fd74813ec75 in ??? > | #7 0x412bb0 in pcorona > | at src/pcorona.f90:49 > | #8 0x40361c in main > | at src/pcorona.f90:17 > | > | [...] > | > | -------------------------------------------------------------------------- > | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on > signal 11 (Segmentation fault). > | ------------------------------------------------------- > `---- > > I cannot see inside the MPI library (I don't really know if that would > be helpful) but line 627 in pcorona_main.f90 is: > > ,---- > | call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror) > `---- > > Any ideas/suggestions what could be going on or how to try an get some > more clues about the possible causes of this? > > Many thanks, > -- > Ángel de Vicente > > Tel.: +34 922 605 747 > Web.: http://research.iac.es/proyecto/polmag/ > --------------------------------------------------------------------------------------------- > AVISO LEGAL: Este mensaje puede contener información confidencial y/o > privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido > por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso > no autorizadas del contenido de este mensaje está estrictamente prohibida. > Más información en: https://www.iac.es/es/responsabilidad-legal > DISCLAIMER: This message may contain confidential and / or privileged > information. If you are not the final recipient or have received it in error, > please notify the sender immediately. Any unauthorized use of the content of > this message is strictly prohibited. More information: > https://www.iac.es/en/disclaimer --------------------------------------------------------------------- Prof. Dr.-Ing. Rainer Keller, HS Esslingen Studiengangkoordinator Master Angewandte Informatik Professor für Betriebssysteme, verteilte und parallele Systeme Fakultät Informatik und Informationstechnik Flandernstr. 101, Raum F01.320 73732 Esslingen T.: +49 (0)711 397-4165 F.: +49 (0)711 397-48 4165
smime.p7s
Description: S/MIME cryptographic signature