Hello, I'm running out of ideas, and wonder if someone here could have some tips on how to debug a segmentation fault I'm having with my application [due to the nature of the problem I'm wondering if the problem is with OpenMPI itself rather than my app, though at this point I'm not leaning strongly either way].
The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and OpenMPI 4.1.3. Usually I was running the code with "mpirun -np X --bind-to none [...]" so that the threads created by OpenMP don't get bound to a single core and I actually get proper speedup out of OpenMP. Now, since I introduced some changes to the code this week (though I have read the changes carefully a number of times, and I don't see anything suspicious), I now get a segmentation fault sometimes, but only when I run with "--bind-to none" and only in my workstation. It is not always with the same running configuration, but I can see some pattern, and the problem shows up only if I run the version compiled with OpenMP support and most of the times only when the number of rank*threads goes above 4 or so. If I run it with "--bind-to socket" all looks good all the time. If I run it in another server, "--bind-to none" doesn't seem to be any issue (I submitted the jobs many many times and not a single segmentation fault), but in my workstation it fails almost every time if using MPI+OpenMP with a handful of threads and with "--bind-to none". In this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3. For example, setting OMP_NUM_THREADS to 1, I run the code like the following, and get the segmentation fault as below: ,---- | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 --bind-to none ../../../../../pcorona+openmp~gauss Fe13_NL3.params | Reading control file: Fe13_NL3.params | ... Control file parameters broadcasted | | [...] | | Starting calculation loop on the line of sight | Receiving results from: 2 | Receiving results from: 1 | | Program received signal SIGSEGV: Segmentation fault - invalid memory reference. | | Backtrace for this error: | Receiving results from: 3 | #0 0x7fd747e7555f in ??? | #1 0x7fd7488778e1 in ??? | #2 0x7fd7488667a4 in ??? | #3 0x7fd7486fe84c in ??? | #4 0x7fd7489aa9ce in ??? | #5 0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0 | at src/pcorona_main.f90:627 | #6 0x7fd74813ec75 in ??? | #7 0x412bb0 in pcorona | at src/pcorona.f90:49 | #8 0x40361c in main | at src/pcorona.f90:17 | | [...] | | -------------------------------------------------------------------------- | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on signal 11 (Segmentation fault). | ------------------------------------------------------- `---- I cannot see inside the MPI library (I don't really know if that would be helpful) but line 627 in pcorona_main.f90 is: ,---- | call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror) `---- Any ideas/suggestions what could be going on or how to try an get some more clues about the possible causes of this? Many thanks, -- Ángel de Vicente Tel.: +34 922 605 747 Web.: http://research.iac.es/proyecto/polmag/ --------------------------------------------------------------------------------------------- AVISO LEGAL: Este mensaje puede contener información confidencial y/o privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no autorizadas del contenido de este mensaje está estrictamente prohibida. Más información en: https://www.iac.es/es/responsabilidad-legal DISCLAIMER: This message may contain confidential and / or privileged information. If you are not the final recipient or have received it in error, please notify the sender immediately. Any unauthorized use of the content of this message is strictly prohibited. More information: https://www.iac.es/en/disclaimer