Dear Angel,
You’re using MPI_Probe() with Threads; that’s not safe.
Please consider using MPI_Mprobe() together with MPI_Mrecv().

However, you mention running with only one Thread — setting OMP_NUM_THREADS=1, 
assuming you didn’t set using omp_set_num_threads() again, or use num_threads() 
clause…

So getting into valgrind may be of help, possibly recompiling Open MPI enabling 
valgrind-checking together with debugging options.

Best regards,
Rainer


> On 22. Apr 2022, at 14:40, Angel de Vicente via users 
> <users@lists.open-mpi.org> wrote:
> 
> Hello,
> 
> I'm running out of ideas, and wonder if someone here could have some
> tips on how to debug a segmentation fault I'm having with my
> application [due to the nature of the problem I'm wondering if the
> problem is with OpenMPI itself rather than my app, though at this point
> I'm not leaning strongly either way].
> 
> The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and
> OpenMPI 4.1.3.
> 
> Usually I was running the code with "mpirun -np X --bind-to none [...]"
> so that the threads created by OpenMP don't get bound to a single core
> and I actually get proper speedup out of OpenMP.
> 
> Now, since I introduced some changes to the code this week (though I
> have read the changes carefully a number of times, and I don't see
> anything suspicious), I now get a segmentation fault sometimes, but only
> when I run with "--bind-to none" and only in my workstation. It is not
> always with the same running configuration, but I can see some pattern,
> and the problem shows up only if I run the version compiled with OpenMP
> support and most of the times only when the number of rank*threads goes
> above 4 or so. If I run it with "--bind-to socket" all looks good all
> the time.
> 
> If I run it in another server, "--bind-to none" doesn't seem to be any
> issue (I submitted the jobs many many times and not a single
> segmentation fault), but in my workstation it fails almost every time if
> using MPI+OpenMP with a handful of threads and with "--bind-to none". In
> this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3.
> 
> For example, setting OMP_NUM_THREADS to 1, I run the code like the
> following, and get the segmentation fault as below:
> 
> ,----
> | angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 --bind-to 
> none  ../../../../../pcorona+openmp~gauss Fe13_NL3.params 
> |  Reading control file: Fe13_NL3.params
> |   ... Control file parameters broadcasted
> | 
> | [...]
> |  
> |  Starting calculation loop on the line of sight
> |  Receiving results from:            2
> |  Receiving results from:            1
> | 
> | Program received signal SIGSEGV: Segmentation fault - invalid memory 
> reference.
> | 
> | Backtrace for this error:
> |  Receiving results from:            3
> | #0  0x7fd747e7555f in ???
> | #1  0x7fd7488778e1 in ???
> | #2  0x7fd7488667a4 in ???
> | #3  0x7fd7486fe84c in ???
> | #4  0x7fd7489aa9ce in ???
> | #5  0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0
> |         at src/pcorona_main.f90:627
> | #6  0x7fd74813ec75 in ???
> | #7  0x412bb0 in pcorona
> |         at src/pcorona.f90:49
> | #8  0x40361c in main
> |         at src/pcorona.f90:17
> | 
> | [...]
> | 
> | --------------------------------------------------------------------------
> | mpirun noticed that process rank 3 with PID 0 on node sieladon exited on 
> signal 11 (Segmentation fault).
> | -------------------------------------------------------
> `----
> 
> I cannot see inside the MPI library (I don't really know if that would
> be helpful) but line 627 in pcorona_main.f90 is:
> 
> ,----
> |              call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
> `----
> 
> Any ideas/suggestions what could be going on or how to try an get some
> more clues about the possible causes of this?
> 
> Many thanks,
> -- 
> Ángel de Vicente
> 
> Tel.: +34 922 605 747
> Web.: http://research.iac.es/proyecto/polmag/
> ---------------------------------------------------------------------------------------------
> AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
> privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
> por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso 
> no autorizadas del contenido de este mensaje está estrictamente prohibida. 
> Más información en: https://www.iac.es/es/responsabilidad-legal
> DISCLAIMER: This message may contain confidential and / or privileged 
> information. If you are not the final recipient or have received it in error, 
> please notify the sender immediately. Any unauthorized use of the content of 
> this message is strictly prohibited. More information:  
> https://www.iac.es/en/disclaimer

---------------------------------------------------------------------
Prof. Dr.-Ing. Rainer Keller, HS Esslingen
Studiengangkoordinator Master Angewandte Informatik
Professor für Betriebssysteme, verteilte und parallele Systeme
Fakultät Informatik und Informationstechnik
Flandernstr. 101, Raum F01.320
73732 Esslingen
T.: +49 (0)711 397-4165
F.: +49 (0)711 397-48 4165

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to