Hello,

I'm running out of ideas, and wonder if someone here could have some
tips on how to debug a segmentation fault I'm having with my
application [due to the nature of the problem I'm wondering if the
problem is with OpenMPI itself rather than my app, though at this point
I'm not leaning strongly either way].

The code is hybrid MPI+OpenMP and I compile it with gcc 10.3.0 and
OpenMPI 4.1.3.

Usually I was running the code with "mpirun -np X --bind-to none [...]"
so that the threads created by OpenMP don't get bound to a single core
and I actually get proper speedup out of OpenMP.

Now, since I introduced some changes to the code this week (though I
have read the changes carefully a number of times, and I don't see
anything suspicious), I now get a segmentation fault sometimes, but only
when I run with "--bind-to none" and only in my workstation. It is not
always with the same running configuration, but I can see some pattern,
and the problem shows up only if I run the version compiled with OpenMP
support and most of the times only when the number of rank*threads goes
above 4 or so. If I run it with "--bind-to socket" all looks good all
the time.

If I run it in another server, "--bind-to none" doesn't seem to be any
issue (I submitted the jobs many many times and not a single
segmentation fault), but in my workstation it fails almost every time if
using MPI+OpenMP with a handful of threads and with "--bind-to none". In
this other server I'm running gcc 9.3.0 and OpenMPI 4.1.3.

For example, setting OMP_NUM_THREADS to 1, I run the code like the
following, and get the segmentation fault as below:

,----
| angelv@sieladon:~/.../Fe13_NL3/t~gauss+isat+istim$ mpirun -np 4 --bind-to 
none  ../../../../../pcorona+openmp~gauss Fe13_NL3.params 
|  Reading control file: Fe13_NL3.params
|   ... Control file parameters broadcasted
| 
| [...]
|  
|  Starting calculation loop on the line of sight
|  Receiving results from:            2
|  Receiving results from:            1
| 
| Program received signal SIGSEGV: Segmentation fault - invalid memory 
reference.
| 
| Backtrace for this error:
|  Receiving results from:            3
| #0  0x7fd747e7555f in ???
| #1  0x7fd7488778e1 in ???
| #2  0x7fd7488667a4 in ???
| #3  0x7fd7486fe84c in ???
| #4  0x7fd7489aa9ce in ???
| #5  0x414959 in __pcorona_main_MOD_main_loop._omp_fn.0
|         at src/pcorona_main.f90:627
| #6  0x7fd74813ec75 in ???
| #7  0x412bb0 in pcorona
|         at src/pcorona.f90:49
| #8  0x40361c in main
|         at src/pcorona.f90:17
| 
| [...]
| 
| --------------------------------------------------------------------------
| mpirun noticed that process rank 3 with PID 0 on node sieladon exited on 
signal 11 (Segmentation fault).
| -------------------------------------------------------
`----

I cannot see inside the MPI library (I don't really know if that would
be helpful) but line 627 in pcorona_main.f90 is:

,----
|              call mpi_probe(master,mpi_any_tag,mpi_comm_world,stat,mpierror)
`----

Any ideas/suggestions what could be going on or how to try an get some
more clues about the possible causes of this?

Many thanks,
-- 
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/polmag/
---------------------------------------------------------------------------------------------
AVISO LEGAL: Este mensaje puede contener información confidencial y/o 
privilegiada. Si usted no es el destinatario final del mismo o lo ha recibido 
por error, por favor notifíquelo al remitente inmediatamente. Cualquier uso no 
autorizadas del contenido de este mensaje está estrictamente prohibida. Más 
información en: https://www.iac.es/es/responsabilidad-legal
DISCLAIMER: This message may contain confidential and / or privileged 
information. If you are not the final recipient or have received it in error, 
please notify the sender immediately. Any unauthorized use of the content of 
this message is strictly prohibited. More information:  
https://www.iac.es/en/disclaimer

Reply via email to