I once had a crash in libpthread something like the one below.  The
very un-obvious cause was a stack overflow on subroutine entry - large
automatic array.

HTH,
Douglas.

On Wed, Mar 04, 2009 at 03:04:20PM -0500, Jeff Squyres wrote:
> On Feb 27, 2009, at 1:56 PM, Mahmoud Payami wrote:
> 
> >I am using intel lc_prof-11 (and its own mkl) and have built  
> >openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc  
> >CXX=icpc". Then I have built my application.
> >The linux box is 2Xamd64 quad. In the middle of running of my  
> >application (after some 15 iterations), I receive the message and  
> >stops.
> >I tried to configure openmpi using "--disable-mpi-threads" but it  
> >automatically assumes "posix".
> 
> This doesn't sound like a threading problem, thankfully.  Open MPI has  
> two levels of threading issues:
> 
> - whether MPI_THREAD_MULTIPLE is supported or not (which is what -- 
> enable|disable-mpi-threads does)
> - whether thread support is present at all on the system (e.g.,  
> solaris or posix threads)
> 
> You see "posix" in the configure output mainly because OMPI still  
> detects that posix threads are available on the system.  It doesn't  
> necessarily mean that threads will be used in your application's run.
> 
> >This problem does not happen in openmpi-1.2.9.
> >Any comment is highly appreciated.
> >Best regards,
> >                    mahmoud payami
> >
> >
> >[hpc1:25353] *** Process received signal ***
> >[hpc1:25353] Signal: Segmentation fault (11)
> >[hpc1:25353] Signal code: Address not mapped (1)
> >[hpc1:25353] Failing at address: 0x51
> >[hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40]
> >[hpc1:25353] [ 1] /opt/openmpi131_cc/lib/
> >openmpi/mca_pml_ob1.so [0x2aaaae350d96]
> >[hpc1:25353] [ 2] /opt/openmpi131_cc/lib/
> >openmpi/mca_pml_ob1.so [0x2aaaae3514a8]
> >[hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so  
> >[0x2aaaaeb7c72a]
> >[hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so. 
> >0(opal_progress+0x89) [0x2aaaab42b7d9]
> >[hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so  
> >[0x2aaaae34d27c]
> >[hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv 
> >+0x210) [0x2aaaaaf46010]
> >[hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv 
> >+0xa4) [0x2aaaaacd6af4]
> >[hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_ 
> >+0x13da) [0x513d8a]
> >[hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff]
> >[hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e]
> >[hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7]
> >[hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f]
> >[hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb]
> >[hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc]
> >[hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4)  
> >[0x303b21d8a4]
> >[hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9]
> >[hpc1:25353] *** End of error message ***
> >--------------------------------------------------------------------------
> >mpirun noticed that process rank 6 with PID 25353 on node hpc1  
> >exited on signal 11 (Segmentation fault).
> >--------------------------------------------------------------------------
> 
> What this stack trace tells us is that Open MPI crashed somewhere  
> while trying to use shared memory for message passing, but it doesn't  
> really tell us much else.  It's not clear, either, whether this is  
> OMPI's fault or your app's fault (or something else).
> 
> Can you run your application through a memory-checking debugger to see  
> if anything obvious pops out?
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to