I once had a crash in libpthread something like the one below. The very un-obvious cause was a stack overflow on subroutine entry - large automatic array.
HTH, Douglas. On Wed, Mar 04, 2009 at 03:04:20PM -0500, Jeff Squyres wrote: > On Feb 27, 2009, at 1:56 PM, Mahmoud Payami wrote: > > >I am using intel lc_prof-11 (and its own mkl) and have built > >openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc > >CXX=icpc". Then I have built my application. > >The linux box is 2Xamd64 quad. In the middle of running of my > >application (after some 15 iterations), I receive the message and > >stops. > >I tried to configure openmpi using "--disable-mpi-threads" but it > >automatically assumes "posix". > > This doesn't sound like a threading problem, thankfully. Open MPI has > two levels of threading issues: > > - whether MPI_THREAD_MULTIPLE is supported or not (which is what -- > enable|disable-mpi-threads does) > - whether thread support is present at all on the system (e.g., > solaris or posix threads) > > You see "posix" in the configure output mainly because OMPI still > detects that posix threads are available on the system. It doesn't > necessarily mean that threads will be used in your application's run. > > >This problem does not happen in openmpi-1.2.9. > >Any comment is highly appreciated. > >Best regards, > > mahmoud payami > > > > > >[hpc1:25353] *** Process received signal *** > >[hpc1:25353] Signal: Segmentation fault (11) > >[hpc1:25353] Signal code: Address not mapped (1) > >[hpc1:25353] Failing at address: 0x51 > >[hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40] > >[hpc1:25353] [ 1] /opt/openmpi131_cc/lib/ > >openmpi/mca_pml_ob1.so [0x2aaaae350d96] > >[hpc1:25353] [ 2] /opt/openmpi131_cc/lib/ > >openmpi/mca_pml_ob1.so [0x2aaaae3514a8] > >[hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so > >[0x2aaaaeb7c72a] > >[hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so. > >0(opal_progress+0x89) [0x2aaaab42b7d9] > >[hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so > >[0x2aaaae34d27c] > >[hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv > >+0x210) [0x2aaaaaf46010] > >[hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv > >+0xa4) [0x2aaaaacd6af4] > >[hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_ > >+0x13da) [0x513d8a] > >[hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff] > >[hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e] > >[hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7] > >[hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f] > >[hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb] > >[hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc] > >[hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4) > >[0x303b21d8a4] > >[hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9] > >[hpc1:25353] *** End of error message *** > >-------------------------------------------------------------------------- > >mpirun noticed that process rank 6 with PID 25353 on node hpc1 > >exited on signal 11 (Segmentation fault). > >-------------------------------------------------------------------------- > > What this stack trace tells us is that Open MPI crashed somewhere > while trying to use shared memory for message passing, but it doesn't > really tell us much else. It's not clear, either, whether this is > OMPI's fault or your app's fault (or something else). > > Can you run your application through a memory-checking debugger to see > if anything obvious pops out? > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users