On Feb 27, 2009, at 1:56 PM, Mahmoud Payami wrote:

I am using intel lc_prof-11 (and its own mkl) and have built openmpi-1.3.1 with connfigure options: "FC=ifort F77=ifort CC=icc CXX=icpc". Then I have built my application. The linux box is 2Xamd64 quad. In the middle of running of my application (after some 15 iterations), I receive the message and stops. I tried to configure openmpi using "--disable-mpi-threads" but it automatically assumes "posix".

This doesn't sound like a threading problem, thankfully. Open MPI has two levels of threading issues:

- whether MPI_THREAD_MULTIPLE is supported or not (which is what -- enable|disable-mpi-threads does) - whether thread support is present at all on the system (e.g., solaris or posix threads)

You see "posix" in the configure output mainly because OMPI still detects that posix threads are available on the system. It doesn't necessarily mean that threads will be used in your application's run.

This problem does not happen in openmpi-1.2.9.
Any comment is highly appreciated.
Best regards,
                    mahmoud payami


[hpc1:25353] *** Process received signal ***
[hpc1:25353] Signal: Segmentation fault (11)
[hpc1:25353] Signal code: Address not mapped (1)
[hpc1:25353] Failing at address: 0x51
[hpc1:25353] [ 0] /lib64/libpthread.so.0 [0x303be0dd40]
[hpc1:25353] [ 1] /opt/openmpi131_cc/lib/
openmpi/mca_pml_ob1.so [0x2aaaae350d96]
[hpc1:25353] [ 2] /opt/openmpi131_cc/lib/
openmpi/mca_pml_ob1.so [0x2aaaae3514a8]
[hpc1:25353] [ 3] /opt/openmpi131_cc/lib/openmpi/mca_btl_sm.so [0x2aaaaeb7c72a] [hpc1:25353] [ 4] /opt/openmpi131_cc/lib/libopen-pal.so. 0(opal_progress+0x89) [0x2aaaab42b7d9] [hpc1:25353] [ 5] /opt/openmpi131_cc/lib/openmpi/mca_pml_ob1.so [0x2aaaae34d27c] [hpc1:25353] [ 6] /opt/openmpi131_cc/lib/libmpi.so.0(PMPI_Recv +0x210) [0x2aaaaaf46010] [hpc1:25353] [ 7] /opt/openmpi131_cc/lib/libmpi_f77.so.0(mpi_recv +0xa4) [0x2aaaaacd6af4] [hpc1:25353] [ 8] /opt/QE131_cc/bin/pw.x(parallel_toolkit_mp_zsqmred_ +0x13da) [0x513d8a]
[hpc1:25353] [ 9] /opt/QE131_cc/bin/pw.x(pcegterg_+0x6c3f) [0x6667ff]
[hpc1:25353] [10] /opt/QE131_cc/bin/pw.x(diag_bands_+0xb9e) [0x65654e]
[hpc1:25353] [11] /opt/QE131_cc/bin/pw.x(c_bands_+0x277) [0x6575a7]
[hpc1:25353] [12] /opt/QE131_cc/bin/pw.x(electrons_+0x53f) [0x58a54f]
[hpc1:25353] [13] /opt/QE131_cc/bin/pw.x(MAIN__+0x1fb) [0x458acb]
[hpc1:25353] [14] /opt/QE131_cc/bin/pw.x(main+0x3c) [0x4588bc]
[hpc1:25353] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x303b21d8a4]
[hpc1:25353] [16] /opt/QE131_cc/bin/pw.x(realloc+0x1b9) [0x4587e9]
[hpc1:25353] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 25353 on node hpc1 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

What this stack trace tells us is that Open MPI crashed somewhere while trying to use shared memory for message passing, but it doesn't really tell us much else. It's not clear, either, whether this is OMPI's fault or your app's fault (or something else).

Can you run your application through a memory-checking debugger to see if anything obvious pops out?

--
Jeff Squyres
Cisco Systems

Reply via email to