Apologies for the vague details of the problem I'm about to describe,
but then I only understand it vaguely. Any pointers about the best
directions for further investigation would be appreciated. Lengthy
details follow:

So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run
into some weird behaviour. When run under mpiexec, a segmentation
fault is thrown:

% mpiexec -n 2 ./omegamip
[...]
main.cpp:52: Finished.
Completed 20 of 20 in 0.0695 minutes
[queen:23560] *** Process received signal ***
[queen:23560] Signal: Segmentation fault (11)
[queen:23560] Signal code:  (128)
[queen:23560] Failing at address: (nil)
[queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80]
[queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40)
[0x2afb1fa43460]
[queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) [0x2afb1fa439ad]
[queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b]
[queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc]
[queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4]
[queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59]
[queen:23560] *** End of error message ***
mpiexec noticed that job rank 1 with PID 23560 on node
queen.bioinformatics exited on signal 11 (Segmentation fault).

Right, so I've got a memory overrun or something. Except that when the
program is run in standalone mode, it works fine:

% ./omegamip
[...]
main.cpp:52: Finished.
Completed 20 of 20 in 0.05970 minutes

Right, so there's a difference between my standalone and MPI modes.
Except the the difference between my standalone and MPI versions is
currently nothing but the calls to MPI_Init, MPI_Finalize and some
exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't
gotten as far as coding the problem division.) Also, calling mpiexec
with 1 process always works:

% mpiexec -n 1 ./omegamip
[...]
main.cpp:52: Finished.
Completed 20 of 20 in 0.05801 minutes

So there's still this segmentation fault. Running valgrind across the
program doesn't show any obvious problems: there was some quirky
pointer arithmetic and some huge blocks of dangling memory, but these
were only leaked at the end of the program (i.e. the original
programmer didn't bother cleaning up at program termination). I've
caught most of those. But the segmentation fault still occurs only
when run under mpiexec with 2 or more processes. And by use of
diagnostic printfs and logging, I can see that it only occurs at the
very end of the program, the very end of main, possibly when
destructors are being automatically called. But again this cleanup
doesn't cause any problems with the standalone or 1 process modes.

So, any ideas for where to start looking?

technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64,
Red Hat 4.1.2-42

----
Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk)
Bioinformatics, Centre for Infections, Health Protection Agency

Reply via email to