Apologies for the vague details of the problem I'm about to describe, but then I only understand it vaguely. Any pointers about the best directions for further investigation would be appreciated. Lengthy details follow:
So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run into some weird behaviour. When run under mpiexec, a segmentation fault is thrown: % mpiexec -n 2 ./omegamip [...] main.cpp:52: Finished. Completed 20 of 20 in 0.0695 minutes [queen:23560] *** Process received signal *** [queen:23560] Signal: Segmentation fault (11) [queen:23560] Signal code: (128) [queen:23560] Failing at address: (nil) [queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80] [queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40) [0x2afb1fa43460] [queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) [0x2afb1fa439ad] [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b] [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc] [queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4] [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59] [queen:23560] *** End of error message *** mpiexec noticed that job rank 1 with PID 23560 on node queen.bioinformatics exited on signal 11 (Segmentation fault). Right, so I've got a memory overrun or something. Except that when the program is run in standalone mode, it works fine: % ./omegamip [...] main.cpp:52: Finished. Completed 20 of 20 in 0.05970 minutes Right, so there's a difference between my standalone and MPI modes. Except the the difference between my standalone and MPI versions is currently nothing but the calls to MPI_Init, MPI_Finalize and some exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't gotten as far as coding the problem division.) Also, calling mpiexec with 1 process always works: % mpiexec -n 1 ./omegamip [...] main.cpp:52: Finished. Completed 20 of 20 in 0.05801 minutes So there's still this segmentation fault. Running valgrind across the program doesn't show any obvious problems: there was some quirky pointer arithmetic and some huge blocks of dangling memory, but these were only leaked at the end of the program (i.e. the original programmer didn't bother cleaning up at program termination). I've caught most of those. But the segmentation fault still occurs only when run under mpiexec with 2 or more processes. And by use of diagnostic printfs and logging, I can see that it only occurs at the very end of the program, the very end of main, possibly when destructors are being automatically called. But again this cleanup doesn't cause any problems with the standalone or 1 process modes. So, any ideas for where to start looking? technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64, Red Hat 4.1.2-42 ---- Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk) Bioinformatics, Centre for Infections, Health Protection Agency