Ouch. These are the worst kinds of bugs to find. :-( If you attach a debugger to these processes and step through the final death throes of the process, does it provide any additional insight? I have not infrequently done stuff like this:
{ int i = 0; printf("Process %d ready to attach\n", getpid()); while (i == 0) sleep(5); } Then you get a message indicating which pid to attach to. When you attach, set the variable i to nonzero and you can continue stepping through the process. On May 14, 2010, at 10:44 AM, Paul-Michael Agapow wrote: > Apologies for the vague details of the problem I'm about to describe, > but then I only understand it vaguely. Any pointers about the best > directions for further investigation would be appreciated. Lengthy > details follow: > > So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run > into some weird behaviour. When run under mpiexec, a segmentation > fault is thrown: > > % mpiexec -n 2 ./omegamip > [...] > main.cpp:52: Finished. > Completed 20 of 20 in 0.0695 minutes > [queen:23560] *** Process received signal *** > [queen:23560] Signal: Segmentation fault (11) > [queen:23560] Signal code: (128) > [queen:23560] Failing at address: (nil) > [queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80] > [queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40) > [0x2afb1fa43460] > [queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) > [0x2afb1fa439ad] > [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b] > [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc] > [queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4] > [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59] > [queen:23560] *** End of error message *** > mpiexec noticed that job rank 1 with PID 23560 on node > queen.bioinformatics exited on signal 11 (Segmentation fault). > > Right, so I've got a memory overrun or something. Except that when the > program is run in standalone mode, it works fine: > > % ./omegamip > [...] > main.cpp:52: Finished. > Completed 20 of 20 in 0.05970 minutes > > Right, so there's a difference between my standalone and MPI modes. > Except the the difference between my standalone and MPI versions is > currently nothing but the calls to MPI_Init, MPI_Finalize and some > exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't > gotten as far as coding the problem division.) Also, calling mpiexec > with 1 process always works: > > % mpiexec -n 1 ./omegamip > [...] > main.cpp:52: Finished. > Completed 20 of 20 in 0.05801 minutes > > So there's still this segmentation fault. Running valgrind across the > program doesn't show any obvious problems: there was some quirky > pointer arithmetic and some huge blocks of dangling memory, but these > were only leaked at the end of the program (i.e. the original > programmer didn't bother cleaning up at program termination). I've > caught most of those. But the segmentation fault still occurs only > when run under mpiexec with 2 or more processes. And by use of > diagnostic printfs and logging, I can see that it only occurs at the > very end of the program, the very end of main, possibly when > destructors are being automatically called. But again this cleanup > doesn't cause any problems with the standalone or 1 process modes. > > So, any ideas for where to start looking? > > technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64, > Red Hat 4.1.2-42 > > ---- > Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk) > Bioinformatics, Centre for Infections, Health Protection Agency > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/