Ouch.  These are the worst kinds of bugs to find.  :-(

If you attach a debugger to these processes and step through the final death 
throes of the process, does it provide any additional insight?  I have not 
infrequently done stuff like this:

  {
     int i = 0;
     printf("Process %d ready to attach\n", getpid());
     while (i == 0) sleep(5);
  }

Then you get a message indicating which pid to attach to.  When you attach, set 
the variable i to nonzero and you can continue stepping through the process.



On May 14, 2010, at 10:44 AM, Paul-Michael Agapow wrote:

> Apologies for the vague details of the problem I'm about to describe,
> but then I only understand it vaguely. Any pointers about the best
> directions for further investigation would be appreciated. Lengthy
> details follow:
> 
> So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run
> into some weird behaviour. When run under mpiexec, a segmentation
> fault is thrown:
> 
> % mpiexec -n 2 ./omegamip
> [...]
> main.cpp:52: Finished.
> Completed 20 of 20 in 0.0695 minutes
> [queen:23560] *** Process received signal ***
> [queen:23560] Signal: Segmentation fault (11)
> [queen:23560] Signal code:  (128)
> [queen:23560] Failing at address: (nil)
> [queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80]
> [queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40)
> [0x2afb1fa43460]
> [queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) 
> [0x2afb1fa439ad]
> [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b]
> [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc]
> [queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4]
> [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59]
> [queen:23560] *** End of error message ***
> mpiexec noticed that job rank 1 with PID 23560 on node
> queen.bioinformatics exited on signal 11 (Segmentation fault).
> 
> Right, so I've got a memory overrun or something. Except that when the
> program is run in standalone mode, it works fine:
> 
> % ./omegamip
> [...]
> main.cpp:52: Finished.
> Completed 20 of 20 in 0.05970 minutes
> 
> Right, so there's a difference between my standalone and MPI modes.
> Except the the difference between my standalone and MPI versions is
> currently nothing but the calls to MPI_Init, MPI_Finalize and some
> exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't
> gotten as far as coding the problem division.) Also, calling mpiexec
> with 1 process always works:
> 
> % mpiexec -n 1 ./omegamip
> [...]
> main.cpp:52: Finished.
> Completed 20 of 20 in 0.05801 minutes
> 
> So there's still this segmentation fault. Running valgrind across the
> program doesn't show any obvious problems: there was some quirky
> pointer arithmetic and some huge blocks of dangling memory, but these
> were only leaked at the end of the program (i.e. the original
> programmer didn't bother cleaning up at program termination). I've
> caught most of those. But the segmentation fault still occurs only
> when run under mpiexec with 2 or more processes. And by use of
> diagnostic printfs and logging, I can see that it only occurs at the
> very end of the program, the very end of main, possibly when
> destructors are being automatically called. But again this cleanup
> doesn't cause any problems with the standalone or 1 process modes.
> 
> So, any ideas for where to start looking?
> 
> technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64,
> Red Hat 4.1.2-42
> 
> ----
> Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk)
> Bioinformatics, Centre for Infections, Health Protection Agency
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to