[OMPI users] ompi-checkpoint problem on shared storage

Dave Schulz Fri, 23 Sep 2011 15:09:10 -0400

Hi Everyone.

I've been trying to figure out an issue with ompi-checkpoint/blcr. Thesymptoms seem to be related to what filesystem thesnapc_base_global_snapshot_dir is located on.

I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. thenthe highest sends to 0. then it waits 1 sec and repeats.

I'm using openmpi-1.4.3 and when I run "ompi-checkpoint --term<pidofmpirun>" on the shared filesystems, the ompi-checkpoint returns acheckpoint reference, the worker processes go away, but the mpirunremains but is stuck (It dies right away if I run kill on it -- so it'sresponding to SIGTERM). If I attach an strace to the mpirun, I get thefollowing from strace forever:

poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0,events=POLLIN}], 6, 1000) = 0 (Timeout)poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0,events=POLLIN}], 6, 1000) = 0 (Timeout)poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0,events=POLLIN}], 6, 1000) = 0 (Timeout)


I'm running with:
mpirun -machinefile machines -am ft-enable-cr ./mpiloop

the "machines" file simply has the local hostname listed a few times.I've tried 2 and 8. I can try up to 24 as this node is a pretty big oneif it's deemed useful. Also, there's 256Gb of RAM. And it's Opteron 6core, 4 socket if that helps.

I initially installed this on a test system with only local harddisksand standard nfs on Centos 5.6 where everything worked as expected.When I moved over to the production system things started breaking. Thefilesystem is the major software difference. The shared filesystems areIbrix and that is where the above symptoms started to appear.

I haven't even moved on to multi-node mpi runs as I can't even get thisto work for any number of processes on the local machine except if I setthe checkpoint directory to /tmp which is on a local xfs harddisk. If Iput the checkpoints on any shared directory, things fail.

I've tried a number of *_verbose mca parameters and none of them seem toissue any messages at the point of checkpoint, only when I give-up andsend kill `pidof mpirun` are there any further messages.


openmpi is compiled with:

./configure --prefix=/global/software/openmpi-blcr--with-blcr=/global/software/blcr--with-blcr-libdir=/global/software/blcr/lib/ --with-ft=cr--enable-ft-thread --enable-mpi-threads --with-openib --with-tm

and blcr only has a prefix to put it in /global/software/blcr otherwiseit's vanilla. Both are compiled with the default gcc.

One final note, is that occasionally it does succeed and terminate. Butit seems completely random.

What I'm wondering is has anyone else seen symptoms like this --especially where the mpirun doesn't quit after a checkpoint with --termbut the worker processes do?

Also, is there some sort of somewhat unusual filesystem semantic thatour shared filesystem may not support that ompi/ompi-checkpoint is needing?


Thanks for any insights you may have.

-Dave

[OMPI users] ompi-checkpoint problem on shared storage

Reply via email to