Re: [OMPI users] ompi-checkpoint is hanging

Josh Hursey Fri, 31 Oct 2008 12:20:57 -0400

After some additional testing I believe that I have been able toreproduce the problem. I suspect that there is a bug in thecoordination protocol that is causing an occasional hang in thesystem. Since it only happens occasionally (though slightly more oftenon a fully loaded machine) that is probably how I missed it in mytesting.

I'll work on a patch, and let you know when it is ready. Unfortunatelyit probably won't be for a couple weeks. :(

You can increase the verbose level for all of the fault toleranceframeworks and components through MCA parameters. They are referencedin the FT C/R User Doc on the Open MPI wiki, and you can access themthrough 'ompi-info'. You will look for the following frameworks/components:

 - crs/blcr
 - snapc/full
 - crcp/bkmrk
 - opal_cr_verbose
 - orte_cr_verbose
 - ompi_cr_verbose

Thanks for the bug report. I filed a ticket in our bug tracker, andCC'ed you on it. The ticket is:

  http://svn.open-mpi.org/trac/ompi/ticket/1619

Cheers,
Josh

On Oct 31, 2008, at 10:51 AM, Matthias Hovestadt wrote:

Hi Tim!

First of all: thanks a lot for answering! :-)
Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra coresavailable.
This problem occurrs with any number of procs.
Also, what happens to the checkpointing of one MPI job if you killthe
other MPI job
after the first "hangs"?
Nothing, it keeps hanging.

> (It may not be a true hang, but very very slow progress that you
> are observing.)

I already waited for more than 12 hours, but the ompi-checkpoint
did not return. So if it's slow, it must be very slow.


I continued testing and just observed a case where the problem
occurred with only one job running on the compute node:

-------------------------------------------------------
ccs@grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep
ccs 7706 0.4 0.2 63864 2640 ? S 15:35 0:00mpirun -np 1 -am ft-enable-cr -np 6 /home/ccs/XN-OMPI/testdrive/loop-1/remotedir/mpi-x-povray +I planet.pov -w1600 -h1200 +SP1 +Oplanet.tga
ccs@grid-demo-1:~$
-------------------------------------------------------

The resource management system tried to checkpoint this job using the
command "ompi-checkpoint -v --term 7706". This is the output of that
command:

-------------------------------------------------------
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08178]     PID 7706
[grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun[[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08178]     Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:Contact Head Node Process PID 7706[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:Requested a checkpoint of jobid [INVALID][grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:Receive a command message.[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:Status Update.[grid-demo-1.cit.tu-berlin.de:08178] Requested -Global Snapshot Reference: (null)[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:Receive a command message.[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:Status Update.[grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) -Global Snapshot Reference: (null)[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:Receive a command message.[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:Status Update.[grid-demo-1.cit.tu-berlin.de:08178] Running -Global Snapshot Reference: (null)
-------------------------------------------------------

If I look to the activity on the node, I see that the processes
are still computing:

-------------------------------------------------------
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
7710 ccs 25 0 327m 6936 4052 R 102 0.7 4:14.17 mpi-x-povray7712 ccs 25 0 327m 6884 4000 R 102 0.7 3:34.06 mpi-x-povray7708 ccs 25 0 327m 6896 4012 R 66 0.7 2:42.10 mpi-x-povray7707 ccs 25 0 331m 10m 3736 R 54 1.0 3:08.62 mpi-x-povray7709 ccs 25 0 327m 6940 4056 R 48 0.7 1:48.24 mpi-x-povray7711 ccs 25 0 327m 6724 4032 R 36 0.7 1:29.34 mpi-x-povray
-------------------------------------------------------

Now I killed the hanging ompi-checkpoint operation and tried
to execute a checkpoint manually:

-------------------------------------------------------
ccs@grid-demo-1:~$ ompi-checkpoint -v --term 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08224]     PID 7706
[grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun[[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08224]     Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:Contact Head Node Process PID 7706[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:Requested a checkpoint of jobid [INVALID]
-------------------------------------------------------

Is there perhaps a way of increasing the level of debug output?
Please let me know if I can support you in any way...


Best,
Matthias
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] ompi-checkpoint is hanging

Reply via email to