After some additional testing I believe that I have been able to
reproduce the problem. I suspect that there is a bug in the
coordination protocol that is causing an occasional hang in the
system. Since it only happens occasionally (though slightly more often
on a fully loaded machine) that is probably how I missed it in my
testing.
I'll work on a patch, and let you know when it is ready. Unfortunately
it probably won't be for a couple weeks. :(
You can increase the verbose level for all of the fault tolerance
frameworks and components through MCA parameters. They are referenced
in the FT C/R User Doc on the Open MPI wiki, and you can access them
through 'ompi-info'. You will look for the following frameworks/
components:
- crs/blcr
- snapc/full
- crcp/bkmrk
- opal_cr_verbose
- orte_cr_verbose
- ompi_cr_verbose
Thanks for the bug report. I filed a ticket in our bug tracker, and
CC'ed you on it. The ticket is:
http://svn.open-mpi.org/trac/ompi/ticket/1619
Cheers,
Josh
On Oct 31, 2008, at 10:51 AM, Matthias Hovestadt wrote:
Hi Tim!
First of all: thanks a lot for answering! :-)
Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores
available.
This problem occurrs with any number of procs.
Also, what happens to the checkpointing of one MPI job if you kill
the
other MPI job
after the first "hangs"?
Nothing, it keeps hanging.
> (It may not be a true hang, but very very slow progress that you
> are observing.)
I already waited for more than 12 hours, but the ompi-checkpoint
did not return. So if it's slow, it must be very slow.
I continued testing and just observed a case where the problem
occurred with only one job running on the compute node:
-------------------------------------------------------
ccs@grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep
ccs 7706 0.4 0.2 63864 2640 ? S 15:35 0:00
mpirun -np 1 -am ft-enable-cr -np 6 /home/ccs/XN-OMPI/testdrive/
loop-1/remotedir/mpi-x-povray +I planet.pov -w1600 -h1200 +SP1 +O
planet.tga
ccs@grid-demo-1:~$
-------------------------------------------------------
The resource management system tried to checkpoint this job using the
command "ompi-checkpoint -v --term 7706". This is the output of that
command:
-------------------------------------------------------
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08178] PID 7706
[grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun
[[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08178] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:
Contact Head Node Process PID 7706
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp:
Requested a checkpoint of jobid [INVALID]
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Requested -
Global Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) -
Global Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver:
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Running -
Global Snapshot Reference: (null)
-------------------------------------------------------
If I look to the activity on the node, I see that the processes
are still computing:
-------------------------------------------------------
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7710 ccs 25 0 327m 6936 4052 R 102 0.7 4:14.17 mpi-x-
povray
7712 ccs 25 0 327m 6884 4000 R 102 0.7 3:34.06 mpi-x-
povray
7708 ccs 25 0 327m 6896 4012 R 66 0.7 2:42.10 mpi-x-
povray
7707 ccs 25 0 331m 10m 3736 R 54 1.0 3:08.62 mpi-x-
povray
7709 ccs 25 0 327m 6940 4056 R 48 0.7 1:48.24 mpi-x-
povray
7711 ccs 25 0 327m 6724 4032 R 36 0.7 1:29.34 mpi-x-
povray
-------------------------------------------------------
Now I killed the hanging ompi-checkpoint operation and tried
to execute a checkpoint manually:
-------------------------------------------------------
ccs@grid-demo-1:~$ ompi-checkpoint -v --term 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08224] PID 7706
[grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun
[[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08224] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:
Contact Head Node Process PID 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp:
Requested a checkpoint of jobid [INVALID]
-------------------------------------------------------
Is there perhaps a way of increasing the level of debug output?
Please let me know if I can support you in any way...
Best,
Matthias
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users