Hi Tim!

First of all: thanks a lot for answering! :-)


Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores available.

This problem occurrs with any number of procs.

Also, what happens to the checkpointing of one MPI job if you kill the
other MPI job
after the first "hangs"?

Nothing, it keeps hanging.

> (It may not be a true hang, but very very slow progress that you
> are observing.)

I already waited for more than 12 hours, but the ompi-checkpoint
did not return. So if it's slow, it must be very slow.


I continued testing and just observed a case where the problem
occurred with only one job running on the compute node:

-------------------------------------------------------
ccs@grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep
ccs 7706 0.4 0.2 63864 2640 ? S 15:35 0:00 mpirun -np 1 -am ft-enable-cr -np 6 /home/ccs/XN-OMPI/testdrive/loop-1/remotedir/mpi-x-povray +I planet.pov -w1600 -h1200 +SP1 +O planet.tga
ccs@grid-demo-1:~$
-------------------------------------------------------

The resource management system tried to checkpoint this job using the
command "ompi-checkpoint -v --term 7706". This is the output of that
command:

-------------------------------------------------------
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08178]     PID 7706
[grid-demo-1.cit.tu-berlin.de:08178]     Connected to Mpirun [[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08178]     Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7706 [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Running - Global Snapshot Reference: (null)
-------------------------------------------------------

If I look to the activity on the node, I see that the processes
are still computing:

-------------------------------------------------------
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7710 ccs       25   0  327m 6936 4052 R  102  0.7   4:14.17 mpi-x-povray
 7712 ccs       25   0  327m 6884 4000 R  102  0.7   3:34.06 mpi-x-povray
 7708 ccs       25   0  327m 6896 4012 R   66  0.7   2:42.10 mpi-x-povray
 7707 ccs       25   0  331m  10m 3736 R   54  1.0   3:08.62 mpi-x-povray
 7709 ccs       25   0  327m 6940 4056 R   48  0.7   1:48.24 mpi-x-povray
 7711 ccs       25   0  327m 6724 4032 R   36  0.7   1:29.34 mpi-x-povray
-------------------------------------------------------

Now I killed the hanging ompi-checkpoint operation and tried
to execute a checkpoint manually:

-------------------------------------------------------
ccs@grid-demo-1:~$ ompi-checkpoint -v --term 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08224]     PID 7706
[grid-demo-1.cit.tu-berlin.de:08224]     Connected to Mpirun [[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08224]     Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7706 [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
-------------------------------------------------------

Is there perhaps a way of increasing the level of debug output?
Please let me know if I can support you in any way...


Best,
Matthias

Reply via email to