Hello Matthias, Hopefully Josh will chime in shortly. But I have one suggestion to help diagnose this. Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. I know that isn't a solution, but it may help us diagnose what is going on. (It may not be a true hang, but very very slow progress that you are observing.)
Also, what happens to the checkpointing of one MPI job if you kill the other MPI job after the first "hangs"? On Fri, Oct 31, 2008 at 8:18 AM, Matthias Hovestadt <m...@cs.tu-berlin.de> wrote: > Hi! > > I'm using the development version of OMPI from SVN (rev. 19857) > for executing MPI jobs on my cluster system. I'm particularly using > the checkpoint and restart feature, basing on the currentmost version > of BLCR. > > The checkpointing is working pretty fine as long as I only execute > a single job on a node. If more than one MPI application is executing > on a system, ompi-checkpoint sometimes does not return, hanging forever. > > > Example: checkpointing with a single running application > > I'm using the MPI-enabled flavor of Povray as demo application. So I'm > starting it on a node using the following command. > > mpirun -np 4 mpi-x-povray +I planet.pov -w1200 -h1000 +SP1 \ > +O planet.tga > > This gives me 4 MPI processes, all running on the local node. > checkpointing it with > > ompi-checkpoint -v --term 7022 > > (where 7022 is the PID of the mpirun process) gives me a checkpoint > dataset ompi_global_snapshot_7022.ckpt, that can be used for restarting > the job. > > The ompi-checkpoint command gives the following output: > > ------------------------------------------------------- > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: Checkpointing... > [grid-demo-1.cit.tu-berlin.de:07480] PID 7022 > [grid-demo-1.cit.tu-berlin.de:07480] Connected to Mpirun [[2899,0],0] > [grid-demo-1.cit.tu-berlin.de:07480] Terminating after checkpoint > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Contact > Head Node Process PID 7022 > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Requested > a checkpoint of jobid [INVALID] > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Requested - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Pending (Termination) - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Running - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] File Transfer - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Finished - Global > Snapshot Reference: ompi_global_snapshot_7022.ckpt > Snapshot Ref.: 0 ompi_global_snapshot_7022.ckpt > ------------------------------------------------------- > > > > Example: checkpointing with two running applications > > Similar to the first example, I'm again using the MPI-enabled flavor > of Povray as demo application. But now, I'm not only starting a single > Povray computation, but a second one in parallel. This gives me 8 MPI > processes (4 processes for each MPI job), so that the 8 cores of my > system are fully utilized > > Without checkpointing, these two processes are executing without any > problem, each job resulting in a Povray image. However, if I'm using > the ompi-checkpoint command for checkpointing one of these two jobs, > this ompi-checkpoint is in danger of not returning. > > Again I'm executing > > ompi-checkpoint -v --term 13572 > > (where 13752 is the PID of the mpirun process). This command gives > the following output, not returning back to the user: > > ------------------------------------------------------- > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: Checkpointing... > [grid-demo-1.cit.tu-berlin.de:14252] PID 13572 > [grid-demo-1.cit.tu-berlin.de:14252] Connected to Mpirun [[9529,0],0] > [grid-demo-1.cit.tu-berlin.de:14252] Terminating after checkpoint > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Contact > Head Node Process PID 13572 > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Requested > a checkpoint of jobid [INVALID] > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:14252] Requested - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:14252] Pending (Termination) - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:14252] Running - Global > Snapshot Reference: (null) > ------------------------------------------------------- > > I want to underline that ompi-checkpoint is not hanging each > time I execute it while more than one job is running, but in > approx. 50% of all cases. I don't see any difference between > successful and failing calls... > > > Is there perhaps a way of increasing the debug output? > > > Best, > Matthias > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/