Re: [OMPI users] problems with checkpointing an mpi job

Josh Hursey Fri, 6 Nov 2009 08:54:00 -0500


On Oct 30, 2009, at 1:35 PM, Hui Jin wrote:

Hi All,
I got a problem when trying to checkpoint a mpi job.
I will really appreciate if you can help me fix the problem.
the blcr package was installed successfully on the cluster.
I configure the ompenmpi with flags,

./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads --with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/

The installation looks correct. The open MPI version is 1.3.3

I got the following output when issueing ompi_info:

root@hec:/export/home/hjin/test# ompi_info | grep ft
               MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3)
root@hec:/export/home/hjin/test# ompi_info | grep crs
               MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
It seems the MCA crs is lost but I have no idea about how to get it.

This is an artifact of the way ompi_info searches for components. Thiscame up before on the users list:

  http://www.open-mpi.org/community/lists/users/2009/09/10667.php

I filed a bug about this, if you want to track its progress:
  https://svn.open-mpi.org/trac/ompi/ticket/2097

To run a checkpointable application, I run:
mpirun -np 2 --host hec -am ft-enable-cr test_mpi
however, when trying to checkpoint at another terminal of the samehost, I have the following,
root@hec:~# ompi-checkpoint -v 29234
[hec:29243] orte_checkpoint: Checkpointing...
[hec:29243]      PID 29234
[hec:29243]      Connected to Mpirun [[46621,0],0]
[hec:29243] orte_checkpoint: notify_hnp: Contact Head Node ProcessPID 29234[hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint ofjobid [INVALID]
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Requested - Global Snapshot Reference:(null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Pending - Global Snapshot Reference:(null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Running - Global Snapshot Reference:(null)
There is some error msg at the terminal of the running applicaiton,as,
--------------------------------------------------------------------------
Error: The process with PID 29236 is not checkpointable.
     This could be due to one of the following:
      - An application with this PID doesn't currently exist
      - The application with this PID isn't checkpointable
      - The application with this PID isn't an OPAL application.
     We were looking for the named files:
       /tmp/opal_cr_prog_write.29236
       /tmp/opal_cr_prog_read.29236
--------------------------------------------------------------------------
[hec:29234] local) Error: Unable to initiate the handshake with peer[[46621,1],1]. -1[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in filesnapc_full_global.c at line 567[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in filesnapc_full_global.c at line 1054

This means that either the MPI application did not respond to thecheckpoint request in time, or that the application was notcheckpointable for some other reason.


Some options to try:

- Set the 'snapc_full_max_wait_time' MCA parameter to say 60, thedefault is 20 seconds before giving up. You can also set it to 0,which indicates to the runtime to wait indefinitely.

   shell$ mpirun -mca snapc_full_max_wait_time 60

- Try cleaning out the /tmp directory on all of the nodes, maybethis has something to do with disks being full (though usually wewould see other symptoms).

If that doesn't help, can you send me the config.log from your buildof Open MPI. If those do not work, I would suspect that something inthe configure of Open MPI might have gone wrong.


-- Josh





does anyone have some hint to fix this problem?

Thanks,
Hui Jin

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] problems with checkpointing an mpi job

Reply via email to