On Oct 30, 2009, at 1:35 PM, Hui Jin wrote:

Hi All,
I got a problem when trying to checkpoint a mpi job.
I will really appreciate if you can help me fix the problem.
the blcr package was installed successfully on the cluster.
I configure the ompenmpi with flags,
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads -- with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/
The installation looks correct. The open MPI version is 1.3.3

I got the following output when issueing ompi_info:

root@hec:/export/home/hjin/test# ompi_info | grep ft
               MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3)
root@hec:/export/home/hjin/test# ompi_info | grep crs
               MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
It seems the MCA crs is lost but I have no idea about how to get it.

This is an artifact of the way ompi_info searches for components. This came up before on the users list:
  http://www.open-mpi.org/community/lists/users/2009/09/10667.php

I filed a bug about this, if you want to track its progress:
  https://svn.open-mpi.org/trac/ompi/ticket/2097


To run a checkpointable application, I run:
mpirun -np 2 --host hec -am ft-enable-cr test_mpi

however, when trying to checkpoint at another terminal of the same host, I have the following,
root@hec:~# ompi-checkpoint -v 29234
[hec:29243] orte_checkpoint: Checkpointing...
[hec:29243]      PID 29234
[hec:29243]      Connected to Mpirun [[46621,0],0]
[hec:29243] orte_checkpoint: notify_hnp: Contact Head Node Process PID 29234 [hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Requested - Global Snapshot Reference: (null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Pending - Global Snapshot Reference: (null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Running - Global Snapshot Reference: (null)

There is some error msg at the terminal of the running applicaiton, as,
--------------------------------------------------------------------------
Error: The process with PID 29236 is not checkpointable.
     This could be due to one of the following:
      - An application with this PID doesn't currently exist
      - The application with this PID isn't checkpointable
      - The application with this PID isn't an OPAL application.
     We were looking for the named files:
       /tmp/opal_cr_prog_write.29236
       /tmp/opal_cr_prog_read.29236
--------------------------------------------------------------------------
[hec:29234] local) Error: Unable to initiate the handshake with peer [[46621,1],1]. -1 [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 567 [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054

This means that either the MPI application did not respond to the checkpoint request in time, or that the application was not checkpointable for some other reason.

Some options to try:
- Set the 'snapc_full_max_wait_time' MCA parameter to say 60, the default is 20 seconds before giving up. You can also set it to 0, which indicates to the runtime to wait indefinitely.
   shell$ mpirun -mca snapc_full_max_wait_time 60
- Try cleaning out the /tmp directory on all of the nodes, maybe this has something to do with disks being full (though usually we would see other symptoms).

If that doesn't help, can you send me the config.log from your build of Open MPI. If those do not work, I would suspect that something in the configure of Open MPI might have gone wrong.

-- Josh






does anyone have some hint to fix this problem?

Thanks,
Hui Jin

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to