Hi All,
I got a problem when trying to checkpoint a mpi job.
I will really appreciate if you can help me fix the problem.
the blcr package was installed successfully on the cluster.
I configure the ompenmpi with flags,
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads --with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/
The installation looks correct. The open MPI version is 1.3.3

I got the following output when issueing ompi_info:

root@hec:/export/home/hjin/test# ompi_info | grep ft
                MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3)
root@hec:/export/home/hjin/test# ompi_info | grep crs
                MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
It seems the MCA crs is lost but I have no idea about how to get it.

To run a checkpointable application, I run:
mpirun -np 2 --host hec -am ft-enable-cr test_mpi

however, when trying to checkpoint at another terminal of the same host, I have the following,
root@hec:~# ompi-checkpoint -v 29234
[hec:29243] orte_checkpoint: Checkpointing...
[hec:29243]      PID 29234
[hec:29243]      Connected to Mpirun [[46621,0],0]
[hec:29243] orte_checkpoint: notify_hnp: Contact Head Node Process PID 29234
[hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243]                 Requested - Global Snapshot Reference: (null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243]                   Pending - Global Snapshot Reference: (null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243]                   Running - Global Snapshot Reference: (null)

There is some error msg at the terminal of the running applicaiton, as,
--------------------------------------------------------------------------
Error: The process with PID 29236 is not checkpointable.
      This could be due to one of the following:
       - An application with this PID doesn't currently exist
       - The application with this PID isn't checkpointable
       - The application with this PID isn't an OPAL application.
      We were looking for the named files:
        /tmp/opal_cr_prog_write.29236
        /tmp/opal_cr_prog_read.29236
--------------------------------------------------------------------------
[hec:29234] local) Error: Unable to initiate the handshake with peer [[46621,1],1]. -1 [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 567 [hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054




does anyone have some hint to fix this problem?

Thanks,
Hui Jin

Reply via email to