On Oct 30, 2009, at 1:35 PM, Hui Jin wrote:
Hi All,
I got a problem when trying to checkpoint a mpi job.
I will really appreciate if you can help me fix the problem.
the blcr package was installed successfully on the cluster.
I configure the ompenmpi with flags,
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads --
with-blcr=/usr/local --with-blcr-libdir=/usr/local/lib/
The installation looks correct. The open MPI version is 1.3.3
I got the following output when issueing ompi_info:
root@hec:/export/home/hjin/test# ompi_info | grep ft
MCA rml: ftrm (MCA v2.0, API v2.0, Component v1.3.3)
root@hec:/export/home/hjin/test# ompi_info | grep crs
MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
It seems the MCA crs is lost but I have no idea about how to get it.
This is an artifact of the way ompi_info searches for components. This
came up before on the users list:
http://www.open-mpi.org/community/lists/users/2009/09/10667.php
I filed a bug about this, if you want to track its progress:
https://svn.open-mpi.org/trac/ompi/ticket/2097
To run a checkpointable application, I run:
mpirun -np 2 --host hec -am ft-enable-cr test_mpi
however, when trying to checkpoint at another terminal of the same
host, I have the following,
root@hec:~# ompi-checkpoint -v 29234
[hec:29243] orte_checkpoint: Checkpointing...
[hec:29243] PID 29234
[hec:29243] Connected to Mpirun [[46621,0],0]
[hec:29243] orte_checkpoint: notify_hnp: Contact Head Node Process
PID 29234
[hec:29243] orte_checkpoint: notify_hnp: Requested a checkpoint of
jobid [INVALID]
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Requested - Global Snapshot Reference:
(null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Pending - Global Snapshot Reference:
(null)
[hec:29243] orte_checkpoint: hnp_receiver: Receive a command message.
[hec:29243] orte_checkpoint: hnp_receiver: Status Update.
[hec:29243] Running - Global Snapshot Reference:
(null)
There is some error msg at the terminal of the running applicaiton,
as,
--------------------------------------------------------------------------
Error: The process with PID 29236 is not checkpointable.
This could be due to one of the following:
- An application with this PID doesn't currently exist
- The application with this PID isn't checkpointable
- The application with this PID isn't an OPAL application.
We were looking for the named files:
/tmp/opal_cr_prog_write.29236
/tmp/opal_cr_prog_read.29236
--------------------------------------------------------------------------
[hec:29234] local) Error: Unable to initiate the handshake with peer
[[46621,1],1]. -1
[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file
snapc_full_global.c at line 567
[hec:29234] [[46621,0],0] ORTE_ERROR_LOG: Error in file
snapc_full_global.c at line 1054
This means that either the MPI application did not respond to the
checkpoint request in time, or that the application was not
checkpointable for some other reason.
Some options to try:
- Set the 'snapc_full_max_wait_time' MCA parameter to say 60, the
default is 20 seconds before giving up. You can also set it to 0,
which indicates to the runtime to wait indefinitely.
shell$ mpirun -mca snapc_full_max_wait_time 60
- Try cleaning out the /tmp directory on all of the nodes, maybe
this has something to do with disks being full (though usually we
would see other symptoms).
If that doesn't help, can you send me the config.log from your build
of Open MPI. If those do not work, I would suspect that something in
the configure of Open MPI might have gone wrong.
-- Josh
does anyone have some hint to fix this problem?
Thanks,
Hui Jin
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users