Hello !  
I had some  problems . 
This is My environment 
   BLCR= 0.8.4   , openMPI= 1.5.5  , OS= ubuntu 11.04
   I have 2 Node : cuda05(Master ,it have NFS  file system)  , cuda07(slave 
,mount Master)

   I had also set  ~/.openmpi/mca-params.conf->
     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints

  my configure format=
./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread  
 --with-blcr=/usr/local/BLCR  --with-blcr-libdir=/usr/local/BLCR/lib 
--enable-mpirun-prefix-by-default 
 --enable-static --enable-shared  --enable-opal-multi-threads;

problem 1:  ompi-restart  on multiple Node
  command 01: mpirun -hostfile  Hosts -am ft-enable-cr  -x  LD_LIBRARY_PATH  
-np 2  ./TEST         
  command 02: ompi-restart  ompi_global_snapshot_2892.ckpt
      -> I can checkpoint 2 process on multiples nodes ,but when I restart ,it 
can only restart on Master-Node.   
           
     command 03 : ompi-restart  -hostfile Hosts ompi_global_snapshot_2892.ckpt
    ->Error Message .   I make sure BLCR  is OK.
################################################################################################
  
 --------------------------------------------------------------------------
    root@cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts 
ompi_global_snapshot_2892.ckpt/
   --------------------------------------------------------------------------
   Error: BLCR was not able to restart the process because exec failed.
            Check the installation of BLCR on all of the machines in your
       system. The following information may be of help:
 Return Code : -1
 BLCR Restart Command : cr_restart
 Restart
 Command Line : cr_restart 
/root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.2704
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Error: Unable to
 obtain the proper restart command to restart from the 
       checkpoint file (opal_snapshot_1.ckpt). Returned -1.
       Check the installation of the blcr checkpoint/restart service
       on all of the machines in your system.essage
####################################################################################################
 problem 2: ompi-migrate i can't find .   How to use ompi-migrate ?
  Please help me , thanks .

Reply via email to