+1

And you probably have to start the job using OMPI's mpirun cmd - I don't 
believe it will work when running the processes directly via srun as I believe 
it requires that the OMPI daemons be present to support the operations.


On Apr 16, 2013, at 10:43 AM, Paul Hargrove <[email protected]> wrote:

> Yann,
> 
> I don't know what might be the specific cause of your error, but I do know 
> that to checkpoint and restart Open MPI jobs with BLCR one should be using 
> ompi-checkpoint and ompi-restart.  You can find some more information at 
> http://osl.iu.edu/research/ft/ompi-cr/
> 
> -Paul
> 
> 
> On Tue, Apr 16, 2013 at 1:51 AM, Yann Sagon <[email protected]> wrote:
> Hello,
> 
> I'm using trying to do a checkpoint of an MPI application using BLCR and 
> slurm.
> 
> I'm using openmpi_gcc-1.6.3, slurm2.5.4 and blcr 0.8.5. I have ran the blcr 
> test-suit without any error.
> 
> How I'm proceeding:
> 
> srun_cr -n 16 ./cavity3d
> 
> squeue -u sagon
>   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>    5653     debug cavity3d    sagon   R       2:09      1 node01
> 
> ps -U sagon | grep srun
> 195549 pts/4    00:00:00 srun_cr
> 195551 pts/4    00:00:00 srun
> 195556 pts/4    00:00:00 srun
> 
> cr_checkpoint 195549
> 
> scancel 5653
> 
> cr_restart context.195549
> 
> - cr_regenerate returned -5
> - cr_rstrt_child [84861]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84810]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84808]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84829]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84827]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84805]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84831]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> srun: error: node01: tasks 3,5,7,9,11,13,15: Exited with exit code 5
> - cr_regenerate returned -5
> - cr_rstrt_child [84847]:  Unable to load mmap()ed data!  (err=-5)
> ...
> srun: error: node01: tasks 0,2,4,6,8,10,12,14: Exited with exit code 5
> - cr_regenerate returned -5
> - cr_rstrt_child [84833]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> srun: error: node01: task 1: Exited with exit code 5
> 
> Do you have any clue?
> 
> Thanks a lot
> 
> 
> 
> 
> -- 
> Paul H. Hargrove                          [email protected]
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>  

Reply via email to