Yann,

I don't know what might be the specific cause of your error, but I do know
that to checkpoint and restart Open MPI jobs with BLCR one should be using
ompi-checkpoint and ompi-restart.  You can find some more information at
http://osl.iu.edu/research/ft/ompi-cr/

-Paul


On Tue, Apr 16, 2013 at 1:51 AM, Yann Sagon <[email protected]> wrote:

>  Hello,
>
> I'm using trying to do a checkpoint of an MPI application using BLCR and
> slurm.
>
> I'm using openmpi_gcc-1.6.3, slurm2.5.4 and blcr 0.8.5. I have ran the
> blcr test-suit without any error.
>
> How I'm proceeding:
>
> srun_cr -n 16 ./cavity3d
>
> squeue -u sagon
>   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>    5653     debug cavity3d    sagon   R       2:09      1 node01
>
> ps -U sagon | grep srun
> 195549 pts/4    00:00:00 srun_cr
> 195551 pts/4    00:00:00 srun
> 195556 pts/4    00:00:00 srun
>
> cr_checkpoint 195549
>
> scancel 5653
>
> cr_restart context.195549
>
> - cr_regenerate returned -5
> - cr_rstrt_child [84861]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84810]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84808]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84829]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84827]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84805]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> - cr_regenerate returned -5
> - cr_rstrt_child [84831]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> srun: error: node01: tasks 3,5,7,9,11,13,15: Exited with exit code 5
> - cr_regenerate returned -5
> - cr_rstrt_child [84847]:  Unable to load mmap()ed data!  (err=-5)
> ...
> srun: error: node01: tasks 0,2,4,6,8,10,12,14: Exited with exit code 5
> - cr_regenerate returned -5
> - cr_rstrt_child [84833]:  Unable to load mmap()ed data!  (err=-5)
> Restart failed: Input/output error
> srun: error: node01: task 1: Exited with exit code 5
>
> Do you have any clue?
>
> Thanks a lot
>



-- 
Paul H. Hargrove                          [email protected]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to