Yann, I don't know what might be the specific cause of your error, but I do know that to checkpoint and restart Open MPI jobs with BLCR one should be using ompi-checkpoint and ompi-restart. You can find some more information at http://osl.iu.edu/research/ft/ompi-cr/
-Paul On Tue, Apr 16, 2013 at 1:51 AM, Yann Sagon <[email protected]> wrote: > Hello, > > I'm using trying to do a checkpoint of an MPI application using BLCR and > slurm. > > I'm using openmpi_gcc-1.6.3, slurm2.5.4 and blcr 0.8.5. I have ran the > blcr test-suit without any error. > > How I'm proceeding: > > srun_cr -n 16 ./cavity3d > > squeue -u sagon > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 5653 debug cavity3d sagon R 2:09 1 node01 > > ps -U sagon | grep srun > 195549 pts/4 00:00:00 srun_cr > 195551 pts/4 00:00:00 srun > 195556 pts/4 00:00:00 srun > > cr_checkpoint 195549 > > scancel 5653 > > cr_restart context.195549 > > - cr_regenerate returned -5 > - cr_rstrt_child [84861]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > - cr_regenerate returned -5 > - cr_rstrt_child [84810]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > - cr_regenerate returned -5 > - cr_rstrt_child [84808]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > - cr_regenerate returned -5 > - cr_rstrt_child [84829]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > - cr_regenerate returned -5 > - cr_rstrt_child [84827]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > - cr_regenerate returned -5 > - cr_rstrt_child [84805]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > - cr_regenerate returned -5 > - cr_rstrt_child [84831]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > srun: error: node01: tasks 3,5,7,9,11,13,15: Exited with exit code 5 > - cr_regenerate returned -5 > - cr_rstrt_child [84847]: Unable to load mmap()ed data! (err=-5) > ... > srun: error: node01: tasks 0,2,4,6,8,10,12,14: Exited with exit code 5 > - cr_regenerate returned -5 > - cr_rstrt_child [84833]: Unable to load mmap()ed data! (err=-5) > Restart failed: Input/output error > srun: error: node01: task 1: Exited with exit code 5 > > Do you have any clue? > > Thanks a lot > -- Paul H. Hargrove [email protected] Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
