Danny and Husen,

I saw your posts recently on using BLCR w/ MPI on the slurm-dev list.
It sounded like you both ran into some problems.  I want to try to help you
through these and see if we can get you up and running.

It sounds like the MPI checkpoint/restart support isn't working correctly
for either of you in batch.

Can you check to see if it works manually?  Outside of SLURM?
For OpenMPI, you should be able to run ompi-checkpoint <pid-of-mpirun>,
and ompi-restart <snapshot-id> to restart with BLCR.

Try the manual checkpoint/restart instructions in the MVAPICH2
documentation first.
Outside of SLURM if possible.  I want to make sure your MPI is working with
BLCR first.

There are instructions in the MVAPICH user guide for how to do this.
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html

SLURM provides support for checkpointing MVAPICH jobs through a
wrapper. http://slurm.schedmd.com/checkpoint_blcr.html
I'm guessing that the wrapper may not be configured correctly.

We'll need to poke around.  In your thread, there were a few different
problems.
(1) In one case, the checkpoint files weren't being created.

(2) This message is odd.  It looks like something had the root directory
open as a
normal file.  Any idea what/why that is?  What was process 28534 at the
time of the
checkpoint?

- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory

Eric


<http://slurm.schedmd.com/checkpoint_blcr.html>

Reply via email to