Danny and Husen, I saw your posts recently on using BLCR w/ MPI on the slurm-dev list. It sounded like you both ran into some problems. I want to try to help you through these and see if we can get you up and running.
It sounds like the MPI checkpoint/restart support isn't working correctly for either of you in batch. Can you check to see if it works manually? Outside of SLURM? For OpenMPI, you should be able to run ompi-checkpoint <pid-of-mpirun>, and ompi-restart <snapshot-id> to restart with BLCR. Try the manual checkpoint/restart instructions in the MVAPICH2 documentation first. Outside of SLURM if possible. I want to make sure your MPI is working with BLCR first. There are instructions in the MVAPICH user guide for how to do this. http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html SLURM provides support for checkpointing MVAPICH jobs through a wrapper. http://slurm.schedmd.com/checkpoint_blcr.html I'm guessing that the wrapper may not be configured correctly. We'll need to poke around. In your thread, there were a few different problems. (1) In one case, the checkpoint files weren't being created. (2) This message is odd. It looks like something had the root directory open as a normal file. Any idea what/why that is? What was process 28534 at the time of the checkpoint? - Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory Eric <http://slurm.schedmd.com/checkpoint_blcr.html>
