Hello Eric, first of all, thank you for your help!
Am 20.04.2016 um 01:43 schrieb Eric Roman:
I currently checked it and it works, but we have blcr and openmpi installed as modules and may it cause path problems?!Danny and Husen, I saw your posts recently on using BLCR w/ MPI on the slurm-dev list. It sounded like you both ran into some problems. I want to try to help you through these and see if we can get you up and running. It sounds like the MPI checkpoint/restart support isn't working correctly for either of you in batch. Can you check to see if it works manually? Outside of SLURM? For OpenMPI, you should be able to run ompi-checkpoint <pid-of-mpirun>, and ompi-restart <snapshot-id> to restart with BLCR.
Now, I will check the manual checkpointing with MVAPICH2, I only test it in context with Slurm...Try the manual checkpoint/restart instructions in the MVAPICH2 documentation first. Outside of SLURM if possible. I want to make sure your MPI is working with BLCR first. There are instructions in the MVAPICH user guide for how to do this. http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html
I will check this, but until now I have no idea which process cause the problem.SLURM provides support for checkpointing MVAPICH jobs through a wrapper. http://slurm.schedmd.com/checkpoint_blcr.html I'm guessing that the wrapper may not be configured correctly. We'll need to poke around. In your thread, there were a few different problems. (1) In one case, the checkpoint files weren't being created. (2) This message is odd. It looks like something had the root directory open as a normal file. Any idea what/why that is? What was process 28534 at the time of the checkpoint?
- Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory Eric <http://slurm.schedmd.com/checkpoint_blcr.html>
Kind regards, Danny -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Danny Rotscher HPC-Support Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) 01062 Dresden Tel.: +49 351 463-35853 Fax : +49 351 463-37773 E-Mail: [email protected] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
smime.p7s
Description: S/MIME Cryptographic Signature
