Hello Eric,

first of all, thank you for your help!

Am 20.04.2016 um 01:43 schrieb Eric Roman:
Danny and Husen,

I saw your posts recently on using BLCR w/ MPI on the slurm-dev list.
It sounded like you both ran into some problems.  I want to try to help you
through these and see if we can get you up and running.

It sounds like the MPI checkpoint/restart support isn't working correctly
for either of you in batch.

Can you check to see if it works manually?  Outside of SLURM?
For OpenMPI, you should be able to run ompi-checkpoint <pid-of-mpirun>,
and ompi-restart <snapshot-id> to restart with BLCR.
I currently checked it and it works, but we have blcr and openmpi installed as modules and may it cause path problems?!

Try the manual checkpoint/restart instructions in the MVAPICH2
documentation first.
Outside of SLURM if possible.  I want to make sure your MPI is working with
BLCR first.

There are instructions in the MVAPICH user guide for how to do this.
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html
Now, I will check the manual checkpointing with MVAPICH2, I only test it in context with Slurm...
SLURM provides support for checkpointing MVAPICH jobs through a
wrapper. http://slurm.schedmd.com/checkpoint_blcr.html
I'm guessing that the wrapper may not be configured correctly.

We'll need to poke around.  In your thread, there were a few different
problems.
(1) In one case, the checkpoint files weren't being created.

(2) This message is odd.  It looks like something had the root directory
open as a
normal file.  Any idea what/why that is?  What was process 28534 at the
time of the
checkpoint?
I will check this, but until now I have no idea which process cause the problem.

- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory

Eric


<http://slurm.schedmd.com/checkpoint_blcr.html>

Kind regards,
Danny

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Danny Rotscher
HPC-Support

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden
Tel.: +49 351 463-35853
Fax : +49 351 463-37773
E-Mail: [email protected]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to