If you do not see any job steps created in slurm when mpirun is
executed, then your openmpi installation is configured to launch its
tasks using some other mechanism than slurm (e.g. rsh) and slurm has
absolutely no control over those processes or checkpointing them.
Openmpi can be configured to launch the tasks using slurm's srun
command, in which case slurm will have control over the processes and
account for them.
Ideally you can upgrade to openmpi version 1.5 or higher and use srun
to launch the parallel job in a very scalable fashion as described here:
http://www.schedmd.com/slurmdocs/mpi_guide.html#open_mpi
I personally can't say how well slurm's openmpi checkpoint plugin
works compared to blcr, but other people probably can.
Moe Jette
SchedMD
Quoting Staffan Ronnås <suhbat...@gmail.com>:
Hello,
I am trying to enable checkpoint/restart functionality on a Debian
Squeeze cluster with SLURM 2.1.11 and OpenMPI 1.4.3. My goal is to be
able to preempt low-priority jobs with higher priority ones. From the
documentation I have understood that it should be possible to do this
using one of the "checkpoint" plugins, and I am currently looking at
the "ompi" plugin (activated with CheckpointType=checkpoint/ompi).
As OpenMPI has an interface for checkpointing with BLCR, I am already
able to checkpoint/restart when the job is not launched via SLURM. For
this I run the MPI job as follows:
mpirun -np X -am ft-enable-cr ./myapp
and can then create a checkpoint with ompi-checkpoint that can be
restarted with ompi-restart.
In SLURM, I would now like to be able to call something like
scontrol checkpoint create JOBID.JOBSTEP
to create a checkpoint. From earlier messages to this list, I have
understood that for the ompi plugin, it is necessary to specify a
jobstep that is to be checkpointed, instead of just a job id. It seems
however, that when I call mpirun within a simple batch script (a
oneliner calling "mpirun"), no job steps are created, and hence it is
not possible to checkpoint them. Trying something like
scontrol checkpoint create 4.0
gives me the error message "scontrol_checkpoint error: Invalid job id
specified".
Is there a way around this, or an alternative way to set up
checkpointing? Would it be better to use the "blcr" plugin? Have there
been large improvements in version 2.3 over 2.1 in this area?
Thank you,
Staffan Ronnas