Hello, I am trying to enable checkpoint/restart functionality on a Debian Squeeze cluster with SLURM 2.1.11 and OpenMPI 1.4.3. My goal is to be able to preempt low-priority jobs with higher priority ones. From the documentation I have understood that it should be possible to do this using one of the "checkpoint" plugins, and I am currently looking at the "ompi" plugin (activated with CheckpointType=checkpoint/ompi).
As OpenMPI has an interface for checkpointing with BLCR, I am already able to checkpoint/restart when the job is not launched via SLURM. For this I run the MPI job as follows: mpirun -np X -am ft-enable-cr ./myapp and can then create a checkpoint with ompi-checkpoint that can be restarted with ompi-restart. In SLURM, I would now like to be able to call something like scontrol checkpoint create JOBID.JOBSTEP to create a checkpoint. From earlier messages to this list, I have understood that for the ompi plugin, it is necessary to specify a jobstep that is to be checkpointed, instead of just a job id. It seems however, that when I call mpirun within a simple batch script (a oneliner calling "mpirun"), no job steps are created, and hence it is not possible to checkpoint them. Trying something like scontrol checkpoint create 4.0 gives me the error message "scontrol_checkpoint error: Invalid job id specified". Is there a way around this, or an alternative way to set up checkpointing? Would it be better to use the "blcr" plugin? Have there been large improvements in version 2.3 over 2.1 in this area? Thank you, Staffan Ronnas