Manuel Rodríguez Pascual <manuel.rodriguez.pasc...@gmail.com> writes:

> After working with the developers of DMTCP checkpoint library, we have a
> nice working version of Slurm+DMTCP. We are able to checkpoint any batch
> job (well, most of them) and restarting it anywhere else in the cluster. We
> are testing it thoroughly, and will let you know in a few weeks in case any
> of you is interested in testing/using it.

I'd be very interested in this!

> My question is, what should happen to this non-checkpointable jobs whenever
> one with higher priority comes? One alternative should be to preempt only
> jobs with checkpoint support, so no computation is lost; the other would be
> to preempt whatever necessary to run the job as soon as possible, not
> caring about being able to restore it later.

In our setup, at least, the second option is to be preferred: A lowpri
job will be checkpointed if possible, and then requeued (even if it
couldn't be checkpointed).  We use low priority jobs to let projects run
on more cpus than they normally have access to if they are idle, and the
users know that the jobs can be requeued any time.

Since in other setups the first option might be preferrable, perhaps you
could make it configurable?

> Also, the next question is what happens with the job to be restarted. With
> current Slurm implementation it goes back to the queue. The problem it this
> is that, if there are many jobs in the queue, this partially-completed one
> will have to wait a lot before restarting. From my point of view it would
> make sense to put it on top of the queue, so it restarts as soon as there
> is a free slot. This can be easily changed in the code, but I'd love to
> hear your point of view before modifying anything.

In our setup, the first option is preferrable; just putting it on the
queue and let it wait until it's turn.  But of course, there are other
setups where the second option would be best.  Could you perhaps make it
configurable, so a site can choose?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

Attachment: signature.asc
Description: PGP signature

Reply via email to