Manuel Rodríguez Pascual <manuel.rodriguez.pasc...@gmail.com> writes:
> After working with the developers of DMTCP checkpoint library, we have a > nice working version of Slurm+DMTCP. We are able to checkpoint any batch > job (well, most of them) and restarting it anywhere else in the cluster. We > are testing it thoroughly, and will let you know in a few weeks in case any > of you is interested in testing/using it. I'd be very interested in this! > My question is, what should happen to this non-checkpointable jobs whenever > one with higher priority comes? One alternative should be to preempt only > jobs with checkpoint support, so no computation is lost; the other would be > to preempt whatever necessary to run the job as soon as possible, not > caring about being able to restore it later. In our setup, at least, the second option is to be preferred: A lowpri job will be checkpointed if possible, and then requeued (even if it couldn't be checkpointed). We use low priority jobs to let projects run on more cpus than they normally have access to if they are idle, and the users know that the jobs can be requeued any time. Since in other setups the first option might be preferrable, perhaps you could make it configurable? > Also, the next question is what happens with the job to be restarted. With > current Slurm implementation it goes back to the queue. The problem it this > is that, if there are many jobs in the queue, this partially-completed one > will have to wait a lot before restarting. From my point of view it would > make sense to put it on top of the queue, so it restarts as soon as there > is a free slot. This can be easily changed in the code, but I'd love to > hear your point of view before modifying anything. In our setup, the first option is preferrable; just putting it on the queue and let it wait until it's turn. But of course, there are other setups where the second option would be best. Could you perhaps make it configurable, so a site can choose? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
signature.asc
Description: PGP signature