[slurm-dev] Time limit exhausted for JobId=
One user has recently started to see their jobs killed after roughly 40 minutes, even though they have asked for four hours. 40 minutes is partitions' default, but this user has #SBATCH --time=04:00:00 in their sbatch file? I have found this: https://bugs.schedmd.com/show_bug.cgi?id=2353 and we are using the affected 16.05.0 but I've not run scontrol reconfigure for a while, and we don't run nhc. I'm confused. This is from the slurm-ctld log [2017-05-22T16:51:53.577] _slurm_rpc_submit_batch_job JobId=723118 usec=303 [2017-05-22T16:51:54.271] sched: Allocate JobID=723118 NodeList=papr-res-compute01 #CPUs=1 Partition=prod [2017-05-22T16:51:58.252] _pick_step_nodes: Configuration for job 723118 is complete [2017-05-22T17:32:09.641] Time limit exhausted for JobId=723118 [2017-05-22T17:32:09.749] job_complete: JobID=723118 State=0x8006 NodeCnt=1 WTERMSIG 15 This is from the relevant node's slurmd.log [2017-05-22T16:51:54.289] _run_prolog: prolog with lock for job 723118 ran for 0 seconds [2017-05-22T16:51:54.309] Launching batch job 723118 for UID 1514 [2017-05-22T16:51:58.259] launch task 723118.0 request from 1514.1514@10.126.19.15 (port 11938) [2017-05-22T17:32:09.644] [723118] error: *** JOB 723118 ON papr-res-compute01 CANCELLED AT 2017-05-22T17:32:09 DUE TO TIME LIMIT *** [2017-05-22T17:32:09.644] [723118.0] error: *** STEP 723118.0 ON papr-res-compute01 CANCELLED AT 2017-05-22T17:32:09 DUE TO TIME LIMIT *** [2017-05-22T17:32:09.747] [723118] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 [2017-05-22T17:32:09.759] [723118] done with job [2017-05-22T17:32:09.793] [723118.0] done with job User has also run this sbatch with #SBATCH --time=0-04:00:00 to the same error. Any ideas where to look (the time on the cluster is managed, and was resync'd early last week) cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams." - Patrice Cullors, *Black Lives Matter founder*
[slurm-dev] Re: PartitionTimeLimit : what does that mean?
Gah. I just found the MaxTime in the slurm.conf. My bad, sorry. L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams." - Patrice Cullors, *Black Lives Matter founder* On 23 May 2017 at 09:43, Lachlan Musicmanwrote: > Hola, > > One of my users has been given the PartitionTimeLimit reason for his jobs > not running. > > He has requested 20 days for the job, but I don't remember setting a time > limit on any partition? > > I do recall setting a default time, but not a time limit. > > The docs claim: > > https://slurm.schedmd.com/squeue.html > > *PartitionTimeLimit* The job's time limit exceeds it's partition's > current time limit. > > > But I can't find anything else that might describe where a time limit was > set or how I might go about configurating it out of the way? > > > cheers > > L. > > > -- > "Mission Statement: To provide hope and inspiration for collective action, > to build collective power, to achieve collective transformation, rooted in > grief and rage but pointed towards vision and dreams." > > - Patrice Cullors, *Black Lives Matter founder* >
[slurm-dev] PartitionTimeLimit : what does that mean?
Hola, One of my users has been given the PartitionTimeLimit reason for his jobs not running. He has requested 20 days for the job, but I don't remember setting a time limit on any partition? I do recall setting a default time, but not a time limit. The docs claim: https://slurm.schedmd.com/squeue.html *PartitionTimeLimit* The job's time limit exceeds it's partition's current time limit. But I can't find anything else that might describe where a time limit was set or how I might go about configurating it out of the way? cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams." - Patrice Cullors, *Black Lives Matter founder*
[slurm-dev] thoughts on task preemption
Hi all, After working with the developers of DMTCP checkpoint library, we have a nice working version of Slurm+DMTCP. We are able to checkpoint any batch job (well, most of them) and restarting it anywhere else in the cluster. We are testing it thoroughly, and will let you know in a few weeks in case any of you is interested in testing/using it. Anyway, after having this ready, we are working on some uses to this new functionality. An interesting one is job preemption: if a job is running but another with a higher priority comes, then checkpoint the first one, cancel it, run the second, and restart the first one somewhere else. This already counts with Slurm support, so it is kind of trivial from a technical point of view. I am however not fully convinced on how this should work. If possible, I'd like to have your thoughts as expert system administrators. A key thing is that, while we are able to checkpoint/restart most of the jobs (both serial and MPI) we are able to checkpoint only batch jobs: no srun, no salloc. Also, jobs running on GPUs or Xeon Phi cannot currently be checkpointed. My question is, what should happen to this non-checkpointable jobs whenever one with higher priority comes? One alternative should be to preempt only jobs with checkpoint support, so no computation is lost; the other would be to preempt whatever necessary to run the job as soon as possible, not caring about being able to restore it later. I can imagine scenarios where one of the alternatives is better than the other one, but I am not sure of how realistic they are. As system administrators, would you have any preference on this? Also, the next question is what happens with the job to be restarted. With current Slurm implementation it goes back to the queue. The problem it this is that, if there are many jobs in the queue, this partially-completed one will have to wait a lot before restarting. From my point of view it would make sense to put it on top of the queue, so it restarts as soon as there is a free slot. This can be easily changed in the code, but I'd love to hear your point of view before modifying anything. So, any ideas/suggestions? Thanks for your help. Best regards, Manuel
[slurm-dev] Compute nodes going to drained/draining state
Hello, I've recently started using slurm v17.02.2, however something seems very odd. For some reason, when for example jobs fail or exceed their walltime limit, I see that compute nodes are being placed in drained or draining state. Does anyone understand what might be wrong? Is this a known bug or new feature? I never saw this with v 16.05.5. If anyone could please shed any light on this issue then that would be appreciated. Best regards, David