[slurm-dev] Time limit exhausted for JobId=

2017-05-22 Thread Lachlan Musicman
One user has recently started to see their jobs killed after roughly 40
minutes, even though they have asked for four hours.

40 minutes is partitions' default, but this user has

#SBATCH --time=04:00:00

in their sbatch file?

I have found this:  https://bugs.schedmd.com/show_bug.cgi?id=2353 and we
are using the affected 16.05.0 but I've not run scontrol reconfigure for a
while, and we don't run nhc.


I'm confused.

This is from the slurm-ctld log
[2017-05-22T16:51:53.577] _slurm_rpc_submit_batch_job JobId=723118 usec=303
[2017-05-22T16:51:54.271] sched: Allocate JobID=723118
NodeList=papr-res-compute01 #CPUs=1 Partition=prod
[2017-05-22T16:51:58.252] _pick_step_nodes: Configuration for job 723118 is
complete
[2017-05-22T17:32:09.641] Time limit exhausted for JobId=723118
[2017-05-22T17:32:09.749] job_complete: JobID=723118 State=0x8006 NodeCnt=1
WTERMSIG 15


This is from the relevant node's slurmd.log
[2017-05-22T16:51:54.289] _run_prolog: prolog with lock for job 723118 ran
for 0 seconds
[2017-05-22T16:51:54.309] Launching batch job 723118 for UID 1514
[2017-05-22T16:51:58.259] launch task 723118.0 request from
1514.1514@10.126.19.15 (port 11938)
[2017-05-22T17:32:09.644] [723118] error: *** JOB 723118 ON
papr-res-compute01 CANCELLED AT 2017-05-22T17:32:09 DUE TO TIME LIMIT ***
[2017-05-22T17:32:09.644] [723118.0] error: *** STEP 723118.0 ON
papr-res-compute01 CANCELLED AT 2017-05-22T17:32:09 DUE TO TIME LIMIT ***
[2017-05-22T17:32:09.747] [723118] sending REQUEST_COMPLETE_BATCH_SCRIPT,
error:0 status 15
[2017-05-22T17:32:09.759] [723118] done with job
[2017-05-22T17:32:09.793] [723118.0] done with job



User has also run this sbatch with

#SBATCH --time=0-04:00:00

to the same error.

Any ideas where to look (the time on the cluster is managed, and was
resync'd early last week)

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective power, to achieve collective transformation, rooted in
grief and rage but pointed towards vision and dreams."

 - Patrice Cullors, *Black Lives Matter founder*


[slurm-dev] Re: PartitionTimeLimit : what does that mean?

2017-05-22 Thread Lachlan Musicman
Gah. I just found the MaxTime in the slurm.conf.

My bad, sorry.

L.

--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective power, to achieve collective transformation, rooted in
grief and rage but pointed towards vision and dreams."

 - Patrice Cullors, *Black Lives Matter founder*

On 23 May 2017 at 09:43, Lachlan Musicman  wrote:

> Hola,
>
> One of my users has been given the PartitionTimeLimit reason for his jobs
> not running.
>
> He has requested 20 days for the job, but I don't remember setting a time
> limit on any partition?
>
> I do recall setting a default time, but not a time limit.
>
> The docs claim:
>
> https://slurm.schedmd.com/squeue.html
>
> *PartitionTimeLimit* The job's time limit exceeds it's partition's
> current time limit.
>
>
> But I can't find anything else that might describe where a time limit was
> set or how I might go about configurating it out of the way?
>
>
> cheers
>
> L.
>
>
> --
> "Mission Statement: To provide hope and inspiration for collective action,
> to build collective power, to achieve collective transformation, rooted in
> grief and rage but pointed towards vision and dreams."
>
>  - Patrice Cullors, *Black Lives Matter founder*
>


[slurm-dev] PartitionTimeLimit : what does that mean?

2017-05-22 Thread Lachlan Musicman
Hola,

One of my users has been given the PartitionTimeLimit reason for his jobs
not running.

He has requested 20 days for the job, but I don't remember setting a time
limit on any partition?

I do recall setting a default time, but not a time limit.

The docs claim:

https://slurm.schedmd.com/squeue.html

*PartitionTimeLimit* The job's time limit exceeds it's partition's current
time limit.


But I can't find anything else that might describe where a time limit was
set or how I might go about configurating it out of the way?


cheers

L.


--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective power, to achieve collective transformation, rooted in
grief and rage but pointed towards vision and dreams."

 - Patrice Cullors, *Black Lives Matter founder*


[slurm-dev] thoughts on task preemption

2017-05-22 Thread Manuel Rodríguez Pascual
Hi all,

After working with the developers of DMTCP checkpoint library, we have a
nice working version of Slurm+DMTCP. We are able to checkpoint any batch
job (well, most of them) and restarting it anywhere else in the cluster. We
are testing it thoroughly, and will let you know in a few weeks in case any
of you is interested in testing/using it.

Anyway, after having this ready, we are working on some uses to this new
functionality. An interesting one is job preemption: if a job is running
but another with a higher priority comes, then checkpoint the first one,
cancel it, run the second, and restart the first one somewhere else.

This already counts with Slurm support, so it is kind of trivial from a
technical point of view. I am however not fully convinced on how this
should work. If possible, I'd like to have your thoughts as expert system
administrators.

A key thing is that, while we are able to checkpoint/restart most of the
jobs (both serial and MPI)  we are able to checkpoint only batch jobs: no
srun, no salloc. Also, jobs running on GPUs or Xeon Phi cannot currently be
checkpointed.

My question is, what should happen to this non-checkpointable jobs whenever
one with higher priority comes? One alternative should be to preempt only
jobs with checkpoint support, so no computation is lost; the other would be
to preempt whatever necessary to run the job as soon as possible, not
caring about being able to restore it later. I can imagine scenarios where
one of the alternatives is better than the other one, but I am not sure of
how realistic they are. As system administrators, would you have any
preference on this?

Also, the next question is what happens with the job to be restarted. With
current Slurm implementation it goes back to the queue. The problem it this
is that, if there are many jobs in the queue, this partially-completed one
will have to wait a lot before restarting. From my point of view it would
make sense to put it on top of the queue, so it restarts as soon as there
is a free slot. This can be easily changed in the code, but I'd love to
hear your point of view before modifying anything.

So, any ideas/suggestions?

Thanks for your help. Best regards,

Manuel


[slurm-dev] Compute nodes going to drained/draining state

2017-05-22 Thread Baker D . J .
Hello,

I've recently started using slurm v17.02.2, however something seems very odd. 
For some reason, when for example jobs fail or exceed their walltime limit, I 
see that compute nodes are being placed in drained or draining state. Does 
anyone understand what might be wrong? Is this a known bug or new feature? I 
never saw this with v 16.05.5. If anyone could please shed any light on this 
issue then that would be appreciated.

Best regards,
David