[slurm-dev] Re: MaxTRESMins limit on a job kills a running job -- is it meant to?

2016-01-08 Thread Lennart Karlsson


Thank you Doug, for your suggestion.

But I really want the job to start.

The problem appears when the timelimit later is increased: the job will
crash when it reaches the MaxTRESMins limit,  and we do not want that
to happen.

I would like to be able to prolong the job, and that the job continues
to run until it has finished or has reached the timelimit.

Cheers,
-- Lennart Karlsson
UPPMAX, Uppsala University, Sweden
http://www.uppmax.uu.se


On 01/07/2016 05:38 PM, Douglas Jacobsen wrote:

I think you probably want to add "safe" to AccountingStorageEnforce in
slurm.conf;  that should prevent it from starting jobs that would exceed
association limits.


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Thu, Jan 7, 2016 at 7:15 AM, Lennart Karlsson 
wrote:



We have set the MaxTRESMins limit on accounts and users, to make it
impossible to start what we think is outrageously large jobs.

But we have found an unwanted side effect:
When the user asks for a longer timelimit, we often allow that, and
when we increase the timelimit, sometimes jobs run into the
MaxTRESMins limit and die:
Dec 28 17:20:18 milou-q slurmctld: [2015-12-28T17:20:09.072] Job 6574528
timed out, the job is at or exceeds assoc 10056(b2013086/ansgar/(null)) max
tres(cpu) minutes of 60 with 61

For us, this looks like a bug.

Please, we would prefer the MaxTRESMins limit not to kill already
running jobs.

Cheers,
-- Lennart Karlsson
UPPMAX, Uppsala University, Sweden
http://www.uppmax.uu.se





[slurm-dev] Re: MaxTRESMins limit on a job kills a running job -- is it meant to?

2016-01-07 Thread Douglas Jacobsen
I think you probably want to add "safe" to AccountingStorageEnforce in
slurm.conf;  that should prevent it from starting jobs that would exceed
association limits.


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Thu, Jan 7, 2016 at 7:15 AM, Lennart Karlsson 
wrote:

>
> We have set the MaxTRESMins limit on accounts and users, to make it
> impossible to start what we think is outrageously large jobs.
>
> But we have found an unwanted side effect:
> When the user asks for a longer timelimit, we often allow that, and
> when we increase the timelimit, sometimes jobs run into the
> MaxTRESMins limit and die:
> Dec 28 17:20:18 milou-q slurmctld: [2015-12-28T17:20:09.072] Job 6574528
> timed out, the job is at or exceeds assoc 10056(b2013086/ansgar/(null)) max
> tres(cpu) minutes of 60 with 61
>
> For us, this looks like a bug.
>
> Please, we would prefer the MaxTRESMins limit not to kill already
> running jobs.
>
> Cheers,
> -- Lennart Karlsson
>UPPMAX, Uppsala University, Sweden
>http://www.uppmax.uu.se
>