I think you probably want to add "safe" to AccountingStorageEnforce in
slurm.conf;  that should prevent it from starting jobs that would exceed
association limits.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
[email protected]

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Thu, Jan 7, 2016 at 7:15 AM, Lennart Karlsson <[email protected]>
wrote:

>
> We have set the MaxTRESMins limit on accounts and users, to make it
> impossible to start what we think is outrageously large jobs.
>
> But we have found an unwanted side effect:
> When the user asks for a longer timelimit, we often allow that, and
> when we increase the timelimit, sometimes jobs run into the
> MaxTRESMins limit and die:
> Dec 28 17:20:18 milou-q slurmctld: [2015-12-28T17:20:09.072] Job 6574528
> timed out, the job is at or exceeds assoc 10056(b2013086/ansgar/(null)) max
> tres(cpu) minutes of 600000 with 600001
>
> For us, this looks like a bug.
>
> Please, we would prefer the MaxTRESMins limit not to kill already
> running jobs.
>
> Cheers,
> -- Lennart Karlsson
>    UPPMAX, Uppsala University, Sweden
>    http://www.uppmax.uu.se
>

Reply via email to