I think you probably want to add "safe" to AccountingStorageEnforce in slurm.conf; that should prevent it from starting jobs that would exceed association limits.
---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> [email protected] ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Thu, Jan 7, 2016 at 7:15 AM, Lennart Karlsson <[email protected]> wrote: > > We have set the MaxTRESMins limit on accounts and users, to make it > impossible to start what we think is outrageously large jobs. > > But we have found an unwanted side effect: > When the user asks for a longer timelimit, we often allow that, and > when we increase the timelimit, sometimes jobs run into the > MaxTRESMins limit and die: > Dec 28 17:20:18 milou-q slurmctld: [2015-12-28T17:20:09.072] Job 6574528 > timed out, the job is at or exceeds assoc 10056(b2013086/ansgar/(null)) max > tres(cpu) minutes of 600000 with 600001 > > For us, this looks like a bug. > > Please, we would prefer the MaxTRESMins limit not to kill already > running jobs. > > Cheers, > -- Lennart Karlsson > UPPMAX, Uppsala University, Sweden > http://www.uppmax.uu.se >
