[slurm-dev] Re: GrpCPUMins and GrpWall causing running jobs to be killed

Danny Auble Tue, 06 Nov 2012 11:33:06 -0800

This is at least partially fixed in 2.5.

https://github.com/SchedMD/slurm/commit/177f85e7f7695eea6336658ee89b69ce5cc0f839


The same kind of thing could be done for GrpWall.  I am guessing it is 
the same issue.  The patch should work with 2.4 if you didn't want to 
wait for 2.5.

Danny

On 11/06/2012 10:35 AM, Paddy Doyle wrote:
> Hi again,
>
> I'd just like to raise the issue of GrpCPUMins and GrpWall causing running 
> jobs
> to be killed, when limits are reached.
>
> I personally think this is a bit heavy-handed..
>
> I would prefer the system to prevent the job from being started, rather than
> killing a running job.
>
> This obviously would require (much) more logic at the job launch stage
> to calculate requested time * allocated cpus, and check if that added to
> the current usage would bring it over the limit. If you take into account
> multiple users in an assocation submitting multiple jobs, I appreciate that
> this is a non-trivial issue. It has shades of GOLD pre-allocation of time, of
> which I don't have fond memories!
>
>
> Perhaps a compromise might be an additional slurm.conf boolean value, 
> something
> like:
>
> AccountingStorageEnforceAllowFinish=true
>
> (that's a terrible name!)
>
> It could default to false, to preserve the current behaviour, but if set to
> true, it would allow running jobs to finish, even if they run over the limit.
>
> That way it's less cruel to users, but they still end up going over the limit,
> and it affects their future jobs, rather than their currently running jobs.
> Sure, a user could end up having multiple jobs go over the limit, but 
> eventually
> they won't be able to run.
>
> To implement this, you'd need additional slurm.conf parsing logic, and then in
> the src/slurmctld/job_mgr.c:job_time_limit() function you'd have an additional
> boolean check in each of the usage checks, similar to my previously proposed
> patch.
>
>
> Any thoughts / comments?
>
> Thanks,
> Paddy
>

[slurm-dev] Re: GrpCPUMins and GrpWall causing running jobs to be killed

Reply via email to