This is at least partially fixed in 2.5. https://github.com/SchedMD/slurm/commit/177f85e7f7695eea6336658ee89b69ce5cc0f839
The same kind of thing could be done for GrpWall. I am guessing it is the same issue. The patch should work with 2.4 if you didn't want to wait for 2.5. Danny On 11/06/2012 10:35 AM, Paddy Doyle wrote: > Hi again, > > I'd just like to raise the issue of GrpCPUMins and GrpWall causing running > jobs > to be killed, when limits are reached. > > I personally think this is a bit heavy-handed.. > > I would prefer the system to prevent the job from being started, rather than > killing a running job. > > This obviously would require (much) more logic at the job launch stage > to calculate requested time * allocated cpus, and check if that added to > the current usage would bring it over the limit. If you take into account > multiple users in an assocation submitting multiple jobs, I appreciate that > this is a non-trivial issue. It has shades of GOLD pre-allocation of time, of > which I don't have fond memories! > > > Perhaps a compromise might be an additional slurm.conf boolean value, > something > like: > > AccountingStorageEnforceAllowFinish=true > > (that's a terrible name!) > > It could default to false, to preserve the current behaviour, but if set to > true, it would allow running jobs to finish, even if they run over the limit. > > That way it's less cruel to users, but they still end up going over the limit, > and it affects their future jobs, rather than their currently running jobs. > Sure, a user could end up having multiple jobs go over the limit, but > eventually > they won't be able to run. > > To implement this, you'd need additional slurm.conf parsing logic, and then in > the src/slurmctld/job_mgr.c:job_time_limit() function you'd have an additional > boolean check in each of the usage checks, similar to my previously proposed > patch. > > > Any thoughts / comments? > > Thanks, > Paddy >
