Quick and dirty addition to the job_submit.lua does the trick. Thanks.
Bill
--
-- Check for unlimited memory requests
--
if job_desc.pn_min_memory == 0 then
log_info("slurm_job_submit: job from uid %d invalid memory request
MaxMemPerNode", job_desc.user_id)
return 2044 -- signal ESLURM_INVALID_TASK_MEMORY
end
On 05/21/2014 10:40 AM, [email protected] wrote:
Hi Bill,
Here are a couple of ideas:
At job end, compare each job's memory specification against actual use
and work with the offending users.
You might configure DefaultMemPerCPU and MaxMemPerCPU to match the CPU
and memory allocations (e.g. if 8 CPUs and 8GB, then set both
configuration parameters to 1G). Then if someone requests all of the
memory on the node they will get all of the CPUs as well.
In a job_submit plugin, set a nice value for jobs requesting a lot of
memory (i.e. lower their scheduling priority).
You could configure MaxMemPerNode, but that would probably impact
users who really need a lot of memory.
Moe Jette
SchedMD
Quoting Bill Wichser <[email protected]>:
In doing accounting on past jobs, we are trying to figure out how to
account for memory usage as well as core usage. What began as an
anomaly has now turned into something my users have found to work
quite effectively for their jobs, and that is to add the line:
#SBATCH --mem=MaxMemPerNode
We do share our nodes so this is an unacceptable specification.
Before going down the path of adding yet another check in the
job_submit.lua script, I am wondering if there isn't a better way.
Currently I do not have this value configured so when I do a
"scontrol show config" it comes up as UNLIMITED, not at all what I
want. Ideally I'd set this to some small value but suspect that this
would have repercussions further along when users actually do
allocate the correct amount and that value exceeds this MaxMemPerNode
value I'd set low.
Yes, I could just inform my users that this is unacceptable
behavior. But we all know that without policing, it will arise again
so I'd much rather deal with this once and for all either by adding
athe "right" value to slurm.conf or I'll just reject jobs using this
variable altogether.
Thanks,
Bill