[slurm-dev] Re: Slurm, RHEL6, cgroups and not constraining memory

Bjørn-Helge Mevik Mon, 21 Jan 2013 01:11:18 -0800

Christopher Samuel <[email protected]> writes:

> On 18/01/13 19:53, Bjørn-Helge Mevik wrote:
>
>> I don't know if this is the reason in your case, but note that cgroup
>> in slurm constrains_resident_  RAM, not_allocated_  ("virtual") RAM.
>
> Hmm, as a sysadmin that doesn't seem very useful,


Hmm, as a sysadmin I must say that I disagree. :)

> you want it to constrain how much memory the application can allocate
> so that it can learn it has hit a limit when malloc() fails (and
> hopefully gracefully report/recover).

What the best way to constrain memory is, is very much dependent on how
the cluster is set up and what type of jobs are run on it, IMO.

A problem with limiting the virtual memory allocations, is that with
recent versions of glibc, the amount of VMEM that a threaded application
allocates is much, much bigger than what it is ever going to use.  For
instance, on our master node, slurmctld uses about 50 MiB RAM
(resident), but the VMEM usage reported by ps or top is 16 GiB(!).  This
is the reason we switched to using cgroups.

As for letting cgroups notify the job instead of killing it, that is
probably hard to implement, because the cgroups limiting is done by the
kernel itself, not slurm, and I at least don't know of any
callback-hooks or other features in cgroups that could be used for such a
thing.

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

[slurm-dev] Re: Slurm, RHEL6, cgroups and not constraining memory

Reply via email to