Perhaps an easy approach is to set RLIMIT_AS in the job itself or in its 
wrapper, then allow the application to handle ENOMEM error.

On 01/21/2013 10:16 AM, Bjørn-Helge Mevik wrote:
>
> Christopher Samuel<[email protected]>  writes:
>
>> On 18/01/13 19:53, Bjørn-Helge Mevik wrote:
>>
>>> I don't know if this is the reason in your case, but note that cgroup
>>> in slurm constrains_resident_  RAM, not_allocated_  ("virtual") RAM.
>>
>> Hmm, as a sysadmin that doesn't seem very useful,
>
> Hmm, as a sysadmin I must say that I disagree. :)
>
>> you want it to constrain how much memory the application can allocate
>> so that it can learn it has hit a limit when malloc() fails (and
>> hopefully gracefully report/recover).
>
> What the best way to constrain memory is, is very much dependent on how
> the cluster is set up and what type of jobs are run on it, IMO.
>
> A problem with limiting the virtual memory allocations, is that with
> recent versions of glibc, the amount of VMEM that a threaded application
> allocates is much, much bigger than what it is ever going to use.  For
> instance, on our master node, slurmctld uses about 50 MiB RAM
> (resident), but the VMEM usage reported by ps or top is 16 GiB(!).  This
> is the reason we switched to using cgroups.
>
> As for letting cgroups notify the job instead of killing it, that is
> probably hard to implement, because the cgroups limiting is done by the
> kernel itself, not slurm, and I at least don't know of any
> callback-hooks or other features in cgroups that could be used for such a
> thing.
>

-- 

/David

Reply via email to