Christopher Samuel <[email protected]> writes: > On 18/01/13 19:53, Bjørn-Helge Mevik wrote: > >> I don't know if this is the reason in your case, but note that cgroup >> in slurm constrains_resident_ RAM, not_allocated_ ("virtual") RAM. > > Hmm, as a sysadmin that doesn't seem very useful,
Hmm, as a sysadmin I must say that I disagree. :) > you want it to constrain how much memory the application can allocate > so that it can learn it has hit a limit when malloc() fails (and > hopefully gracefully report/recover). What the best way to constrain memory is, is very much dependent on how the cluster is set up and what type of jobs are run on it, IMO. A problem with limiting the virtual memory allocations, is that with recent versions of glibc, the amount of VMEM that a threaded application allocates is much, much bigger than what it is ever going to use. For instance, on our master node, slurmctld uses about 50 MiB RAM (resident), but the VMEM usage reported by ps or top is 16 GiB(!). This is the reason we switched to using cgroups. As for letting cgroups notify the job instead of killing it, that is probably hard to implement, because the cgroups limiting is done by the kernel itself, not slurm, and I at least don't know of any callback-hooks or other features in cgroups that could be used for such a thing. -- Cheers, Bjørn-Helge Mevik, dr. scient, Research Computing Services, University of Oslo
