[slurm-dev] Re: cgroups and memory accounting

Felip Moll Fri, 22 Jan 2016 02:34:20 -0800

Finally I solved the issue in big part thanks to Carlos Fenoy tips.

The issue was due to NFS filesystem. This filesystem, as CF said, caches
data while other file systems does not. Cgroups takes into account cached
data and our user jobs use NFS filesystem intensivelly.


I switched from:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
TaskPluginParam=

To:
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
TaskPluginParam=Sched


And in the following 11 days I didn't receive a single oom kill and
everythink is working perfectly.

Best regards and thanks to all of you.
Felip M




*--Felip Moll Marquès*
Computer Science Engineer
E-Mail - [email protected]
WebPage - http://lipix.ciutadella.es

2015-12-18 15:09 GMT+01:00 Bjørn-Helge Mevik <[email protected]>:

>
> Carlos Fenoy <[email protected]> writes:
>
> > Barbara, I don't think that is the issue here. The killer is the OOM not
> > Slurm, so Slurm is not accounting incorrectly the amount of memory, but
> it
> > seems that the cached memory is also accounted in the cgroup and it is
> what
> > is causing the OOM to kill gzip.
>
> I've seen cases where the job has copied a set of large files, which
> makes the cgroup memory usage go right up to the limit.  I guess that is
> cached data.  Then the job starts computing, without the job getting
> killed.  My interpretatin is that the kernel will flush the cache when a
> process needs more memory instead of killing the process.  If I'm
> correct, oom will _not_ kill a job due to cached data.
>
> --
> Regards,
> Bjørn-Helge Mevik, dr. scient,
> Department for Research Computing, University of Oslo
>

[slurm-dev] Re: cgroups and memory accounting

Reply via email to