Finally I solved the issue in big part thanks to Carlos Fenoy tips. The issue was due to NFS filesystem. This filesystem, as CF said, caches data while other file systems does not. Cgroups takes into account cached data and our user jobs use NFS filesystem intensivelly.
I switched from: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup TaskPluginParam= To: ProctrackType=proctrack/linuxproc TaskPlugin=task/affinity TaskPluginParam=Sched And in the following 11 days I didn't receive a single oom kill and everythink is working perfectly. Best regards and thanks to all of you. Felip M *--Felip Moll Marquès* Computer Science Engineer E-Mail - [email protected] WebPage - http://lipix.ciutadella.es 2015-12-18 15:09 GMT+01:00 Bjørn-Helge Mevik <[email protected]>: > > Carlos Fenoy <[email protected]> writes: > > > Barbara, I don't think that is the issue here. The killer is the OOM not > > Slurm, so Slurm is not accounting incorrectly the amount of memory, but > it > > seems that the cached memory is also accounted in the cgroup and it is > what > > is causing the OOM to kill gzip. > > I've seen cases where the job has copied a set of large files, which > makes the cgroup memory usage go right up to the limit. I guess that is > cached data. Then the job starts computing, without the job getting > killed. My interpretatin is that the kernel will flush the cache when a > process needs more memory instead of killing the process. If I'm > correct, oom will _not_ kill a job due to cached data. > > -- > Regards, > Bjørn-Helge Mevik, dr. scient, > Department for Research Computing, University of Oslo >
