[slurm-dev] Re: Slurm & CGROUP

Nicholas McCollum Fri, 17 Mar 2017 09:17:47 -0700

+1 : I tried getting oom_notifierd working in CentOS7 but was
unsuccessful.  I'd be greatly interested if anyone has gotten this to
work.  I've ported some of the other BYU cgroup fencing tools over to
CentOS 7 and added minor functionality improvements if anyone is
interested.


Thank you to Ryan Cox for these excellent tools.


-- 
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Fri, 2017-03-17 at 08:59 -0700, Ryan Cox wrote:
> usage_in_bytes is not actually usage in bytes, by the way.  It's
> often close but I have seen wildly different values.  See https://lkm
> l.org/lkml/2011/3/28/93 and
> https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section
> 5.5.  memory.stat is what you want for accurate data.
> 
> I wrote the code you referenced below.  Now that I know more about
> failcnt, it does have some corner cases that aren't ideal.  If I were
> to start over I would use cgroup.event_control to get OOM events,
> such as in https://github.com/BYUHPC/uft/blob/master/oom_notifierd/oo
> m_notifierd.c or https://www.kernel.org/doc/Documentation/cgroup-
> v1/memory.txt section 9.  At the time I didn't really feel like
> learning how to add and clean up a thread or something that that
> would listen for those events.
> 
> If someone wants to do the work that would be great :). I have no
> plans to do so myself for the time being.
> 
> Ryan
> 
> On 03/17/2017 08:46 AM, Sam Gallop (NBI) wrote:
> > Hi,
> >  
> > I believe you can get that message ('Exceeded job memory limit at
> > some point') even if the job finishes fine.  When the cgroup is
> > created (by SLURM) it updates memory.limit_in_bytes with the job
> > memory request coded in the job.  During the life of the job the
> > kernel updates a number of files within the cgroup, one of which is
> > memory.usage_in_bytes - which is the current memory of the cgroup. 
> > Periodically, SLURM will check if the cgroup has exceeded its limit
> > (i.e. memory.limit_in_bytes) - the frequency of the check is
> > probably set by JobAcctGatherFrequency.  It does this by checking
> > if memory.failcnt is greater than one.  The memory.failcnt is
> > incremented by the kernel each time memory.usage_in_bytes reaches
> > the value set in memory.limit_in_bytes.
> >  
> > This is the code snippet the produces the error (found in
> > task_cgroup_memory.c) …
> > extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
> > {
> > ...
> >             else if (failcnt_non_zero(&step_memory_cg,
> >                           "memory.failcnt"))
> >                 /* reports the number of times that the
> >                  * memory limit has reached the value set
> >                  * in memory.limit_in_bytes.
> >                  */
> >                 error("Exceeded step memory limit at some point.");
> > ...
> >             else if (failcnt_non_zero(&job_memory_cg,
> >                           "memory.failcnt"))
> >                 error("Exceeded job memory limit at some point.");
> > ...
> > }
> >  
> > Anyway, back to the point.  You can see this message and the job
> > not fail because the operating system counter (memory.failcnt) that
> > SLURM checks doesn't actually mean the memory limit has been
> > exceeded but means the memory limit has been reached - a subtle but
> > an important difference.  Important because OOM doesn't terminate
> > jobs upon reaching the memory limit, only if they exceed the limit,
> > it means the job isn't terminated.  Note: other cgroup files like
> > memory.memsw.xxx are also in play if you are using swap space
> >  
> > As to how to manage this.  You can either not use cgroup and use an
> > alternative plugin, you could also try the JobAcctGatherParams
> > parameter NoOverMemoryKill (the documentation say use this with
> > caution, see https://slurm.schedmd.com/slurm.conf.html), or you can
> > try and account for the cache by using the jobacct_gather/cgroup. 
> > Unfortunately, because of a bug this plugin does report cache usage
> > either.  I've contributed a bug/fix to address this (https://bugs.s
> > chedmd.com/show_bug.cgi?id=3531).
> >  
> > ---
> > Samuel Gallop
> > Computing infrastructure for Science
> > CiS Support & Development
> >  
> > From: Wensheng Deng [mailto:w...@nyu.edu] 
> > Sent: 17 March 2017 13:42
> > To: slurm-dev <slurm-dev@schedmd.com>
> > Subject: [slurm-dev] Re: Slurm & CGROUP
> >  
> > The file is copied fine. It is just the message error annoying. 
> >  
> >  
> >  
> > On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist <janne.blomqvist@a
> > alto.fi> wrote:
> > On 2017-03-15 17:52, Wensheng Deng wrote:
> > > No, it does not help:
> > >
> > > $ scontrol show config |grep -i jobacct
> > >
> > > *JobAcct*GatherFrequency  = 30
> > >
> > > *JobAcct*GatherType       = *jobacct*_gather/cgroup
> > >
> > > *JobAcct*GatherParams     = NoShared
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu
> > > <mailto:w...@nyu.edu>> wrote:
> > >
> > >     I think I tried that. let me try it again. Thank you!
> > >
> > >     On Wed, Mar 15, 2017 at 11:43 AM, Chris Read <cr...@drw.com
> > >     <mailto:cr...@drw.com>> wrote:
> > >
> > >
> > >         We explicitly exclude shared usage from our measurement:
> > >
> > >
> > >         JobAcctGatherType=jobacct_gather/cgroup
> > >         JobAcctGatherParams=NoShare?
> > >
> > >         Chris
> > >
> > >
> > >         ________________________________
> > >         From: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>>
> > >         Sent: 15 March 2017 10:28
> > >         To: slurm-dev
> > >         Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
> > >
> > >         It should be (sorry):
> > >         we 'cp'ed a 5GB file from scratch to node local disk
> > >
> > >
> > >         On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <wd35@nyu
> > .edu
> > >         <mailto:w...@nyu.edu><mailto:w...@nyu.edu
> > >         <mailto:w...@nyu.edu>>> wrote:
> > >         Hello experts:
> > >
> > >         We turn on TaskPlugin=task/cgroup. In one Slurm job, we
> > 'cp'ed a
> > >         5GB job from scratch to node local disk, declared 5 GB
> > memory
> > >         for the job, and saw error message as below although the
> > file
> > >         was copied okay:
> > >
> > >         slurmstepd: error: Exceeded job memory limit at some
> > point.
> > >
> > >         srun: error: [nodenameXXX]: task 0: Out Of Memory
> > >
> > >         srun: Terminating job step 41.0
> > >
> > >         slurmstepd: error: Exceeded job memory limit at some
> > point.
> > >
> > >
> > >         From the cgroup document
> > >         https://www.kernel.org/doc/Documentation/cgroup-v1/memory
> > .txt
> > >         <https://www.kernel.org/doc/Documentation/cgroup-v1/memor
> > y.txt>
> > >         Features:
> > >         - accounting anonymous pages, file caches, swap caches
> > usage and
> > >         limiting them.
> > >
> > >         It seems that cgroup charges memory "RSS + file caches"
> > to user
> > >         process like 'cp', in our case, charged to user's jobs.
> > swap is
> > >         off in this case. The file cache can be small or very
> > big, and
> > >         it should not be charged to users'  batch jobs in my
> > opinion.
> > >         How do other sites circumvent this issue? The Slurm
> > version is
> > >         16.05.4.
> > >
> > >         Thank you and Best Regards.
> > >
> > >
> > >
> > >
> > 
> > Could you set AllowedRamSpace/AllowedSwapSpace in
> > /etc/slurm/cgroup.conf to some big number? That way the job memory
> > limit will be the cgroup soft limit, and the cgroup hard limit
> > which is when the kernel will OOM kill the job would be
> > "job_memory_limit * AllowedRamSpace" that is, some large value?
> > 
> > --
> > Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> > Aalto University School of Science, PHYS & NBE
> > +358503841576 || janne.blomqv...@aalto.fi
> > 
> >  
>

[slurm-dev] Re: Slurm & CGROUP

Reply via email to