Thank you. I had some doubt about the accuracy of memory.stat. Sam, what
slurm conf parameters do you recommend to try your fix in bug #3531? There
are three places where cgroup plugin could be used:

JobAcctGatherType       = jobacct_gather/*cgroup*

ProctrackType           = proctrack/*cgroup*

TaskPlugin              = task/*cgroup*



On Fri, Mar 17, 2017 at 11:30 AM, Sam Gallop (NBI) <sam.gal...@nbi.ac.uk>
wrote:

> Yes the memory.usage_in_bytes is one sum, but in memory.stat the two
> figures are split …
>
>
>
> # cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat |
> grep -Ew "^rss|^cache"
>
> cache 16758034432
>
> rss 663552
>
>
>
> The fix (https://bugs.schedmd.com/show_bug.cgi?id=3531) attempts to
> address this by recording both.
>
>
>
> You can argue either way about whether the cache should be charged to a
> users' jobs.  Based on your stance you may wish to try …
>
> ProctrackType=proctrack/linuxproc
>
> TaskPlugin=task/affinity
>
> TaskPluginParam=Sched
>
>
>
> I've not try this myself, and the documentation states proctrack/linuxproc
> … can fail to identify all processes associated with a job since processes
> can become a child of the init process (when the parent process terminates)
> or change their process group.
>
>
>
> My personal take is if the user used it, it should be accounted for.
>
>
>
> ---
>
> Sam Gallop
>
>
>
> [image: Description: Macintosh HD:Users:fretter:Documents:SugarSync Shared
> Folders:NBI:pics for hpc wiki:CiSlogo578x293.png]
>
> Have you tried looking through our *Documentation Portal*
> <http://docs.cis.nbi.ac.uk/display/CIS/CiSSupport+Home>
>
> Our documentation isn’t all text, check out our *Video Tutorials*
> <http://docs.cis.nbi.ac.uk/display/CIS/Video+tutorials>
>
> Have you tried looking through *CiS Service Desk*
> <http://support.cis.nbi.ac.uk/>
>
> Keep up to date with availability at *CiS Service Status*
> <http://status.cis.nbi.ac.uk/>
>
> More information on our HPC, Linux, Storage at *HPC Support*
> <http://docs.cis.nbi.ac.uk/display/CIS/CiSSupport+Home>* Site*
>
>
>
> To speak to us about technical issues feel free to call the *Computing
> infrastructure for Science* team on *group phone extension * *2003**.*
>
> If your request is urgent, please contact the *NBIP Computing Helpdesk*
> at computing.helpd...@nbi.ac.uk or call *phone extension **1234**.*
>
>
>
> *From:* Wensheng Deng [mailto:w...@nyu.edu]
> *Sent:* 17 March 2017 15:06
>
> *To:* slurm-dev <slurm-dev@schedmd.com>
> *Subject:* [slurm-dev] Re: Slurm & CGROUP
>
>
>
> For the case of the simple 'cp' test job which copying a 5 GB file, the
> issue at the bottom is that how do we distinguish memories used: which is
> from RSS, which is from file cache. cgroup reports them as one sum:
> memory.memsw.* (we turn on swap off). The file cache can be small or very
> big depending on what's required and what's available at the time point.
> The file cache should not be charged to users' jobs in the batch job
> context in my opinion. Thank you!
>
>
>
>
>
>
>
> On Fri, Mar 17, 2017 at 10:47 AM, Sam Gallop (NBI) <sam.gal...@nbi.ac.uk>
> wrote:
>
> Hi,
>
>
>
> I believe you can get that message ('Exceeded job memory limit at some
> point') even if the job finishes fine.  When the cgroup is created (by
> SLURM) it updates memory.limit_in_bytes with the job memory request coded
> in the job.  During the life of the job the kernel updates a number of
> files within the cgroup, one of which is memory.usage_in_bytes - which is
> the current memory of the cgroup.  Periodically, SLURM will check if the
> cgroup has exceeded its limit (i.e. memory.limit_in_bytes) - the frequency
> of the check is probably set by JobAcctGatherFrequency.  It does this by
> checking if memory.failcnt is greater than one.  The memory.failcnt is
> incremented by the kernel each time memory.usage_in_bytes reaches the value
> set in memory.limit_in_bytes.
>
>
>
> This is the code snippet the produces the error (found in
> task_cgroup_memory.c) …
>
> extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
>
> {
>
> ...
>
>             else if (failcnt_non_zero(&step_memory_cg,
>
>                           "memory.failcnt"))
>
>                 /* reports the number of times that the
>
>                  * memory limit has reached the value set
>
>                  * in memory.limit_in_bytes.
>
>                  */
>
>                 error("Exceeded step memory limit at some point.");
>
> ...
>
>             else if (failcnt_non_zero(&job_memory_cg,
>
>                           "memory.failcnt"))
>
>                 error("Exceeded job memory limit at some point.");
>
> ...
>
> }
>
>
>
> Anyway, back to the point.  You can see this message and the job not fail
> because the operating system counter (memory.failcnt) that SLURM checks
> doesn't actually mean the memory limit has been exceeded but means the
> memory limit has been reached - a subtle but an important difference.
> Important because OOM doesn't terminate jobs upon reaching the memory
> limit, only if they exceed the limit, it means the job isn't terminated.
> Note: other cgroup files like memory.memsw.xxx are also in play if you are
> using swap space
>
>
>
> As to how to manage this.  You can either not use cgroup and use an
> alternative plugin, you could also try the JobAcctGatherParams parameter
> NoOverMemoryKill (the documentation say use this with caution, see
> https://slurm.schedmd.com/slurm.conf.html), or you can try and account
> for the cache by using the jobacct_gather/cgroup.  Unfortunately, because
> of a bug this plugin does report cache usage either.  I've contributed a
> bug/fix to address this (https://bugs.schedmd.com/show_bug.cgi?id=3531).
>
>
>
> *---*
>
> *Samuel Gallop*
>
> *Computing infrastructure for Science*
>
> *CiS Support & Development*
>
>
>
> *From:* Wensheng Deng [mailto:w...@nyu.edu]
> *Sent:* 17 March 2017 13:42
> *To:* slurm-dev <slurm-dev@schedmd.com>
> *Subject:* [slurm-dev] Re: Slurm & CGROUP
>
>
>
> The file is copied fine. It is just the message error annoying.
>
>
>
>
>
>
>
> On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist <janne.blomqv...@aalto.fi>
> wrote:
>
> On 2017-03-15 17:52, Wensheng Deng wrote:
> > No, it does not help:
> >
> > $ scontrol show config |grep -i jobacct
> >
> > *JobAcct*GatherFrequency  = 30
> >
> > *JobAcct*GatherType       = *jobacct*_gather/cgroup
> >
> > *JobAcct*GatherParams     = NoShared
> >
> >
> >
> >
> >
> > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu
> > <mailto:w...@nyu.edu>> wrote:
> >
> >     I think I tried that. let me try it again. Thank you!
> >
> >     On Wed, Mar 15, 2017 at 11:43 AM, Chris Read <cr...@drw.com
> >     <mailto:cr...@drw.com>> wrote:
> >
> >
> >         We explicitly exclude shared usage from our measurement:
> >
> >
> >         JobAcctGatherType=jobacct_gather/cgroup
> >         JobAcctGatherParams=NoShare?
> >
> >         Chris
> >
> >
> >         ________________________________
> >         From: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>>
> >         Sent: 15 March 2017 10:28
> >         To: slurm-dev
> >         Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
> >
> >         It should be (sorry):
> >         we 'cp'ed a 5GB file from scratch to node local disk
> >
> >
> >         On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <w...@nyu.edu
> >         <mailto:w...@nyu.edu><mailto:w...@nyu.edu
> >         <mailto:w...@nyu.edu>>> wrote:
> >         Hello experts:
> >
> >         We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> >         5GB job from scratch to node local disk, declared 5 GB memory
> >         for the job, and saw error message as below although the file
> >         was copied okay:
> >
> >         slurmstepd: error: Exceeded job memory limit at some point.
> >
> >         srun: error: [nodenameXXX]: task 0: Out Of Memory
> >
> >         srun: Terminating job step 41.0
> >
> >         slurmstepd: error: Exceeded job memory limit at some point.
> >
> >
> >         From the cgroup document
> >         https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
> >         <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt>
> >         Features:
> >         - accounting anonymous pages, file caches, swap caches usage and
> >         limiting them.
> >
> >         It seems that cgroup charges memory "RSS + file caches" to user
> >         process like 'cp', in our case, charged to user's jobs. swap is
> >         off in this case. The file cache can be small or very big, and
> >         it should not be charged to users'  batch jobs in my opinion.
> >         How do other sites circumvent this issue? The Slurm version is
> >         16.05.4.
> >
> >         Thank you and Best Regards.
> >
> >
> >
> >
>
> Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf
> to some big number? That way the job memory limit will be the cgroup soft
> limit, and the cgroup hard limit which is when the kernel will OOM kill the
> job would be "job_memory_limit * AllowedRamSpace" that is, some large value?
>
> --
> Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> Aalto University School of Science, PHYS & NBE
> +358503841576 || janne.blomqv...@aalto.fi
>
>
>
>
>

Reply via email to