Thank you. I had some doubt about the accuracy of memory.stat. Sam, what slurm conf parameters do you recommend to try your fix in bug #3531? There are three places where cgroup plugin could be used:
JobAcctGatherType = jobacct_gather/*cgroup* ProctrackType = proctrack/*cgroup* TaskPlugin = task/*cgroup* On Fri, Mar 17, 2017 at 11:30 AM, Sam Gallop (NBI) <sam.gal...@nbi.ac.uk> wrote: > Yes the memory.usage_in_bytes is one sum, but in memory.stat the two > figures are split … > > > > # cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat | > grep -Ew "^rss|^cache" > > cache 16758034432 > > rss 663552 > > > > The fix (https://bugs.schedmd.com/show_bug.cgi?id=3531) attempts to > address this by recording both. > > > > You can argue either way about whether the cache should be charged to a > users' jobs. Based on your stance you may wish to try … > > ProctrackType=proctrack/linuxproc > > TaskPlugin=task/affinity > > TaskPluginParam=Sched > > > > I've not try this myself, and the documentation states proctrack/linuxproc > … can fail to identify all processes associated with a job since processes > can become a child of the init process (when the parent process terminates) > or change their process group. > > > > My personal take is if the user used it, it should be accounted for. > > > > --- > > Sam Gallop > > > > [image: Description: Macintosh HD:Users:fretter:Documents:SugarSync Shared > Folders:NBI:pics for hpc wiki:CiSlogo578x293.png] > > Have you tried looking through our *Documentation Portal* > <http://docs.cis.nbi.ac.uk/display/CIS/CiSSupport+Home> > > Our documentation isn’t all text, check out our *Video Tutorials* > <http://docs.cis.nbi.ac.uk/display/CIS/Video+tutorials> > > Have you tried looking through *CiS Service Desk* > <http://support.cis.nbi.ac.uk/> > > Keep up to date with availability at *CiS Service Status* > <http://status.cis.nbi.ac.uk/> > > More information on our HPC, Linux, Storage at *HPC Support* > <http://docs.cis.nbi.ac.uk/display/CIS/CiSSupport+Home>* Site* > > > > To speak to us about technical issues feel free to call the *Computing > infrastructure for Science* team on *group phone extension * *2003**.* > > If your request is urgent, please contact the *NBIP Computing Helpdesk* > at computing.helpd...@nbi.ac.uk or call *phone extension **1234**.* > > > > *From:* Wensheng Deng [mailto:w...@nyu.edu] > *Sent:* 17 March 2017 15:06 > > *To:* slurm-dev <slurm-dev@schedmd.com> > *Subject:* [slurm-dev] Re: Slurm & CGROUP > > > > For the case of the simple 'cp' test job which copying a 5 GB file, the > issue at the bottom is that how do we distinguish memories used: which is > from RSS, which is from file cache. cgroup reports them as one sum: > memory.memsw.* (we turn on swap off). The file cache can be small or very > big depending on what's required and what's available at the time point. > The file cache should not be charged to users' jobs in the batch job > context in my opinion. Thank you! > > > > > > > > On Fri, Mar 17, 2017 at 10:47 AM, Sam Gallop (NBI) <sam.gal...@nbi.ac.uk> > wrote: > > Hi, > > > > I believe you can get that message ('Exceeded job memory limit at some > point') even if the job finishes fine. When the cgroup is created (by > SLURM) it updates memory.limit_in_bytes with the job memory request coded > in the job. During the life of the job the kernel updates a number of > files within the cgroup, one of which is memory.usage_in_bytes - which is > the current memory of the cgroup. Periodically, SLURM will check if the > cgroup has exceeded its limit (i.e. memory.limit_in_bytes) - the frequency > of the check is probably set by JobAcctGatherFrequency. It does this by > checking if memory.failcnt is greater than one. The memory.failcnt is > incremented by the kernel each time memory.usage_in_bytes reaches the value > set in memory.limit_in_bytes. > > > > This is the code snippet the produces the error (found in > task_cgroup_memory.c) … > > extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job) > > { > > ... > > else if (failcnt_non_zero(&step_memory_cg, > > "memory.failcnt")) > > /* reports the number of times that the > > * memory limit has reached the value set > > * in memory.limit_in_bytes. > > */ > > error("Exceeded step memory limit at some point."); > > ... > > else if (failcnt_non_zero(&job_memory_cg, > > "memory.failcnt")) > > error("Exceeded job memory limit at some point."); > > ... > > } > > > > Anyway, back to the point. You can see this message and the job not fail > because the operating system counter (memory.failcnt) that SLURM checks > doesn't actually mean the memory limit has been exceeded but means the > memory limit has been reached - a subtle but an important difference. > Important because OOM doesn't terminate jobs upon reaching the memory > limit, only if they exceed the limit, it means the job isn't terminated. > Note: other cgroup files like memory.memsw.xxx are also in play if you are > using swap space > > > > As to how to manage this. You can either not use cgroup and use an > alternative plugin, you could also try the JobAcctGatherParams parameter > NoOverMemoryKill (the documentation say use this with caution, see > https://slurm.schedmd.com/slurm.conf.html), or you can try and account > for the cache by using the jobacct_gather/cgroup. Unfortunately, because > of a bug this plugin does report cache usage either. I've contributed a > bug/fix to address this (https://bugs.schedmd.com/show_bug.cgi?id=3531). > > > > *---* > > *Samuel Gallop* > > *Computing infrastructure for Science* > > *CiS Support & Development* > > > > *From:* Wensheng Deng [mailto:w...@nyu.edu] > *Sent:* 17 March 2017 13:42 > *To:* slurm-dev <slurm-dev@schedmd.com> > *Subject:* [slurm-dev] Re: Slurm & CGROUP > > > > The file is copied fine. It is just the message error annoying. > > > > > > > > On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist <janne.blomqv...@aalto.fi> > wrote: > > On 2017-03-15 17:52, Wensheng Deng wrote: > > No, it does not help: > > > > $ scontrol show config |grep -i jobacct > > > > *JobAcct*GatherFrequency = 30 > > > > *JobAcct*GatherType = *jobacct*_gather/cgroup > > > > *JobAcct*GatherParams = NoShared > > > > > > > > > > > > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu > > <mailto:w...@nyu.edu>> wrote: > > > > I think I tried that. let me try it again. Thank you! > > > > On Wed, Mar 15, 2017 at 11:43 AM, Chris Read <cr...@drw.com > > <mailto:cr...@drw.com>> wrote: > > > > > > We explicitly exclude shared usage from our measurement: > > > > > > JobAcctGatherType=jobacct_gather/cgroup > > JobAcctGatherParams=NoShare? > > > > Chris > > > > > > ________________________________ > > From: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>> > > Sent: 15 March 2017 10:28 > > To: slurm-dev > > Subject: [ext] [slurm-dev] Re: Slurm & CGROUP > > > > It should be (sorry): > > we 'cp'ed a 5GB file from scratch to node local disk > > > > > > On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <w...@nyu.edu > > <mailto:w...@nyu.edu><mailto:w...@nyu.edu > > <mailto:w...@nyu.edu>>> wrote: > > Hello experts: > > > > We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a > > 5GB job from scratch to node local disk, declared 5 GB memory > > for the job, and saw error message as below although the file > > was copied okay: > > > > slurmstepd: error: Exceeded job memory limit at some point. > > > > srun: error: [nodenameXXX]: task 0: Out Of Memory > > > > srun: Terminating job step 41.0 > > > > slurmstepd: error: Exceeded job memory limit at some point. > > > > > > From the cgroup document > > https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt > > <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt> > > Features: > > - accounting anonymous pages, file caches, swap caches usage and > > limiting them. > > > > It seems that cgroup charges memory "RSS + file caches" to user > > process like 'cp', in our case, charged to user's jobs. swap is > > off in this case. The file cache can be small or very big, and > > it should not be charged to users' batch jobs in my opinion. > > How do other sites circumvent this issue? The Slurm version is > > 16.05.4. > > > > Thank you and Best Regards. > > > > > > > > > > Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf > to some big number? That way the job memory limit will be the cgroup soft > limit, and the cgroup hard limit which is when the kernel will OOM kill the > job would be "job_memory_limit * AllowedRamSpace" that is, some large value? > > -- > Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist > Aalto University School of Science, PHYS & NBE > +358503841576 || janne.blomqv...@aalto.fi > > > > >