Nevermind - it appears to happen when puppet runs. I have no hand in that, so I'll kick it to those admins and report back with what I find.
I ruled out slurm by simply creating a non-slurm cgroup, with e.g. `cgcreate -g memory:test`, and that cgroup also disappeared unexpectedly. --nate On Mon, Apr 30, 2018 at 4:37 PM, Nate Coraor <n...@bx.psu.edu> wrote: > Hi Shawn, > > I'm wondering if you're still seeing this. I've recently enabled > task/cgroup on 17.11.5 running on CentOS 7 and just discovered that jobs > are escaping their cgroups. For me this is resulting in a lot of jobs > ending in OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks > the oom-killer has triggered when it hasn't. I'm not using GRES or devices, > only: > > cgroup.conf: > > CgroupAutomount=yes > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > > slurm.conf: > > JobAcctGatherType=jobacct_gather/cgroup > JobAcctGatherFrequency=task=15 > ProctrackType=proctrack/cgroup > TaskPlugin=task/cgroup > > The only thing that seems to maybe correspond are the log messages: > > [JOB_ID.batch] debug: Handling REQUEST_STATE > debug: _fill_registration_msg: found apparently running job JOB_ID > > Thanks, > --nate > > On Mon, Apr 23, 2018 at 4:41 PM, Kevin Manalo <kman...@jhu.edu> wrote: > >> Shawn, >> >> >> >> Just to give you a compare and contrast: >> >> >> >> We have for related entries slurm.conf >> >> >> >> JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup >> eventually >> >> JobAcctGatherFrequency=30 >> >> ProctrackType=proctrack/cgroup >> >> TaskPlugin=task/affinity,task/cgroup >> >> >> >> cgroup_allowed_devices_file.conf: >> >> >> >> /dev/null >> >> /dev/urandom >> >> /dev/zero >> >> /dev/sda* >> >> /dev/cpu/*/* >> >> /dev/pts/* >> >> /dev/nvidia* >> >> >> >> gres.conf (4 K80s on node with 24 core haswell): >> >> >> >> Name=gpu File=/dev/nvidia0 CPUs=0-5 >> >> Name=gpu File=/dev/nvidia1 CPUs=12-17 >> >> Name=gpu File=/dev/nvidia2 CPUs=6-11 >> >> Name=gpu File=/dev/nvidia3 CPUs=18-23 >> >> >> >> >> >> I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 >> day and they are still inside of cgroups, but again this is on CentOS6 >> clusters. >> >> >> >> Are you still seeing cgroup escapes now, specifically for jobs > 1 day? >> >> >> >> Thanks, >> >> Kevin >> >> >> >> >> >> >> >> *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of >> Shawn Bobbin <sabob...@umiacs.umd.edu> >> *Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com> >> *Date: *Monday, April 23, 2018 at 2:45 PM >> *To: *Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject: *Re: [slurm-users] Jobs escaping cgroup device controls after >> some amount of time. >> >> >> >> Hi, >> >> >> >> I attached our cgroup.conf and gres.conf. >> >> >> >> As for the cgroup_allowed_devices.conf file, I have this file stubbed but >> empty. In 17.02 slurm started fine without this file (as far as I >> remember) and it being empty doesn’t appear to actually impact anything… >> device availability remains the same. Based on the behavior explained in >> [0] I don’t expect this file to impact specific GPU containment. >> >> >> >> TaskPlugin = task/cgroup >> >> ProctrackType = proctrack/cgroup >> >> JobAcctGatherType = jobacct_gather/cgroup >> >> >> >> >> >> >> >> [0] https://bugs.schedmd.com/show_bug.cgi?id=4122 >> >> >> >> >> >> >> >> >