Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-05-01 Thread Nate Coraor
Thanks Andy,

I've been able to confirm that in my case, any jobs that ran for at least
30 minutes (puppet's run interval) would lose their cgroups, and that the
time those cgroups disappear corresponds exactly with puppet runs. I am not
sure if this is cgroup change to root is what causes the oom event that
Slurm detects - I looked through
src/plugins/task/cgroup/task_cgroup_memory.c and the memory cgroup
documentation and it's not clear to me what would happen if you've created
the oom event listener on a specific cgroup and that cgroup disappears. But
since I disabled puppet overnight, jobs running longer than 30 minutes are
completing, and cgroups are persisting, whereas before that, they were not.

--nate

On Mon, Apr 30, 2018 at 5:47 PM, Andy Georges  wrote:

>
>
> > On 30 Apr 2018, at 22:37, Nate Coraor  wrote:
> >
> > Hi Shawn,
> >
> > I'm wondering if you're still seeing this. I've recently enabled
> task/cgroup on 17.11.5 running on CentOS 7 and just discovered that jobs
> are escaping their cgroups. For me this is resulting in a lot of jobs
> ending in OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks
> the oom-killer has triggered when it hasn't. I'm not using GRES or devices,
> only:
>
> I am not sure that you are making the correct conclusion here.
>
> There is a known cgroups issue, due to
>
> https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
>
> Relevant part:
>
> The memory controller has a long history. A request for comments for the
> memory
> controller was posted by Balbir Singh [1]. At the time the RFC was posted
> there were several implementations for memory control. The goal of the
> RFC was to build consensus and agreement for the minimal features required
> for memory control. The first RSS controller was posted by Balbir Singh[2]
> in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of
> the
> RSS controller. At OLS, at the resource management BoF, everyone suggested
> that we handle both page cache and RSS together. Another request was raised
> to allow user space handling of OOM. The current memory controller is
> at version 6; it combines both mapped (RSS) and unmapped Page
> Cache Control [11].
>
> Are the jobs killed prematurely? If not, then you ran into the above.
>
> Kind regards.
> — Andy
>


Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-30 Thread Andy Georges


> On 30 Apr 2018, at 22:37, Nate Coraor  wrote:
> 
> Hi Shawn,
> 
> I'm wondering if you're still seeing this. I've recently enabled task/cgroup 
> on 17.11.5 running on CentOS 7 and just discovered that jobs are escaping 
> their cgroups. For me this is resulting in a lot of jobs ending in 
> OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks the oom-killer 
> has triggered when it hasn't. I'm not using GRES or devices, only:

I am not sure that you are making the correct conclusion here.

There is a known cgroups issue, due to

https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

Relevant part:

The memory controller has a long history. A request for comments for the memory
controller was posted by Balbir Singh [1]. At the time the RFC was posted
there were several implementations for memory control. The goal of the
RFC was to build consensus and agreement for the minimal features required
for memory control. The first RSS controller was posted by Balbir Singh[2]
in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
RSS controller. At OLS, at the resource management BoF, everyone suggested
that we handle both page cache and RSS together. Another request was raised
to allow user space handling of OOM. The current memory controller is
at version 6; it combines both mapped (RSS) and unmapped Page
Cache Control [11].

Are the jobs killed prematurely? If not, then you ran into the above.

Kind regards.
— Andy


signature.asc
Description: Message signed with OpenPGP


Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-30 Thread Nate Coraor
Nevermind - it appears to happen when puppet runs. I have no hand in that,
so I'll kick it to those admins and report back with what I find.

I ruled out slurm by simply creating a non-slurm cgroup, with e.g.
`cgcreate -g memory:test`, and that cgroup also disappeared unexpectedly.

--nate

On Mon, Apr 30, 2018 at 4:37 PM, Nate Coraor <n...@bx.psu.edu> wrote:

> Hi Shawn,
>
> I'm wondering if you're still seeing this. I've recently enabled
> task/cgroup on 17.11.5 running on CentOS 7 and just discovered that jobs
> are escaping their cgroups. For me this is resulting in a lot of jobs
> ending in OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks
> the oom-killer has triggered when it hasn't. I'm not using GRES or devices,
> only:
>
> cgroup.conf:
>
> CgroupAutomount=yes
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
>
> slurm.conf:
>
> JobAcctGatherType=jobacct_gather/cgroup
> JobAcctGatherFrequency=task=15
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/cgroup
>
> The only thing that seems to maybe correspond are the log messages:
>
> [JOB_ID.batch] debug:  Handling REQUEST_STATE
> debug:  _fill_registration_msg: found apparently running job JOB_ID
>
> Thanks,
> --nate
>
> On Mon, Apr 23, 2018 at 4:41 PM, Kevin Manalo <kman...@jhu.edu> wrote:
>
>> Shawn,
>>
>>
>>
>> Just to give you a compare and contrast:
>>
>>
>>
>> We have for related entries slurm.conf
>>
>>
>>
>> JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup
>> eventually
>>
>> JobAcctGatherFrequency=30
>>
>> ProctrackType=proctrack/cgroup
>>
>> TaskPlugin=task/affinity,task/cgroup
>>
>>
>>
>> cgroup_allowed_devices_file.conf:
>>
>>
>>
>> /dev/null
>>
>> /dev/urandom
>>
>> /dev/zero
>>
>> /dev/sda*
>>
>> /dev/cpu/*/*
>>
>> /dev/pts/*
>>
>> /dev/nvidia*
>>
>>
>>
>> gres.conf (4 K80s on node with 24 core haswell):
>>
>>
>>
>> Name=gpu File=/dev/nvidia0 CPUs=0-5
>>
>> Name=gpu File=/dev/nvidia1 CPUs=12-17
>>
>> Name=gpu File=/dev/nvidia2 CPUs=6-11
>>
>> Name=gpu File=/dev/nvidia3 CPUs=18-23
>>
>>
>>
>>
>>
>> I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1
>> day and they are still inside of cgroups, but again this is on CentOS6
>> clusters.
>>
>>
>>
>> Are you still seeing  cgroup escapes now, specifically for jobs > 1 day?
>>
>>
>>
>> Thanks,
>>
>> Kevin
>>
>>
>>
>>
>>
>>
>>
>> *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of
>> Shawn Bobbin <sabob...@umiacs.umd.edu>
>> *Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com>
>> *Date: *Monday, April 23, 2018 at 2:45 PM
>> *To: *Slurm User Community List <slurm-users@lists.schedmd.com>
>> *Subject: *Re: [slurm-users] Jobs escaping cgroup device controls after
>> some amount of time.
>>
>>
>>
>> Hi,
>>
>>
>>
>> I attached our cgroup.conf and gres.conf.
>>
>>
>>
>> As for the cgroup_allowed_devices.conf file, I have this file stubbed but
>> empty.  In 17.02 slurm started fine without this file (as far as I
>> remember) and it being empty doesn’t appear to actually impact anything…
>> device availability remains the same.  Based on the behavior explained in
>> [0] I don’t expect this file to impact specific GPU containment.
>>
>>
>>
>> TaskPlugin = task/cgroup
>>
>> ProctrackType = proctrack/cgroup
>>
>> JobAcctGatherType = jobacct_gather/cgroup
>>
>>
>>
>>
>>
>>
>>
>> [0] https://bugs.schedmd.com/show_bug.cgi?id=4122
>>
>>
>>
>>
>>
>>
>>
>>
>


Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-30 Thread Nate Coraor
Hi Shawn,

I'm wondering if you're still seeing this. I've recently enabled
task/cgroup on 17.11.5 running on CentOS 7 and just discovered that jobs
are escaping their cgroups. For me this is resulting in a lot of jobs
ending in OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks
the oom-killer has triggered when it hasn't. I'm not using GRES or devices,
only:

cgroup.conf:

CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

slurm.conf:

JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=task=15
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

The only thing that seems to maybe correspond are the log messages:

[JOB_ID.batch] debug:  Handling REQUEST_STATE
debug:  _fill_registration_msg: found apparently running job JOB_ID

Thanks,
--nate

On Mon, Apr 23, 2018 at 4:41 PM, Kevin Manalo <kman...@jhu.edu> wrote:

> Shawn,
>
>
>
> Just to give you a compare and contrast:
>
>
>
> We have for related entries slurm.conf
>
>
>
> JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup eventually
>
> JobAcctGatherFrequency=30
>
> ProctrackType=proctrack/cgroup
>
> TaskPlugin=task/affinity,task/cgroup
>
>
>
> cgroup_allowed_devices_file.conf:
>
>
>
> /dev/null
>
> /dev/urandom
>
> /dev/zero
>
> /dev/sda*
>
> /dev/cpu/*/*
>
> /dev/pts/*
>
> /dev/nvidia*
>
>
>
> gres.conf (4 K80s on node with 24 core haswell):
>
>
>
> Name=gpu File=/dev/nvidia0 CPUs=0-5
>
> Name=gpu File=/dev/nvidia1 CPUs=12-17
>
> Name=gpu File=/dev/nvidia2 CPUs=6-11
>
> Name=gpu File=/dev/nvidia3 CPUs=18-23
>
>
>
>
>
> I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 day
> and they are still inside of cgroups, but again this is on CentOS6 clusters.
>
>
>
> Are you still seeing  cgroup escapes now, specifically for jobs > 1 day?
>
>
>
> Thanks,
>
> Kevin
>
>
>
>
>
>
>
> *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of
> Shawn Bobbin <sabob...@umiacs.umd.edu>
> *Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com>
> *Date: *Monday, April 23, 2018 at 2:45 PM
> *To: *Slurm User Community List <slurm-users@lists.schedmd.com>
> *Subject: *Re: [slurm-users] Jobs escaping cgroup device controls after
> some amount of time.
>
>
>
> Hi,
>
>
>
> I attached our cgroup.conf and gres.conf.
>
>
>
> As for the cgroup_allowed_devices.conf file, I have this file stubbed but
> empty.  In 17.02 slurm started fine without this file (as far as I
> remember) and it being empty doesn’t appear to actually impact anything…
> device availability remains the same.  Based on the behavior explained in
> [0] I don’t expect this file to impact specific GPU containment.
>
>
>
> TaskPlugin = task/cgroup
>
> ProctrackType = proctrack/cgroup
>
> JobAcctGatherType = jobacct_gather/cgroup
>
>
>
>
>
>
>
> [0] https://bugs.schedmd.com/show_bug.cgi?id=4122
>
>
>
>
>
>
>
>


Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-23 Thread Kevin Manalo
Shawn,

Just to give you a compare and contrast:

We have for related entries slurm.conf

JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup eventually
JobAcctGatherFrequency=30
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup

cgroup_allowed_devices_file.conf:

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*

gres.conf (4 K80s on node with 24 core haswell):

Name=gpu File=/dev/nvidia0 CPUs=0-5
Name=gpu File=/dev/nvidia1 CPUs=12-17
Name=gpu File=/dev/nvidia2 CPUs=6-11
Name=gpu File=/dev/nvidia3 CPUs=18-23


I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 day and 
they are still inside of cgroups, but again this is on CentOS6 clusters.

Are you still seeing  cgroup escapes now, specifically for jobs > 1 day?

Thanks,
Kevin



From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Shawn 
Bobbin <sabob...@umiacs.umd.edu>
Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
Date: Monday, April 23, 2018 at 2:45 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Jobs escaping cgroup device controls after some 
amount of time.

Hi,

I attached our cgroup.conf and gres.conf.

As for the cgroup_allowed_devices.conf file, I have this file stubbed but 
empty.  In 17.02 slurm started fine without this file (as far as I remember) 
and it being empty doesn’t appear to actually impact anything… device 
availability remains the same.  Based on the behavior explained in [0] I don’t 
expect this file to impact specific GPU containment.

TaskPlugin = task/cgroup
ProctrackType = proctrack/cgroup
JobAcctGatherType = jobacct_gather/cgroup






[0] https://bugs.schedmd.com/show_bug.cgi?id=4122








Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-23 Thread Shawn Bobbin
Hi,I attached our cgroup.conf and gres.conf.  As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty.  In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains the same.  Based on the behavior explained in [0] I don’t expect this file to impact specific GPU containment. TaskPlugin = task/cgroupProctrackType = proctrack/cgroupJobAcctGatherType = jobacct_gather/cgroup[0] https://bugs.schedmd.com/show_bug.cgi?id=4122

slurm.conf
Description: Binary data


gres.conf
Description: Binary data
On Apr 13, 2018, at 9:25 AM, Kevin Manalo <kman...@jhu.edu> wrote:I’m asking in the hopes that others will chime in (I’m curious why this is happening) Could you share your related slurm.conf cgroup options cgroup.confcgroup_allowed_devices_file.conf TaskPluginProctrackTypeJobAcctGatherType -Kevin PS Looking for similar style jobs, We have >1 day gpu users inside of cgroup, but not multi-tenant currently. 17.11.5, CentOS6 From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Shawn Bobbin <sabob...@umiacs.umd.edu>Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>Date: Thursday, April 12, 2018 at 9:25 AMTo: "slurm-us...@schedmd.com" <slurm-us...@schedmd.com>Subject: [slurm-users] Jobs escaping cgroup device controls after some amount of time. Hi,  We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping there cgroup controls on GPU devices.  For example we have the following steps running: # ps auxn | grep [s]lurmstepd       0  2380  0.0  0.0 538436  3700 ?        Sl   07:22   0:02 slurmstepd: [46609.0]       0  5714  0.0  0.0 472136  3952 ?        Sl   Apr11   0:03 slurmstepd: [46603.0]       0 17202  0.0  0.0 538448  3724 ?        Sl   Apr11   0:03 slurmstepd: [46596.0]       0 28673  0.0  0.0 538380  3696 ?        Sl   Apr10   0:39 slurmstepd: [46262.0]       0 44832  0.0  0.0 538640  3964 ?        Sl   Apr11   1:12 slurmstepd: [46361.0]  But not all of those are reflected in the cgroup device hierarchy: # lscgroup | grep devices | grep slurmdevices:/slurmdevices:/slurm/uid_2093devices:/slurm/uid_2093/job_46609devices:/slurm/uid_2093/job_46609/step_0devices:/slurm/uid_11477devices:/slurm/uid_11477/job_46603devices:/slurm/uid_11477/job_46603/step_0devices:/slurm/uid_11184devices:/slurm/uid_11184/job_46596devices:/slurm/uid_11184/job_46596/step_0  This issue only seems to happen after a job has been running for a while, as when it is first started the cgroup controls work as expected.  In this example, the jobs that have escaped the controls (46361,46362) have been running for over a day: # squeue -j 46609,46603,46596,46262,46361             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)             46596     dpart     bash     yhng  R   10:56:00      1 vulcan14             46609 scavenger     bash    yaser  R    1:52:37      1 vulcan14             46603 scavenger     bash  jxzheng  R    9:47:26      1 vulcan14             46361     dpart     bash  jxzheng  R 1-08:31:14      1 vulcan14             46262     dpart Weighted  umahbub  R 1-18:07:07      1 vulcan14  So it seems that at some point slurm, or something else, comes in and modifies the cgroup hierarchy, but we haven’t had much luck in tracking down what. Has anyone run into this, or have any pointers for troubleshooting this further? Thanks,—Shawn

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-13 Thread Kevin Manalo
I’m asking in the hopes that others will chime in (I’m curious why this is 
happening)

Could you share your related slurm.conf cgroup options

cgroup.conf
cgroup_allowed_devices_file.conf

TaskPlugin
ProctrackType
JobAcctGatherType

-Kevin

PS Looking for similar style jobs, We have >1 day gpu users inside of cgroup, 
but not multi-tenant currently. 17.11.5, CentOS6

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Shawn 
Bobbin <sabob...@umiacs.umd.edu>
Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
Date: Thursday, April 12, 2018 at 9:25 AM
To: "slurm-us...@schedmd.com" <slurm-us...@schedmd.com>
Subject: [slurm-users] Jobs escaping cgroup device controls after some amount 
of time.

Hi,

We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs 
escaping there cgroup controls on GPU devices.


For example we have the following steps running:

# ps auxn | grep [s]lurmstepd
   0  2380  0.0  0.0 538436  3700 ?Sl   07:22   0:02 slurmstepd: 
[46609.0]
   0  5714  0.0  0.0 472136  3952 ?Sl   Apr11   0:03 slurmstepd: 
[46603.0]
   0 17202  0.0  0.0 538448  3724 ?Sl   Apr11   0:03 slurmstepd: 
[46596.0]
   0 28673  0.0  0.0 538380  3696 ?Sl   Apr10   0:39 slurmstepd: 
[46262.0]
   0 44832  0.0  0.0 538640  3964 ?Sl   Apr11   1:12 slurmstepd: 
[46361.0]


But not all of those are reflected in the cgroup device hierarchy:

# lscgroup | grep devices | grep slurm
devices:/slurm
devices:/slurm/uid_2093
devices:/slurm/uid_2093/job_46609
devices:/slurm/uid_2093/job_46609/step_0
devices:/slurm/uid_11477
devices:/slurm/uid_11477/job_46603
devices:/slurm/uid_11477/job_46603/step_0
devices:/slurm/uid_11184
devices:/slurm/uid_11184/job_46596
devices:/slurm/uid_11184/job_46596/step_0


This issue only seems to happen after a job has been running for a while, as 
when it is first started the cgroup controls work as expected.  In this 
example, the jobs that have escaped the controls (46361,46362) have been 
running for over a day:

# squeue -j 46609,46603,46596,46262,46361
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
 46596 dpart bash yhng  R   10:56:00  1 vulcan14
 46609 scavenger bashyaser  R1:52:37  1 vulcan14
 46603 scavenger bash  jxzheng  R9:47:26  1 vulcan14
 46361 dpart bash  jxzheng  R 1-08:31:14  1 vulcan14
 46262 dpart Weighted  umahbub  R 1-18:07:07  1 vulcan14


So it seems that at some point slurm, or something else, comes in and modifies 
the cgroup hierarchy, but we haven’t had much luck in tracking down what.

Has anyone run into this, or have any pointers for troubleshooting this further?

Thanks,
—Shawn




[slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-12 Thread Shawn Bobbin
Hi,

We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs 
escaping there cgroup controls on GPU devices.


For example we have the following steps running:

# ps auxn | grep [s]lurmstepd
   0  2380  0.0  0.0 538436  3700 ?Sl   07:22   0:02 slurmstepd: 
[46609.0]
   0  5714  0.0  0.0 472136  3952 ?Sl   Apr11   0:03 slurmstepd: 
[46603.0]
   0 17202  0.0  0.0 538448  3724 ?Sl   Apr11   0:03 slurmstepd: 
[46596.0]
   0 28673  0.0  0.0 538380  3696 ?Sl   Apr10   0:39 slurmstepd: 
[46262.0]
   0 44832  0.0  0.0 538640  3964 ?Sl   Apr11   1:12 slurmstepd: 
[46361.0]


But not all of those are reflected in the cgroup device hierarchy:

# lscgroup | grep devices | grep slurm
devices:/slurm
devices:/slurm/uid_2093
devices:/slurm/uid_2093/job_46609
devices:/slurm/uid_2093/job_46609/step_0
devices:/slurm/uid_11477
devices:/slurm/uid_11477/job_46603
devices:/slurm/uid_11477/job_46603/step_0
devices:/slurm/uid_11184
devices:/slurm/uid_11184/job_46596
devices:/slurm/uid_11184/job_46596/step_0


This issue only seems to happen after a job has been running for a while, as 
when it is first started the cgroup controls work as expected.  In this 
example, the jobs that have escaped the controls (46361,46362) have been 
running for over a day:

# squeue -j 46609,46603,46596,46262,46361
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
 46596 dpart bash yhng  R   10:56:00  1 vulcan14
 46609 scavenger bashyaser  R1:52:37  1 vulcan14
 46603 scavenger bash  jxzheng  R9:47:26  1 vulcan14
 46361 dpart bash  jxzheng  R 1-08:31:14  1 vulcan14
 46262 dpart Weighted  umahbub  R 1-18:07:07  1 vulcan14


So it seems that at some point slurm, or something else, comes in and modifies 
the cgroup hierarchy, but we haven’t had much luck in tracking down what.

Has anyone run into this, or have any pointers for troubleshooting this further?

Thanks,
—Shawn