Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-23 Thread Shawn Bobbin
Hi,I attached our cgroup.conf and gres.conf.  As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty.  In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains the same.  Based on the behavior explained in [0] I don’t expect this file to impact specific GPU containment. TaskPlugin = task/cgroupProctrackType = proctrack/cgroupJobAcctGatherType = jobacct_gather/cgroup[0] https://bugs.schedmd.com/show_bug.cgi?id=4122

slurm.conf
Description: Binary data


gres.conf
Description: Binary data
On Apr 13, 2018, at 9:25 AM, Kevin Manalo <kman...@jhu.edu> wrote:I’m asking in the hopes that others will chime in (I’m curious why this is happening) Could you share your related slurm.conf cgroup options cgroup.confcgroup_allowed_devices_file.conf TaskPluginProctrackTypeJobAcctGatherType -Kevin PS Looking for similar style jobs, We have >1 day gpu users inside of cgroup, but not multi-tenant currently. 17.11.5, CentOS6 From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Shawn Bobbin <sabob...@umiacs.umd.edu>Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>Date: Thursday, April 12, 2018 at 9:25 AMTo: "slurm-us...@schedmd.com" <slurm-us...@schedmd.com>Subject: [slurm-users] Jobs escaping cgroup device controls after some amount of time. Hi,  We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs escaping there cgroup controls on GPU devices.  For example we have the following steps running: # ps auxn | grep [s]lurmstepd       0  2380  0.0  0.0 538436  3700 ?        Sl   07:22   0:02 slurmstepd: [46609.0]       0  5714  0.0  0.0 472136  3952 ?        Sl   Apr11   0:03 slurmstepd: [46603.0]       0 17202  0.0  0.0 538448  3724 ?        Sl   Apr11   0:03 slurmstepd: [46596.0]       0 28673  0.0  0.0 538380  3696 ?        Sl   Apr10   0:39 slurmstepd: [46262.0]       0 44832  0.0  0.0 538640  3964 ?        Sl   Apr11   1:12 slurmstepd: [46361.0]  But not all of those are reflected in the cgroup device hierarchy: # lscgroup | grep devices | grep slurmdevices:/slurmdevices:/slurm/uid_2093devices:/slurm/uid_2093/job_46609devices:/slurm/uid_2093/job_46609/step_0devices:/slurm/uid_11477devices:/slurm/uid_11477/job_46603devices:/slurm/uid_11477/job_46603/step_0devices:/slurm/uid_11184devices:/slurm/uid_11184/job_46596devices:/slurm/uid_11184/job_46596/step_0  This issue only seems to happen after a job has been running for a while, as when it is first started the cgroup controls work as expected.  In this example, the jobs that have escaped the controls (46361,46362) have been running for over a day: # squeue -j 46609,46603,46596,46262,46361             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)             46596     dpart     bash     yhng  R   10:56:00      1 vulcan14             46609 scavenger     bash    yaser  R    1:52:37      1 vulcan14             46603 scavenger     bash  jxzheng  R    9:47:26      1 vulcan14             46361     dpart     bash  jxzheng  R 1-08:31:14      1 vulcan14             46262     dpart Weighted  umahbub  R 1-18:07:07      1 vulcan14  So it seems that at some point slurm, or something else, comes in and modifies the cgroup hierarchy, but we haven’t had much luck in tracking down what. Has anyone run into this, or have any pointers for troubleshooting this further? Thanks,—Shawn

[slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-12 Thread Shawn Bobbin
Hi,

We’re running slurm 17.11.5 on RHEL 7 and have been having issues with jobs 
escaping there cgroup controls on GPU devices.


For example we have the following steps running:

# ps auxn | grep [s]lurmstepd
   0  2380  0.0  0.0 538436  3700 ?Sl   07:22   0:02 slurmstepd: 
[46609.0]
   0  5714  0.0  0.0 472136  3952 ?Sl   Apr11   0:03 slurmstepd: 
[46603.0]
   0 17202  0.0  0.0 538448  3724 ?Sl   Apr11   0:03 slurmstepd: 
[46596.0]
   0 28673  0.0  0.0 538380  3696 ?Sl   Apr10   0:39 slurmstepd: 
[46262.0]
   0 44832  0.0  0.0 538640  3964 ?Sl   Apr11   1:12 slurmstepd: 
[46361.0]


But not all of those are reflected in the cgroup device hierarchy:

# lscgroup | grep devices | grep slurm
devices:/slurm
devices:/slurm/uid_2093
devices:/slurm/uid_2093/job_46609
devices:/slurm/uid_2093/job_46609/step_0
devices:/slurm/uid_11477
devices:/slurm/uid_11477/job_46603
devices:/slurm/uid_11477/job_46603/step_0
devices:/slurm/uid_11184
devices:/slurm/uid_11184/job_46596
devices:/slurm/uid_11184/job_46596/step_0


This issue only seems to happen after a job has been running for a while, as 
when it is first started the cgroup controls work as expected.  In this 
example, the jobs that have escaped the controls (46361,46362) have been 
running for over a day:

# squeue -j 46609,46603,46596,46262,46361
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
 46596 dpart bash yhng  R   10:56:00  1 vulcan14
 46609 scavenger bashyaser  R1:52:37  1 vulcan14
 46603 scavenger bash  jxzheng  R9:47:26  1 vulcan14
 46361 dpart bash  jxzheng  R 1-08:31:14  1 vulcan14
 46262 dpart Weighted  umahbub  R 1-18:07:07  1 vulcan14


So it seems that at some point slurm, or something else, comes in and modifies 
the cgroup hierarchy, but we haven’t had much luck in tracking down what.

Has anyone run into this, or have any pointers for troubleshooting this further?

Thanks,
—Shawn