Re: [slurm-users] wckey specification error

2018-05-01 Thread Mahmood Naderan
Thanks Trevor for pointing out that there is an option for such thing is slurm.conf. Although I previously greped for *wc* and found nothing, the correct name is TrackWCKey which is set to "yes" by default. After setting that to "no", the error disappeared. About the comments on Rocks and the

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 10:15, R. Paul Wiegand wrote: Yes, I am sure they are all the same. Typically, I just scontrol reconfig; however, I have also tried restarting all daemons. Understood. Any diagnostics in the slurmd logs when trying to start a GPU job on the node? We are moving to 7.4 in a few

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Yes, I am sure they are all the same. Typically, I just scontrol reconfig; however, I have also tried restarting all daemons. We are moving to 7.4 in a few weeks during our downtime. We had a QDR -> OFED version constraint -> Lustre client version constraint issue that delayed our upgrade.

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 09:31, R. Paul Wiegand wrote: Slurm 17.11.0 on CentOS 7.1 That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on that same Slurm release but didn't do the GPU cgroup stuff until a later version (17.11.3 on RHEL 7.4). I don't see anything in the NEWS file about

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Kevin Manalo
Chris, Thanks for the correction there, that /dev/nvidia* isn’t needed in [cgroup_allowed_devices_file.conf] for constraining GPU devices. -Kevin From: slurm-users on behalf of "R. Paul Wiegand" Reply-To: "p...@tesseract.org"

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Slurm 17.11.0 on CentOS 7.1 On Tue, May 1, 2018, 19:26 Christopher Samuel wrote: > On 02/05/18 09:23, R. Paul Wiegand wrote: > > > I thought including the /dev/nvidia* would whitelist those devices > > ... which seems to be the opposite of what I want, no? Or do I > >

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Thanks Chris. I do have the ConstrainDevices turned on. Are the differences in your cgroup_allowed_devices_file.conf relevant in this case? On Tue, May 1, 2018, 19:23 Christopher Samuel wrote: > On 02/05/18 09:00, Kevin Manalo wrote: > > > Also, I recall appending this to

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Thanks Kevin! Indeed, nvidia-smi in an interactive job tells me that I can get access to the device when I should not be able to. I thought including the /dev/nvidia* would whitelist those devices ... which seems to be the opposite of what I want, no? Or do I misunderstand? Thanks, Paul On

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 09:00, Kevin Manalo wrote: Also, I recall appending this to the bottom of [cgroup_allowed_devices_file.conf] .. Same as yours ... /dev/nvidia* There was a SLURM bug issue that made this clear, not so much in the website docs. That shouldn't be necessary, all we have for this

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Kevin Manalo
Paul, Having recently set this up, this was my test, when you make a single GPU request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty bash) request you should only see the GPU assigned to you via 'nvidia-smi' When gres is unset you should see nvidia-smi No devices were

[slurm-users] GPU / cgroup challenges

2018-05-01 Thread R. Paul Wiegand
Greetings, I am setting up our new GPU cluster, and I seem to have a problem configuring things so that the devices are properly walled off via cgroups. Our nodes each of two GPUS; however, if --gres is unset, or set to --gres=gpu:0, I can access both GPUs from inside a job. Moreover, if I ask

Re: [slurm-users] wckey specification error

2018-05-01 Thread Cooper, Trevor
> On May 1, 2018, at 2:58 AM, John Hearns wrote: > > Rocks 7 is now available, which is based on CentOS 7.4 > I hate to be uncharitable, but I am not a fan of Rocks. I speak from > experience, having installed my share of Rocks clusters. > The philosophy just does not

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-05-01 Thread Nate Coraor
Thanks Andy, I've been able to confirm that in my case, any jobs that ran for at least 30 minutes (puppet's run interval) would lose their cgroups, and that the time those cgroups disappear corresponds exactly with puppet runs. I am not sure if this is cgroup change to root is what causes the oom

Re: [slurm-users] wckey specification error

2018-05-01 Thread John Hearns
I quickly downloaded that roll and unpacked the RPMs. I cannot quite see how SLurm is configured, so to my shame I gave up (I did say that Rocks was not my thing) On 1 May 2018 at 11:58, John Hearns wrote: > Rocks 7 is now available, which is based on CentOS 7.4 > I hate

Re: [slurm-users] wckey specification error

2018-05-01 Thread Chris Samuel
On Tuesday, 1 May 2018 2:45:21 PM AEST Mahmood Naderan wrote: > The wckey explanation in the manual [1] is not meaningful at the > moment. Can someone explain that? I've never used it, but it sounds like you've configured your system to require it (or perhaps Rocks has done that?).