Thanks Trevor for pointing out that there is an option for such thing
is slurm.conf. Although I previously greped for *wc* and found
nothing, the correct name is TrackWCKey which is set to "yes" by
default. After setting that to "no", the error disappeared.
About the comments on Rocks and the
On 02/05/18 10:15, R. Paul Wiegand wrote:
Yes, I am sure they are all the same. Typically, I just scontrol
reconfig; however, I have also tried restarting all daemons.
Understood. Any diagnostics in the slurmd logs when trying to start
a GPU job on the node?
We are moving to 7.4 in a few
Yes, I am sure they are all the same. Typically, I just scontrol reconfig;
however, I have also tried restarting all daemons.
We are moving to 7.4 in a few weeks during our downtime. We had a QDR ->
OFED version constraint -> Lustre client version constraint issue that
delayed our upgrade.
On 02/05/18 09:31, R. Paul Wiegand wrote:
Slurm 17.11.0 on CentOS 7.1
That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on
that same Slurm release but didn't do the GPU cgroup stuff until a later
version (17.11.3 on RHEL 7.4).
I don't see anything in the NEWS file about
Chris,
Thanks for the correction there, that /dev/nvidia* isn’t needed in
[cgroup_allowed_devices_file.conf] for constraining GPU devices.
-Kevin
From: slurm-users on behalf of "R. Paul
Wiegand"
Reply-To: "p...@tesseract.org"
Slurm 17.11.0 on CentOS 7.1
On Tue, May 1, 2018, 19:26 Christopher Samuel wrote:
> On 02/05/18 09:23, R. Paul Wiegand wrote:
>
> > I thought including the /dev/nvidia* would whitelist those devices
> > ... which seems to be the opposite of what I want, no? Or do I
> >
Thanks Chris. I do have the ConstrainDevices turned on. Are the
differences in your cgroup_allowed_devices_file.conf relevant in this case?
On Tue, May 1, 2018, 19:23 Christopher Samuel wrote:
> On 02/05/18 09:00, Kevin Manalo wrote:
>
> > Also, I recall appending this to
Thanks Kevin!
Indeed, nvidia-smi in an interactive job tells me that I can get access to
the device when I should not be able to.
I thought including the /dev/nvidia* would whitelist those devices ...
which seems to be the opposite of what I want, no? Or do I misunderstand?
Thanks,
Paul
On
On 02/05/18 09:00, Kevin Manalo wrote:
Also, I recall appending this to the bottom of
[cgroup_allowed_devices_file.conf]
..
Same as yours
...
/dev/nvidia*
There was a SLURM bug issue that made this clear, not so much in the website
docs.
That shouldn't be necessary, all we have for this
Paul,
Having recently set this up, this was my test, when you make a single GPU
request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty
bash) request you should only see the GPU assigned to you via 'nvidia-smi'
When gres is unset you should see
nvidia-smi
No devices were
Greetings,
I am setting up our new GPU cluster, and I seem to have a problem
configuring things so that the devices are properly walled off via
cgroups. Our nodes each of two GPUS; however, if --gres is unset, or
set to --gres=gpu:0, I can access both GPUs from inside a job.
Moreover, if I ask
> On May 1, 2018, at 2:58 AM, John Hearns wrote:
>
> Rocks 7 is now available, which is based on CentOS 7.4
> I hate to be uncharitable, but I am not a fan of Rocks. I speak from
> experience, having installed my share of Rocks clusters.
> The philosophy just does not
Thanks Andy,
I've been able to confirm that in my case, any jobs that ran for at least
30 minutes (puppet's run interval) would lose their cgroups, and that the
time those cgroups disappear corresponds exactly with puppet runs. I am not
sure if this is cgroup change to root is what causes the oom
I quickly downloaded that roll and unpacked the RPMs.
I cannot quite see how SLurm is configured, so to my shame I gave up (I did
say that Rocks was not my thing)
On 1 May 2018 at 11:58, John Hearns wrote:
> Rocks 7 is now available, which is based on CentOS 7.4
> I hate
On Tuesday, 1 May 2018 2:45:21 PM AEST Mahmood Naderan wrote:
> The wckey explanation in the manual [1] is not meaningful at the
> moment. Can someone explain that?
I've never used it, but it sounds like you've configured your system to require
it (or perhaps Rocks has done that?).
15 matches
Mail list logo