Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 4:18 pm, Sajesh Singh wrote: Thank you for the tip. That works as expected. No worries, glad it's useful. Do be aware that the core bindings for the GPUs would likely need to be adjusted for your hardware! Best of luck, Chris -- Chris Samuel :

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Christopher, Thank you for the tip. That works as expected. -SS- -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Thursday, October 8, 2020 6:52 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
On 10/8/20 3:48 pm, Sajesh Singh wrote: Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is in gres.conf and looks

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Relu, Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf -SS- -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Thursday, October 8, 2020 6:10 PM To: slurm-users@lists.schedmd.com Subject: Re:

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel
Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: debug:  common_gres_set_env: unable to set env vars, no device files configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices in /dev for the GPUs? All the best, Chris -- Chris Samuel :

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu
Do you have a line like this in  your cgroup_allowed_devices_file.conf /dev/nvidia* ? Relu On 2020-10-08 16:32, Sajesh Singh wrote: It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714  0 nvidia_modeset   1109636  1

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Yes. It is located in the /etc/slurm directory -- -SS- From: slurm-users On Behalf Of Brian Andrus Sent: Thursday, October 8, 2020 5:02 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER do you have your gres.conf on the

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Brian Andrus
do you have your gres.conf on the nodes also? Brian Andrus On 10/8/2020 11:57 AM, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
I only get a line returned for “Gres=”, but this is the same behavior on another cluster that has GPUs and the variable gets set on that cluster. -Sajesh- -- _ Sajesh Singh Manager, Systems and Scientific Computing American Museum of Natural

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Renfro, Michael
From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”. From: slurm-users on behalf of Sajesh Singh Reply-To: Slurm User Community List Date: Thursday, October 8, 2020 at 3:33 PM To: Slurm

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714 0 nvidia_modeset 1109636 1 nvidia_drm nvidia_uvm935322 0 nvidia 20390295 2 nvidia_modeset,nvidia_uvm Also the nvidia-smi command returns the

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu
That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster,

[slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug:

Re: [slurm-users] Controlling access to idle nodes

2020-10-08 Thread David Baker
Thank you very much for your comments. Oddly enough, I came up with the 3-partition model as well once I'd sent my email. So, your comments helped to confirm that I was thinking on the right lines. Best regards, David From: slurm-users on behalf of Thomas M.

Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread William Brown
R is single threaded. On Thu, 8 Oct 2020, 07:44 Diego Zuccato, wrote: > Il 08/10/20 08:19, David Bellot ha scritto: > > > good spot. At least, scontrol show job is now saying that each job only > > requires one "CPU", so it seems all the cores are treated the same way > now. > > Though I still

Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread Diego Zuccato
Il 08/10/20 08:19, David Bellot ha scritto: > good spot. At least, scontrol show job is now saying that each job only > requires one "CPU", so it seems all the cores are treated the same way now. > Though I still have the problem of not using more than half the cores. > So I suppose it might be

Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-08 Thread Diego Zuccato
Il 06/10/20 13:45, Riebs, Andy ha scritto: Well, the cluster is quite heterogeneus, and node bl0-02 only have 24 threads available: str957-bl0-02:~$ lscpu Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48

Re: [slurm-users] unable to run on all the logical cores

2020-10-08 Thread David Bellot
Hi Rodrigo, good spot. At least, scontrol show job is now saying that each job only requires one "CPU", so it seems all the cores are treated the same way now. Though I still have the problem of not using more than half the cores. So I suppose it might be due to the way I submit (batchtools in