I should also mention, that even the PATH hack does not fully solve the problem. I can easily submit GPU jobs to node-2, but for the other (which is also the one running slurmctld) I get
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: cuda unavilable), which is also the usual error one gets when not allocating a GPU via --gres. Now, this seems to be a different problem, but maybe you have some ideas. Thanks in advance, Hagen 2015-12-08 15:28 GMT+01:00 Hagen Kerzmann <[email protected]>: > Hi, > > I have installed slurm-15.08.2 on a very small cluster of two machines, > each featuring 4 NVidia GPUs. I want to submit jobs from another machine > that has slurm installed, but no daemons running, so it is not part of the > cluster. We mainly work with theano, so to test the GPU allocation in the > cluster, I run a theano script that does some calculations on the GPU, if > one is available. This works great for any jobs submitted on nodes within > the cluster, using > > srun --gres=gpu:1 sh theanoscript.sh > > Submitting non-GPU jobs from the remote machine also works fine, but when > I try to allocate one of the GPUs, theano throws the following error: > > ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your > nvcc installation and try again. > > Of course, the GPU machines in the cluster have CUDA installed, so this > error must be coming from the fact that the submitting machine does not. > Therefore, I added the (non-existing) CUDA bin to my local PATH, and that > actually fixed the problem, but of course, that is no desirable solution. > > So, from my observations, srun somehow looks for the CUDA path on the > remote machine, even though the job has already started to execute on one > of the cluster nodes. How is that possible and how I can fix this without > the mentioned hack? Does this occur because I submit the job from outside > the cluster or because the submitting machine does not have CUDA installed? > > My gres.conf file is the same on all machines (also on the one outside the > cluster): > > # Configure support for Titan GPUs > NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia0 > NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia1 > NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia2 > NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia3 > > Best regards, > > Hagen >
