Hi,

I have installed slurm-15.08.2 on a very small cluster of two machines,
each featuring 4 NVidia GPUs. I want to submit jobs from another machine
that has slurm installed, but no daemons running, so it is not part of the
cluster. We mainly work with theano, so to test the GPU allocation in the
cluster, I run a theano script that does some calculations on the GPU, if
one is available. This works great for any jobs submitted on nodes within
the cluster, using

srun --gres=gpu:1 sh theanoscript.sh

Submitting non-GPU jobs from the remote machine also works fine, but when I
try to allocate one of the GPUs, theano throws the following error:

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your
nvcc installation and try again.

Of course, the GPU machines in the cluster have CUDA installed, so this
error must be coming from the fact that the submitting machine does not.
Therefore, I added the (non-existing) CUDA bin to my local PATH, and that
actually fixed the problem, but of course, that is no desirable solution.

So, from my observations, srun somehow looks for the CUDA path on the
remote machine, even though the job has already started to execute on one
of the cluster nodes. How is that possible and how I can fix this without
the mentioned hack? Does this occur because I submit the job from outside
the cluster or because the submitting machine does not have CUDA installed?

My gres.conf file is the same on all machines (also on the one outside the
cluster):

# Configure support for Titan GPUs
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia0
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia1
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia2
NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia3

Best regards,

Hagen

Reply via email to