Hi, I have installed slurm-15.08.2 on a very small cluster of two machines, each featuring 4 NVidia GPUs. I want to submit jobs from another machine that has slurm installed, but no daemons running, so it is not part of the cluster. We mainly work with theano, so to test the GPU allocation in the cluster, I run a theano script that does some calculations on the GPU, if one is available. This works great for any jobs submitted on nodes within the cluster, using
srun --gres=gpu:1 sh theanoscript.sh Submitting non-GPU jobs from the remote machine also works fine, but when I try to allocate one of the GPUs, theano throws the following error: ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again. Of course, the GPU machines in the cluster have CUDA installed, so this error must be coming from the fact that the submitting machine does not. Therefore, I added the (non-existing) CUDA bin to my local PATH, and that actually fixed the problem, but of course, that is no desirable solution. So, from my observations, srun somehow looks for the CUDA path on the remote machine, even though the job has already started to execute on one of the cluster nodes. How is that possible and how I can fix this without the mentioned hack? Does this occur because I submit the job from outside the cluster or because the submitting machine does not have CUDA installed? My gres.conf file is the same on all machines (also on the one outside the cluster): # Configure support for Titan GPUs NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia0 NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia1 NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia2 NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia3 Best regards, Hagen