[slurm-dev] Re: GPU allocation errors when submitting from outside the cluster

Hagen Kerzmann Tue, 08 Dec 2015 07:31:27 -0800

I should also mention, that even the PATH hack does not fully solve the
problem. I can easily submit GPU jobs to node-2, but for the other (which
is also the one running slurmctld) I get


WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not
available  (error: cuda unavilable),

which is also the usual error one gets when not allocating a GPU via
--gres. Now, this seems to be a different problem, but maybe you have some
ideas.

Thanks in advance,

Hagen

2015-12-08 15:28 GMT+01:00 Hagen Kerzmann <[email protected]>:

> Hi,
>
> I have installed slurm-15.08.2 on a very small cluster of two machines,
> each featuring 4 NVidia GPUs. I want to submit jobs from another machine
> that has slurm installed, but no daemons running, so it is not part of the
> cluster. We mainly work with theano, so to test the GPU allocation in the
> cluster, I run a theano script that does some calculations on the GPU, if
> one is available. This works great for any jobs submitted on nodes within
> the cluster, using
>
> srun --gres=gpu:1 sh theanoscript.sh
>
> Submitting non-GPU jobs from the remote machine also works fine, but when
> I try to allocate one of the GPUs, theano throws the following error:
>
> ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your
> nvcc installation and try again.
>
> Of course, the GPU machines in the cluster have CUDA installed, so this
> error must be coming from the fact that the submitting machine does not.
> Therefore, I added the (non-existing) CUDA bin to my local PATH, and that
> actually fixed the problem, but of course, that is no desirable solution.
>
> So, from my observations, srun somehow looks for the CUDA path on the
> remote machine, even though the job has already started to execute on one
> of the cluster nodes. How is that possible and how I can fix this without
> the mentioned hack? Does this occur because I submit the job from outside
> the cluster or because the submitting machine does not have CUDA installed?
>
> My gres.conf file is the same on all machines (also on the one outside the
> cluster):
>
> # Configure support for Titan GPUs
> NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia0
> NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia1
> NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia2
> NodeName=devbox[1-2] Name=gpu Type=titan File=/dev/nvidia3
>
> Best regards,
>
> Hagen
>

[slurm-dev] Re: GPU allocation errors when submitting from outside the cluster

Reply via email to