I would guess that you do not have Slurm configured correctly for the  
GPUs. Documentation is available here:

http://www.schedmd.com/slurmdocs/gres.html


Quoting Sa Li <[email protected]>:

> Hello, Slurm team
>
> I have constructed a Ubuntu cluster consisting of few nodes, each node is a
> machine equipped with 8 GeForce GTX 690 GPU. In each node, I have
> successfully run my code in each gpu device without slurm command, like
>            ./myCode deviceID .....
>
> commandline argument "deviceID" is used for cudaSetDevice(deviceID). It
> perfectly works to access all GPUs on a selected node.
>
> However, if I run SLURM to change the CUDA_VISIBLE_DEVICES, and my code
> takes CUDA_VISIBLE_DEVICES on commandline, it always comes with such error:
>  "CUDA Runtime API error 10: invalid device ordinal."
> Apparently, such error is caused by cudaSetDevice(deviceID), script is
> test.sh
>
> #!/bin/bash
> echo 'hostname' - $CUDA_VISIBLE_DEVICES
> ./myCode $CUDA_VISIBLE_DEVICES ...
> sleep 10
>
> And run
> srun --gres=gpu:1 ./test.sh &
>
> So my question is why I can run my code individually by taking the deviceID
> on command line, but being unable to send jobs by slurm. I assume I may
> need to make some changes on configure file. I have searched online, it
> seems no available answers. It is highly appreciated if your team can
> provide some clues to solve the problem. I am looking forwards to get your
> reply.
>
>
> Thanks
>
> SL
>

Reply via email to