I would guess that you do not have Slurm configured correctly for the GPUs. Documentation is available here:
http://www.schedmd.com/slurmdocs/gres.html Quoting Sa Li <[email protected]>: > Hello, Slurm team > > I have constructed a Ubuntu cluster consisting of few nodes, each node is a > machine equipped with 8 GeForce GTX 690 GPU. In each node, I have > successfully run my code in each gpu device without slurm command, like > ./myCode deviceID ..... > > commandline argument "deviceID" is used for cudaSetDevice(deviceID). It > perfectly works to access all GPUs on a selected node. > > However, if I run SLURM to change the CUDA_VISIBLE_DEVICES, and my code > takes CUDA_VISIBLE_DEVICES on commandline, it always comes with such error: > "CUDA Runtime API error 10: invalid device ordinal." > Apparently, such error is caused by cudaSetDevice(deviceID), script is > test.sh > > #!/bin/bash > echo 'hostname' - $CUDA_VISIBLE_DEVICES > ./myCode $CUDA_VISIBLE_DEVICES ... > sleep 10 > > And run > srun --gres=gpu:1 ./test.sh & > > So my question is why I can run my code individually by taking the deviceID > on command line, but being unable to send jobs by slurm. I assume I may > need to make some changes on configure file. I have searched online, it > seems no available answers. It is highly appreciated if your team can > provide some clues to solve the problem. I am looking forwards to get your > reply. > > > Thanks > > SL >
