This suggests you do not have GPU's in compute-0-3 and compute-0-4 OR you don't have CUDA/nVIDIA driver installed OR you haven't 'initialized' the device entries.
What happens when you login to these nodes and run nvidia-smi? We run nvidia-smi at the end of our driver/cuda installation to make the devices appear so that when we start slurm on the node it sees the GPUs. If you try to start slurm BEFORE you install cuda/nvidia drivers AND run nvidia-smi it fails. Look in your slurmd log on the nodes to see if that is the problem. -- Trevor On Oct 23, 2015, at 1:04 PM, Fany Pagés Díaz <[email protected]<mailto:[email protected]>> wrote: Now, when I send this command I get this error. root@cluster bin]# srun -n 2 -N 2 --gres=gpu: mpirun cuda+mpi We have 2 processors Spawning from compute-0-4.local CUDA MPI Probing nodes... Node Psid CUDA Cards (devID) ----------- ----- ---- ---------- We have 2 processors Spawning from compute-0-3.local CUDA MPI Probing nodes... Node Psid CUDA Cards (devID) ----------- ----- ---- ---------- - compute-0-3.local 1 0 NONE - compute-0-4.local 1 0 NONE -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. But when I send srun -n 2 -N 2 --gres=gpu:2 mpirun cuda+mpi I get: root@cluster bin]# srun -n 2 -N 2 --gres=gpu:2 mpirun cudampi srun: Force Terminated job 408 srun: error: Unable to allocate resources: Requested node configuration is not available I wrote in the discussion about the slurm-roll at sourceforge, but nothing yet. Please any can help me?thanks De: Werner Saar [mailto:[email protected]] Enviado el: viernes, 23 de octubre de 2015 12:11 Para: slurm-dev Asunto: [slurm-dev] RE: I can´t send job for several nodes with gpus Hi, You are at the wrong place. This may be a problem of the slurm-roll for the rocks cluster. Please use the discussion about the slurm-roll at sourceforge. Best regards Werner (maintainer of the slurm-roll) On 10/23/2015 05:45 PM, Fany Pagés Díaz wrote: I did all the configuration again and nothing. This is my output when I run scontrol show node [root@cluster bin]# scontrol show node NodeName=cluster CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null) Gres=gpu:2 NodeAddr=10.8.52.254 NodeHostName=cluster Version=(null) RealMemory=1 AllocMem=0 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=None SlurmdStartTime=None CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Not responding [root@2015-10-23T10:10:25] NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=rack-0,8CPUs Gres=gpu:4 NodeAddr=10.8.52.253 NodeHostName=compute-0-0 Version=14.03 OS=Linux RealMemory=5968 AllocMem=0 Sockets=8 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=447278 Weight=20488100 BootTime=2015-10-22T15:05:08 SlurmdStartTime=2015-10-23T09:33:45 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.08 Features=rack-0,8CPUs Gres=gpu:2 NodeAddr=10.8.52.252 NodeHostName=compute-0-1 Version=14.03 OS=Linux RealMemory=5972 AllocMem=0 Sockets=8 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=447278 Weight=20488101 BootTime=2015-10-22T15:06:09 SlurmdStartTime=2015-10-22T15:06:40 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I don´t know why the node compute-0-0 take gpu:4 when I put it gpu:2. My cluster is completely homogeneous, and I need send job with several nodes with gpus, but the only node its works is compute-0-0. Any idea? I sent the files slurm.conf and gres.conf(the files are in each node in /etc/slurm), I don´t know what I missed in the configuration. De: Jared David Baker [mailto:[email protected]] Enviado el: miércoles, 21 de octubre de 2015 12:28 Para: slurm-dev Asunto: [slurm-dev] RE: I can´t send job for several nodes with gpus Fany, Run `scontrol show node` and post the output. Something may look strange in your nodes configuration. Also, your gres.conf file may be nicer if it looks similar to this: -- NodeName=compute-0-[0,3-4] Name=gpu Type=gtx260 File=/dev/nvidia[0-1] -- Your file may be valid, but I guess I would generally prefer the above or if the system is completely homogeneous, you can use the form: -- Name=gpu Type=gtx260 File=/dev/nvidia0 Name=gpu Type=gtx260 File=/dev/nvidia1 -- I would not use both at the same time though. That’s my quick two cents right now. -Jared From: Fany Pagés Díaz [mailto:[email protected]] Sent: Wednesday, October 21, 2015 9:12 AM To: slurm-dev Subject: [slurm-dev] I can´t send job for several nodes with gpus I configured the cluster for send jobs for gpus but is not works fine. When I send a job for one node it works but I get a little error (only I can send for node compute-0-0 for the others I can´t). This is the output. [root@cluster bin]# srun -n 2 -N 1 --gres=gpu:2 mpirun cudampi We have 2 processors Spawning from compute-0-0.local CUDA MPI Probing nodes... Node Psid CUDA Cards (devID) ----------- ----- ---- ---------- We have 2 processors Spawning from compute-0-0.local CUDA MPI Probing nodes... Node Psid CUDA Cards (devID) ----------- ----- ---- ---------- + compute-0-0.local 1 2 GeForce GTX 260 (0) GeForce GTX 260 (1) + compute-0-0.local 1 2 GeForce GTX 260 (0) GeForce GTX 260 (1) -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- srun: error: compute-0-0: tasks 0-1: Exited with exit code 1 [root@cluster bin]# But when I send for several nodes I have the next error. [root@cluster bin]# srun -n 2 -N 2 --gres=gpu:2 mpirun cudampi srun: Force Terminated job 408 srun: error: Unable to allocate resources: Requested node configuration is not available [root@cluster bin]# I don’t know what I missed because I have the same configuration in all nodes. This is the file /etc/slurm/slurm.conf NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2 GresTypes=gpu SelectType=select/cons_res This is the file /etc/slurm/gres.conf (this file is in each node) #Configuracion de gres en los nodos NodeName=compute-0-[0,3-4] Name=gpu File=/dev/nvidia[0-1] #Configuration of two GPUs Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Any idea? please any can help me? Thanks
