Fany,

Run `scontrol show node` and post the output. Something may look strange in 
your nodes configuration.

Also, your gres.conf file may be nicer if it looks similar to this:

--
NodeName=compute-0-[0,3-4] Name=gpu Type=gtx260 File=/dev/nvidia[0-1]
--

Your file may be valid, but I guess I would generally prefer the above or if 
the system is completely homogeneous, you can use the form:

--
Name=gpu Type=gtx260 File=/dev/nvidia0
Name=gpu Type=gtx260 File=/dev/nvidia1
--

I would not use both at the same time though. That’s my quick two cents right 
now.

-Jared




From: Fany Pagés Díaz [mailto:[email protected]]
Sent: Wednesday, October 21, 2015 9:12 AM
To: slurm-dev
Subject: [slurm-dev] I can´t send job for several nodes with gpus

I configured the cluster for send jobs for gpus but is not works fine. When I 
send a job for one node it works but I get a little error (only I can send for 
node compute-0-0 for the others I can´t). This is the output.


[root@cluster bin]# srun -n 2 -N 1 --gres=gpu:2 mpirun cudampi
  We have 2 processors
  Spawning from compute-0-0.local
  CUDA MPI

  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
  We have 2 processors
  Spawning from compute-0-0.local
  CUDA MPI

  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
+ compute-0-0.local     1    2 GeForce GTX 260 (0)  GeForce GTX 260 (1)

+ compute-0-0.local     1    2 GeForce GTX 260 (0)  GeForce GTX 260 (1)

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
srun: error: compute-0-0: tasks 0-1: Exited with exit code 1
[root@cluster bin]#





But when I send for several nodes I have the next error.

[root@cluster bin]# srun -n 2 -N 2 --gres=gpu:2 mpirun cudampi
srun: Force Terminated job 408
srun: error: Unable to allocate resources: Requested node configuration is not 
available
[root@cluster bin]#

I don’t know what I missed because I have the same configuration in all nodes.

This is the file /etc/slurm/slurm.conf

NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2
GresTypes=gpu
SelectType=select/cons_res

This is the file /etc/slurm/gres.conf (this file is in each node)

#Configuracion de gres en los nodos
NodeName=compute-0-[0,3-4] Name=gpu File=/dev/nvidia[0-1]

#Configuration of two GPUs
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1

Any idea? please any can help me? Thanks

Reply via email to