I configured the cluster for send jobs for gpus but is not works fine. When
I send a job for one node it works but I get a little error (only I can send
for node compute-0-0 for the others I can´t). This is the output.
[root@cluster bin]# srun -n 2 -N 1 --gres=gpu:2 mpirun cudampi
We have 2 processors
Spawning from compute-0-0.local
CUDA MPI
Probing nodes...
Node Psid CUDA Cards (devID)
----------- ----- ---- ----------
We have 2 processors
Spawning from compute-0-0.local
CUDA MPI
Probing nodes...
Node Psid CUDA Cards (devID)
----------- ----- ---- ----------
+ compute-0-0.local 1 2 GeForce GTX 260 (0) GeForce GTX 260 (1)
+ compute-0-0.local 1 2 GeForce GTX 260 (0) GeForce GTX 260 (1)
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
srun: error: compute-0-0: tasks 0-1: Exited with exit code 1
[root@cluster bin]#
But when I send for several nodes I have the next error.
[root@cluster bin]# srun -n 2 -N 2 --gres=gpu:2 mpirun cudampi
srun: Force Terminated job 408
srun: error: Unable to allocate resources: Requested node configuration is
not available
[root@cluster bin]#
I dont know what I missed because I have the same configuration in all
nodes.
This is the file /etc/slurm/slurm.conf
NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2
GresTypes=gpu
SelectType=select/cons_res
This is the file /etc/slurm/gres.conf (this file is in each node)
#Configuracion de gres en los nodos
NodeName=compute-0-[0,3-4] Name=gpu File=/dev/nvidia[0-1]
#Configuration of two GPUs
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Any idea? please any can help me? Thanks