[slurm-dev] RE: I canÃÂÃÂ´t send job for several nodes with gpus

Werner Saar Fri, 23 Oct 2015 09:10:44 -0700

Hi,

You are at the wrong place.
This may be a problem of the slurm-roll for the rocks cluster.
Please use the discussion about the slurm-roll at sourceforge.


Best regards
Werner
(maintainer of the slurm-roll)




On 10/23/2015 05:45 PM, Fany Pagés Díaz wrote:

I did all the configuration again and nothing. This is my output whenI run scontrol show node


[root@cluster bin]# scontrol show node
NodeName=cluster CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null)
   Gres=gpu:2
   NodeAddr=10.8.52.254 NodeHostName=cluster Version=(null)
   RealMemory=1 AllocMem=0 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [root@2015-10-23T10:10:25]

NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=rack-0,8CPUs
   Gres=gpu:4
   NodeAddr=10.8.52.253 NodeHostName=compute-0-0 Version=14.03
   OS=Linux RealMemory=5968 AllocMem=0 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=447278 Weight=20488100
   BootTime=2015-10-22T15:05:08 SlurmdStartTime=2015-10-23T09:33:45
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.08 Features=rack-0,8CPUs
   Gres=gpu:2
   NodeAddr=10.8.52.252 NodeHostName=compute-0-1 Version=14.03
   OS=Linux RealMemory=5972 AllocMem=0 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=447278 Weight=20488101
   BootTime=2015-10-22T15:06:09 SlurmdStartTime=2015-10-22T15:06:40
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I don´t know why the node compute-0-0 take gpu:4 when I put it gpu:2.

My cluster iscompletely homogeneous, and I need send job with severalnodes with gpus, but the only node its works is compute-0-0.

Any idea? I sent the files slurm.conf and gres.conf(the files are ineach node in /etc/slurm), I don´t know what I missed in theconfiguration.


*De:* Jared David Baker [mailto:[email protected]]
*Enviado el:* miércoles, 21 de octubre de 2015 12:28
*Para:* slurm-dev
*Asunto:* [slurm-dev] RE: I canÂ´t send job for several nodes with gpus

Fany,

Run `scontrol show node` and post the output. Something may lookstrange in your nodes configuration.


Also, your gres.conf file may be nicer if it looks similar to this:

--

NodeName=compute-0-[0,3-4] Name=gpu Type=gtx260 File=/dev/nvidia[0-1]

--

Your file may be valid, but I guess I would generally prefer the aboveor if the system is completely homogeneous, you can use the form:


--

Name=gpu Type=gtx260 File=/dev/nvidia0

Name=gpu Type=gtx260 File=/dev/nvidia1

--

I would not use both at the same time though. That’s my quick twocents right now.


-Jared

*From:*Fany Pagés Díaz [mailto:[email protected]]
*Sent:* Wednesday, October 21, 2015 9:12 AM
*To:* slurm-dev
*Subject:* [slurm-dev] I can´t send job for several nodes with gpus

I configured the cluster for send jobs for gpus but is not works fine.When I send a job for one node it works but I get a little error (onlyI can send for node compute-0-0 for the others I can´t). This is theoutput.


[root@cluster bin]# srun -n 2 -N 1 --gres=gpu:2 mpirun cudampi
  We have 2 processors
  Spawning from compute-0-0.local
  CUDA MPI

  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
  We have 2 processors
  Spawning from compute-0-0.local
  CUDA MPI

  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
+ compute-0-0.local     1    2 GeForce GTX 260 (0)  GeForce GTX 260 (1)

+ compute-0-0.local     1    2 GeForce GTX 260 (0)  GeForce GTX 260 (1)

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
*srun: error: compute-0-0: tasks 0-1: Exited with exit code 1*
[root@cluster bin]#

*But when I send for several nodes I have the next error.
*
[root@cluster bin]# srun -n 2 -N 2 --gres=gpu:2 mpirun cudampi
srun: Force Terminated job 408

srun: error: Unable to allocate resources: Requested nodeconfiguration is not available

[root@cluster bin]#

I don’t know what I missed because I have the same configuration inall nodes.


*This is the file /etc/slurm/slurm.conf*

NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2

GresTypes=gpu

SelectType=select/cons_res

*This is the file /etc/slurm/gres.conf (this file is in each node)*

#Configuracion de gres en los nodos
NodeName=compute-0-[0,3-4] Name=gpu File=/dev/nvidia[0-1]

#Configuration of two GPUs
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1

Any idea? please any can help me? Thanks

[slurm-dev] RE: I canÃÂÃÂ´t send job for several nodes with gpus

Reply via email to

[slurm-dev] RE: I canÃÂÃÂ´t send job for several nodes with gpus