[slurm-dev] Re: I canÃÂÃÂÃÂÃÂ´t send job for several nodes with gpus

Cooper, Trevor Tue, 27 Oct 2015 11:55:21 -0700

Is slurmd running on these nodes? Are there any errors related to GRES and/or 
GPU in the slurmd log?


-- Trevor

On Oct 27, 2015, at 11:19 AM, Fany Pagés Díaz 
<[email protected]<mailto:[email protected]>> wrote:

I have two gpus (GTX 260) in each node. I have installed the driver 
NVIDIA-Linux-x86_64-319.49.run, cudatoolkit_3.2.16_linux_64_rhel5.5.run y 
gpucomputingsdk_3.2.16_linux. This the output of nvidia –smi in node 
compute-0-3 and node compute-0-4.

[root@compute-0-3 ~]# nvidia-smi
Tue Oct 27 09:47:01 2015
+------------------------------------------------------+
| NVIDIA-SMI 5.319.49   Driver Version: 319.49         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 260     Off  | 0000:07:00.0     N/A |                  N/A |
| 40%   46C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 260     Off  | 0000:08:00.0     N/A |                  N/A |
| 40%   46C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
+-----------------------------------------------------------------------------+
[root@compute-0-3 ~]#
--------------------------------------------------------
[root@compute-0-4 ~]# nvidia-smi
Tue Oct 27 09:50:45 2015
+------------------------------------------------------+
| NVIDIA-SMI 5.319.49   Driver Version: 319.49         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 260     Off  | 0000:07:00.0     N/A |                  N/A |
| 40%   46C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 260     Off  | 0000:08:00.0     N/A |                  N/A |
| 40%   47C  N/A     N/A /  N/A |        3MB /   895MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
+-----------------------------------------------------------------------------+
[root@compute-0-4 ~]#


De: Cooper, Trevor [mailto:[email protected]]
Enviado el: viernes, 23 de octubre de 2015 16:39
Para: slurm-dev
Asunto: [slurm-dev] Re: I canÃ‚Â´t send job for several nodes with gpus

This suggests you do not have GPU's in compute-0-3 and compute-0-4 OR you don't 
have CUDA/nVIDIA driver installed OR you haven't 'initialized' the device 
entries.

What happens when you login to these nodes and run nvidia-smi?

We run nvidia-smi at the end of our driver/cuda installation to make the 
devices appear so that when we start slurm on the node it sees the GPUs.

If you try to start slurm BEFORE you install cuda/nvidia drivers AND run 
nvidia-smi it fails.

Look in your slurmd log on the nodes to see if that is the problem.

-- Trevor

On Oct 23, 2015, at 1:04 PM, Fany Pagés Díaz 
<[email protected]<mailto:[email protected]>> wrote:

Now, when I send this command I get this error.

root@cluster bin]# srun -n 2 -N 2 --gres=gpu: mpirun cuda+mpi
  We have 2 processors
  Spawning from compute-0-4.local
  CUDA MPI




  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
  We have 2 processors
  Spawning from compute-0-3.local
  CUDA MPI




  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
- compute-0-3.local     1    0 NONE

- compute-0-4.local     1    0 NONE

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

But when I send srun -n 2 -N 2 --gres=gpu:2 mpirun cuda+mpi I get:

root@cluster bin]# srun -n 2 -N 2 --gres=gpu:2 mpirun cudampi
srun: Force Terminated job 408
srun: error: Unable to allocate resources: Requested node configuration is not 
available
I wrote in the discussion about the slurm-roll at sourceforge, but nothing yet.

Please any can help me?thanks


De: Werner Saar [mailto:[email protected]]
Enviado el: viernes, 23 de octubre de 2015 12:11
Para: slurm-dev
Asunto: [slurm-dev] RE: I canÃƒÂ‚Ã‚Â´t send job for several nodes with gpus

Hi,

You are at the wrong place.
This may be a problem of the slurm-roll for the rocks cluster.
Please use the discussion about the slurm-roll at sourceforge.

Best regards
Werner
(maintainer of the slurm-roll)




On 10/23/2015 05:45 PM, Fany Pagés Díaz wrote:
I did all the configuration again and nothing. This is my output when I run 
scontrol show node

[root@cluster bin]# scontrol show node
NodeName=cluster CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=N/A Features=(null)
   Gres=gpu:2
   NodeAddr=10.8.52.254 NodeHostName=cluster Version=(null)
   RealMemory=1 AllocMem=0 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=None SlurmdStartTime=None
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Not responding [root@2015-10-23T10:10:25]

NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00 Features=rack-0,8CPUs
   Gres=gpu:4
   NodeAddr=10.8.52.253 NodeHostName=compute-0-0 Version=14.03
   OS=Linux RealMemory=5968 AllocMem=0 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=447278 Weight=20488100
   BootTime=2015-10-22T15:05:08 SlurmdStartTime=2015-10-23T09:33:45
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


NodeName=compute-0-1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.08 Features=rack-0,8CPUs
   Gres=gpu:2
   NodeAddr=10.8.52.252 NodeHostName=compute-0-1 Version=14.03
   OS=Linux RealMemory=5972 AllocMem=0 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=447278 Weight=20488101
   BootTime=2015-10-22T15:06:09 SlurmdStartTime=2015-10-22T15:06:40
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


I don´t know why the node compute-0-0 take gpu:4 when I put it gpu:2.
My cluster is completely homogeneous, and I need send job with several nodes 
with gpus, but the only node its works is compute-0-0.
Any idea? I sent the files slurm.conf and gres.conf(the files are in each node 
in /etc/slurm), I don´t know what I missed in the configuration.


De: Jared David Baker [mailto:[email protected]]
Enviado el: miércoles, 21 de octubre de 2015 12:28
Para: slurm-dev
Asunto: [slurm-dev] RE: I canÂ´t send job for several nodes with gpus

Fany,

Run `scontrol show node` and post the output. Something may look strange in 
your nodes configuration.

Also, your gres.conf file may be nicer if it looks similar to this:

--
NodeName=compute-0-[0,3-4] Name=gpu Type=gtx260 File=/dev/nvidia[0-1]
--

Your file may be valid, but I guess I would generally prefer the above or if 
the system is completely homogeneous, you can use the form:

--
Name=gpu Type=gtx260 File=/dev/nvidia0
Name=gpu Type=gtx260 File=/dev/nvidia1
--

I would not use both at the same time though. That’s my quick two cents right 
now.

-Jared




From: Fany Pagés Díaz [mailto:[email protected]]
Sent: Wednesday, October 21, 2015 9:12 AM
To: slurm-dev
Subject: [slurm-dev] I can´t send job for several nodes with gpus

I configured the cluster for send jobs for gpus but is not works fine. When I 
send a job for one node it works but I get a little error (only I can send for 
node compute-0-0 for the others I can´t). This is the output.

[root@cluster bin]# srun -n 2 -N 1 --gres=gpu:2 mpirun cudampi
  We have 2 processors
  Spawning from compute-0-0.local
  CUDA MPI

  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
  We have 2 processors
  Spawning from compute-0-0.local
  CUDA MPI

  Probing nodes...
     Node        Psid  CUDA Cards (devID)
     ----------- ----- ---- ----------
+ compute-0-0.local     1    2 GeForce GTX 260 (0)  GeForce GTX 260 (1)

+ compute-0-0.local     1    2 GeForce GTX 260 (0)  GeForce GTX 260 (1)

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
srun: error: compute-0-0: tasks 0-1: Exited with exit code 1
[root@cluster bin]#


But when I send for several nodes I have the next error.

[root@cluster bin]# srun -n 2 -N 2 --gres=gpu:2 mpirun cudampi
srun: Force Terminated job 408
srun: error: Unable to allocate resources: Requested node configuration is not 
available
[root@cluster bin]#

I don’t know what I missed because I have the same configuration in all nodes.

This is the file /etc/slurm/slurm.conf

NodeName=cluster NodeAddr=10.8.52.254 gres=gpu:2
GresTypes=gpu
SelectType=select/cons_res

This is the file /etc/slurm/gres.conf (this file is in each node)

#Configuracion de gres en los nodos
NodeName=compute-0-[0,3-4] Name=gpu File=/dev/nvidia[0-1]

#Configuration of two GPUs
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1

Any idea? please any can help me? Thanks

[slurm-dev] Re: I canÃÂÃÂÃÂÃÂ´t send job for several nodes with gpus

Reply via email to

[slurm-dev] Re: I canÃÂÃÂÃÂÃÂ´t send job for several nodes with gpus