Dear all,

we have a small cluster made of:
-1 frontend
- 4 computing nodes ( 2 nodes with 4 NVIDIA GPUs K20Xm, 16 cores each)
-Infiniband interconnect

And slurm-2.6.6-2 configured on all nodes.
On gpu nodes, I've configured  gres file as follows:
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3

In attachment you can find the slurm.conf file as well as scontrol show config 
output.

Our problem is that if you want to use the gpu_node partition (the one related 
to gpu nodes) the scheduler seems not consider the gres specification.
So for example, if I submit an mpi job of 2 processes requesting 2 gpus  and 
afterwards others of the same type, the slurm scheduler wait to schedule job on 
the other node until all the cores are busy and not the GPU (even specifying 
the #BATCH - -gres=gpu:2).
So I suppose the scheduler doesn't see the gpu correctly.
The script is:

#SBATCH --partition=gpu_nodes
#SBATCH --job-name=test_shoc
#BATCH --gres=gpu:2
#SBATCH -N 1
#SBATCH --ntasks-per-node=2

export CUDA_VISIBLE_DEVICES=0,1,2,3
mpirun -np 2 /opt/shared/shoc_gpu/bin/EP/CUDA/Triad


The only error I have when some jobs overlap the others is:

MPI Task 0/1 starting....
MPI Task 1/1 starting....
Chose device: name='Tesla K20Xm' index=0
error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103
  Chose device: name='Tesla K20Xm' index=1
error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103
  -------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

  Process name: [[226,1],1]
  Exit code:    255
--------------------------------------------------------------------------

Do you have any idea regarding this behavior?

Thanks in advance!

Ing. Francesca Tartaglione
HPC Team, System Engineer
E4 Computer Engineering Spa
Switchboard: +39 0522 991811 . Fax: +39 0522 991803
Email: 
[email protected]<mailto:[email protected]>
Website: http://www.e4company.com<http://www.e4company.com/>

P "Considera la responsabilità che hai verso l'ambiente e prima di stampare 
questa e-mail domandati: ho davvero bisogno di una copia cartacea ?"
        "Please consider your environmental responsibility and before printing 
this e-mail ask yourself: do I need a hard copy?"

--
E4 Computer Engineering S.p.A.
Via Martiri della Liberta', 66 . 42019 Scandiano (RE) . Italy
P.I./C.F/Registro Imprese di RE 02005300351 . Cap.Soc. Euro 150.000,00 i.v.
Registro Pile  IT 09060P00000265 . Registro A.E.E.  IT0802 000 000 1117 .

Disclaimer:
La presente comunicazione/mail è strettamente confidenziale e riservata, 
contiene informazioni legali destinate esclusivamente al soggetto indicato 
quale destinatario. Il ricevente la presente comunicazione, se diverso dal 
destinatario sopra indicato, è avvertito che qualunque utilizzazione o copia 
della stessa è rigorosamente vietata ed è pregato di volerne dare immediata 
comunicazione, anche telefonica, e ritrasmettere la stessa per posta 
all'azienda.
Il trattamento dei dati in nostro possesso, rilevati per esigenze 
fiscali/amministrative e nel corso della normale attività commerciale e di cui 
Le garantiamo la massima riservatezza, è effettuato anche per aggiornarla sulle 
nostre iniziative e offerte commerciali. I dati non saranno comunicati e 
diffusi a terzi e per essi Lei potrà richiedere, in qualsiasi momento, la 
modifica o la cancellazione, comunicandocelo tramite posta, e-mail o altro 
mezzo valido, come previsto dall'articolo 13 del D.lgs 196/2003. Potrà, allo 
stesso modo, comunicarci anche successivamente la volontà di non ricevere 
eventuali news.

Disclaimer:
This message and any document transmitted may contain material that is 
confidential or proprietary to the sender for the sole use of the intended 
recipient. If you are not the intended recipient of this message, please do not 
read this message and notify us immediately by e-mail or by telephone and then 
delete this message and any document attached.
Any personal info, obtained for fiscal and administrative purposes only during 
the regular commercial sphere of activity and for which we guarantee the 
strictest privacy, will be also used to keep you updated on our enterprises and 
commercial offers. Your data shall not be passed on to any third party and you 
may ask to have them modified or cancelled from our data base at any one time, 
by post, e-mail or any other valid medium, as set by par. 13 del D.lgs 
196/2003.  Equally, you will be able to tell us if you do not wish to receive 
anymore news on a later date.

Attachment: slurm.conf
Description: slurm.conf

Attachment: slurm_conf.rtf
Description: slurm_conf.rtf

Reply via email to