Hello,

Why do you write the environmental variable "CUDA_VISIBLE_DEVICES"?
It should be written by SLURM in order to schedule the resources.

Try to remove the export command, and try again.

Regards!



2014-05-15 14:05 GMT+02:00 Francesca Tartaglione <
[email protected]>:

> Dear all,
>
>
>
> we have a small cluster made of:
>
> -1 frontend
>
> - 4 computing nodes ( 2 nodes with 4 NVIDIA GPUs K20Xm, 16 cores each)
>
> -Infiniband interconnect
>
>
>
> And slurm-2.6.6-2 configured on all nodes.
>
> On gpu nodes, I’ve configured  gres file as follows:
>
> Name=gpu File=/dev/nvidia0
>
> Name=gpu File=/dev/nvidia1
>
> Name=gpu File=/dev/nvidia2
>
> Name=gpu File=/dev/nvidia3
>
>
>
> In attachment you can find the slurm.conf file as well as scontrol show
> config output.
>
>
>
> Our problem is that if you want to use the gpu_node partition (the one
> related to gpu nodes) the scheduler seems not consider the gres
> specification.
>
> So for example, if I submit an mpi job of 2 processes requesting 2 gpus
> and afterwards others of the same type, the slurm scheduler wait to
> schedule job on the other node until all the cores are busy and not the GPU
> (even specifying the #BATCH - -gres=gpu:2).
>
> So I suppose the scheduler doesn’t see the gpu correctly.
>
> The script is:
>
>
>
> #SBATCH --partition=gpu_nodes
>
> #SBATCH --job-name=test_shoc
>
> #BATCH --gres=gpu:2
>
> #SBATCH -N 1
>
> #SBATCH --ntasks-per-node=2
>
>
>
> export CUDA_VISIBLE_DEVICES=0,1,2,3
>
> mpirun -np 2 /opt/shared/shoc_gpu/bin/EP/CUDA/Triad
>
>
>
>
>
> The only error I have when some jobs overlap the others is:
>
>
>
> MPI Task 0/1 starting....
>
> MPI Task 1/1 starting....
>
> Chose device: name='Tesla K20Xm' index=0
>
> error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103
>
>   Chose device: name='Tesla K20Xm' index=1
>
> error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103
>
>   -------------------------------------------------------
>
> Primary job  terminated normally, but 1 process returned
>
> a non-zero exit code.. Per user-direction, the job has been aborted.
>
> -------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
>
> the job to be terminated. The first process to do so was:
>
>
>
>   Process name: [[226,1],1]
>
>   Exit code:    255
>
> --------------------------------------------------------------------------
>
>
>
> Do you have any idea regarding this behavior?
>
>
>
> Thanks in advance!
>
>
>
> *Ing. Francesca Tartaglione*
>
> HPC Team, System Engineer
>
> *E4 Computer Engineering Spa*
>
> Switchboard: +39 0522 991811 . Fax: +39 0522 991803
>
> Email: [email protected]
>
> Website: http://www.e4company.com
>
>
>
> *P* "Considera la responsabilità che hai verso l’ambiente e prima di
> stampare questa e-mail domandati: ho davvero bisogno di una copia cartacea
> ?”
>
>         *"*Please consider your environmental responsibility and before
> printing this e-mail ask yourself: do I need a hard copy?"
>
>
>
> --
>
> E4 Computer Engineering S.p.A.
>
> Via Martiri della Liberta', 66 . 42019 Scandiano (RE) . Italy
>
> P.I./C.F/Registro Imprese di RE 02005300351 . Cap.Soc. Euro 150.000,00 i.v.
>
> Registro Pile  IT 09060P00000265 . Registro A.E.E.  IT0802 000 000 1117 .
>
>
>
> Disclaimer:
>
> La presente comunicazione/mail è strettamente confidenziale e riservata,
> contiene informazioni legali destinate esclusivamente al soggetto indicato
> quale destinatario. Il ricevente la presente comunicazione, se diverso dal
> destinatario sopra indicato, è avvertito che qualunque utilizzazione o
> copia della stessa è rigorosamente vietata ed è pregato di volerne dare
> immediata comunicazione, anche telefonica, e ritrasmettere la stessa per
> posta all'azienda.
>
> Il trattamento dei dati in nostro possesso, rilevati per esigenze
> fiscali/amministrative e nel corso della normale attività commerciale e di
> cui Le garantiamo la massima riservatezza, è effettuato anche per
> aggiornarla sulle nostre iniziative e offerte commerciali. I dati non
> saranno comunicati e diffusi a terzi e per essi Lei potrà richiedere, in
> qualsiasi momento, la modifica o la cancellazione, comunicandocelo tramite
> posta, e-mail o altro mezzo valido, come previsto dall'articolo 13 del
> D.lgs 196/2003. Potrà, allo stesso modo, comunicarci anche successivamente
> la volontà di non ricevere eventuali news.
>
>
>
> Disclaimer:
>
> This message and any document transmitted may contain material that is
> confidential or proprietary to the sender for the sole use of the intended
> recipient. If you are not the intended recipient of this message, please do
> not read this message and notify us immediately by e-mail or by telephone
> and then delete this message and any document attached.
>
> Any personal info, obtained for fiscal and administrative purposes only
> during the regular commercial sphere of activity and for which we guarantee
> the strictest privacy, will be also used to keep you updated on our
> enterprises and commercial offers. Your data shall not be passed on to any
> third party and you may ask to have them modified or cancelled from our
> data base at any one time, by post, e-mail or any other valid medium, as
> set by par. 13 del D.lgs 196/2003.  Equally, you will be able to tell us if
> you do not wish to receive anymore news on a later date.
>
>
>



-- 
*Sergio Iserte Agut, research assistant,*
*High Performance Computing & Architecture*
*Jaume I University (Castellón, Spain)*

Reply via email to