Hello, Why do you write the environmental variable "CUDA_VISIBLE_DEVICES"? It should be written by SLURM in order to schedule the resources.
Try to remove the export command, and try again. Regards! 2014-05-15 14:05 GMT+02:00 Francesca Tartaglione < [email protected]>: > Dear all, > > > > we have a small cluster made of: > > -1 frontend > > - 4 computing nodes ( 2 nodes with 4 NVIDIA GPUs K20Xm, 16 cores each) > > -Infiniband interconnect > > > > And slurm-2.6.6-2 configured on all nodes. > > On gpu nodes, I’ve configured gres file as follows: > > Name=gpu File=/dev/nvidia0 > > Name=gpu File=/dev/nvidia1 > > Name=gpu File=/dev/nvidia2 > > Name=gpu File=/dev/nvidia3 > > > > In attachment you can find the slurm.conf file as well as scontrol show > config output. > > > > Our problem is that if you want to use the gpu_node partition (the one > related to gpu nodes) the scheduler seems not consider the gres > specification. > > So for example, if I submit an mpi job of 2 processes requesting 2 gpus > and afterwards others of the same type, the slurm scheduler wait to > schedule job on the other node until all the cores are busy and not the GPU > (even specifying the #BATCH - -gres=gpu:2). > > So I suppose the scheduler doesn’t see the gpu correctly. > > The script is: > > > > #SBATCH --partition=gpu_nodes > > #SBATCH --job-name=test_shoc > > #BATCH --gres=gpu:2 > > #SBATCH -N 1 > > #SBATCH --ntasks-per-node=2 > > > > export CUDA_VISIBLE_DEVICES=0,1,2,3 > > mpirun -np 2 /opt/shared/shoc_gpu/bin/EP/CUDA/Triad > > > > > > The only error I have when some jobs overlap the others is: > > > > MPI Task 0/1 starting.... > > MPI Task 1/1 starting.... > > Chose device: name='Tesla K20Xm' index=0 > > error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103 > > Chose device: name='Tesla K20Xm' index=1 > > error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103 > > ------------------------------------------------------- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code.. Per user-direction, the job has been aborted. > > ------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun detected that one or more processes exited with non-zero status, > thus causing > > the job to be terminated. The first process to do so was: > > > > Process name: [[226,1],1] > > Exit code: 255 > > -------------------------------------------------------------------------- > > > > Do you have any idea regarding this behavior? > > > > Thanks in advance! > > > > *Ing. Francesca Tartaglione* > > HPC Team, System Engineer > > *E4 Computer Engineering Spa* > > Switchboard: +39 0522 991811 . Fax: +39 0522 991803 > > Email: [email protected] > > Website: http://www.e4company.com > > > > *P* "Considera la responsabilità che hai verso l’ambiente e prima di > stampare questa e-mail domandati: ho davvero bisogno di una copia cartacea > ?” > > *"*Please consider your environmental responsibility and before > printing this e-mail ask yourself: do I need a hard copy?" > > > > -- > > E4 Computer Engineering S.p.A. > > Via Martiri della Liberta', 66 . 42019 Scandiano (RE) . Italy > > P.I./C.F/Registro Imprese di RE 02005300351 . Cap.Soc. Euro 150.000,00 i.v. > > Registro Pile IT 09060P00000265 . Registro A.E.E. IT0802 000 000 1117 . > > > > Disclaimer: > > La presente comunicazione/mail è strettamente confidenziale e riservata, > contiene informazioni legali destinate esclusivamente al soggetto indicato > quale destinatario. Il ricevente la presente comunicazione, se diverso dal > destinatario sopra indicato, è avvertito che qualunque utilizzazione o > copia della stessa è rigorosamente vietata ed è pregato di volerne dare > immediata comunicazione, anche telefonica, e ritrasmettere la stessa per > posta all'azienda. > > Il trattamento dei dati in nostro possesso, rilevati per esigenze > fiscali/amministrative e nel corso della normale attività commerciale e di > cui Le garantiamo la massima riservatezza, è effettuato anche per > aggiornarla sulle nostre iniziative e offerte commerciali. I dati non > saranno comunicati e diffusi a terzi e per essi Lei potrà richiedere, in > qualsiasi momento, la modifica o la cancellazione, comunicandocelo tramite > posta, e-mail o altro mezzo valido, come previsto dall'articolo 13 del > D.lgs 196/2003. Potrà, allo stesso modo, comunicarci anche successivamente > la volontà di non ricevere eventuali news. > > > > Disclaimer: > > This message and any document transmitted may contain material that is > confidential or proprietary to the sender for the sole use of the intended > recipient. If you are not the intended recipient of this message, please do > not read this message and notify us immediately by e-mail or by telephone > and then delete this message and any document attached. > > Any personal info, obtained for fiscal and administrative purposes only > during the regular commercial sphere of activity and for which we guarantee > the strictest privacy, will be also used to keep you updated on our > enterprises and commercial offers. Your data shall not be passed on to any > third party and you may ask to have them modified or cancelled from our > data base at any one time, by post, e-mail or any other valid medium, as > set by par. 13 del D.lgs 196/2003. Equally, you will be able to tell us if > you do not wish to receive anymore news on a later date. > > > -- *Sergio Iserte Agut, research assistant,* *High Performance Computing & Architecture* *Jaume I University (Castellón, Spain)*
