Dear all, we have a small cluster made of: -1 frontend - 4 computing nodes ( 2 nodes with 4 NVIDIA GPUs K20Xm, 16 cores each) -Infiniband interconnect
And slurm-2.6.6-2 configured on all nodes. On gpu nodes, I've configured gres file as follows: Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1 Name=gpu File=/dev/nvidia2 Name=gpu File=/dev/nvidia3 In attachment you can find the slurm.conf file as well as scontrol show config output. Our problem is that if you want to use the gpu_node partition (the one related to gpu nodes) the scheduler seems not consider the gres specification. So for example, if I submit an mpi job of 2 processes requesting 2 gpus and afterwards others of the same type, the slurm scheduler wait to schedule job on the other node until all the cores are busy and not the GPU (even specifying the #BATCH - -gres=gpu:2). So I suppose the scheduler doesn't see the gpu correctly. The script is: #SBATCH --partition=gpu_nodes #SBATCH --job-name=test_shoc #BATCH --gres=gpu:2 #SBATCH -N 1 #SBATCH --ntasks-per-node=2 export CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 2 /opt/shared/shoc_gpu/bin/EP/CUDA/Triad The only error I have when some jobs overlap the others is: MPI Task 0/1 starting.... MPI Task 1/1 starting.... Chose device: name='Tesla K20Xm' index=0 error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103 Chose device: name='Tesla K20Xm' index=1 error=46 name=all CUDA-capable devices are busy or unavailable at ln: 103 ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[226,1],1] Exit code: 255 -------------------------------------------------------------------------- Do you have any idea regarding this behavior? Thanks in advance! Ing. Francesca Tartaglione HPC Team, System Engineer E4 Computer Engineering Spa Switchboard: +39 0522 991811 . Fax: +39 0522 991803 Email: [email protected]<mailto:[email protected]> Website: http://www.e4company.com<http://www.e4company.com/> P "Considera la responsabilità che hai verso l'ambiente e prima di stampare questa e-mail domandati: ho davvero bisogno di una copia cartacea ?" "Please consider your environmental responsibility and before printing this e-mail ask yourself: do I need a hard copy?" -- E4 Computer Engineering S.p.A. Via Martiri della Liberta', 66 . 42019 Scandiano (RE) . Italy P.I./C.F/Registro Imprese di RE 02005300351 . Cap.Soc. Euro 150.000,00 i.v. Registro Pile IT 09060P00000265 . Registro A.E.E. IT0802 000 000 1117 . Disclaimer: La presente comunicazione/mail è strettamente confidenziale e riservata, contiene informazioni legali destinate esclusivamente al soggetto indicato quale destinatario. Il ricevente la presente comunicazione, se diverso dal destinatario sopra indicato, è avvertito che qualunque utilizzazione o copia della stessa è rigorosamente vietata ed è pregato di volerne dare immediata comunicazione, anche telefonica, e ritrasmettere la stessa per posta all'azienda. Il trattamento dei dati in nostro possesso, rilevati per esigenze fiscali/amministrative e nel corso della normale attività commerciale e di cui Le garantiamo la massima riservatezza, è effettuato anche per aggiornarla sulle nostre iniziative e offerte commerciali. I dati non saranno comunicati e diffusi a terzi e per essi Lei potrà richiedere, in qualsiasi momento, la modifica o la cancellazione, comunicandocelo tramite posta, e-mail o altro mezzo valido, come previsto dall'articolo 13 del D.lgs 196/2003. Potrà, allo stesso modo, comunicarci anche successivamente la volontà di non ricevere eventuali news. Disclaimer: This message and any document transmitted may contain material that is confidential or proprietary to the sender for the sole use of the intended recipient. If you are not the intended recipient of this message, please do not read this message and notify us immediately by e-mail or by telephone and then delete this message and any document attached. Any personal info, obtained for fiscal and administrative purposes only during the regular commercial sphere of activity and for which we guarantee the strictest privacy, will be also used to keep you updated on our enterprises and commercial offers. Your data shall not be passed on to any third party and you may ask to have them modified or cancelled from our data base at any one time, by post, e-mail or any other valid medium, as set by par. 13 del D.lgs 196/2003. Equally, you will be able to tell us if you do not wish to receive anymore news on a later date.
slurm.conf
Description: slurm.conf
slurm_conf.rtf
Description: slurm_conf.rtf
