Hi Marcus, This may depend on ConstrainDevices in cgroups.conf. I guess it is set to "no" in your case.
Best regards, Taras On Tue, Jun 23, 2020 at 4:02 PM Marcus Wagner <wag...@itc.rwth-aachen.de> wrote: > Hi Kota, > > thanks for the hint. > > Yet, I'm still a little bit astonished, as if I remember right, > CUDA_VISIBLE_DEVICES in a cgroup always start from zero. That has been > already years ago, as we still used LSF. > > But SLURM_JOB_GPUS seems to be the right thing: > > same node, two different users (and therefore jobs) > > > $> xargs --null --max-args=1 echo < /proc/32719/environ | egrep "GPU|CUDA" > SLURM_JOB_GPUS=0 > CUDA_VISIBLE_DEVICES=0 > GPU_DEVICE_ORDINAL=0 > > $> xargs --null --max-args=1 echo < /proc/109479/environ | egrep "GPU|CUDA" > SLURM_MEM_PER_GPU=6144 > SLURM_JOB_GPUS=1 > CUDA_VISIBLE_DEVICES=0 > GPU_DEVICE_ORDINAL=0 > CUDA_ROOT=/usr/local_rwth/sw/cuda/10.1.243 > CUDA_PATH=/usr/local_rwth/sw/cuda/10.1.243 > CUDA_VERSION=101 > > SLURM_JOB_GPU differs > > $> scontrol show -d job 14658274 > ... > Nodes=nrg02 CPU_IDs=24 Mem=8192 GRES_IDX=gpu:volta(IDX:1) > > $> scontrol show -d job 14673550 > ... > Nodes=nrg02 CPU_IDs=0 Mem=8192 GRES_IDX=gpu:volta(IDX:0) > > > > Is there anyone out there, who can confirm this besides me? > > > Best > Marcus > > > Am 23.06.2020 um 04:51 schrieb Kota Tsuyuzaki: > >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always > >> starts from zero. So this is NOT the index of the GPU. > > > > Thanks. Just FYI, when I tested the environment variables with Slurm > 19.05.2 + proctrack/cgroup configuration, It looks CUDA_VISIBLE_DEVICES > fits the indices on the host devices (i.e. not started from zero). I'm not > sure if the behavior would be changed in the newer Slurm version though. > > > > I also found that SLURM_JOB_GPUS and GPU_DEVICE_ORDIGNAL was set in > environment variables that can be useful. In my current tests, those > variables ware being same values with CUDA_VISILE_DEVICES. > > > > Any advices on what I should look for, is always welcome.. > > > > Best, > > Kota > > > >> -----Original Message----- > >> From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of > Marcus Wagner > >> Sent: Tuesday, June 16, 2020 9:17 PM > >> To: slurm-users@lists.schedmd.com > >> Subject: Re: [slurm-users] How to view GPU indices of the completed > jobs? > >> > >> Hi David, > >> > >> if I remember right, if you use cgroups, CUDA_VISIBLE_DEVICES always > >> starts from zero. So this is NOT the index of the GPU. > >> > >> Just verified it: > >> $> nvidia-smi > >> Tue Jun 16 13:28:47 2020 > >> > +-----------------------------------------------------------------------------+ > >> | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: > >> 10.2 | > >> ... > >> > +-----------------------------------------------------------------------------+ > >> | Processes: GPU > >> Memory | > >> | GPU PID Type Process name Usage > >> | > >> > |========================================================================= > >> ====| > >> | 0 17269 C gmx_mpi > >> 679MiB | > >> | 1 19246 C gmx_mpi > >> 513MiB | > >> > +-----------------------------------------------------------------------------+ > >> > >> $> squeue -w nrg04 > >> JOBID PARTITION NAME USER ST TIME NODES > >> NODELIST(REASON) > >> 14560009 c18g_low egf5 bk449967 R 1-00:17:48 1 > nrg04 > >> 14560005 c18g_low egf1 bk449967 R 1-00:20:23 1 > nrg04 > >> > >> > >> $> scontrol show job -d 14560005 > >> ... > >> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* > >> Nodes=nrg04 CPU_IDs=0-23 Mem=93600 GRES_IDX=gpu(IDX:0) > >> > >> $> scontrol show job -d 14560009 > >> JobId=14560009 JobName=egf5 > >> ... > >> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* > >> Nodes=nrg04 CPU_IDs=24-47 Mem=93600 GRES_IDX=gpu(IDX:1) > >> > >> From the PIDs from nvidia-smi ouput: > >> > >> $> xargs --null --max-args=1 echo < /proc/17269/environ | grep > CUDA_VISIBLE > >> CUDA_VISIBLE_DEVICES=0 > >> > >> $> xargs --null --max-args=1 echo < /proc/19246/environ | grep > CUDA_VISIBLE > >> CUDA_VISIBLE_DEVICES=0 > >> > >> > >> So this is only a way to see how MANY devices were used, not which. > >> > >> > >> Best > >> Marcus > >> > >> Am 10.06.2020 um 20:49 schrieb David Braun: > >>> Hi Kota, > >>> > >>> This is from the job template that I give to my users: > >>> > >>> # Collect some information about the execution environment that may > >>> # be useful should we need to do some debugging. > >>> > >>> echo "CREATING DEBUG DIRECTORY" > >>> echo > >>> > >>> mkdir .debug_info > >>> module list > .debug_info/environ_modules 2>&1 > >>> ulimit -a > .debug_info/limits 2>&1 > >>> hostname > .debug_info/environ_hostname 2>&1 > >>> env |grep SLURM > .debug_info/environ_slurm 2>&1 > >>> env |grep OMP |grep -v OMPI > .debug_info/environ_omp 2>&1 > >>> env |grep OMPI > .debug_info/environ_openmpi 2>&1 > >>> env > .debug_info/environ 2>&1 > >>> > >>> if [ ! -z ${CUDA_VISIBLE_DEVICES+x} ]; then > >>> echo "SAVING CUDA ENVIRONMENT" > >>> echo > >>> env |grep CUDA > .debug_info/environ_cuda 2>&1 > >>> fi > >>> > >>> You could add something like this to one of the SLURM prologs to save > >>> the GPU list of jobs. > >>> > >>> Best, > >>> > >>> David > >>> > >>> On Thu, Jun 4, 2020 at 4:02 AM Kota Tsuyuzaki > >>> <kota.tsuyuzaki...@hco.ntt.co.jp > >>> <mailto:kota.tsuyuzaki...@hco.ntt.co.jp>> wrote: > >>> > >>> Hello Guys, > >>> > >>> We are running GPU clusters with Slurm and SlurmDBD (version 19.05 > >>> series) and some of GPUs seemed to get troubles for attached > >>> jobs. To investigate if the troubles happened on the same GPUs, > I'd > >>> like to get GPU indices of the completed jobs. > >>> > >>> In my understanding `scontrol show job` can show the indices (as > IDX > >>> in gres info) but cannot be used for completed job. And also > >>> `sacct -j` is available for complete jobs but won't print the > indices. > >>> > >>> Is there any way (commands, configurations, etc...) to see the > >>> allocated GPU indices for completed jobs? > >>> > >>> Best regards, > >>> > >>> -------------------------------------------- > >>> 露崎 浩太 (Kota Tsuyuzaki) > >>> kota.tsuyuzaki...@hco.ntt.co.jp <mailto: > kota.tsuyuzaki...@hco.ntt.co.jp> > >>> NTTソフトウェアイノベーションセンタ > >>> 分散処理基盤技術プロジェクト > >>> 0422-59-2837 > >>> --------------------------------------------- > >>> > >>> > >>> > >>> > >>> > >> > >> -- > >> Dipl.-Inf. Marcus Wagner > >> > >> IT Center > >> Gruppe: Systemgruppe Linux > >> Abteilung: Systeme und Betrieb > >> RWTH Aachen University > >> Seffenter Weg 23 > >> 52074 Aachen > >> Tel: +49 241 80-24383 > >> Fax: +49 241 80-624383 > >> wag...@itc.rwth-aachen.de > >> www.itc.rwth-aachen.de > >> > >> Social Media Kanäle des IT Centers: > >> https://blog.rwth-aachen.de/itc/ > >> https://www.facebook.com/itcenterrwth > >> https://www.linkedin.com/company/itcenterrwth > >> https://twitter.com/ITCenterRWTH > >> https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ > > > > > > > > > > -- > Dipl.-Inf. Marcus Wagner > > IT Center > Gruppe: Systemgruppe Linux > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383 > wag...@itc.rwth-aachen.de > www.itc.rwth-aachen.de > > Social Media Kanäle des IT Centers: > https://blog.rwth-aachen.de/itc/ > https://www.facebook.com/itcenterrwth > https://www.linkedin.com/company/itcenterrwth > https://twitter.com/ITCenterRWTH > https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ > >