Also, supposedly adding the "--accel-bind=g" option to srun will do this, though we are observing that this is broken and causes jobs to hang.
Can anyone confirm this? -----Original Message----- From: Kilian Cavalotti [mailto:kilian.cavalotti.w...@gmail.com] Sent: Friday, October 27, 2017 8:13 AM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: CPU/GPU Affinity Not Working Hi Michael, On Fri, Oct 27, 2017 at 4:44 AM, Michael Di Domenico <mdidomeni...@gmail.com> wrote: > as an aside, is there some tool which provides the optimal mapping of > CPU id's to GPU cards? We use nvidia-smi: -- 8< ----------------------------------------------------------------------------------------- # nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity GPU0 X NV1 NV1 NV2 PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18 GPU1 NV1 X NV2 NV1 PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18 GPU2 NV1 NV2 X NV1 PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18 GPU3 NV2 NV1 NV1 X PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18 mlx5_0 PHB PHB PHB PHB X Legend: X = Self SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI) PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks -- 8< ----------------------------------------------------------------------------------------- and hwloc (https://www.open-mpi.org/projects/hwloc/): -- 8< ----------------------------------------------------------------------------------------- # hwloc-ls --ignore misc Machine (256GB total) NUMANode L#0 (P#0 128GB) Package L#0 + L3 L#0 (25MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) HostBridge L#0 PCIBridge PCIBridge PCIBridge PCI 10de:1b02 GPU L#0 "card1" GPU L#1 "renderD128" PCIBridge PCI 10de:1b02 GPU L#2 "card2" GPU L#3 "renderD129" PCIBridge PCIBridge PCIBridge PCI 10de:1b02 GPU L#4 "card3" GPU L#5 "renderD130" PCIBridge PCI 10de:1b02 GPU L#6 "card4" GPU L#7 "renderD131" PCI 8086:8d62 Block(Disk) L#8 "sda" PCIBridge PCIBridge PCI 1a03:2000 GPU L#9 "card0" GPU L#10 "controlD64" PCI 8086:8d02 NUMANode L#1 (P#1 128GB) Package L#1 + L3 L#1 (25MB) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19) HostBridge L#11 PCIBridge PCI 8086:1521 Net L#11 "enp129s0f0" PCI 8086:1521 Net L#12 "enp129s0f1" PCIBridge PCI 15b3:1013 Net L#13 "ib0" OpenFabrics L#14 "mlx5_0" PCIBridge PCIBridge PCIBridge PCI 10de:1b02 GPU L#15 "card5" GPU L#16 "renderD132" PCIBridge PCI 10de:1b02 GPU L#17 "card6" GPU L#18 "renderD133" PCIBridge PCIBridge PCIBridge PCI 10de:1b02 GPU L#19 "card7" GPU L#20 "renderD134" PCIBridge PCI 10de:1b02 GPU L#21 "card8" GPU L#22 "renderD135" -- 8< ----------------------------------------------------------------------------------------- Both will show which CPU ids are associated to which GPUs. Cheers, -- Kilian ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------