Also, supposedly adding the "--accel-bind=g" option to srun will do this, 
though we are observing that this is broken and causes jobs to hang.  

Can anyone confirm this?

-----Original Message-----
From: Kilian Cavalotti [mailto:kilian.cavalotti.w...@gmail.com] 
Sent: Friday, October 27, 2017 8:13 AM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: CPU/GPU Affinity Not Working


Hi Michael,

On Fri, Oct 27, 2017 at 4:44 AM, Michael Di Domenico <mdidomeni...@gmail.com> 
wrote:
> as an aside, is there some tool which provides the optimal mapping of 
> CPU id's to GPU cards?

We use nvidia-smi:

-- 8< 
-----------------------------------------------------------------------------------------
# nvidia-smi topo -m
       GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
GPU0     X      NV1     NV1     NV2     PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU1    NV1      X      NV2     NV1     PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU2    NV1     NV2      X      NV1     PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
GPU3    NV2     NV1     NV1      X      PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18
mlx5_0  PHB     PHB     PHB     PHB      X

Legend:

 X   = Self
 SOC  = Connection traversing PCIe as well as the SMP link between CPU 
sockets(e.g. QPI)  PHB  = Connection traversing PCIe as well as a PCIe Host 
Bridge (typically the CPU)  PXB  = Connection traversing multiple PCIe switches 
(without traversing the PCIe Host Bridge)  PIX  = Connection traversing a 
single PCIe switch  NV#  = Connection traversing a bonded set of # NVLinks
-- 8< 
-----------------------------------------------------------------------------------------

 and hwloc (https://www.open-mpi.org/projects/hwloc/):
-- 8< 
-----------------------------------------------------------------------------------------
# hwloc-ls --ignore misc
Machine (256GB total)
 NUMANode L#0 (P#0 128GB)
   Package L#0 + L3 L#0 (25MB)
     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
     L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
     L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
     L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
     L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
   HostBridge L#0
     PCIBridge
       PCIBridge
         PCIBridge
           PCI 10de:1b02
             GPU L#0 "card1"
             GPU L#1 "renderD128"
         PCIBridge
           PCI 10de:1b02
             GPU L#2 "card2"
             GPU L#3 "renderD129"
     PCIBridge
       PCIBridge
         PCIBridge
           PCI 10de:1b02
             GPU L#4 "card3"
             GPU L#5 "renderD130"
         PCIBridge
           PCI 10de:1b02
             GPU L#6 "card4"
             GPU L#7 "renderD131"
     PCI 8086:8d62
       Block(Disk) L#8 "sda"
     PCIBridge
       PCIBridge
         PCI 1a03:2000
           GPU L#9 "card0"
           GPU L#10 "controlD64"
     PCI 8086:8d02
 NUMANode L#1 (P#1 128GB)
   Package L#1 + L3 L#1 (25MB)
     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 
(P#10)
     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 
(P#11)
     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 
(P#12)
     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 
(P#13)
     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 
(P#14)
     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 
(P#15)
     L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 
(P#16)
     L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 
(P#17)
     L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 
(P#18)
     L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 
(P#19)
   HostBridge L#11
     PCIBridge
       PCI 8086:1521
         Net L#11 "enp129s0f0"
       PCI 8086:1521
         Net L#12 "enp129s0f1"
     PCIBridge
       PCI 15b3:1013
         Net L#13 "ib0"
         OpenFabrics L#14 "mlx5_0"
     PCIBridge
       PCIBridge
         PCIBridge
           PCI 10de:1b02
             GPU L#15 "card5"
             GPU L#16 "renderD132"
         PCIBridge
           PCI 10de:1b02
             GPU L#17 "card6"
             GPU L#18 "renderD133"
     PCIBridge
       PCIBridge
         PCIBridge
           PCI 10de:1b02
             GPU L#19 "card7"
             GPU L#20 "renderD134"
         PCIBridge
           PCI 10de:1b02
             GPU L#21 "card8"
             GPU L#22 "renderD135"
-- 8< 
-----------------------------------------------------------------------------------------

Both will show which CPU ids are associated to which GPUs.

Cheers,
--
Kilian

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to