Hi Tim,
we have MIG configured and integrated with Slurm using the slurm-mig-discovery 
tools:

https://gitlab.com/nvidia/hpc/slurm-mig-discovery

The mig-parted tool is great for setting up MIG itself:

https://github.com/NVIDIA/mig-parted

Once setup MIG instances work fine with Slurm although the output from 
nvidia-smi is a little different as one sees both GPUs - the “visible device” 
is the MIG instance::

$ salloc -p interactive -n 1 -c 8 --gres=gpu:1
salloc: Granted job allocation 5235
salloc: Waiting for resource configuration
salloc: Nodes gpu001 are ready for job

$ env | grep CUDA
CUDA_VISIBLE_DEVICES=0

$ nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-c1976541-7b00-3f9f-f557-a17f45b879e9)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-c1976541-7b00-3f9f-f557-a17f45b879e9/1/0)
GPU 1: A100-PCIE-40GB (UUID: GPU-83f9ff5b-09c3-8de1-b3eb-adaadb1cda9f)


The caveats are that MIG and the slurm integration is rather static for the 
moment so it’s not really possible to dynamically change the profiles.

The other slight issue is that all combinations of MIG instances waste some 
compute or memory capacity. We have divided each A100 into two 3g.20gb devices 
so all the memory is used but 1/7 of the compute capacity is lost.

Thanks

Ewan Roche

Division Calcul et Soutien à la Recherche
UNIL | Université de Lausanne


On 21 Apr 2021, at 09:14, Timothy Carr 
<[email protected]<mailto:[email protected]>> wrote:

Dear Community,

Trust everyone is well and keeping safe?

We are considering the purchase of nodes with the Nvidia A100 GPUs and enabling 
the MIG feature which allows for the creation of instance resource profiles. 
The creation of these profiles seems to be straightforward as per the 
documentation. Have any of you had the opportunity to implement the A100 MIG 
with SLURM and have you found any caveats you are willing to share?

Kind Regards

--
Tim



Disclaimer - University of Cape Town This email is subject to UCT policies and 
email disclaimer published on our website at 
http://www.uct.ac.za/main/email-disclaimer or obtainable from +27 21 650 9111. 
If this email is not related to the business of UCT, it is sent by the sender 
in an individual capacity. Please report security incidents or abuse via 
https://csirt.uct.ac.za/page/report-an-incident.php.

Reply via email to