[Yahoo-eng-team] [Bug 1922264] Re: On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group

Sylvain Bauza Tue, 06 Apr 2021 03:00:53 -0700

*** This bug is a duplicate of bug 1900006 ***
    https://bugs.launchpad.net/bugs/1900006


Marking this bug report as duplicate, so we can directly backport the
change down to stable/victoria.

** This bug has been marked a duplicate of bug 1900006
   Asking for different vGPU types is racey

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1922264

Title:
  On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load
  second group

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  Description
  ===========
  We have a multiple compute nodes with multiple NVIDIA GPU cards 
(RTX8000/RTX6000).
  Nodes with a mix of RTX8000 and RTX6000 cards have 2 gpu groups configured in 
nova.conf but nova-compute only creates resource providers for the first gpu 
group.

  Steps to reproduce
  ==================

  For example, on a node with 2 RTX8000 and 1 RTX6000.

  $ lspci | grep -i nvidia
  21:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev 
a1)
  81:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev 
a1)
  e2:00.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev 
a1)

  $ nvidia-smi
  Thu Apr  1 17:22:53 2021
  
+-----------------------------------------------------------------------------+
  | NVIDIA-SMI 460.32.04    Driver Version: 460.32.04    CUDA Version: N/A      
|
  
|-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC 
|
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. 
|
  |                               |                      |               MIG M. 
|
  
|===============================+======================+======================|
  |   0  Quadro RTX 8000     On   | 00000000:21:00.0 Off |                    0 
|
  | N/A   30C    P8    27W / 250W |    285MiB / 46079MiB |      0%      Default 
|
  |                               |                      |                  N/A 
|
  
+-------------------------------+----------------------+----------------------+
  |   1  Quadro RTX 8000     On   | 00000000:81:00.0 Off |                    0 
|
  | N/A   30C    P8    27W / 250W |    285MiB / 46079MiB |      0%      Default 
|
  |                               |                      |                  N/A 
|
  
+-------------------------------+----------------------+----------------------+
  |   2  Quadro RTX 6000     On   | 00000000:E2:00.0 Off |                    0 
|
  | N/A   30C    P8    24W / 250W |    150MiB / 23039MiB |      0%      Default 
|
  |                               |                      |                  N/A 
|
  
+-------------------------------+----------------------+----------------------+

  Extract from nova.conf :
  ...
  [devices]
  enabled_vgpu_types = nvidia-428, nvidia-387

  [vgpu_nvidia-428]
  device_addresses = 0000:21:00.0,0000:81:00.0

  [vgpu_nvidia-387]
  device_addresses = 0000:e2:00.0

  
  When nova-compute starts, log shows :
  2021-04-01 17:15:25.454 7 WARNING nova.virt.libvirt.driver 
[req-bebc8637-d231-435c-a6cc-4613e14e2f76 - - - - -] The vGPU type 'nvidia-428' 
was listed in '[devices] enabled_vgpu_types' but no corresponding 
'[vgpu_nvidia-428]' group or '[vgpu_nvidia-428] device_addresses' option was 
defined. Only the first type 'nvidia-428' will be used.

  And a listing of resource providers on this node shows that only nvidia-428 
GPUs were used :
  $ openstack resource provider list --os-placement-api-version 1.14 --in-tree 
f5d35bdc-b4b7-4764-a9d0-41f67fd95385
  
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
  | uuid                                 | name                               | 
generation | root_provider_uuid                   | parent_provider_uuid        
         |
  
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+
  | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | cloud-lyse-cmp-02                  | 
        32 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | None                        
         |
  | 21a4a16e-8d33-4a23-a924-b00f8c31f0d0 | cloud-lyse-cmp-02_pci_0000_81_00_0 | 
         4 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | 
f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
  | 76e1ee94-fbf2-410e-9711-fba71c709388 | cloud-lyse-cmp-02_pci_0000_21_00_0 | 
         2 | f5d35bdc-b4b7-4764-a9d0-41f67fd95385 | 
f5d35bdc-b4b7-4764-a9d0-41f67fd95385 |
  
+--------------------------------------+------------------------------------+------------+--------------------------------------+--------------------------------------+

  In nova.conf, if I swap nvidia-428 & nvidia-387 in enabled_vgpu_types,
  only nvidia-387 is loaded.

  
  Expected result
  ===============
  All gpu groups have to be loaded (as stated in docs).

  Actual result
  =============
  Only the first gpu group is loaded.

  Environment
  ===========
  OpenStack Victoria was deployed with kolla-ansible.
  NVIDIA GRID KVM drivers: 12.1 (latest)
  System: Ubuntu 20.04.2
  nova-compute version: 22.2.1

  Hypervisor: libvirt+KVM (libvirt 6.0.0, QEMU/KVM 4.2.1)
  Storage: Dell EMC Storage Center (7.3.20.19)
  Network: neutron with OVN/OVS

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1922264/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1922264] Re: On a compute node with 3 GPUs and 2 vgpu groups, nova fails to load second group

Reply via email to