[Yahoo-eng-team] [Bug 1901170] [NEW] nova boot GPU instance will attach more one GPU pci device when reschedule happened

guolei Fri, 23 Oct 2020 03:06:30 -0700

Public bug reported:

Description
===========
When we boot a GPU instance, on nova-compute's instance_claim
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/resource_tracker.py#L197-L217
input instance object's attribute 'pci_devices'  had update from [] to 
[PciDevice], it include a calculated GPU PCI device object.


Ok, now we pay attention to claim's code flow:
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L64
it cloned input instance object, set to self.instance
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L78-L84
abort func will abort instance's claim with self.instance, it a cloned one, not 
the origin input instance object.

Now, we can see, if spawn instance failed, claim.abort will be called, it 
revert the cloned instance object's
 'pci_devices' attribute to [], and pci_device in db had reverted from allocate 
to free too. but the origin input instance object not, origin instance object's 
'pci_devices' is still [PciDevice], and it will send to nova-conductor to do 
reschedule, and on next node, after claim, instance.pci_devices will be 
[PciDevice, PciDevice]

And then, spawn instance will have two GPU pci device, or raise a
LibvirtError, "Device xxx is in used"

Steps to reproduce
==================
1. build libvirt error on all compute nodes 
2. nova boot a GPU instance
3. show guest xml in nova-compute.log

Expected result
===============
on reschedule node, guest xml had just one GPU pci device

Actual result
=============
on reschedule node, guest xml had more then one GPU pci device

** Affects: nova
     Importance: Undecided
     Assignee: guolei (guolei-5)
         Status: New

** Changed in: nova
     Assignee: (unassigned) => guolei (guolei-5)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1901170

Title:
  nova boot GPU instance will attach more one GPU pci device when
  reschedule happened

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When we boot a GPU instance, on nova-compute's instance_claim
  
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/resource_tracker.py#L197-L217
  input instance object's attribute 'pci_devices'  had update from [] to 
[PciDevice], it include a calculated GPU PCI device object.

  Ok, now we pay attention to claim's code flow:
  
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L64
  it cloned input instance object, set to self.instance
  
https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L78-L84
  abort func will abort instance's claim with self.instance, it a cloned one, 
not the origin input instance object.

  Now, we can see, if spawn instance failed, claim.abort will be called, it 
revert the cloned instance object's
   'pci_devices' attribute to [], and pci_device in db had reverted from 
allocate to free too. but the origin input instance object not, origin instance 
object's 'pci_devices' is still [PciDevice], and it will send to nova-conductor 
to do reschedule, and on next node, after claim, instance.pci_devices will be 
[PciDevice, PciDevice]

  And then, spawn instance will have two GPU pci device, or raise a
  LibvirtError, "Device xxx is in used"

  Steps to reproduce
  ==================
  1. build libvirt error on all compute nodes 
  2. nova boot a GPU instance
  3. show guest xml in nova-compute.log

  Expected result
  ===============
  on reschedule node, guest xml had just one GPU pci device

  Actual result
  =============
  on reschedule node, guest xml had more then one GPU pci device

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1901170/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1901170] [NEW] nova boot GPU instance will attach more one GPU pci device when reschedule happened

Reply via email to