Public bug reported: Description =========== When we boot a GPU instance, on nova-compute's instance_claim https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/resource_tracker.py#L197-L217 input instance object's attribute 'pci_devices' had update from [] to [PciDevice], it include a calculated GPU PCI device object.
Ok, now we pay attention to claim's code flow: https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L64 it cloned input instance object, set to self.instance https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L78-L84 abort func will abort instance's claim with self.instance, it a cloned one, not the origin input instance object. Now, we can see, if spawn instance failed, claim.abort will be called, it revert the cloned instance object's 'pci_devices' attribute to [], and pci_device in db had reverted from allocate to free too. but the origin input instance object not, origin instance object's 'pci_devices' is still [PciDevice], and it will send to nova-conductor to do reschedule, and on next node, after claim, instance.pci_devices will be [PciDevice, PciDevice] And then, spawn instance will have two GPU pci device, or raise a LibvirtError, "Device xxx is in used" Steps to reproduce ================== 1. build libvirt error on all compute nodes 2. nova boot a GPU instance 3. show guest xml in nova-compute.log Expected result =============== on reschedule node, guest xml had just one GPU pci device Actual result ============= on reschedule node, guest xml had more then one GPU pci device ** Affects: nova Importance: Undecided Assignee: guolei (guolei-5) Status: New ** Changed in: nova Assignee: (unassigned) => guolei (guolei-5) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1901170 Title: nova boot GPU instance will attach more one GPU pci device when reschedule happened Status in OpenStack Compute (nova): New Bug description: Description =========== When we boot a GPU instance, on nova-compute's instance_claim https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/resource_tracker.py#L197-L217 input instance object's attribute 'pci_devices' had update from [] to [PciDevice], it include a calculated GPU PCI device object. Ok, now we pay attention to claim's code flow: https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L64 it cloned input instance object, set to self.instance https://github.com/openstack/nova/blob/7020196aaa2fedd537806fe229e237c91e4f0ca5/nova/compute/claims.py#L78-L84 abort func will abort instance's claim with self.instance, it a cloned one, not the origin input instance object. Now, we can see, if spawn instance failed, claim.abort will be called, it revert the cloned instance object's 'pci_devices' attribute to [], and pci_device in db had reverted from allocate to free too. but the origin input instance object not, origin instance object's 'pci_devices' is still [PciDevice], and it will send to nova-conductor to do reschedule, and on next node, after claim, instance.pci_devices will be [PciDevice, PciDevice] And then, spawn instance will have two GPU pci device, or raise a LibvirtError, "Device xxx is in used" Steps to reproduce ================== 1. build libvirt error on all compute nodes 2. nova boot a GPU instance 3. show guest xml in nova-compute.log Expected result =============== on reschedule node, guest xml had just one GPU pci device Actual result ============= on reschedule node, guest xml had more then one GPU pci device To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1901170/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

