[Yahoo-eng-team] [Bug 1969496] [NEW] booting with PCI device fails: Attempt to consume PCI device xxx from empty pool

Balazs Gibizer Tue, 19 Apr 2022 09:26:17 -0700

Public bug reported:

We saw in the field that the pci_devices table can end up in inconsistent state 
after a compute node HW failure and re-deployment. There could be dependent 
devices where the parent PF is in available state while the children VFs are in 
unavailable state. (Before the HW fault the PF was allocated hence the VFs was 
marked unavailable).
    
In this state this PF is still schedulable but during the PCI claim the 
handling of dependent devices in the PCI tracker will fail with the error: 
"Attempt to consume PCI device XXX from empty pool".
    
The reason of the failure is that when the PF is claimed, all the children VFs 
are marked unavailable. But if the VF is already unavailable such step fails.


There is no reproducer found so far that generates the inconsistent
state. (We tried whitelist reconfiguration, evacuation, VM delete while
the compute was down) But recovering from the inconsistency should be
possible.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1969496

Title:
  booting with PCI device fails: Attempt to consume PCI device xxx from
  empty pool

Status in OpenStack Compute (nova):
  New

Bug description:
  We saw in the field that the pci_devices table can end up in inconsistent 
state after a compute node HW failure and re-deployment. There could be 
dependent devices where the parent PF is in available state while the children 
VFs are in unavailable state. (Before the HW fault the PF was allocated hence 
the VFs was marked unavailable).
      
  In this state this PF is still schedulable but during the PCI claim the 
handling of dependent devices in the PCI tracker will fail with the error: 
"Attempt to consume PCI device XXX from empty pool".
      
  The reason of the failure is that when the PF is claimed, all the children 
VFs are marked unavailable. But if the VF is already unavailable such step 
fails.

  There is no reproducer found so far that generates the inconsistent
  state. (We tried whitelist reconfiguration, evacuation, VM delete
  while the compute was down) But recovering from the inconsistency
  should be possible.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1969496/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1969496] [NEW] booting with PCI device fails: Attempt to consume PCI device xxx from empty pool

Reply via email to