Reviewed: https://review.opendev.org/c/openstack/nova/+/694802 Committed: https://opendev.org/openstack/nova/commit/a8492e88783b40f6dc61888fada232f0d00d6acf Submitter: "Zuul (22348)" Branch: master
commit a8492e88783b40f6dc61888fada232f0d00d6acf Author: Mark Goddard <[email protected]> Date: Mon Nov 18 12:06:47 2019 +0000 Prevent deletion of a compute node belonging to another host There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The main race condition involved is in update_available_resources in the compute manager. When the list of compute nodes is queried, there is a compute node belonging to the host that it does not expect to be managing, i.e. it is an orphan. Between that time and deleting the orphan, the real owner of the compute node takes ownership of it ( in the resource tracker). However, the node is still deleted as the first host is unaware of the ownership change. This change prevents this from occurring by filtering on the host when deleting a compute node. If another compute host has taken ownership of a node, it will have updated the host field and this will prevent deletion from occurring. The first host sees this has happened via the ComputeHostNotFound exception, and avoids deleting its resource provider. Co-Authored-By: melanie witt <[email protected]> Closes-Bug: #1853009 Related-Bug: #1841481 Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1853009 Title: Ironic node rebalance race can lead to missing compute nodes in DB Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) ocata series: New Status in OpenStack Compute (nova) pike series: New Status in OpenStack Compute (nova) queens series: New Status in OpenStack Compute (nova) rocky series: New Status in OpenStack Compute (nova) stein series: New Status in OpenStack Compute (nova) train series: New Status in OpenStack Compute (nova) ussuri series: In Progress Bug description: There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. Steps to reproduce ================== * Deploy nova with multiple nova-compute services managing ironic. * Create some bare metal nodes in ironic, and make them 'available' (does not work if they are 'active') * Stop all nova-compute services * Wait for all nova-compute services to be DOWN in 'openstack compute service list' * Simultaneously start all nova-compute services Expected results ================ All ironic nodes appear as hypervisors in 'openstack hypervisor list' Actual results ============== One or more nodes may be missing from 'openstack hypervisor list'. This is most easily checked via 'openstack hypervisor list | wc -l' Environment =========== OS: CentOS 7.6 Hypervisor: ironic Nova: 18.2.0, plus a handful of backported patches Logs ==== I grabbed some relevant logs from one incident of this issue. They are split between two compute services, and I have tried to make that clear, including a summary of what happened at each point. http://paste.openstack.org/show/786272/ tl;dr c3: 19:14:55 Finds no compute record in RT. Tries to create one (_init_compute_node). Shows traceback with SQL rollback but seems to succeed c1: 19:14:56 Finds no compute record in RT, ‘moves’ existing node from c3 c1: 19:15:54 Begins periodic update, queries compute nodes for this host, finds the node c3: 19:15:54 Finds no compute record in RT, ‘moves’ existing node from c1 c1: 19:15:55 Deletes orphan compute node (which now belongs to c3) c3: 19:16:56 Creates resource provider c3; 19:17:56 Uses existing resource provider There are two major problems here: * c1 deletes the orphan node after c3 has taken ownership of it * c3 assumes that another compute service will not delete its nodes. Once a node is in rt.compute_nodes, it is not removed again unless the node is orphaned To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1853009/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

