Reviewed: https://review.openstack.org/508555 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e3c5e22d1fde7ca916a8cc364f335fba8a3a798f Submitter: Zuul Branch: master
commit e3c5e22d1fde7ca916a8cc364f335fba8a3a798f Author: John Garbutt <j...@johngarbutt.com> Date: Fri Sep 29 15:48:54 2017 +0100 Re-use existing ComputeNode on ironic rebalance When a nova-compute service dies that is one of several ironic based nova-compute services running, a node rebalance occurs to ensure there is still an active nova-compute service dealing with requests for the given instance that is running. Today, when this occurs, we create a new ComputeNode entry. This change alters that logic to detect the case of the ironic node rebalance and in that case we re-use the existing ComputeNode entry, simply updating the host field to match the new host it has been rebalanced onto. Previously we hit problems with placement when we get a new ComputeNode.uuid for the same ironic_node.uuid. This reusing of the existing entry keeps the ComputeNode.uuid the same when the rebalance of the ComputeNode occurs. Without keeping the same ComputeNode.uuid placement errors out with a 409 because we attempt to create a ResourceProvider that has the same name as an existing ResourceProvdier. Had that worked, we would have noticed the race that occurs after we create the ResourceProvider but before we add back the existing allocations for existing instances. Keeping the ComputeNode.uuid the same means we simply look up the existing ResourceProvider in placement, avoiding all this pain and tears. Closes-Bug: #1714248 Co-Authored-By: Dmitry Tantsur <dtant...@redhat.com> Change-Id: I4253cffca3dbf558c875eed7e77711a31e9e3406 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1714248 Title: Compute node HA for ironic doesn't work due to the name duplication of Resource Provider Status in Ironic: Invalid Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) pike series: In Progress Bug description: Description =========== In an environment where there are multiple compute nodes with ironic driver, when a compute node goes down, another compute node cannot take over ironic nodes. Steps to reproduce ================== 1. Start multiple compute nodes with ironic driver. 2. Register one node to ironic. 2. Stop a compute node which manages the ironic node. 3. Create an instance. Expected result =============== The instance is created. Actual result ============= The instance creation is failed. Environment =========== 1. Exact version of OpenStack you are running. openstack-nova-scheduler-15.0.6-2.el7.noarch openstack-nova-console-15.0.6-2.el7.noarch python2-novaclient-7.1.0-1.el7.noarch openstack-nova-common-15.0.6-2.el7.noarch openstack-nova-serialproxy-15.0.6-2.el7.noarch openstack-nova-placement-api-15.0.6-2.el7.noarch python-nova-15.0.6-2.el7.noarch openstack-nova-novncproxy-15.0.6-2.el7.noarch openstack-nova-api-15.0.6-2.el7.noarch openstack-nova-conductor-15.0.6-2.el7.noarch 2. Which hypervisor did you use? ironic Details ======= When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node UUID)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute. When a new instance is requested, the scheduler gets only an old resource provider for the ironic node[4]. Then, the ironic node is not selected: WARNING nova.scheduler.filters.compute_filter [req- a37d68b5-7ab1-4254-8698-502304607a90 7b55e61a07304f9cab1544260dcd3e41 e21242f450d948d7af2650ac9365ee36 - - -] (compute02, 8904aeeb-a35b-4ba3 -848a-73269fdde4d3) ram: 4096MB disk: 849920MB io_ops: 0 instances: 0 has not been heard from in a while [1] https://github.com/openstack/nova/blob/stable/ocata/nova/compute/resource_tracker.py#L464 [2] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L630 [3] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L410 [4] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/filter_scheduler.py#L183 To manage notifications about this bug go to: https://bugs.launchpad.net/ironic/+bug/1714248/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp