Reviewed: https://review.opendev.org/747746 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc9c7a5ebf11253f86127238d33dff7401465155 Submitter: Zuul Branch: master
commit dc9c7a5ebf11253f86127238d33dff7401465155 Author: Stephen Finucane <[email protected]> Date: Fri Aug 21 17:43:36 2020 +0100 Move revert resize under semaphore As discussed in change I26b050c402f5721fc490126e9becb643af9279b4, the resource tracker's periodic task is reliant on the status of migrations to determine whether to include usage from these migrations in the total, and races between setting the migration status and decrementing resource usage via 'drop_move_claim' can result in incorrect usage. That change tackled the confirm resize operation. This one changes the revert resize operation, and is a little trickier due to kinks in how both the same-cell and cross-cell resize revert operations work. For same-cell resize revert, the 'ComputeManager.revert_resize' function, running on the destination host, sets the migration status to 'reverted' before dropping the move claim. This exposes the same race that we previously saw with the confirm resize operation. It then calls back to 'ComputeManager.finish_revert_resize' on the source host to boot up the instance itself. This is kind of weird, because, even ignoring the race, we're marking the migration as 'reverted' before we've done any of the necessary work on the source host. The cross-cell resize revert splits dropping of the move claim and setting of the migration status between the source and destination host tasks. Specifically, we do cleanup on the destination and drop the move claim first, via 'ComputeManager.revert_snapshot_based_resize_at_dest' before resuming the instance and setting the migration status on the source via 'ComputeManager.finish_revert_snapshot_based_resize_at_source'. This would appear to avoid the weird quirk of same-cell migration, however, in typical weird cross-cell fashion, these are actually different instances and different migration records. The solution is once again to move the setting of the migration status and the dropping of the claim under 'COMPUTE_RESOURCE_SEMAPHORE'. This introduces the weird setting of migration status before completion to the cross-cell resize case and perpetuates it in the same-cell case, but this seems like a suitable compromise to avoid attempts to do things like unplugging already unplugged PCI devices or unpinning already unpinned CPUs. From an end-user perspective, instance state changes are what really matter and once a revert is completed on the destination host and the instance has been marked as having returned to the source host, hard reboots can help us resolve any remaining issues. Change-Id: I29d6f4a78c0206385a550967ce244794e71cef6d Signed-off-by: Stephen Finucane <[email protected]> Closes-Bug: #1879878 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1879878 Title: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) train series: Confirmed Status in OpenStack Compute (nova) ussuri series: Confirmed Bug description: Description =========== In my environmet, it will take some time to clean VM on source node in confirming resize. during confirming resize process, periodic_task update_available_resource may update resource usage at the same time. It may cause ERROR like: CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set [] during confirming resize process. Steps to reproduce ================== * Set /etc/nova/nova.conf "update_resources_interval" to small value, let's say 30 seconds on compute nodes. This step will increase the probability of error. * create a "dedicated" VM, the flavor can be +----------------------------+--------------------------------------+ | Property | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 80 | | extra_specs | {"hw:cpu_policy": "dedicated"} | | id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 | | name | 4vcpu.4mem.80ssd.0eph.numa | | os-flavor-access:is_public | True | | ram | 4096 | | rxtx_factor | 1.0 | | swap | | | vcpus | 4 | +----------------------------+--------------------------------------+ * Resize the VM with a new flavor to another node. * Confirm resize. Make sure it will take some time to undefine the vm on source node, 30 seconds will lead to inevitable results. * Then you will see the ERROR notice on dashboard, And the VM become ERROR Expected result =============== VM resized successfuly, vm state is active Actual result ============= * VM become ERROR * On dashboard you can see this notice: Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []]. Environment =========== 1. Exact version of OpenStack you are running. Newton version with patch https://review.opendev.org/#/c/641806/21 I am sure it will happen to other new vision with https://review.opendev.org/#/c/641806/21 such as Train and Ussuri 2. Which hypervisor did you use? Libvirt + KVM 3. Which storage type did you use? local disk 4. Which networking type did you use? Neutron with OpenVSwitch Logs & Configs ============== ERROR log on source node 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 95262f1f45f14170b91cd8054bb36512 - - -] [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Setting instance vm_state to ERROR 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Traceback (most recent call last): 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6661, in _error_out_instance_on_exception 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] yield 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3444, in _confirm_resize 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] prefix='old_') 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] return f(*args, **kwargs) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 379, in drop_move_claim 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] self._update_usage(usage, sign=-1) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 724, in _update_usage 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] self.compute_node, usage, free) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1542, in get_host_numa_usage_from_instance 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] host_numa_topology, instance_numa_topology, free=free)) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1409, in numa_usage_from_instances 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] newcell.unpin_cpus(pinned_cpus) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 95, in unpin_cpus 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] pinned=list(self.pinned_cpus)) 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set [] 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1879878/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

