Public bug reported: If the _poll_unconfirmed_resizes periodic task runs in nova/compute/manager.py:ComputeManager._finish_resize() after the migration record has been updated in the database but before the instances has been updated.
2014-09-30 16:15:00.897 112868 INFO nova.compute.manager [-] Automatically confirming migration 207 for instance 799f9246-bc05-4ae8-8737-4f358240f586 2014-09-30 16:15:01.109 112868 WARNING nova.compute.manager [-] [instance: 799f9246-bc05-4ae8-8737-4f358240f586] Setting migration 207 to error: In states stopped/resize_finish, not RESIZED/None This causes _poll_unconfirmed_resizes to see that the VM task_state is still 'resize_finish' instead of None, and set the migration record to error state. Which in turn causes the VM to be stuck in resizing forever. Two fixes have been proposed for this issue so far but were reverted because they caused other race conditions. See the following two bugs for more details. https://bugs.launchpad.net/nova/+bug/1321298 https://bugs.launchpad.net/nova/+bug/1326778 This timing issue still exists in Juno today in an environment with periodic tasks set to run once every 60 seconds and with a resize_confirm_window of 1 second. Would a possible solution for this be to change the code in _poll_unconfirmed_resizes() to ignore any VMs with a task state of 'resize_finish' instead of setting the corresponding migration record to error? This is the task_state it should have right before changed to None in finish_resize(). Then next time _poll_unconfirmed_resizes() is called, the migration record will still be fetched and the VM will be checked again and in the updated vm_state/task_state. add the following in _poll_unconfirmed_resizes(): # This removes a race condition if task_state == 'resize_finish': continue prior to: elif vm_state != vm_states.RESIZED or task_state is not None: reason = (_("In states %(vm_state)s/%(task_state)s, not " "RESIZED/None") % {'vm_state': vm_state, 'task_state': task_state}) _set_migration_to_error(migration, reason, instance=instance) continue ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1376933 Title: _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever Status in OpenStack Compute (Nova): New Bug description: If the _poll_unconfirmed_resizes periodic task runs in nova/compute/manager.py:ComputeManager._finish_resize() after the migration record has been updated in the database but before the instances has been updated. 2014-09-30 16:15:00.897 112868 INFO nova.compute.manager [-] Automatically confirming migration 207 for instance 799f9246-bc05-4ae8-8737-4f358240f586 2014-09-30 16:15:01.109 112868 WARNING nova.compute.manager [-] [instance: 799f9246-bc05-4ae8-8737-4f358240f586] Setting migration 207 to error: In states stopped/resize_finish, not RESIZED/None This causes _poll_unconfirmed_resizes to see that the VM task_state is still 'resize_finish' instead of None, and set the migration record to error state. Which in turn causes the VM to be stuck in resizing forever. Two fixes have been proposed for this issue so far but were reverted because they caused other race conditions. See the following two bugs for more details. https://bugs.launchpad.net/nova/+bug/1321298 https://bugs.launchpad.net/nova/+bug/1326778 This timing issue still exists in Juno today in an environment with periodic tasks set to run once every 60 seconds and with a resize_confirm_window of 1 second. Would a possible solution for this be to change the code in _poll_unconfirmed_resizes() to ignore any VMs with a task state of 'resize_finish' instead of setting the corresponding migration record to error? This is the task_state it should have right before changed to None in finish_resize(). Then next time _poll_unconfirmed_resizes() is called, the migration record will still be fetched and the VM will be checked again and in the updated vm_state/task_state. add the following in _poll_unconfirmed_resizes(): # This removes a race condition if task_state == 'resize_finish': continue prior to: elif vm_state != vm_states.RESIZED or task_state is not None: reason = (_("In states %(vm_state)s/%(task_state)s, not " "RESIZED/None") % {'vm_state': vm_state, 'task_state': task_state}) _set_migration_to_error(migration, reason, instance=instance) continue To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1376933/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

