Public bug reported: This is based on code inspection but let's say I have configured my computes to set resize_confirm_window=3600 to automatically confirm a resized server after 1 hour. Within that hour, let's say the source compute service is down.
The periodic task gets the unconfirmed migrations with status='finished' which have been updated some time older than the given configurable window: https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/manager.py#L8793 https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/db/sqlalchemy/api.py#L4342 The periodic task then calls the compute API code to confirm the resize: https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7160 which changes the migration status to 'confirming': https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3684 And casts off to the source compute: https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/rpcapi.py#L600 Now if the source compute is down and that fails, the compute manager task code will handle it and say it will retry later: https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7163 However, because the migration status was changed from 'finished' to 'confirming' the task will not retry because it won't find the migration given the DB query. And trying to confirm the resize via the API will fail as well because we'll get MigrationNotFoundByStatus since the migration status is no longer 'finished': https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3681 The compute manager code should probably mark the migration status as 'finished' again if it's really going to try later, or mark the migration status as 'error'. Note that the confirm_resize method in the compute manager doesn't mark the migration status as 'error' if something fails there either: https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L3807 ** Affects: nova Importance: Low Status: New ** Tags: error-handling migrate resize -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1855927 Title: _poll_unconfirmed_resizes may not retry later if confirm_resize fails in API Status in OpenStack Compute (nova): New Bug description: This is based on code inspection but let's say I have configured my computes to set resize_confirm_window=3600 to automatically confirm a resized server after 1 hour. Within that hour, let's say the source compute service is down. The periodic task gets the unconfirmed migrations with status='finished' which have been updated some time older than the given configurable window: https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/manager.py#L8793 https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/db/sqlalchemy/api.py#L4342 The periodic task then calls the compute API code to confirm the resize: https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7160 which changes the migration status to 'confirming': https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3684 And casts off to the source compute: https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/rpcapi.py#L600 Now if the source compute is down and that fails, the compute manager task code will handle it and say it will retry later: https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7163 However, because the migration status was changed from 'finished' to 'confirming' the task will not retry because it won't find the migration given the DB query. And trying to confirm the resize via the API will fail as well because we'll get MigrationNotFoundByStatus since the migration status is no longer 'finished': https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3681 The compute manager code should probably mark the migration status as 'finished' again if it's really going to try later, or mark the migration status as 'error'. Note that the confirm_resize method in the compute manager doesn't mark the migration status as 'error' if something fails there either: https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L3807 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1855927/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

