[Yahoo-eng-team] [Bug 1960401] Re: missing gracefull recovery when attaching volume fails

OpenStack Infra Mon, 14 Feb 2022 04:46:03 -0800

Reviewed:  https://review.opendev.org/c/openstack/nova/+/828516
Committed: 
https://opendev.org/openstack/nova/commit/9eb116b99ce32bc69c4abf8ec3b0179ef89a8860
Submitter: "Zuul (22348)"
Branch:    master


commit 9eb116b99ce32bc69c4abf8ec3b0179ef89a8860
Author: Felix Huettner <[email protected]>
Date:   Wed Feb 9 12:03:15 2022 +0100

    Gracefull recovery when attaching volume fails
    
    When trying to attach a volume to an already running instance the nova-api
    requests the nova-compute service to create a BlockDeviceMapping. If the
    nova-api does not receive a response within `rpc_response_timeout` it will
    treat the request as failed and raise an exception.
    
    There are multiple cases where nova-compute actually already processed the
    request and just the reply did not reach the nova-api in time (see bug 
report).
    After the failed request the database will contain a BlockDeviceMapping 
entry
    for the volume + instance combination that will never be cleaned up again.
    This entry also causes the nova-api to reject all future attachments of this
    volume to this instance (as it assumes it is already attached).
    
    To work around this we check if a BlockDeviceMapping has already been 
created
    when we see a messaging timeout. If this is the case we can safely delete it
    as the compute node has already finished processing and we will no longer 
pick
    it up.
    This allows users to try the request again.
    
    A previous fix was abandoned but without a clear reason ([1]).
    
    [1]: https://review.opendev.org/c/openstack/nova/+/731804
    
    Closes-Bug: 1960401
    Change-Id: I17f4d7d2cb129c4ec1479cc4e5d723da75d3a527


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1960401

Title:
  missing gracefull recovery when attaching volume fails

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========
  When trying to attach a volume to an already running instance the nova-api 
requests the nova-compute service to create a BlockDeviceMapping. If the 
nova-api does not receive a response within `rpc_response_timeout` it will 
treat the request as failed and raise an exception.

  There are cases where nova-compute actually already processed the request and 
just the reply did not reach the nova-api in time. This can happen e.g. in the 
following cases (or their combinations):
  * nova-compute crashes/is unable to send the message reply back
  * nova-api is handeling too many other requests and does not get processing 
time to receive the message
  * a configuration error rabbitmq causes the message to be dropped before it 
can be read
  * rabbitmq fails over to another node before the message is read (reply 
queues are non-persistent)

  The state after the failed request will be the same in all cases. The
  database will contain a BlockDeviceMapping entry for the volume +
  instance combination that will never be cleaned up again. This entry
  also causes the nova-api to reject all future attachments of this
  volume to this instance (as it assumes it is already attached).

  Manual intervention is now required (by deleting the offending db
  entry) to allow the volume to be attached again.

  It seems like there was already a proposal for a fix here which is abandoned 
(https://review.opendev.org/c/openstack/nova/+/731804).
  I will propose a new fix based on the same idea

  Steps to reproduce
  ==================

  This issue is not reliably reproduceable. The rough steps should be
  (with non-prod changes to make reproducing the issue more likely):

  * create a instance and a volume of your choice
  * create a unrelated high load on the nova-api
  * configure a policy in rabbitmq to drop all messages in reply queues after 
1ms
  * try to attach the volume to the instance (you should hopefully get a 
messaging timeout)
  * try to attach the volume again. It will fail as it is already attached

  Expected result
  ===============
  The second volume attach call should do an additional attempt to attach the 
volume

  Actual result
  =============
  The second volume attach call failed as nova-api assumes the volume is 
already attached.

  Environment
  ===========
  stable/queens
  (Issue is also present on master)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1960401/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1960401] Re: missing gracefull recovery when attaching volume fails

Reply via email to