[Yahoo-eng-team] [Bug 1886418] Re: Enabled compute service still have COMPUTE_STATUS_DISABLED trait and therefore ignored by the scheduler

OpenStack Infra Tue, 21 Jul 2020 19:31:25 -0700

Reviewed:  https://review.opendev.org/704866
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=1b661c2669d9d75b3d4622418e84a5c4046c2072
Submitter: Zuul
Branch:    master


commit 1b661c2669d9d75b3d4622418e84a5c4046c2072
Author: Balazs Gibizer <[email protected]>
Date:   Wed Jan 29 19:48:12 2020 +0100

    Reduce gen conflict in COMPUTE_STATUS_DISABLED handling
    
    The COMPUTE_STATUS_DISABLED trait is supposed to be added to the compute
    RP when the compute service is disabled, and the trait is supposed to be
    removed when the service is enabled again. However adding and removing
    traits is prone to generation conflict in placement. The original
    implementation of blueprint pre-filter-disabled-computes noticed this
    and prints a detailed warning message while the API operation succeeds.
    We can ignore the conflict this way because the periodic
    update_available_resource() call will re-sync the traits later.
    
    Still this gives human noticeable time window where the trait and the
    service state are not in sync.
    
    Setting the compute service disable is the smaller problem as the
    scheduler still uses the ComputeFilter that filters the computes based
    on the service api. So during the enable->disable race window we only
    lose scheduling performance as the placement filter is inefficient.
    
    In case of setting the compute service to enabled the race is more
    visible as the placement pre_filter will filter out the compute that
    is enable by the admin until the re-sync happens. If the conflict would
    only happen due to high load on the given compute the such delay could
    be explained by the load itself. However conflict can happen simply due
    to a new instance boot on the compute.
    
    Fortunately the solution is easy and cheap. The current service state
    handler code path has already queried placement about the existing traits
    on the compute RP and therefore it receives the current RP generation as
    well. But it does not really use this information but instead rely on
    the potentially stale provide_tree cache.
    
    This patch uses the much fresher RP generation known by the service state
    handling code instead of the potentially stale provider_tree cache.
    
    The change in the notification test is also due to the fixed behavior.
    The test disables the compute. Until now this caused that the
    FilterScheduler detected that there is no valid host. Now it is already
    detect by the scheduler manager based on the empty placement response.
    This causes now that that the FilterScheduler is not called and
    therefore the select_destination.start notification is not sent.
    
    Closes-Bug: #1886418
    
    Change-Id: Ib3c455bf21f33923bb82e3f5c53035f6722480d3


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1886418

Title:
  Enabled compute service still have COMPUTE_STATUS_DISABLED trait and
  therefore ignored by the scheduler

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Steps to reproduce
  ==================
  * Create a single host devstack
  In quick succession (in less time than [compute]/update_resources_interval 
config)
  * Make sure that you can boot a server on the single compute
  * Set the compute service status to disabled
  * Make sure that the COMPUTE_STATUS_DISABLED in added to the compute RP in 
placement and therefore you cannot create servers any more
  * Set the compute service status to enabled
  * create a new server

  Expected result
  ===============
  the server is created successfully 

  Actual result
  =============
  The server creation fails with NoValidHost because the 
COMPUTE_STATUS_DISABLED wasn't removed from the compute RP due to generation 
conflict.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1886418/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1886418] Re: Enabled compute service still have COMPUTE_STATUS_DISABLED trait and therefore ignored by the scheduler

Reply via email to