[Yahoo-eng-team] [Bug 1974070] [NEW] Ironic builds fail when landing on a cleaning node, it doesn't try to reschedule

John Garbutt Wed, 18 May 2022 10:36:25 -0700

Public bug reported:

In a happy world, placement reserved gets updated when a node is not
availabe any more, so the scheduler doesn't pick that one, everyone it
happy.


Howerver, as is fairly well known, it takes a while for Nova to notice
if a node has been marked as in maintenance or if it has started
cleaning due to the instance now having been deleted, and you can still
reach a node in a bad state.

This actually fails hard when setting the instance uuid, as expected here:
https://github.com/openstack/nova/blob/4939318649650b60dd07d161b80909e70d0e093e/nova/virt/ironic/driver.py#L378

You get a conflict errors, as the ironic node is in a transitioning
state (i.e. its not actually available any more).

When people are busy rebuilding large numbers of nodes, they tend to hit
this problem, even when only building when you know there available
nodes, you sometimes pick the ones you just deleted.

In an idea world this would trigger a re-schedule, a bit like when you
hit errors in the resource tracker such as ComputeResourcesUnavailable

** Affects: nova
     Importance: Low
     Assignee: John Garbutt (johngarbutt)
         Status: In Progress

** Changed in: nova
       Status: New => In Progress

** Changed in: nova
     Assignee: (unassigned) => John Garbutt (johngarbutt)

** Changed in: nova
   Importance: Undecided => Low

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1974070

Title:
  Ironic builds fail when landing on a cleaning node, it doesn't try to
  reschedule

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  In a happy world, placement reserved gets updated when a node is not
  availabe any more, so the scheduler doesn't pick that one, everyone it
  happy.

  Howerver, as is fairly well known, it takes a while for Nova to notice
  if a node has been marked as in maintenance or if it has started
  cleaning due to the instance now having been deleted, and you can
  still reach a node in a bad state.

  This actually fails hard when setting the instance uuid, as expected here:
  
https://github.com/openstack/nova/blob/4939318649650b60dd07d161b80909e70d0e093e/nova/virt/ironic/driver.py#L378

  You get a conflict errors, as the ironic node is in a transitioning
  state (i.e. its not actually available any more).

  When people are busy rebuilding large numbers of nodes, they tend to
  hit this problem, even when only building when you know there
  available nodes, you sometimes pick the ones you just deleted.

  In an idea world this would trigger a re-schedule, a bit like when you
  hit errors in the resource tracker such as ComputeResourcesUnavailable

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1974070/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1974070] [NEW] Ironic builds fail when landing on a cleaning node, it doesn't try to reschedule

Reply via email to