Public bug reported:

Description
===========

Sometimes when a baremetal instance is terminated, some VIFs are not
detached from the node. This can lead to the node becoming unusable,
with subsequent attempts to provision it fail during VIF attachment due
to there being insufficient free ironic ports to attach the VIF to.

Steps to reproduce
==================

No reproduction procedure identified as yet, but will be something like:

* boot one baremetal instance
* do something to trigger the bug
* delete the instance
* boot a second instance on the same ironic node

Expected results
================

The second instance should boot successfully.

Actual results
==============

The second instance fails to boot, and the following error message is
emitted by nova-compute:

VirtualInterfacePlugException: Cannot attach VIF 409830a5-b4de-4d1d-
be22-5e6fe4ccd65b to the node 3aaaf79e-99fb-42a3-b22e-b1a7fae44272 due
to error: Unable to attach VIF 409830a5-b4de-4d1d-be22-5e6fe4ccd65b, not
enough free physical ports. (HTTP 400)

The neutron port has been deleted:

$ openstack port show 7e567468-53a2-4fad-8bc9-a30a0e7218a0
ResourceNotFound: No Port found for 7e567468-53a2-4fad-8bc9-a30a0e7218a0

The ironic node's VIF is still attached:

$ openstack baremetal node vif list <node>
+--------------------------------------+
| ID                                   |
+--------------------------------------+
| 7e567468-53a2-4fad-8bc9-a30a0e7218a0 |
+--------------------------------------+

Workaround
==========

The VIF can be manually detached via ironic:

$ openstack baremetal node vif detach <node> 7e567468-53a2-4fad-
8bc9-a30a0e7218a0

This allows instances to be deployed on the node.

Environment
===========

RDO Pike, deployed on CentOS 7 using kayobe & kolla-ansible.

openstack-nova-api-16.0.0-1.el7.noarch

Notes
=====

I've seen this happen on a number of occasions, and have spent some time
investigating a few of them. Although they all have similarities, no two
have been the same, so far as I can tell.

Some things I've worked out along the way:

* the VIF detach code in ironic is very simple, and just removes the
tenant_vif_port_id field from the internal_info attribute of the ironic
port to which the VIF is attached. This leads me to believe that nova is
*not* calling this API during instance termination.

* the nova ironic virt driver's terminate method always ends up calling
_unplug_vifs, so either terminate has not been called, it has not
completed successfully, or the VIF was not present in the provided
network_info object. So far my investigations have suggested the latter
- network_info does not contain the VIF.

* there seems to be some level of raciness when deleting instances and
their ports (VIFs) at similar times. The neutron vif unplugged event may
not always call detach_interface[1] on the virt driver, but will remove
the port from the instance info cache. This would cause the VIF to be
absent from network_info during terminate.

Given that there seem to be multiple causes for this issue, one way to
avoid the node becoming unusable would be to query the attached VIFs
from ironic, as well as those in network_info when terminating an
instance. Any unexpected VIFs could then be detached.

References
==========

[1]
https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L1481

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1733861

Title:
  VIFs not always detached from ironic nodes during termination

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  Sometimes when a baremetal instance is terminated, some VIFs are not
  detached from the node. This can lead to the node becoming unusable,
  with subsequent attempts to provision it fail during VIF attachment
  due to there being insufficient free ironic ports to attach the VIF
  to.

  Steps to reproduce
  ==================

  No reproduction procedure identified as yet, but will be something
  like:

  * boot one baremetal instance
  * do something to trigger the bug
  * delete the instance
  * boot a second instance on the same ironic node

  Expected results
  ================

  The second instance should boot successfully.

  Actual results
  ==============

  The second instance fails to boot, and the following error message is
  emitted by nova-compute:

  VirtualInterfacePlugException: Cannot attach VIF 409830a5-b4de-4d1d-
  be22-5e6fe4ccd65b to the node 3aaaf79e-99fb-42a3-b22e-b1a7fae44272 due
  to error: Unable to attach VIF 409830a5-b4de-4d1d-be22-5e6fe4ccd65b,
  not enough free physical ports. (HTTP 400)

  The neutron port has been deleted:

  $ openstack port show 7e567468-53a2-4fad-8bc9-a30a0e7218a0
  ResourceNotFound: No Port found for 7e567468-53a2-4fad-8bc9-a30a0e7218a0

  The ironic node's VIF is still attached:

  $ openstack baremetal node vif list <node>
  +--------------------------------------+
  | ID                                   |
  +--------------------------------------+
  | 7e567468-53a2-4fad-8bc9-a30a0e7218a0 |
  +--------------------------------------+

  Workaround
  ==========

  The VIF can be manually detached via ironic:

  $ openstack baremetal node vif detach <node> 7e567468-53a2-4fad-
  8bc9-a30a0e7218a0

  This allows instances to be deployed on the node.

  Environment
  ===========

  RDO Pike, deployed on CentOS 7 using kayobe & kolla-ansible.

  openstack-nova-api-16.0.0-1.el7.noarch

  Notes
  =====

  I've seen this happen on a number of occasions, and have spent some
  time investigating a few of them. Although they all have similarities,
  no two have been the same, so far as I can tell.

  Some things I've worked out along the way:

  * the VIF detach code in ironic is very simple, and just removes the
  tenant_vif_port_id field from the internal_info attribute of the
  ironic port to which the VIF is attached. This leads me to believe
  that nova is *not* calling this API during instance termination.

  * the nova ironic virt driver's terminate method always ends up
  calling _unplug_vifs, so either terminate has not been called, it has
  not completed successfully, or the VIF was not present in the provided
  network_info object. So far my investigations have suggested the
  latter - network_info does not contain the VIF.

  * there seems to be some level of raciness when deleting instances and
  their ports (VIFs) at similar times. The neutron vif unplugged event
  may not always call detach_interface[1] on the virt driver, but will
  remove the port from the instance info cache. This would cause the VIF
  to be absent from network_info during terminate.

  Given that there seem to be multiple causes for this issue, one way to
  avoid the node becoming unusable would be to query the attached VIFs
  from ironic, as well as those in network_info when terminating an
  instance. Any unexpected VIFs could then be detached.

  References
  ==========

  [1]
  https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L1481

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1733861/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to