[Yahoo-eng-team] [Bug 1426524] [NEW] race condition prevents intance deletion

Evgeniy Afonichev Fri, 27 Feb 2015 11:34:20 -0800

Public bug reported:

Version: icehouse. Though looking in to the code in the master I believe bug is 
still there
Hypervisor: libvirt
Frequency: very rare, under heavy load (stress tests)
Steps to reproduce: as an operator I issue "nova delete" command. Instead of 
being deleted that vm gets into ERROR state.

I couldn't reproduce this issue on my own though there are some logs
(nova-compute): http://paste.openstack.org/show/183111/
Here's why it happens:
It's a race condition. There are two threads (coroutines if eventlet patched) -
thread-1 which handles termination request
(nova.compute.manager.ComputeManager.terminate_instance) and thread-2 which
dispatches events from hypervisor.
1) thread-1: manager clears (deletes) all queued events for that vm and
switches to thread-2
https://github.com/openstack/nova/blob/983f755562cb87a0b498af5d62be9bd2010bc999/nova/compute/manager.py#L2526
2) thread-2: hypervisor emits one more event, stores it to
manager.instance_events and switches to thread-1
3) thread-1: manager deletes image files, marks instance as deleted in the db.
Thread finishes and exits normally
4) thread-2: manager tries to dispatch one more event. But fails as there is no
such instance anymore. To be more precise - there is no InstanceInfoCache for
that vm.

** Affects: nova
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1426524

Title:
race condition prevents intance deletion

Status in OpenStack Compute (Nova):
New

Bug description:
Version: icehouse. Though looking in to the code in the master I believe bug
is still there
Hypervisor: libvirt
Frequency: very rare, under heavy load (stress tests)
Steps to reproduce: as an operator I issue "nova delete" command. Instead of
being deleted that vm gets into ERROR state.

I couldn't reproduce this issue on my own though there are some logs
(nova-compute): http://paste.openstack.org/show/183111/
Here's why it happens:
It's a race condition. There are two threads (coroutines if eventlet patched)
- thread-1 which handles termination request
(nova.compute.manager.ComputeManager.terminate_instance) and thread-2 which
dispatches events from hypervisor.
1) thread-1: manager clears (deletes) all queued events for that vm and
switches to thread-2
https://github.com/openstack/nova/blob/983f755562cb87a0b498af5d62be9bd2010bc999/nova/compute/manager.py#L2526
2) thread-2: hypervisor emits one more event, stores it to
manager.instance_events and switches to thread-1
3) thread-1: manager deletes image files, marks instance as deleted in the
db. Thread finishes and exits normally
4) thread-2: manager tries to dispatch one more event. But fails as there is
no such instance anymore. To be more precise - there is no InstanceInfoCache
for that vm.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1426524/+subscriptions

--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1426524] [NEW] race condition prevents intance deletion

Reply via email to