Public bug reported: Version: icehouse. Though looking in to the code in the master I believe bug is still there Hypervisor: libvirt Frequency: very rare, under heavy load (stress tests) Steps to reproduce: as an operator I issue "nova delete" command. Instead of being deleted that vm gets into ERROR state.
I couldn't reproduce this issue on my own though there are some logs (nova-compute): http://paste.openstack.org/show/183111/ Here's why it happens: It's a race condition. There are two threads (coroutines if eventlet patched) - thread-1 which handles termination request (nova.compute.manager.ComputeManager.terminate_instance) and thread-2 which dispatches events from hypervisor. 1) thread-1: manager clears (deletes) all queued events for that vm and switches to thread-2 https://github.com/openstack/nova/blob/983f755562cb87a0b498af5d62be9bd2010bc999/nova/compute/manager.py#L2526 2) thread-2: hypervisor emits one more event, stores it to manager.instance_events and switches to thread-1 3) thread-1: manager deletes image files, marks instance as deleted in the db. Thread finishes and exits normally 4) thread-2: manager tries to dispatch one more event. But fails as there is no such instance anymore. To be more precise - there is no InstanceInfoCache for that vm. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1426524 Title: race condition prevents intance deletion Status in OpenStack Compute (Nova): New Bug description: Version: icehouse. Though looking in to the code in the master I believe bug is still there Hypervisor: libvirt Frequency: very rare, under heavy load (stress tests) Steps to reproduce: as an operator I issue "nova delete" command. Instead of being deleted that vm gets into ERROR state. I couldn't reproduce this issue on my own though there are some logs (nova-compute): http://paste.openstack.org/show/183111/ Here's why it happens: It's a race condition. There are two threads (coroutines if eventlet patched) - thread-1 which handles termination request (nova.compute.manager.ComputeManager.terminate_instance) and thread-2 which dispatches events from hypervisor. 1) thread-1: manager clears (deletes) all queued events for that vm and switches to thread-2 https://github.com/openstack/nova/blob/983f755562cb87a0b498af5d62be9bd2010bc999/nova/compute/manager.py#L2526 2) thread-2: hypervisor emits one more event, stores it to manager.instance_events and switches to thread-1 3) thread-1: manager deletes image files, marks instance as deleted in the db. Thread finishes and exits normally 4) thread-2: manager tries to dispatch one more event. But fails as there is no such instance anymore. To be more precise - there is no InstanceInfoCache for that vm. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1426524/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

