Public bug reported: The nova-live-migration job has 2 nodes and in addition to running live migration tests it also runs evacuate tests for image-backed and volume- backed servers.
Seeing intermittent failures, with debug in this change: https://review.openstack.org/#/c/571325/ It looks like the network-vif-plugged event is coming to the API before nova-compute has registered a callback for that event, so nova-compute does not process the event and then times out waiting for the event to complete by the time it registered. The API processes the network-vif-plugged for that server here: http://logs.openstack.org/25/571325/10/gate/nova-live- migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004 Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req- 3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede- 83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif- plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}} Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758 Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}} Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393 Which is weird because that's before the vif plugging timeout in the compute logs. From the compute logs, we plugged the vif here: Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova- compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d- b4e0ab87601c demo admin] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br- int',has_traffic_filtering=True,id=e241f79f- fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name ='tape241f79f-fb') So it looks like maybe we're getting the network-vif-plugged event before we're ready for it and so we miss the event and then timeout? That's pretty weird though because the libvirt driver's spawn method is what should be registering and waiting for the vif plugged event: https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673 Looks like the compute does register the event callback for network-vif- plugged here: Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova- compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff-4ff3 -aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd- ce175adbdc67] Preparing to wait for external event network-vif-plugged- e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903) prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:325}} Which is too late: Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req- 3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede- 83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif- plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}} http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B ('network-vif- plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d 4 hits in the last 7 days, check and gate, all failures. ** Affects: nova Importance: Medium Status: Confirmed ** Tags: evacuate gate-failure networking -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1813789 Title: Evacuate test intermittently fails with network-vif-plugged timeout exception Status in OpenStack Compute (nova): Confirmed Bug description: The nova-live-migration job has 2 nodes and in addition to running live migration tests it also runs evacuate tests for image-backed and volume-backed servers. Seeing intermittent failures, with debug in this change: https://review.openstack.org/#/c/571325/ It looks like the network-vif-plugged event is coming to the API before nova-compute has registered a callback for that event, so nova- compute does not process the event and then times out waiting for the event to complete by the time it registered. The API processes the network-vif-plugged for that server here: http://logs.openstack.org/25/571325/10/gate/nova-live- migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004 Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req- 3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede- 83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif- plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}} Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758 Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}} Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393 Which is weird because that's before the vif plugging timeout in the compute logs. From the compute logs, we plugged the vif here: Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova- compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d- b4e0ab87601c demo admin] Successfully plugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br- int',has_traffic_filtering=True,id=e241f79f- fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name ='tape241f79f-fb') So it looks like maybe we're getting the network-vif-plugged event before we're ready for it and so we miss the event and then timeout? That's pretty weird though because the libvirt driver's spawn method is what should be registering and waiting for the vif plugged event: https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673 Looks like the compute does register the event callback for network- vif-plugged here: Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova- compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff- 4ff3-aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd- ce175adbdc67] Preparing to wait for external event network-vif- plugged-e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903) prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:325}} Which is too late: Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req- 3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede- 83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling method: <bound method ServerExternalEventsController.create of <nova.api.openstack.compute.server_external_events.ServerExternalEventsController object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed", "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif- plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]} {{(pid=22322) _process_stack /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}} http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B ('network-vif- plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d 4 hits in the last 7 days, check and gate, all failures. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1813789/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp