Public bug reported:

Due to a mistake in the SDK patches, the wait_for_active loop tries to
check a random node instead of the target node. While it works fine in
the default job with only one node at all, it fails in the multinode
grenade.

The failure [1] looks a bit cryptic:

Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall [-] Fixed interval looping call 
'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: 
InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be 
found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall Traceback (most recent call last):
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall   File 
"/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, 
in _run_loop
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall   File 
"/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall     raise 
exception.InstanceNotFound(instance_id=instance.uuid)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall InstanceNotFound: Instance 
09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall
Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 
tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error 
deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 
3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 
09957ef9-d006-4328-8342-03a712888996 could not be found.

This code path is usually hit when a node suddenly goes into AVAILABLE,
which clearly shouldn't be the case here. The mistery got solved by
looking at ironic-api logs [2]:

[Thu Sep  5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
switches on core 0)

This does not match the expected query: GET
/baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
looking at the code more carefully, I've noticed that we're passing
instance_id and fields inside a dict as a positional argument. In
reality, they are expected as normal keyword arguments. The dict ends up
in 'details', which ends up being True. So in reality we're listing all
nodes and picking the first one.

[1] 
https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
[2] 
https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz

** Affects: nova
     Importance: Critical
     Assignee: Dmitry Tantsur (divius)
         Status: In Progress


** Tags: ironic

** Description changed:

  Due to a mistake in the SDK patches, the wait_for_active loop tries to
  check a random node instead of the target node. While it works fine in
  the default job with only one node at all, it fails in the multinode
  grenade.
  
- The failure looks a bit cryptic:
+ The failure [1] looks a bit cryptic:
  
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall [-] Fixed interval looping call 
'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: 
InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be 
found.
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall Traceback (most recent call last):
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall   File 
"/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, 
in _run_loop
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall   File 
"/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall     raise 
exception.InstanceNotFound(instance_id=instance.uuid)
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall InstanceNotFound: Instance 
09957ef9-d006-4328-8342-03a712888996 could not be found.
- Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall 
+ Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall
  Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 
tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error 
deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 
3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 
09957ef9-d006-4328-8342-03a712888996 could not be found.
  
  This code path is usually hit when a node suddenly goes into AVAILABLE,
  which clearly shouldn't be the case here. The mistery got solved by
- looking at ironic-api logs:
+ looking at ironic-api logs [2]:
  
  [Thu Sep  5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
  41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
  switches on core 0)
  
  This does not match the expected query: GET
  /baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
  looking at the code more carefully, I've noticed that we're passing
  instance_id and fields inside a dict as a positional argument. In
  reality, they are expected as normal keyword arguments. The dict ends up
  in 'details', which ends up being True. So in reality we're listing all
  nodes and picking the first one.
+ 
+ [1] 
https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
+ [2] 
https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz

** Changed in: nova
     Assignee: (unassigned) => Dmitry Tantsur (divius)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1843050

Title:
  Regression in the ironic driver: wait_for_active checks a random node

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Due to a mistake in the SDK patches, the wait_for_active loop tries to
  check a random node instead of the target node. While it works fine in
  the default job with only one node at all, it fails in the multinode
  grenade.

  The failure [1] looks a bit cryptic:

  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall [-] Fixed interval looping call 
'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: 
InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be 
found.
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall Traceback (most recent call last):
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall   File 
"/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, 
in _run_loop
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall   File 
"/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall     raise 
exception.InstanceNotFound(instance_id=instance.uuid)
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall InstanceNotFound: Instance 
09957ef9-d006-4328-8342-03a712888996 could not be found.
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR oslo.service.loopingcall
  Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: 
ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 
tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error 
deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 
3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 
09957ef9-d006-4328-8342-03a712888996 could not be found.

  This code path is usually hit when a node suddenly goes into
  AVAILABLE, which clearly shouldn't be the case here. The mistery got
  solved by looking at ironic-api logs [2]:

  [Thu Sep  5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
  41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
  switches on core 0)

  This does not match the expected query: GET
  /baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
  looking at the code more carefully, I've noticed that we're passing
  instance_id and fields inside a dict as a positional argument. In
  reality, they are expected as normal keyword arguments. The dict ends
  up in 'details', which ends up being True. So in reality we're listing
  all nodes and picking the first one.

  [1] 
https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
  [2] 
https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1843050/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to