update: the root cause here was the haproxy which reached to about 4 billion 
tasks in run queue, and thus while the health checks of nova services worked, 
the nova compute service were in a coma like state because of unanswered HTTP 
calls to update resource usages.
to me this is very strange that why the nova compute service will hang if HTTP 
calls are not responding, it should get a time out or something.
tbh I don't have enough knowledge and time to dig more on this issue and my 
problem is kinda solved. for now I will change the status of this bug to 
invalid. but if anyone is interested to dig deeper I will be happy to help to 
reproduce this issue.

** Changed in: nova
       Status: Incomplete => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1936720

Title:
  new instance gets stuck indefinitely at build state with task_state
  none

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===========
  nova-compute service is up but does not work.

  new instances which get scheduled on that compute node will stuck at build 
state with task_state none,
  and it doesn't go to ERROR state even after it reaches intance build timeout 
threshold.

  (openstack) server show 9299bee1-633d-4233-9f2b-9a7d1871d51b
  
+-------------------------------------+------------------------------------------------+
  | Field                               | Value                                 
         |
  
+-------------------------------------+------------------------------------------------+
  | OS-DCF:diskConfig                   | AUTO                                  
         |
  | OS-EXT-AZ:availability_zone         | nova                                  
         |
  | OS-EXT-SRV-ATTR:host                | None                                  
         |
  | OS-EXT-SRV-ATTR:hypervisor_hostname | None                                  
         |
  | OS-EXT-SRV-ATTR:instance_name       | instance-0000bfb6                     
         |
  | OS-EXT-STS:power_state              | NOSTATE                               
         |
  | OS-EXT-STS:task_state               | None                                  
         |
  | OS-EXT-STS:vm_state                 | building                              
         |
  | OS-SRV-USG:launched_at              | None                                  
         |
  | OS-SRV-USG:terminated_at            | None                                  
         |
  | accessIPv4                          |                                       
         |
  | accessIPv6                          |                                       
         |
  | addresses                           |                                       
         |
  | config_drive                        |                                       
         |
  | created                             | 2021-07-17T11:49:35Z                  
         |
  | flavor                              | i1.mini 
(75253a8f-eb7c-4473-9874-884a01a524a7) |
  | hostId                              |                                       
         |
  | id                                  | 9299bee1-633d-4233-9f2b-9a7d1871d51b  
         |
  | image                               |                                       
         |
  | key_name                            | Sia-KP                                
         |
  | name                                | qwerty-17                             
         |
  | progress                            | 0                                     
         |
  | project_id                          | c4a93f6c1c194bf78bd98ee0f4d51978      
         |
  | properties                          |                                       
         |
  | status                              | BUILD                                 
         |
  | updated                             | 2021-07-17T11:49:41Z                  
         |
  | user_id                             | 042131e0784b46218521eee7963022bf      
         |
  | volumes_attached                    |                                       
         |
  
+-------------------------------------+------------------------------------------------+

  
  I have two OpenStack setups (staging and production). this issue happens on 
both of them but randomly on 
  different compute nodes. both setups are stable/ussuri release and deployed 
using openstack-ansible.

  there were no error in nova logs, I enabled debug on nova services, it cought 
my eye that on the corrupted
  compute node, the logs got stopped sometime before this problem occurs.

  compute service list, while this issue happens. (CP-12 is the
  corrupted compute node)

  (openstack) compute service list
  
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
  |  ID | Binary         | Host                                  | Zone     | 
Status  | State | Updated At                 |
  
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
  |   7 | nova-conductor | SHN-CN-61-nova-api-container-b11ef08e | internal | 
enabled | up    | 2021-07-17T14:23:45.000000 |
  |  34 | nova-scheduler | SHN-CN-61-nova-api-container-b11ef08e | internal | 
enabled | up    | 2021-07-17T14:23:43.000000 |
  |  85 | nova-conductor | SHN-CN-63-nova-api-container-e4f37374 | internal | 
enabled | up    | 2021-07-17T14:23:41.000000 |
  |  91 | nova-conductor | SHN-CN-62-nova-api-container-71ffd912 | internal | 
enabled | up    | 2021-07-17T14:23:45.000000 |
  | 109 | nova-scheduler | SHN-CN-63-nova-api-container-e4f37374 | internal | 
enabled | up    | 2021-07-17T14:23:41.000000 |
  | 157 | nova-scheduler | SHN-CN-62-nova-api-container-71ffd912 | internal | 
enabled | up    | 2021-07-17T14:23:45.000000 |
  | 199 | nova-compute   | SHN-CP-72                             | nova     | 
enabled | up    | 2021-07-17T14:23:41.000000 |
  .
  .
  .
  | 232 | nova-compute   | SHN-CP-18                             | nova     | 
enabled | up    | 2021-07-17T14:23:41.000000 |
  | 235 | nova-compute   | SHN-CP-12                             | nova     | 
enabled | up    | 2021-07-17T14:23:41.000000 |
  | 238 | nova-compute   | SHN-CP-20                             | nova     | 
enabled | up    | 2021-07-17T14:23:41.000000 |
  | 241 | nova-compute   | SHN-CP-22                             | nova     | 
enabled | up    | 2021-07-17T14:23:41.000000 |
  
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+

  
  restarting nova-compute will resolve the issue until it happens again.

  Steps to reproduce
  ==================
  - not always but sometimes this happens.
  - Create multiple instances for higher probability of happenng this issue.

  Expected result
  ===============
  either nova-compute service goes to down state, or instance goes to ERROR 
state, or any warning or error in nova logs.

  Actual result
  =============
  instances which schedule on the corrupted compute node(which BTW happens 
randomly) will stuck indefinitely at BUILD state
  and task_state None

  Environment
  ===========
  OSA deployment of stable/ussuri on ubuntu, with install_method=source.
  this problem happend after I seperated RPC rabbitmq cluster from notify 
rabbitmq cluster.(not sure if this is related, but
  thats when it started happening)
  also it worth mentioning that this issue happens on both of my setups.

  Logs & Configs
  ==============
  this is the log before nova-compute service stops logging:
  https://paste.opendev.org/show/807547/

  this is nova-compute log when the instance get scheduled on the node:

  # journalctl -u nova-compute.service --since '2021-07-17 11:49:00' --until 
'2021-07-17 12:00:00' --no-pager
  -- Logs begin at Mon 2021-05-31 04:36:00 UTC, end at Sat 2021-07-17 16:23:38 
UTC. --
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.086 
3857910 DEBUG oslo_concurrency.lockutils 
[req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf 
c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock 
"6a148ea3-6793-4e26-acb2-9dd1214a666d" acquired by 
"nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance"
 :: waited 0.000s inner 
/openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.105 
3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 
042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default 
default] [instance: 6a148ea3-6793-4e26-acb2-9dd1214a666d] Starting instance... 
_do_build_and_run_instance 
/openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.711 
3857910 DEBUG oslo_concurrency.lockutils 
[req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf 
c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock 
"9299bee1-633d-4233-9f2b-9a7d1871d51b" acquired by 
"nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance"
 :: waited 0.000s inner 
/openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.731 
3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 
042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default 
default] [instance: 9299bee1-633d-4233-9f2b-9a7d1871d51b] Starting instance... 
_do_build_and_run_instance 
/openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1936720/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to