This isn't a Nova bug, maybe some oslo.messaging problem, but anyway, as
the nova-compute service will be off, then the servicegroup API wouldn't
accept it for the scheduler, so this shouldn't be a problem.


** Also affects: oslo.messaging
   Importance: Undecided
       Status: New

** Changed in: nova
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054502

Title:
  shutdowning rabbitmq causes nova-compute.service down

Status in OpenStack Compute (nova):
  Invalid
Status in oslo.messaging:
  New

Bug description:
  Description
  ===========
  We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of 
nova-compute nodes.
  When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted 
nova-compute.service down for 2 nova-compute nodes. 

  Upon checking, we found that nova-compute.service is running.

  nova-compute.service - OpenStack Compute
       Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; 
vendor preset: enabled)
       Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago
     Main PID: 10130 (nova-compute)
        Tasks: 32 (limit: 463517)
       Memory: 248.2M
          CPU: 55min 5.217s
       CGroup: /system.slice/nova-compute.service
               ├─10130 /usr/bin/python3 /usr/bin/nova-compute 
--config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf 
--log-file=/var/log/nova/nova-compute.log
               ├─11527 /usr/bin/python3 /bin/privsep-helper --config-file 
/etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context 
vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock
               └─11702 /usr/bin/python3 /bin/privsep-helper --config-file 
/etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context 
nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock

  Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened 
for user root(uid=0) by (uid=64060)
  Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed 
for user root
  Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call 
last):
  Feb 20 04:55:31 node002 nova-compute[10130]:   File 
"/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
  Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
  Feb 20 04:55:31 node002 nova-compute[10130]:   File 
"/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
  Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
  Feb 20 04:55:31 node002 nova-compute[10130]:   File 
"/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
  Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
  Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to 
a different thread

  I guess it's possible that when shutting down a RabbitMQ node, nova-compute 
is experiencing contention or state inconsistencies in processing connection 
recovery
  restarting nova-compute.service can resolve the problem. 

  Logs & Configs
  ==============
  The nova-compute.log:

  2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] 
[0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is 
unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: 
ConnectionResetError: [Errno 104] Connection reset by peer
  2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] 
[0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is 
unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: 
ConnectionRefusedError: [Errno 111] ECONNREFUSED
  2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] 
[0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 
10.10.10.52:5672 via [amqp] client with port 35346.
  2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A 
recoverable connection/channel error occurred, trying to reconnect: [Errno 104] 
Connection reset by peer
  然后systemctl status nova-compute 
  Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call 
last):
  Feb 20 04:55:31 node002 nova-compute[10130]:   File 
"/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
  Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
  Feb 20 04:55:31 node002 nova-compute[10130]:   File 
"/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
  Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
  Feb 20 04:55:31 node002 nova-compute[10130]:   File 
"/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
  Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
  Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to 
a different thread


  Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9)

  nova.conf:

  [oslo_messaging_rabbit]

  
  [oslo_messaging_notifications]
  driver = messagingv2
  transport_url = *********

  [notifications]
  notification_format = unversioned

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054502/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to