Public bug reported:

Environment:
- OpenStack 2024.02, deployed via Kolla-Ansible
- nova-compute communicating with RabbitMQ
- [oslo_messaging_rabbit]
heartbeat_in_pthread = false
ssl = true
ssl_ca_file = /etc/ssl/certs/ca-certificates.crt
rabbit_quorum_queue = true

- Have 3 computes, the compute with the error is holding about 53 instances, 
each remaining compute has about 32-35 instances.
- Each compute is using less than 30% of its resources.

Observed:
- Unexpectedly frequent reconnects/recoverable channel errors on nova-compute.
- Compute node occasionally marked as down or delayed in reporting state, 
causing scheduling delays.
- No kernel/syslog error during the time window.

Log error at current lost connection:

- Rabbitmq
2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, timeout: 
60s
2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection <0.127012395.1> 
(compute-node:45166 -> controller-node:5671 - nova-compute:...)
...
2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, timeout: 
60s
2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection <0.153316717.1> 
(compute-node:34520 -> controller-node1:5671 - nova-compute:...)

- Compute

2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on 
controller-node:5671 is unreachable: Server unexpectedly closed connection. 
Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable 
connection/channel error occurred, trying to reconnect: Too many heartbeats 
missed
2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, trying 
to reconnect: EOF occurred in violation of protocol (_ssl.c:2406)
2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to 
AMQP server on controller-node:5671 via [amqp] client with port 43046.
2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to 
AMQP server on controller-node1:5671 via [amqp] client with port 41370.

similar phenomenon in another article https://bugs.launchpad.net/kolla-
ansible/+bug/2091975

** Affects: kolla-ansible
     Importance: Undecided
         Status: New


** Tags: nova-compute

** Description changed:

  Environment:
  - OpenStack 2024.02, deployed via Kolla-Ansible
  - nova-compute communicating with RabbitMQ
  - [oslo_messaging_rabbit]
  heartbeat_in_pthread = false
  ssl = true
  ssl_ca_file = /etc/ssl/certs/ca-certificates.crt
  rabbit_quorum_queue = true
  
  - Have 3 computes, the compute with the error is holding about 53 instances, 
each remaining compute has about 32-35 instances.
  - Each compute is using less than 30% of its resources.
  
  Observed:
  - Unexpectedly frequent reconnects/recoverable channel errors on nova-compute.
  - Compute node occasionally marked as down or delayed in reporting state, 
causing scheduling delays.
  - No kernel/syslog error during the time window.
  
  Log error at current lost connection:
  
  - Rabbitmq
  2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, 
timeout: 60s
  2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection 
<0.127012395.1> (10.101.12.123:45166 -> 10.101.12.104:5671 - nova-compute:...)
  ...
  2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, 
timeout: 60s
  2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection 
<0.153316717.1> (10.101.12.123:34520 -> 10.101.12.106:5671 - nova-compute:...)
  
  - Compute
  
  2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on 
10.101.12.104:5671 is unreachable: Server unexpectedly closed connection. 
Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
  2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable 
connection/channel error occurred, trying to reconnect: Too many heartbeats 
missed
  2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, 
trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2406)
  2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to 
AMQP server on 10.101.12.104:5671 via [amqp] client with port 43046.
  2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to 
AMQP server on 10.101.12.106:5671 via [amqp] client with port 41370.
+ 
+ 
+ similar phenomenon in another article 
https://bugs.launchpad.net/kolla-ansible/+bug/2091975

** Description changed:

  Environment:
  - OpenStack 2024.02, deployed via Kolla-Ansible
  - nova-compute communicating with RabbitMQ
  - [oslo_messaging_rabbit]
  heartbeat_in_pthread = false
  ssl = true
  ssl_ca_file = /etc/ssl/certs/ca-certificates.crt
  rabbit_quorum_queue = true
  
  - Have 3 computes, the compute with the error is holding about 53 instances, 
each remaining compute has about 32-35 instances.
  - Each compute is using less than 30% of its resources.
  
  Observed:
  - Unexpectedly frequent reconnects/recoverable channel errors on nova-compute.
  - Compute node occasionally marked as down or delayed in reporting state, 
causing scheduling delays.
  - No kernel/syslog error during the time window.
  
  Log error at current lost connection:
  
  - Rabbitmq
  2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, 
timeout: 60s
- 2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection 
<0.127012395.1> (10.101.12.123:45166 -> 10.101.12.104:5671 - nova-compute:...)
+ 2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection 
<0.127012395.1> (compute-node:45166 -> controller-node:5671 - nova-compute:...)
  ...
  2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, 
timeout: 60s
- 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection 
<0.153316717.1> (10.101.12.123:34520 -> 10.101.12.106:5671 - nova-compute:...)
+ 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection 
<0.153316717.1> (compute-node:34520 -> 10.101.12.106:5671 - nova-compute:...)
  
  - Compute
  
- 2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on 
10.101.12.104:5671 is unreachable: Server unexpectedly closed connection. 
Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
+ 2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on 
controller-node:5671 is unreachable: Server unexpectedly closed connection. 
Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
  2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable 
connection/channel error occurred, trying to reconnect: Too many heartbeats 
missed
  2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, 
trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2406)
- 2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to 
AMQP server on 10.101.12.104:5671 via [amqp] client with port 43046.
+ 2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to 
AMQP server on controller-node:5671 via [amqp] client with port 43046.
  2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to 
AMQP server on 10.101.12.106:5671 via [amqp] client with port 41370.
  
- 
- similar phenomenon in another article 
https://bugs.launchpad.net/kolla-ansible/+bug/2091975
+ similar phenomenon in another article https://bugs.launchpad.net/kolla-
+ ansible/+bug/2091975

** Description changed:

  Environment:
  - OpenStack 2024.02, deployed via Kolla-Ansible
  - nova-compute communicating with RabbitMQ
  - [oslo_messaging_rabbit]
  heartbeat_in_pthread = false
  ssl = true
  ssl_ca_file = /etc/ssl/certs/ca-certificates.crt
  rabbit_quorum_queue = true
  
  - Have 3 computes, the compute with the error is holding about 53 instances, 
each remaining compute has about 32-35 instances.
  - Each compute is using less than 30% of its resources.
  
  Observed:
  - Unexpectedly frequent reconnects/recoverable channel errors on nova-compute.
  - Compute node occasionally marked as down or delayed in reporting state, 
causing scheduling delays.
  - No kernel/syslog error during the time window.
  
  Log error at current lost connection:
  
  - Rabbitmq
  2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, 
timeout: 60s
  2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection 
<0.127012395.1> (compute-node:45166 -> controller-node:5671 - nova-compute:...)
  ...
  2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, 
timeout: 60s
- 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection 
<0.153316717.1> (compute-node:34520 -> 10.101.12.106:5671 - nova-compute:...)
+ 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection 
<0.153316717.1> (compute-node:34520 -> controller-node1:5671 - nova-compute:...)
  
  - Compute
  
  2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on 
controller-node:5671 is unreachable: Server unexpectedly closed connection. 
Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
  2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable 
connection/channel error occurred, trying to reconnect: Too many heartbeats 
missed
  2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, 
trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2406)
  2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to 
AMQP server on controller-node:5671 via [amqp] client with port 43046.
- 2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to 
AMQP server on 10.101.12.106:5671 via [amqp] client with port 41370.
+ 2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to 
AMQP server on controller-node1:5671 via [amqp] client with port 41370.
  
  similar phenomenon in another article https://bugs.launchpad.net/kolla-
  ansible/+bug/2091975

** Changed in: kolla-ansible
       Status: New => Invalid

** Changed in: ubuntu
       Status: Invalid => New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2117454

Title:
  Frequent RabbitMQ heartbeat timeouts cause intermittent nova-compute
  reconnect loops in OpenStack 2024.02

To manage notifications about this bug go to:
https://bugs.launchpad.net/kolla-ansible/+bug/2117454/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to