Public bug reported: Environment: - OpenStack 2024.02, deployed via Kolla-Ansible - nova-compute communicating with RabbitMQ - [oslo_messaging_rabbit] heartbeat_in_pthread = false ssl = true ssl_ca_file = /etc/ssl/certs/ca-certificates.crt rabbit_quorum_queue = true
- Have 3 computes, the compute with the error is holding about 53 instances, each remaining compute has about 32-35 instances. - Each compute is using less than 30% of its resources. Observed: - Unexpectedly frequent reconnects/recoverable channel errors on nova-compute. - Compute node occasionally marked as down or delayed in reporting state, causing scheduling delays. - No kernel/syslog error during the time window. Log error at current lost connection: - Rabbitmq 2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, timeout: 60s 2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection <0.127012395.1> (compute-node:45166 -> controller-node:5671 - nova-compute:...) ... 2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, timeout: 60s 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection <0.153316717.1> (compute-node:34520 -> controller-node1:5671 - nova-compute:...) - Compute 2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on controller-node:5671 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection 2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable connection/channel error occurred, trying to reconnect: Too many heartbeats missed 2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2406) 2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to AMQP server on controller-node:5671 via [amqp] client with port 43046. 2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to AMQP server on controller-node1:5671 via [amqp] client with port 41370. similar phenomenon in another article https://bugs.launchpad.net/kolla- ansible/+bug/2091975 ** Affects: kolla-ansible Importance: Undecided Status: New ** Tags: nova-compute ** Description changed: Environment: - OpenStack 2024.02, deployed via Kolla-Ansible - nova-compute communicating with RabbitMQ - [oslo_messaging_rabbit] heartbeat_in_pthread = false ssl = true ssl_ca_file = /etc/ssl/certs/ca-certificates.crt rabbit_quorum_queue = true - Have 3 computes, the compute with the error is holding about 53 instances, each remaining compute has about 32-35 instances. - Each compute is using less than 30% of its resources. Observed: - Unexpectedly frequent reconnects/recoverable channel errors on nova-compute. - Compute node occasionally marked as down or delayed in reporting state, causing scheduling delays. - No kernel/syslog error during the time window. Log error at current lost connection: - Rabbitmq 2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, timeout: 60s 2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection <0.127012395.1> (10.101.12.123:45166 -> 10.101.12.104:5671 - nova-compute:...) ... 2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, timeout: 60s 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection <0.153316717.1> (10.101.12.123:34520 -> 10.101.12.106:5671 - nova-compute:...) - Compute 2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on 10.101.12.104:5671 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection 2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable connection/channel error occurred, trying to reconnect: Too many heartbeats missed 2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2406) 2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to AMQP server on 10.101.12.104:5671 via [amqp] client with port 43046. 2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to AMQP server on 10.101.12.106:5671 via [amqp] client with port 41370. + + + similar phenomenon in another article https://bugs.launchpad.net/kolla-ansible/+bug/2091975 ** Description changed: Environment: - OpenStack 2024.02, deployed via Kolla-Ansible - nova-compute communicating with RabbitMQ - [oslo_messaging_rabbit] heartbeat_in_pthread = false ssl = true ssl_ca_file = /etc/ssl/certs/ca-certificates.crt rabbit_quorum_queue = true - Have 3 computes, the compute with the error is holding about 53 instances, each remaining compute has about 32-35 instances. - Each compute is using less than 30% of its resources. Observed: - Unexpectedly frequent reconnects/recoverable channel errors on nova-compute. - Compute node occasionally marked as down or delayed in reporting state, causing scheduling delays. - No kernel/syslog error during the time window. Log error at current lost connection: - Rabbitmq 2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, timeout: 60s - 2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection <0.127012395.1> (10.101.12.123:45166 -> 10.101.12.104:5671 - nova-compute:...) + 2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection <0.127012395.1> (compute-node:45166 -> controller-node:5671 - nova-compute:...) ... 2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, timeout: 60s - 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection <0.153316717.1> (10.101.12.123:34520 -> 10.101.12.106:5671 - nova-compute:...) + 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection <0.153316717.1> (compute-node:34520 -> 10.101.12.106:5671 - nova-compute:...) - Compute - 2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on 10.101.12.104:5671 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection + 2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on controller-node:5671 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection 2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable connection/channel error occurred, trying to reconnect: Too many heartbeats missed 2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2406) - 2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to AMQP server on 10.101.12.104:5671 via [amqp] client with port 43046. + 2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to AMQP server on controller-node:5671 via [amqp] client with port 43046. 2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to AMQP server on 10.101.12.106:5671 via [amqp] client with port 41370. - - similar phenomenon in another article https://bugs.launchpad.net/kolla-ansible/+bug/2091975 + similar phenomenon in another article https://bugs.launchpad.net/kolla- + ansible/+bug/2091975 ** Description changed: Environment: - OpenStack 2024.02, deployed via Kolla-Ansible - nova-compute communicating with RabbitMQ - [oslo_messaging_rabbit] heartbeat_in_pthread = false ssl = true ssl_ca_file = /etc/ssl/certs/ca-certificates.crt rabbit_quorum_queue = true - Have 3 computes, the compute with the error is holding about 53 instances, each remaining compute has about 32-35 instances. - Each compute is using less than 30% of its resources. Observed: - Unexpectedly frequent reconnects/recoverable channel errors on nova-compute. - Compute node occasionally marked as down or delayed in reporting state, causing scheduling delays. - No kernel/syslog error during the time window. Log error at current lost connection: - Rabbitmq 2025-07-21 03:05:27.312 <0.127012395.1> missed heartbeats from client, timeout: 60s 2025-07-21 03:05:27.312 <0.127012395.1> closing AMQP connection <0.127012395.1> (compute-node:45166 -> controller-node:5671 - nova-compute:...) ... 2025-07-21 03:05:40.605 <0.153316717.1> missed heartbeats from client, timeout: 60s - 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection <0.153316717.1> (compute-node:34520 -> 10.101.12.106:5671 - nova-compute:...) + 2025-07-21 03:05:40.605 <0.153316717.1> closing AMQP connection <0.153316717.1> (compute-node:34520 -> controller-node1:5671 - nova-compute:...) - Compute 2025-07-21 03:05:44.397 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] AMQP server on controller-node:5671 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection 2025-07-21 03:05:44.398 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] A recoverable connection/channel error occurred, trying to reconnect: Too many heartbeats missed 2025-07-21 03:05:44.398 A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2406) 2025-07-21 03:05:45.457 [43b1a1ae-54c0-4a27-994c-dc0a885e0897] Reconnected to AMQP server on controller-node:5671 via [amqp] client with port 43046. - 2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to AMQP server on 10.101.12.106:5671 via [amqp] client with port 41370. + 2025-07-21 03:05:45.459 [9a2a48b0-4c3b-471e-8980-20eab5e55e0b] Reconnected to AMQP server on controller-node1:5671 via [amqp] client with port 41370. similar phenomenon in another article https://bugs.launchpad.net/kolla- ansible/+bug/2091975 ** Changed in: kolla-ansible Status: New => Invalid ** Changed in: ubuntu Status: Invalid => New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2117454 Title: Frequent RabbitMQ heartbeat timeouts cause intermittent nova-compute reconnect loops in OpenStack 2024.02 To manage notifications about this bug go to: https://bugs.launchpad.net/kolla-ansible/+bug/2117454/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
