Public bug reported: In MOS 6.1:
When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again: - nova-conductor - nova-scheduler - nova-compute (on compute nodes) - ceilometer-collector As of now I believe the issue might be due to the potential disabling of heartbeats. If heartbeats are disabled then if RabbitMQ goes down ungracefully all nova services have no way of knowing RabbitMQ went down. When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat feature built in I think enabling this would be the way to go. Perhaps building upon this bug would be a wise idea: https://bugs.launchpad.net/fuel/+bug/1447559 Alternatively, I have found a solution provided by an escalations engineer in a custom patch to a customer. This patch would be applied to the oslo.utils (on which oslo.messaging depends) library on compute nodes and controllers. File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py This could be done with going to /usr/lib/python2.7/dist- packages/oslo.utils/ and running patch < oslo_utils2.diff After that restart nova-compute service by running/etc/init.d/nova- compute restart. My proposal is that we investigate the reasoning behind the first solution. Additionally, I think this patch should make its way to MOS 6.1 MU5 or other. ** Affects: mos Importance: Undecided Status: New ** Tags: nova oslo.messaging ** Attachment added: "oslo-patch" https://bugs.launchpad.net/bugs/1538759/+attachment/4557961/+files/oslo_utils2.diff ** Description changed: - When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again: - - nova-conductor - - nova-scheduler - - nova-compute (on compute nodes) + When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again: + - nova-conductor + - nova-scheduler + - nova-compute (on compute nodes) - ceilometer-collector As of now I believe the issue might be due to the potential disabling of heartbeats. If heartbeats are disabled then if RabbitMQ goes down ungracefully all nova services have no way of knowing RabbitMQ went down. When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat feature built in I think enabling this would be the way to go. Perhaps building upon this bug would be a wise idea: https://bugs.launchpad.net/fuel/+bug/1447559 Alternatively, I have found a solution provided by an escalations engineer in a custom patch to a customer. This patch would be applied to the oslo.utils (on which oslo.messaging depends) library on compute nodes and controllers. File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py This could be done with going to /usr/lib/python2.7/dist- packages/oslo.utils/ and running patch < oslo_utils2.diff After that restart nova-compute service by running/etc/init.d/nova- compute restart. - - My proposal is that we investigate the reasoning behind the first solution. Additionally, I think this patch should make its way to MOS 6.1 MU5 or other. + My proposal is that we investigate the reasoning behind the first + solution. Additionally, I think this patch should make its way to MOS + 6.1 MU5 or other. ** Also affects: oslo.messaging (Ubuntu) Importance: Undecided Status: New ** No longer affects: oslo.messaging (Ubuntu) ** Description changed: + In MOS 6.1: + When RabbitMQ cluster recovers from a failure (whatever the case may be), other OpenStack services like the following had to be restarted as well to get our environment stable again: - nova-conductor - nova-scheduler - nova-compute (on compute nodes) - ceilometer-collector As of now I believe the issue might be due to the potential disabling of heartbeats. If heartbeats are disabled then if RabbitMQ goes down ungracefully all nova services have no way of knowing RabbitMQ went down. When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES but given that RabbitMQ has the heartbeat feature built in I think enabling this would be the way to go. Perhaps building upon this bug would be a wise idea: https://bugs.launchpad.net/fuel/+bug/1447559 Alternatively, I have found a solution provided by an escalations engineer in a custom patch to a customer. This patch would be applied to the oslo.utils (on which oslo.messaging depends) library on compute nodes and controllers. File to update: /usr/lib/python2.7/dist-packages/oslo.utils/excutils.py This could be done with going to /usr/lib/python2.7/dist- packages/oslo.utils/ and running patch < oslo_utils2.diff After that restart nova-compute service by running/etc/init.d/nova- compute restart. My proposal is that we investigate the reasoning behind the first solution. Additionally, I think this patch should make its way to MOS 6.1 MU5 or other. -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to oslo.messaging in Ubuntu. https://bugs.launchpad.net/bugs/1538759 Title: When RabbitMQ cluster service restarts, other OpenStack services do not gracefully recover To manage notifications about this bug go to: https://bugs.launchpad.net/mos/+bug/1538759/+subscriptions -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs