Re: [openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?
On Nov 29, 2013, at 9:24 PM, Chris Friesen chris.frie...@windriver.com wrote: On 11/29/2013 06:37 PM, David Koo wrote: On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote: We're currently running Grizzly (going to Havana soon) and we're running into an issue where if the active controller is ungracefully killed then nova-compute on the compute node doesn't properly connect to the new rabbitmq server on the newly-active controller node. Interestingly, killing and restarting nova-compute on the compute node seems to work, which implies that the retry code is doing something less effective than the initial startup. Has anyone doing HA controller setups run into something similar? As a followup, it looks like if I wait for 9 minutes or so I see a message in the compute logs: 2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed It then reconnects to the AMQP server and everything is fine after that. However, any instances that I tried to boot during those 9 minutes stay stuck in the BUILD status. So the rabbitmq server and the controller are on the same node? Yes, they are. My guess is that it's related to this bug 856764 (RabbitMQ connections lack heartbeat or TCP keepalives). The gist of it is that since there are no heartbeats between the MQ and nova-compute, if the MQ goes down ungracefully then nova-compute has no way of knowing. If the MQ goes down gracefully then the MQ clients are notified and so the problem doesn't arise. Sounds about right. We got bitten by the same bug a while ago when our controller node got hard reset without any warning!. It came down to this bug (which, unfortunately, doesn't have a fix yet). We worked around this bug by implementing our own crude fix - we wrote a simple app to periodically check if the MQ was alive (write a short message into the MQ, then read it out again). When this fails n-times in a row we restart nova-compute. Very ugly, but it worked! Sounds reasonable. I did notice a kombu heartbeat change that was submitted and then backed out again because it was buggy. I guess we're still waiting on the real fix? Hi Chris, This general problem comes up a lot, and one fix is to use keepalives. Note that more is needed if you are using multi-master rabbitmq, but for failover I have had great success with the following (also posted to the bug): When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates. So solving the HA issue generally involves a rabbit config with a section like the following: [ {rabbit, [{tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]} ]} ]. Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections: echo 5 /proc/sys/net/ipv4/tcp_keepalive_time echo 5 /proc/sys/net/ipv4/tcp_keepalive_probes echo 1 /proc/sys/net/ipv4/tcp_keepalive_intvl Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use: https://github.com/meebey/force_bind/blob/master/README Vish Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev signature.asc Description: Message signed with OpenPGP using GPGMail ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?
We do had the same problem in our deployment. Here is the brief description of what we saw and how we fixed it. http://l4tol7.blogspot.com/2013/12/openstack-rabbitmq-issues.html On Mon, Dec 2, 2013 at 10:37 AM, Vishvananda Ishaya vishvana...@gmail.comwrote: On Nov 29, 2013, at 9:24 PM, Chris Friesen chris.frie...@windriver.com wrote: On 11/29/2013 06:37 PM, David Koo wrote: On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote: We're currently running Grizzly (going to Havana soon) and we're running into an issue where if the active controller is ungracefully killed then nova-compute on the compute node doesn't properly connect to the new rabbitmq server on the newly-active controller node. Interestingly, killing and restarting nova-compute on the compute node seems to work, which implies that the retry code is doing something less effective than the initial startup. Has anyone doing HA controller setups run into something similar? As a followup, it looks like if I wait for 9 minutes or so I see a message in the compute logs: 2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed It then reconnects to the AMQP server and everything is fine after that. However, any instances that I tried to boot during those 9 minutes stay stuck in the BUILD status. So the rabbitmq server and the controller are on the same node? Yes, they are. My guess is that it's related to this bug 856764 (RabbitMQ connections lack heartbeat or TCP keepalives). The gist of it is that since there are no heartbeats between the MQ and nova-compute, if the MQ goes down ungracefully then nova-compute has no way of knowing. If the MQ goes down gracefully then the MQ clients are notified and so the problem doesn't arise. Sounds about right. We got bitten by the same bug a while ago when our controller node got hard reset without any warning!. It came down to this bug (which, unfortunately, doesn't have a fix yet). We worked around this bug by implementing our own crude fix - we wrote a simple app to periodically check if the MQ was alive (write a short message into the MQ, then read it out again). When this fails n-times in a row we restart nova-compute. Very ugly, but it worked! Sounds reasonable. I did notice a kombu heartbeat change that was submitted and then backed out again because it was buggy. I guess we're still waiting on the real fix? Hi Chris, This general problem comes up a lot, and one fix is to use keepalives. Note that more is needed if you are using multi-master rabbitmq, but for failover I have had great success with the following (also posted to the bug): When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates. So solving the HA issue generally involves a rabbit config with a section like the following: [ {rabbit, [{tcp_listen_options, [binary, {packet, raw}, {reuseaddr, true}, {backlog, 128}, {nodelay, true}, {exit_on_close, false}, {keepalive, true}]} ]} ]. Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections: echo 5 /proc/sys/net/ipv4/tcp_keepalive_time echo 5 /proc/sys/net/ipv4/tcp_keepalive_probes echo 1 /proc/sys/net/ipv4/tcp_keepalive_intvl Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use: https://github.com/meebey/force_bind/blob/master/README Vish Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Ravi ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?
Hi, We're currently running Grizzly (going to Havana soon) and we're running into an issue where if the active controller is ungracefully killed then nova-compute on the compute node doesn't properly connect to the new rabbitmq server on the newly-active controller node. I saw a bugfix in Folsom (https://bugs.launchpad.net/nova/+bug/718869) to retry the connection to rabbitmq if it's lost, but it doesn't seem to be properly handling this case. Interestingly, killing and restarting nova-compute on the compute node seems to work, which implies that the retry code is doing something less effective than the initial startup. Has anyone doing HA controller setups run into something similar? Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?
On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote: We're currently running Grizzly (going to Havana soon) and we're running into an issue where if the active controller is ungracefully killed then nova-compute on the compute node doesn't properly connect to the new rabbitmq server on the newly-active controller node. I saw a bugfix in Folsom (https://bugs.launchpad.net/nova/+bug/718869) to retry the connection to rabbitmq if it's lost, but it doesn't seem to be properly handling this case. Interestingly, killing and restarting nova-compute on the compute node seems to work, which implies that the retry code is doing something less effective than the initial startup. Has anyone doing HA controller setups run into something similar? So the rabbitmq server and the controller are on the same node? My guess is that it's related to this bug 856764 (RabbitMQ connections lack heartbeat or TCP keepalives). The gist of it is that since there are no heartbeats between the MQ and nova-compute, if the MQ goes down ungracefully then nova-compute has no way of knowing. If the MQ goes down gracefully then the MQ clients are notified and so the problem doesn't arise. We got bitten by the same bug a while ago when our controller node got hard reset without any warning!. It came down to this bug (which, unfortunately, doesn't have a fix yet). We worked around this bug by implementing our own crude fix - we wrote a simple app to periodically check if the MQ was alive (write a short message into the MQ, then read it out again). When this fails n-times in a row we restart nova-compute. Very ugly, but it worked! -- Koo ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev