Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)
On 05/08/2014 02:22 PM, Bogdan Dobrelya wrote: On 05/06/2014 10:42 PM, Roman Sokolkov wrote: Hello, fuelers. I'm using Fuel 4.1A + Havana in HA mode. I permanently observe (on other deployments also) issue with stuck nova-compute service. But i think problem is more fundamental and relates to HA RabbitMQ and OpenStack AMQP driver implementation. Symptoms: * Random nova-compute from time to time marked as XXX for a while. * I see that service itself works properly. In logs i see that it sends status updates to conductor. But actually nothing is sent. * netstat shows that all connections to/from rabbit ESTABLISHED * rabbitmqctl shows that compute.node-x queue synced to all slaves. * nothing has been broken before, i mean rabbitmq cluster, etc. Axe style solution: * /etc/init.d/openstack-nova-compute restart So here i've found a lot of interesting stuff (and solutions): https://bugs.launchpad.net/oslo.messaging/+bug/856764 My questions are: * Are there any thoughts particular for Fuel to solve/workaround this issue? * Any fast solution for this in 4.1? Like adjust TCP keep-alive timeouts? I submitted an issue for Fuel https://bugs.launchpad.net/fuel/+bug/1317488 and assigned it to Fuel hardening team. Feel free to update it as appropriate. For some reason, the issue #1317488 was marked as a duplicate of https://bugs.launchpad.net/fuel/+bug/1289200 (perhaps, handling the disappeared sessions which became a half-open is a generic case for either of them?) The patch (I believe not the final one) was suggested here https://review.openstack.org/#/c/93411/ Please feel free to test it on any affected environments. Any feedback would be greatly appreciated, thank you. -- Roman Sokolkov, Deployment Engineer, Mirantis, Inc. Skype rsokolkov, rsokol...@mirantis.com mailto:rsokol...@mirantis.com -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)
On 05/06/2014 10:42 PM, Roman Sokolkov wrote: Hello, fuelers. I'm using Fuel 4.1A + Havana in HA mode. I permanently observe (on other deployments also) issue with stuck nova-compute service. But i think problem is more fundamental and relates to HA RabbitMQ and OpenStack AMQP driver implementation. Symptoms: * Random nova-compute from time to time marked as XXX for a while. * I see that service itself works properly. In logs i see that it sends status updates to conductor. But actually nothing is sent. * netstat shows that all connections to/from rabbit ESTABLISHED * rabbitmqctl shows that compute.node-x queue synced to all slaves. * nothing has been broken before, i mean rabbitmq cluster, etc. Axe style solution: * /etc/init.d/openstack-nova-compute restart So here i've found a lot of interesting stuff (and solutions): https://bugs.launchpad.net/oslo.messaging/+bug/856764 My questions are: * Are there any thoughts particular for Fuel to solve/workaround this issue? * Any fast solution for this in 4.1? Like adjust TCP keep-alive timeouts? I submitted an issue for Fuel https://bugs.launchpad.net/fuel/+bug/1317488 and assigned it to Fuel hardening team. Feel free to update it as appropriate. -- Roman Sokolkov, Deployment Engineer, Mirantis, Inc. Skype rsokolkov, rsokol...@mirantis.com mailto:rsokol...@mirantis.com -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)
Bogdan, thank you. On Thu, May 8, 2014 at 6:22 AM, Bogdan Dobrelya bdobre...@mirantis.comwrote: On 05/06/2014 10:42 PM, Roman Sokolkov wrote: Hello, fuelers. I'm using Fuel 4.1A + Havana in HA mode. I permanently observe (on other deployments also) issue with stuck nova-compute service. But i think problem is more fundamental and relates to HA RabbitMQ and OpenStack AMQP driver implementation. Symptoms: * Random nova-compute from time to time marked as XXX for a while. * I see that service itself works properly. In logs i see that it sends status updates to conductor. But actually nothing is sent. * netstat shows that all connections to/from rabbit ESTABLISHED * rabbitmqctl shows that compute.node-x queue synced to all slaves. * nothing has been broken before, i mean rabbitmq cluster, etc. Axe style solution: * /etc/init.d/openstack-nova-compute restart So here i've found a lot of interesting stuff (and solutions): https://bugs.launchpad.net/oslo.messaging/+bug/856764 My questions are: * Are there any thoughts particular for Fuel to solve/workaround this issue? * Any fast solution for this in 4.1? Like adjust TCP keep-alive timeouts? I submitted an issue for Fuel https://bugs.launchpad.net/fuel/+bug/1317488 and assigned it to Fuel hardening team. Feel free to update it as appropriate. -- Roman Sokolkov, Deployment Engineer, Mirantis, Inc. Skype rsokolkov, rsokol...@mirantis.com mailto:rsokol...@mirantis.com -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando -- Roman Sokolkov, Deployment Engineer, Mirantis, Inc. Skype rsokolkov, rsokol...@mirantis.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)
Roman, the current stable/4.1 has some fixes that make this less likely to occur and is the most likely to recover. That said, I've done some tracing and there are some issues with nova-conductor processing those messages. Some of the times I've seen the compute-node be the issue, other times I've seen nova-conductor be the issue. As of stable/4.1 I've been able to track it down to nova-conductor. AFAICT it receives the message from nova-compute, takes it from the queue, acks the queue, and selects the object from the DB. However after moving nova-compute and nova-conductor log trace level in amqp and sqlalchemey, the issue appears to stop. I've yet to confirm if the cluster state of rabbit changed, or if the change in logging level changed it or something else. On Tue, May 6, 2014 at 12:42 PM, Roman Sokolkov rsokol...@mirantis.com wrote: Hello, fuelers. I'm using Fuel 4.1A + Havana in HA mode. I permanently observe (on other deployments also) issue with stuck nova-compute service. But i think problem is more fundamental and relates to HA RabbitMQ and OpenStack AMQP driver implementation. Symptoms: Random nova-compute from time to time marked as XXX for a while. I see that service itself works properly. In logs i see that it sends status updates to conductor. But actually nothing is sent. netstat shows that all connections to/from rabbit ESTABLISHED rabbitmqctl shows that compute.node-x queue synced to all slaves. nothing has been broken before, i mean rabbitmq cluster, etc. Axe style solution: /etc/init.d/openstack-nova-compute restart So here i've found a lot of interesting stuff (and solutions): https://bugs.launchpad.net/oslo.messaging/+bug/856764 My questions are: Are there any thoughts particular for Fuel to solve/workaround this issue? Any fast solution for this in 4.1? Like adjust TCP keep-alive timeouts? -- Roman Sokolkov, Deployment Engineer, Mirantis, Inc. Skype rsokolkov, rsokol...@mirantis.com -- Mailing list: https://launchpad.net/~fuel-dev Post to : fuel-...@lists.launchpad.net Unsubscribe : https://launchpad.net/~fuel-dev More help : https://help.launchpad.net/ListHelp -- Andrew Mirantis Ceph community ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)
On 05/06/2014 10:42 PM, Roman Sokolkov wrote: Hello, fuelers. I'm using Fuel 4.1A + Havana in HA mode. I permanently observe (on other deployments also) issue with stuck nova-compute service. But i think problem is more fundamental and relates to HA RabbitMQ and OpenStack AMQP driver implementation. Symptoms: * Random nova-compute from time to time marked as XXX for a while. * I see that service itself works properly. In logs i see that it sends status updates to conductor. But actually nothing is sent. * netstat shows that all connections to/from rabbit ESTABLISHED * rabbitmqctl shows that compute.node-x queue synced to all slaves. * nothing has been broken before, i mean rabbitmq cluster, etc. Axe style solution: * /etc/init.d/openstack-nova-compute restart So here i've found a lot of interesting stuff (and solutions): https://bugs.launchpad.net/oslo.messaging/+bug/856764 My questions are: * Are there any thoughts particular for Fuel to solve/workaround this issue? * Any fast solution for this in 4.1? Like adjust TCP keep-alive timeouts? Perhaps, the soultion is to apply https://review.openstack.org/#/c/34949 and check results with rabbitmq and nova. If it is OK, we could submit a task for OSCI team to patch our internal repos and update 4.1.1 / 5.0 targeted MOS packages. -- Roman Sokolkov, Deployment Engineer, Mirantis, Inc. Skype rsokolkov, rsokol...@mirantis.com mailto:rsokol...@mirantis.com -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)
On 05/07/2014 04:12 PM, Bogdan Dobrelya wrote: On 05/06/2014 10:42 PM, Roman Sokolkov wrote: Hello, fuelers. I'm using Fuel 4.1A + Havana in HA mode. I permanently observe (on other deployments also) issue with stuck nova-compute service. But i think problem is more fundamental and relates to HA RabbitMQ and OpenStack AMQP driver implementation. Symptoms: * Random nova-compute from time to time marked as XXX for a while. * I see that service itself works properly. In logs i see that it sends status updates to conductor. But actually nothing is sent. * netstat shows that all connections to/from rabbit ESTABLISHED * rabbitmqctl shows that compute.node-x queue synced to all slaves. * nothing has been broken before, i mean rabbitmq cluster, etc. Axe style solution: * /etc/init.d/openstack-nova-compute restart So here i've found a lot of interesting stuff (and solutions): https://bugs.launchpad.net/oslo.messaging/+bug/856764 My questions are: * Are there any thoughts particular for Fuel to solve/workaround this issue? * Any fast solution for this in 4.1? Like adjust TCP keep-alive timeouts? Perhaps, the soultion is to apply https://review.openstack.org/#/c/34949 and check results with rabbitmq and nova. If it is OK, we could submit a task for OSCI team to patch our internal repos and update 4.1.1 / 5.0 targeted MOS packages. Sorry, I mean to sync all Oslo patches from https://bugs.launchpad.net/oslo.messaging/+bug/856764; for nova packages in MOS and check the results with rabbitmq. -- Roman Sokolkov, Deployment Engineer, Mirantis, Inc. Skype rsokolkov, rsokol...@mirantis.com mailto:rsokol...@mirantis.com -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev