Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)

2014-05-13 Thread Bogdan Dobrelya
On 05/08/2014 02:22 PM, Bogdan Dobrelya wrote:
 On 05/06/2014 10:42 PM, Roman Sokolkov wrote:
 Hello, fuelers.

 I'm using Fuel 4.1A + Havana in HA mode.

 I permanently observe (on other deployments also) issue with stuck
 nova-compute service. But i think problem is more fundamental and
 relates to HA RabbitMQ and OpenStack AMQP driver implementation.

 Symptoms:

   * Random nova-compute from time to time marked as XXX for a while.
   * I see that service itself works properly. In logs i see that it
 sends status updates to conductor. But actually nothing is sent.
   * netstat shows that all connections to/from rabbit ESTABLISHED
   * rabbitmqctl shows that compute.node-x queue synced to all slaves.
   * nothing has been broken before, i mean rabbitmq cluster, etc.

 Axe style solution:

   * /etc/init.d/openstack-nova-compute restart

 So here i've found a lot of interesting stuff (and solutions):

 https://bugs.launchpad.net/oslo.messaging/+bug/856764


 My questions are:

   * Are there any thoughts particular for Fuel to solve/workaround this
 issue?
   * Any fast solution for this in 4.1? Like adjust TCP keep-alive  timeouts?


 
 I submitted an issue for Fuel
 https://bugs.launchpad.net/fuel/+bug/1317488 and assigned it to Fuel
 hardening team. Feel free to update it as appropriate.

For some reason, the issue #1317488 was marked as a duplicate of
https://bugs.launchpad.net/fuel/+bug/1289200 (perhaps, handling the
disappeared sessions which became a half-open is a generic case for
either of them?)

The patch (I believe not the final one) was suggested here
https://review.openstack.org/#/c/93411/
Please feel free to test it on any affected environments. Any feedback
would be greatly appreciated, thank you.

 
 -- 
 Roman Sokolkov,
 Deployment Engineer,
 Mirantis, Inc.
 Skype rsokolkov,
 rsokol...@mirantis.com mailto:rsokol...@mirantis.com


 
 


-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)

2014-05-08 Thread Bogdan Dobrelya
On 05/06/2014 10:42 PM, Roman Sokolkov wrote:
 Hello, fuelers.
 
 I'm using Fuel 4.1A + Havana in HA mode.
 
 I permanently observe (on other deployments also) issue with stuck
 nova-compute service. But i think problem is more fundamental and
 relates to HA RabbitMQ and OpenStack AMQP driver implementation.
 
 Symptoms:
 
   * Random nova-compute from time to time marked as XXX for a while.
   * I see that service itself works properly. In logs i see that it
 sends status updates to conductor. But actually nothing is sent.
   * netstat shows that all connections to/from rabbit ESTABLISHED
   * rabbitmqctl shows that compute.node-x queue synced to all slaves.
   * nothing has been broken before, i mean rabbitmq cluster, etc.
 
 Axe style solution:
 
   * /etc/init.d/openstack-nova-compute restart
 
 So here i've found a lot of interesting stuff (and solutions):
 
 https://bugs.launchpad.net/oslo.messaging/+bug/856764
 
 
 My questions are:
 
   * Are there any thoughts particular for Fuel to solve/workaround this
 issue?
   * Any fast solution for this in 4.1? Like adjust TCP keep-alive  timeouts?
 
 

I submitted an issue for Fuel
https://bugs.launchpad.net/fuel/+bug/1317488 and assigned it to Fuel
hardening team. Feel free to update it as appropriate.

 -- 
 Roman Sokolkov,
 Deployment Engineer,
 Mirantis, Inc.
 Skype rsokolkov,
 rsokol...@mirantis.com mailto:rsokol...@mirantis.com
 
 


-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)

2014-05-08 Thread Roman Sokolkov
Bogdan,

thank you.


On Thu, May 8, 2014 at 6:22 AM, Bogdan Dobrelya bdobre...@mirantis.comwrote:

 On 05/06/2014 10:42 PM, Roman Sokolkov wrote:
  Hello, fuelers.
 
  I'm using Fuel 4.1A + Havana in HA mode.
 
  I permanently observe (on other deployments also) issue with stuck
  nova-compute service. But i think problem is more fundamental and
  relates to HA RabbitMQ and OpenStack AMQP driver implementation.
 
  Symptoms:
 
* Random nova-compute from time to time marked as XXX for a while.
* I see that service itself works properly. In logs i see that it
  sends status updates to conductor. But actually nothing is sent.
* netstat shows that all connections to/from rabbit ESTABLISHED
* rabbitmqctl shows that compute.node-x queue synced to all slaves.
* nothing has been broken before, i mean rabbitmq cluster, etc.
 
  Axe style solution:
 
* /etc/init.d/openstack-nova-compute restart
 
  So here i've found a lot of interesting stuff (and solutions):
 
  https://bugs.launchpad.net/oslo.messaging/+bug/856764
 
 
  My questions are:
 
* Are there any thoughts particular for Fuel to solve/workaround this
  issue?
* Any fast solution for this in 4.1? Like adjust TCP keep-alive
  timeouts?
 
 

 I submitted an issue for Fuel
 https://bugs.launchpad.net/fuel/+bug/1317488 and assigned it to Fuel
 hardening team. Feel free to update it as appropriate.

  --
  Roman Sokolkov,
  Deployment Engineer,
  Mirantis, Inc.
  Skype rsokolkov,
  rsokol...@mirantis.com mailto:rsokol...@mirantis.com
 
 


 --
 Best regards,
 Bogdan Dobrelya,
 Skype #bogdando_at_yahoo.com
 Irc #bogdando




-- 
Roman Sokolkov,
Deployment Engineer,
Mirantis, Inc.
Skype rsokolkov,
rsokol...@mirantis.com
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)

2014-05-07 Thread Andrew Woodward
Roman,

the current stable/4.1 has some fixes that make this less likely to
occur and is the most likely to recover.

That said, I've done some tracing and there are some issues with
nova-conductor processing those messages. Some of the times I've seen
the compute-node be the issue, other times I've seen nova-conductor be
the issue. As of stable/4.1 I've been able to track it down to
nova-conductor. AFAICT it receives the message from nova-compute,
takes it from the queue, acks the queue, and selects the object from
the DB. However after moving nova-compute and nova-conductor log trace
level in amqp and sqlalchemey, the issue appears to stop. I've yet to
confirm if the cluster state of rabbit changed, or if the change in
logging level changed it or something else.



On Tue, May 6, 2014 at 12:42 PM, Roman Sokolkov rsokol...@mirantis.com wrote:
 Hello, fuelers.

 I'm using Fuel 4.1A + Havana in HA mode.

 I permanently observe (on other deployments also) issue with stuck
 nova-compute service. But i think problem is more fundamental and relates
 to HA RabbitMQ and OpenStack AMQP driver implementation.

 Symptoms:

 Random nova-compute from time to time marked as XXX for a while.
 I see that service itself works properly. In logs i see that it sends status
 updates to conductor. But actually nothing is sent.
 netstat shows that all connections to/from rabbit ESTABLISHED
 rabbitmqctl shows that compute.node-x queue synced to all slaves.
 nothing has been broken before, i mean rabbitmq cluster, etc.

 Axe style solution:

 /etc/init.d/openstack-nova-compute restart

 So here i've found a lot of interesting stuff (and solutions):

 https://bugs.launchpad.net/oslo.messaging/+bug/856764


 My questions are:

 Are there any thoughts particular for Fuel to solve/workaround this issue?
 Any fast solution for this in 4.1? Like adjust TCP keep-alive  timeouts?


 --
 Roman Sokolkov,
 Deployment Engineer,
 Mirantis, Inc.
 Skype rsokolkov,
 rsokol...@mirantis.com

 --
 Mailing list: https://launchpad.net/~fuel-dev
 Post to : fuel-...@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~fuel-dev
 More help   : https://help.launchpad.net/ListHelp




-- 
Andrew
Mirantis
Ceph community

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)

2014-05-07 Thread Bogdan Dobrelya
On 05/06/2014 10:42 PM, Roman Sokolkov wrote:
 Hello, fuelers.
 
 I'm using Fuel 4.1A + Havana in HA mode.
 
 I permanently observe (on other deployments also) issue with stuck
 nova-compute service. But i think problem is more fundamental and
 relates to HA RabbitMQ and OpenStack AMQP driver implementation.
 
 Symptoms:
 
   * Random nova-compute from time to time marked as XXX for a while.
   * I see that service itself works properly. In logs i see that it
 sends status updates to conductor. But actually nothing is sent.
   * netstat shows that all connections to/from rabbit ESTABLISHED
   * rabbitmqctl shows that compute.node-x queue synced to all slaves.
   * nothing has been broken before, i mean rabbitmq cluster, etc.
 
 Axe style solution:
 
   * /etc/init.d/openstack-nova-compute restart
 
 So here i've found a lot of interesting stuff (and solutions):
 
 https://bugs.launchpad.net/oslo.messaging/+bug/856764
 
 
 My questions are:
 
   * Are there any thoughts particular for Fuel to solve/workaround this
 issue?
   * Any fast solution for this in 4.1? Like adjust TCP keep-alive  timeouts?

Perhaps, the soultion is to apply https://review.openstack.org/#/c/34949
and check results with rabbitmq and nova. If it is OK, we could submit a
task for OSCI team to patch our internal repos and update 4.1.1 / 5.0
targeted MOS packages.

 
 
 -- 
 Roman Sokolkov,
 Deployment Engineer,
 Mirantis, Inc.
 Skype rsokolkov,
 rsokol...@mirantis.com mailto:rsokol...@mirantis.com
 
 


-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel-dev] [Fuel][RabbitMQ] nova-compute stuck for a while (AMQP)

2014-05-07 Thread Bogdan Dobrelya
On 05/07/2014 04:12 PM, Bogdan Dobrelya wrote:
 On 05/06/2014 10:42 PM, Roman Sokolkov wrote:
 Hello, fuelers.

 I'm using Fuel 4.1A + Havana in HA mode.

 I permanently observe (on other deployments also) issue with stuck
 nova-compute service. But i think problem is more fundamental and
 relates to HA RabbitMQ and OpenStack AMQP driver implementation.

 Symptoms:

   * Random nova-compute from time to time marked as XXX for a while.
   * I see that service itself works properly. In logs i see that it
 sends status updates to conductor. But actually nothing is sent.
   * netstat shows that all connections to/from rabbit ESTABLISHED
   * rabbitmqctl shows that compute.node-x queue synced to all slaves.
   * nothing has been broken before, i mean rabbitmq cluster, etc.

 Axe style solution:

   * /etc/init.d/openstack-nova-compute restart

 So here i've found a lot of interesting stuff (and solutions):

 https://bugs.launchpad.net/oslo.messaging/+bug/856764


 My questions are:

   * Are there any thoughts particular for Fuel to solve/workaround this
 issue?
   * Any fast solution for this in 4.1? Like adjust TCP keep-alive  timeouts?
 
 Perhaps, the soultion is to apply https://review.openstack.org/#/c/34949
 and check results with rabbitmq and nova. If it is OK, we could submit a
 task for OSCI team to patch our internal repos and update 4.1.1 / 5.0
 targeted MOS packages.

Sorry, I mean to sync all Oslo patches from
https://bugs.launchpad.net/oslo.messaging/+bug/856764; for nova packages
in MOS and check the results with rabbitmq.

 


 -- 
 Roman Sokolkov,
 Deployment Engineer,
 Mirantis, Inc.
 Skype rsokolkov,
 rsokol...@mirantis.com mailto:rsokol...@mirantis.com


 
 


-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev