Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-20 Thread Vladimir Kuklin
Actually, we are not skipping 'Started' state - we just consider resource
as started when beam is powered up and rabbitmq start_app/stop_app action
succeeds. Such a node is considered as a good one that can be marked as
'Master' to which the nodes should connect and then all the cluster
join/leave actions are handled using multi-state notification mechanism.

On Wed, May 20, 2015 at 5:05 AM, Andrew Beekhof abeek...@redhat.com wrote:


  On 20 May 2015, at 6:05 am, Andrew Woodward xar...@gmail.com wrote:
 
 
 
  On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof abeek...@redhat.com
 wrote:
 
   On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 
 zhengsh...@awcloud.com wrote:
  
   Thank you Andrew.
  
   on 2015/05/05 08:03, Andrew Beekhof wrote:
   On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com
 wrote:
  
   Hello,
   Hello, Zhou
  
   I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
   power failure. I have a running HA environment, then I reset power
 of
   all the machines at the same time. I observe that after reboot it
   usually takes 10 minutes for RabittMQ cluster to appear running
   master-slave mode in pacemaker. If I power off all the 3
 controllers and
   only start 2 of them, the downtime sometimes can be as long as 20
 minutes.
   Yes, this is a known issue [0]. Note, there were many bugfixes, like
   [1],[2],[3], merged for MQ OCF script, so you may want to try to
   backport them as well by the following guide [4]
  
   [0] https://bugs.launchpad.net/fuel/+bug/1432603
   [1] https://review.openstack.org/#/c/175460/
   [2] https://review.openstack.org/#/c/175457/
   [3] https://review.openstack.org/#/c/175371/
   [4] https://review.openstack.org/#/c/170476/
   Is there a reason you’re using a custom OCF script instead of the
 upstream[a] one?
   Please have a chat with David (the maintainer, in CC) if there is
 something you believe is wrong with it.
  
   [a]
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
  
   I'm using the OCF script from the Fuel project, specifically from the
   6.0 stable branch [alpha].
 
  Ah, I’m still learning who is who... i thought you were part of that
 project :-)
 
  
   Comparing with upstream OCF code, the main difference is that Fuel
   RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
   bookkeeping, for example, blocking client access when RabbitMQ cluster
   is not ready. I beleive the upstream OCF should be OK to use as well
   after I read the code, but it might not fit into the Fuel project. As
   far as I test, the Fuel OCF script is good except sometimes the full
   reassemble time is long, and as I find out, it is mostly because the
   Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
   resource, as I mentioned in the previous emails.
  
   Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
   master-slave RabbitMQ.
 
  That would be good to know.
  Browsing the agent, promote seems to be a no-op if rabbit is already
 running.
 
 
  To the master / slave reason due to how the ocf script is structured to
 deal with rabbit's poor ability to handle its self in some scenarios.
 Hopefully the state transition diagram [5] is enough to clarify what's
 going on.
 
  [5] http://goo.gl/PPNrw7

 Not really.
 It seems to be under the impression you can skip started and go directly
 from stopped to master.
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
35bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com http://www.mirantis.ru/
www.mirantis.ru
vkuk...@mirantis.com
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-19 Thread Andrew Beekhof

 On 20 May 2015, at 6:05 am, Andrew Woodward xar...@gmail.com wrote:
 
 
 
 On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof abeek...@redhat.com wrote:
 
  On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
  wrote:
 
  Thank you Andrew.
 
  on 2015/05/05 08:03, Andrew Beekhof wrote:
  On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com 
  wrote:
 
  Hello,
  Hello, Zhou
 
  I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
  power failure. I have a running HA environment, then I reset power of
  all the machines at the same time. I observe that after reboot it
  usually takes 10 minutes for RabittMQ cluster to appear running
  master-slave mode in pacemaker. If I power off all the 3 controllers and
  only start 2 of them, the downtime sometimes can be as long as 20 
  minutes.
  Yes, this is a known issue [0]. Note, there were many bugfixes, like
  [1],[2],[3], merged for MQ OCF script, so you may want to try to
  backport them as well by the following guide [4]
 
  [0] https://bugs.launchpad.net/fuel/+bug/1432603
  [1] https://review.openstack.org/#/c/175460/
  [2] https://review.openstack.org/#/c/175457/
  [3] https://review.openstack.org/#/c/175371/
  [4] https://review.openstack.org/#/c/170476/
  Is there a reason you’re using a custom OCF script instead of the 
  upstream[a] one?
  Please have a chat with David (the maintainer, in CC) if there is 
  something you believe is wrong with it.
 
  [a] 
  https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
 
  I'm using the OCF script from the Fuel project, specifically from the
  6.0 stable branch [alpha].
 
 Ah, I’m still learning who is who... i thought you were part of that project 
 :-)
 
 
  Comparing with upstream OCF code, the main difference is that Fuel
  RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
  bookkeeping, for example, blocking client access when RabbitMQ cluster
  is not ready. I beleive the upstream OCF should be OK to use as well
  after I read the code, but it might not fit into the Fuel project. As
  far as I test, the Fuel OCF script is good except sometimes the full
  reassemble time is long, and as I find out, it is mostly because the
  Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
  resource, as I mentioned in the previous emails.
 
  Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
  master-slave RabbitMQ.
 
 That would be good to know.
 Browsing the agent, promote seems to be a no-op if rabbit is already running.
 
 
 To the master / slave reason due to how the ocf script is structured to deal 
 with rabbit's poor ability to handle its self in some scenarios. Hopefully 
 the state transition diagram [5] is enough to clarify what's going on.
 
 [5] http://goo.gl/PPNrw7

Not really.
It seems to be under the impression you can skip started and go directly from 
stopped to master.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-19 Thread Andrew Woodward
On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof abeek...@redhat.com wrote:


  On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 
 zhengsh...@awcloud.com wrote:
 
  Thank you Andrew.
 
  on 2015/05/05 08:03, Andrew Beekhof wrote:
  On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com
 wrote:
 
  Hello,
  Hello, Zhou
 
  I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
  power failure. I have a running HA environment, then I reset power of
  all the machines at the same time. I observe that after reboot it
  usually takes 10 minutes for RabittMQ cluster to appear running
  master-slave mode in pacemaker. If I power off all the 3 controllers
 and
  only start 2 of them, the downtime sometimes can be as long as 20
 minutes.
  Yes, this is a known issue [0]. Note, there were many bugfixes, like
  [1],[2],[3], merged for MQ OCF script, so you may want to try to
  backport them as well by the following guide [4]
 
  [0] https://bugs.launchpad.net/fuel/+bug/1432603
  [1] https://review.openstack.org/#/c/175460/
  [2] https://review.openstack.org/#/c/175457/
  [3] https://review.openstack.org/#/c/175371/
  [4] https://review.openstack.org/#/c/170476/
  Is there a reason you’re using a custom OCF script instead of the
 upstream[a] one?
  Please have a chat with David (the maintainer, in CC) if there is
 something you believe is wrong with it.
 
  [a]
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
 
  I'm using the OCF script from the Fuel project, specifically from the
  6.0 stable branch [alpha].

 Ah, I’m still learning who is who... i thought you were part of that
 project :-)

 
  Comparing with upstream OCF code, the main difference is that Fuel
  RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
  bookkeeping, for example, blocking client access when RabbitMQ cluster
  is not ready. I beleive the upstream OCF should be OK to use as well
  after I read the code, but it might not fit into the Fuel project. As
  far as I test, the Fuel OCF script is good except sometimes the full
  reassemble time is long, and as I find out, it is mostly because the
  Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
  resource, as I mentioned in the previous emails.
 
  Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
  master-slave RabbitMQ.

 That would be good to know.
 Browsing the agent, promote seems to be a no-op if rabbit is already
 running.


To the master / slave reason due to how the ocf script is structured to
deal with rabbit's poor ability to handle its self in some scenarios.
Hopefully the state transition diagram [5] is enough to clarify what's
going on.

[5] http://goo.gl/PPNrw7


  I see Vladimir and Sergey works on the original
  Fuel blueprint RabbitMQ cluster [beta].
 
  [alpha]
 
 https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
  [beta]
 
 https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker
 
  I have a little investigation and find out there are some possible
 causes.
 
  1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering
 in
  Pacemaker
 
  The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
  MySQL-wss fails to start after power failure, and pacemaker would wait
  475s before retry starting it. The problem is that pacemaker divides
  resource state transitions into batches. Since RabbitMQ is
 master-slave
  resource, I assume that starting all the slaves and promoting master
 are
  put into two different batches. If unfortunately starting all RabbitMQ
  slaves are put in the same batch as MySQL starting, even if RabbitMQ
  slaves and all other resources are ready, pacemaker will not continue
  but just wait for MySQL timeout.
  Could you please elaborate the what is the same/different batches for
 MQ
  and DB? Note, there is a MQ clustering logic flow charts available here
  [5] and we're planning to release a dedicated technical bulletin for
 this.
 
  [5] http://goo.gl/PPNrw7
 
  I can re-produce this by hard powering off all the controllers and
 start
  them again. It's more likely to trigger MySQL failure in this way.
 Then
  I observe that if there is one cloned mysql instance not starting, the
  whole pacemaker cluster gets stuck and does not emit any log. On the
  host of the failed instance, I can see a mysql resource agent process
  calling the sleep command. If I kill that process, the pacemaker comes
  back alive and RabbitMQ master gets promoted. In fact this long
 timeout
  is blocking every resource from state transition in pacemaker.
 
  This maybe a known problem of pacemaker and there are some discussions
  in Linux-HA mailing list [2]. It might not be fixed in the near
 future.
  It seems in generally it's bad to have long timeout in state
 transition
  actions (start/stop/promote/demote). There maybe another way to
  implement 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-07 Thread Andrew Beekhof

 On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:
 
 Thank you Andrew.
 
 on 2015/05/05 08:03, Andrew Beekhof wrote:
 On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote:
 
 Hello,
 Hello, Zhou
 
 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
 power failure. I have a running HA environment, then I reset power of
 all the machines at the same time. I observe that after reboot it
 usually takes 10 minutes for RabittMQ cluster to appear running
 master-slave mode in pacemaker. If I power off all the 3 controllers and
 only start 2 of them, the downtime sometimes can be as long as 20 minutes.
 Yes, this is a known issue [0]. Note, there were many bugfixes, like
 [1],[2],[3], merged for MQ OCF script, so you may want to try to
 backport them as well by the following guide [4]
 
 [0] https://bugs.launchpad.net/fuel/+bug/1432603
 [1] https://review.openstack.org/#/c/175460/
 [2] https://review.openstack.org/#/c/175457/
 [3] https://review.openstack.org/#/c/175371/
 [4] https://review.openstack.org/#/c/170476/
 Is there a reason you’re using a custom OCF script instead of the 
 upstream[a] one?
 Please have a chat with David (the maintainer, in CC) if there is something 
 you believe is wrong with it.
 
 [a] 
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
 
 I'm using the OCF script from the Fuel project, specifically from the
 6.0 stable branch [alpha].

Ah, I’m still learning who is who... i thought you were part of that project 
:-) 

 
 Comparing with upstream OCF code, the main difference is that Fuel
 RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
 bookkeeping, for example, blocking client access when RabbitMQ cluster
 is not ready. I beleive the upstream OCF should be OK to use as well
 after I read the code, but it might not fit into the Fuel project. As
 far as I test, the Fuel OCF script is good except sometimes the full
 reassemble time is long, and as I find out, it is mostly because the
 Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
 resource, as I mentioned in the previous emails.
 
 Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
 master-slave RabbitMQ.

That would be good to know.
Browsing the agent, promote seems to be a no-op if rabbit is already running.

 I see Vladimir and Sergey works on the original
 Fuel blueprint RabbitMQ cluster [beta].
 
 [alpha]
 https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
 [beta]
 https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker
 
 I have a little investigation and find out there are some possible causes.
 
 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker
 
 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait
 475s before retry starting it. The problem is that pacemaker divides
 resource state transitions into batches. Since RabbitMQ is master-slave
 resource, I assume that starting all the slaves and promoting master are
 put into two different batches. If unfortunately starting all RabbitMQ
 slaves are put in the same batch as MySQL starting, even if RabbitMQ
 slaves and all other resources are ready, pacemaker will not continue
 but just wait for MySQL timeout.
 Could you please elaborate the what is the same/different batches for MQ
 and DB? Note, there is a MQ clustering logic flow charts available here
 [5] and we're planning to release a dedicated technical bulletin for this.
 
 [5] http://goo.gl/PPNrw7
 
 I can re-produce this by hard powering off all the controllers and start
 them again. It's more likely to trigger MySQL failure in this way. Then
 I observe that if there is one cloned mysql instance not starting, the
 whole pacemaker cluster gets stuck and does not emit any log. On the
 host of the failed instance, I can see a mysql resource agent process
 calling the sleep command. If I kill that process, the pacemaker comes
 back alive and RabbitMQ master gets promoted. In fact this long timeout
 is blocking every resource from state transition in pacemaker.
 
 This maybe a known problem of pacemaker and there are some discussions
 in Linux-HA mailing list [2]. It might not be fixed in the near future.
 It seems in generally it's bad to have long timeout in state transition
 actions (start/stop/promote/demote). There maybe another way to
 implement MySQL-wss resource agent to use a short start timeout and
 monitor the wss cluster state using monitor action.
 This is very interesting, thank you! I believe all commands for MySQL RA
 OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
 as we did for MQ RA OCF. And there should no be any sleep calls. I
 created a bug for this [6].
 
 [6] https://bugs.launchpad.net/fuel/+bug/1449542
 
 I also 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-07 Thread Andrew Beekhof

 On 5 May 2015, at 9:30 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:
 
 Thank you Andrew. Sorry for misspell your name in the previous email.
 
 on 2015/05/05 14:25, Andrew Beekhof wrote:
 On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:
 
 Thank you Bogdan for clearing the pacemaker promotion process for me.
 
 on 2015/05/05 10:32, Andrew Beekhof wrote:
 On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 
 zhengsh...@awcloud.com wrote:
 [snip]
 
 Batch is a pacemaker concept I found when I was reading its
 documentation and code. There is a batch-limit: 30 in the output of
 pcs property list --all. The pacemaker official documentation
 explanation is that it's The number of jobs that the TE is allowed to
 execute in parallel. From my understanding, pacemaker maintains cluster
 states, and when we start/stop/promote/demote a resource, it triggers a
 state transition. Pacemaker puts as many as possible transition jobs
 into a batch, and process them in parallel.
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:
 
  
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
 
 There is a more complex one which includes promotion and demotion on the 
 next page.
 
 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between 
 them (eg. rsc{1,2,3} in the above example)  
 
 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others 
 complete.
 
 Processing of the graph stops the moment any action returns a value that 
 was not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
 So can I infer the following statement? In a big cluster with many
 resources, chances are some resource agent actions return unexpected
 values,
 The size of the cluster shouldn’t increase the chance of this happening 
 unless you’ve set the timeouts too aggressively.
 
 If there are many types of resource agents, and anyone of them is not
 well written, it might cause trouble, right?

Yes, but really only for the things that depend on it.

For example if resources B, C, D, E all depend (in some way) on A, then their 
startup is going to be delayed.
But F, G, H and J will be able to start while we wait around for B to time out.

 
 and if any of the in-flight action timeout is long, it would
 block pacemaker from re-calculating a new transition graph?
 Yes, but its actually an argument for making the timeouts longer, not 
 shorter.
 Setting the timeouts too aggressively actually increases downtime because of 
 all the extra delays and recovery it induces.
 So set them to be long enough that there is unquestionably a problem if you 
 hit them.
 
 But we absolutely recognise that starting/stopping a database can take a 
 very long time comparatively and that it shouldn’t block recovery of other 
 unrelated services.
 I would expect to see this land in Pacemaker 1.1.14
 
 It will be great to see this in Pacemaker 1.1.14. From my experience
 using Pacemaker, I think customized resource agents are possibly the
 weakest part.

This is why we encourage people wanting new agents to get involved with the 
upstream resource-agents project :-)

 This feature should improve the handling for resource
 action timeouts.
 
 I see the
 current batch-limit is 30 and I tried to increase it to 100, but did not
 help.
 Correct.  It only puts an upper limit on the number of in-flight actions, 
 actions still need to wait for all their dependants to complete before 
 executing.
 
 I'm sure that the cloned MySQL Galera resource is not related to
 master-slave RabbitMQ resource. I don't find any dependency, order or
 rule connecting them in the cluster deployed by Fuel [1].
 In general it should not have needed to wait, but if you send me a 
 crm_report covering the period you’re talking about I’ll be able to comment 
 specifically about the behaviour you saw.
 
 You are very nice, thank you. I uploaded the file generated by
 crm_report to google drive.
 
 https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing

Hmmm... there’s no logs included here for some reason.
I suspect it a bug on my part, can you apply this patch to report.collector on 
the machine you’re running crm_report from and retry?

   https://github.com/ClusterLabs/pacemaker/commit/96427ec


 
 Is there anything I can do to make sure all the resource actions return
 expected values in a full reassembling?
 In general, if we say ‘start’, do your best to start or return ‘0’ if you 
 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-07 Thread Andrew Beekhof

 On 5 May 2015, at 7:52 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote:
 
 On 05.05.2015 04:32, Andrew Beekhof wrote:
 
 
 [snip]
 
 
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:
 
   
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
 
 There is a more complex one which includes promotion and demotion on the 
 next page.
 
 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between them 
 (eg. rsc{1,2,3} in the above example)  
 
 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others complete.
 
 Processing of the graph stops the moment any action returns a value that was 
 not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
 
 
 First we do a non-recurring monitor (*_monitor_0) to check what state the 
 resource is in.
 We can’t assume its off because a) we might have crashed, b) the admin might 
 have accidentally configured it to start at boot or c) the admin may have 
 asked us to re-check everything.
 
 
 Also important to know, the order of actions is:

I should clarify something here:

   s/actions is/actions for each resource is/

 
 1. any necessary demotions
 2. any necessary stops
 3. any necessary starts
 4. any necessary promotions
 
 
 
 Thank you for explaining this, Andrew!
 
 So, in the context of the given two example DB(MySQL) and
 messaging(RabbitMQ) resources:
 
 The problem is that pacemaker can only promote a resource after it
 detects the resource is started. During a full reassemble, in the first
 transition batch, pacemaker starts all the resources including MySQL and
 RabbitMQ. Pacemaker issues resource agent start invocation in parallel
 and reaps the results.
 For a multi-state resource agent like RabbitMQ, pacemaker needs the
 start result reported in the first batch, then transition engine and
 policy engine decide if it has to retry starting or promote, and put
 this new transition job into a new batch.
 
 So, for given example, it looks like we currently have:
 _batch start_
 ...
 3. DB, messaging resources start in a one batch

Since there is no dependancy between them, yes.

 4. messaging resource promote blocked by the step 3 completion
 _batch end_

Not quite, I wasn’t as clear as I could have been in my previous email.

We wont promote Rabbit instances until all they have all been started.
However we don’t need to wait for all the DBs to finish starting (again, 
because there is no dependancy between them) before we begin promoting Rabbit.

So a single transition that did this is totally possible:

t0.  Begin transition
t1.  Rabbit start node1(begin)
t2.  DB start node 3   (begin)
t3.  DB start node 2   (begin)
t4.  Rabbit start node2(begin)
t5.  Rabbit start node3(begin)
t6.  DB start node 1   (begin)
t7.  Rabbit start node2(complete)
t8.  Rabbit start node1(complete)
t9.  DB start node 3   (complete)
t10. Rabbit start node3(complete)
t11. Rabbit promote node 1 (begin)
t12. Rabbit promote node 3 (begin)
t13. Rabbit promote node 2 (begin)
... etc etc ...

For something like cinder however, these are some of the dependancies we define:

pcs constraint order start keystone-clone then cinder-api-clone
pcs constraint order start cinder-api-clone then cinder-scheduler-clone
pcs constraint order start galera-master then keystone-clone

So first all the galera instances must be started. Then we can begin to promote 
some.
Once all the promotions complete, then we can start the keystone instances.
Once all the keystone instances are up, then we can bring up the cinder API 
instances, which allows us to start the scheduler, etc etc.

And assuming nothing fails, this can all happen in one transition.

Bottom line: Pacemaker will do as much as it can as soon as it can.  
The only restrictions are ordering constraints you specify, the batch-limit, 
and each master/slave (or clone) resource’s _internal_ 
demote-stop-start-promote ordering.

Am I making it better or worse?

 
 Does this mean what an artificial constraints ordering between DB and
 messaging could help them to get into the separate transition batches, like:
 
 ...
 3. messaging multistate clone resource start
 4. messaging multistate clone resource promote
 _batch end_
 
 _next batch start_
 ...
 3. DB simple clone resource start
 
 ?
 
 -- 
 Best regards,
 Bogdan Dobrelya,
 Skype #bogdando_at_yahoo.com
 Irc #bogdando
 
 __
 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-05 Thread Zhou Zheng Sheng / 周征晟
Thank you Andrew. Sorry for misspell your name in the previous email.

on 2015/05/05 14:25, Andrew Beekhof wrote:
 On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:

 Thank you Bogdan for clearing the pacemaker promotion process for me.

 on 2015/05/05 10:32, Andrew Beekhof wrote:
 On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 
 zhengsh...@awcloud.com wrote:
 [snip]

 Batch is a pacemaker concept I found when I was reading its
 documentation and code. There is a batch-limit: 30 in the output of
 pcs property list --all. The pacemaker official documentation
 explanation is that it's The number of jobs that the TE is allowed to
 execute in parallel. From my understanding, pacemaker maintains cluster
 states, and when we start/stop/promote/demote a resource, it triggers a
 state transition. Pacemaker puts as many as possible transition jobs
 into a batch, and process them in parallel.
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:

   
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html

 There is a more complex one which includes promotion and demotion on the 
 next page.

 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between 
 them (eg. rsc{1,2,3} in the above example)  

 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others complete.

 Processing of the graph stops the moment any action returns a value that 
 was not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
 So can I infer the following statement? In a big cluster with many
 resources, chances are some resource agent actions return unexpected
 values,
 The size of the cluster shouldn’t increase the chance of this happening 
 unless you’ve set the timeouts too aggressively.

If there are many types of resource agents, and anyone of them is not
well written, it might cause trouble, right?

 and if any of the in-flight action timeout is long, it would
 block pacemaker from re-calculating a new transition graph?
 Yes, but its actually an argument for making the timeouts longer, not shorter.
 Setting the timeouts too aggressively actually increases downtime because of 
 all the extra delays and recovery it induces.
 So set them to be long enough that there is unquestionably a problem if you 
 hit them.

 But we absolutely recognise that starting/stopping a database can take a very 
 long time comparatively and that it shouldn’t block recovery of other 
 unrelated services.
 I would expect to see this land in Pacemaker 1.1.14

It will be great to see this in Pacemaker 1.1.14. From my experience
using Pacemaker, I think customized resource agents are possibly the
weakest part. This feature should improve the handling for resource
action timeouts.

 I see the
 current batch-limit is 30 and I tried to increase it to 100, but did not
 help.
 Correct.  It only puts an upper limit on the number of in-flight actions, 
 actions still need to wait for all their dependants to complete before 
 executing.

 I'm sure that the cloned MySQL Galera resource is not related to
 master-slave RabbitMQ resource. I don't find any dependency, order or
 rule connecting them in the cluster deployed by Fuel [1].
 In general it should not have needed to wait, but if you send me a crm_report 
 covering the period you’re talking about I’ll be able to comment specifically 
 about the behaviour you saw.

You are very nice, thank you. I uploaded the file generated by
crm_report to google drive.

https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing

 Is there anything I can do to make sure all the resource actions return
 expected values in a full reassembling?
 In general, if we say ‘start’, do your best to start or return ‘0’ if you 
 already were started.
 Likewise for stop.

 Otherwise its really specific to your agent.
 For example an IP resource just needs to add itself to an interface - it cant 
 do much differently, if it times out then the system much be very very busy.

 The only other thing I would say is:
 - avoid blocking calls where possible
 - have empathy for the machine (do as little as is needed)


+1 for the empathy :)
 Is it because node-1 and node-2
 happen to boot faster than node-3 and form a cluster, when node-3 joins,
 it triggers new state transition? Or may because some resources are
 already started, so pacemaker needs to stop them firstly?
 We only stop them if they shouldn’t yet be running (ie. a colocation or 
 ordering dependancy 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-05 Thread Andrew Beekhof

 On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:
 
 Thank you Bogdan for clearing the pacemaker promotion process for me.
 
 on 2015/05/05 10:32, Andrew Beekhof wrote:
 On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:
 [snip]
 
 Batch is a pacemaker concept I found when I was reading its
 documentation and code. There is a batch-limit: 30 in the output of
 pcs property list --all. The pacemaker official documentation
 explanation is that it's The number of jobs that the TE is allowed to
 execute in parallel. From my understanding, pacemaker maintains cluster
 states, and when we start/stop/promote/demote a resource, it triggers a
 state transition. Pacemaker puts as many as possible transition jobs
 into a batch, and process them in parallel.
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:
 
   
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
 
 There is a more complex one which includes promotion and demotion on the 
 next page.
 
 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between them 
 (eg. rsc{1,2,3} in the above example)  
 
 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others complete.
 
 Processing of the graph stops the moment any action returns a value that was 
 not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
 So can I infer the following statement? In a big cluster with many
 resources, chances are some resource agent actions return unexpected
 values,

The size of the cluster shouldn’t increase the chance of this happening unless 
you’ve set the timeouts too aggressively.

 and if any of the in-flight action timeout is long, it would
 block pacemaker from re-calculating a new transition graph?

Yes, but its actually an argument for making the timeouts longer, not shorter.
Setting the timeouts too aggressively actually increases downtime because of 
all the extra delays and recovery it induces.
So set them to be long enough that there is unquestionably a problem if you hit 
them.

But we absolutely recognise that starting/stopping a database can take a very 
long time comparatively and that it shouldn’t block recovery of other unrelated 
services.
I would expect to see this land in Pacemaker 1.1.14


 I see the
 current batch-limit is 30 and I tried to increase it to 100, but did not
 help.

Correct.  It only puts an upper limit on the number of in-flight actions, 
actions still need to wait for all their dependants to complete before 
executing.

 I'm sure that the cloned MySQL Galera resource is not related to
 master-slave RabbitMQ resource. I don't find any dependency, order or
 rule connecting them in the cluster deployed by Fuel [1].

In general it should not have needed to wait, but if you send me a crm_report 
covering the period you’re talking about I’ll be able to comment specifically 
about the behaviour you saw.

 
 Is there anything I can do to make sure all the resource actions return
 expected values in a full reassembling?

In general, if we say ‘start’, do your best to start or return ‘0’ if you 
already were started.
Likewise for stop.

Otherwise its really specific to your agent.
For example an IP resource just needs to add itself to an interface - it cant 
do much differently, if it times out then the system much be very very busy.

The only other thing I would say is:
- avoid blocking calls where possible
- have empathy for the machine (do as little as is needed)

 Is it because node-1 and node-2
 happen to boot faster than node-3 and form a cluster, when node-3 joins,
 it triggers new state transition? Or may because some resources are
 already started, so pacemaker needs to stop them firstly?

We only stop them if they shouldn’t yet be running (ie. a colocation or 
ordering dependancy is not yet started also).


 Does setting
 default-resource-stickiness to 1 help?

From 0 or INFINITY?

 
 I also tried crm history XXX commands in a live and correct cluster,

I’m not familiar with that tool anymore.

 but didn't find much information. I can see there are many log entries
 like run_graph: Transition 7108  Next I'll inspect the pacemaker
 log to see which resource action returns the unexpected value or which
 thing triggers new state transition.
 
 [1] http://paste.openstack.org/show/214919/

I’d not recommend mixing the two CLI tools.

 
 The problem is that pacemaker can only promote a resource after it
 detects the resource is 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-05 Thread Bogdan Dobrelya
On 05.05.2015 04:32, Andrew Beekhof wrote:
 
 
 [snip]
 
 
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:
 

 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html
 
 There is a more complex one which includes promotion and demotion on the next 
 page.
 
 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between them 
 (eg. rsc{1,2,3} in the above example)  
 
 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others complete.
 
 Processing of the graph stops the moment any action returns a value that was 
 not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
 
 
 First we do a non-recurring monitor (*_monitor_0) to check what state the 
 resource is in.
 We can’t assume its off because a) we might have crashed, b) the admin might 
 have accidentally configured it to start at boot or c) the admin may have 
 asked us to re-check everything.
 
 
 Also important to know, the order of actions is:
 
 1. any necessary demotions
 2. any necessary stops
 3. any necessary starts
 4. any necessary promotions
 
 

Thank you for explaining this, Andrew!

So, in the context of the given two example DB(MySQL) and
messaging(RabbitMQ) resources:

The problem is that pacemaker can only promote a resource after it
detects the resource is started. During a full reassemble, in the first
transition batch, pacemaker starts all the resources including MySQL and
RabbitMQ. Pacemaker issues resource agent start invocation in parallel
and reaps the results.
For a multi-state resource agent like RabbitMQ, pacemaker needs the
start result reported in the first batch, then transition engine and
policy engine decide if it has to retry starting or promote, and put
this new transition job into a new batch.

So, for given example, it looks like we currently have:
_batch start_
...
3. DB, messaging resources start in a one batch
4. messaging resource promote blocked by the step 3 completion
_batch end_

Does this mean what an artificial constraints ordering between DB and
messaging could help them to get into the separate transition batches, like:

...
3. messaging multistate clone resource start
4. messaging multistate clone resource promote
_batch end_

_next batch start_
...
3. DB simple clone resource start

?

-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-04 Thread Zhou Zheng Sheng / 周征晟
Thank you Andrew.

on 2015/05/05 08:03, Andrew Beekhof wrote:
 On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote:

 Hello,
 Hello, Zhou

 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
 power failure. I have a running HA environment, then I reset power of
 all the machines at the same time. I observe that after reboot it
 usually takes 10 minutes for RabittMQ cluster to appear running
 master-slave mode in pacemaker. If I power off all the 3 controllers and
 only start 2 of them, the downtime sometimes can be as long as 20 minutes.
 Yes, this is a known issue [0]. Note, there were many bugfixes, like
 [1],[2],[3], merged for MQ OCF script, so you may want to try to
 backport them as well by the following guide [4]

 [0] https://bugs.launchpad.net/fuel/+bug/1432603
 [1] https://review.openstack.org/#/c/175460/
 [2] https://review.openstack.org/#/c/175457/
 [3] https://review.openstack.org/#/c/175371/
 [4] https://review.openstack.org/#/c/170476/
 Is there a reason you’re using a custom OCF script instead of the upstream[a] 
 one?
 Please have a chat with David (the maintainer, in CC) if there is something 
 you believe is wrong with it.

 [a] 
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster

I'm using the OCF script from the Fuel project, specifically from the
6.0 stable branch [alpha].

Comparing with upstream OCF code, the main difference is that Fuel
RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
bookkeeping, for example, blocking client access when RabbitMQ cluster
is not ready. I beleive the upstream OCF should be OK to use as well
after I read the code, but it might not fit into the Fuel project. As
far as I test, the Fuel OCF script is good except sometimes the full
reassemble time is long, and as I find out, it is mostly because the
Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
resource, as I mentioned in the previous emails.

Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
master-slave RabbitMQ. I see Vladimir and Sergey works on the original
Fuel blueprint RabbitMQ cluster [beta].

[alpha]
https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
[beta]
https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker

 I have a little investigation and find out there are some possible causes.

 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker

 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait
 475s before retry starting it. The problem is that pacemaker divides
 resource state transitions into batches. Since RabbitMQ is master-slave
 resource, I assume that starting all the slaves and promoting master are
 put into two different batches. If unfortunately starting all RabbitMQ
 slaves are put in the same batch as MySQL starting, even if RabbitMQ
 slaves and all other resources are ready, pacemaker will not continue
 but just wait for MySQL timeout.
 Could you please elaborate the what is the same/different batches for MQ
 and DB? Note, there is a MQ clustering logic flow charts available here
 [5] and we're planning to release a dedicated technical bulletin for this.

 [5] http://goo.gl/PPNrw7

 I can re-produce this by hard powering off all the controllers and start
 them again. It's more likely to trigger MySQL failure in this way. Then
 I observe that if there is one cloned mysql instance not starting, the
 whole pacemaker cluster gets stuck and does not emit any log. On the
 host of the failed instance, I can see a mysql resource agent process
 calling the sleep command. If I kill that process, the pacemaker comes
 back alive and RabbitMQ master gets promoted. In fact this long timeout
 is blocking every resource from state transition in pacemaker.

 This maybe a known problem of pacemaker and there are some discussions
 in Linux-HA mailing list [2]. It might not be fixed in the near future.
 It seems in generally it's bad to have long timeout in state transition
 actions (start/stop/promote/demote). There maybe another way to
 implement MySQL-wss resource agent to use a short start timeout and
 monitor the wss cluster state using monitor action.
 This is very interesting, thank you! I believe all commands for MySQL RA
 OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
 as we did for MQ RA OCF. And there should no be any sleep calls. I
 created a bug for this [6].

 [6] https://bugs.launchpad.net/fuel/+bug/1449542

 I also find a fix to improve MySQL start timeout [3]. It shortens the
 timeout to 300s. At the time I sending this email, I can not find it in
 stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
 stable/6.0 ?

 [1] https://bugs.launchpad.net/fuel/+bug/1441885
 [2] 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-04 Thread Zhou Zheng Sheng / 周征晟
Thank you Bogdan for clearing the pacemaker promotion process for me.

on 2015/05/05 10:32, Andrew Beekhof wrote:
 On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:
 [snip]

 Batch is a pacemaker concept I found when I was reading its
 documentation and code. There is a batch-limit: 30 in the output of
 pcs property list --all. The pacemaker official documentation
 explanation is that it's The number of jobs that the TE is allowed to
 execute in parallel. From my understanding, pacemaker maintains cluster
 states, and when we start/stop/promote/demote a resource, it triggers a
 state transition. Pacemaker puts as many as possible transition jobs
 into a batch, and process them in parallel.
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:


 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html

 There is a more complex one which includes promotion and demotion on the next 
 page.

 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between them 
 (eg. rsc{1,2,3} in the above example)  

 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others complete.

 Processing of the graph stops the moment any action returns a value that was 
 not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
So can I infer the following statement? In a big cluster with many
resources, chances are some resource agent actions return unexpected
values, and if any of the in-flight action timeout is long, it would
block pacemaker from re-calculating a new transition graph? I see the
current batch-limit is 30 and I tried to increase it to 100, but did not
help. I'm sure that the cloned MySQL Galera resource is not related to
master-slave RabbitMQ resource. I don't find any dependency, order or
rule connecting them in the cluster deployed by Fuel [1].

Is there anything I can do to make sure all the resource actions return
expected values in a full reassembling? Is it because node-1 and node-2
happen to boot faster than node-3 and form a cluster, when node-3 joins,
it triggers new state transition? Or may because some resources are
already started, so pacemaker needs to stop them firstly? Does setting
default-resource-stickiness to 1 help?

I also tried crm history XXX commands in a live and correct cluster,
but didn't find much information. I can see there are many log entries
like run_graph: Transition 7108  Next I'll inspect the pacemaker
log to see which resource action returns the unexpected value or which
thing triggers new state transition.

[1] http://paste.openstack.org/show/214919/

 The problem is that pacemaker can only promote a resource after it
 detects the resource is started.
 First we do a non-recurring monitor (*_monitor_0) to check what state the 
 resource is in.
 We can’t assume its off because a) we might have crashed, b) the admin might 
 have accidentally configured it to start at boot or c) the admin may have 
 asked us to re-check everything.

 During a full reassemble, in the first
 transition batch, pacemaker starts all the resources including MySQL and
 RabbitMQ. Pacemaker issues resource agent start invocation in parallel
 and reaps the results.

 For a multi-state resource agent like RabbitMQ, pacemaker needs the
 start result reported in the first batch, then transition engine and
 policy engine decide if it has to retry starting or promote, and put
 this new transition job into a new batch.
 Also important to know, the order of actions is:

 1. any necessary demotions
 2. any necessary stops
 3. any necessary starts
 4. any necessary promotions



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-- 
Best wishes!
Zhou Zheng Sheng / 周征晟  Software Engineer
Beijing AWcloud Software Co., Ltd.




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-04 Thread Andrew Beekhof

 On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote:
 
 Hello,
 
 Hello, Zhou
 
 
 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
 power failure. I have a running HA environment, then I reset power of
 all the machines at the same time. I observe that after reboot it
 usually takes 10 minutes for RabittMQ cluster to appear running
 master-slave mode in pacemaker. If I power off all the 3 controllers and
 only start 2 of them, the downtime sometimes can be as long as 20 minutes.
 
 Yes, this is a known issue [0]. Note, there were many bugfixes, like
 [1],[2],[3], merged for MQ OCF script, so you may want to try to
 backport them as well by the following guide [4]
 
 [0] https://bugs.launchpad.net/fuel/+bug/1432603
 [1] https://review.openstack.org/#/c/175460/
 [2] https://review.openstack.org/#/c/175457/
 [3] https://review.openstack.org/#/c/175371/
 [4] https://review.openstack.org/#/c/170476/

Is there a reason you’re using a custom OCF script instead of the upstream[a] 
one?
Please have a chat with David (the maintainer, in CC) if there is something you 
believe is wrong with it.

[a] 
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster

 
 
 I have a little investigation and find out there are some possible causes.
 
 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker
 
 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait
 475s before retry starting it. The problem is that pacemaker divides
 resource state transitions into batches. Since RabbitMQ is master-slave
 resource, I assume that starting all the slaves and promoting master are
 put into two different batches. If unfortunately starting all RabbitMQ
 slaves are put in the same batch as MySQL starting, even if RabbitMQ
 slaves and all other resources are ready, pacemaker will not continue
 but just wait for MySQL timeout.
 
 Could you please elaborate the what is the same/different batches for MQ
 and DB? Note, there is a MQ clustering logic flow charts available here
 [5] and we're planning to release a dedicated technical bulletin for this.
 
 [5] http://goo.gl/PPNrw7
 
 
 I can re-produce this by hard powering off all the controllers and start
 them again. It's more likely to trigger MySQL failure in this way. Then
 I observe that if there is one cloned mysql instance not starting, the
 whole pacemaker cluster gets stuck and does not emit any log. On the
 host of the failed instance, I can see a mysql resource agent process
 calling the sleep command. If I kill that process, the pacemaker comes
 back alive and RabbitMQ master gets promoted. In fact this long timeout
 is blocking every resource from state transition in pacemaker.
 
 This maybe a known problem of pacemaker and there are some discussions
 in Linux-HA mailing list [2]. It might not be fixed in the near future.
 It seems in generally it's bad to have long timeout in state transition
 actions (start/stop/promote/demote). There maybe another way to
 implement MySQL-wss resource agent to use a short start timeout and
 monitor the wss cluster state using monitor action.
 
 This is very interesting, thank you! I believe all commands for MySQL RA
 OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
 as we did for MQ RA OCF. And there should no be any sleep calls. I
 created a bug for this [6].
 
 [6] https://bugs.launchpad.net/fuel/+bug/1449542
 
 
 I also find a fix to improve MySQL start timeout [3]. It shortens the
 timeout to 300s. At the time I sending this email, I can not find it in
 stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
 stable/6.0 ?
 
 [1] https://bugs.launchpad.net/fuel/+bug/1441885
 [2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
 [3] https://review.openstack.org/#/c/171333/
 
 
 2. RabbitMQ Resource Agent Breaks Existing Cluster
 
 Read the code of the RabbitMQ resource agent, I find it does the
 following to start RabbitMQ master-slave cluster.
 On all the controllers:
 (1) Start Erlang beam process
 (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 (3) Stop RabbitMQ App but do not stop the beam process
 
 Then in pacemaker, all the RabbitMQ instances are in slave state. After
 pacemaker determines the master, it does the following.
 On the to-be-master host:
 (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 On the slaves hosts:
 (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 (6) Join RabbitMQ cluster of the master host
 
 
 Yes, something like that. As I mentioned, there were several bug fixes
 in the 6.1 dev, and you can also check the MQ clustering flow charts.
 
 As far as I can understand, this process is to make sure the master
 determined by pacemaker is the same as the master determined in RabbitMQ
 cluster. If there is no existing 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-04 Thread Andrew Beekhof

 On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:

[snip]

 Batch is a pacemaker concept I found when I was reading its
 documentation and code. There is a batch-limit: 30 in the output of
 pcs property list --all. The pacemaker official documentation
 explanation is that it's The number of jobs that the TE is allowed to
 execute in parallel. From my understanding, pacemaker maintains cluster
 states, and when we start/stop/promote/demote a resource, it triggers a
 state transition. Pacemaker puts as many as possible transition jobs
 into a batch, and process them in parallel.

Technically it calculates an ordered graph of actions that need to be performed 
for a set of related resources.
You can see an example of the kinds of graphs it produces at:

   
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html

There is a more complex one which includes promotion and demotion on the next 
page.

The number of actions that can run at any one time is therefor limited by
- the value of batch-limit (the total number of in-flight actions)
- the number of resources that do not have ordering constraints between them 
(eg. rsc{1,2,3} in the above example)  

So in the above example, if batch-limit = 3, the monitor_0 actions will still 
all execute in parallel.
If batch-limit == 2, one of them will be deferred until the others complete.

Processing of the graph stops the moment any action returns a value that was 
not expected.
If that happens, we wait for currently in-flight actions to complete, 
re-calculate a new graph based on the new information and start again.

 
 The problem is that pacemaker can only promote a resource after it
 detects the resource is started.

First we do a non-recurring monitor (*_monitor_0) to check what state the 
resource is in.
We can’t assume its off because a) we might have crashed, b) the admin might 
have accidentally configured it to start at boot or c) the admin may have asked 
us to re-check everything.

 During a full reassemble, in the first
 transition batch, pacemaker starts all the resources including MySQL and
 RabbitMQ. Pacemaker issues resource agent start invocation in parallel
 and reaps the results.
 
 For a multi-state resource agent like RabbitMQ, pacemaker needs the
 start result reported in the first batch, then transition engine and
 policy engine decide if it has to retry starting or promote, and put
 this new transition job into a new batch.

Also important to know, the order of actions is:

1. any necessary demotions
2. any necessary stops
3. any necessary starts
4. any necessary promotions



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-03 Thread Zhou Zheng Sheng / 周征晟
Hello Sergii,

Thank you for the great explanation on Galera OCF script. I replied your
question inline.

on 2015/05/03 04:49, Sergii Golovatiuk wrote:
 Hi Zhou,

 Galera OCF script is a bit special. Since MySQL keeps the most
 important data we should find the most recent data on all nodes across
 the cluster. check_if_galera_pc is specially designed for that. Every
 server registers the latest status from grastate.dat to CIB. Once all
 nodes are registered, the one with the most recent data will be
 selected as Primary Component. All others should join to that node. 5
 minutes is a time for all nodes to appear and register position from
 grastate.dat to CIB. Usually, it takes much faster. Though there are
 cases when node is stuck on fsck or grub or power outlet or some other
 cases. If all nodes are registered there shouldn't be 5 minute penalty
 timeout. If one node is stuck (at least present in CIB), then all
 other nodes will be waiting for 5 minutes then will assemble cluster
 without it.

 Concerning dependencies, I agree that RabbitMQ may start in parallel
 to Galera cluster assemble procedure. It makes no sense to start other
 services as they are dependent on Galera and RabbitMQ.

 Also, I have a quick question to you. Shutting down all three
 controllers is a unique case, like whole power outage in whole
 datacenter (DC). In this case, 5 minute delay is very small comparing
 to DC recovery procedure. Reboot of one controller is more optimistic
 scenario. What's a special case to restart all 3-5 at once?

Sorry, I am not very clear about what 3-5 refers to. Is the question
about why we want to make the full reassemble time short, and why this
case is important for us?

We have some small customers forming a long-tail in local market. They
have neither dedicated datacenter houses nor dual power supply. Some of
them would even shutdown all the machines when they go home, and start
all of the machines when they start to work. Considering of data
privacy, they are not willing to put their virtual machines on the
public cloud. Usually, this kind of customer don't have IT skills to
troubleshoot a full reassemble process. We want to make this process as
simple as turning on all the machines roughly at the same time and wait
about several minutes, so they don't call our service team.


 Also, I would like to say a big thank for digging it out. It's very
 useful to use your findings in our next steps.


 --
 Best regards,
 Sergii Golovatiuk,
 Skype #golserge
 IRC #holser

 On Wed, Apr 29, 2015 at 9:38 AM, Zhou Zheng Sheng / 周征晟
 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote:

 Hi!

 Thank you very much Vladimir and Bogdan! Thanks for the fast
 respond and
 rich information.

 I backported MySQL and RabbitMQ ocf patches from stable/6.0 and tested
 again. A full reassemble takes about 5mins, this improves a lot.
 Adding
 the force_load trick I mentioned in the previous email, it takes
 about
 4mins.

 I get that there is not really a RabbitMQ master instance because
 queue
 masters spreads to all the RabbitMQ instances. The pacemaker master is
 an abstract one. However there is still an mnesia node from which
 other
 mnesia nodes sync table schema. The exception
 timeout_waiting_for_tables in log is actually reported by mnesia. By
 default, it places a mark on the last alive mnesia node, and other
 nodes
 have to sync table from it
 (http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477).
 RabbitMQ clustering inherits the behavior, and the last RabbitMQ
 instance shutdown must be the first instance to start. Otherwise it
 produces timeout_waiting_for_tables
 (http://www.rabbitmq.com/clustering.html#transcript search for
 the last
 node to go down).

 The 1 minute difference is because without force_load, the abstract
 master determined by pacemaker during a promote action may not be the
 last RabbitMQ instance shut down in the last start action. So
 there is
 chance for rabbitmqctl start_app to wait 30s and trigger a RabbitMQ
 exception timeout_waiting_for_tables. We may able to see table
 timeout
 and mnesa resetting for once during a reassemble process on some
 of the
 RabbitMQ instances, but it only introduces 30s of wait, which is
 acceptable for me.

 I also inspect the RabbitMQ resource agent code in latest master
 branch.
 There are timeout wrapper and other improvements which are great. It
 does not change the master promotion process much, so it may still run
 into the problems I described.

 Please see the inline reply below.

 on 2015/04/28/ 21:15, Bogdan Dobrelya wrote:
  Hello,
  Hello, Zhou
 
  I using Fuel 6.0.1 and find that RabbitMQ recover time is long
 after
  power failure. I have a running HA environment, then I reset
 power of
  all the machines at the 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-04-29 Thread Zhou Zheng Sheng / 周征晟
Hi!

Thank you very much Vladimir and Bogdan! Thanks for the fast respond and
rich information.

I backported MySQL and RabbitMQ ocf patches from stable/6.0 and tested
again. A full reassemble takes about 5mins, this improves a lot. Adding
the force_load trick I mentioned in the previous email, it takes about
4mins.

I get that there is not really a RabbitMQ master instance because queue
masters spreads to all the RabbitMQ instances. The pacemaker master is
an abstract one. However there is still an mnesia node from which other
mnesia nodes sync table schema. The exception
timeout_waiting_for_tables in log is actually reported by mnesia. By
default, it places a mark on the last alive mnesia node, and other nodes
have to sync table from it
(http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477).
RabbitMQ clustering inherits the behavior, and the last RabbitMQ
instance shutdown must be the first instance to start. Otherwise it
produces timeout_waiting_for_tables
(http://www.rabbitmq.com/clustering.html#transcript search for the last
node to go down).

The 1 minute difference is because without force_load, the abstract
master determined by pacemaker during a promote action may not be the
last RabbitMQ instance shut down in the last start action. So there is
chance for rabbitmqctl start_app to wait 30s and trigger a RabbitMQ
exception timeout_waiting_for_tables. We may able to see table timeout
and mnesa resetting for once during a reassemble process on some of the
RabbitMQ instances, but it only introduces 30s of wait, which is
acceptable for me.

I also inspect the RabbitMQ resource agent code in latest master branch.
There are timeout wrapper and other improvements which are great. It
does not change the master promotion process much, so it may still run
into the problems I described.

Please see the inline reply below.

on 2015/04/28/ 21:15, Bogdan Dobrelya wrote:
 Hello,
 Hello, Zhou

 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
 power failure. I have a running HA environment, then I reset power of
 all the machines at the same time. I observe that after reboot it
 usually takes 10 minutes for RabittMQ cluster to appear running
 master-slave mode in pacemaker. If I power off all the 3 controllers and
 only start 2 of them, the downtime sometimes can be as long as 20 minutes.
 Yes, this is a known issue [0]. Note, there were many bugfixes, like
 [1],[2],[3], merged for MQ OCF script, so you may want to try to
 backport them as well by the following guide [4]

 [0] https://bugs.launchpad.net/fuel/+bug/1432603
 [1] https://review.openstack.org/#/c/175460/
 [2] https://review.openstack.org/#/c/175457/
 [3] https://review.openstack.org/#/c/175371/
 [4] https://review.openstack.org/#/c/170476/

 I have a little investigation and find out there are some possible causes.

 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker

 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait
 475s before retry starting it. The problem is that pacemaker divides
 resource state transitions into batches. Since RabbitMQ is master-slave
 resource, I assume that starting all the slaves and promoting master are
 put into two different batches. If unfortunately starting all RabbitMQ
 slaves are put in the same batch as MySQL starting, even if RabbitMQ
 slaves and all other resources are ready, pacemaker will not continue
 but just wait for MySQL timeout.
 Could you please elaborate the what is the same/different batches for MQ
 and DB? Note, there is a MQ clustering logic flow charts available here
 [5] and we're planning to release a dedicated technical bulletin for this.

 [5] http://goo.gl/PPNrw7

Batch is a pacemaker concept I found when I was reading its
documentation and code. There is a batch-limit: 30 in the output of
pcs property list --all. The pacemaker official documentation
explanation is that it's The number of jobs that the TE is allowed to
execute in parallel. From my understanding, pacemaker maintains cluster
states, and when we start/stop/promote/demote a resource, it triggers a
state transition. Pacemaker puts as many as possible transition jobs
into a batch, and process them in parallel.

The problem is that pacemaker can only promote a resource after it
detects the resource is started. During a full reassemble, in the first
transition batch, pacemaker starts all the resources including MySQL and
RabbitMQ. Pacemaker issues resource agent start invocation in parallel
and reaps the results.

For a multi-state resource agent like RabbitMQ, pacemaker needs the
start result reported in the first batch, then transition engine and
policy engine decide if it has to retry starting or promote, and put
this new transition job into a new batch.

I see improvements to put individual commands inside a timeout wrapper
in RabbitMQ resource agent, and a bug created yesterday 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-04-28 Thread Bogdan Dobrelya
On 28.04.2015 15:15, Bogdan Dobrelya wrote:
 
 Hello, Zhou
 
 
 Yes, this is a known issue [0]. Note, there were many bugfixes, like
 [1],[2],[3], merged for MQ OCF script, so you may want to try to
 backport them as well by the following guide [4]
 
 [0] https://bugs.launchpad.net/fuel/+bug/1432603
 [1] https://review.openstack.org/#/c/175460/
 [2] https://review.openstack.org/#/c/175457/
 [3] https://review.openstack.org/#/c/175371/
 [4] https://review.openstack.org/#/c/170476/
 
 
 Could you please elaborate the what is the same/different batches for MQ
 and DB? Note, there is a MQ clustering logic flow charts available here
 [5] and we're planning to release a dedicated technical bulletin for this.
 
 [5] http://goo.gl/PPNrw7
 
 
 This is very interesting, thank you! I believe all commands for MySQL RA
 OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
 as we did for MQ RA OCF. And there should no be any sleep calls. I
 created a bug for this [6].
 
 [6] https://bugs.launchpad.net/fuel/+bug/1449542
 
 
 Yes, something like that. As I mentioned, there were several bug fixes
 in the 6.1 dev, and you can also check the MQ clustering flow charts.
 
 after
 
 Not exactly. There is no master in mirrored MQ cluster. We define the
 rabbit_hosts configuration option from Oslo.messaging. What ensures all
 queue masters will be spread around all of MQ nodes in a long run. And
 we use a master abstraction only for the Pacemaker RA clustering layer.
 Here, a master is the MQ node what joins the rest of the MQ nodes.
 
 
 We do erase the node master attribute in CIB for such cases. This should
 not bring problems into the master election logic.
 
 
 (Note, the RabbitMQ documentation mentions *queue* masters and slaves,
 which are not the case for the Pacemaker RA clustering abstraction layer.)
 
 
 We made an assumption what the node with the highest MQ uptime should
 know the most about recent cluster state, so other nodes must join it.
 RA OCF does not work with queue masters directly.
 
 
 The full MQ cluster reassemble logic is far from the perfect state,
 indeed. This might erase all mnesia files, hence any custom entities,
 like users or vhosts, would be removed as well. Note, we do not
 configure durable queues for Openstack so there is nothing to care about
 here - the full cluster downtime assumes there will be no AMQP messages
 stored at all.
 
 
 Yes, this option is only supported for newest RabbitMQ versions. But we
 definitely should look how this could help.
 
 
 Indeed, there are cases when MQ's autoheal can do nothing with existing
 partitions and remains partitioned for ever, for example:
 
 Masters: [ node-1 ]
 Slaves: [ node-2 node-3 ]
 root@node-1:~# rabbitmqctl cluster_status
 Cluster status of node 'rabbit@node-1' ...
 [{nodes,[{disc,['rabbit@node-1','rabbit@node-2']}]},
 {running_nodes,['rabbit@node-1']},
 {cluster_name,rabbit@node-2},
 {partitions,[]}]
 ...done.
 root@node-2:~# rabbitmqctl cluster_status
 Cluster status of node 'rabbit@node-2' ...
 [{nodes,[{disc,['rabbit@node-2']}]}]
 ...done.
 root@node-3:~# rabbitmqctl cluster_status
 Cluster status of node 'rabbit@node-3' ...
 [{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-3']},
 {cluster_name,rabbit@node-2},
 {partitions,[]}]

Sorry, here is the correct one [0] !

[0] http://pastebin.com/m3fDdMA6

 
 So we should test the pause-minority value as well.
 But I strongly believe we should make MQ multi-state clone to support
 many masters, related bp [7]
 
 [7]
 https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone
 
 
 Well, we should not mess the queue masters and multi-clone master for MQ
 resource in the pacemaker.
 As I said, pacemaker RA has nothing to do with queue masters. And we
 introduced this master mostly in order to support the full cluster
 reassemble case - there must be a node promoted and other nodes should join.
 
 
 This is a very good point, thank you.
 
 
 Thank you for a thorough feedback! This was a really great job.
 
 
 


-- 
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-04-28 Thread Bogdan Dobrelya
 Hello,

Hello, Zhou


 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
 power failure. I have a running HA environment, then I reset power of
 all the machines at the same time. I observe that after reboot it
 usually takes 10 minutes for RabittMQ cluster to appear running
 master-slave mode in pacemaker. If I power off all the 3 controllers and
 only start 2 of them, the downtime sometimes can be as long as 20 minutes.

Yes, this is a known issue [0]. Note, there were many bugfixes, like
[1],[2],[3], merged for MQ OCF script, so you may want to try to
backport them as well by the following guide [4]

[0] https://bugs.launchpad.net/fuel/+bug/1432603
[1] https://review.openstack.org/#/c/175460/
[2] https://review.openstack.org/#/c/175457/
[3] https://review.openstack.org/#/c/175371/
[4] https://review.openstack.org/#/c/170476/


 I have a little investigation and find out there are some possible causes.

 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker

 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait
 475s before retry starting it. The problem is that pacemaker divides
 resource state transitions into batches. Since RabbitMQ is master-slave
 resource, I assume that starting all the slaves and promoting master are
 put into two different batches. If unfortunately starting all RabbitMQ
 slaves are put in the same batch as MySQL starting, even if RabbitMQ
 slaves and all other resources are ready, pacemaker will not continue
 but just wait for MySQL timeout.

Could you please elaborate the what is the same/different batches for MQ
and DB? Note, there is a MQ clustering logic flow charts available here
[5] and we're planning to release a dedicated technical bulletin for this.

[5] http://goo.gl/PPNrw7


 I can re-produce this by hard powering off all the controllers and start
 them again. It's more likely to trigger MySQL failure in this way. Then
 I observe that if there is one cloned mysql instance not starting, the
 whole pacemaker cluster gets stuck and does not emit any log. On the
 host of the failed instance, I can see a mysql resource agent process
 calling the sleep command. If I kill that process, the pacemaker comes
 back alive and RabbitMQ master gets promoted. In fact this long timeout
 is blocking every resource from state transition in pacemaker.

 This maybe a known problem of pacemaker and there are some discussions
 in Linux-HA mailing list [2]. It might not be fixed in the near future.
 It seems in generally it's bad to have long timeout in state transition
 actions (start/stop/promote/demote). There maybe another way to
 implement MySQL-wss resource agent to use a short start timeout and
 monitor the wss cluster state using monitor action.

This is very interesting, thank you! I believe all commands for MySQL RA
OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
as we did for MQ RA OCF. And there should no be any sleep calls. I
created a bug for this [6].

[6] https://bugs.launchpad.net/fuel/+bug/1449542


 I also find a fix to improve MySQL start timeout [3]. It shortens the
 timeout to 300s. At the time I sending this email, I can not find it in
 stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
 stable/6.0 ?

 [1] https://bugs.launchpad.net/fuel/+bug/1441885
 [2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
 [3] https://review.openstack.org/#/c/171333/


 2. RabbitMQ Resource Agent Breaks Existing Cluster

 Read the code of the RabbitMQ resource agent, I find it does the
 following to start RabbitMQ master-slave cluster.
 On all the controllers:
 (1) Start Erlang beam process
 (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 (3) Stop RabbitMQ App but do not stop the beam process

 Then in pacemaker, all the RabbitMQ instances are in slave state. After
 pacemaker determines the master, it does the following.
 On the to-be-master host:
 (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 On the slaves hosts:
 (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 (6) Join RabbitMQ cluster of the master host


Yes, something like that. As I mentioned, there were several bug fixes
in the 6.1 dev, and you can also check the MQ clustering flow charts.

 As far as I can understand, this process is to make sure the master
 determined by pacemaker is the same as the master determined in RabbitMQ
 cluster. If there is no existing cluster, it's fine. If it is run
after

Not exactly. There is no master in mirrored MQ cluster. We define the
rabbit_hosts configuration option from Oslo.messaging. What ensures all
queue masters will be spread around all of MQ nodes in a long run. And
we use a master abstraction only for the Pacemaker RA clustering layer.
Here, a master is the MQ node what joins the rest of the MQ nodes.

 power failure 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-04-28 Thread Vladimir Kuklin
Hi, Zhou

Thank you for writing these awesome recommendations.

We will look into them and see whether they provide significant impact.
BTW, we have found a bunch of issues with our 5.1 and 6.0 RabbitMQ OCF
script and fixed them in current master. Would you be so kind as to check
out the newest version and say if any of issues mentioned by you are gone?

On Tue, Apr 28, 2015 at 9:03 AM, Zhou Zheng Sheng / 周征晟 
zhengsh...@awcloud.com wrote:

  Hello,

 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after power
 failure. I have a running HA environment, then I reset power of all the
 machines at the same time. I observe that after reboot it usually takes 10
 minutes for RabittMQ cluster to appear running master-slave mode in
 pacemaker. If I power off all the 3 controllers and only start 2 of them,
 the downtime sometimes can be as long as 20 minutes.

 I have a little investigation and find out there are some possible causes.

 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker

 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait 475s
 before retry starting it. The problem is that pacemaker divides resource
 state transitions into batches. Since RabbitMQ is master-slave resource, I
 assume that starting all the slaves and promoting master are put into two
 different batches. If unfortunately starting all RabbitMQ slaves are put in
 the same batch as MySQL starting, even if RabbitMQ slaves and all other
 resources are ready, pacemaker will not continue but just wait for MySQL
 timeout.

 I can re-produce this by hard powering off all the controllers and start
 them again. It's more likely to trigger MySQL failure in this way. Then I
 observe that if there is one cloned mysql instance not starting, the whole
 pacemaker cluster gets stuck and does not emit any log. On the host of the
 failed instance, I can see a mysql resource agent process calling the sleep
 command. If I kill that process, the pacemaker comes back alive and
 RabbitMQ master gets promoted. In fact this long timeout is blocking every
 resource from state transition in pacemaker.

 This maybe a known problem of pacemaker and there are some discussions in
 Linux-HA mailing list [2]. It might not be fixed in the near future. It
 seems in generally it's bad to have long timeout in state transition
 actions (start/stop/promote/demote). There maybe another way to implement
 MySQL-wss resource agent to use a short start timeout and monitor the wss
 cluster state using monitor action.

 I also find a fix to improve MySQL start timeout [3]. It shortens the
 timeout to 300s. At the time I sending this email, I can not find it in
 stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
 stable/6.0 ?

 [1] https://bugs.launchpad.net/fuel/+bug/1441885
 [2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
 [3] https://review.openstack.org/#/c/171333/


 2. RabbitMQ Resource Agent Breaks Existing Cluster

 Read the code of the RabbitMQ resource agent, I find it does the following
 to start RabbitMQ master-slave cluster.
 On all the controllers:
 (1) Start Erlang beam process
 (2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 (3) Stop RabbitMQ App but do not stop the beam process

 Then in pacemaker, all the RabbitMQ instances are in slave state. After
 pacemaker determines the master, it does the following.
 On the to-be-master host:
 (4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 On the slaves hosts:
 (5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
 (6) Join RabbitMQ cluster of the master host

 As far as I can understand, this process is to make sure the master
 determined by pacemaker is the same as the master determined in RabbitMQ
 cluster. If there is no existing cluster, it's fine. If it is run after
 power failure and recovery, it introduces the a new problem.

 After power recovery, if some of the RabbitMQ instances reach step (2)
 roughly at the same time (within 30s which is hard coded in RabbitMQ) as
 the original RabbitMQ master instance, they form the original cluster again
 and then shutdown. The other instances would have to wait for 30s before it
 reports failure waiting for tables, and be  reset to a standalone cluster.

 In RabbitMQ documentation [4], it is also mentioned that if we shutdown
 RabbitMQ master, a new master is elected from the rest of slaves. If we
 continue to shutdown nodes in step (3), we reach a point that the last node
 is the RabbitMQ master, and pacemaker is not aware of it. I can see there
 is code to bookkeeping a rabbit-start-time attribute in pacemaker to
 record the most long lived instance to help pacemaker determine the master,
 but it does not cover the case mentioned above. A recent patch [5] checks
 existing rabbit-master attribute but it neither cover the above case.

 So 

[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-04-28 Thread Zhou Zheng Sheng / 周征晟
Hello,

I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
power failure. I have a running HA environment, then I reset power of
all the machines at the same time. I observe that after reboot it
usually takes 10 minutes for RabittMQ cluster to appear running
master-slave mode in pacemaker. If I power off all the 3 controllers and
only start 2 of them, the downtime sometimes can be as long as 20 minutes.

I have a little investigation and find out there are some possible causes.

1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
Pacemaker

The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
MySQL-wss fails to start after power failure, and pacemaker would wait
475s before retry starting it. The problem is that pacemaker divides
resource state transitions into batches. Since RabbitMQ is master-slave
resource, I assume that starting all the slaves and promoting master are
put into two different batches. If unfortunately starting all RabbitMQ
slaves are put in the same batch as MySQL starting, even if RabbitMQ
slaves and all other resources are ready, pacemaker will not continue
but just wait for MySQL timeout.

I can re-produce this by hard powering off all the controllers and start
them again. It's more likely to trigger MySQL failure in this way. Then
I observe that if there is one cloned mysql instance not starting, the
whole pacemaker cluster gets stuck and does not emit any log. On the
host of the failed instance, I can see a mysql resource agent process
calling the sleep command. If I kill that process, the pacemaker comes
back alive and RabbitMQ master gets promoted. In fact this long timeout
is blocking every resource from state transition in pacemaker.

This maybe a known problem of pacemaker and there are some discussions
in Linux-HA mailing list [2]. It might not be fixed in the near future.
It seems in generally it's bad to have long timeout in state transition
actions (start/stop/promote/demote). There maybe another way to
implement MySQL-wss resource agent to use a short start timeout and
monitor the wss cluster state using monitor action.

I also find a fix to improve MySQL start timeout [3]. It shortens the
timeout to 300s. At the time I sending this email, I can not find it in
stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
stable/6.0 ?

[1] https://bugs.launchpad.net/fuel/+bug/1441885
[2] http://lists.linux-ha.org/pipermail/linux-ha/2014-March/047989.html
[3] https://review.openstack.org/#/c/171333/


2. RabbitMQ Resource Agent Breaks Existing Cluster

Read the code of the RabbitMQ resource agent, I find it does the
following to start RabbitMQ master-slave cluster.
On all the controllers:
(1) Start Erlang beam process
(2) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
(3) Stop RabbitMQ App but do not stop the beam process

Then in pacemaker, all the RabbitMQ instances are in slave state. After
pacemaker determines the master, it does the following.
On the to-be-master host:
(4) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
On the slaves hosts:
(5) Start RabbitMQ App (If failed, reset mnesia DB and cluster state)
(6) Join RabbitMQ cluster of the master host

As far as I can understand, this process is to make sure the master
determined by pacemaker is the same as the master determined in RabbitMQ
cluster. If there is no existing cluster, it's fine. If it is run after
power failure and recovery, it introduces the a new problem.

After power recovery, if some of the RabbitMQ instances reach step (2)
roughly at the same time (within 30s which is hard coded in RabbitMQ) as
the original RabbitMQ master instance, they form the original cluster
again and then shutdown. The other instances would have to wait for 30s
before it reports failure waiting for tables, and be  reset to a
standalone cluster.

In RabbitMQ documentation [4], it is also mentioned that if we shutdown
RabbitMQ master, a new master is elected from the rest of slaves. If we
continue to shutdown nodes in step (3), we reach a point that the last
node is the RabbitMQ master, and pacemaker is not aware of it. I can see
there is code to bookkeeping a rabbit-start-time attribute in
pacemaker to record the most long lived instance to help pacemaker
determine the master, but it does not cover the case mentioned above. A
recent patch [5] checks existing rabbit-master attribute but it
neither cover the above case.

So in step (4), pacemaker determines a different master which was a
RabbitMQ slave last time. It would wait for its original RabbitMQ master
for 30s and fail, then it gets reset to a standalone cluster. Here we
get some different clusters, so in step (5) and (6), it is likely to
report error in log saying timeout waiting for tables or fail to merge
mnesia database schema, then the those instances get reset. You can
easily re-produce the case by hard resetting power of all the controllers.

As you can see, if you