subject:"\[openstack\-dev\] \[Fuel\] Speed Up RabbitMQ Recovering"

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-20 Thread Vladimir Kuklin

Actually, we are not skipping 'Started' state - we just consider resource
as started when beam is powered up and rabbitmq start_app/stop_app action
succeeds. Such a node is considered as a good one that can be marked as
'Master' to which the nodes should connect and then all the cluster
join/leave actions are handled using multi-state notification mechanism.

On Wed, May 20, 2015 at 5:05 AM, Andrew Beekhof abeek...@redhat.com wrote:


  On 20 May 2015, at 6:05 am, Andrew Woodward xar...@gmail.com wrote:
 
 
 
  On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof abeek...@redhat.com
 wrote:
 
   On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 
 zhengsh...@awcloud.com wrote:
  
   Thank you Andrew.
  
   on 2015/05/05 08:03, Andrew Beekhof wrote:
   On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com
 wrote:
  
   Hello,
   Hello, Zhou
  
   I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
   power failure. I have a running HA environment, then I reset power
 of
   all the machines at the same time. I observe that after reboot it
   usually takes 10 minutes for RabittMQ cluster to appear running
   master-slave mode in pacemaker. If I power off all the 3
 controllers and
   only start 2 of them, the downtime sometimes can be as long as 20
 minutes.
   Yes, this is a known issue [0]. Note, there were many bugfixes, like
   [1],[2],[3], merged for MQ OCF script, so you may want to try to
   backport them as well by the following guide [4]
  
   [0] https://bugs.launchpad.net/fuel/+bug/1432603
   [1] https://review.openstack.org/#/c/175460/
   [2] https://review.openstack.org/#/c/175457/
   [3] https://review.openstack.org/#/c/175371/
   [4] https://review.openstack.org/#/c/170476/
   Is there a reason you’re using a custom OCF script instead of the
 upstream[a] one?
   Please have a chat with David (the maintainer, in CC) if there is
 something you believe is wrong with it.
  
   [a]
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
  
   I'm using the OCF script from the Fuel project, specifically from the
   6.0 stable branch [alpha].
 
  Ah, I’m still learning who is who... i thought you were part of that
 project :-)
 
  
   Comparing with upstream OCF code, the main difference is that Fuel
   RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
   bookkeeping, for example, blocking client access when RabbitMQ cluster
   is not ready. I beleive the upstream OCF should be OK to use as well
   after I read the code, but it might not fit into the Fuel project. As
   far as I test, the Fuel OCF script is good except sometimes the full
   reassemble time is long, and as I find out, it is mostly because the
   Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
   resource, as I mentioned in the previous emails.
  
   Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
   master-slave RabbitMQ.
 
  That would be good to know.
  Browsing the agent, promote seems to be a no-op if rabbit is already
 running.
 
 
  To the master / slave reason due to how the ocf script is structured to
 deal with rabbit's poor ability to handle its self in some scenarios.
 Hopefully the state transition diagram [5] is enough to clarify what's
 going on.
 
  [5] http://goo.gl/PPNrw7

 Not really.
 It seems to be under the impression you can skip started and go directly
 from stopped to master.
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
35bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com http://www.mirantis.ru/
www.mirantis.ru
vkuk...@mirantis.com
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-19 Thread Andrew Beekhof


 On 20 May 2015, at 6:05 am, Andrew Woodward xar...@gmail.com wrote:
 
 
 
 On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof abeek...@redhat.com wrote:
 
  On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
  wrote:
 
  Thank you Andrew.
 
  on 2015/05/05 08:03, Andrew Beekhof wrote:
  On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com 
  wrote:
 
  Hello,
  Hello, Zhou
 
  I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
  power failure. I have a running HA environment, then I reset power of
  all the machines at the same time. I observe that after reboot it
  usually takes 10 minutes for RabittMQ cluster to appear running
  master-slave mode in pacemaker. If I power off all the 3 controllers and
  only start 2 of them, the downtime sometimes can be as long as 20 
  minutes.
  Yes, this is a known issue [0]. Note, there were many bugfixes, like
  [1],[2],[3], merged for MQ OCF script, so you may want to try to
  backport them as well by the following guide [4]
 
  [0] https://bugs.launchpad.net/fuel/+bug/1432603
  [1] https://review.openstack.org/#/c/175460/
  [2] https://review.openstack.org/#/c/175457/
  [3] https://review.openstack.org/#/c/175371/
  [4] https://review.openstack.org/#/c/170476/
  Is there a reason you’re using a custom OCF script instead of the 
  upstream[a] one?
  Please have a chat with David (the maintainer, in CC) if there is 
  something you believe is wrong with it.
 
  [a] 
  https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster
 
  I'm using the OCF script from the Fuel project, specifically from the
  6.0 stable branch [alpha].
 
 Ah, I’m still learning who is who... i thought you were part of that project 
 :-)
 
 
  Comparing with upstream OCF code, the main difference is that Fuel
  RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
  bookkeeping, for example, blocking client access when RabbitMQ cluster
  is not ready. I beleive the upstream OCF should be OK to use as well
  after I read the code, but it might not fit into the Fuel project. As
  far as I test, the Fuel OCF script is good except sometimes the full
  reassemble time is long, and as I find out, it is mostly because the
  Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
  resource, as I mentioned in the previous emails.
 
  Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
  master-slave RabbitMQ.
 
 That would be good to know.
 Browsing the agent, promote seems to be a no-op if rabbit is already running.
 
 
 To the master / slave reason due to how the ocf script is structured to deal 
 with rabbit's poor ability to handle its self in some scenarios. Hopefully 
 the state transition diagram [5] is enough to clarify what's going on.
 
 [5] http://goo.gl/PPNrw7

Not really.
It seems to be under the impression you can skip started and go directly from 
stopped to master.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-19 Thread Andrew Woodward

On Thu, May 7, 2015 at 5:01 PM Andrew Beekhof abeek...@redhat.com wrote:

On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟
zhengsh...@awcloud.com wrote:

Thank you Andrew.

on 2015/05/05 08:03, Andrew Beekhof wrote:
On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com
wrote:

Hello,
Hello, Zhou

I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
power failure. I have a running HA environment, then I reset power of
all the machines at the same time. I observe that after reboot it
usually takes 10 minutes for RabittMQ cluster to appear running
master-slave mode in pacemaker. If I power off all the 3 controllers
and
only start 2 of them, the downtime sometimes can be as long as 20
minutes.
Yes, this is a known issue [0]. Note, there were many bugfixes, like
[1],[2],[3], merged for MQ OCF script, so you may want to try to
backport them as well by the following guide [4]

[0] https://bugs.launchpad.net/fuel/+bug/1432603
[1] https://review.openstack.org/#/c/175460/
[2] https://review.openstack.org/#/c/175457/
[3] https://review.openstack.org/#/c/175371/
[4] https://review.openstack.org/#/c/170476/
Is there a reason you’re using a custom OCF script instead of the
upstream[a] one?
Please have a chat with David (the maintainer, in CC) if there is
something you believe is wrong with it.

[a]
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster

I'm using the OCF script from the Fuel project, specifically from the
6.0 stable branch [alpha].

Ah, I’m still learning who is who... i thought you were part of that
project :-)

Comparing with upstream OCF code, the main difference is that Fuel
RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
bookkeeping, for example, blocking client access when RabbitMQ cluster
is not ready. I beleive the upstream OCF should be OK to use as well
after I read the code, but it might not fit into the Fuel project. As
far as I test, the Fuel OCF script is good except sometimes the full
reassemble time is long, and as I find out, it is mostly because the
Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
resource, as I mentioned in the previous emails.

Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
master-slave RabbitMQ.

That would be good to know.
Browsing the agent, promote seems to be a no-op if rabbit is already
running.

To the master / slave reason due to how the ocf script is structured to
deal with rabbit's poor ability to handle its self in some scenarios.
Hopefully the state transition diagram [5] is enough to clarify what's
going on.

[5] http://goo.gl/PPNrw7

I see Vladimir and Sergey works on the original
Fuel blueprint RabbitMQ cluster [beta].

[alpha]

https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
[beta]

https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker

I have a little investigation and find out there are some possible
causes.

1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering
in
Pacemaker

The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
MySQL-wss fails to start after power failure, and pacemaker would wait
475s before retry starting it. The problem is that pacemaker divides
resource state transitions into batches. Since RabbitMQ is
master-slave
resource, I assume that starting all the slaves and promoting master
are
put into two different batches. If unfortunately starting all RabbitMQ
slaves are put in the same batch as MySQL starting, even if RabbitMQ
slaves and all other resources are ready, pacemaker will not continue
but just wait for MySQL timeout.
Could you please elaborate the what is the same/different batches for
MQ
and DB? Note, there is a MQ clustering logic flow charts available here
[5] and we're planning to release a dedicated technical bulletin for
this.

[5] http://goo.gl/PPNrw7

I can re-produce this by hard powering off all the controllers and
start
them again. It's more likely to trigger MySQL failure in this way.
Then
I observe that if there is one cloned mysql instance not starting, the
whole pacemaker cluster gets stuck and does not emit any log. On the
host of the failed instance, I can see a mysql resource agent process
calling the sleep command. If I kill that process, the pacemaker comes
back alive and RabbitMQ master gets promoted. In fact this long
timeout
is blocking every resource from state transition in pacemaker.

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-07 Thread Andrew Beekhof

On 5 May 2015, at 1:19 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com
wrote:

Thank you Andrew.

on 2015/05/05 08:03, Andrew Beekhof wrote:
On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote:

Hello,
Hello, Zhou

I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
power failure. I have a running HA environment, then I reset power of
all the machines at the same time. I observe that after reboot it
usually takes 10 minutes for RabittMQ cluster to appear running
master-slave mode in pacemaker. If I power off all the 3 controllers and
only start 2 of them, the downtime sometimes can be as long as 20 minutes.
Yes, this is a known issue [0]. Note, there were many bugfixes, like
[1],[2],[3], merged for MQ OCF script, so you may want to try to
backport them as well by the following guide [4]

[a]
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster

I'm using the OCF script from the Fuel project, specifically from the
6.0 stable branch [alpha].

Ah, I’m still learning who is who... i thought you were part of that project
:-)

Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
master-slave RabbitMQ.

That would be good to know.
Browsing the agent, promote seems to be a no-op if rabbit is already running.

I see Vladimir and Sergey works on the original
Fuel blueprint RabbitMQ cluster [beta].

[alpha]
https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
[beta]
https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker

I have a little investigation and find out there are some possible causes.

1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
Pacemaker

The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
MySQL-wss fails to start after power failure, and pacemaker would wait
475s before retry starting it. The problem is that pacemaker divides
resource state transitions into batches. Since RabbitMQ is master-slave
resource, I assume that starting all the slaves and promoting master are
put into two different batches. If unfortunately starting all RabbitMQ
slaves are put in the same batch as MySQL starting, even if RabbitMQ
slaves and all other resources are ready, pacemaker will not continue
but just wait for MySQL timeout.
Could you please elaborate the what is the same/different batches for MQ
and DB? Note, there is a MQ clustering logic flow charts available here
[5] and we're planning to release a dedicated technical bulletin for this.

[5] http://goo.gl/PPNrw7

I can re-produce this by hard powering off all the controllers and start
them again. It's more likely to trigger MySQL failure in this way. Then
I observe that if there is one cloned mysql instance not starting, the
whole pacemaker cluster gets stuck and does not emit any log. On the
host of the failed instance, I can see a mysql resource agent process
calling the sleep command. If I kill that process, the pacemaker comes
back alive and RabbitMQ master gets promoted. In fact this long timeout
is blocking every resource from state transition in pacemaker.

This maybe a known problem of pacemaker and there are some discussions
in Linux-HA mailing list [2]. It might not be fixed in the near future.
It seems in generally it's bad to have long timeout in state transition
actions (start/stop/promote/demote). There maybe another way to
implement MySQL-wss resource agent to use a short start timeout and
monitor the wss cluster state using monitor action.
This is very interesting, thank you! I believe all commands for MySQL RA
OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
as we did for MQ RA OCF. And there should no be any sleep calls. I
created a bug for this [6].

[6] https://bugs.launchpad.net/fuel/+bug/1449542

I also

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-07 Thread Andrew Beekhof

On 5 May 2015, at 9:30 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com
wrote:

Thank you Andrew. Sorry for misspell your name in the previous email.

on 2015/05/05 14:25, Andrew Beekhof wrote:
On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com
wrote:

Thank you Bogdan for clearing the pacemaker promotion process for me.

on 2015/05/05 10:32, Andrew Beekhof wrote:
On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟
zhengsh...@awcloud.com wrote:
[snip]

Batch is a pacemaker concept I found when I was reading its
documentation and code. There is a batch-limit: 30 in the output of
pcs property list --all. The pacemaker official documentation
explanation is that it's The number of jobs that the TE is allowed to
execute in parallel. From my understanding, pacemaker maintains cluster
states, and when we start/stop/promote/demote a resource, it triggers a
state transition. Pacemaker puts as many as possible transition jobs
into a batch, and process them in parallel.
Technically it calculates an ordered graph of actions that need to be
performed for a set of related resources.
You can see an example of the kinds of graphs it produces at:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html

There is a more complex one which includes promotion and demotion on the
next page.

The number of actions that can run at any one time is therefor limited by
- the value of batch-limit (the total number of in-flight actions)
- the number of resources that do not have ordering constraints between
them (eg. rsc{1,2,3} in the above example)

So in the above example, if batch-limit = 3, the monitor_0 actions will
still all execute in parallel.
If batch-limit == 2, one of them will be deferred until the others
complete.

If there are many types of resource agents, and anyone of them is not
well written, it might cause trouble, right?

Yes, but really only for the things that depend on it.

For example if resources B, C, D, E all depend (in some way) on A, then their
startup is going to be delayed.
But F, G, H and J will be able to start while we wait around for B to time out.

and if any of the in-flight action timeout is long, it would
block pacemaker from re-calculating a new transition graph?
Yes, but its actually an argument for making the timeouts longer, not
shorter.
Setting the timeouts too aggressively actually increases downtime because of
all the extra delays and recovery it induces.
So set them to be long enough that there is unquestionably a problem if you
hit them.

But we absolutely recognise that starting/stopping a database can take a
very long time comparatively and that it shouldn’t block recovery of other
unrelated services.
I would expect to see this land in Pacemaker 1.1.14

It will be great to see this in Pacemaker 1.1.14. From my experience
using Pacemaker, I think customized resource agents are possibly the
weakest part.

This is why we encourage people wanting new agents to get involved with the
upstream resource-agents project :-)

This feature should improve the handling for resource
action timeouts.

I see the
current batch-limit is 30 and I tried to increase it to 100, but did not
help.
Correct. It only puts an upper limit on the number of in-flight actions,
actions still need to wait for all their dependants to complete before
executing.

I'm sure that the cloned MySQL Galera resource is not related to
master-slave RabbitMQ resource. I don't find any dependency, order or
rule connecting them in the cluster deployed by Fuel [1].
In general it should not have needed to wait, but if you send me a
crm_report covering the period you’re talking about I’ll be able to comment
specifically about the behaviour you saw.

You are very nice, thank you. I uploaded the file generated by
crm_report to google drive.

https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing

Hmmm... there’s no logs included here for some reason.
I suspect it a bug on my part, can you apply this patch to report.collector on
the machine you’re running crm_report from and retry?

https://github.com/ClusterLabs/pacemaker/commit/96427ec

Is there anything I can do to make sure all the resource actions return
expected values in a full reassembling?
In general, if we say ‘start’, do your best to start or return ‘0’ if you

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-07 Thread Andrew Beekhof

On 5 May 2015, at 7:52 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote:

On 05.05.2015 04:32, Andrew Beekhof wrote:

[snip]

Technically it calculates an ordered graph of actions that need to be
performed for a set of related resources.
You can see an example of the kinds of graphs it produces at:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html

There is a more complex one which includes promotion and demotion on the
next page.

So in the above example, if batch-limit = 3, the monitor_0 actions will
still all execute in parallel.
If batch-limit == 2, one of them will be deferred until the others complete.

First we do a non-recurring monitor (*_monitor_0) to check what state the
resource is in.
We can’t assume its off because a) we might have crashed, b) the admin might
have accidentally configured it to start at boot or c) the admin may have
asked us to re-check everything.

Also important to know, the order of actions is:

I should clarify something here:

s/actions is/actions for each resource is/

1. any necessary demotions
2. any necessary stops
3. any necessary starts
4. any necessary promotions

Thank you for explaining this, Andrew!

So, in the context of the given two example DB(MySQL) and
messaging(RabbitMQ) resources:

The problem is that pacemaker can only promote a resource after it
detects the resource is started. During a full reassemble, in the first
transition batch, pacemaker starts all the resources including MySQL and
RabbitMQ. Pacemaker issues resource agent start invocation in parallel
and reaps the results.
For a multi-state resource agent like RabbitMQ, pacemaker needs the
start result reported in the first batch, then transition engine and
policy engine decide if it has to retry starting or promote, and put
this new transition job into a new batch.

So, for given example, it looks like we currently have:
_batch start_
...
3. DB, messaging resources start in a one batch

Since there is no dependancy between them, yes.

4. messaging resource promote blocked by the step 3 completion
_batch end_

Not quite, I wasn’t as clear as I could have been in my previous email.

We wont promote Rabbit instances until all they have all been started.
However we don’t need to wait for all the DBs to finish starting (again,
because there is no dependancy between them) before we begin promoting Rabbit.

So a single transition that did this is totally possible:

t0. Begin transition
t1. Rabbit start node1(begin)
t2. DB start node 3 (begin)
t3. DB start node 2 (begin)
t4. Rabbit start node2(begin)
t5. Rabbit start node3(begin)
t6. DB start node 1 (begin)
t7. Rabbit start node2(complete)
t8. Rabbit start node1(complete)
t9. DB start node 3 (complete)
t10. Rabbit start node3(complete)
t11. Rabbit promote node 1 (begin)
t12. Rabbit promote node 3 (begin)
t13. Rabbit promote node 2 (begin)
... etc etc ...

For something like cinder however, these are some of the dependancies we define:

pcs constraint order start keystone-clone then cinder-api-clone
pcs constraint order start cinder-api-clone then cinder-scheduler-clone
pcs constraint order start galera-master then keystone-clone

So first all the galera instances must be started. Then we can begin to promote
some.
Once all the promotions complete, then we can start the keystone instances.
Once all the keystone instances are up, then we can bring up the cinder API
instances, which allows us to start the scheduler, etc etc.

And assuming nothing fails, this can all happen in one transition.

Bottom line: Pacemaker will do as much as it can as soon as it can.
The only restrictions are ordering constraints you specify, the batch-limit,
and each master/slave (or clone) resource’s _internal_
demote-stop-start-promote ordering.

Am I making it better or worse?

Does this mean what an artificial constraints ordering between DB and
messaging could help them to get into the separate transition batches, like:

...
3. messaging multistate clone resource start
4. messaging multistate clone resource promote
_batch end_

_next batch start_
...
3. DB simple clone resource start

--
Best regards,
Bogdan Dobrelya,
Skype #bogdando_at_yahoo.com
Irc #bogdando

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-05 Thread Zhou Zheng Sheng / 周征晟