Re: [openstack-dev] [Fuel] Ceph Public Network Setting

2015-06-24 Thread Zhou Zheng Sheng /
Thank you guys!

I'm reading the advanced networking spec and think it's good. I'll try
to comment it. It's too epic for me for now. I wrote a patch to move
Ceph public network to storage network. It's based on Fuel 6.0, so I'm
trying to rebase it to lastest master then maybe ask you for a review.

on 2015/06/23 19:21, Igor Kalnitsky wrote:
 Hello,

 That makes sense to me. Still, I want to point that we're going to
 implement advanced networking and with this feature you'll be able to
 assign every single network role to any network.

 That means, you'll be able to assign ceph network role to storage,
 management or  whatever-you-want network. Sounds cool, ha? :)

 Feel free to read a design spec [1].

 Thanks,
 Igor

 [1]: https://review.openstack.org/#/c/115340/

 On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng Sheng / 周征晟
 zhengsh...@awcloud.com wrote:
 Hi!

 I notice that in OpenStack deployed by Fuel, Ceph public network is on
 management network. In some environments, not all NICs of a physical
 server are 10Gb. Sometimes 1 or 2 among the NICs on a machine may be
 1Gb. Usually on this type of machine we assign management network to 1Gb
 NIC, and storage network to 10Gb NIC. If Ceph public network is with
 management network, the QEMU accesses Ceph using management network, and
 the performance is not optimal.

 In a small deployment, cloud controller and Ceph OSD may be assigned to
 the same machine, so it would be more effective to keep Ceph client
 traffic separated from MySQL, RabbitMQ, Pacemaker traffic. Maybe it's
 better to place Ceph public network on the storage network. Agree?

 --
 Best wishes!
 Zhou Zheng Sheng, Software Developer
 Beijing AWcloud Software Co., Ltd.




 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-- 
Best wishes!
Zhou Zheng Sheng / 周征晟  Software Developer
Beijing AWcloud Software Co., Ltd.




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Ceph Public Network Setting

2015-06-24 Thread Zhou Zheng Sheng /
Thank you Sergii,

I read the release notes but didn't find a specific 'Advanced
Networking' part. It seems advanced networking is a spec [1] mentioned
by Igor in a previous mail. Maybe I overlook something? Anyway, I think
6.1 is a great release, for it solves some major problems in my previous
experience using 6.0.

[1] https://blueprints.launchpad.net/fuel/+spec/granular-network-functions

on 2015/06/24 17:04, Sergii Golovatiuk wrote:
 Hi Zhou,

 Try Fuel 6.1 where we have a lot of very nice features including
 'advanced networking'. Feel free to read release notes [1]. It's one
 of the most significant releases with many many features and
 improvements. I am very proud of it.

 [1] https://docs.mirantis.com/fuel/fuel-master/release-notes.html

 --
 Best regards,
 Sergii Golovatiuk,
 Skype #golserge
 IRC #holser

 On Wed, Jun 24, 2015 at 11:42 AM, Zhou Zheng Sheng / 周征晟
 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote:

 Thank you guys!

 I'm reading the advanced networking spec and think it's good. I'll try
 to comment it. It's too epic for me for now. I wrote a patch to move
 Ceph public network to storage network. It's based on Fuel 6.0, so I'm
 trying to rebase it to lastest master then maybe ask you for a review.

 on 2015/06/23 19:21, Igor Kalnitsky wrote:
  Hello,
 
  That makes sense to me. Still, I want to point that we're going to
  implement advanced networking and with this feature you'll be
 able to
  assign every single network role to any network.
 
  That means, you'll be able to assign ceph network role to storage,
  management or  whatever-you-want network. Sounds cool, ha? :)
 
  Feel free to read a design spec [1].
 
  Thanks,
  Igor
 
  [1]: https://review.openstack.org/#/c/115340/
 
  On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng Sheng / 周征晟
  zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote:
  Hi!
 
  I notice that in OpenStack deployed by Fuel, Ceph public
 network is on
  management network. In some environments, not all NICs of a
 physical
  server are 10Gb. Sometimes 1 or 2 among the NICs on a machine
 may be
  1Gb. Usually on this type of machine we assign management
 network to 1Gb
  NIC, and storage network to 10Gb NIC. If Ceph public network is
 with
  management network, the QEMU accesses Ceph using management
 network, and
  the performance is not optimal.
 
  In a small deployment, cloud controller and Ceph OSD may be
 assigned to
  the same machine, so it would be more effective to keep Ceph client
  traffic separated from MySQL, RabbitMQ, Pacemaker traffic.
 Maybe it's
  better to place Ceph public network on the storage network. Agree?
 
  --
  Best wishes!
  Zhou Zheng Sheng, Software Developer
  Beijing AWcloud Software Co., Ltd.
 
 
 
 
 
 __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 --
 Best wishes!
 Zhou Zheng Sheng / 周征晟  Software Developer
 Beijing AWcloud Software Co., Ltd.




 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-- 
Best wishes!
Zhou Zheng Sheng / 周征晟  Software Developer
Beijing AWcloud Software Co., Ltd.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Ceph Public Network Setting

2015-06-24 Thread Zhou Zheng Sheng /
Hello Andrew,

I did some similar modifications in Fuel 6.0 as [1][2]. However in [3]
it says Ceph monitor have to run on the public network.

In the file deployment/puppet/ceph/manifests/conf.pp, we can see it
calls 'ceph-deploy new ${::hostname}:${::internal_address}'.
In deployment/puppet/ceph/manifests/mon.pp, it calls 'ceph-deploy mon
create ${::hostname}:${::internal_address}'.

After we set Ceph public network to OpenStack storage network, Ceph
monitor still runs on management network. I also see in QEMU command
line, it connects to Ceph via management network IP address. Though this
setup does not follow the rule suggested in [3], it probably works
because Ceph monitor only provides PG location information, and then
QEMU should talk to OSDs directly using the storage network addresses.

That's why I didn't take your patches as-is. I meant to setup a Fuel 6.1
environment and test the modifications before I can give an accurate
comment.

[3]
http://ceph.com/docs/master/rados/configuration/network-config-ref/#monitor-ip-tables

on 2015/06/25 03:20, Andrew Woodward wrote:
 Zhou,

 As mentioned, Please review [1][2]. This is the interface we will
 support as we implement the advanced networking parts. In your case,
 just fip the role map in your patch to nailgun and the library
 interface will remain the same.

 [1] https://review.openstack.org/#/c/194434/
 [2] https://review.openstack.org/#/c/194438/

 On Wed, Jun 24, 2015 at 3:21 AM Stanislav Makar sma...@mirantis.com
 mailto:sma...@mirantis.com wrote:

 Hello
 My five cents :)
 I also very proud of our 6.1 release
 Unfortunately we still do not have this separation for ceph public
 and cluster networks in 6.1.

 One little hint
 In 6.1 we have granular deployment
 
 (https://docs.mirantis.com/fuel/fuel-master/release-notes.html#granular-deployment-based-on-pluggable-tasks)
 and all tasks which are connected with ceph deployment are in

 
 https://github.com/stackforge/fuel-library/tree/stable/6.1/deployment/puppet/osnailyfacter/modular/ceph

 To have your patch working in 6.1 you should hack all these files now

 If you have question feel free to ask.

 Thanks.


 -- 
 All the best,
 Stanislav Makar
 skype: makar_stanislav 
 irc: stamak

 On Wed, Jun 24, 2015 at 12:04 PM, Sergii Golovatiuk
 sgolovat...@mirantis.com mailto:sgolovat...@mirantis.com wrote:

 Hi Zhou,

 Try Fuel 6.1 where we have a lot of very nice features
 including 'advanced networking'. Feel free to read release
 notes [1]. It's one of the most significant releases with many
 many features and improvements. I am very proud of it.

 [1] https://docs.mirantis.com/fuel/fuel-master/release-notes.html

 --
 Best regards,
 Sergii Golovatiuk,
 Skype #golserge
 IRC #holser

 On Wed, Jun 24, 2015 at 11:42 AM, Zhou Zheng Sheng / 周征晟
 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote:

 Thank you guys!

 I'm reading the advanced networking spec and think it's
 good. I'll try
 to comment it. It's too epic for me for now. I wrote a
 patch to move
 Ceph public network to storage network. It's based on Fuel
 6.0, so I'm
 trying to rebase it to lastest master then maybe ask you
 for a review.

 on 2015/06/23 19:21, Igor Kalnitsky wrote:
  Hello,
 
  That makes sense to me. Still, I want to point that
 we're going to
  implement advanced networking and with this feature
 you'll be able to
  assign every single network role to any network.
 
  That means, you'll be able to assign ceph network role
 to storage,
  management or  whatever-you-want network. Sounds cool,
 ha? :)
 
  Feel free to read a design spec [1].
 
  Thanks,
  Igor
 
  [1]: https://review.openstack.org/#/c/115340/
 
  On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng Sheng / 周征晟
  zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com
 wrote:
  Hi!
 
  I notice that in OpenStack deployed by Fuel, Ceph
 public network is on
  management network. In some environments, not all NICs
 of a physical
  server are 10Gb. Sometimes 1 or 2 among the NICs on a
 machine may be
  1Gb. Usually on this type of machine we assign
 management network to 1Gb
  NIC, and storage network to 10Gb NIC. If Ceph public
 network is with
  management network, the QEMU accesses Ceph using
 management network

Re: [openstack-dev] [Fuel] Ceph Public Network Setting

2015-06-24 Thread Zhou Zheng Sheng /
Hi,

I also notice that in latest ceph-deploy source code, it allows us to
define public_network and cluster_network argument when calling
'ceph-deploy new'. It also checks if monitor IP address is in either of
the two networks.

https://github.com/ceph/ceph-deploy/blob/master/ceph_deploy/new.py#L158

I'm not sure that Ceph monitor operating in the public network is just
an artificial requirement, or it's an inherent precondition when they
implement Ceph and breaking the precondition might lead to some strange
problems. If it's the latter case, when we change Ceph public network,
we have to change Ceph monitor IP address configuration at the same time.

on 2015/06/25 11:01, Zhou Zheng Sheng / 周征晟 wrote:
 Hello Andrew,

 I did some similar modifications in Fuel 6.0 as [1][2]. However in [3]
 it says Ceph monitor have to run on the public network.

 In the file deployment/puppet/ceph/manifests/conf.pp, we can see it
 calls 'ceph-deploy new ${::hostname}:${::internal_address}'.
 In deployment/puppet/ceph/manifests/mon.pp, it calls 'ceph-deploy
 mon create ${::hostname}:${::internal_address}'.

 After we set Ceph public network to OpenStack storage network, Ceph
 monitor still runs on management network. I can see in QEMU command
 line, it connects to Ceph via management network IP address. Though
 this setup does not follow the rule suggested in [3], it probably
 works because Ceph monitor only provides PG location information, and
 then QEMU should talk to OSDs directly using the storage network
 addresses.

 That's why I didn't take your patches as-is. I meant to setup a Fuel
 6.1 environment and test the modifications before I can give an
 accurate comment.

 [3]
 http://ceph.com/docs/master/rados/configuration/network-config-ref/#monitor-ip-tables

 on 2015/06/25 03:20, Andrew Woodward wrote:
 Zhou,

 As mentioned, Please review [1][2]. This is the interface we will
 support as we implement the advanced networking parts. In your case,
 just fip the role map in your patch to nailgun and the library
 interface will remain the same.

 [1] https://review.openstack.org/#/c/194434/
 [2] https://review.openstack.org/#/c/194438/

 On Wed, Jun 24, 2015 at 3:21 AM Stanislav Makar sma...@mirantis.com
 mailto:sma...@mirantis.com wrote:

 Hello
 My five cents :)
 I also very proud of our 6.1 release
 Unfortunately we still do not have this separation for ceph
 public and cluster networks in 6.1.

 One little hint
 In 6.1 we have granular deployment
 
 (https://docs.mirantis.com/fuel/fuel-master/release-notes.html#granular-deployment-based-on-pluggable-tasks)
 and all tasks which are connected with ceph deployment are in

 
 https://github.com/stackforge/fuel-library/tree/stable/6.1/deployment/puppet/osnailyfacter/modular/ceph

 To have your patch working in 6.1 you should hack all these files now

 If you have question feel free to ask.

 Thanks.


 -- 
 All the best,
 Stanislav Makar
 skype: makar_stanislav 
 irc: stamak

 On Wed, Jun 24, 2015 at 12:04 PM, Sergii Golovatiuk
 sgolovat...@mirantis.com mailto:sgolovat...@mirantis.com wrote:

 Hi Zhou,

 Try Fuel 6.1 where we have a lot of very nice features
 including 'advanced networking'. Feel free to read release
 notes [1]. It's one of the most significant releases with
 many many features and improvements. I am very proud of it.

 [1] https://docs.mirantis.com/fuel/fuel-master/release-notes.html

 --
 Best regards,
 Sergii Golovatiuk,
 Skype #golserge
 IRC #holser

 On Wed, Jun 24, 2015 at 11:42 AM, Zhou Zheng Sheng / 周征晟
 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote:

 Thank you guys!

 I'm reading the advanced networking spec and think it's
 good. I'll try
 to comment it. It's too epic for me for now. I wrote a
 patch to move
 Ceph public network to storage network. It's based on
 Fuel 6.0, so I'm
 trying to rebase it to lastest master then maybe ask you
 for a review.

 on 2015/06/23 19:21, Igor Kalnitsky wrote:
  Hello,
 
  That makes sense to me. Still, I want to point that
 we're going to
  implement advanced networking and with this feature
 you'll be able to
  assign every single network role to any network.
 
  That means, you'll be able to assign ceph network role
 to storage,
  management or  whatever-you-want network. Sounds cool,
 ha? :)
 
  Feel free to read a design spec [1].
 
  Thanks,
  Igor
 
  [1]: https://review.openstack.org/#/c/115340/
 
  On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng

[openstack-dev] [Fuel] Ceph Public Network Setting

2015-06-23 Thread Zhou Zheng Sheng /
Hi!

I notice that in OpenStack deployed by Fuel, Ceph public network is on
management network. In some environments, not all NICs of a physical
server are 10Gb. Sometimes 1 or 2 among the NICs on a machine may be
1Gb. Usually on this type of machine we assign management network to 1Gb
NIC, and storage network to 10Gb NIC. If Ceph public network is with
management network, the QEMU accesses Ceph using management network, and
the performance is not optimal.

In a small deployment, cloud controller and Ceph OSD may be assigned to
the same machine, so it would be more effective to keep Ceph client
traffic separated from MySQL, RabbitMQ, Pacemaker traffic. Maybe it's
better to place Ceph public network on the storage network. Agree?

-- 
Best wishes!
Zhou Zheng Sheng, Software Developer
Beijing AWcloud Software Co., Ltd.




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-05 Thread Zhou Zheng Sheng /
Thank you Andrew. Sorry for misspell your name in the previous email.

on 2015/05/05 14:25, Andrew Beekhof wrote:
 On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:

 Thank you Bogdan for clearing the pacemaker promotion process for me.

 on 2015/05/05 10:32, Andrew Beekhof wrote:
 On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 
 zhengsh...@awcloud.com wrote:
 [snip]

 Batch is a pacemaker concept I found when I was reading its
 documentation and code. There is a batch-limit: 30 in the output of
 pcs property list --all. The pacemaker official documentation
 explanation is that it's The number of jobs that the TE is allowed to
 execute in parallel. From my understanding, pacemaker maintains cluster
 states, and when we start/stop/promote/demote a resource, it triggers a
 state transition. Pacemaker puts as many as possible transition jobs
 into a batch, and process them in parallel.
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:

   
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html

 There is a more complex one which includes promotion and demotion on the 
 next page.

 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between 
 them (eg. rsc{1,2,3} in the above example)  

 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others complete.

 Processing of the graph stops the moment any action returns a value that 
 was not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
 So can I infer the following statement? In a big cluster with many
 resources, chances are some resource agent actions return unexpected
 values,
 The size of the cluster shouldn’t increase the chance of this happening 
 unless you’ve set the timeouts too aggressively.

If there are many types of resource agents, and anyone of them is not
well written, it might cause trouble, right?

 and if any of the in-flight action timeout is long, it would
 block pacemaker from re-calculating a new transition graph?
 Yes, but its actually an argument for making the timeouts longer, not shorter.
 Setting the timeouts too aggressively actually increases downtime because of 
 all the extra delays and recovery it induces.
 So set them to be long enough that there is unquestionably a problem if you 
 hit them.

 But we absolutely recognise that starting/stopping a database can take a very 
 long time comparatively and that it shouldn’t block recovery of other 
 unrelated services.
 I would expect to see this land in Pacemaker 1.1.14

It will be great to see this in Pacemaker 1.1.14. From my experience
using Pacemaker, I think customized resource agents are possibly the
weakest part. This feature should improve the handling for resource
action timeouts.

 I see the
 current batch-limit is 30 and I tried to increase it to 100, but did not
 help.
 Correct.  It only puts an upper limit on the number of in-flight actions, 
 actions still need to wait for all their dependants to complete before 
 executing.

 I'm sure that the cloned MySQL Galera resource is not related to
 master-slave RabbitMQ resource. I don't find any dependency, order or
 rule connecting them in the cluster deployed by Fuel [1].
 In general it should not have needed to wait, but if you send me a crm_report 
 covering the period you’re talking about I’ll be able to comment specifically 
 about the behaviour you saw.

You are very nice, thank you. I uploaded the file generated by
crm_report to google drive.

https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing

 Is there anything I can do to make sure all the resource actions return
 expected values in a full reassembling?
 In general, if we say ‘start’, do your best to start or return ‘0’ if you 
 already were started.
 Likewise for stop.

 Otherwise its really specific to your agent.
 For example an IP resource just needs to add itself to an interface - it cant 
 do much differently, if it times out then the system much be very very busy.

 The only other thing I would say is:
 - avoid blocking calls where possible
 - have empathy for the machine (do as little as is needed)


+1 for the empathy :)
 Is it because node-1 and node-2
 happen to boot faster than node-3 and form a cluster, when node-3 joins,
 it triggers new state transition? Or may because some resources are
 already started, so pacemaker needs to stop them firstly?
 We only stop them if they shouldn’t yet be running (ie. a colocation or 
 ordering dependancy

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-04 Thread Zhou Zheng Sheng /
Thank you Andrew.

on 2015/05/05 08:03, Andrew Beekhof wrote:
 On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote:

 Hello,
 Hello, Zhou

 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
 power failure. I have a running HA environment, then I reset power of
 all the machines at the same time. I observe that after reboot it
 usually takes 10 minutes for RabittMQ cluster to appear running
 master-slave mode in pacemaker. If I power off all the 3 controllers and
 only start 2 of them, the downtime sometimes can be as long as 20 minutes.
 Yes, this is a known issue [0]. Note, there were many bugfixes, like
 [1],[2],[3], merged for MQ OCF script, so you may want to try to
 backport them as well by the following guide [4]

 [0] https://bugs.launchpad.net/fuel/+bug/1432603
 [1] https://review.openstack.org/#/c/175460/
 [2] https://review.openstack.org/#/c/175457/
 [3] https://review.openstack.org/#/c/175371/
 [4] https://review.openstack.org/#/c/170476/
 Is there a reason you’re using a custom OCF script instead of the upstream[a] 
 one?
 Please have a chat with David (the maintainer, in CC) if there is something 
 you believe is wrong with it.

 [a] 
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster

I'm using the OCF script from the Fuel project, specifically from the
6.0 stable branch [alpha].

Comparing with upstream OCF code, the main difference is that Fuel
RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more
bookkeeping, for example, blocking client access when RabbitMQ cluster
is not ready. I beleive the upstream OCF should be OK to use as well
after I read the code, but it might not fit into the Fuel project. As
far as I test, the Fuel OCF script is good except sometimes the full
reassemble time is long, and as I find out, it is mostly because the
Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ
resource, as I mentioned in the previous emails.

Maybe Vladimir and Sergey can give us more insight on why Fuel needs a
master-slave RabbitMQ. I see Vladimir and Sergey works on the original
Fuel blueprint RabbitMQ cluster [beta].

[alpha]
https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq
[beta]
https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker

 I have a little investigation and find out there are some possible causes.

 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker

 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait
 475s before retry starting it. The problem is that pacemaker divides
 resource state transitions into batches. Since RabbitMQ is master-slave
 resource, I assume that starting all the slaves and promoting master are
 put into two different batches. If unfortunately starting all RabbitMQ
 slaves are put in the same batch as MySQL starting, even if RabbitMQ
 slaves and all other resources are ready, pacemaker will not continue
 but just wait for MySQL timeout.
 Could you please elaborate the what is the same/different batches for MQ
 and DB? Note, there is a MQ clustering logic flow charts available here
 [5] and we're planning to release a dedicated technical bulletin for this.

 [5] http://goo.gl/PPNrw7

 I can re-produce this by hard powering off all the controllers and start
 them again. It's more likely to trigger MySQL failure in this way. Then
 I observe that if there is one cloned mysql instance not starting, the
 whole pacemaker cluster gets stuck and does not emit any log. On the
 host of the failed instance, I can see a mysql resource agent process
 calling the sleep command. If I kill that process, the pacemaker comes
 back alive and RabbitMQ master gets promoted. In fact this long timeout
 is blocking every resource from state transition in pacemaker.

 This maybe a known problem of pacemaker and there are some discussions
 in Linux-HA mailing list [2]. It might not be fixed in the near future.
 It seems in generally it's bad to have long timeout in state transition
 actions (start/stop/promote/demote). There maybe another way to
 implement MySQL-wss resource agent to use a short start timeout and
 monitor the wss cluster state using monitor action.
 This is very interesting, thank you! I believe all commands for MySQL RA
 OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL
 as we did for MQ RA OCF. And there should no be any sleep calls. I
 created a bug for this [6].

 [6] https://bugs.launchpad.net/fuel/+bug/1449542

 I also find a fix to improve MySQL start timeout [3]. It shortens the
 timeout to 300s. At the time I sending this email, I can not find it in
 stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to
 stable/6.0 ?

 [1] https://bugs.launchpad.net/fuel/+bug/1441885
 [2] 

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-04 Thread Zhou Zheng Sheng /
Thank you Bogdan for clearing the pacemaker promotion process for me.

on 2015/05/05 10:32, Andrew Beekhof wrote:
 On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com 
 wrote:
 [snip]

 Batch is a pacemaker concept I found when I was reading its
 documentation and code. There is a batch-limit: 30 in the output of
 pcs property list --all. The pacemaker official documentation
 explanation is that it's The number of jobs that the TE is allowed to
 execute in parallel. From my understanding, pacemaker maintains cluster
 states, and when we start/stop/promote/demote a resource, it triggers a
 state transition. Pacemaker puts as many as possible transition jobs
 into a batch, and process them in parallel.
 Technically it calculates an ordered graph of actions that need to be 
 performed for a set of related resources.
 You can see an example of the kinds of graphs it produces at:


 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html

 There is a more complex one which includes promotion and demotion on the next 
 page.

 The number of actions that can run at any one time is therefor limited by
 - the value of batch-limit (the total number of in-flight actions)
 - the number of resources that do not have ordering constraints between them 
 (eg. rsc{1,2,3} in the above example)  

 So in the above example, if batch-limit = 3, the monitor_0 actions will 
 still all execute in parallel.
 If batch-limit == 2, one of them will be deferred until the others complete.

 Processing of the graph stops the moment any action returns a value that was 
 not expected.
 If that happens, we wait for currently in-flight actions to complete, 
 re-calculate a new graph based on the new information and start again.
So can I infer the following statement? In a big cluster with many
resources, chances are some resource agent actions return unexpected
values, and if any of the in-flight action timeout is long, it would
block pacemaker from re-calculating a new transition graph? I see the
current batch-limit is 30 and I tried to increase it to 100, but did not
help. I'm sure that the cloned MySQL Galera resource is not related to
master-slave RabbitMQ resource. I don't find any dependency, order or
rule connecting them in the cluster deployed by Fuel [1].

Is there anything I can do to make sure all the resource actions return
expected values in a full reassembling? Is it because node-1 and node-2
happen to boot faster than node-3 and form a cluster, when node-3 joins,
it triggers new state transition? Or may because some resources are
already started, so pacemaker needs to stop them firstly? Does setting
default-resource-stickiness to 1 help?

I also tried crm history XXX commands in a live and correct cluster,
but didn't find much information. I can see there are many log entries
like run_graph: Transition 7108  Next I'll inspect the pacemaker
log to see which resource action returns the unexpected value or which
thing triggers new state transition.

[1] http://paste.openstack.org/show/214919/

 The problem is that pacemaker can only promote a resource after it
 detects the resource is started.
 First we do a non-recurring monitor (*_monitor_0) to check what state the 
 resource is in.
 We can’t assume its off because a) we might have crashed, b) the admin might 
 have accidentally configured it to start at boot or c) the admin may have 
 asked us to re-check everything.

 During a full reassemble, in the first
 transition batch, pacemaker starts all the resources including MySQL and
 RabbitMQ. Pacemaker issues resource agent start invocation in parallel
 and reaps the results.

 For a multi-state resource agent like RabbitMQ, pacemaker needs the
 start result reported in the first batch, then transition engine and
 policy engine decide if it has to retry starting or promote, and put
 this new transition job into a new batch.
 Also important to know, the order of actions is:

 1. any necessary demotions
 2. any necessary stops
 3. any necessary starts
 4. any necessary promotions



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-- 
Best wishes!
Zhou Zheng Sheng / 周征晟  Software Engineer
Beijing AWcloud Software Co., Ltd.




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-05-03 Thread Zhou Zheng Sheng /
Hello Sergii,

Thank you for the great explanation on Galera OCF script. I replied your
question inline.

on 2015/05/03 04:49, Sergii Golovatiuk wrote:
 Hi Zhou,

 Galera OCF script is a bit special. Since MySQL keeps the most
 important data we should find the most recent data on all nodes across
 the cluster. check_if_galera_pc is specially designed for that. Every
 server registers the latest status from grastate.dat to CIB. Once all
 nodes are registered, the one with the most recent data will be
 selected as Primary Component. All others should join to that node. 5
 minutes is a time for all nodes to appear and register position from
 grastate.dat to CIB. Usually, it takes much faster. Though there are
 cases when node is stuck on fsck or grub or power outlet or some other
 cases. If all nodes are registered there shouldn't be 5 minute penalty
 timeout. If one node is stuck (at least present in CIB), then all
 other nodes will be waiting for 5 minutes then will assemble cluster
 without it.

 Concerning dependencies, I agree that RabbitMQ may start in parallel
 to Galera cluster assemble procedure. It makes no sense to start other
 services as they are dependent on Galera and RabbitMQ.

 Also, I have a quick question to you. Shutting down all three
 controllers is a unique case, like whole power outage in whole
 datacenter (DC). In this case, 5 minute delay is very small comparing
 to DC recovery procedure. Reboot of one controller is more optimistic
 scenario. What's a special case to restart all 3-5 at once?

Sorry, I am not very clear about what 3-5 refers to. Is the question
about why we want to make the full reassemble time short, and why this
case is important for us?

We have some small customers forming a long-tail in local market. They
have neither dedicated datacenter houses nor dual power supply. Some of
them would even shutdown all the machines when they go home, and start
all of the machines when they start to work. Considering of data
privacy, they are not willing to put their virtual machines on the
public cloud. Usually, this kind of customer don't have IT skills to
troubleshoot a full reassemble process. We want to make this process as
simple as turning on all the machines roughly at the same time and wait
about several minutes, so they don't call our service team.


 Also, I would like to say a big thank for digging it out. It's very
 useful to use your findings in our next steps.


 --
 Best regards,
 Sergii Golovatiuk,
 Skype #golserge
 IRC #holser

 On Wed, Apr 29, 2015 at 9:38 AM, Zhou Zheng Sheng / 周征晟
 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote:

 Hi!

 Thank you very much Vladimir and Bogdan! Thanks for the fast
 respond and
 rich information.

 I backported MySQL and RabbitMQ ocf patches from stable/6.0 and tested
 again. A full reassemble takes about 5mins, this improves a lot.
 Adding
 the force_load trick I mentioned in the previous email, it takes
 about
 4mins.

 I get that there is not really a RabbitMQ master instance because
 queue
 masters spreads to all the RabbitMQ instances. The pacemaker master is
 an abstract one. However there is still an mnesia node from which
 other
 mnesia nodes sync table schema. The exception
 timeout_waiting_for_tables in log is actually reported by mnesia. By
 default, it places a mark on the last alive mnesia node, and other
 nodes
 have to sync table from it
 (http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477).
 RabbitMQ clustering inherits the behavior, and the last RabbitMQ
 instance shutdown must be the first instance to start. Otherwise it
 produces timeout_waiting_for_tables
 (http://www.rabbitmq.com/clustering.html#transcript search for
 the last
 node to go down).

 The 1 minute difference is because without force_load, the abstract
 master determined by pacemaker during a promote action may not be the
 last RabbitMQ instance shut down in the last start action. So
 there is
 chance for rabbitmqctl start_app to wait 30s and trigger a RabbitMQ
 exception timeout_waiting_for_tables. We may able to see table
 timeout
 and mnesa resetting for once during a reassemble process on some
 of the
 RabbitMQ instances, but it only introduces 30s of wait, which is
 acceptable for me.

 I also inspect the RabbitMQ resource agent code in latest master
 branch.
 There are timeout wrapper and other improvements which are great. It
 does not change the master promotion process much, so it may still run
 into the problems I described.

 Please see the inline reply below.

 on 2015/04/28/ 21:15, Bogdan Dobrelya wrote:
  Hello,
  Hello, Zhou
 
  I using Fuel 6.0.1 and find that RabbitMQ recover time is long
 after
  power failure. I have a running HA environment, then I reset
 power of
  all the machines

Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-04-29 Thread Zhou Zheng Sheng /
Hi!

Thank you very much Vladimir and Bogdan! Thanks for the fast respond and
rich information.

I backported MySQL and RabbitMQ ocf patches from stable/6.0 and tested
again. A full reassemble takes about 5mins, this improves a lot. Adding
the force_load trick I mentioned in the previous email, it takes about
4mins.

I get that there is not really a RabbitMQ master instance because queue
masters spreads to all the RabbitMQ instances. The pacemaker master is
an abstract one. However there is still an mnesia node from which other
mnesia nodes sync table schema. The exception
timeout_waiting_for_tables in log is actually reported by mnesia. By
default, it places a mark on the last alive mnesia node, and other nodes
have to sync table from it
(http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477).
RabbitMQ clustering inherits the behavior, and the last RabbitMQ
instance shutdown must be the first instance to start. Otherwise it
produces timeout_waiting_for_tables
(http://www.rabbitmq.com/clustering.html#transcript search for the last
node to go down).

The 1 minute difference is because without force_load, the abstract
master determined by pacemaker during a promote action may not be the
last RabbitMQ instance shut down in the last start action. So there is
chance for rabbitmqctl start_app to wait 30s and trigger a RabbitMQ
exception timeout_waiting_for_tables. We may able to see table timeout
and mnesa resetting for once during a reassemble process on some of the
RabbitMQ instances, but it only introduces 30s of wait, which is
acceptable for me.

I also inspect the RabbitMQ resource agent code in latest master branch.
There are timeout wrapper and other improvements which are great. It
does not change the master promotion process much, so it may still run
into the problems I described.

Please see the inline reply below.

on 2015/04/28/ 21:15, Bogdan Dobrelya wrote:
 Hello,
 Hello, Zhou

 I using Fuel 6.0.1 and find that RabbitMQ recover time is long after
 power failure. I have a running HA environment, then I reset power of
 all the machines at the same time. I observe that after reboot it
 usually takes 10 minutes for RabittMQ cluster to appear running
 master-slave mode in pacemaker. If I power off all the 3 controllers and
 only start 2 of them, the downtime sometimes can be as long as 20 minutes.
 Yes, this is a known issue [0]. Note, there were many bugfixes, like
 [1],[2],[3], merged for MQ OCF script, so you may want to try to
 backport them as well by the following guide [4]

 [0] https://bugs.launchpad.net/fuel/+bug/1432603
 [1] https://review.openstack.org/#/c/175460/
 [2] https://review.openstack.org/#/c/175457/
 [3] https://review.openstack.org/#/c/175371/
 [4] https://review.openstack.org/#/c/170476/

 I have a little investigation and find out there are some possible causes.

 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in
 Pacemaker

 The pacemaker resource p_mysql start timeout is set to 475s. Sometimes
 MySQL-wss fails to start after power failure, and pacemaker would wait
 475s before retry starting it. The problem is that pacemaker divides
 resource state transitions into batches. Since RabbitMQ is master-slave
 resource, I assume that starting all the slaves and promoting master are
 put into two different batches. If unfortunately starting all RabbitMQ
 slaves are put in the same batch as MySQL starting, even if RabbitMQ
 slaves and all other resources are ready, pacemaker will not continue
 but just wait for MySQL timeout.
 Could you please elaborate the what is the same/different batches for MQ
 and DB? Note, there is a MQ clustering logic flow charts available here
 [5] and we're planning to release a dedicated technical bulletin for this.

 [5] http://goo.gl/PPNrw7

Batch is a pacemaker concept I found when I was reading its
documentation and code. There is a batch-limit: 30 in the output of
pcs property list --all. The pacemaker official documentation
explanation is that it's The number of jobs that the TE is allowed to
execute in parallel. From my understanding, pacemaker maintains cluster
states, and when we start/stop/promote/demote a resource, it triggers a
state transition. Pacemaker puts as many as possible transition jobs
into a batch, and process them in parallel.

The problem is that pacemaker can only promote a resource after it
detects the resource is started. During a full reassemble, in the first
transition batch, pacemaker starts all the resources including MySQL and
RabbitMQ. Pacemaker issues resource agent start invocation in parallel
and reaps the results.

For a multi-state resource agent like RabbitMQ, pacemaker needs the
start result reported in the first batch, then transition engine and
policy engine decide if it has to retry starting or promote, and put
this new transition job into a new batch.

I see improvements to put individual commands inside a timeout wrapper
in RabbitMQ resource agent, and a bug created yesterday 

[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering

2015-04-28 Thread Zhou Zheng Sheng /
 are unlucky, there would be several 30s timeout
and reset before you finally get a healthy RabbitMQ cluster.

I find three possible solutions.
A. Using rabbitmqctl force_boot option [6]
It will skips waiting for 30s and resetting cluster, but just assume the
current node is the master and continue to operate. This is feasible
because the original RabbitMQ master would discards the local state and
sync with the new master after it joins a new cluster [7]. So we can be
sure that after step (4) and (6), the pacemaker determined master
instance is started unconditionally, and it will be the same as RabbitMQ
master, and all operations run without 30s timeout. I find this option
is only available in newer RabbitMQ release, and updating RabbitMQ might
introduce other compatibility problems.

B. Turn RabbitMQ into cloned instance and use pause_minority instead of
autoheal [8]
This works like MySQL-wss. It let RabbitMQ cluster itself deal with
partition in a manner similar to pacemaker quorum mechanism. When there
is network partition, instances in the minority partition pauses
themselves automatically. Pacemaker does not have to track who is the
RabbitMQ master, who lives longest, who to promote... It just starts all
the clones, done. This leads to huge change in RabbitMQ resource agent,
and the stability and other impact is to be tested.

C. Creating a force_load file
After reading RabbitMQ source code, I find that the actual thing it does
in solution A is just creating an empty file named force_load in
mnesia database dir, then mnesia thinks it is the last node shut down in
the last time and boot itself as the master. This implementation keeps
the same from v3.1.4 to the latest RabbitMQ master branch. I think we
can make use of this little trick. The change is adding just one line in
try_to_start_rmq_app() function.

touch ${MNESIA_FILES}/force_load  \
  chown rabbitmq:rabbitmq ${MNESIA_FILES}/force_load

[4] http://www.rabbitmq.com/ha.html
[5] https://review.openstack.org/#/c/169291/
[6] https://www.rabbitmq.com/clustering.html
[7] http://www.rabbitmq.com/partitions.html#recovering
[8] http://www.rabbitmq.com/partitions.html#automatic-handling

Maybe you have better ideas on this. Please share your thoughts.


Best wishes!
Zhou Zheng Sheng / 周征晟  Software Engineer
Beijing AWcloud Software Co., Ltd.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev