Re: [openstack-dev] [Fuel] Ceph Public Network Setting
Thank you guys! I'm reading the advanced networking spec and think it's good. I'll try to comment it. It's too epic for me for now. I wrote a patch to move Ceph public network to storage network. It's based on Fuel 6.0, so I'm trying to rebase it to lastest master then maybe ask you for a review. on 2015/06/23 19:21, Igor Kalnitsky wrote: Hello, That makes sense to me. Still, I want to point that we're going to implement advanced networking and with this feature you'll be able to assign every single network role to any network. That means, you'll be able to assign ceph network role to storage, management or whatever-you-want network. Sounds cool, ha? :) Feel free to read a design spec [1]. Thanks, Igor [1]: https://review.openstack.org/#/c/115340/ On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com wrote: Hi! I notice that in OpenStack deployed by Fuel, Ceph public network is on management network. In some environments, not all NICs of a physical server are 10Gb. Sometimes 1 or 2 among the NICs on a machine may be 1Gb. Usually on this type of machine we assign management network to 1Gb NIC, and storage network to 10Gb NIC. If Ceph public network is with management network, the QEMU accesses Ceph using management network, and the performance is not optimal. In a small deployment, cloud controller and Ceph OSD may be assigned to the same machine, so it would be more effective to keep Ceph client traffic separated from MySQL, RabbitMQ, Pacemaker traffic. Maybe it's better to place Ceph public network on the storage network. Agree? -- Best wishes! Zhou Zheng Sheng, Software Developer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Best wishes! Zhou Zheng Sheng / 周征晟 Software Developer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel] Ceph Public Network Setting
Thank you Sergii, I read the release notes but didn't find a specific 'Advanced Networking' part. It seems advanced networking is a spec [1] mentioned by Igor in a previous mail. Maybe I overlook something? Anyway, I think 6.1 is a great release, for it solves some major problems in my previous experience using 6.0. [1] https://blueprints.launchpad.net/fuel/+spec/granular-network-functions on 2015/06/24 17:04, Sergii Golovatiuk wrote: Hi Zhou, Try Fuel 6.1 where we have a lot of very nice features including 'advanced networking'. Feel free to read release notes [1]. It's one of the most significant releases with many many features and improvements. I am very proud of it. [1] https://docs.mirantis.com/fuel/fuel-master/release-notes.html -- Best regards, Sergii Golovatiuk, Skype #golserge IRC #holser On Wed, Jun 24, 2015 at 11:42 AM, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote: Thank you guys! I'm reading the advanced networking spec and think it's good. I'll try to comment it. It's too epic for me for now. I wrote a patch to move Ceph public network to storage network. It's based on Fuel 6.0, so I'm trying to rebase it to lastest master then maybe ask you for a review. on 2015/06/23 19:21, Igor Kalnitsky wrote: Hello, That makes sense to me. Still, I want to point that we're going to implement advanced networking and with this feature you'll be able to assign every single network role to any network. That means, you'll be able to assign ceph network role to storage, management or whatever-you-want network. Sounds cool, ha? :) Feel free to read a design spec [1]. Thanks, Igor [1]: https://review.openstack.org/#/c/115340/ On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote: Hi! I notice that in OpenStack deployed by Fuel, Ceph public network is on management network. In some environments, not all NICs of a physical server are 10Gb. Sometimes 1 or 2 among the NICs on a machine may be 1Gb. Usually on this type of machine we assign management network to 1Gb NIC, and storage network to 10Gb NIC. If Ceph public network is with management network, the QEMU accesses Ceph using management network, and the performance is not optimal. In a small deployment, cloud controller and Ceph OSD may be assigned to the same machine, so it would be more effective to keep Ceph client traffic separated from MySQL, RabbitMQ, Pacemaker traffic. Maybe it's better to place Ceph public network on the storage network. Agree? -- Best wishes! Zhou Zheng Sheng, Software Developer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Best wishes! Zhou Zheng Sheng / 周征晟 Software Developer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Best wishes! Zhou Zheng Sheng / 周征晟 Software Developer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel] Ceph Public Network Setting
Hello Andrew, I did some similar modifications in Fuel 6.0 as [1][2]. However in [3] it says Ceph monitor have to run on the public network. In the file deployment/puppet/ceph/manifests/conf.pp, we can see it calls 'ceph-deploy new ${::hostname}:${::internal_address}'. In deployment/puppet/ceph/manifests/mon.pp, it calls 'ceph-deploy mon create ${::hostname}:${::internal_address}'. After we set Ceph public network to OpenStack storage network, Ceph monitor still runs on management network. I also see in QEMU command line, it connects to Ceph via management network IP address. Though this setup does not follow the rule suggested in [3], it probably works because Ceph monitor only provides PG location information, and then QEMU should talk to OSDs directly using the storage network addresses. That's why I didn't take your patches as-is. I meant to setup a Fuel 6.1 environment and test the modifications before I can give an accurate comment. [3] http://ceph.com/docs/master/rados/configuration/network-config-ref/#monitor-ip-tables on 2015/06/25 03:20, Andrew Woodward wrote: Zhou, As mentioned, Please review [1][2]. This is the interface we will support as we implement the advanced networking parts. In your case, just fip the role map in your patch to nailgun and the library interface will remain the same. [1] https://review.openstack.org/#/c/194434/ [2] https://review.openstack.org/#/c/194438/ On Wed, Jun 24, 2015 at 3:21 AM Stanislav Makar sma...@mirantis.com mailto:sma...@mirantis.com wrote: Hello My five cents :) I also very proud of our 6.1 release Unfortunately we still do not have this separation for ceph public and cluster networks in 6.1. One little hint In 6.1 we have granular deployment (https://docs.mirantis.com/fuel/fuel-master/release-notes.html#granular-deployment-based-on-pluggable-tasks) and all tasks which are connected with ceph deployment are in https://github.com/stackforge/fuel-library/tree/stable/6.1/deployment/puppet/osnailyfacter/modular/ceph To have your patch working in 6.1 you should hack all these files now If you have question feel free to ask. Thanks. -- All the best, Stanislav Makar skype: makar_stanislav irc: stamak On Wed, Jun 24, 2015 at 12:04 PM, Sergii Golovatiuk sgolovat...@mirantis.com mailto:sgolovat...@mirantis.com wrote: Hi Zhou, Try Fuel 6.1 where we have a lot of very nice features including 'advanced networking'. Feel free to read release notes [1]. It's one of the most significant releases with many many features and improvements. I am very proud of it. [1] https://docs.mirantis.com/fuel/fuel-master/release-notes.html -- Best regards, Sergii Golovatiuk, Skype #golserge IRC #holser On Wed, Jun 24, 2015 at 11:42 AM, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote: Thank you guys! I'm reading the advanced networking spec and think it's good. I'll try to comment it. It's too epic for me for now. I wrote a patch to move Ceph public network to storage network. It's based on Fuel 6.0, so I'm trying to rebase it to lastest master then maybe ask you for a review. on 2015/06/23 19:21, Igor Kalnitsky wrote: Hello, That makes sense to me. Still, I want to point that we're going to implement advanced networking and with this feature you'll be able to assign every single network role to any network. That means, you'll be able to assign ceph network role to storage, management or whatever-you-want network. Sounds cool, ha? :) Feel free to read a design spec [1]. Thanks, Igor [1]: https://review.openstack.org/#/c/115340/ On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote: Hi! I notice that in OpenStack deployed by Fuel, Ceph public network is on management network. In some environments, not all NICs of a physical server are 10Gb. Sometimes 1 or 2 among the NICs on a machine may be 1Gb. Usually on this type of machine we assign management network to 1Gb NIC, and storage network to 10Gb NIC. If Ceph public network is with management network, the QEMU accesses Ceph using management network
Re: [openstack-dev] [Fuel] Ceph Public Network Setting
Hi, I also notice that in latest ceph-deploy source code, it allows us to define public_network and cluster_network argument when calling 'ceph-deploy new'. It also checks if monitor IP address is in either of the two networks. https://github.com/ceph/ceph-deploy/blob/master/ceph_deploy/new.py#L158 I'm not sure that Ceph monitor operating in the public network is just an artificial requirement, or it's an inherent precondition when they implement Ceph and breaking the precondition might lead to some strange problems. If it's the latter case, when we change Ceph public network, we have to change Ceph monitor IP address configuration at the same time. on 2015/06/25 11:01, Zhou Zheng Sheng / 周征晟 wrote: Hello Andrew, I did some similar modifications in Fuel 6.0 as [1][2]. However in [3] it says Ceph monitor have to run on the public network. In the file deployment/puppet/ceph/manifests/conf.pp, we can see it calls 'ceph-deploy new ${::hostname}:${::internal_address}'. In deployment/puppet/ceph/manifests/mon.pp, it calls 'ceph-deploy mon create ${::hostname}:${::internal_address}'. After we set Ceph public network to OpenStack storage network, Ceph monitor still runs on management network. I can see in QEMU command line, it connects to Ceph via management network IP address. Though this setup does not follow the rule suggested in [3], it probably works because Ceph monitor only provides PG location information, and then QEMU should talk to OSDs directly using the storage network addresses. That's why I didn't take your patches as-is. I meant to setup a Fuel 6.1 environment and test the modifications before I can give an accurate comment. [3] http://ceph.com/docs/master/rados/configuration/network-config-ref/#monitor-ip-tables on 2015/06/25 03:20, Andrew Woodward wrote: Zhou, As mentioned, Please review [1][2]. This is the interface we will support as we implement the advanced networking parts. In your case, just fip the role map in your patch to nailgun and the library interface will remain the same. [1] https://review.openstack.org/#/c/194434/ [2] https://review.openstack.org/#/c/194438/ On Wed, Jun 24, 2015 at 3:21 AM Stanislav Makar sma...@mirantis.com mailto:sma...@mirantis.com wrote: Hello My five cents :) I also very proud of our 6.1 release Unfortunately we still do not have this separation for ceph public and cluster networks in 6.1. One little hint In 6.1 we have granular deployment (https://docs.mirantis.com/fuel/fuel-master/release-notes.html#granular-deployment-based-on-pluggable-tasks) and all tasks which are connected with ceph deployment are in https://github.com/stackforge/fuel-library/tree/stable/6.1/deployment/puppet/osnailyfacter/modular/ceph To have your patch working in 6.1 you should hack all these files now If you have question feel free to ask. Thanks. -- All the best, Stanislav Makar skype: makar_stanislav irc: stamak On Wed, Jun 24, 2015 at 12:04 PM, Sergii Golovatiuk sgolovat...@mirantis.com mailto:sgolovat...@mirantis.com wrote: Hi Zhou, Try Fuel 6.1 where we have a lot of very nice features including 'advanced networking'. Feel free to read release notes [1]. It's one of the most significant releases with many many features and improvements. I am very proud of it. [1] https://docs.mirantis.com/fuel/fuel-master/release-notes.html -- Best regards, Sergii Golovatiuk, Skype #golserge IRC #holser On Wed, Jun 24, 2015 at 11:42 AM, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote: Thank you guys! I'm reading the advanced networking spec and think it's good. I'll try to comment it. It's too epic for me for now. I wrote a patch to move Ceph public network to storage network. It's based on Fuel 6.0, so I'm trying to rebase it to lastest master then maybe ask you for a review. on 2015/06/23 19:21, Igor Kalnitsky wrote: Hello, That makes sense to me. Still, I want to point that we're going to implement advanced networking and with this feature you'll be able to assign every single network role to any network. That means, you'll be able to assign ceph network role to storage, management or whatever-you-want network. Sounds cool, ha? :) Feel free to read a design spec [1]. Thanks, Igor [1]: https://review.openstack.org/#/c/115340/ On Tue, Jun 23, 2015 at 1:13 PM, Zhou Zheng
[openstack-dev] [Fuel] Ceph Public Network Setting
Hi! I notice that in OpenStack deployed by Fuel, Ceph public network is on management network. In some environments, not all NICs of a physical server are 10Gb. Sometimes 1 or 2 among the NICs on a machine may be 1Gb. Usually on this type of machine we assign management network to 1Gb NIC, and storage network to 10Gb NIC. If Ceph public network is with management network, the QEMU accesses Ceph using management network, and the performance is not optimal. In a small deployment, cloud controller and Ceph OSD may be assigned to the same machine, so it would be more effective to keep Ceph client traffic separated from MySQL, RabbitMQ, Pacemaker traffic. Maybe it's better to place Ceph public network on the storage network. Agree? -- Best wishes! Zhou Zheng Sheng, Software Developer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering
Thank you Andrew. Sorry for misspell your name in the previous email. on 2015/05/05 14:25, Andrew Beekhof wrote: On 5 May 2015, at 2:31 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com wrote: Thank you Bogdan for clearing the pacemaker promotion process for me. on 2015/05/05 10:32, Andrew Beekhof wrote: On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com wrote: [snip] Batch is a pacemaker concept I found when I was reading its documentation and code. There is a batch-limit: 30 in the output of pcs property list --all. The pacemaker official documentation explanation is that it's The number of jobs that the TE is allowed to execute in parallel. From my understanding, pacemaker maintains cluster states, and when we start/stop/promote/demote a resource, it triggers a state transition. Pacemaker puts as many as possible transition jobs into a batch, and process them in parallel. Technically it calculates an ordered graph of actions that need to be performed for a set of related resources. You can see an example of the kinds of graphs it produces at: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html There is a more complex one which includes promotion and demotion on the next page. The number of actions that can run at any one time is therefor limited by - the value of batch-limit (the total number of in-flight actions) - the number of resources that do not have ordering constraints between them (eg. rsc{1,2,3} in the above example) So in the above example, if batch-limit = 3, the monitor_0 actions will still all execute in parallel. If batch-limit == 2, one of them will be deferred until the others complete. Processing of the graph stops the moment any action returns a value that was not expected. If that happens, we wait for currently in-flight actions to complete, re-calculate a new graph based on the new information and start again. So can I infer the following statement? In a big cluster with many resources, chances are some resource agent actions return unexpected values, The size of the cluster shouldn’t increase the chance of this happening unless you’ve set the timeouts too aggressively. If there are many types of resource agents, and anyone of them is not well written, it might cause trouble, right? and if any of the in-flight action timeout is long, it would block pacemaker from re-calculating a new transition graph? Yes, but its actually an argument for making the timeouts longer, not shorter. Setting the timeouts too aggressively actually increases downtime because of all the extra delays and recovery it induces. So set them to be long enough that there is unquestionably a problem if you hit them. But we absolutely recognise that starting/stopping a database can take a very long time comparatively and that it shouldn’t block recovery of other unrelated services. I would expect to see this land in Pacemaker 1.1.14 It will be great to see this in Pacemaker 1.1.14. From my experience using Pacemaker, I think customized resource agents are possibly the weakest part. This feature should improve the handling for resource action timeouts. I see the current batch-limit is 30 and I tried to increase it to 100, but did not help. Correct. It only puts an upper limit on the number of in-flight actions, actions still need to wait for all their dependants to complete before executing. I'm sure that the cloned MySQL Galera resource is not related to master-slave RabbitMQ resource. I don't find any dependency, order or rule connecting them in the cluster deployed by Fuel [1]. In general it should not have needed to wait, but if you send me a crm_report covering the period you’re talking about I’ll be able to comment specifically about the behaviour you saw. You are very nice, thank you. I uploaded the file generated by crm_report to google drive. https://drive.google.com/file/d/0B_vDkYRYHPSIZ29NdzV3NXotYU0/view?usp=sharing Is there anything I can do to make sure all the resource actions return expected values in a full reassembling? In general, if we say ‘start’, do your best to start or return ‘0’ if you already were started. Likewise for stop. Otherwise its really specific to your agent. For example an IP resource just needs to add itself to an interface - it cant do much differently, if it times out then the system much be very very busy. The only other thing I would say is: - avoid blocking calls where possible - have empathy for the machine (do as little as is needed) +1 for the empathy :) Is it because node-1 and node-2 happen to boot faster than node-3 and form a cluster, when node-3 joins, it triggers new state transition? Or may because some resources are already started, so pacemaker needs to stop them firstly? We only stop them if they shouldn’t yet be running (ie. a colocation or ordering dependancy
Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering
Thank you Andrew. on 2015/05/05 08:03, Andrew Beekhof wrote: On 28 Apr 2015, at 11:15 pm, Bogdan Dobrelya bdobre...@mirantis.com wrote: Hello, Hello, Zhou I using Fuel 6.0.1 and find that RabbitMQ recover time is long after power failure. I have a running HA environment, then I reset power of all the machines at the same time. I observe that after reboot it usually takes 10 minutes for RabittMQ cluster to appear running master-slave mode in pacemaker. If I power off all the 3 controllers and only start 2 of them, the downtime sometimes can be as long as 20 minutes. Yes, this is a known issue [0]. Note, there were many bugfixes, like [1],[2],[3], merged for MQ OCF script, so you may want to try to backport them as well by the following guide [4] [0] https://bugs.launchpad.net/fuel/+bug/1432603 [1] https://review.openstack.org/#/c/175460/ [2] https://review.openstack.org/#/c/175457/ [3] https://review.openstack.org/#/c/175371/ [4] https://review.openstack.org/#/c/170476/ Is there a reason you’re using a custom OCF script instead of the upstream[a] one? Please have a chat with David (the maintainer, in CC) if there is something you believe is wrong with it. [a] https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/rabbitmq-cluster I'm using the OCF script from the Fuel project, specifically from the 6.0 stable branch [alpha]. Comparing with upstream OCF code, the main difference is that Fuel RabbitMQ OCF is a master-slave resource. Fuel RabbitMQ OCF does more bookkeeping, for example, blocking client access when RabbitMQ cluster is not ready. I beleive the upstream OCF should be OK to use as well after I read the code, but it might not fit into the Fuel project. As far as I test, the Fuel OCF script is good except sometimes the full reassemble time is long, and as I find out, it is mostly because the Fuel MySQL Galera OCF script keeps pacemaker from promoting RabbitMQ resource, as I mentioned in the previous emails. Maybe Vladimir and Sergey can give us more insight on why Fuel needs a master-slave RabbitMQ. I see Vladimir and Sergey works on the original Fuel blueprint RabbitMQ cluster [beta]. [alpha] https://github.com/stackforge/fuel-library/blob/stable/6.0/deployment/puppet/nova/files/ocf/rabbitmq [beta] https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker I have a little investigation and find out there are some possible causes. 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in Pacemaker The pacemaker resource p_mysql start timeout is set to 475s. Sometimes MySQL-wss fails to start after power failure, and pacemaker would wait 475s before retry starting it. The problem is that pacemaker divides resource state transitions into batches. Since RabbitMQ is master-slave resource, I assume that starting all the slaves and promoting master are put into two different batches. If unfortunately starting all RabbitMQ slaves are put in the same batch as MySQL starting, even if RabbitMQ slaves and all other resources are ready, pacemaker will not continue but just wait for MySQL timeout. Could you please elaborate the what is the same/different batches for MQ and DB? Note, there is a MQ clustering logic flow charts available here [5] and we're planning to release a dedicated technical bulletin for this. [5] http://goo.gl/PPNrw7 I can re-produce this by hard powering off all the controllers and start them again. It's more likely to trigger MySQL failure in this way. Then I observe that if there is one cloned mysql instance not starting, the whole pacemaker cluster gets stuck and does not emit any log. On the host of the failed instance, I can see a mysql resource agent process calling the sleep command. If I kill that process, the pacemaker comes back alive and RabbitMQ master gets promoted. In fact this long timeout is blocking every resource from state transition in pacemaker. This maybe a known problem of pacemaker and there are some discussions in Linux-HA mailing list [2]. It might not be fixed in the near future. It seems in generally it's bad to have long timeout in state transition actions (start/stop/promote/demote). There maybe another way to implement MySQL-wss resource agent to use a short start timeout and monitor the wss cluster state using monitor action. This is very interesting, thank you! I believe all commands for MySQL RA OCF script should be as well wrapped with timeout -SIGTERM or -SIGKILL as we did for MQ RA OCF. And there should no be any sleep calls. I created a bug for this [6]. [6] https://bugs.launchpad.net/fuel/+bug/1449542 I also find a fix to improve MySQL start timeout [3]. It shortens the timeout to 300s. At the time I sending this email, I can not find it in stable/6.0 branch. Maybe the maintainer needs to cherry-pick it to stable/6.0 ? [1] https://bugs.launchpad.net/fuel/+bug/1441885 [2]
Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering
Thank you Bogdan for clearing the pacemaker promotion process for me. on 2015/05/05 10:32, Andrew Beekhof wrote: On 29 Apr 2015, at 5:38 pm, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com wrote: [snip] Batch is a pacemaker concept I found when I was reading its documentation and code. There is a batch-limit: 30 in the output of pcs property list --all. The pacemaker official documentation explanation is that it's The number of jobs that the TE is allowed to execute in parallel. From my understanding, pacemaker maintains cluster states, and when we start/stop/promote/demote a resource, it triggers a state transition. Pacemaker puts as many as possible transition jobs into a batch, and process them in parallel. Technically it calculates an ordered graph of actions that need to be performed for a set of related resources. You can see an example of the kinds of graphs it produces at: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-config-testing-changes.html There is a more complex one which includes promotion and demotion on the next page. The number of actions that can run at any one time is therefor limited by - the value of batch-limit (the total number of in-flight actions) - the number of resources that do not have ordering constraints between them (eg. rsc{1,2,3} in the above example) So in the above example, if batch-limit = 3, the monitor_0 actions will still all execute in parallel. If batch-limit == 2, one of them will be deferred until the others complete. Processing of the graph stops the moment any action returns a value that was not expected. If that happens, we wait for currently in-flight actions to complete, re-calculate a new graph based on the new information and start again. So can I infer the following statement? In a big cluster with many resources, chances are some resource agent actions return unexpected values, and if any of the in-flight action timeout is long, it would block pacemaker from re-calculating a new transition graph? I see the current batch-limit is 30 and I tried to increase it to 100, but did not help. I'm sure that the cloned MySQL Galera resource is not related to master-slave RabbitMQ resource. I don't find any dependency, order or rule connecting them in the cluster deployed by Fuel [1]. Is there anything I can do to make sure all the resource actions return expected values in a full reassembling? Is it because node-1 and node-2 happen to boot faster than node-3 and form a cluster, when node-3 joins, it triggers new state transition? Or may because some resources are already started, so pacemaker needs to stop them firstly? Does setting default-resource-stickiness to 1 help? I also tried crm history XXX commands in a live and correct cluster, but didn't find much information. I can see there are many log entries like run_graph: Transition 7108 Next I'll inspect the pacemaker log to see which resource action returns the unexpected value or which thing triggers new state transition. [1] http://paste.openstack.org/show/214919/ The problem is that pacemaker can only promote a resource after it detects the resource is started. First we do a non-recurring monitor (*_monitor_0) to check what state the resource is in. We can’t assume its off because a) we might have crashed, b) the admin might have accidentally configured it to start at boot or c) the admin may have asked us to re-check everything. During a full reassemble, in the first transition batch, pacemaker starts all the resources including MySQL and RabbitMQ. Pacemaker issues resource agent start invocation in parallel and reaps the results. For a multi-state resource agent like RabbitMQ, pacemaker needs the start result reported in the first batch, then transition engine and policy engine decide if it has to retry starting or promote, and put this new transition job into a new batch. Also important to know, the order of actions is: 1. any necessary demotions 2. any necessary stops 3. any necessary starts 4. any necessary promotions __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Best wishes! Zhou Zheng Sheng / 周征晟 Software Engineer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering
Hello Sergii, Thank you for the great explanation on Galera OCF script. I replied your question inline. on 2015/05/03 04:49, Sergii Golovatiuk wrote: Hi Zhou, Galera OCF script is a bit special. Since MySQL keeps the most important data we should find the most recent data on all nodes across the cluster. check_if_galera_pc is specially designed for that. Every server registers the latest status from grastate.dat to CIB. Once all nodes are registered, the one with the most recent data will be selected as Primary Component. All others should join to that node. 5 minutes is a time for all nodes to appear and register position from grastate.dat to CIB. Usually, it takes much faster. Though there are cases when node is stuck on fsck or grub or power outlet or some other cases. If all nodes are registered there shouldn't be 5 minute penalty timeout. If one node is stuck (at least present in CIB), then all other nodes will be waiting for 5 minutes then will assemble cluster without it. Concerning dependencies, I agree that RabbitMQ may start in parallel to Galera cluster assemble procedure. It makes no sense to start other services as they are dependent on Galera and RabbitMQ. Also, I have a quick question to you. Shutting down all three controllers is a unique case, like whole power outage in whole datacenter (DC). In this case, 5 minute delay is very small comparing to DC recovery procedure. Reboot of one controller is more optimistic scenario. What's a special case to restart all 3-5 at once? Sorry, I am not very clear about what 3-5 refers to. Is the question about why we want to make the full reassemble time short, and why this case is important for us? We have some small customers forming a long-tail in local market. They have neither dedicated datacenter houses nor dual power supply. Some of them would even shutdown all the machines when they go home, and start all of the machines when they start to work. Considering of data privacy, they are not willing to put their virtual machines on the public cloud. Usually, this kind of customer don't have IT skills to troubleshoot a full reassemble process. We want to make this process as simple as turning on all the machines roughly at the same time and wait about several minutes, so they don't call our service team. Also, I would like to say a big thank for digging it out. It's very useful to use your findings in our next steps. -- Best regards, Sergii Golovatiuk, Skype #golserge IRC #holser On Wed, Apr 29, 2015 at 9:38 AM, Zhou Zheng Sheng / 周征晟 zhengsh...@awcloud.com mailto:zhengsh...@awcloud.com wrote: Hi! Thank you very much Vladimir and Bogdan! Thanks for the fast respond and rich information. I backported MySQL and RabbitMQ ocf patches from stable/6.0 and tested again. A full reassemble takes about 5mins, this improves a lot. Adding the force_load trick I mentioned in the previous email, it takes about 4mins. I get that there is not really a RabbitMQ master instance because queue masters spreads to all the RabbitMQ instances. The pacemaker master is an abstract one. However there is still an mnesia node from which other mnesia nodes sync table schema. The exception timeout_waiting_for_tables in log is actually reported by mnesia. By default, it places a mark on the last alive mnesia node, and other nodes have to sync table from it (http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477). RabbitMQ clustering inherits the behavior, and the last RabbitMQ instance shutdown must be the first instance to start. Otherwise it produces timeout_waiting_for_tables (http://www.rabbitmq.com/clustering.html#transcript search for the last node to go down). The 1 minute difference is because without force_load, the abstract master determined by pacemaker during a promote action may not be the last RabbitMQ instance shut down in the last start action. So there is chance for rabbitmqctl start_app to wait 30s and trigger a RabbitMQ exception timeout_waiting_for_tables. We may able to see table timeout and mnesa resetting for once during a reassemble process on some of the RabbitMQ instances, but it only introduces 30s of wait, which is acceptable for me. I also inspect the RabbitMQ resource agent code in latest master branch. There are timeout wrapper and other improvements which are great. It does not change the master promotion process much, so it may still run into the problems I described. Please see the inline reply below. on 2015/04/28/ 21:15, Bogdan Dobrelya wrote: Hello, Hello, Zhou I using Fuel 6.0.1 and find that RabbitMQ recover time is long after power failure. I have a running HA environment, then I reset power of all the machines
Re: [openstack-dev] [Fuel] Speed Up RabbitMQ Recovering
Hi! Thank you very much Vladimir and Bogdan! Thanks for the fast respond and rich information. I backported MySQL and RabbitMQ ocf patches from stable/6.0 and tested again. A full reassemble takes about 5mins, this improves a lot. Adding the force_load trick I mentioned in the previous email, it takes about 4mins. I get that there is not really a RabbitMQ master instance because queue masters spreads to all the RabbitMQ instances. The pacemaker master is an abstract one. However there is still an mnesia node from which other mnesia nodes sync table schema. The exception timeout_waiting_for_tables in log is actually reported by mnesia. By default, it places a mark on the last alive mnesia node, and other nodes have to sync table from it (http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html#id78477). RabbitMQ clustering inherits the behavior, and the last RabbitMQ instance shutdown must be the first instance to start. Otherwise it produces timeout_waiting_for_tables (http://www.rabbitmq.com/clustering.html#transcript search for the last node to go down). The 1 minute difference is because without force_load, the abstract master determined by pacemaker during a promote action may not be the last RabbitMQ instance shut down in the last start action. So there is chance for rabbitmqctl start_app to wait 30s and trigger a RabbitMQ exception timeout_waiting_for_tables. We may able to see table timeout and mnesa resetting for once during a reassemble process on some of the RabbitMQ instances, but it only introduces 30s of wait, which is acceptable for me. I also inspect the RabbitMQ resource agent code in latest master branch. There are timeout wrapper and other improvements which are great. It does not change the master promotion process much, so it may still run into the problems I described. Please see the inline reply below. on 2015/04/28/ 21:15, Bogdan Dobrelya wrote: Hello, Hello, Zhou I using Fuel 6.0.1 and find that RabbitMQ recover time is long after power failure. I have a running HA environment, then I reset power of all the machines at the same time. I observe that after reboot it usually takes 10 minutes for RabittMQ cluster to appear running master-slave mode in pacemaker. If I power off all the 3 controllers and only start 2 of them, the downtime sometimes can be as long as 20 minutes. Yes, this is a known issue [0]. Note, there were many bugfixes, like [1],[2],[3], merged for MQ OCF script, so you may want to try to backport them as well by the following guide [4] [0] https://bugs.launchpad.net/fuel/+bug/1432603 [1] https://review.openstack.org/#/c/175460/ [2] https://review.openstack.org/#/c/175457/ [3] https://review.openstack.org/#/c/175371/ [4] https://review.openstack.org/#/c/170476/ I have a little investigation and find out there are some possible causes. 1. MySQL Recovery Takes Too Long [1] and Blocking RabbitMQ Clustering in Pacemaker The pacemaker resource p_mysql start timeout is set to 475s. Sometimes MySQL-wss fails to start after power failure, and pacemaker would wait 475s before retry starting it. The problem is that pacemaker divides resource state transitions into batches. Since RabbitMQ is master-slave resource, I assume that starting all the slaves and promoting master are put into two different batches. If unfortunately starting all RabbitMQ slaves are put in the same batch as MySQL starting, even if RabbitMQ slaves and all other resources are ready, pacemaker will not continue but just wait for MySQL timeout. Could you please elaborate the what is the same/different batches for MQ and DB? Note, there is a MQ clustering logic flow charts available here [5] and we're planning to release a dedicated technical bulletin for this. [5] http://goo.gl/PPNrw7 Batch is a pacemaker concept I found when I was reading its documentation and code. There is a batch-limit: 30 in the output of pcs property list --all. The pacemaker official documentation explanation is that it's The number of jobs that the TE is allowed to execute in parallel. From my understanding, pacemaker maintains cluster states, and when we start/stop/promote/demote a resource, it triggers a state transition. Pacemaker puts as many as possible transition jobs into a batch, and process them in parallel. The problem is that pacemaker can only promote a resource after it detects the resource is started. During a full reassemble, in the first transition batch, pacemaker starts all the resources including MySQL and RabbitMQ. Pacemaker issues resource agent start invocation in parallel and reaps the results. For a multi-state resource agent like RabbitMQ, pacemaker needs the start result reported in the first batch, then transition engine and policy engine decide if it has to retry starting or promote, and put this new transition job into a new batch. I see improvements to put individual commands inside a timeout wrapper in RabbitMQ resource agent, and a bug created yesterday
[openstack-dev] [Fuel] Speed Up RabbitMQ Recovering
are unlucky, there would be several 30s timeout and reset before you finally get a healthy RabbitMQ cluster. I find three possible solutions. A. Using rabbitmqctl force_boot option [6] It will skips waiting for 30s and resetting cluster, but just assume the current node is the master and continue to operate. This is feasible because the original RabbitMQ master would discards the local state and sync with the new master after it joins a new cluster [7]. So we can be sure that after step (4) and (6), the pacemaker determined master instance is started unconditionally, and it will be the same as RabbitMQ master, and all operations run without 30s timeout. I find this option is only available in newer RabbitMQ release, and updating RabbitMQ might introduce other compatibility problems. B. Turn RabbitMQ into cloned instance and use pause_minority instead of autoheal [8] This works like MySQL-wss. It let RabbitMQ cluster itself deal with partition in a manner similar to pacemaker quorum mechanism. When there is network partition, instances in the minority partition pauses themselves automatically. Pacemaker does not have to track who is the RabbitMQ master, who lives longest, who to promote... It just starts all the clones, done. This leads to huge change in RabbitMQ resource agent, and the stability and other impact is to be tested. C. Creating a force_load file After reading RabbitMQ source code, I find that the actual thing it does in solution A is just creating an empty file named force_load in mnesia database dir, then mnesia thinks it is the last node shut down in the last time and boot itself as the master. This implementation keeps the same from v3.1.4 to the latest RabbitMQ master branch. I think we can make use of this little trick. The change is adding just one line in try_to_start_rmq_app() function. touch ${MNESIA_FILES}/force_load \ chown rabbitmq:rabbitmq ${MNESIA_FILES}/force_load [4] http://www.rabbitmq.com/ha.html [5] https://review.openstack.org/#/c/169291/ [6] https://www.rabbitmq.com/clustering.html [7] http://www.rabbitmq.com/partitions.html#recovering [8] http://www.rabbitmq.com/partitions.html#automatic-handling Maybe you have better ideas on this. Please share your thoughts. Best wishes! Zhou Zheng Sheng / 周征晟 Software Engineer Beijing AWcloud Software Co., Ltd. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev