Re: [Pacemaker] pacemaker-remote debian wheezy
Hi Alexis, Sorry, I didn't work with drbd. Try to look here http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/ . Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration
Hi Kristoffer, Thank you for the help. I will let you know if I see the same with the latest version. On Feb 9, 2015 10:07 AM, Kristoffer Grönlund kgronl...@suse.com wrote: Hi, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com writes: Hi guys, I saw this during applying the configuration using a script with crmsh commands: The CIB patching performed by crmsh has been a bit too sensitive to CIB version mismatches which can cause errors like the one you are seeing. This should be fixed in the latest released version of crmsh (2.1.2), and I would recommend upgrading to that version if you can. If this problem still occurs with 2.1.2, please let me know [1]. Thanks! [1]: http://github.com/crmsh/crmsh/issues + crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw + crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params delay=10 + crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1 -inf: node-1 + crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0 -inf: node-0 Call cib_apply_diff failed (-205): Update was older than existing configuration ERROR: could not patch cib (rc=205) INFO: offending xml diff: diff format=2 version source admin_epoch=0 epoch=26 num_updates=3/ target admin_epoch=0 epoch=27 num_updates=3/ /version change operation=modify path=/cib change-list change-attr name=epoch operation=set value=27/ /change-list change-result cib crm_feature_set=3.0.9 validate-with=pacemaker-2.0 epoch=27 num_updates=3 admin_epoch=0 cib-last-written=Thu Feb 5 14:56:09 2015 have-quorum=1 dc-uuid=1/ /change-result /change change operation=create path=/cib/configuration/constraints position=1 rsc_location id=dont_run_STONITH_node-0_on_node-0 rsc=STONITH_node-0 score=-INFINITY node=node-0/ /change /diff After that pacemaker stopped on the node on which the script was run. Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- // Kristoffer Grönlund // kgronl...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] authentication in the cluster
Hi Chrissie, On Thu, Jan 29, 2015 at 11:44 AM, Christine Caulfield ccaul...@redhat.com wrote: as corosync rejects the messages with the wrong authkey And I suppose that it is enough just to use crypto_hash option. I am right? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] authentication in the cluster
Chrissie, one more thing. And I also suppose that the key is read only one time at the start-up of the Corosync. In other words there is no any check for the presence of the key or its update, while Corosync is working. Am I right? Thank you, Kostya On Thu, Feb 5, 2015 at 2:42 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi Chrissie, On Thu, Jan 29, 2015 at 11:44 AM, Christine Caulfield ccaul...@redhat.com wrote: as corosync rejects the messages with the wrong authkey And I suppose that it is enough just to use crypto_hash option. I am right? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] authentication in the cluster
Chrissie, thank you! =) It helps! Thank you, Kostya On Thu, Feb 5, 2015 at 3:59 PM, Christine Caulfield ccaul...@redhat.com wrote: On 02/05/2015 01:47 PM, Kostiantyn Ponomarenko wrote: Chrissie, one more thing. And I also suppose that the key is read only one time at the start-up of the Corosync. In other words there is no any check for the presence of the key or its update, while Corosync is working. Am I right? That's correct, yes :) Chrissie Thank you, Kostya On Thu, Feb 5, 2015 at 2:42 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi Chrissie, On Thu, Jan 29, 2015 at 11:44 AM, Christine Caulfield ccaul...@redhat.com wrote: as corosync rejects the messages with the wrong authkey And I suppose that it is enough just to use crypto_hash option. I am right? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration
Hi guys, I saw this during applying the configuration using a script with crmsh commands: + crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw + crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params delay=10 + crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1 -inf: node-1 + crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0 -inf: node-0 Call cib_apply_diff failed (-205): Update was older than existing configuration ERROR: could not patch cib (rc=205) INFO: offending xml diff: diff format=2 version source admin_epoch=0 epoch=26 num_updates=3/ target admin_epoch=0 epoch=27 num_updates=3/ /version change operation=modify path=/cib change-list change-attr name=epoch operation=set value=27/ /change-list change-result cib crm_feature_set=3.0.9 validate-with=pacemaker-2.0 epoch=27 num_updates=3 admin_epoch=0 cib-last-written=Thu Feb 5 14:56:09 2015 have-quorum=1 dc-uuid=1/ /change-result /change change operation=create path=/cib/configuration/constraints position=1 rsc_location id=dont_run_STONITH_node-0_on_node-0 rsc=STONITH_node-0 score=-INFINITY node=node-0/ /change /diff After that pacemaker stopped on the node on which the script was run. Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration
Here is a part of my /var/log/cluster/corosync.log: Feb 05 16:23:36 [3184] isis-seth943fcib: info: cib_process_ping:Reporting our current digest to node-1: 71d36e21c5df77a4e8b10b29f2a349e2 for 0.70.16 (0x2474c50 0) Feb 05 16:23:41 [3185] isis-seth943f stonithd: notice: remote_op_done: Operation reboot of node-0 by node-1 for crmd.3201@node-1.58027587: OK Feb 05 16:23:41 [3189] isis-seth943f crmd: crit: tengine_stonith_notify: We were alegedly just fenced by node-1 for node-1! Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: Diff: --- 0.70.16 2 Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: Diff: +++ 0.70.17 (null) Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: + /cib: @num_updates=17 Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: + /cib/status/node_state[@id='1']: @crm-debug-origin=send_stonith_update, @join=down, @expected=down Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node-1/crmd/91, version=0.70.17) Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted: info: cancel_recurring_action: Cancelling operation dmdh_monitor_3 Feb 05 16:23:41 [3182] isis-seth943f pacemakerd:error: pcmk_child_exit: Child process crmd (3189) exited: Network is down (100) Feb 05 16:23:41 [3182] isis-seth943f pacemakerd: warning: pcmk_child_exit: Pacemaker child process crmd no longer wishes to be respawned. Shutting ourselves down. Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted: warning: qb_ipcs_event_sendv: new_event_notification (3186-3189-7): Bad file descriptor (9) Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted: warning: send_client_notify: Notification of client crmd/437207d3-a709-48d0-b544-94e2158ea191 failed Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted: info: cancel_recurring_action: Cancelling operation sm0dh_monitor_3 Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted: warning: send_client_notify: Notification of client crmd/437207d3-a709-48d0-b544-94e2158ea191 failed Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: Diff: --- 0.70.17 2 Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: Diff: +++ 0.70.18 (null) Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: -- /cib/status/node_state[@id='1']/lrm[@id='1'] Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op: + /cib: @num_updates=18 Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_process_request: Completed cib_delete operation for section //node_state[@uname='node-0']/lrm: OK (rc=0, origin=node-1/crmd/92, version=0.70.18) Feb 05 16:23:41 [3182] isis-seth943f pacemakerd: notice: pcmk_shutdown_worker:Shuting down Pacemaker Thank you, Kostya On Thu, Feb 5, 2015 at 6:15 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, I saw this during applying the configuration using a script with crmsh commands: + crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw + crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params delay=10 + crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1 -inf: node-1 + crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0 -inf: node-0 Call cib_apply_diff failed (-205): Update was older than existing configuration ERROR: could not patch cib (rc=205) INFO: offending xml diff: diff format=2 version source admin_epoch=0 epoch=26 num_updates=3/ target admin_epoch=0 epoch=27 num_updates=3/ /version change operation=modify path=/cib change-list change-attr name=epoch operation=set value=27/ /change-list change-result cib crm_feature_set=3.0.9 validate-with=pacemaker-2.0 epoch=27 num_updates=3 admin_epoch=0 cib-last-written=Thu Feb 5 14:56:09 2015 have-quorum=1 dc-uuid=1/ /change-result /change change operation=create path=/cib/configuration/constraints position=1 rsc_location id=dont_run_STONITH_node-0_on_node-0 rsc=STONITH_node-0 score=-INFINITY node=node-0/ /change /diff After that pacemaker stopped on the node on which the script was run. Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] rrp_mode in corosync.conf
Hi all, I've been looking for a good answer to my question, but all information I found is ambiguous. I hope to get a good answer here =) The only description about active and passive modes I found is: Active: both rings will be active, in use Passive: only one of the 2 rings is in use, the second one will be use only if the first one fails There is no description of how it works and what the impact is? So, my general question is: How the rings are used in active and passive modes? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] authentication in the cluster
Hi all, Here is a situation - there are two two-node clusters. They have totally identical configuration. Nodes in the clusters are connected directly, without any switches. Here is a part of corosync.comf file: totem { version: 2 cluster_name: mycluster transport: udpu crypto_hash: sha256 crypto_cipher: none rrp_mode: passive } nodelist { node { name: node-a nodeid: 1 ring0_addr: 169.254.0.2 ring1_addr: 169.254.1.2 } node { name: node-b nodeid: 2 ring0_addr: 169.254.0.3 ring1_addr: 169.254.1.3 } } The only difference between those two clusters is authentication key ( /etc/corosync/authkey ) - it is different for both clusters. QUESTION: -- What will be the behavior if the next mess in connection occurs: ring1_addr of node-a (cluster-A) is connected to ring1_addr of node-b (cluster-B) ring1_addr of node-a (cluster-B) is connected to ring1_addr of node-b (cluster-A) I attached a pic which shows the connections. My actual goal - do not let the clusters work in such case. To achieve it, I decided to use authentication key mechanism. But I don't know the result in the situation which I described ... . Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] authentication in the cluster
Hi Chrissie, I know that this setup it crazy thing =) First of all I needed to say - think about each two-node cluster as one box with two nodes. You can't connect clusters together like that. I know that. All nodes in the cluster have just 1 authkey file. That is true. But in this example there are two clusters, each of them have its own auth key. What you have there is not a ring, it's err, a linked-cross?! Yep, I showed the wrong way of connecting two clusters. Why do you need to connect the two clusters together - is it for failover? No, it is not. I really don't (and won't) connect them in that way. It wrong. But, in real life those two clusters will be standing (physically, in the same room, in the same rack) pretty close to each other. And my concern is - if someone do that connection by a mistake. What will be in that situation? What I would like to get in that situation, is something which prevent simultaneous work of two nodes in one cluster - because it will cause data corruption. The situation is pretty simple when there is only one ring_addr defined per node. In this case, when some one cross-linked two separate clusters, it will lead to 4 clusters each of which is missing one node - because two connected nodes has different auth keys, and that is why they will not see each other even when there is a connection. STONITH always works in the same cluster. So, STONITH will be rebooting the other one in the cluster. That will prevent simultaneous access to the data. I tried to do my best in describing the situation, the problem and the question. Looking forward to hear any suggestions =) Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Got it. Thank you =) I just thought about possibility of a NIC to burn down. Thank you, Kostya On Tue, Jan 20, 2015 at 10:50 AM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, One more thing to clarify. You said rebind can be avoided - what does it mean? By that I mean that as long as you don't shutdown interface everything will work as expected. Interface shutdown is administrator decision, system doesn't do it automagically :) Regards, Honza Thank you, Kostya On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri
Re: [Pacemaker] Corosync fails to start when NIC is absent
One more thing to clarify. You said rebind can be avoided - what does it mean? Thank you, Kostya On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least
Re: [Pacemaker] pacemaker-remote debian wheezy
Thomas, There was a need for me to run the latest cluster stuff on Debian 7. So I created a document for myself to use. I don't claim this doc to be the best way to go, but it works for me. I hope it will work for you as well. Here is the doc's content: Software - libqb 0.17.0 - corosync 2.3.3 - cluster-glue 1.0.12 - resource-agents 3.9.5 - pacemaker 1.1.12 - crmsh 2.1.0 IMPORTANT: do this installation step-by-step as here, the order is significant. Pre-Configuration $ sudo apt-get install build-essential $ sudo apt-get install automake autoconf $ sudo apt-get install libtool $ sudo apt-get install pkg-config LIBQB (needed by Corosync) https://github.com/ClusterLabs/libqb/releases $ echo 0.17.0 .tarball-version $ ./autogen.sh $ ./configure $ make $ sudo make install COROSYNC https://github.com/corosync/corosync/releases $ sudo apt-get install libnss3-dev $ echo 2.3.3 .tarball-version $ ./autogen.sh $ ./configure $ make $ sudo make install CLUSTER-GLUE (node fencing plugins, an error reporting utility, and other reusable cluster components from the Linux HA project) http://hg.linux-ha.org/glue/archive/glue-1.0.12.tar.bz2 $ sudo apt-get install libaio-dev (!) install dependencies for pacemaker (below) before proceed $ ./autogen.sh $ ./configure --enable-fatal-warnings=no $ make $ sudo make install RESOURCE-AGENTS (Combined repository of OCF agents from the RHCS and Linux-HA projects) https://github.com/ClusterLabs/resource-agents/releases $ echo 3.9.5 .tarball-version $ ./autogen.sh $ ./configure $ make $ sudo make install PACEMAKER https://github.com/ClusterLabs/pacemaker/releases $ sudo apt-get install uuid-dev $ sudo apt-get install libglib2.0-dev $ sudo apt-get install libxml2-dev $ sudo apt-get install libxslt1-dev $ sudo apt-get install libbz2-dev $ sudo apt-get install libncurses5-dev $ sudo addgroup --system haclient $ sudo adduser --system --no-create-home --ingroup haclient hacluster $ ./autogen.sh $ ./configure $ make $ sudo make install CRMSH https://github.com/crmsh/crmsh/releases $ sudo apt-get install python-lxml $ ./autogen.sh $ ./configure $ make $ sudo make install Thank you, Kostya On Thu, Jan 15, 2015 at 5:44 PM, Thomas Manninger dbgtmas...@gmx.at wrote: Hi, i also compiled the pacemaker_mgmt. I can start the hb_gui, but i have no server daemon? I used git://github.com/ClusterLabs/pacemaker-mgmt.git as source. Is the server in another repo?? I used: ./ConfigureMe configure ./ConfigureMe make checkinstall --fstrans=no ./ConfigureMe install regards thomas *Gesendet:* Donnerstag, 15. Januar 2015 um 15:16 Uhr *Von:* Ken Gaillot kgail...@redhat.com *An:* pacemaker@oss.clusterlabs.org *Betreff:* Re: [Pacemaker] pacemaker-remote debian wheezy On 01/15/2015 08:18 AM, Kristoffer Grönlund wrote: Thomas Manninger dbgtmas...@gmx.at writes: Hi, I compiled the latest libqb, corosync and pacemaker from source. Now there is no crm command available? Is there another standard shell? Should i use crmadmin? Thanks! Regards Thomas You can get crmsh and build from source at crmsh.github.io, or try the .rpm packages for various distributions here: https://build.opensuse.org/package/show/network:ha-clustering:Stable/crmsh Congratulations on getting that far, that's probably the hardest part :-) The crm shell was part of the pacemaker packages in Debian squeeze. It was going to be separated into its own package for jessie, but that hasn't made it out of sid/unstable yet, so it might not make it into the final release. Since you've built everything else from source, that's probably easiest, but if you want to try ... For the rpm mentioned above, have a look at alien (https://wiki.debian.org/Alien). crmsh is a standalone package so hopefully it would work; I wouldn't try alien for something as complicated as all the rpm's that go into a pacemaker install. You could try backporting the sid package https://packages.debian.org/source/sid/crmsh but I suspect the dependencies would get you. In theory the crm binary from the squeeze packages should work with the newer pacemaker, if you can straighten out the library dependencies. Or you can use the crm*/cib* command-line tools that come with pacemaker if you don't mind the lower-level approach. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Re: [Pacemaker] pacemaker-remote debian wheezy
Hi Thomas, I don't remember starting from which version of Pacemaker crmsh is not included in it anymore. It goes as a independent product. You can get it back. Here is the link https://github.com/crmsh/crmsh/ . Build and install =) Thank you, Kostya On Thu, Jan 15, 2015 at 3:18 PM, Kristoffer Grönlund kgronl...@suse.com wrote: Thomas Manninger dbgtmas...@gmx.at writes: Hi, I compiled the latest libqb, corosync and pacemaker from source. Now there is no crm command available? Is there another standard shell? Should i use crmadmin? Thanks! Regards Thomas You can get crmsh and build from source at crmsh.github.io, or try the .rpm packages for various distributions here: https://build.opensuse.org/package/show/network:ha-clustering:Stable/crmsh Best regards, Kristoffer -- // Kristoffer Grönlund // kgronl...@suse.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http
Re: [Pacemaker] Corosync fails to start when NIC is absent
Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker
Re: [Pacemaker] Corosync fails to start when NIC is absent
According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya corosync.conf Description: Binary data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Corosync fails to start when NIC is absent
Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Only build crm_uuid when supporting heartbeat
Hi guys, Could you help me to understand the reason to build crm_uuid only when supporting heartbeat? My situation is the next, I am trying to build Pacemaker's sources for Debian. I built the sources from Debian Wheezy repo (removed --with-heartbeat from it). Then I did the same but using sources from Debian Sid. And I've noticed that crm_uuid is missing. What the reason to build Pacemaker --with-heartbeat? I thought that having Corosync is sufficient for Pacemaker. Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Only build crm_uuid when supporting heartbeat
Fair enough, thank you! Thank you, Kostya On Mon, Sep 22, 2014 at 8:38 PM, Andrew Beekhof and...@beekhof.net wrote: On 22 Sep 2014, at 11:50 pm, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Could you help me to understand the reason to build crm_uuid only when supporting heartbeat? Because only heartbeat cares about /var/lib/heartbeat/hb_uuid which is what crm_uuid is designed to read/write. My situation is the next, I am trying to build Pacemaker's sources for Debian. I built the sources from Debian Wheezy repo (removed --with-heartbeat from it). Then I did the same but using sources from Debian Sid. And I've noticed that crm_uuid is missing. What the reason to build Pacemaker --with-heartbeat? I thought that having Corosync is sufficient for Pacemaker. Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] testquorum module
Hi Chrissie, You mentioned testquorum in your doc (Whatever happened to cman) as it's a good place if you are thinking about writing your own quorum module. The only file I found in corosync code is in /test folder and I think it's not the module. Could you please point me where I can find this module? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] What is the cman package for ubuntu 13.10
Hi Vijay B, I have 2 Debian machines with the latest Corosync and Pacemaker. I wanted the latest versions of these packages, so I didn't use apt-get install corosync pacemaker. Instead of that I downloaded the sources, built it and installed it. I have a document with all steps I did to get it working. If you still need some help here, just write me back. Thank you, Kostya On Wed, Jun 25, 2014 at 3:33 AM, Digimer li...@alteeve.ca wrote: I can't speak to the installation bits, I don't use Ubuntu/Debian myself, but once installed, this guide should apply: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html- single/Clusters_from_Scratch/index.html Cheeers On 24/06/14 07:40 PM, Vijay B wrote: Hi Digimer, Thanks for the info - do you happen to have a cheat sheet or instructions for corosync/pacemaker setup on ubuntu 14.04? I can give it a try.. Regards, Vijay On Tue, Jun 24, 2014 at 3:52 PM, Digimer li...@alteeve.ca mailto:li...@alteeve.ca wrote: I don't know about 13, but on Ubuntu 14.04 LTS, they're up to corosync 2 and pacemaker 1.1.10, which do not need cman at all. On 24/06/14 06:17 PM, Vijay B wrote: Hi Emmanuel! Thanks for getting back to me on this. Turns out that the corosync package is already installed on both nodes: user@pmk1:~$ sudo apt-get install corosync [sudo] password for user: Reading package lists... Done Building dependency tree Reading state information... Done corosync is already the newest version. corosync set to manually installed. 0 upgraded, 0 newly installed, 0 to remove and 338 not upgraded. user@pmk1:~$ So I think it's something else that needs to be done.. any ideas? Thanks, Regards, Vijay On Tue, Jun 24, 2014 at 3:02 PM, emmanuel segura emi2f...@gmail.com mailto:emi2f...@gmail.com mailto:emi2f...@gmail.com mailto:emi2f...@gmail.com wrote: did you try apt-get install corosync pacemaker 2014-06-24 23:50 GMT+02:00 Vijay B os.v...@gmail.com mailto:os.v...@gmail.com mailto:os.v...@gmail.com mailto:os.v...@gmail.com: Hi, This is my first time using/trying to setup pacemaker with corosync and I'm trying to set it up on 2 ubuntu 13.10 VMs. I'm following the instructions in this link - http://clusterlabs.org/__quickstart-ubuntu.html http://clusterlabs.org/quickstart-ubuntu.html But when I attempt to install the cman package, I see this error: user@pmk2:~$ sudo apt-get install cman Reading package lists... Done Building dependency tree Reading state information... Done Package cman is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source However the following packages replace it: fence-agents:i386 dlm:i386 fence-agents dlm E: Package 'cman' has no installation candidate user@pmk2:~$ So it says that the fence-agents replaces the obsolete cman package, but then I don't know what the new equivalent service for cman is, or what the config files are. I don't see an /etc/default/cman file, and even if I add one manually with a single line in it: QUORUM_TIMEOUT=0 (for a two node cluster) it doesn't seem to be used anywhere. Also, since there is no cman service as such, this is what happens: user@pmk2:~$ sudo service cman status cman: unrecognized service user@pmk2:~$ If I try to ignore the above error and simply attempt to start up the pacemaker service, I see this: user@pmk2:~$ sudo crm status Could not establish cib_ro connection: Connection refused (111) Connection to cluster failed: Transport endpoint is not connected user@pmk2:~$
Re: [Pacemaker] configuration variants for 2 node cluster
Hi Chrissie, But wait_for_all doesn't help when there is no connection between the nodes. Because in case I need to reboot the remaining working node I won't get working cluster after that - both nodes will be waiting connection between them. That's why I am looking for the solution which could help me to get one node working in this situation (after reboot). I've been thinking about some kind of marker which could help a node to determine a state of the other node. Like external disk and SCSI reservation command. Maybe you could suggest another kind of marker? I am not sure can we use a presents of a file on external SSD as the marker. Kind of: if there is a file - the other node is alive, if no - node is dead. Digimer, Thanks for the links and information. Anyway if I go this way, I will write my own daemon to determine a state of the other node. Also the information about fence loop is new for me, thanks =) Thank you, Kostya On Tue, Jun 24, 2014 at 10:55 AM, Christine Caulfield ccaul...@redhat.com wrote: On 23/06/14 15:49, Digimer wrote: Hi Kostya, I'm having a little trouble understanding your question, sorry. On boot, the node will not start anything, so after booting it, you log in, check that it can talk to the peer node (a simple ping is generally enough), then start the cluster. It will join the peer's existing cluster (even if it's a cluster on just itself). If you booted both nodes, say after a power outage, you will check the connection (again, a simple ping is fine) and then start the cluster on both nodes at the same time. wait_for_all helps with most of these situations. If a node goes down then it won't start services until it's seen the non-failed node because wait_for_all prevents a newly rebooted node from doing anything on its own. This also takes care of the case where both nodes are rebooted together of course, because that's the same as a new start. Chrissie If one of the nodes needs to be shut down, say for repairs or upgrades, you migrate the services off of it and over to the peer node, then you stop the cluster (which tells the peer that the node is leaving the cluster). After that, the remaining node operates by itself. When you turn it back on, you rejoin the cluster and migrate the services back. I think, maybe, you are looking at things more complicated than you need to. Pacemaker and corosync will handle most of this for you, once setup properly. What operating system do you plan to use, and what cluster stack? I suspect it will be corosync + pacemaker, which should work fine. digimer On 23/06/14 10:36 AM, Kostiantyn Ponomarenko wrote: Hi Digimer, Suppose I disabled to cluster on start up, but what about remaining node, if I need to reboot it? So, even in case of connection lost between these two nodes I need to have one node working and providing resources. How did you solve this situation? Should it be a separate daemon which checks somehow connection between the two nodes and decides to run corosync and pacemaker or to keep them down? Thank you, Kostya On Mon, Jun 23, 2014 at 4:34 PM, Digimer li...@alteeve.ca mailto:li...@alteeve.ca wrote: On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote: Hi guys, I want to gather all possible configuration variants for 2-node cluster, because it has a lot of pitfalls and there are not a lot of information across the internet about it. And also I have some questions about configurations and their specific problems. VARIANT 1: - We can use two_node and wait_for_all option from Corosync's votequorum, and set up fencing agents with delay on one of them. Here is a workflow(diagram) of this configuration: 1. Node start. 2. Cluster start (Corosync and Pacemaker) at the boot time. 3. Wait for all nodes. All nodes joined? No. Go to step 3. Yes. Go to step 4. 4. Start resources. 5. Split brain situation (something with connection between nodes). 6. Fencing agent on the one of the nodes reboots the other node (there is a configured delay on one of the Fencing agents). 7. Rebooted node go to step 1. There are two (or more?) important things in this configuration: 1. Rebooted node remains waiting for all nodes to be visible (connection should be restored). 2. Suppose connection problem still exists and the node which rebooted the other guy has to be rebooted also (for some reasons). After reboot he is also stuck on step 3 because of connection problem. QUESTION: - Is it possible somehow to assign to the guy who won the reboot race (rebooted other guy) a status like a primary and allow him
Re: [Pacemaker] configuration variants for 2 node cluster
Chrissie, I don't wont to reinvent a quorum disk =) I know about its complexity. That's why I think that the most reasonable decision for me is to wait till Corosync 2 gets quorum disk :) But meanwhile I need to deal somehow with my situation. So, the possible solution for me is creating a daemon, which will start cluster stack based on some circumstances. Here is how I see it (any improvements are appreciated): The marker: SCSI reservation of SSD IMPORTANT: The daemon should distinguish which node marker belongs to. QUESTION: What other markers is it possible to use? -- Main workflow: -- 1. Node start 2. Daemon start 2.1. Check the marker. Is marker present? NO: 2.1.1. Set marker. Successful? NO: Do nothing. (Go to 2.1 and repeat it for few times). YES: Start cluster stack. YES: 2.1.2. Ping the other node. Successful? NO: Do nothing: the other node is probably (99%) on. YES: Remove the marker. Start cluster stack.[*] P.S.: In case cluster won't establish connection with the other node, fencing agent on this node is triggered and will fence the other node (can be fence loop but we can minimize possibility of it[1]). -- Split brain situation: -- 1. Fencing agent tries to set the marker. Successful? NO: Do nothing: this node is gonna be fenced. Meanwhile this node can be put in standby mode while waiting for fencing. YES: STONITH (reboot) the other node. Marker is kept. - Benefits: - Even after reboot, one of the nodes still starts cluster stack - the one that the marker belongs to. -- Possible problems: -- If the node, that the marker belongs to, is not working, we need to force run cluster stack on the other node. It requires human interaction. = * In case ping is successful but cluster doesn't see the other node (is it even possible?) we can do the next: a. Daemon starts Corosync. b. Gets a list of nodes and ensures that the other node is present there. This is the guarantee that the nodes are seeing each other in the cluster. c. Starts Pacemaker. Thank you, Kostya On Tue, Jun 24, 2014 at 11:44 AM, Christine Caulfield ccaul...@redhat.com wrote: On 24/06/14 09:36, Kostiantyn Ponomarenko wrote: Hi Chrissie, But wait_for_all doesn't help when there is no connection between the nodes. Because in case I need to reboot the remaining working node I won't get working cluster after that - both nodes will be waiting connection between them. That's why I am looking for the solution which could help me to get one node working in this situation (after reboot). I've been thinking about some kind of marker which could help a node to determine a state of the other node. Like external disk and SCSI reservation command. Maybe you could suggest another kind of marker? I am not sure can we use a presents of a file on external SSD as the marker. Kind of: if there is a file - the other node is alive, if no - node is dead. More seriously, that solution is harder than it might seem - which is one reason qdiskd was as complex as it became, and why votequorum is as conservative as it is when it comes to declaring a workable cluster. If someone is there to manually reboot nodes then it might be as well for a human decision to be made about which one is capable of running services. Chrissie Digimer, Thanks for the links and information. Anyway if I go this way, I will write my own daemon to determine a state of the other node. Also the information about fence loop is new for me, thanks =) Thank you, Kostya On Tue, Jun 24, 2014 at 10:55 AM, Christine Caulfield ccaul...@redhat.com mailto:ccaul...@redhat.com wrote: On 23/06/14 15:49, Digimer wrote: Hi Kostya, I'm having a little trouble understanding your question, sorry. On boot, the node will not start anything, so after booting it, you log in, check that it can talk to the peer node (a simple ping is generally enough), then start the cluster. It will join the peer's existing cluster (even if it's a cluster on just itself). If you booted both nodes, say after a power outage, you will check the connection (again, a simple ping is fine) and then start the cluster on both nodes at the same time. wait_for_all helps with most of these situations. If a node goes down then it won't start services until it's seen the non-failed node because wait_for_all prevents a newly rebooted node from doing anything on its own. This also takes care of the case where both nodes are rebooted together of course, because that's the same as a new start. Chrissie
Re: [Pacemaker] configuration variants for 2 node cluster
Digimer, Yes, wait_for_all is a part of votequorum in Corosync v2. Thank you, Kostya On Tue, Jun 24, 2014 at 6:47 PM, Digimer li...@alteeve.ca wrote: On 24/06/14 03:55 AM, Christine Caulfield wrote: On 23/06/14 15:49, Digimer wrote: Hi Kostya, I'm having a little trouble understanding your question, sorry. On boot, the node will not start anything, so after booting it, you log in, check that it can talk to the peer node (a simple ping is generally enough), then start the cluster. It will join the peer's existing cluster (even if it's a cluster on just itself). If you booted both nodes, say after a power outage, you will check the connection (again, a simple ping is fine) and then start the cluster on both nodes at the same time. wait_for_all helps with most of these situations. If a node goes down then it won't start services until it's seen the non-failed node because wait_for_all prevents a newly rebooted node from doing anything on its own. This also takes care of the case where both nodes are rebooted together of course, because that's the same as a new start. Chrissie This isn't available on RHEL 6, is it? iirc, it's a Corosync v2 feature? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] configuration variants for 2 node cluster
Hi guys, I want to gather all possible configuration variants for 2-node cluster, because it has a lot of pitfalls and there are not a lot of information across the internet about it. And also I have some questions about configurations and their specific problems. VARIANT 1: - We can use two_node and wait_for_all option from Corosync's votequorum, and set up fencing agents with delay on one of them. Here is a workflow(diagram) of this configuration: 1. Node start. 2. Cluster start (Corosync and Pacemaker) at the boot time. 3. Wait for all nodes. All nodes joined? No. Go to step 3. Yes. Go to step 4. 4. Start resources. 5. Split brain situation (something with connection between nodes). 6. Fencing agent on the one of the nodes reboots the other node (there is a configured delay on one of the Fencing agents). 7. Rebooted node go to step 1. There are two (or more?) important things in this configuration: 1. Rebooted node remains waiting for all nodes to be visible (connection should be restored). 2. Suppose connection problem still exists and the node which rebooted the other guy has to be rebooted also (for some reasons). After reboot he is also stuck on step 3 because of connection problem. QUESTION: - Is it possible somehow to assign to the guy who won the reboot race (rebooted other guy) a status like a primary and allow him not to wait for all nodes after reboot. And neglect this status after other node joined this one. So is it possible? Right now that's the only configuration I know for 2 node cluster. Other variants are very appreciated =) VARIANT 2 (not implemented, just a suggestion): - I've been thinking about using external SSD drive (or other external drive). So for example fencing agent can reserve SSD using SCSI command and after that reboot the other node. The main idea of this is the first node, as soon as a cluster starts on it, reserves SSD till the other node joins the cluster, after that SCSI reservation is removed. 1. Node start 2. Cluster start (Corosync and Pacemaker) at the boot time. 3. Reserve SSD. Did it manage to reserve? No. Don't start resources (Wait for all). Yes. Go to step 4. 4. Start resources. 5. Remove SCSI reservation when the other node has joined. 5. Split brain situation (something with connection between nodes). 6. Fencing agent tries to reserve SSD. Did it manage to reserve? No. Maybe puts node in standby mode ... Yes. Reboot the other node. 7. Optional: a single node can keep SSD reservation till he is alone in the cluster or till his shut-down. I am really looking forward to find the best solution (or a couple of them =)). Hope I am not the only person ho is interested in this topic. Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] configuration variants for 2 node cluster
Hi Digimer, Suppose I disabled to cluster on start up, but what about remaining node, if I need to reboot it? So, even in case of connection lost between these two nodes I need to have one node working and providing resources. How did you solve this situation? Should it be a separate daemon which checks somehow connection between the two nodes and decides to run corosync and pacemaker or to keep them down? Thank you, Kostya On Mon, Jun 23, 2014 at 4:34 PM, Digimer li...@alteeve.ca wrote: On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote: Hi guys, I want to gather all possible configuration variants for 2-node cluster, because it has a lot of pitfalls and there are not a lot of information across the internet about it. And also I have some questions about configurations and their specific problems. VARIANT 1: - We can use two_node and wait_for_all option from Corosync's votequorum, and set up fencing agents with delay on one of them. Here is a workflow(diagram) of this configuration: 1. Node start. 2. Cluster start (Corosync and Pacemaker) at the boot time. 3. Wait for all nodes. All nodes joined? No. Go to step 3. Yes. Go to step 4. 4. Start resources. 5. Split brain situation (something with connection between nodes). 6. Fencing agent on the one of the nodes reboots the other node (there is a configured delay on one of the Fencing agents). 7. Rebooted node go to step 1. There are two (or more?) important things in this configuration: 1. Rebooted node remains waiting for all nodes to be visible (connection should be restored). 2. Suppose connection problem still exists and the node which rebooted the other guy has to be rebooted also (for some reasons). After reboot he is also stuck on step 3 because of connection problem. QUESTION: - Is it possible somehow to assign to the guy who won the reboot race (rebooted other guy) a status like a primary and allow him not to wait for all nodes after reboot. And neglect this status after other node joined this one. So is it possible? Right now that's the only configuration I know for 2 node cluster. Other variants are very appreciated =) VARIANT 2 (not implemented, just a suggestion): - I've been thinking about using external SSD drive (or other external drive). So for example fencing agent can reserve SSD using SCSI command and after that reboot the other node. The main idea of this is the first node, as soon as a cluster starts on it, reserves SSD till the other node joins the cluster, after that SCSI reservation is removed. 1. Node start 2. Cluster start (Corosync and Pacemaker) at the boot time. 3. Reserve SSD. Did it manage to reserve? No. Don't start resources (Wait for all). Yes. Go to step 4. 4. Start resources. 5. Remove SCSI reservation when the other node has joined. 5. Split brain situation (something with connection between nodes). 6. Fencing agent tries to reserve SSD. Did it manage to reserve? No. Maybe puts node in standby mode ... Yes. Reboot the other node. 7. Optional: a single node can keep SSD reservation till he is alone in the cluster or till his shut-down. I am really looking forward to find the best solution (or a couple of them =)). Hope I am not the only person ho is interested in this topic. Thank you, Kostya Hi Kostya, I only build 2-node clusters, and I've not had problems with this going back to 2009 over dozens of clusters. The tricks I found are: * Disable quorum (of course) * Setup good fencing, and add a delay to the node you you prefer (or pick one at random, if equal value) to avoid dual-fences * Disable to cluster on start up, to prevent fence loops. That's it. With this, your 2-node cluster will be just fine. As for your question; Once a node is fenced successfully, the resource manager (pacemaker) will take over any services lost on the fenced node, if that is how you configured it. A node the either gracefully leaves or dies/fenced should not interfere with the remaining node. The problem is when a node vanishes and fencing fails. Then, not knowing what the other node might be doing, the only safe option is to block, otherwise you risk a split-brain. This is why fencing is so important. Cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org
Re: [Pacemaker] configuration variants for 2 node cluster
Digimer, I am using Debian as OS and Corosync + Pacemaker as cluster stack. I understand your suggestion. I don't have any questions about it. My main question is how to do it automatically? So that it could work without human interruption for a while (nodes could be rebooted, but not repaired). That is my question. The only way for me to do it is: Write a daemon which will run Corosync and Pacemaker. But before that this daemon will check the connection between the two nodes. Then make a decision whether to start cluster or not, based on that check. Maybe you have some thoughts how can I do it in other way? Instead of doing ping the daemon could run Corosync, get the number of nodes in cluster, and based on that it decides whether to run Pacemaker or not. Thank you, Kostya On Mon, Jun 23, 2014 at 5:49 PM, Digimer li...@alteeve.ca wrote: Hi Kostya, I'm having a little trouble understanding your question, sorry. On boot, the node will not start anything, so after booting it, you log in, check that it can talk to the peer node (a simple ping is generally enough), then start the cluster. It will join the peer's existing cluster (even if it's a cluster on just itself). If you booted both nodes, say after a power outage, you will check the connection (again, a simple ping is fine) and then start the cluster on both nodes at the same time. If one of the nodes needs to be shut down, say for repairs or upgrades, you migrate the services off of it and over to the peer node, then you stop the cluster (which tells the peer that the node is leaving the cluster). After that, the remaining node operates by itself. When you turn it back on, you rejoin the cluster and migrate the services back. I think, maybe, you are looking at things more complicated than you need to. Pacemaker and corosync will handle most of this for you, once setup properly. What operating system do you plan to use, and what cluster stack? I suspect it will be corosync + pacemaker, which should work fine. digimer On 23/06/14 10:36 AM, Kostiantyn Ponomarenko wrote: Hi Digimer, Suppose I disabled to cluster on start up, but what about remaining node, if I need to reboot it? So, even in case of connection lost between these two nodes I need to have one node working and providing resources. How did you solve this situation? Should it be a separate daemon which checks somehow connection between the two nodes and decides to run corosync and pacemaker or to keep them down? Thank you, Kostya On Mon, Jun 23, 2014 at 4:34 PM, Digimer li...@alteeve.ca mailto:li...@alteeve.ca wrote: On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote: Hi guys, I want to gather all possible configuration variants for 2-node cluster, because it has a lot of pitfalls and there are not a lot of information across the internet about it. And also I have some questions about configurations and their specific problems. VARIANT 1: - We can use two_node and wait_for_all option from Corosync's votequorum, and set up fencing agents with delay on one of them. Here is a workflow(diagram) of this configuration: 1. Node start. 2. Cluster start (Corosync and Pacemaker) at the boot time. 3. Wait for all nodes. All nodes joined? No. Go to step 3. Yes. Go to step 4. 4. Start resources. 5. Split brain situation (something with connection between nodes). 6. Fencing agent on the one of the nodes reboots the other node (there is a configured delay on one of the Fencing agents). 7. Rebooted node go to step 1. There are two (or more?) important things in this configuration: 1. Rebooted node remains waiting for all nodes to be visible (connection should be restored). 2. Suppose connection problem still exists and the node which rebooted the other guy has to be rebooted also (for some reasons). After reboot he is also stuck on step 3 because of connection problem. QUESTION: - Is it possible somehow to assign to the guy who won the reboot race (rebooted other guy) a status like a primary and allow him not to wait for all nodes after reboot. And neglect this status after other node joined this one. So is it possible? Right now that's the only configuration I know for 2 node cluster. Other variants are very appreciated =) VARIANT 2 (not implemented, just a suggestion): - I've been thinking about using external SSD drive (or other external drive). So for example fencing agent can reserve SSD using SCSI command and after that reboot the other node
[Pacemaker] API documentation
I took a look at the include folders of pacemaker and corosync and I didn't find there any explanation to functions. Did I look at a wrong place? My goal is to manage cluster from my app, so I don't need to use crmsh or pcs. Any ideas are appreciated. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] API documentation
Andrew, many thanks! On Jun 18, 2014 2:11 AM, Andrew Beekhof and...@beekhof.net wrote: On 17 Jun 2014, at 8:01 pm, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: I took a look at the include folders of pacemaker and corosync and I didn't find there any explanation to functions. Did I look at a wrong place? My goal is to manage cluster from my app, so I don't need to use crmsh or pcs. Any ideas are appreciated. For reading cluster state, try: crm_mon --as-xml Or for the raw config, cibadmin -Q and the relax-ng (.rng) schema files. It calling a binary isn't something you want to do, try looking at the source for those two tools to see how they do it. For making changes, including stop/start/move a resource, you also want cibadmin (or its C-API) and the relax-ng schema files. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] votequorum for 2 node cluster
Thank you for the explanation. I got the point. But just to be sure, and maybe someone will find this info helpful, wanna clarify this two options behavior. From the http://manpages.ubuntu.com/manpages/saucy/man5/votequorum.5.html about last_man_standing option: NOTES: In order for the cluster to downgrade automatically from 2 nodes to a 1 node cluster, the auto_tie_breaker feature must also be enabled. If auto_tie_breaker is not enabled, and one more failure occurs, the remaining node will not be quorate. But this is still a roulette - you can lose only the node which doesn't have the lowest nodeid? Am I right? Thank you, Kostya On Thu, Jun 12, 2014 at 12:37 PM, Christine Caulfield ccaul...@redhat.com wrote: On 12/06/14 00:51, Andrew Beekhof wrote: Chrissy? Can you shed some light here? On 11 Jun 2014, at 11:26 pm, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, I am trying to deal somehow with split brain situation in 2 node cluster using votequorum. Here is a quorum section in my corosync.conf: provider: corosync_votequorum expected_votes: 2 wait_for_all: 1 last_man_standing: 1 auto_tie_breaker: 1 My question is about behavior of the remaining node after I shout down node with the lowest nodeid. My expectation is that after a last_man_standing_window this node should be back working. Or in the case of two node cluster it is not a solution? If you want symmetric failure handing into a 2 node cluster then the two_node option might be more appropriate. auto_tie_breaker and last_man_standing are more useful for larger clusters where a network split leaves more than one node in a partition. Chrissie ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] votequorum for 2 node cluster
Hi guys, I am trying to deal somehow with split brain situation in 2 node cluster using votequorum. Here is a quorum section in my corosync.conf: provider: corosync_votequorum expected_votes: 2 wait_for_all: 1 last_man_standing: 1 auto_tie_breaker: 1 My question is about behavior of the remaining node after I shout down node with the lowest nodeid. My expectation is that after a last_man_standing_window this node should be back working. Or in the case of two node cluster it is not a solution? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] votequorum for 2 node cluster
two_node option is my another question. I think it's not for this thread. last_man_standing: 1 auto_tie_breaker: 1 So, anyway the only node will remain working in split brain (or one node shout down) situation is that with the lowest id. And that is like roulette, in case we lose the lowest nodeid we lose all. So I can lose only the node which doesn't have the lowest nodeid? And it's not useful in 2 node cluster. Am i correct? Thank you, Kostya On Wed, Jun 11, 2014 at 4:33 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 11.06.2014 16:26, Kostiantyn Ponomarenko wrote: Hi guys, I am trying to deal somehow with split brain situation in 2 node cluster using votequorum. Here is a quorum section in my corosync.conf: provider: corosync_votequorum expected_votes: 2 Just a side note, not an answer to your question: you'd add 'two_node: 1' here as two-node clusters are very special in terms of quorum. wait_for_all: 1 last_man_standing: 1 auto_tie_breaker: 1 My question is about behavior of the remaining node after I shout down node with the lowest nodeid. My expectation is that after a last_man_standing_window this node should be back working. Or in the case of two node cluster it is not a solution? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] auto_tie_breaker in two node cluster
Honza, Can you please explain what does network based mean? Thank you, Kostya On Wed, May 21, 2014 at 10:54 AM, Jan Friesse jfrie...@redhat.com wrote: I am not quite understand how auto_tie_breaker works. Say we have a cluster with 2 nodes and enabled auto_tie_breaker feature. Each node has 2 NICs. One NIC is used for cluster communication and another one is used for providing some services from the cluster. So the question is how the nodes will distinguish between two possible situations: 1) connection between the nodes are lost, but the both nodes remain working; 2) power supply on the node 1 (has the lowest node-id) broke down and node 2 remain working; In 1st case, according to the description of the auto_tie_breaker, the node with the lowest node-id in the cluster will remain working. And in that particular situation it is good result because the both nodes are in good state (the both can remain working). In 2nd case the only working node is #2 and the node-id of that node is not the lowest one. So what will be in this case? What logic will work, because we have lost the node with the lowest node id in 2-node cluster? there is no qdiskd for votequorum yet Is there plans to implement it? Kostya, yes there are plans to implement qdisk (network based one). Regards, Honza Many thanks, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] auto_tie_breaker in two node cluster
I am not quite understand how auto_tie_breaker works. Say we have a cluster with 2 nodes and enabled auto_tie_breaker feature. Each node has 2 NICs. One NIC is used for cluster communication and another one is used for providing some services from the cluster. So the question is how the nodes will distinguish between two possible situations: 1) connection between the nodes are lost, but the both nodes remain working; 2) power supply on the node 1 (has the lowest node-id) broke down and node 2 remain working; In 1st case, according to the description of the auto_tie_breaker, the node with the lowest node-id in the cluster will remain working. And in that particular situation it is good result because the both nodes are in good state (the both can remain working). In 2nd case the only working node is #2 and the node-id of that node is not the lowest one. So what will be in this case? What logic will work, because we have lost the node with the lowest node id in 2-node cluster? there is no qdiskd for votequorum yet Is there plans to implement it? Many thanks, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org