Re: [Pacemaker] can't get external/xen0 fencing to work on debian wheezy
Just answering myself quickly so people don't waste their time reading long logs and config. Actually I simply forgot to define a location constraint for my fencing resource. I _have to_ do it as I am using an opt-in cluster. Sorry for the noise. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] ordering cloned resources
Hi, I am setting up a cluster on debian wheezy. I have installed pacemaker using the debian provided packages (so am runing 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff). I have roughly 10 nodes, among which some nodes are acting as SAN (exporting block devices using AoE protocol) and others nodes acting as initiators (they are actually mail servers, storing emails on the exported devices). Bellow are the defined resources for those nodes: xml primitive class=ocf id=pri_aoe1 provider=heartbeat type=AoEtarget \ instance_attributes id=pri_aoe1.1-instance_attributes \ rule id=node-sanaoe01 score=1 \ expression attribute=#uname id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \ /rule \ nvpair id=pri_aoe1.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe1.1-instance_attributes-nic name=nic value=eth0/ \ nvpair id=pri_aoe1.1-instance_attributes-shelf name=shelf value=1/ \ nvpair id=pri_aoe1.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ instance_attributes id=pri_aoe2.1-instance_attributes \ rule id=node-sanaoe02 score=2 \ expression attribute=#uname id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \ /rule \ nvpair id=pri_aoe2.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe2.1-instance_attributes-nic name=nic value=eth1/ \ nvpair id=pri_aoe2.1-instance_attributes-shelf name=shelf value=2/ \ nvpair id=pri_aoe2.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ /primitive primitive pri_dovecot lsb:dovecot \ op start interval=0 timeout=20 \ op stop interval=0 timeout=30 \ op monitor interval=5 timeout=10 primitive pri_spamassassin lsb:spamassassin \ op start interval=0 timeout=50 \ op stop interval=0 timeout=60 \ op monitor interval=5 timeout=20 group grp_aoe pri_aoe1 group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot clone cln_mailstore grp_mailstore \ meta ordered=false interleave=true clone-max=2 clone cln_san grp_aoe \ meta ordered=true interleave=true clone-max=2 As I am in an opt-in cluster mode (symmetric-cluster=false), I have the location constraints bellow for those hosts: location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01 location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02 location LOC_MAIL_STORE_1 cln_mailstore inf: ms01 location LOC_MAIL_STORE_2 cln_mailstore inf: ms02 So far so good. I want to make sure the initiators won't try to search for exported devices before the targets actually exported them. To do so, I though I could use the following ordering constraint: order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore Unfortunately if I add this constraint the clone Set cln_mailstore never starts (or even stops if started when I add the constraint). Is there something wrong with this ordering rule? Where can i find informations on what's going on? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] ordering cloned resources
Hi Andrew, I have tried to stop and start the first resource of the ordering constraint (cln_san), hoping it would trigger a start attemps of the second resource of the ordering constraint (cln_mailstore). I tailed the syslog logs on the node where I was expecting the second resource to start but really nothing appeared in those logs (I grepped 'pengine as per your suggestion). I have done another test, where I changed the first resource of the ordering constraint with a very simple primitive (lsb resource), and it worked in this case. I am wondering if the issue doesn't come from the rather complicated first resource. It is a cloned group which contains a primitive conditional instance attributes... Are you aware of any specific issue in pacemaker 1.1.7 with this kind of ressources? I will try to simplify the resources by getting rid of the conditional instance attribute and try again. In the mean time I'd be delighted to hear about what you guys think about that. Regards, Alex. 2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net: On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote: Hi, I am setting up a cluster on debian wheezy. I have installed pacemaker using the debian provided packages (so am runing 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff). I have roughly 10 nodes, among which some nodes are acting as SAN (exporting block devices using AoE protocol) and others nodes acting as initiators (they are actually mail servers, storing emails on the exported devices). Bellow are the defined resources for those nodes: xml primitive class=ocf id=pri_aoe1 provider=heartbeat type=AoEtarget \ instance_attributes id=pri_aoe1.1-instance_attributes \ rule id=node-sanaoe01 score=1 \ expression attribute=#uname id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \ /rule \ nvpair id=pri_aoe1.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe1.1-instance_attributes-nic name=nic value=eth0/ \ nvpair id=pri_aoe1.1-instance_attributes-shelf name=shelf value=1/ \ nvpair id=pri_aoe1.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ instance_attributes id=pri_aoe2.1-instance_attributes \ rule id=node-sanaoe02 score=2 \ expression attribute=#uname id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \ /rule \ nvpair id=pri_aoe2.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe2.1-instance_attributes-nic name=nic value=eth1/ \ nvpair id=pri_aoe2.1-instance_attributes-shelf name=shelf value=2/ \ nvpair id=pri_aoe2.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ /primitive primitive pri_dovecot lsb:dovecot \ op start interval=0 timeout=20 \ op stop interval=0 timeout=30 \ op monitor interval=5 timeout=10 primitive pri_spamassassin lsb:spamassassin \ op start interval=0 timeout=50 \ op stop interval=0 timeout=60 \ op monitor interval=5 timeout=20 group grp_aoe pri_aoe1 group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot clone cln_mailstore grp_mailstore \ meta ordered=false interleave=true clone-max=2 clone cln_san grp_aoe \ meta ordered=true interleave=true clone-max=2 As I am in an opt-in cluster mode (symmetric-cluster=false), I have the location constraints bellow for those hosts: location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01 location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02 location LOC_MAIL_STORE_1 cln_mailstore inf: ms01 location LOC_MAIL_STORE_2 cln_mailstore inf: ms02 So far so good. I want to make sure the initiators won't try to search for exported devices before the targets actually exported them. To do so, I though I could use the following ordering constraint: order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore Unfortunately if I add this constraint the clone Set cln_mailstore never starts (or even stops if started when I add the constraint). Is there something wrong with this ordering rule? Where can i find informations on what's going on? No errors in the logs? If you grep for 'pengine' does it want to start them or just leave them stopped? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http
Re: [Pacemaker] ordering cloned resources
So..., It appears the problem doesn't come from the primitive but for the cloned resource. If I use the primitive instead of the clone in the order constraint (thus deleting the clone and the group) , the second resource of the constraint startup as expected. Any idea why? Should I upgrade this pretty old version of pacemaker? 2014-03-08 10:36 GMT+01:00 Alexandre alxg...@gmail.com: Hi Andrew, I have tried to stop and start the first resource of the ordering constraint (cln_san), hoping it would trigger a start attemps of the second resource of the ordering constraint (cln_mailstore). I tailed the syslog logs on the node where I was expecting the second resource to start but really nothing appeared in those logs (I grepped 'pengine as per your suggestion). I have done another test, where I changed the first resource of the ordering constraint with a very simple primitive (lsb resource), and it worked in this case. I am wondering if the issue doesn't come from the rather complicated first resource. It is a cloned group which contains a primitive conditional instance attributes... Are you aware of any specific issue in pacemaker 1.1.7 with this kind of ressources? I will try to simplify the resources by getting rid of the conditional instance attribute and try again. In the mean time I'd be delighted to hear about what you guys think about that. Regards, Alex. 2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net: On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote: Hi, I am setting up a cluster on debian wheezy. I have installed pacemaker using the debian provided packages (so am runing 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff). I have roughly 10 nodes, among which some nodes are acting as SAN (exporting block devices using AoE protocol) and others nodes acting as initiators (they are actually mail servers, storing emails on the exported devices). Bellow are the defined resources for those nodes: xml primitive class=ocf id=pri_aoe1 provider=heartbeat type=AoEtarget \ instance_attributes id=pri_aoe1.1-instance_attributes \ rule id=node-sanaoe01 score=1 \ expression attribute=#uname id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \ /rule \ nvpair id=pri_aoe1.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe1.1-instance_attributes-nic name=nic value=eth0/ \ nvpair id=pri_aoe1.1-instance_attributes-shelf name=shelf value=1/ \ nvpair id=pri_aoe1.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ instance_attributes id=pri_aoe2.1-instance_attributes \ rule id=node-sanaoe02 score=2 \ expression attribute=#uname id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \ /rule \ nvpair id=pri_aoe2.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe2.1-instance_attributes-nic name=nic value=eth1/ \ nvpair id=pri_aoe2.1-instance_attributes-shelf name=shelf value=2/ \ nvpair id=pri_aoe2.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ /primitive primitive pri_dovecot lsb:dovecot \ op start interval=0 timeout=20 \ op stop interval=0 timeout=30 \ op monitor interval=5 timeout=10 primitive pri_spamassassin lsb:spamassassin \ op start interval=0 timeout=50 \ op stop interval=0 timeout=60 \ op monitor interval=5 timeout=20 group grp_aoe pri_aoe1 group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot clone cln_mailstore grp_mailstore \ meta ordered=false interleave=true clone-max=2 clone cln_san grp_aoe \ meta ordered=true interleave=true clone-max=2 As I am in an opt-in cluster mode (symmetric-cluster=false), I have the location constraints bellow for those hosts: location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01 location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02 location LOC_MAIL_STORE_1 cln_mailstore inf: ms01 location LOC_MAIL_STORE_2 cln_mailstore inf: ms02 So far so good. I want to make sure the initiators won't try to search for exported devices before the targets actually exported them. To do so, I though I could use the following ordering constraint: order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore Unfortunately if I add this constraint the clone Set cln_mailstore never starts (or even stops if started when I add the constraint). Is there something wrong with this ordering rule? Where can i find informations on what's going on? No errors in the logs? If you grep for 'pengine' does it want to start them or just leave them stopped? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo
Re: [Pacemaker] ordering cloned resources
[10989]: notice: te_rsc_command: Initiating action 39: start pri_aoe1_start_0 on sanaoe02 (local) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: process_pe_message: Calculated Transition 381: /var/lib/pacemaker/pengine/pe-input-104.bz2 Mar 22 23:37:50 sanaoe02 AoEtarget(pri_aoe1)[14379]: INFO: Exporting device /dev/xvdb on eth1 as shelf 2, slot 1 Mar 22 23:37:50 sanaoe02 AoEtarget(pri_aoe1)[14379]: DEBUG: pri_aoe1 start : 0 Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: process_lrm_event: LRM operation pri_aoe1_start_0 (call=198, rc=0, cib-update=1027, confirmed=true) ok Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: te_rsc_command: Initiating action 25: start pri_dovecot_start_0 on ms02 Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: te_rsc_command: Initiating action 26: monitor pri_dovecot_monitor_5000 on ms02 Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: run_graph: Transition 381 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2): Complete Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] and where the second resource starts Mar 22 22:37:50 ms02 crmd[89496]: notice: process_lrm_event: LRM operation pri_dovecot_start_0 (call=151, rc=0, cib-update=197, confirmed=true) ok Mar 22 22:37:50 ms02 dovecot: master: Dovecot v2.1.7 starting up Mar 22 22:37:50 ms02 dovecot: master: Warning: /home is no longer mounted. If this is intentional, remove it with doveadm mount Mar 22 22:37:50 ms02 crmd[89496]: notice: process_lrm_event: LRM operation pri_dovecot_monitor_5000 (call=152, rc=0, cib-update=198, confirmed=false) ok I can't find anything usefull in those logs but if you think something is relevant or could be, please feel free to highlight. 2014-03-11 2:13 GMT+01:00 Andrew Beekhof and...@beekhof.net: On 9 Mar 2014, at 10:36 pm, Alexandre alxg...@gmail.com wrote: So..., It appears the problem doesn't come from the primitive but for the cloned resource. If I use the primitive instead of the clone in the order constraint (thus deleting the clone and the group) , the second resource of the constraint startup as expected. Any idea why? Not without logs Should I upgrade this pretty old version of pacemaker? Yes :) 2014-03-08 10:36 GMT+01:00 Alexandre alxg...@gmail.com: Hi Andrew, I have tried to stop and start the first resource of the ordering constraint (cln_san), hoping it would trigger a start attemps of the second resource of the ordering constraint (cln_mailstore). I tailed the syslog logs on the node where I was expecting the second resource to start but really nothing appeared in those logs (I grepped 'pengine as per your suggestion). I have done another test, where I changed the first resource of the ordering constraint with a very simple primitive (lsb resource), and it worked in this case. I am wondering if the issue doesn't come from the rather complicated first resource. It is a cloned group which contains a primitive conditional instance attributes... Are you aware of any specific issue in pacemaker 1.1.7 with this kind of ressources? I will try to simplify the resources by getting rid of the conditional instance attribute and try again. In the mean time I'd be delighted to hear about what you guys think about that. Regards, Alex. 2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net: On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote: Hi, I am setting up a cluster on debian wheezy. I have installed pacemaker using the debian provided packages (so am runing 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff). I have roughly 10 nodes, among which some nodes are acting as SAN (exporting block devices using AoE protocol) and others nodes acting as initiators (they are actually mail servers, storing emails on the exported devices). Bellow are the defined resources for those nodes: xml primitive class=ocf id=pri_aoe1 provider=heartbeat type=AoEtarget \ instance_attributes id=pri_aoe1.1-instance_attributes \ rule id=node-sanaoe01 score=1 \ expression attribute=#uname id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \ /rule \ nvpair id=pri_aoe1.1-instance_attributes-device name=device value=/dev/xvdb/ \ nvpair id=pri_aoe1.1-instance_attributes-nic name=nic value=eth0/ \ nvpair id=pri_aoe1.1-instance_attributes-shelf name=shelf value=1/ \ nvpair id=pri_aoe1.1-instance_attributes-slot name=slot value=1/ \ /instance_attributes \ instance_attributes id=pri_aoe2.1-instance_attributes \ rule id=node-sanaoe02 score=2 \ expression attribute=#uname id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \ /rule \ nvpair id=pri_aoe2.1
[Pacemaker] collocating a set of resources with crmsh
Hi, I am configuring a cluster on nodes that doesn't have pcs installed (pacemaker 1.1.7 with crmsh). I would like to configure collocated sets of resources (as show here:http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/#s-resource-sets-collocation) in that cluster but can't find the proper way to do it with crm. I have tried the command bellow but it just failed: sudo crm configure xml constraintsrsc_colocation id=coloc-1 score=INFINITY/resource_set id=collocated-set-example sequential=trueresource_ref id=pri_apache2/resource_ref id=pri_iscsi//resource_set/rsc_colocation/constraints ERROR: not well-formed (invalid token): line 1, column 32 What is the way to proceed with crm? Regards. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] no-quorum-policy = demote?
Have you tried to patch the monitor action of your RA, so that it set the a temporary constraint location on the node to avoid it becoming master. Something like Location loc_splited_cluster -inf: MsRsc:Master $node Not sure about the above crm syntax, but that's the idea. Le 8 avr. 2014 02:52, Andrew Beekhof and...@beekhof.net a écrit : On 7 Apr 2014, at 5:54 pm, Christian Ciach derein...@gmail.com wrote: Hello, I am using Corosync 2.0 with Pacemaker 1.1 on Ubuntu Server 14.04 (daily builds until final release). My problem is as follows: I have a 2-node (plus a quorum-node) cluster to manage a multistate-resource. One node should be the master and the other one the slave. It is absolutely not allowed to have two masters at the same time. To prevent a split-brain situation, I am also using a third node as a quorum-only node (set to standby). There is no redundant connection because the nodes are connected over the internet. If one of the two nodes managing the resource becomes disconnected, it loses quorum. In this case, I want this resource to become a slave, but the resource should never be stopped completely! Ever? Including when you stop pacemaker? If so, maybe the path of least resistance is to delete the contents of the stop action in that OCF agent... This leaves me with a problem: no-quorum-policy=stop will stop the resource, while no-quorum-policy=ignore will keep this resource in a master-state. I already tried to demote the resource manually inside the monitor-action of the OCF-agent, but pacemaker will promote the resource immediately again. I am aware that I am trying the manage a multi-site-cluster and there is something like the booth-daemon, which sounds like the solution to my problem. But unfortunately I need the location-constraints of pacemaker based on the score of the OCF-agent. As far as I know location-constraints are not possible when using booth, because the 2-node-cluster is essentially split into two 1-node-clusters. Is this correct? To conclude: Is it possible to demote a resource on quorum loss instead of stopping it? Is booth an option if I need to manage the location of the master based on the score returned by the OCF-agent? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] no-quorum-policy = demote?
Le 10 avr. 2014 15:44, Christian Ciach derein...@gmail.com a écrit : I don't really like the idea to periodically poll crm_node -q for the current quorum state. No matter how frequently the monitor-function gets called, there will always be a small time frame where both nodes will be in the master state at the same time. Is there a way to get a notification to the OCF-agent whenever the quorum state changes? You should probably look for something like this in the ocfshellfunction.sh file. But also take a look at the page below, it has a lot of multi state dedicated variables that are most definitely useful in your case. http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_multi_state_proper_interpretation_of_notification_environment_variables.html 2014-04-08 10:14 GMT+02:00 Christian Ciach derein...@gmail.com: Interesting idea! I can confirm that this works. So, I need to monitor the output of crm_node -q to check if the current partition has quorum. If the partition doesn't have quorum, I need to set the location constraint according to your example. If the partition gets quorum again, I need to remove the constraint. This seems almost a bit hacky, but it should work okay. Thank you! It almost a shame that pacemaker doesn't have demote as a no-quorum-policy, but supports demote as a loss-policy for tickets. Yesterday I had another idea: Maybe I won't use a multistate resource agent but a primitive instead. This way, I will start the resource outside of pacemaker and let the start-action of the OCF-agent set the resource to master and the stop-action sets it to slave. Then I will just use no-quorum-policy=stop. The downside of this is that I cannot distinguish between a stopped resource and a resource in a slave state using crm_mon. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Possible to colocate ms ressource with standard ones ? If so, you probably need to make sure it is able to handle m/s resource in pacemaker properly.
Why did you hide the resource agent provider? Is it a custom one Le 30 avr. 2014 01:10, Andrew Beekhof and...@beekhof.net a écrit : On 29 Apr 2014, at 11:06 pm, Sékine Coulibaly scoulib...@gmail.com wrote: Hi, Let me explain my use case. I'm using RHEL 6.3 fwiw, there are updates to pacemaker 1.1.10 in 6.4 and 6.5. Its even supported now. with Corosync + Pacemaker + PostgreSQL9.2 + repmgr 2.0. I have two nodes names clustera and clusterb. I have a total of 3 resources : - APACHE - BOUM - MS_POSTGRESQL They are defined as follow : sudo crm configure primitive APACHE ocf:heartbeat:apache \ params configfile=/etc/httpd/conf/httpd.conf \ op monitor interval=5s timeout=10s \ op start interval=0 timeout=10s \ op stop interval=0 timeout=10s sudo crm configure primitive BOUM ocf:heartbeat:anything \ params binfile=/usr/local/boum/current/bin/boum \ workdir=/var/boum \ logfile=/var/log/boum/boum_STDOUT \ errlogfile=/var/log/boum/boum_STDERR \ pidfile=/var/run/boum.pid \ op monitor interval=5s timeout=10s \ op start interval=0 timeout=10s \ op stop interval=0 timeout=10s sudo crm configure primitive POSTGRESQL ocf:xx:postgresql \ params repmgr_conf=/var/lib/pgsql/repmgr/repmgr.conf pgctl=/usr/pgsql-9.2/bin/pg_ctl pgdata=/opt/pgdata \ op start interval=0 timeout=90s \ op stop interval=0 timeout=60s \ op promote interval=0 timeout=120s \ op monitor interval=53s role=Master \ op monitor interval=60s role=Slave Since the PostgreSQL is in streaming replication, I need to have a master and a slave constantly running. Hence, I created an MasterSlave resource, called MS_POSTGRESQL. I want to that APACHE, BOUM and the master node of PostgreSQL run altogether on the same node. It looks like that as soon as I add a colocation, the Postgresql slave doesn't start anymore. I end up with : Online: [ clusterb clustera ] Master/Slave Set: MS_POSTGRESQL [POSTGRESQL] Masters: [ clustera ] Stopped: [ POSTGRESQL:1 ] APACHE (ocf::heartbeat:apache):Started clustera BOUM (ocf::heartbeat:anything): Started clustera My configuration is as follows : node clustera \ attributes standby=off node clusterb \ attributes standby=off primitive APACHE ocf:heartbeat:apache \ params configfile=/etc/httpd/conf/httpd.conf \ op monitor interval=5s timeout=10s \ op start interval=0 timeout=10s \ op stop interval=0 timeout=10s \ meta target-role=Started primitive BOUM ocf:heartbeat:anything \ params binfile=/usr/local/boum/current/bin/boum workdir=/var/boum logfile=/var/log/boum/boum_STDOUT errlogfile=/var/log/boum/boum_STDERR pidfile=/var/run/boum.pid \ op monitor interval=5s timeout=10s \ op start interval=0 timeout=10s \ op stop interval=0 timeout=10s primitive POSTGRESQL ocf:xxx:postgresql \ params repmgr_conf=/var/lib/pgsql/repmgr/repmgr.conf pgctl=/usr/pgsql-9.2/bin/pg_ctl pgdata=/opt/pgdata \ op start interval=0 timeout=90s \ op stop interval=0 timeout=60s \ op promote interval=0 timeout=120s \ op monitor interval=53s role=Master \ op monitor interval=60s role=Slave ms MS_POSTGRESQL POSTGRESQL \ meta clone-max=2 target-role=Started resource-stickiness=100 notify=true colocation link-resources inf: ZK UFO BOUM APACHE MS_POSTGRESQL Could you send the raw xml (cibadmin -Ql) please? I've never gotten used to crmsh's colocation syntax and don't have it installed locally (pcs is the supplied tool for configuring pacemaker on rhel) property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ default-resource-stickiness=10 \ start-failure-is-fatal=false \ last-lrm-refresh=1398775386 Is this a normal behaviour ? If it is, is there a workaround I didn't think of ? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
Re: [Pacemaker] Pacemaker with Xen 4.3 problem
IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked for me (if you're using the same RA) Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit : Hello, I try to build a XEN HA cluster with pacemaker/corosync. Xen 4.3 works on all nodes and also the xen live migration works fine. Pacemaker also works with the cluster virtual IP. But when I try to create a XEN OCF Heartbeat resource to get online, an error appears: ## Failed actions: xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31, status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms, exec=555ms xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10, status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms, exec=706ms ## I added the resource with the command crm configure primitive xen_dns_ha ocf:heartbeat:Xen \ params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \ op monitor interval=10s \ op start interval=0s timeout=30s \ op stop interval=0s timeout=300s in the /var/log/messages the following error is printed: 2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]: notice: operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to connect to xend: No such file or directory. Is xend running? ] I use xen 4.3 with XL toolstack without xend . Is it possible to use pacemaker with Xen 4.3 ? Can anybody please help me ? Best regards T. Reineck ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Pacemaker with Xen 4.3 problem
Actually I did it for the stonith resource agent external:xen0. xm and xl are supposed to be semantically very close and as far as I can see the ocf:heartbeat:Xen agent doesn't seem to use any xm command that shouldn't work with xl. What error do you have when using xl instead of xm? Regards. 2014-07-09 8:39 GMT+02:00 Tobias Reineck tobias.rein...@hotmail.de: Hello, do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ? I also tried this to replace all xm with xl with no success. Is it possible that you can show me you RA resource for Xen ? Best regards T. Reineck -- Date: Tue, 8 Jul 2014 22:27:59 +0200 From: alxg...@gmail.com To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Pacemaker with Xen 4.3 problem IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked for me (if you're using the same RA) Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit : Hello, I try to build a XEN HA cluster with pacemaker/corosync. Xen 4.3 works on all nodes and also the xen live migration works fine. Pacemaker also works with the cluster virtual IP. But when I try to create a XEN OCF Heartbeat resource to get online, an error appears: ## Failed actions: xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31, status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms, exec=555ms xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10, status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms, exec=706ms ## I added the resource with the command crm configure primitive xen_dns_ha ocf:heartbeat:Xen \ params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \ op monitor interval=10s \ op start interval=0s timeout=30s \ op stop interval=0s timeout=300s in the /var/log/messages the following error is printed: 2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]: notice: operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to connect to xend: No such file or directory. Is xend running? ] I use xen 4.3 with XL toolstack without xend . Is it possible to use pacemaker with Xen 4.3 ? Can anybody please help me ? Best regards T. Reineck ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] help deciphering output
I have seen this behavior on several virtualsed environments. when vm backup starts, the VM actually freezes for a (short?) Period of time.I guess it then no more responding to the other cluster nodes thus triggering unexpected fail over and/or fencing.I have this kind of behavior on VMware env using veam backup, as well promox (+ u don't what backup tool) That's actually an interesting topic I never though about rising here. How can we avoid that? Increasing timeout? I am afraid we would have to reach unacceptable high timeout values and am not even sure that would fix the pb. I think not all VM snapshots strategy would trigger that PV, do you guys have any feedback to provide on the backup/snapshot method best suits corosync clusters? Regards Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com a écrit : One of my nodes died in a 2 node cluster I gather something went wrong, and it fenced/killed itself. But I am not sure what happened. I think maybe around that time the VM backups happened and snap of the VM could have happened But there is nothing for me to put my finger on Output from messages around that time This is on devrp1 Oct 8 23:31:38 devrp1 corosync[1670]: [TOTEM ] A processor failed, forming new configuration. Oct 8 23:31:40 devrp1 corosync[1670]: [CMAN ] quorum lost, blocking activity Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] Members[1]: 1 Oct 8 23:31:40 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:40 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:2 left:1) Oct 8 23:31:40 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:41 devrp1 kernel: dlm: closing connection to node 2 Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 424: quorum lost Oct 8 23:31:42 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:42 devrp1 corosync[1670]: [CMAN ] quorum regained, resuming activity Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] This node is within the primary component and will provide service. Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:1 left:0) Oct 8 23:31:42 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now lost (was member) Oct 8 23:31:42 devrp1 crmd[2350]: warning: reap_dead_nodes: Our DC node (devrp2) left the cluster Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 428: quorum acquired Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now member (was lost) Oct 8 23:31:42 devrp1 crmd[2350]: notice: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ] Oct 8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were killed by cman_tool or other application Oct 8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection terminated Oct 8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting Oct 8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2 Oct 8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost connection to Corosync service! Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Exiting... Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Disconnecting client 0x18cf240, pid=2350... Oct 8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy: Connection destroyed Oct 8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:43 devrp1 cib[2345]:error: cib_cs_destroy: Corosync connection lost! Exiting. Oct 8 23:31:43 devrp1 stonith-ng[2346]:error: stonith_peer_cs_destroy: Corosync connection terminated Oct 8 23:31:43 devrp1 dlm_controld[1752]:
Re: [Pacemaker] colocate three resources
I think you can use a single colocation with a set of resources. crmsh allows you to create such a colocation with: crm colocation vm_with_disks inf: vm_srv ( ms_disk_R:Master ms_disk_S:Master ) This forces the cluster to place the master resources on the same host, starting them without specific ordering, and then start the VM along with them. Le 9 nov. 2014 11:31, Matthias Teege matthias-gm...@mteege.de a écrit : Hallo, On a cluster I have to place three resources on the same node. ms ms_disk_R p_disk_R ms ms_disk_S p_disk_S primitive vm_srv ocf:heartbeat:VirtualDomain The colocation constraints looks like this: colocation vm_with_disk_R inf: vm_srv ms_disk_R:Master colocation vm_with_disk_S inf: vm_srv ms_disk_S:Master Do I have to add another colocation constraint to define a colocation between disk_R and disk_S. I'm not sure because the documentation says: with-rsc: The colocation target. The cluster will decide where to put this resource first and then decide where to put the resource in the rsc field. In my case the colocation targets are ms_disk_R and ms_disk_S. If pacemaker decides to put disk_R on node A and disk_S on node B vm_srv would not start. I use order constraints to start disks before the vm resource. order disk_R_before_vm inf: ms_disk_R:promote vm_srv:start order disk_S_before_vm inf: ms_disk_S:promote vm_srv:start Thanks Matthias ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Daemon Start attempt on wrong Server
You should use an opt out cluster. Set the cluster option symmetrical=false. This will tell corosync not to place a resource anywhere on the cluster, unless a location rule explicitly tell the cluster where it should run. Corosync will still monitor sql resources on www hosts and return rc 5 but this is expected and works. Le 11 nov. 2014 13:22, Hauke Homburg hhomb...@w3-creative.de a écrit : Hello, I am installing a 6 Node pacemaker CLuster. 3 Nodes for Apache, 3 Nodes for Postgres. My Cluster Config is node kvm-node1 node sql-node1 node sql-node2 node sql-node3 node www-node1 node www-node2 node www-node3 primitive pri_kvm_ip ocf:heartbeat:IPaddr2 \ params ip=10.0.6.41 cidr_netmask=255.255.255.0 \ op monitor interval=10s timeout=20s primitive pri_sql_ip ocf:heartbeat:IPaddr2 \ params ip=10.0.6.31 cidr_netmask=255.255.255.0 \ op monitor interval=10s timeout=20s primitive pri_www_ip ocf:heartbeat:IPaddr2 \ params ip=10.0.6.21 cidr_netmask=255.255.255.0 \ op monitor interval=10s timeout=20s primitive res_apache ocf:heartbeat:apache \ params configfile=/etc/apache2/apache2.conf \ op start interval=0 timeout=40 \ op stop interval=0 timeout=60 \ op monitor interval=60 timeout=120 start-delay=0 \ meta target-role=Started primitive res_pgsql ocf:heartbeat:pgsql \ params pgctl=/usr/lib/postgresql/9.1/bin/pg_ctl psql=/usr/bin/psql start_opt= pgdata=/var/lib/postgresql/9.1/main config=/etc/postgresql/9.1/main/postgresql.conf pgdba=postgres \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ op monitor interval=30s timeout=30s depth=0 location loc_kvm_ip_node1 pri_kvm_ip 10001: kvm-node1 location loc_sql_ip_node1 pri_sql_ip inf: sql-node1 location loc_sql_ip_node2 pri_sql_ip inf: sql-node2 location loc_sql_ip_node3 pri_sql_ip inf: sql-node3 location loc_sql_srv_node1 res_pgsql inf: sql-node1 location loc_sql_srv_node2 res_pgsql inf: sql-node2 location loc_sql_srv_node3 res_pgsql inf: sql-node3 location loc_www_ip_node1 pri_www_ip inf: www-node1 location loc_www_ip_node2 pri_www_ip inf: www-node2 location loc_www_ip_node3 pri_www_ip inf: www-node3 location loc_www_srv_node1 res_apache inf: www-node1 location loc_www_srv_node2 res_apache inf: www-node2 location loc_www_srv_node3 res_apache inf: www-node3 property $id=cib-bootstrap-options \ dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \ cluster-infrastructurFailed actions: Why do i see in crm_mon the following output? res_pgsql_start_0 (node=www-node1, call=16, rc=5, status=complete): not installed res_pgsql_start_0 (node=www-node2, call=13, rc=5, status=complete): not installed pri_www_ip_monitor_1 (node=www-node3, call=22, rc=7, status=complete): not running res_pgsql_start_0 (node=www-node3, call=13, rc=5, status=complete): not installed res_apache_start_0 (node=sql-node2, call=18, rc=5, status=complete): not installed res_pgsql_start_0 (node=sql-node2, call=12, rc=5, status=complete): not installed res_apache_start_0 (node=sql-node3, call=12, rc=5, status=complete): not installed res_pgsql_start_0 (node=sql-node3, call=10, rc=5, status=complete): not installed res_apache_start_0 (node=kvm-node1, call=12, rc=5, status=complete): not installed res_pgsql_start_0 (node=kvm-node1, call=20, rc=5, status=complete): not installede=openais \ expected-quorum-votes=7 \ stonith-enabled=false I set the infinity for pgsql on all 3 sql nodes, but not! on the www nodes. Why tries Pacemaker to start the Postgres SQL Server on the www Node? In example? Thank for your Help greetings Hauke ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Reset failcount for resources
Le 13 nov. 2014 12:09, Arjun Pandey apandepub...@gmail.com a écrit : Hi I am running a 2 node cluster with this config Master/Slave Set: foo-master [foo] Masters: [ bharat ] Slaves: [ ram ] AC_FLT (ocf::pw:IPaddr): Started bharat CR_CP_FLT (ocf::pw:IPaddr): Started bharat CR_UP_FLT (ocf::pw:IPaddr): Started bharat Mgmt_FLT (ocf::pw:IPaddr): Started bharat where IPaddr RA is just modified IPAddr2 RA. Additionally i have a collocation constraint for the IP addr to be collocated with the master. I have set the migration-threshold as 2 for the VIP. I also have set the failure-timeout to 15s. Initially i bring down the interface on bharat to force switch-over to ram. After this i fail the interfaces on bharat again. Now i bring the interface up again on ram. However the virtual IP's are now in stopped state. I don't get out of this unless i use crm_resource -C to reset state of resources. However if i check failcount of resources after this it's still set as INFINITY. Based on the documentation the failcount on a node should have expired after the failure-timeout.That doesn't happen. Expiration probably happens, meaning the failure is marked for expiration. However, expired failures are only removed when the timer pops in, which is defined by the cluster-recheck-interval (by default 15 mins). However why don't we reset the count after the the crm_resource -C command too. Any other command to actually reset the failcount. Thanks in advance Regards Arjun ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] postgresql never promoted
Hi list, I am facing a very strange issue. I have setup a postgresql cluster (with streaming repl). The replication works ok when started manually but the RA seems to never promote any host where the resource is started. my config is bellow: node pp-obm-sgbd.upond.fr node pp-obm-sgbd2.upond.fr \ attributes pri_pgsql-data-status=DISCONNECT primitive pri_obm-locator lsb:obm-locator \ params \ op start interval=0s timeout=60s \ op stop interval=0s timeout=60s \ op monitor interval=10s timeout=20s primitive pri_pgsql pgsql \ params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data/ node_list=pp-obm-sgbd.upond.fr pp-obm-sgbd2.upond.fr repuser=replication rep_mode=sync restart_on_promote=true restore_command=cp /var/lib/pgsql/replication/%f %p primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 master_ip=193.50.151.200 \ op start interval=0 on-fail=restart timeout=120s \ op monitor interval=20s on-fail=restart timeout=60s \ op monitor interval=15s on-fail=restart role=Master timeout=60s \ op promote interval=0 on-fail=restart timeout=120s \ op demote interval=0 on-fail=stop timeout=120s \ op notify interval=0s timeout=60s \ op stop interval=0 on-fail=block timeout=120s primitive pri_vip IPaddr2 \ params ip=193.50.151.200 nic=eth1 cidr_netmask=32 \ op start interval=0s timeout=60s \ op monitor interval=10s timeout=60s \ op stop interval=0s timeout=60s ms ms_pgsql pri_pgsql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 colocation clc_vip-ms_pgsql inf: pri_vip:Started ms_pgsql:Master order ord_dm_pgsql-vip 0: ms_pgsql:demote pri_vip:stop order ord_pm_pgsql-vip 0: ms_pgsql:promote pri_vip:start symmetrical=false property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=cman \ last-lrm-refresh=1424459378 \ no-quorum-policy=ignore \ stonith-enabled=false \ maintenance-mode=false rsc_defaults rsc_defaults-options: \ resource-stickiness=1000 \ migration-threshold=5 crm_mon shows both hosts as slaves and none is never promoted ever: Master/Slave Set: ms_pgsql [pri_pgsql] Slaves: [ pp-obm-sgbd.upond.fr pp-obm-sgbd2.upond.fr ] Node Attributes: * Node pp-obm-sgbd.upond.fr: + master-pri_pgsql : 1000 + pri_pgsql-status : HS:alone + pri_pgsql-xlog-loc: 2D78 * Node pp-obm-sgbd2.upond.fr: + master-pri_pgsql : -INFINITY + pri_pgsql-data-status : DISCONNECT + pri_pgsql-status : HS:alone + pri_pgsql-xlog-loc: 2D00 on the host I am expecting promotion I see when doing cleanups: Feb 20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]: INFO: Master does not exist. Feb 20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]: INFO: My data status=. And on the other node I see the following logs that sounds interrseting: Feb 20 20:16:10 pp-obm-sgbd2 crmd[19626]: notice: print_synapse: [Action 18]: Pending pseudo op ms_pgsql_promoted_0 on N/A (priority: 100, waiting: 11) Feb 20 20:16:10 pp-obm-sgbd2 crmd[19626]: notice: print_synapse: [Action 17]: Pending pseudo op ms_pgsql_promote_0 on N/A (priority: 0, waiting: 21) the N/A part seems to tell me the cluster don't know where to promote the resource but I can't understand why. bellow are my constraint rules: pcs constraint show Location Constraints: Ordering Constraints: demote ms_pgsql then stop pri_vip (score:0) promote ms_pgsql then start pri_vip (score:0) (non-symmetrical) Colocation Constraints: pri_vip with ms_pgsql (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) I am now out of ideas so any help is very much appreciated. Regards. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] postgresql never promoted
Thanks, I was already on my way to do it. Note that's done. Le 20 févr. 2015 20:50, Digimer li...@alteeve.ca a écrit : -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Do you mind asking this in the new mailing list? http://clusterlabs.org/mailman/listinfo/users This list is scheduled to be closed and all users are encouraged to switch. :) On 20/02/15 02:18 PM, Alexandre wrote: Hi list, I am facing a very strange issue. I have setup a postgresql cluster (with streaming repl). The replication works ok when started manually but the RA seems to never promote any host where the resource is started. my config is bellow: node pp-obm-sgbd.upond.fr http://pp-obm-sgbd.upond.fr node pp-obm-sgbd2.upond.fr http://pp-obm-sgbd2.upond.fr \ attributes pri_pgsql-data-status=DISCONNECT primitive pri_obm-locator lsb:obm-locator \ params \ op start interval=0s timeout=60s \ op stop interval=0s timeout=60s \ op monitor interval=10s timeout=20s primitive pri_pgsql pgsql \ params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data/ node_list=pp-obm-sgbd.upond.fr http://pp-obm-sgbd.upond.fr pp-obm-sgbd2.upond.fr http://pp-obm-sgbd2.upond.fr repuser=replication rep_mode=sync restart_on_promote=true restore_command=cp /var/lib/pgsql/replication/%f %p primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 master_ip=193.50.151.200 \ op start interval=0 on-fail=restart timeout=120s \ op monitor interval=20s on-fail=restart timeout=60s \ op monitor interval=15s on-fail=restart role=Master timeout=60s \ op promote interval=0 on-fail=restart timeout=120s \ op demote interval=0 on-fail=stop timeout=120s \ op notify interval=0s timeout=60s \ op stop interval=0 on-fail=block timeout=120s primitive pri_vip IPaddr2 \ params ip=193.50.151.200 nic=eth1 cidr_netmask=32 \ op start interval=0s timeout=60s \ op monitor interval=10s timeout=60s \ op stop interval=0s timeout=60s ms ms_pgsql pri_pgsql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 colocation clc_vip-ms_pgsql inf: pri_vip:Started ms_pgsql:Master order ord_dm_pgsql-vip 0: ms_pgsql:demote pri_vip:stop order ord_pm_pgsql-vip 0: ms_pgsql:promote pri_vip:start symmetrical=false property cib-bootstrap-options: \ dc-version=1.1.11-97629de \ cluster-infrastructure=cman \ last-lrm-refresh=1424459378 \ no-quorum-policy=ignore \ stonith-enabled=false \ maintenance-mode=false rsc_defaults rsc_defaults-options: \ resource-stickiness=1000 \ migration-threshold=5 crm_mon shows both hosts as slaves and none is never promoted ever: Master/Slave Set: ms_pgsql [pri_pgsql] Slaves: [ pp-obm-sgbd.upond.fr http://pp-obm-sgbd.upond.fr pp-obm-sgbd2.upond.fr http://pp-obm-sgbd2.upond.fr ] Node Attributes: * Node pp-obm-sgbd.upond.fr http://pp-obm-sgbd.upond.fr: + master-pri_pgsql : 1000 + pri_pgsql-status : HS:alone + pri_pgsql-xlog-loc: 2D78 * Node pp-obm-sgbd2.upond.fr http://pp-obm-sgbd2.upond.fr: + master-pri_pgsql : -INFINITY + pri_pgsql-data-status : DISCONNECT + pri_pgsql-status : HS:alone + pri_pgsql-xlog-loc: 2D00 on the host I am expecting promotion I see when doing cleanups: Feb 20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]: INFO: Master does not exist. Feb 20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]: INFO: My data status=. And on the other node I see the following logs that sounds interrseting: Feb 20 20:16:10 pp-obm-sgbd2 crmd[19626]: notice: print_synapse: [Action 18]: Pending pseudo op ms_pgsql_promoted_0 on N/A (priority: 100, waiting: 11) Feb 20 20:16:10 pp-obm-sgbd2 crmd[19626]: notice: print_synapse: [Action 17]: Pending pseudo op ms_pgsql_promote_0 on N/A (priority: 0, waiting: 21) the N/A part seems to tell me the cluster don't know where to promote the resource but I can't understand why. bellow are my constraint rules: pcs constraint show Location Constraints: Ordering Constraints: demote ms_pgsql then stop pri_vip (score:0) promote ms_pgsql then start pri_vip (score:0) (non-symmetrical) Colocation Constraints: pri_vip with ms_pgsql (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) I am now out of ideas so any help is very much appreciated. Regards. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org - -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -BEGIN PGP SIGNATURE- Version: GnuPG v1
[Pacemaker] Cannot clean history
Hi, I have a pacemaker / corosync / cman cluster running on redhat 6.6. Although cluster is working as expected, I have some trace of old failures (several monthes ago) I can't gert rid of. Basically I have set cluster-recheck-interval=300 and failure-timeout=600 (in rsc_defaults) as shown bellow: property $id=cib-bootstrap-options \ dc-version=1.1.10-14.el6-368c726 \ cluster-infrastructure=cman \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1429702408 \ maintenance-mode=false \ cluster-recheck-interval=300 rsc_defaults $id=rsc-options \ failure-timeout=600 So I would expect old failure to be purged from the cib long ago, but actually I have the following when issuing crm_mon -frA1. Migration summary: * Node host1: etc_ml_drbd: migration-threshold=100 fail-count=244 last-failure='Sat Feb 14 17:04:05 2015' spool_postfix_drbd_msg: migration-threshold=100 fail-count=244 last-failure='Sat Feb 14 17:04:05 2015' lib_ml_drbd: migration-threshold=100 fail-count=244 last-failure='Sat Feb 14 17:04:05 2015' lib_imap_drbd: migration-threshold=100 fail-count=244 last-failure='Sat Feb 14 17:04:05 2015' spool_imap_drbd: migration-threshold=100 fail-count=11654 last-failure='Sat Feb 14 17:04:05 2015' spool_ml_drbd: migration-threshold=100 fail-count=244 last-failure='Sat Feb 14 17:04:05 2015' documents_drbd: migration-threshold=100 fail-count=248 last-failure='Sat Feb 14 17:58:55 2015' * Node host2 documents_drbd: migration-threshold=100 fail-count=548 last-failure='Sat Feb 14 16:26:33 2015' I have tried to crm_failcount -D the resources also tried cleanup... but it's still there! How can I get reid of those record (so my monitoring tools stop complaining) . Regards. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] resource dependency
Hi list, I'm building a 4 node cluster where 2 nodes will export drbd devices via ietd iscsi target (storage nodes) and other 2 nodes will run xen vm (app nodes) stored in lvm partition accessed via open-iscsi initiator, using multipath to failover. Configuring the cluster resources order I came up with a situation that I don't find a solution. The xen vm resources depends of iscsi initiator resource to run, I have two iscsi initiator resources, one for each storage node, how can I make the vm resources dependent on any iscsi initiator resources ? I think in create a clone of the iscsi initiator resource, use rules to change the clone options in a way that I can have two clones per app node with different portal parameter. This way I could make the vm resouces dependency on this clone. Is this possible ? I'm using debian-lenny with the packages described at http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo Excuse me for the bad english. Best Regards, Alexandre ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] resource dependency
On Fri, Nov 20, 2009 at 2:53 PM, Matthew Palmer mpal...@hezmatt.org wrote: On Fri, Nov 20, 2009 at 02:42:29PM -0200, Alexandre Biancalana wrote: I'm building a 4 node cluster where 2 nodes will export drbd devices via ietd iscsi target (storage nodes) and other 2 nodes will run xen vm (app nodes) stored in lvm partition accessed via open-iscsi initiator, using multipath to failover. Configuring the cluster resources order I came up with a situation that I don't find a solution. The xen vm resources depends of iscsi initiator resource to run, I have two iscsi initiator resources, one for each storage node, how can I make the vm resources dependent on any iscsi initiator resources ? Personally, I think you've got the wrong design. I'd prefer to loosely couple the storage and VM clusters, with the storage cluster exporting iSCSI initiators which the VM cluster then attaches to the VMs as required. Put the error handling for the case where the iSCSI initiator isn't available for a VM into the resource agent for the VM. To me, this seems like a more robust solution. Tying everything up together feels like you're asking for trouble whenever any failover happens -- everything gets recalculated and the cluster spends the next several minutes jiggling resources around before everything settles back down again. Hi Matt, thank you for the reply. Ok. But if I go with your suggestion I end with the same question. Having the 2 node storage cluster exporting the block device via iSCSI, how can I make the VM resource at the VM cluster depend on *any* iSCSI target exported ? The standard order configuration just allow dependency on *one* resource. The only way I see is configure a ip resource on storage cluster and use this as portal on iSCSI initiator resource of VM cluster. I don't want to do this way because I think use multipath, for a quicked failover. Alexandre ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] resource dependency
On Fri, Nov 20, 2009 at 4:35 PM, Andrew Beekhof and...@beekhof.net wrote: On Fri, Nov 20, 2009 at 5:42 PM, Alexandre Biancalana biancal...@gmail.com wrote: Hi list, I'm building a 4 node cluster where 2 nodes will export drbd devices via ietd iscsi target (storage nodes) and other 2 nodes will run xen vm (app nodes) stored in lvm partition accessed via open-iscsi initiator, using multipath to failover. Configuring the cluster resources order I came up with a situation that I don't find a solution. The xen vm resources depends of iscsi initiator resource to run, I have two iscsi initiator resources, one for each storage node, how can I make the vm resources dependent on any iscsi initiator resources ? The cluster can't express this case yet. But its on the to-doo list. Thank you for the answer Andrew and congratulations for this great peace of software. Best Regards, Alexandre ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker