Re: [Pacemaker] can't get external/xen0 fencing to work on debian wheezy

2014-02-06 Thread Alexandre
Just answering myself quickly so people don't waste their time reading
long logs and config.
Actually I simply forgot to define a location constraint for my
fencing resource. I _have to_ do it as I am using an opt-in cluster.

Sorry for the noise.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] ordering cloned resources

2014-03-02 Thread Alexandre
Hi,

I am setting up a cluster on debian wheezy.
I have installed pacemaker using the debian provided packages (so am
runing  1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff).

I have roughly 10 nodes, among which some nodes are acting as SAN
(exporting block devices using AoE protocol) and others nodes acting
as initiators (they are actually mail servers, storing emails on the
exported devices).
Bellow are the defined resources for those nodes:

xml primitive class=ocf id=pri_aoe1 provider=heartbeat
type=AoEtarget \
instance_attributes id=pri_aoe1.1-instance_attributes \
rule id=node-sanaoe01 score=1 \
expression attribute=#uname
id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \
/rule \
nvpair id=pri_aoe1.1-instance_attributes-device
name=device value=/dev/xvdb/ \
nvpair id=pri_aoe1.1-instance_attributes-nic
name=nic value=eth0/ \
nvpair id=pri_aoe1.1-instance_attributes-shelf
name=shelf value=1/ \
nvpair id=pri_aoe1.1-instance_attributes-slot
name=slot value=1/ \
/instance_attributes \
instance_attributes id=pri_aoe2.1-instance_attributes \
rule id=node-sanaoe02 score=2 \
expression attribute=#uname
id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \
/rule \
nvpair id=pri_aoe2.1-instance_attributes-device
name=device value=/dev/xvdb/ \
nvpair id=pri_aoe2.1-instance_attributes-nic
name=nic value=eth1/ \
nvpair id=pri_aoe2.1-instance_attributes-shelf
name=shelf value=2/ \
nvpair id=pri_aoe2.1-instance_attributes-slot
name=slot value=1/ \
/instance_attributes \
/primitive
primitive pri_dovecot lsb:dovecot \
op start interval=0 timeout=20 \
op stop interval=0 timeout=30 \
op monitor interval=5 timeout=10
primitive pri_spamassassin lsb:spamassassin \
op start interval=0 timeout=50 \
op stop interval=0 timeout=60 \
op monitor interval=5 timeout=20
group grp_aoe pri_aoe1
group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot
clone cln_mailstore grp_mailstore \
meta ordered=false interleave=true clone-max=2
clone cln_san grp_aoe \
meta ordered=true interleave=true clone-max=2

As I am in an opt-in cluster mode (symmetric-cluster=false), I
have the location constraints bellow for those hosts:

location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01
location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02
location LOC_MAIL_STORE_1 cln_mailstore inf: ms01
location LOC_MAIL_STORE_2 cln_mailstore inf: ms02

So far so good. I want to make sure the initiators won't try to search
for exported devices before the targets actually exported them. To do
so, I though I could use the following ordering constraint:

order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore

Unfortunately if I add this constraint the clone Set cln_mailstore
never starts (or even stops if started when I add the constraint).

Is there something wrong with this ordering rule?
Where can i find informations on what's going on?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ordering cloned resources

2014-03-08 Thread Alexandre
Hi Andrew,

I have tried to stop and start the first resource of the ordering
constraint (cln_san), hoping it would trigger a start attemps of the
second resource of the ordering constraint (cln_mailstore).
I tailed the syslog logs on the node where I was expecting the second
resource to start but really nothing appeared in those logs (I grepped
'pengine as per your suggestion).

I have done another test, where I changed the first resource of the
ordering constraint with a very simple primitive (lsb resource), and
it worked in this case.

I am wondering if the issue doesn't come from the rather complicated
first  resource. It is a cloned group which contains a primitive
conditional instance attributes...
Are you aware of any specific issue in pacemaker 1.1.7 with this kind
of ressources?

I will try to simplify the resources by getting rid of the conditional
instance attribute and try again. In the mean time I'd be delighted to
hear about what you guys think about that.

Regards, Alex.

2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net:

 On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote:

 Hi,

 I am setting up a cluster on debian wheezy.
 I have installed pacemaker using the debian provided packages (so am
 runing  1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff).

 I have roughly 10 nodes, among which some nodes are acting as SAN
 (exporting block devices using AoE protocol) and others nodes acting
 as initiators (they are actually mail servers, storing emails on the
 exported devices).
 Bellow are the defined resources for those nodes:

 xml primitive class=ocf id=pri_aoe1 provider=heartbeat
 type=AoEtarget \
instance_attributes id=pri_aoe1.1-instance_attributes \
rule id=node-sanaoe01 score=1 \
expression attribute=#uname
 id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \
/rule \
nvpair id=pri_aoe1.1-instance_attributes-device
 name=device value=/dev/xvdb/ \
nvpair id=pri_aoe1.1-instance_attributes-nic
 name=nic value=eth0/ \
nvpair id=pri_aoe1.1-instance_attributes-shelf
 name=shelf value=1/ \
nvpair id=pri_aoe1.1-instance_attributes-slot
 name=slot value=1/ \
/instance_attributes \
instance_attributes id=pri_aoe2.1-instance_attributes \
rule id=node-sanaoe02 score=2 \
expression attribute=#uname
 id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \
/rule \
nvpair id=pri_aoe2.1-instance_attributes-device
 name=device value=/dev/xvdb/ \
nvpair id=pri_aoe2.1-instance_attributes-nic
 name=nic value=eth1/ \
nvpair id=pri_aoe2.1-instance_attributes-shelf
 name=shelf value=2/ \
nvpair id=pri_aoe2.1-instance_attributes-slot
 name=slot value=1/ \
/instance_attributes \
 /primitive
 primitive pri_dovecot lsb:dovecot \
op start interval=0 timeout=20 \
op stop interval=0 timeout=30 \
op monitor interval=5 timeout=10
 primitive pri_spamassassin lsb:spamassassin \
op start interval=0 timeout=50 \
op stop interval=0 timeout=60 \
op monitor interval=5 timeout=20
 group grp_aoe pri_aoe1
 group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot
 clone cln_mailstore grp_mailstore \
meta ordered=false interleave=true clone-max=2
 clone cln_san grp_aoe \
meta ordered=true interleave=true clone-max=2

 As I am in an opt-in cluster mode (symmetric-cluster=false), I
 have the location constraints bellow for those hosts:

 location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01
 location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02
 location LOC_MAIL_STORE_1 cln_mailstore inf: ms01
 location LOC_MAIL_STORE_2 cln_mailstore inf: ms02

 So far so good. I want to make sure the initiators won't try to search
 for exported devices before the targets actually exported them. To do
 so, I though I could use the following ordering constraint:

 order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore

 Unfortunately if I add this constraint the clone Set cln_mailstore
 never starts (or even stops if started when I add the constraint).

 Is there something wrong with this ordering rule?
 Where can i find informations on what's going on?

 No errors in the logs?
 If you grep for 'pengine' does it want to start them or just leave them 
 stopped?


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http

Re: [Pacemaker] ordering cloned resources

2014-03-09 Thread Alexandre
So...,

It appears the problem doesn't come from the primitive but for the
cloned resource. If I use the primitive instead of the clone in the
order constraint (thus deleting the clone and the group) , the second
resource of the constraint startup as expected.

Any idea why?

Should I upgrade this pretty old version of pacemaker?

2014-03-08 10:36 GMT+01:00 Alexandre alxg...@gmail.com:
 Hi Andrew,

 I have tried to stop and start the first resource of the ordering
 constraint (cln_san), hoping it would trigger a start attemps of the
 second resource of the ordering constraint (cln_mailstore).
 I tailed the syslog logs on the node where I was expecting the second
 resource to start but really nothing appeared in those logs (I grepped
 'pengine as per your suggestion).

 I have done another test, where I changed the first resource of the
 ordering constraint with a very simple primitive (lsb resource), and
 it worked in this case.

 I am wondering if the issue doesn't come from the rather complicated
 first  resource. It is a cloned group which contains a primitive
 conditional instance attributes...
 Are you aware of any specific issue in pacemaker 1.1.7 with this kind
 of ressources?

 I will try to simplify the resources by getting rid of the conditional
 instance attribute and try again. In the mean time I'd be delighted to
 hear about what you guys think about that.

 Regards, Alex.

 2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net:

 On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote:

 Hi,

 I am setting up a cluster on debian wheezy.
 I have installed pacemaker using the debian provided packages (so am
 runing  1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff).

 I have roughly 10 nodes, among which some nodes are acting as SAN
 (exporting block devices using AoE protocol) and others nodes acting
 as initiators (they are actually mail servers, storing emails on the
 exported devices).
 Bellow are the defined resources for those nodes:

 xml primitive class=ocf id=pri_aoe1 provider=heartbeat
 type=AoEtarget \
instance_attributes id=pri_aoe1.1-instance_attributes \
rule id=node-sanaoe01 score=1 \
expression attribute=#uname
 id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \
/rule \
nvpair id=pri_aoe1.1-instance_attributes-device
 name=device value=/dev/xvdb/ \
nvpair id=pri_aoe1.1-instance_attributes-nic
 name=nic value=eth0/ \
nvpair id=pri_aoe1.1-instance_attributes-shelf
 name=shelf value=1/ \
nvpair id=pri_aoe1.1-instance_attributes-slot
 name=slot value=1/ \
/instance_attributes \
instance_attributes id=pri_aoe2.1-instance_attributes \
rule id=node-sanaoe02 score=2 \
expression attribute=#uname
 id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \
/rule \
nvpair id=pri_aoe2.1-instance_attributes-device
 name=device value=/dev/xvdb/ \
nvpair id=pri_aoe2.1-instance_attributes-nic
 name=nic value=eth1/ \
nvpair id=pri_aoe2.1-instance_attributes-shelf
 name=shelf value=2/ \
nvpair id=pri_aoe2.1-instance_attributes-slot
 name=slot value=1/ \
/instance_attributes \
 /primitive
 primitive pri_dovecot lsb:dovecot \
op start interval=0 timeout=20 \
op stop interval=0 timeout=30 \
op monitor interval=5 timeout=10
 primitive pri_spamassassin lsb:spamassassin \
op start interval=0 timeout=50 \
op stop interval=0 timeout=60 \
op monitor interval=5 timeout=20
 group grp_aoe pri_aoe1
 group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot
 clone cln_mailstore grp_mailstore \
meta ordered=false interleave=true clone-max=2
 clone cln_san grp_aoe \
meta ordered=true interleave=true clone-max=2

 As I am in an opt-in cluster mode (symmetric-cluster=false), I
 have the location constraints bellow for those hosts:

 location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01
 location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02
 location LOC_MAIL_STORE_1 cln_mailstore inf: ms01
 location LOC_MAIL_STORE_2 cln_mailstore inf: ms02

 So far so good. I want to make sure the initiators won't try to search
 for exported devices before the targets actually exported them. To do
 so, I though I could use the following ordering constraint:

 order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore

 Unfortunately if I add this constraint the clone Set cln_mailstore
 never starts (or even stops if started when I add the constraint).

 Is there something wrong with this ordering rule?
 Where can i find informations on what's going on?

 No errors in the logs?
 If you grep for 'pengine' does it want to start them or just leave them 
 stopped?


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo

Re: [Pacemaker] ordering cloned resources

2014-03-22 Thread Alexandre
[10989]:   notice: te_rsc_command:
Initiating action 39: start pri_aoe1_start_0 on sanaoe02 (local)
Mar 22 23:37:50 sanaoe02 pengine[10988]:   notice: process_pe_message:
Calculated Transition 381: /var/lib/pacemaker/pengine/pe-input-104.bz2
Mar 22 23:37:50 sanaoe02 AoEtarget(pri_aoe1)[14379]: INFO: Exporting
device /dev/xvdb on eth1 as shelf 2, slot 1
Mar 22 23:37:50 sanaoe02 AoEtarget(pri_aoe1)[14379]: DEBUG: pri_aoe1 start : 0
Mar 22 23:37:50 sanaoe02 crmd[10989]:   notice: process_lrm_event: LRM
operation pri_aoe1_start_0 (call=198, rc=0, cib-update=1027,
confirmed=true) ok
Mar 22 23:37:50 sanaoe02 crmd[10989]:   notice: te_rsc_command:
Initiating action 25: start pri_dovecot_start_0 on ms02
Mar 22 23:37:50 sanaoe02 crmd[10989]:   notice: te_rsc_command:
Initiating action 26: monitor pri_dovecot_monitor_5000 on ms02
Mar 22 23:37:50 sanaoe02 crmd[10989]:   notice: run_graph: Transition
381 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-104.bz2): Complete
Mar 22 23:37:50 sanaoe02 crmd[10989]:   notice: do_state_transition:
State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

and where the second resource starts

Mar 22 22:37:50 ms02 crmd[89496]:   notice: process_lrm_event: LRM
operation pri_dovecot_start_0 (call=151, rc=0, cib-update=197,
confirmed=true) ok
Mar 22 22:37:50 ms02 dovecot: master: Dovecot v2.1.7 starting up
Mar 22 22:37:50 ms02 dovecot: master: Warning: /home is no longer
mounted. If this is intentional, remove it with doveadm mount
Mar 22 22:37:50 ms02 crmd[89496]:   notice: process_lrm_event: LRM
operation pri_dovecot_monitor_5000 (call=152, rc=0, cib-update=198,
confirmed=false) ok

I can't find anything usefull in those logs but if you think something
is relevant or could be, please feel free to highlight.

2014-03-11 2:13 GMT+01:00 Andrew Beekhof and...@beekhof.net:

 On 9 Mar 2014, at 10:36 pm, Alexandre alxg...@gmail.com wrote:

 So...,

 It appears the problem doesn't come from the primitive but for the
 cloned resource. If I use the primitive instead of the clone in the
 order constraint (thus deleting the clone and the group) , the second
 resource of the constraint startup as expected.

 Any idea why?

 Not without logs


 Should I upgrade this pretty old version of pacemaker?

 Yes :)


 2014-03-08 10:36 GMT+01:00 Alexandre alxg...@gmail.com:
 Hi Andrew,

 I have tried to stop and start the first resource of the ordering
 constraint (cln_san), hoping it would trigger a start attemps of the
 second resource of the ordering constraint (cln_mailstore).
 I tailed the syslog logs on the node where I was expecting the second
 resource to start but really nothing appeared in those logs (I grepped
 'pengine as per your suggestion).

 I have done another test, where I changed the first resource of the
 ordering constraint with a very simple primitive (lsb resource), and
 it worked in this case.

 I am wondering if the issue doesn't come from the rather complicated
 first  resource. It is a cloned group which contains a primitive
 conditional instance attributes...
 Are you aware of any specific issue in pacemaker 1.1.7 with this kind
 of ressources?

 I will try to simplify the resources by getting rid of the conditional
 instance attribute and try again. In the mean time I'd be delighted to
 hear about what you guys think about that.

 Regards, Alex.

 2014-03-07 4:21 GMT+01:00 Andrew Beekhof and...@beekhof.net:

 On 3 Mar 2014, at 3:56 am, Alexandre alxg...@gmail.com wrote:

 Hi,

 I am setting up a cluster on debian wheezy.
 I have installed pacemaker using the debian provided packages (so am
 runing  1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff).

 I have roughly 10 nodes, among which some nodes are acting as SAN
 (exporting block devices using AoE protocol) and others nodes acting
 as initiators (they are actually mail servers, storing emails on the
 exported devices).
 Bellow are the defined resources for those nodes:

 xml primitive class=ocf id=pri_aoe1 provider=heartbeat
 type=AoEtarget \
   instance_attributes id=pri_aoe1.1-instance_attributes \
   rule id=node-sanaoe01 score=1 \
   expression attribute=#uname
 id=expr-node-sanaoe01 operation=eq value=sanaoe01/ \
   /rule \
   nvpair id=pri_aoe1.1-instance_attributes-device
 name=device value=/dev/xvdb/ \
   nvpair id=pri_aoe1.1-instance_attributes-nic
 name=nic value=eth0/ \
   nvpair id=pri_aoe1.1-instance_attributes-shelf
 name=shelf value=1/ \
   nvpair id=pri_aoe1.1-instance_attributes-slot
 name=slot value=1/ \
   /instance_attributes \
   instance_attributes id=pri_aoe2.1-instance_attributes \
   rule id=node-sanaoe02 score=2 \
   expression attribute=#uname
 id=expr-node-sanaoe2 operation=eq value=sanaoe02/ \
   /rule \
   nvpair id=pri_aoe2.1

[Pacemaker] collocating a set of resources with crmsh

2014-03-24 Thread Alexandre
Hi,

I am configuring a cluster on nodes that doesn't have pcs installed
(pacemaker 1.1.7 with crmsh).
I would like to configure collocated sets of resources (as show
here:http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/#s-resource-sets-collocation)
in that cluster but can't find the proper way to do it with crm.
I have tried the command bellow but it just failed:

sudo crm configure xml constraintsrsc_colocation id=coloc-1
score=INFINITY/resource_set id=collocated-set-example
sequential=trueresource_ref id=pri_apache2/resource_ref
id=pri_iscsi//resource_set/rsc_colocation/constraints

ERROR: not well-formed (invalid token): line 1, column 32

What is the way to proceed with crm?

Regards.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] no-quorum-policy = demote?

2014-04-08 Thread Alexandre
Have you tried to patch the monitor action of your RA, so that it set the a
temporary constraint location on the node to avoid it becoming master.
Something like
Location loc_splited_cluster -inf: MsRsc:Master  $node

Not sure about the above crm syntax, but that's the idea.
 Le 8 avr. 2014 02:52, Andrew Beekhof and...@beekhof.net a écrit :


 On 7 Apr 2014, at 5:54 pm, Christian Ciach derein...@gmail.com wrote:

  Hello,
 
  I am using Corosync 2.0 with Pacemaker 1.1 on Ubuntu Server 14.04 (daily
 builds until final release).
 
  My problem is as follows: I have a 2-node (plus a quorum-node) cluster
 to manage a multistate-resource. One node should be the master and the
 other one the slave. It is absolutely not allowed to have two masters at
 the same time. To prevent a split-brain situation, I am also using a third
 node as a quorum-only node (set to standby). There is no redundant
 connection because the nodes are connected over the internet.
 
  If one of the two nodes managing the resource becomes disconnected, it
 loses quorum. In this case, I want this resource to become a slave, but the
 resource should never be stopped completely!

 Ever? Including when you stop pacemaker?  If so, maybe the path of least
 resistance is to delete the contents of the stop action in that OCF agent...

  This leaves me with a problem: no-quorum-policy=stop will stop the
 resource, while no-quorum-policy=ignore will keep this resource in a
 master-state. I already tried to demote the resource manually inside the
 monitor-action of the OCF-agent, but pacemaker will promote the resource
 immediately again.
 
  I am aware that I am trying the manage a multi-site-cluster and there is
 something like the booth-daemon, which sounds like the solution to my
 problem. But unfortunately I need the location-constraints of pacemaker
 based on the score of the OCF-agent. As far as I know location-constraints
 are not possible when using booth, because the 2-node-cluster is
 essentially split into two 1-node-clusters. Is this correct?
 
  To conclude: Is it possible to demote a resource on quorum loss instead
 of stopping it? Is booth an option if I need to manage the location of the
 master based on the score returned by the OCF-agent?
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] no-quorum-policy = demote?

2014-04-11 Thread Alexandre
Le 10 avr. 2014 15:44, Christian Ciach derein...@gmail.com a écrit :

 I don't really like the idea to periodically poll crm_node -q for the
current quorum state. No matter how frequently the monitor-function gets
called, there will always be a small time frame where both nodes will be in
the master state at the same time.

 Is there a way to get a notification to the OCF-agent whenever the quorum
state changes?

You should probably look for something like this in the ocfshellfunction.sh
file.

But also take a look at the page below, it has a lot of multi state
dedicated variables that are most definitely useful in your case.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_multi_state_proper_interpretation_of_notification_environment_variables.html



 2014-04-08 10:14 GMT+02:00 Christian Ciach derein...@gmail.com:

 Interesting idea! I can confirm that this works. So, I need to monitor
the output of crm_node -q to check if the current partition has quorum.
If the partition doesn't have quorum, I need to set the location constraint
according to your example. If the partition gets quorum again, I need to
remove the constraint.

 This seems almost a bit hacky, but it should work okay. Thank you! It
almost a shame that pacemaker doesn't have demote as a
no-quorum-policy, but supports demote as a loss-policy for tickets.

 Yesterday I had another idea: Maybe I won't use a multistate resource
agent but a primitive instead. This way, I will start the resource outside
of pacemaker and let the start-action of the OCF-agent set the resource to
master and the stop-action sets it to slave. Then I will just use
no-quorum-policy=stop. The downside of this is that I cannot distinguish
between a stopped resource and a resource in a slave state using crm_mon.



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Possible to colocate ms ressource with standard ones ? If so, you probably need to make sure it is able to handle m/s resource in pacemaker properly.

2014-04-30 Thread Alexandre
Why did you hide the resource agent provider? Is it a custom one
Le 30 avr. 2014 01:10, Andrew Beekhof and...@beekhof.net a écrit :


 On 29 Apr 2014, at 11:06 pm, Sékine Coulibaly scoulib...@gmail.com
 wrote:

  Hi,
 
  Let me explain my use case. I'm using RHEL 6.3

 fwiw, there are updates to pacemaker 1.1.10 in 6.4 and 6.5.
 Its even supported now.

  with Corosync + Pacemaker + PostgreSQL9.2 + repmgr 2.0. I have two nodes
 names clustera and clusterb.
 
  I have a total of 3 resources :
  - APACHE
  - BOUM
  - MS_POSTGRESQL
 
  They are defined as follow :
 
  sudo crm configure  primitive APACHE ocf:heartbeat:apache \
 params configfile=/etc/httpd/conf/httpd.conf \
 op monitor interval=5s timeout=10s \
 op start interval=0 timeout=10s \
 op stop interval=0 timeout=10s
 
   sudo crm configure primitive BOUM ocf:heartbeat:anything \
 params binfile=/usr/local/boum/current/bin/boum \
 workdir=/var/boum \
 logfile=/var/log/boum/boum_STDOUT \
 errlogfile=/var/log/boum/boum_STDERR \
 pidfile=/var/run/boum.pid \
 op monitor interval=5s timeout=10s \
 op start interval=0 timeout=10s \
 op stop interval=0 timeout=10s
 
  sudo crm configure primitive POSTGRESQL ocf:xx:postgresql \
 params repmgr_conf=/var/lib/pgsql/repmgr/repmgr.conf
 pgctl=/usr/pgsql-9.2/bin/pg_ctl pgdata=/opt/pgdata \
 op start interval=0 timeout=90s \
 op stop interval=0 timeout=60s \
 op promote interval=0 timeout=120s \
 op monitor interval=53s role=Master \
 op monitor interval=60s role=Slave
 
  Since the PostgreSQL is in streaming replication, I need to have a
 master and a slave constantly running. Hence, I created an MasterSlave
 resource, called MS_POSTGRESQL.
 
  I want to that APACHE, BOUM and the master node of PostgreSQL run
 altogether on the same node. It looks like that as soon as I add a
 colocation, the Postgresql slave doesn't start anymore.
 
  I end up with :
 
  Online: [ clusterb clustera ]
 
   Master/Slave Set: MS_POSTGRESQL [POSTGRESQL]
   Masters: [ clustera ]
   Stopped: [ POSTGRESQL:1 ]
  APACHE  (ocf::heartbeat:apache):Started clustera
  BOUM (ocf::heartbeat:anything):   Started clustera
 
  My configuration is as follows :
 
 
  node clustera \
  attributes standby=off
  node clusterb \
  attributes standby=off
  primitive APACHE ocf:heartbeat:apache \
  params configfile=/etc/httpd/conf/httpd.conf \
  op monitor interval=5s timeout=10s \
  op start interval=0 timeout=10s \
  op stop interval=0 timeout=10s \
  meta target-role=Started
  primitive BOUM ocf:heartbeat:anything \
  params binfile=/usr/local/boum/current/bin/boum
 workdir=/var/boum logfile=/var/log/boum/boum_STDOUT
 errlogfile=/var/log/boum/boum_STDERR pidfile=/var/run/boum.pid \
  op monitor interval=5s timeout=10s \
  op start interval=0 timeout=10s \
  op stop interval=0 timeout=10s
  primitive POSTGRESQL ocf:xxx:postgresql \
  params repmgr_conf=/var/lib/pgsql/repmgr/repmgr.conf
 pgctl=/usr/pgsql-9.2/bin/pg_ctl pgdata=/opt/pgdata \
  op start interval=0 timeout=90s \
  op stop interval=0 timeout=60s \
  op promote interval=0 timeout=120s \
  op monitor interval=53s role=Master \
  op monitor interval=60s role=Slave
  ms MS_POSTGRESQL POSTGRESQL \
  meta clone-max=2 target-role=Started
 resource-stickiness=100 notify=true
  colocation link-resources inf: ZK UFO BOUM APACHE MS_POSTGRESQL

 Could you send the raw xml (cibadmin -Ql) please?
 I've never gotten used to crmsh's colocation syntax and don't have it
 installed locally (pcs is the supplied tool for configuring pacemaker on
 rhel)

  property $id=cib-bootstrap-options \
 
 dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \
  cluster-infrastructure=openais \
  expected-quorum-votes=2 \
  stonith-enabled=false \
  no-quorum-policy=ignore \
  default-resource-stickiness=10 \
  start-failure-is-fatal=false \
  last-lrm-refresh=1398775386
 
  Is this a normal behaviour ? If it is, is there a workaround I didn't
 think of ?
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

Re: [Pacemaker] Pacemaker with Xen 4.3 problem

2014-07-08 Thread Alexandre
IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked
for me (if you're using the same RA)
Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit :

 Hello,

 I try to build a XEN HA cluster with pacemaker/corosync.
 Xen 4.3 works on all nodes and also the xen live migration works fine.
 Pacemaker also works with the cluster virtual IP.
 But when I try to create a XEN OCF Heartbeat resource to get online, an
 error
 appears:

 ##
 Failed actions:
 xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31,
 status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms,
 exec=555ms
 xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10,
 status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms,
 exec=706ms
 ##

 I added the resource with the command

 crm configure primitive xen_dns_ha ocf:heartbeat:Xen \
 params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \
 op monitor interval=10s \
 op start interval=0s timeout=30s \
 op stop interval=0s timeout=300s

 in the /var/log/messages the following error is printed:
 2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]:   notice:
 operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to
 connect to xend: No such file or directory. Is xend running? ]

 I use xen 4.3 with XL toolstack without xend .
 Is it possible to use pacemaker with Xen 4.3 ?
 Can anybody please help me ?

 Best regards
 T. Reineck



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker with Xen 4.3 problem

2014-07-09 Thread Alexandre
Actually I did it for the stonith resource agent external:xen0.
xm and xl are supposed to be semantically very close and as far as I can
see the ocf:heartbeat:Xen agent doesn't seem to use any xm command that
shouldn't work with xl.
What error do you have when using xl instead of xm?

Regards.


2014-07-09 8:39 GMT+02:00 Tobias Reineck tobias.rein...@hotmail.de:

 Hello,

 do you mean the Xen script in /usr/lib/ocf/resource.d/heartbeat/ ?
 I also tried this to replace all xm with xl with no success.
 Is it possible that you can show me you RA resource for Xen ?

 Best regards
 T. Reineck



 --
 Date: Tue, 8 Jul 2014 22:27:59 +0200
 From: alxg...@gmail.com
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Pacemaker with Xen 4.3 problem


 IIRC the xen RA uses 'xm'. However fixing the RAin is trivial and worked
 for me (if you're using the same RA)
 Le 2014-07-08 21:39, Tobias Reineck tobias.rein...@hotmail.de a écrit
 :

 Hello,

 I try to build a XEN HA cluster with pacemaker/corosync.
 Xen 4.3 works on all nodes and also the xen live migration works fine.
 Pacemaker also works with the cluster virtual IP.
 But when I try to create a XEN OCF Heartbeat resource to get online, an
 error
 appears:

 ##
 Failed actions:
 xen_dns_ha_start_0 on xen01.domain.dom 'unknown error' (1): call=31,
 status=complete, last-rc-change='Sun Jul 6 15:02:25 2014', queued=0ms,
 exec=555ms
 xen_dns_ha_start_0 on xen02.domain.dom 'unknown error' (1): call=10,
 status=complete, last-rc-change='Sun Jul 6 15:15:09 2014', queued=0ms,
 exec=706ms
 ##

 I added the resource with the command

 crm configure primitive xen_dns_ha ocf:heartbeat:Xen \
 params xmfile=/root/xen_storage/dns_dhcp/dns_dhcp.xen \
 op monitor interval=10s \
 op start interval=0s timeout=30s \
 op stop interval=0s timeout=300s

 in the /var/log/messages the following error is printed:
 2014-07-08T21:09:19.885239+02:00 xen01 lrmd[3443]:   notice:
 operation_finished: xen_dns_ha_stop_0:18214:stderr [ Error: Unable to
 connect to xend: No such file or directory. Is xend running? ]

 I use xen 4.3 with XL toolstack without xend .
 Is it possible to use pacemaker with Xen 4.3 ?
 Can anybody please help me ?

 Best regards
 T. Reineck



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___ Pacemaker mailing list:
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home:
 http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
 http://bugs.clusterlabs.org

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] help deciphering output

2014-10-09 Thread Alexandre
I have seen this behavior on several virtualsed environments. when vm
backup starts, the VM actually freezes for a (short?) Period of time.I
guess it then no more responding to the other cluster nodes thus triggering
unexpected fail over and/or fencing.I have this kind of behavior on VMware
env using veam backup, as well promox (+ u don't what backup tool)
That's actually an interesting topic I never though about rising here.
How can we avoid that? Increasing timeout? I am afraid we would have to
reach unacceptable high timeout values and am not even sure that would fix
the pb.
I think not all VM snapshots strategy would trigger that PV, do you guys
have any feedback to provide on the backup/snapshot method best suits
corosync clusters?

Regards
Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com
a écrit :

 One of my nodes died in a 2 node cluster

 I gather something went wrong, and it fenced/killed itself. But I am not
 sure what happened.

 I think maybe around that time the VM backups happened and snap of the VM
 could have happened

 But there is nothing for me to put my finger on

 Output from messages around that time

 This is on devrp1
 Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed,
 forming new configuration.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking
 activity
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the
 non-primary component and will NOT provide any services.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
 Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
 r(0) ip(10.172.214.51) ; members(old:2 left:1)
 Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service
 synchronization, ready to provide service.
 Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
 Membership 424: quorum lost
 Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained,
 resuming activity
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the
 primary component and will provide service.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
 r(0) ip(10.172.214.51) ; members(old:1 left:0)
 Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service
 synchronization, ready to provide service.
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
 cman_event_callback: Node devrp2[2] - state is now lost (was member)
 Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node
 (devrp2) left the cluster
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
 Membership 428: quorum acquired
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
 cman_event_callback: Node devrp2[2] - state is now member (was lost)
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State
 transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
 origin=reap_dead_nodes ]
 Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we
 were killed by cman_tool or other application
 Oct  8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection
 terminated
 Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
 Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
 Oct  8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost
 connection to Corosync service!
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client
 0x18cf240, pid=2350...
 Oct  8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy:
 Connection destroyed
 Oct  8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 cib[2345]:error: cib_cs_destroy: Corosync
 connection lost!  Exiting.
 Oct  8 23:31:43 devrp1 stonith-ng[2346]:error:
 stonith_peer_cs_destroy: Corosync connection terminated
 Oct  8 23:31:43 devrp1 dlm_controld[1752]: 

Re: [Pacemaker] colocate three resources

2014-11-09 Thread Alexandre
I think you can use a single colocation with a set of resources. crmsh
allows you to create such a colocation with:

crm colocation vm_with_disks inf: vm_srv ( ms_disk_R:Master
ms_disk_S:Master )

This forces the cluster to place the master resources on the same host,
starting them without specific ordering, and then start the VM along with
them.
 Le 9 nov. 2014 11:31, Matthias Teege matthias-gm...@mteege.de a écrit :

 Hallo,

 On a cluster I have to place three resources on the same node.

 ms ms_disk_R p_disk_R
 ms ms_disk_S p_disk_S
 primitive vm_srv ocf:heartbeat:VirtualDomain

 The colocation constraints looks like this:

 colocation vm_with_disk_R inf: vm_srv ms_disk_R:Master
 colocation vm_with_disk_S inf: vm_srv ms_disk_S:Master

 Do I have to add another colocation constraint to define a
 colocation between disk_R and disk_S.  I'm not sure because the
 documentation says:

 with-rsc: The colocation target.  The cluster will decide where to
 put this resource first and then decide where to put the resource in
 the rsc field.

 In my case the colocation targets are ms_disk_R and ms_disk_S.
 If pacemaker decides to put disk_R on node A and disk_S on node B
 vm_srv would not start.

 I use order constraints to start disks before the vm resource.

 order disk_R_before_vm inf: ms_disk_R:promote vm_srv:start
 order disk_S_before_vm inf: ms_disk_S:promote vm_srv:start

 Thanks
 Matthias


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Daemon Start attempt on wrong Server

2014-11-11 Thread Alexandre
You should use an opt out cluster. Set the cluster option
symmetrical=false. This will tell corosync not to place a resource anywhere
on the cluster, unless a location rule explicitly tell the cluster where it
should run.

Corosync will still monitor sql resources on www hosts and return rc 5 but
this is expected and works.
Le 11 nov. 2014 13:22, Hauke Homburg hhomb...@w3-creative.de a écrit :

 Hello,

 I am installing a 6 Node pacemaker CLuster. 3 Nodes for Apache, 3 Nodes
 for Postgres.

 My Cluster Config is

 node kvm-node1
 node sql-node1
 node sql-node2
 node sql-node3
 node www-node1
 node www-node2
 node www-node3
 primitive pri_kvm_ip ocf:heartbeat:IPaddr2 \
 params ip=10.0.6.41 cidr_netmask=255.255.255.0 \
 op monitor interval=10s timeout=20s
 primitive pri_sql_ip ocf:heartbeat:IPaddr2 \
 params ip=10.0.6.31 cidr_netmask=255.255.255.0 \
 op monitor interval=10s timeout=20s
 primitive pri_www_ip ocf:heartbeat:IPaddr2 \
 params ip=10.0.6.21 cidr_netmask=255.255.255.0 \
 op monitor interval=10s timeout=20s
 primitive res_apache ocf:heartbeat:apache \
 params configfile=/etc/apache2/apache2.conf \
 op start interval=0 timeout=40 \
 op stop interval=0 timeout=60 \
 op monitor interval=60 timeout=120 start-delay=0 \
 meta target-role=Started
 primitive res_pgsql ocf:heartbeat:pgsql \
 params pgctl=/usr/lib/postgresql/9.1/bin/pg_ctl
 psql=/usr/bin/psql start_opt= pgdata=/var/lib/postgresql/9.1/main
 config=/etc/postgresql/9.1/main/postgresql.conf pgdba=postgres \
 op start interval=0 timeout=120s \
 op stop interval=0 timeout=120s \
 op monitor interval=30s timeout=30s depth=0
 location loc_kvm_ip_node1 pri_kvm_ip 10001: kvm-node1
 location loc_sql_ip_node1 pri_sql_ip inf: sql-node1
 location loc_sql_ip_node2 pri_sql_ip inf: sql-node2
 location loc_sql_ip_node3 pri_sql_ip inf: sql-node3
 location loc_sql_srv_node1 res_pgsql inf: sql-node1
 location loc_sql_srv_node2 res_pgsql inf: sql-node2
 location loc_sql_srv_node3 res_pgsql inf: sql-node3
 location loc_www_ip_node1 pri_www_ip inf: www-node1
 location loc_www_ip_node2 pri_www_ip inf: www-node2
 location loc_www_ip_node3 pri_www_ip inf: www-node3
 location loc_www_srv_node1 res_apache inf: www-node1
 location loc_www_srv_node2 res_apache inf: www-node2
 location loc_www_srv_node3 res_apache inf: www-node3
 property $id=cib-bootstrap-options \
 dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \
 cluster-infrastructurFailed actions:

 Why do i see in crm_mon the following output?

 res_pgsql_start_0 (node=www-node1, call=16, rc=5, status=complete):
 not installed
 res_pgsql_start_0 (node=www-node2, call=13, rc=5, status=complete):
 not installed
 pri_www_ip_monitor_1 (node=www-node3, call=22, rc=7,
 status=complete): not running
 res_pgsql_start_0 (node=www-node3, call=13, rc=5, status=complete):
 not installed
 res_apache_start_0 (node=sql-node2, call=18, rc=5, status=complete):
 not installed
 res_pgsql_start_0 (node=sql-node2, call=12, rc=5, status=complete):
 not installed
 res_apache_start_0 (node=sql-node3, call=12, rc=5, status=complete):
 not installed
 res_pgsql_start_0 (node=sql-node3, call=10, rc=5, status=complete):
 not installed
 res_apache_start_0 (node=kvm-node1, call=12, rc=5, status=complete):
 not installed
 res_pgsql_start_0 (node=kvm-node1, call=20, rc=5, status=complete):
 not installede=openais \
 expected-quorum-votes=7 \
 stonith-enabled=false


 I set the infinity for pgsql on all 3 sql nodes, but not! on the www
 nodes. Why tries Pacemaker to start the Postgres SQL Server on the www
 Node? In example?

 Thank for your Help

 greetings

 Hauke

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Reset failcount for resources

2014-11-16 Thread Alexandre
Le 13 nov. 2014 12:09, Arjun Pandey apandepub...@gmail.com a écrit :

 Hi

 I am running a 2 node cluster with this config

 Master/Slave Set: foo-master [foo]
 Masters: [ bharat ]
 Slaves: [ ram ]
 AC_FLT (ocf::pw:IPaddr): Started bharat
 CR_CP_FLT (ocf::pw:IPaddr): Started bharat
 CR_UP_FLT (ocf::pw:IPaddr): Started bharat
 Mgmt_FLT (ocf::pw:IPaddr): Started bharat

 where IPaddr RA is just modified IPAddr2 RA. Additionally i have a
 collocation constraint for the IP addr to be collocated with the master.
 I have set the migration-threshold as 2 for the VIP. I also have set the
failure-timeout to 15s.


 Initially i bring down the interface on bharat to force switch-over to
ram. After this i fail the interfaces on bharat again. Now i bring the
interface up again on ram. However the virtual IP's are now in stopped
state.

 I don't get out of this unless i use crm_resource -C to reset state of
resources.
 However if i check failcount of resources after this it's still set as
INFINITY.
 Based on the documentation the failcount on a node should have expired
after the failure-timeout.That doesn't happen.

Expiration probably happens, meaning the failure is marked for expiration.
However, expired failures are only removed when the timer pops in, which is
defined by the cluster-recheck-interval (by default 15 mins).

 However why don't we reset the count after the the crm_resource -C
command too. Any other command to actually reset the failcount.

 Thanks in advance

 Regards
 Arjun

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] postgresql never promoted

2015-02-20 Thread Alexandre
Hi list,

I am facing a very strange issue.
I have setup a postgresql cluster (with streaming repl).
The replication works ok when started manually but the RA seems to never
promote any host where the resource is started.

my config is bellow:
node pp-obm-sgbd.upond.fr
node pp-obm-sgbd2.upond.fr \
attributes pri_pgsql-data-status=DISCONNECT
primitive pri_obm-locator lsb:obm-locator \
params \
op start interval=0s timeout=60s \
op stop interval=0s timeout=60s \
op monitor interval=10s timeout=20s
primitive pri_pgsql pgsql \
params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql
pgdata=/var/lib/pgsql/9.1/data/ node_list=pp-obm-sgbd.upond.fr
pp-obm-sgbd2.upond.fr repuser=replication rep_mode=sync
restart_on_promote=true restore_command=cp /var/lib/pgsql/replication/%f
%p primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5
keepalives_count=5 master_ip=193.50.151.200 \
op start interval=0 on-fail=restart timeout=120s \
op monitor interval=20s on-fail=restart timeout=60s \
op monitor interval=15s on-fail=restart role=Master timeout=60s \
op promote interval=0 on-fail=restart timeout=120s \
op demote interval=0 on-fail=stop timeout=120s \
op notify interval=0s timeout=60s \
op stop interval=0 on-fail=block timeout=120s
primitive pri_vip IPaddr2 \
params ip=193.50.151.200 nic=eth1 cidr_netmask=32 \
op start interval=0s timeout=60s \
op monitor interval=10s timeout=60s \
op stop interval=0s timeout=60s
ms ms_pgsql pri_pgsql \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
colocation clc_vip-ms_pgsql inf: pri_vip:Started ms_pgsql:Master
order ord_dm_pgsql-vip 0: ms_pgsql:demote pri_vip:stop
order ord_pm_pgsql-vip 0: ms_pgsql:promote pri_vip:start symmetrical=false
property cib-bootstrap-options: \
dc-version=1.1.11-97629de \
cluster-infrastructure=cman \
last-lrm-refresh=1424459378 \
no-quorum-policy=ignore \
stonith-enabled=false \
maintenance-mode=false
rsc_defaults rsc_defaults-options: \
resource-stickiness=1000 \
migration-threshold=5

crm_mon shows both hosts as slaves and none is never promoted ever:

Master/Slave Set: ms_pgsql [pri_pgsql]
 Slaves: [ pp-obm-sgbd.upond.fr pp-obm-sgbd2.upond.fr ]
Node Attributes:
* Node pp-obm-sgbd.upond.fr:
+ master-pri_pgsql  : 1000
+ pri_pgsql-status  : HS:alone
+ pri_pgsql-xlog-loc: 2D78
* Node pp-obm-sgbd2.upond.fr:
+ master-pri_pgsql  : -INFINITY
+ pri_pgsql-data-status : DISCONNECT
+ pri_pgsql-status  : HS:alone
+ pri_pgsql-xlog-loc: 2D00

on the host I am expecting promotion I see when doing cleanups:
Feb 20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]: INFO: Master does not
exist.
Feb 20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]: INFO: My data status=.

And on the other node I see the following logs that sounds interrseting:
Feb 20 20:16:10 pp-obm-sgbd2 crmd[19626]:   notice: print_synapse:
[Action   18]: Pending pseudo op ms_pgsql_promoted_0  on N/A
(priority: 100, waiting:  11)
Feb 20 20:16:10 pp-obm-sgbd2 crmd[19626]:   notice: print_synapse:
[Action   17]: Pending pseudo op ms_pgsql_promote_0   on N/A
(priority: 0, waiting:  21)

the N/A part seems to tell me the cluster don't know where to promote the
resource but I can't understand why.

bellow are my constraint rules:

pcs constraint show
Location Constraints:
Ordering Constraints:
  demote ms_pgsql then stop pri_vip (score:0)
  promote ms_pgsql then start pri_vip (score:0) (non-symmetrical)
Colocation Constraints:
  pri_vip with ms_pgsql (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master)

I am now out of ideas so any help is very much appreciated.

Regards.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] postgresql never promoted

2015-02-20 Thread Alexandre
Thanks, I was already on my way to do it.
Note that's done.
Le 20 févr. 2015 20:50, Digimer li...@alteeve.ca a écrit :

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Do you mind asking this in the new mailing list?

 http://clusterlabs.org/mailman/listinfo/users

 This list is scheduled to be closed and all users are encouraged to
 switch. :)

 On 20/02/15 02:18 PM, Alexandre wrote:
  Hi list,
 
  I am facing a very strange issue. I have setup a postgresql cluster
  (with streaming repl). The replication works ok when started
  manually but the RA seems to never promote any host where the
  resource is started.
 
  my config is bellow: node pp-obm-sgbd.upond.fr
  http://pp-obm-sgbd.upond.fr node pp-obm-sgbd2.upond.fr
  http://pp-obm-sgbd2.upond.fr \ attributes
  pri_pgsql-data-status=DISCONNECT primitive pri_obm-locator
  lsb:obm-locator \ params \ op start interval=0s timeout=60s \ op
  stop interval=0s timeout=60s \ op monitor interval=10s timeout=20s
  primitive pri_pgsql pgsql \ params
  pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql
  pgdata=/var/lib/pgsql/9.1/data/ node_list=pp-obm-sgbd.upond.fr
  http://pp-obm-sgbd.upond.fr pp-obm-sgbd2.upond.fr
  http://pp-obm-sgbd2.upond.fr repuser=replication rep_mode=sync
  restart_on_promote=true restore_command=cp
  /var/lib/pgsql/replication/%f %p
  primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5
  keepalives_count=5 master_ip=193.50.151.200 \ op start interval=0
  on-fail=restart timeout=120s \ op monitor interval=20s
  on-fail=restart timeout=60s \ op monitor interval=15s
  on-fail=restart role=Master timeout=60s \ op promote interval=0
  on-fail=restart timeout=120s \ op demote interval=0 on-fail=stop
  timeout=120s \ op notify interval=0s timeout=60s \ op stop
  interval=0 on-fail=block timeout=120s primitive pri_vip IPaddr2 \
  params ip=193.50.151.200 nic=eth1 cidr_netmask=32 \ op start
  interval=0s timeout=60s \ op monitor interval=10s timeout=60s \ op
  stop interval=0s timeout=60s ms ms_pgsql pri_pgsql \ meta
  master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
  colocation clc_vip-ms_pgsql inf: pri_vip:Started ms_pgsql:Master
  order ord_dm_pgsql-vip 0: ms_pgsql:demote pri_vip:stop order
  ord_pm_pgsql-vip 0: ms_pgsql:promote pri_vip:start
  symmetrical=false property cib-bootstrap-options: \
  dc-version=1.1.11-97629de \ cluster-infrastructure=cman \
  last-lrm-refresh=1424459378 \ no-quorum-policy=ignore \
  stonith-enabled=false \ maintenance-mode=false rsc_defaults
  rsc_defaults-options: \ resource-stickiness=1000 \
  migration-threshold=5
 
  crm_mon shows both hosts as slaves and none is never promoted
  ever:
 
  Master/Slave Set: ms_pgsql [pri_pgsql] Slaves: [
  pp-obm-sgbd.upond.fr http://pp-obm-sgbd.upond.fr
  pp-obm-sgbd2.upond.fr http://pp-obm-sgbd2.upond.fr ] Node
  Attributes: * Node pp-obm-sgbd.upond.fr
  http://pp-obm-sgbd.upond.fr: + master-pri_pgsql
  : 1000 + pri_pgsql-status  : HS:alone +
  pri_pgsql-xlog-loc: 2D78 * Node
  pp-obm-sgbd2.upond.fr http://pp-obm-sgbd2.upond.fr: +
  master-pri_pgsql  : -INFINITY +
  pri_pgsql-data-status : DISCONNECT + pri_pgsql-status
  : HS:alone + pri_pgsql-xlog-loc: 2D00
 
  on the host I am expecting promotion I see when doing cleanups: Feb
  20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]: INFO: Master does
  not exist. Feb 20 20:15:07 pp-obm-sgbd pgsql(pri_pgsql)[30994]:
  INFO: My data status=.
 
  And on the other node I see the following logs that sounds
  interrseting: Feb 20 20:16:10 pp-obm-sgbd2 crmd[19626]:   notice:
  print_synapse: [Action   18]: Pending pseudo op ms_pgsql_promoted_0
  on N/A (priority: 100, waiting:  11) Feb 20 20:16:10
  pp-obm-sgbd2 crmd[19626]:   notice: print_synapse: [Action   17]:
  Pending pseudo op ms_pgsql_promote_0   on N/A
  (priority: 0, waiting:  21)
 
  the N/A part seems to tell me the cluster don't know where to
  promote the resource but I can't understand why.
 
  bellow are my constraint rules:
 
  pcs constraint show Location Constraints: Ordering Constraints:
  demote ms_pgsql then stop pri_vip (score:0) promote ms_pgsql then
  start pri_vip (score:0) (non-symmetrical) Colocation Constraints:
  pri_vip with ms_pgsql (score:INFINITY) (rsc-role:Started)
  (with-rsc-role:Master)
 
  I am now out of ideas so any help is very much appreciated.
 
  Regards.
 
 
  ___ Pacemaker mailing
  list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
  http://bugs.clusterlabs.org
 


 - --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1

[Pacemaker] Cannot clean history

2015-05-04 Thread Alexandre
Hi,

I have a pacemaker / corosync / cman cluster running on redhat 6.6.
Although cluster is working as expected, I have some trace of old failures
(several monthes ago) I can't gert rid of.
Basically I have set cluster-recheck-interval=300 and
failure-timeout=600 (in rsc_defaults) as shown bellow:

property $id=cib-bootstrap-options \
dc-version=1.1.10-14.el6-368c726 \
cluster-infrastructure=cman \
expected-quorum-votes=2 \
no-quorum-policy=ignore \
stonith-enabled=false \
last-lrm-refresh=1429702408 \
maintenance-mode=false \
cluster-recheck-interval=300
rsc_defaults $id=rsc-options \
failure-timeout=600

So I would expect old failure to be purged from the cib long ago, but
actually I have the following when issuing crm_mon -frA1.

Migration summary:
* Node host1:
   etc_ml_drbd: migration-threshold=100 fail-count=244
last-failure='Sat Feb 14 17:04:05 2015'
   spool_postfix_drbd_msg: migration-threshold=100 fail-count=244
last-failure='Sat Feb 14 17:04:05 2015'
   lib_ml_drbd: migration-threshold=100 fail-count=244
last-failure='Sat Feb 14 17:04:05 2015'
   lib_imap_drbd: migration-threshold=100 fail-count=244
last-failure='Sat Feb 14 17:04:05 2015'
   spool_imap_drbd: migration-threshold=100 fail-count=11654
last-failure='Sat Feb 14 17:04:05 2015'
   spool_ml_drbd: migration-threshold=100 fail-count=244
last-failure='Sat Feb 14 17:04:05 2015'
   documents_drbd: migration-threshold=100 fail-count=248
last-failure='Sat Feb 14 17:58:55 2015'
* Node host2
   documents_drbd: migration-threshold=100 fail-count=548
last-failure='Sat Feb 14 16:26:33 2015'

I have tried to crm_failcount -D the resources also tried cleanup... but
it's still there!
How can I get reid of those record (so my monitoring tools stop
complaining) .

Regards.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] resource dependency

2009-11-20 Thread Alexandre Biancalana
Hi list,

 I'm building a 4 node cluster where 2 nodes will export drbd devices
via ietd iscsi target (storage nodes) and other 2 nodes will run xen
vm (app nodes) stored in lvm partition accessed via open-iscsi
initiator, using multipath to failover.

 Configuring the cluster resources order I came up with a situation
that I don't find a solution. The xen vm resources depends of iscsi
initiator resource to run, I have two iscsi initiator resources, one
for each storage node, how can I make the vm resources dependent on
any iscsi initiator resources ?

 I think in create a clone of the iscsi initiator resource, use rules
to change the clone options in a way that I can have two clones per
app node with different portal parameter. This way I could make the vm
resouces dependency on this clone. Is this possible ?

 I'm using debian-lenny with the packages described at
http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo

Excuse me for the bad english.

Best Regards,

Alexandre

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] resource dependency

2009-11-20 Thread Alexandre Biancalana
On Fri, Nov 20, 2009 at 2:53 PM, Matthew Palmer mpal...@hezmatt.org wrote:
 On Fri, Nov 20, 2009 at 02:42:29PM -0200, Alexandre Biancalana wrote:
  I'm building a 4 node cluster where 2 nodes will export drbd devices
 via ietd iscsi target (storage nodes) and other 2 nodes will run xen
 vm (app nodes) stored in lvm partition accessed via open-iscsi
 initiator, using multipath to failover.

  Configuring the cluster resources order I came up with a situation
 that I don't find a solution. The xen vm resources depends of iscsi
 initiator resource to run, I have two iscsi initiator resources, one
 for each storage node, how can I make the vm resources dependent on
 any iscsi initiator resources ?

 Personally, I think you've got the wrong design.  I'd prefer to loosely
 couple the storage and VM clusters, with the storage cluster exporting iSCSI
 initiators which the VM cluster then attaches to the VMs as required.  Put
 the error handling for the case where the iSCSI initiator isn't available
 for a VM into the resource agent for the VM.  To me, this seems like a more
 robust solution.  Tying everything up together feels like you're asking for
 trouble whenever any failover happens -- everything gets recalculated and the
 cluster spends the next several minutes jiggling resources around before
 everything settles back down again.

Hi Matt, thank you for the reply.

Ok. But if I go with your suggestion I end with the same question.

Having the 2 node storage cluster exporting the block device via
iSCSI, how can I make the VM resource at the VM cluster depend on
*any* iSCSI target exported ? The standard order configuration just
allow dependency on *one* resource.

The only way I see is configure a ip resource on storage cluster and
use this as portal on iSCSI initiator resource of VM cluster. I don't
want to do this way because I think use multipath, for a quicked
failover.

Alexandre

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] resource dependency

2009-11-20 Thread Alexandre Biancalana
On Fri, Nov 20, 2009 at 4:35 PM, Andrew Beekhof and...@beekhof.net wrote:
 On Fri, Nov 20, 2009 at 5:42 PM, Alexandre Biancalana
 biancal...@gmail.com wrote:
 Hi list,

  I'm building a 4 node cluster where 2 nodes will export drbd devices
 via ietd iscsi target (storage nodes) and other 2 nodes will run xen
 vm (app nodes) stored in lvm partition accessed via open-iscsi
 initiator, using multipath to failover.

  Configuring the cluster resources order I came up with a situation
 that I don't find a solution. The xen vm resources depends of iscsi
 initiator resource to run, I have two iscsi initiator resources, one
 for each storage node, how can I make the vm resources dependent on
 any iscsi initiator resources ?

 The cluster can't express this case yet.
 But its on the to-doo list.

Thank you for the answer Andrew and congratulations for this great
peace of software.

Best Regards,
Alexandre

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker