[ClusterLabs] Antw: Re: Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-05-19 Thread Ulrich Windl
>>> Jehan-Guillaume de Rorthais  schrieb am 19.05.2016 um 
>>> 21:29 in
Nachricht <20160519212947.6cc0fd7b@firost>:
[...]
> I was thinking of a use case where a graceful demote or stop action failed
> multiple times and to give a chance to the RA to choose another method to 
> stop
> the resource before it requires a migration. As instance, PostgreSQL has 3
> different kind of stop, the last one being not graceful, but still better 
> than
> a kill -9.

For example the Xen RA tries a clean shutdown with a timeout of about 2/3 of 
the timeout; it it fails it shuts the VM down the hard way.

I don't know Postgres in detail, but I could imagine a three step approach:
1) Shutdown after current operations have finished
2) Shutdown regardless of pending operations (doing rollbacks)
3) Shutdown the hard way, requiring recovery on the next start (I think in 
Oracle this is called a "shutdown abort")

Depending on the scenario one may start at step 2)

[...]
I think RAs should not rely on "stop" being called multiple times for a 
resource to be stopped.

Regards,
Ulrich




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] attrd does not clean per-node cache after node removal

2016-05-19 Thread Ken Gaillot
On 03/23/2016 12:01 PM, Vladislav Bogdanov wrote:
> 23.03.2016 19:52, Vladislav Bogdanov wrote:
>> 23.03.2016 19:39, Ken Gaillot wrote:
>>> On 03/23/2016 07:35 AM, Vladislav Bogdanov wrote:
 Hi!

 It seems like atomic attrd in post-1.1.14 (eb89393) does not
 fully clean node cache after node is removed.

I haven't forgotten, this was a tricky one :-)

I believe this regression was introduced in da17fd0, which clears the
node's attribute *values* when purging the node, but not the value
*structures* that contain the node name and ID. That was intended as a
fix for when nodes leave and rejoin. However the same affected function
is used to handle "crm_node -R" requests, which should cause complete
removal.

I hope to have a fix soon.

Note that the behavior may still occur if "crm_node -R" is not called
after reloading corosync.

>>> Is this a regression? Or have you only tried it with this version?
>>
>> Only with this one.
>>
>>>
 After our QA guys remove node wa-test-server-ha-03 from a two-node 
 cluster:
 * stop pacemaker and corosync on wa-test-server-ha-03
 * remove node wa-test-server-ha-03 from corosync nodelist on 
 wa-test-server-ha-04
 * tune votequorum settings
 * reload corosync on wa-test-server-ha-04
 * remove node from pacemaker on wa-test-server-ha-04
 * delete everything from /var/lib/pacemaker/cib on wa-test-server-ha-03
 , and then join it with the different corosync ID (but with the same 
 node name),
 we see the following in logs:

 Leave node 1 (wa-test-server-ha-03):
 Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: 
 crm_update_peer_proc: Node wa-test-server-ha-03[1] - state is now 
 lost (was member)
 Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Removing 
 all wa-test-server-ha-03 (1) attributes for attrd_peer_change_cb
 Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Lost 
 attribute writer wa-test-server-ha-03
 Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Removing 
 wa-test-server-ha-03/1 from the membership list
 Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Purged 1 
 peers with id=1 and/or uname=wa-test-server-ha-03 from the membership 
 cache
 Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: 
 Processing peer-remove from wa-test-server-ha-04: wa-test-server-ha-03 0
 Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: Removing 
 all wa-test-server-ha-03 (0) attributes for wa-test-server-ha-04
 Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: Removing 
 wa-test-server-ha-03/1 from the membership list
 Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: Purged 1 
 peers with id=0 and/or uname=wa-test-server-ha-03 from the membership 
 cache

 Join node 3 (the same one, wa-test-server-ha-03, but ID differs):
 Mar 23 04:21:23 wa-test-server-ha-04 attrd[25962]: notice: 
 crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now 
 member (was (null))
 Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]:  warning: 
 crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - 
 a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296
 Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]:  warning: 
 crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - 
 6c18faa1-f8c2-4b0c-907c-20db450e2e79
 Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: crit: Node 1 
 and 3 share the same name 'wa-test-server-ha-03'
>>>
>>> It took me a while to understand the above combination of messages. This
>>> is not node 3 joining. This is node 1 joining after node 3 has already
>>> been seen.
>>
>> Hmmm...
>> corosync.conf and corosync-cmapctl both say it is 3
>> Also, cib lists it as 3 and lrmd puts its status records under 3.
> 
> I mean:
> 
>  crm-debug-origin="do_update_resource" in_ccm="true" join="member" 
> expected="member">
>   
> 
> ...
> 
>   
> 
> 
>   
> 
>   
>name="master-rabbitmq-local" value="1"/>
>name="master-meta-0-0-drbd" value="1"/>
>name="master-staging-0-0-drbd" value="1"/>
>value="1458732136"/>
> 
>   
> 
> 
>>
>> Actually issue is that drbd resources are not promoted because their 
>> master attributes go to section with node-id 1. And that is the only 
>> reason why we found that. Everything not related to volatile attributes 
>> works well.
>>
>>>
>>> The warnings are a complete dump of the peer cache. So you can see that
>>> wa-test-server-ha-03 is listed only once, with id 3.
>>>
>>> The critical message ("Node 1 and 3") lists the new id first and the
>>> found ID second. So id 1 is what it's trying to add to the cache.
>>
>> But there is also 'Node 'wa-test-server-ha-03' has changed its ID from 1 
>> to 3' -  it goes first. Does that matter?
>>
>>

Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-05-19 Thread Jehan-Guillaume de Rorthais
Le Thu, 19 May 2016 13:15:20 -0500,
Ken Gaillot  a écrit :

> On 05/19/2016 11:43 AM, Jehan-Guillaume de Rorthais wrote:
>> Le Thu, 19 May 2016 10:53:31 -0500,
>> Ken Gaillot  a écrit :
>> 
>>> A recent thread discussed a proposed new feature, a new environment
>>> variable that would be passed to resource agents, indicating whether a
>>> stop action was part of a recovery.
>>>
>>> Since that thread was long and covered a lot of topics, I'm starting a
>>> new one to focus on the core issue remaining:
>>>
>>> The original idea was to pass the number of restarts remaining before
>>> the resource will no longer tried to be started on the same node. This
>>> involves calculating (fail-count - migration-threshold), and that
>>> implies certain limitations: (1) it will only be set when the cluster
>>> checks migration-threshold; (2) it will only be set for the failed
>>> resource itself, not for other resources that may be recovered due to
>>> dependencies on it.
>>>
>>> Ulrich Windl proposed an alternative: setting a boolean value instead. I
>>> forgot to cc the list on my reply, so I'll summarize now: We would set a
>>> new variable like OCF_RESKEY_CRM_recovery=true whenever a start is
>>> scheduled after a stop on the same node in the same transition. This
>>> would avoid the corner cases of the previous approach; instead of being
>>> tied to migration-threshold, it would be set whenever a recovery was
>>> being attempted, for any reason. And with this approach, it should be
>>> easier to set the variable for all actions on the resource
>>> (demote/stop/start/promote), rather than just the stop.
>> 
>> I can see the value of having such variable during various actions.
>> However, we can also deduce the transition is a recovering during the
>> notify actions with the notify variables (the only information we lack is
>> the order of the actions). A most flexible approach would be to make sure
>> the notify variables are always available during the whole transaction for
>> **all** actions, not just notify. It seems like it's already the case, but
>> a recent discussion emphase this is just a side effect of the current
>> implementation. I understand this as they were sometime available outside
>> of notification "by accident".
> 
> It does seem that a recovery could be implied from the
> notify_{start,stop}_uname variables, but notify variables are only set
> for clones that support the notify action. I think the goal here is to
> work with any resource type. Even for clones, if they don't otherwise
> need notifications, they'd have to add the overhead of notify calls on
> all instances, that would do nothing.

Exact, notify variables are only available for clones, presently. What I was
suggesting is that notify variables were always available, whatever the
resource is a clone, a ms or a standard one.

And I wasn't meaning notify *action* should be activated all the time for
all the resources. The notify switch for clones/ms could be kept to false by
default so the notify action is not called itself during the transitions.

> > Also, I can see the benefit of having the remaining attempt for the current
> > action before hitting the migration-threshold. I might misunderstand
> > something here, but it seems to me both informations are different. 
> 
> I think the use cases that have been mentioned would all be happy with
> just the boolean. Does anyone need the actual count, or just whether
> this is a stop-start vs a full stop?

I was thinking of a use case where a graceful demote or stop action failed
multiple times and to give a chance to the RA to choose another method to stop
the resource before it requires a migration. As instance, PostgreSQL has 3
different kind of stop, the last one being not graceful, but still better than
a kill -9.

> The problem with the migration-threshold approach is that there are
> recoveries that will be missed because they don't involve
> migration-threshold. If the count is really needed, the
> migration-threshold approach is necessary, but if recovery is the really
> interesting information, then a boolean would be more accurate.

I think I misunderstood the original use cases you try to achieve. It seems to
me we are talking about different a feature.

>> Basically, what we need is a better understanding of the transition itself
>> from the RA actions.
>> 
>> If you are still brainstorming on this, as a RA dev, what I would
>> suggest is:
>> 
>>   * provide and enforce the notify variables in all actions
>>   * add the actions order during the current transition to these variables
>> using eg. OCF_RESKEY_CRM_meta_notify_*_actionid
> 
> The action ID would be different for each node being acted on, so it
> would be more complicated (maybe *_actions="NODE1:ID1,NODE2:ID2,..."?).

Following the principle adopted for other variables, each ID would apply to the
corresponding resource and node in OCF_RESKEY_CRM_meta_notify_*_uname and
OCF_RESKEY_CRM_meta_notify_*_rsc.

> Also, RA w

Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-05-19 Thread Ken Gaillot
On 05/19/2016 11:43 AM, Jehan-Guillaume de Rorthais wrote:
> Le Thu, 19 May 2016 10:53:31 -0500,
> Ken Gaillot  a écrit :
> 
>> A recent thread discussed a proposed new feature, a new environment
>> variable that would be passed to resource agents, indicating whether a
>> stop action was part of a recovery.
>>
>> Since that thread was long and covered a lot of topics, I'm starting a
>> new one to focus on the core issue remaining:
>>
>> The original idea was to pass the number of restarts remaining before
>> the resource will no longer tried to be started on the same node. This
>> involves calculating (fail-count - migration-threshold), and that
>> implies certain limitations: (1) it will only be set when the cluster
>> checks migration-threshold; (2) it will only be set for the failed
>> resource itself, not for other resources that may be recovered due to
>> dependencies on it.
>>
>> Ulrich Windl proposed an alternative: setting a boolean value instead. I
>> forgot to cc the list on my reply, so I'll summarize now: We would set a
>> new variable like OCF_RESKEY_CRM_recovery=true whenever a start is
>> scheduled after a stop on the same node in the same transition. This
>> would avoid the corner cases of the previous approach; instead of being
>> tied to migration-threshold, it would be set whenever a recovery was
>> being attempted, for any reason. And with this approach, it should be
>> easier to set the variable for all actions on the resource
>> (demote/stop/start/promote), rather than just the stop.
> 
> I can see the value of having such variable during various actions. However, 
> we
> can also deduce the transition is a recovering during the notify actions with
> the notify variables (the only information we lack is the order of the
> actions). A most flexible approach would be to make sure the notify variables
> are always available during the whole transaction for **all** actions, not 
> just
> notify. It seems like it's already the case, but a recent discussion emphase
> this is just a side effect of the current implementation. I understand this 
> as 
> they were sometime available outside of notification "by accident".

It does seem that a recovery could be implied from the
notify_{start,stop}_uname variables, but notify variables are only set
for clones that support the notify action. I think the goal here is to
work with any resource type. Even for clones, if they don't otherwise
need notifications, they'd have to add the overhead of notify calls on
all instances, that would do nothing.

> Also, I can see the benefit of having the remaining attempt for the current
> action before hitting the migration-threshold. I might misunderstand something
> here, but it seems to me both informations are different. 

I think the use cases that have been mentioned would all be happy with
just the boolean. Does anyone need the actual count, or just whether
this is a stop-start vs a full stop?

The problem with the migration-threshold approach is that there are
recoveries that will be missed because they don't involve
migration-threshold. If the count is really needed, the
migration-threshold approach is necessary, but if recovery is the really
interesting information, then a boolean would be more accurate.

> Basically, what we need is a better understanding of the transition itself
> from the RA actions.
> 
> If you are still brainstorming on this, as a RA dev, what I would
> suggest is:
> 
>   * provide and enforce the notify variables in all actions
>   * add the actions order during the current transition to these variables 
> using
> eg. OCF_RESKEY_CRM_meta_notify_*_actionid

The action ID would be different for each node being acted on, so it
would be more complicated (maybe *_actions="NODE1:ID1,NODE2:ID2,..."?).
Also, RA writers would need to be aware that some actions may be
initiated in parallel. Probably more complex than it's worth.

>   * add a new variable with remaining action attempt before migration. This 
> one
> has the advantage to survive the transition breakage when a failure 
> occurs.
> 
> On a second step, we would be able to provide some helper function in the
> ocf_shellfuncs (and in my perl module equivalent) to compute if the transition
> is a switchover, a failover, a recovery, etc, based on the notify variables.
> 
> Presently, I am detecting such scenarios directly in my RA during the notify
> actions and tracking them as private attributes to be aware of the situation 
> during the real actions (demote and stop). See:
> 
> https://github.com/dalibo/PAF/blob/952cb3cf2f03aad18fbeafe3a91f997a56c3b606/script/pgsqlms#L95
> 
> Regards,
> 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-05-19 Thread Jehan-Guillaume de Rorthais
Le Thu, 19 May 2016 10:53:31 -0500,
Ken Gaillot  a écrit :

> A recent thread discussed a proposed new feature, a new environment
> variable that would be passed to resource agents, indicating whether a
> stop action was part of a recovery.
> 
> Since that thread was long and covered a lot of topics, I'm starting a
> new one to focus on the core issue remaining:
> 
> The original idea was to pass the number of restarts remaining before
> the resource will no longer tried to be started on the same node. This
> involves calculating (fail-count - migration-threshold), and that
> implies certain limitations: (1) it will only be set when the cluster
> checks migration-threshold; (2) it will only be set for the failed
> resource itself, not for other resources that may be recovered due to
> dependencies on it.
> 
> Ulrich Windl proposed an alternative: setting a boolean value instead. I
> forgot to cc the list on my reply, so I'll summarize now: We would set a
> new variable like OCF_RESKEY_CRM_recovery=true whenever a start is
> scheduled after a stop on the same node in the same transition. This
> would avoid the corner cases of the previous approach; instead of being
> tied to migration-threshold, it would be set whenever a recovery was
> being attempted, for any reason. And with this approach, it should be
> easier to set the variable for all actions on the resource
> (demote/stop/start/promote), rather than just the stop.

I can see the value of having such variable during various actions. However, we
can also deduce the transition is a recovering during the notify actions with
the notify variables (the only information we lack is the order of the
actions). A most flexible approach would be to make sure the notify variables
are always available during the whole transaction for **all** actions, not just
notify. It seems like it's already the case, but a recent discussion emphase
this is just a side effect of the current implementation. I understand this as 
they were sometime available outside of notification "by accident".

Also, I can see the benefit of having the remaining attempt for the current
action before hitting the migration-threshold. I might misunderstand something
here, but it seems to me both informations are different. 

Basically, what we need is a better understanding of the transition itself
from the RA actions.

If you are still brainstorming on this, as a RA dev, what I would
suggest is:

  * provide and enforce the notify variables in all actions
  * add the actions order during the current transition to these variables using
eg. OCF_RESKEY_CRM_meta_notify_*_actionid
  * add a new variable with remaining action attempt before migration. This one
has the advantage to survive the transition breakage when a failure occurs.

On a second step, we would be able to provide some helper function in the
ocf_shellfuncs (and in my perl module equivalent) to compute if the transition
is a switchover, a failover, a recovery, etc, based on the notify variables.

Presently, I am detecting such scenarios directly in my RA during the notify
actions and tracking them as private attributes to be aware of the situation 
during the real actions (demote and stop). See:

https://github.com/dalibo/PAF/blob/952cb3cf2f03aad18fbeafe3a91f997a56c3b606/script/pgsqlms#L95

Regards,

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-05-19 Thread Ken Gaillot
A recent thread discussed a proposed new feature, a new environment
variable that would be passed to resource agents, indicating whether a
stop action was part of a recovery.

Since that thread was long and covered a lot of topics, I'm starting a
new one to focus on the core issue remaining:

The original idea was to pass the number of restarts remaining before
the resource will no longer tried to be started on the same node. This
involves calculating (fail-count - migration-threshold), and that
implies certain limitations: (1) it will only be set when the cluster
checks migration-threshold; (2) it will only be set for the failed
resource itself, not for other resources that may be recovered due to
dependencies on it.

Ulrich Windl proposed an alternative: setting a boolean value instead. I
forgot to cc the list on my reply, so I'll summarize now: We would set a
new variable like OCF_RESKEY_CRM_recovery=true whenever a start is
scheduled after a stop on the same node in the same transition. This
would avoid the corner cases of the previous approach; instead of being
tied to migration-threshold, it would be set whenever a recovery was
being attempted, for any reason. And with this approach, it should be
easier to set the variable for all actions on the resource
(demote/stop/start/promote), rather than just the stop.

I think the boolean approach fits all the envisioned use cases that have
been discussed. Any objections to going that route instead of the count?
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Node attributes

2016-05-19 Thread Ken Gaillot
On 05/18/2016 10:49 PM, ‪H Yavari‬ ‪ wrote:
> Hi,
> 
> How can I define a constraint for two resource based on one nodes
> attribute?
> 
> For example resource X and Y are co-located based on node attribute Z.
> 
> 
> 
> Regards,
> H.Yavari

Hi,

See
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#idm140617356537136

High-level tools such as pcs and crm provide a simpler interface, but
the concepts will be the same.

This works for location constraints, not colocation, but you can easily
accomplish what you want. If your goal is that X and Y each can only run
on a node with attribute Z, then set up a location constraint for each
one using the appropriate rule. If you goal is that X and Y must be
colocated together, on a node with attribute Z, then set up a regular
colocation constraint between them, and a location constraint for one of
them with the appropriate rule; or, put them in a group, and set up a
location constraint for the group with the appropriate rule.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Pacemaker restart resources when node joins cluster after failback

2016-05-19 Thread Ulrich Windl
>>> Dharmesh  schrieb am 19.05.2016 um 13:18 in 
>>> Nachricht
:
> Hi,
> 
> i am having a 2 node Debian cluster with resources configured in it.
> Everything is working fine apart from one thing.

Usually you find the reasons in the logs (syslog, cluster log, etc.).

> 
> As and when one of my two node joins the cluster, the resources configured
> on currently active node gets restarted. I am not able to figure out why
> cluster is behaving like this.
> 
> Below is the configuration of my cluster
> 
> node $id="775bad88-0954-40bf-b9e4-4f012a76a34c" testsrv2 \
> attributes standby="off"
> node $id="b1d07507-6191-425c-bee3-14229c85820f" testsrv1 \
> attributes standby="off"
> primitive ClusterIp ocf:heartbeat:IPaddr2 \
> params ip="192.168.120.209" nic="eth0" cidr_netmask="24" \
> op monitor start-delay="0" interval="30" \
> meta target-role="started"
> primitive DBClusterIp ocf:heartbeat:IPaddr2 \
> params ip="192.168.120.210" nic="eth0" cidr_netmask="24" \
> op monitor interval="30" start-delay="0" \
> meta target-role="started"
> primitive Postgres-9.3 lsb:postgres-9.3-openscg \
> op start interval="0" timeout="15" \
> op stop interval="0" timeout="15" \
> op monitor interval="15" timeout="15" start-delay="15" \
> meta target-role="started" migration-threshold="1"
> primitive PowerDns lsb:pdns \
> op start interval="0" timeout="15" \
> op stop interval="0" timeout="15" \
> op monitor interval="15" timeout="15" start-delay="15" \
> meta target-role="started" migration-threshold="2"
> primitive PsqlMasterToStandby ocf:heartbeat:PsqlMasterToStandby \
> op start interval="0" timeout="20" start-delay="10" \
> op monitor interval="10" timeout="240" start-delay="10" \
> op stop interval="0" timeout="20" \
> meta target-role="started"
> primitive PsqlPromote ocf:heartbeat:PsqlPromote \
> op start interval="0" timeout="20" \
> op stop interval="0" timeout="20" \
> op monitor interval="10" timeout="20" start-delay="10" \
> meta target-role="started"
> group Database Postgres-9.3 PsqlPromote
> colocation col_Database_DBClusterIp inf: Database DBClusterIp
> colocation col_Database_PsqlMasterToStandby inf: Database
> PsqlMasterToStandby
> colocation col_PowerDns_ClusterIp inf: PowerDns ClusterIp
> order ord_Database_DBClusterIp inf: Database DBClusterIp
> order ord_Database_PsqlMasterToStandby inf: Database PsqlMasterToStandby
> order ord_PowerDns_ClusterIp inf: PowerDns ClusterIp
> property $id="cib-bootstrap-options" \
> stonith-enabled="false" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="heartbeat" \
> last-lrm-refresh="1453192778"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100" \
> failure-timeout="60s"
> #vim:set syntax=pcmk
> 
> Let me know if my configuration is not appropriate or some new
> configuration needs to be done.
> 
> Thanks and regards,
> 
> -- 
> Dharmesh Kumar





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker restart resources when node joins cluster after failback

2016-05-19 Thread Dharmesh
Hi,

i am having a 2 node Debian cluster with resources configured in it.
Everything is working fine apart from one thing.

As and when one of my two node joins the cluster, the resources configured
on currently active node gets restarted. I am not able to figure out why
cluster is behaving like this.

Below is the configuration of my cluster

node $id="775bad88-0954-40bf-b9e4-4f012a76a34c" testsrv2 \
attributes standby="off"
node $id="b1d07507-6191-425c-bee3-14229c85820f" testsrv1 \
attributes standby="off"
primitive ClusterIp ocf:heartbeat:IPaddr2 \
params ip="192.168.120.209" nic="eth0" cidr_netmask="24" \
op monitor start-delay="0" interval="30" \
meta target-role="started"
primitive DBClusterIp ocf:heartbeat:IPaddr2 \
params ip="192.168.120.210" nic="eth0" cidr_netmask="24" \
op monitor interval="30" start-delay="0" \
meta target-role="started"
primitive Postgres-9.3 lsb:postgres-9.3-openscg \
op start interval="0" timeout="15" \
op stop interval="0" timeout="15" \
op monitor interval="15" timeout="15" start-delay="15" \
meta target-role="started" migration-threshold="1"
primitive PowerDns lsb:pdns \
op start interval="0" timeout="15" \
op stop interval="0" timeout="15" \
op monitor interval="15" timeout="15" start-delay="15" \
meta target-role="started" migration-threshold="2"
primitive PsqlMasterToStandby ocf:heartbeat:PsqlMasterToStandby \
op start interval="0" timeout="20" start-delay="10" \
op monitor interval="10" timeout="240" start-delay="10" \
op stop interval="0" timeout="20" \
meta target-role="started"
primitive PsqlPromote ocf:heartbeat:PsqlPromote \
op start interval="0" timeout="20" \
op stop interval="0" timeout="20" \
op monitor interval="10" timeout="20" start-delay="10" \
meta target-role="started"
group Database Postgres-9.3 PsqlPromote
colocation col_Database_DBClusterIp inf: Database DBClusterIp
colocation col_Database_PsqlMasterToStandby inf: Database
PsqlMasterToStandby
colocation col_PowerDns_ClusterIp inf: PowerDns ClusterIp
order ord_Database_DBClusterIp inf: Database DBClusterIp
order ord_Database_PsqlMasterToStandby inf: Database PsqlMasterToStandby
order ord_PowerDns_ClusterIp inf: PowerDns ClusterIp
property $id="cib-bootstrap-options" \
stonith-enabled="false" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="heartbeat" \
last-lrm-refresh="1453192778"
rsc_defaults $id="rsc-options" \
resource-stickiness="100" \
failure-timeout="60s"
#vim:set syntax=pcmk

Let me know if my configuration is not appropriate or some new
configuration needs to be done.

Thanks and regards,

-- 
Dharmesh Kumar
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org