Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-28 Thread Klaus Wenninger
On 09/29/2016 05:57 AM, Andrew Beekhof wrote:
> On Mon, Sep 26, 2016 at 7:39 PM, Klaus Wenninger  wrote:
>> On 09/24/2016 01:12 AM, Ken Gaillot wrote:
>>> On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
 On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot >>> > wrote:

 On 09/22/2016 09:53 AM, Jan Pokorný wrote:
 > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
 >> Ken Gaillot mailto:kgail...@redhat.com>> 
 writes:
 >>
 >>> I'm not saying it's a bad idea, just that it's more complicated 
 than it
 >>> first sounds, so it's worth thinking through the implications.
 >>
 >> Thinking about it and looking at how complicated it gets, maybe what
 >> you'd really want, to make it clearer for the user, is the ability 
 to
 >> explicitly configure the behavior, either globally or per-resource. 
 So
 >> instead of having to tweak a set of variables that interact in 
 complex
 >> ways, you'd configure something like rule expressions,
 >>
 >> 
 >>   
 >>   
 >>   
 >> 
 >>
 >> So, try to restart the service 3 times, if that fails migrate the
 >> service, if it still fails, fence the node.
 >>
 >> (obviously the details and XML syntax are just an example)
 >>
 >> This would then replace on-fail, migration-threshold, etc.
 >
 > I must admit that in previous emails in this thread, I wasn't able to
 > follow during the first pass, which is not the case with this 
 procedural
 > (sequence-ordered) approach.  Though someone can argue it doesn't 
 take
 > type of operation into account, which might again open the door for
 > non-obvious interactions.

 "restart" is the only on-fail value that it makes sense to escalate.

 block/stop/fence/standby are final. Block means "don't touch the
 resource again", so there can't be any further response to failures.
 Stop/fence/standby move the resource off the local node, so failure
 handling is reset (there are 0 failures on the new node to begin with).

 "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
 then migrate", but I can't think of a real-world situation where that
 makes sense,


 really?

 it is not uncommon to hear "i know its failed, but i dont want the
 cluster to do anything until its _really_ failed"
>>> Hmm, I guess that would be similar to how monitoring systems such as
>>> nagios can be configured to send an alert only if N checks in a row
>>> fail. That's useful where transient outages (e.g. a webserver hitting
>>> its request limit) are acceptable for a short time.
>>>
>>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>>> is not "in a row" but "since the count was last cleared".
>>>
>>> "Ignore up to three monitor failures if they occur in a row [or, within
>>> 10 minutes?], then try soft recovery for the next two monitor failures,
>>> then ban this node for the next monitor failure." Not sure being able to
>>> say that is worth the complexity.
>> That is the reason why I suggested to think of a solution that
>> comes up with a certain number of statistics in environment
>> variables and leaves the final logic to be scripted in the RA
>> or an additional script.
> I don't think you want to go down that path.
> Otherwise you'll end up re-implementing parts of the PE in the agents.
>
> They'll want to know which nodes are available, what their scores are,
> what other services are on them, how many times have things failed
> there, etc etc. It will be never-ending
Rather replacing one possibly never ending wish-list by the other ;-)
>
 and it would be a significant re-implementation of "ignore"
 (which currently ignores the state of having failed, as opposed to a
 particular instance of failure).


 agreed



 What the interface needs to express is: "If this operation fails,
 optionally try a soft recovery [always stop+start], but if  failures
 occur on the same node, proceed to a [configurable] hard recovery".

 And of course the interface will need to be different depending on how
 certain details are decided, e.g. whether any failures count toward 
 or just failures of one particular operation type, and whether the hard
 recovery type can vary depending on what operation failed.
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> __

Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-28 Thread Andrew Beekhof
On Mon, Sep 26, 2016 at 7:39 PM, Klaus Wenninger  wrote:
> On 09/24/2016 01:12 AM, Ken Gaillot wrote:
>> On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
>>>
>>> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot >> > wrote:
>>>
>>> On 09/22/2016 09:53 AM, Jan Pokorný wrote:
>>> > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
>>> >> Ken Gaillot mailto:kgail...@redhat.com>> 
>>> writes:
>>> >>
>>> >>> I'm not saying it's a bad idea, just that it's more complicated 
>>> than it
>>> >>> first sounds, so it's worth thinking through the implications.
>>> >>
>>> >> Thinking about it and looking at how complicated it gets, maybe what
>>> >> you'd really want, to make it clearer for the user, is the ability to
>>> >> explicitly configure the behavior, either globally or per-resource. 
>>> So
>>> >> instead of having to tweak a set of variables that interact in 
>>> complex
>>> >> ways, you'd configure something like rule expressions,
>>> >>
>>> >> 
>>> >>   
>>> >>   
>>> >>   
>>> >> 
>>> >>
>>> >> So, try to restart the service 3 times, if that fails migrate the
>>> >> service, if it still fails, fence the node.
>>> >>
>>> >> (obviously the details and XML syntax are just an example)
>>> >>
>>> >> This would then replace on-fail, migration-threshold, etc.
>>> >
>>> > I must admit that in previous emails in this thread, I wasn't able to
>>> > follow during the first pass, which is not the case with this 
>>> procedural
>>> > (sequence-ordered) approach.  Though someone can argue it doesn't take
>>> > type of operation into account, which might again open the door for
>>> > non-obvious interactions.
>>>
>>> "restart" is the only on-fail value that it makes sense to escalate.
>>>
>>> block/stop/fence/standby are final. Block means "don't touch the
>>> resource again", so there can't be any further response to failures.
>>> Stop/fence/standby move the resource off the local node, so failure
>>> handling is reset (there are 0 failures on the new node to begin with).
>>>
>>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>> then migrate", but I can't think of a real-world situation where that
>>> makes sense,
>>>
>>>
>>> really?
>>>
>>> it is not uncommon to hear "i know its failed, but i dont want the
>>> cluster to do anything until its _really_ failed"
>> Hmm, I guess that would be similar to how monitoring systems such as
>> nagios can be configured to send an alert only if N checks in a row
>> fail. That's useful where transient outages (e.g. a webserver hitting
>> its request limit) are acceptable for a short time.
>>
>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>> is not "in a row" but "since the count was last cleared".
>>
>> "Ignore up to three monitor failures if they occur in a row [or, within
>> 10 minutes?], then try soft recovery for the next two monitor failures,
>> then ban this node for the next monitor failure." Not sure being able to
>> say that is worth the complexity.
> That is the reason why I suggested to think of a solution that
> comes up with a certain number of statistics in environment
> variables and leaves the final logic to be scripted in the RA
> or an additional script.

I don't think you want to go down that path.
Otherwise you'll end up re-implementing parts of the PE in the agents.

They'll want to know which nodes are available, what their scores are,
what other services are on them, how many times have things failed
there, etc etc. It will be never-ending


>>> and it would be a significant re-implementation of "ignore"
>>> (which currently ignores the state of having failed, as opposed to a
>>> particular instance of failure).
>>>
>>>
>>> agreed
>>>
>>>
>>>
>>> What the interface needs to express is: "If this operation fails,
>>> optionally try a soft recovery [always stop+start], but if  failures
>>> occur on the same node, proceed to a [configurable] hard recovery".
>>>
>>> And of course the interface will need to be different depending on how
>>> certain details are decided, e.g. whether any failures count toward 
>>> or just failures of one particular operation type, and whether the hard
>>> recovery type can vary depending on what operation failed.
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scrat

Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-28 Thread Andrew Beekhof
On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot  wrote:
> On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
>>
>>
>> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot > > wrote:
>>
>> On 09/22/2016 09:53 AM, Jan Pokorný wrote:
>> > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
>> >> Ken Gaillot mailto:kgail...@redhat.com>> writes:
>> >>
>> >>> I'm not saying it's a bad idea, just that it's more complicated than 
>> it
>> >>> first sounds, so it's worth thinking through the implications.
>> >>
>> >> Thinking about it and looking at how complicated it gets, maybe what
>> >> you'd really want, to make it clearer for the user, is the ability to
>> >> explicitly configure the behavior, either globally or per-resource. So
>> >> instead of having to tweak a set of variables that interact in complex
>> >> ways, you'd configure something like rule expressions,
>> >>
>> >> 
>> >>   
>> >>   
>> >>   
>> >> 
>> >>
>> >> So, try to restart the service 3 times, if that fails migrate the
>> >> service, if it still fails, fence the node.
>> >>
>> >> (obviously the details and XML syntax are just an example)
>> >>
>> >> This would then replace on-fail, migration-threshold, etc.
>> >
>> > I must admit that in previous emails in this thread, I wasn't able to
>> > follow during the first pass, which is not the case with this 
>> procedural
>> > (sequence-ordered) approach.  Though someone can argue it doesn't take
>> > type of operation into account, which might again open the door for
>> > non-obvious interactions.
>>
>> "restart" is the only on-fail value that it makes sense to escalate.
>>
>> block/stop/fence/standby are final. Block means "don't touch the
>> resource again", so there can't be any further response to failures.
>> Stop/fence/standby move the resource off the local node, so failure
>> handling is reset (there are 0 failures on the new node to begin with).
>>
>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>> then migrate", but I can't think of a real-world situation where that
>> makes sense,
>>
>>
>> really?
>>
>> it is not uncommon to hear "i know its failed, but i dont want the
>> cluster to do anything until its _really_ failed"
>
> Hmm, I guess that would be similar to how monitoring systems such as
> nagios can be configured to send an alert only if N checks in a row
> fail. That's useful where transient outages (e.g. a webserver hitting
> its request limit) are acceptable for a short time.
>
> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
> is not "in a row" but "since the count was last cleared".

It would be a major change, but perhaps it should be "in-a-row" and
successfully performing the action clears the count.
Its entirely possible that the current behaviour is like that because
I wasn't smart enough to implement anything else at the time :-)

>
> "Ignore up to three monitor failures if they occur in a row [or, within
> 10 minutes?], then try soft recovery for the next two monitor failures,
> then ban this node for the next monitor failure." Not sure being able to
> say that is worth the complexity.

Not disagreeing

>
>>
>> and it would be a significant re-implementation of "ignore"
>> (which currently ignores the state of having failed, as opposed to a
>> particular instance of failure).
>>
>>
>> agreed
>>
>>
>>
>> What the interface needs to express is: "If this operation fails,
>> optionally try a soft recovery [always stop+start], but if  failures
>> occur on the same node, proceed to a [configurable] hard recovery".
>>
>> And of course the interface will need to be different depending on how
>> certain details are decided, e.g. whether any failures count toward 
>> or just failures of one particular operation type, and whether the hard
>> recovery type can vary depending on what operation failed.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-28 Thread Ken Gaillot
On 09/28/2016 03:57 PM, Scott Greenlese wrote:
> A quick addendum...
> 
> After sending this post, I decided to stop pacemaker on the single,
> Online node in the cluster,
> and this effectively killed the corosync daemon:
> 
> [root@zs93kl VD]# date;pcs cluster stop
> Wed Sep 28 16:39:22 EDT 2016
> Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

Correct, "pcs cluster stop" tries to stop both pacemaker and corosync.

> [root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep
> Wed Sep 28 16:46:19 EDT 2016

Totally irrelevant, but a little trick I picked up somewhere: when
grepping for a process, square-bracketing a character lets you avoid the
"grep -v", e.g. "ps -ef | grep cor[o]"

It's nice when I remember to use it ;)

> [root@zs93kl VD]#
> 
> 
> 
> Next, I went to a node in "Pending" state, and sure enough... the pcs
> cluster stop killed the daemon there, too:
> 
> [root@zs95kj VD]# date;pcs cluster stop
> Wed Sep 28 16:48:15 EDT 2016
> Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...
> 
> [root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep
> Wed Sep 28 16:48:38 EDT 2016
> [root@zs95kj VD]#
> 
> So, this answers my own question... cluster stop should kill corosync.
> So, why isn't the `pcs cluster stop --all` failing to
> kill corosync?

It should. At least you've narrowed it down :)

> Thanks...
> 
> 
> Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y.
> INTERNET: swgre...@us.ibm.com
> 
> 
> 
> Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi
> folks.. I have some follow-up questions about corosync Scott
> Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up
> questions about corosync daemon status after cluster shutdown.
> 
> From: Scott Greenlese/Poughkeepsie/IBM
> To: kgail...@redhat.com, Cluster Labs - All topics related to
> open-source clustering welcomed 
> Date: 09/28/2016 04:30 PM
> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
> 
> 
> 
> 
> Hi folks..
> 
> I have some follow-up questions about corosync daemon status after
> cluster shutdown.
> 
> Basically, what should happen to corosync on a cluster node when
> pacemaker is shutdown on that node?
> On my 5 node cluster, when I do a global shutdown, the pacemaker
> processes exit, but corosync processes remain active.
> 
> Here's an example of where this led me into some trouble...
> 
> My cluster is still configured to use the "symmetric" resource
> distribution. I don't have any location constraints in place, so
> pacemaker tries to evenly distribute resources across all Online nodes.
> 
> With one cluster node (KVM host) powered off, I did the global cluster
> stop:
> 
> [root@zs90KP VD]# date;pcs cluster stop --all
> Wed Sep 28 15:07:40 EDT 2016
> zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
> zs90kppcs1: Stopping Cluster (pacemaker)...
> zs95KLpcs1: Stopping Cluster (pacemaker)...
> zs95kjpcs1: Stopping Cluster (pacemaker)...
> zs93kjpcs1: Stopping Cluster (pacemaker)...
> Error: unable to stop all nodes
> zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
> 
> Note: The "No route to host" messages are expected because that node /
> LPAR is powered down.
> 
> (I don't show it here, but the corosync daemon is still running on the 4
> active nodes. I do show it later).
> 
> I then powered on the one zs93KLpcs1 LPAR, so in theory I should not
> have quorum when it comes up and activates
> pacemaker, which is enabled to autostart at boot time on all 5 cluster
> nodes. At this point, only 1 out of 5
> nodes should be Online to the cluster, and therefore ... no quorum.
> 
> I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
> Online, and "partition with quorum":

Corosync determines quorum, pacemaker just uses it. If corosync is
running, the node contributes to quorum.

> [root@zs93kl ~]# date;pcs status |less
> Wed Sep 28 15:25:13 EDT 2016
> Cluster name: test_cluster_2
> Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08
> 2016 by root via crm_resource on zs95kjpcs1
> Stack: corosync
> Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
> partition with quorum
> 106 nodes and 304 resources configured
> 
> Node zs90kppcs1: pending
> Node zs93kjpcs1: pending
> Node zs95KLpcs1: pending
> Node zs95kjpcs1: pending
> Online: [ zs93KLpcs1 ]
> 
> Full list of resources:
> 
> zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> .
> .
> .
> 
> 
> Here you can see that corosync is up on all 5 nodes:
> 
> [root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
> zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync
> |grep -v grep"; done
> Wed Sep 28 15:22:21 EDT 2016
> zs90KP
> root 155374 1 0 Sep26 ? 00:10:17 corosync
> zs95KL
> root 22933 1 0 11:51 ? 00:00:54 corosync

Re: [ClusterLabs] Failed to retrieve meta-data for custom ocf resource

2016-09-28 Thread Ken Gaillot
On 09/28/2016 04:04 PM, Christopher Harvey wrote:
> My corosync/pacemaker logs are seeing a bunch of messages like the
> following:
> 
> Sep 22 14:50:36 [1346] node-132-60   crmd: info:
> action_synced_wait: Managed MsgBB-Active_meta-data_0 process 15613
> exited with rc=4

This is the (unmodified) exit status of the process, so the resource
agent must be returning "4" for some reason. Normally, that is used to
indicate "insufficient privileges".

> Sep 22 14:50:36 [1346] node-132-60   crmd:error:
> generic_get_metadata:   Failed to retrieve meta-data for
> ocf:acme:MsgBB-Active
> Sep 22 14:50:36 [1346] node-132-60   crmd:  warning:
> get_rsc_metadata:   No metadata found for MsgBB-Active::ocf:acme:
> Input/output error (-5)
> Sep 22 14:50:36 [1346] node-132-60   crmd:error:
> build_operation_update: No metadata for acme::ocf:MsgBB-Active
> Sep 22 14:50:36 [1346] node-132-60   crmd:   notice:
> process_lrm_event:  Operation MsgBB-Active_start_0: ok
> (node=node-132-60, call=25, rc=0, cib-update=27, confirmed=true)
> 
> I am able to run the meta-data command on the command line:

I would suspect that your user account has some privileges that the lrmd
user (typically hacluster:haclient) doesn't have. Try "su - hacluster"
first and see if it's any different. Maybe directory or file
permissions, or SELinux?

> node-132-43 # /lib/ocf/resource.d/acme/MsgBB-Active meta-data
> 
> 
> 
> 1.0
> 
> 
> MsgBB-Active resource (long desc)
> 
> MsgBB-Active resource
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> node-132-43 # echo $?
> 0
> 
> Resource code here:
> #! /bin/bash
> 
> ###
> # Initialization:
> 
> : ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
> . ${OCF_FUNCTIONS}
> : ${__OCF_ACTION=$1}
> 
> ###
> 
> meta_data()
> {
> cat < 
> 
> 
> 1.0
> 
> 
> MsgBB-Active resource (long desc)
> 
> MsgBB-Active resource
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> END
> }
> 
> # don't exit on TERM, to test that lrmd makes sure that we do exit
> trap sigterm_handler TERM
> sigterm_handler() {
> ocf_log info "They use TERM to bring us down. No such luck."
> return
> }
> 
> msgbb_usage() {
> cat < usage: $0 {start|stop|monitor|validate-all|meta-data}
> 
> Expects to have a fully populated OCF RA-compliant environment set.
> END
> }
> 
> msgbb_monitor() {
> # trimmed.
> }
> 
> msgbb_stop() {
> # trimmed.
> }
> 
> msgbb_start() {
> # trimmed.
> }
> 
> msgbb_validate() {
> # trimmed.
> }
> 
> case $__OCF_ACTION in
> meta-data)  meta_data
> exit $OCF_SUCCESS
> ;;
> start)  msgbb_start;;
> stop)   msgbb_stop;;
> monitor)msgbb_monitor;;
> reload) ocf_log err "Reloading..."
> msgbb_start
> ;;
> validate-all)   msgbb_validate;;
> usage|help) msgbb_usage
> exit $OCF_SUCCESS
> ;;
> *)  msgbb_usage
> exit $OCF_ERR_UNIMPLEMENTED
> ;;
> esac
> rc=$?
> ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
> exit $rc
> 
> 
> Thanks,
> Chris

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Failed to retrieve meta-data for custom ocf resource

2016-09-28 Thread Christopher Harvey
My corosync/pacemaker logs are seeing a bunch of messages like the
following:

Sep 22 14:50:36 [1346] node-132-60   crmd: info:
action_synced_wait: Managed MsgBB-Active_meta-data_0 process 15613
exited with rc=4
Sep 22 14:50:36 [1346] node-132-60   crmd:error:
generic_get_metadata:   Failed to retrieve meta-data for
ocf:acme:MsgBB-Active
Sep 22 14:50:36 [1346] node-132-60   crmd:  warning:
get_rsc_metadata:   No metadata found for MsgBB-Active::ocf:acme:
Input/output error (-5)
Sep 22 14:50:36 [1346] node-132-60   crmd:error:
build_operation_update: No metadata for acme::ocf:MsgBB-Active
Sep 22 14:50:36 [1346] node-132-60   crmd:   notice:
process_lrm_event:  Operation MsgBB-Active_start_0: ok
(node=node-132-60, call=25, rc=0, cib-update=27, confirmed=true)

I am able to run the meta-data command on the command line:

node-132-43 # /lib/ocf/resource.d/acme/MsgBB-Active meta-data



1.0


MsgBB-Active resource (long desc)

MsgBB-Active resource













node-132-43 # echo $?
0

Resource code here:
#! /bin/bash

###
# Initialization:

: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}

###

meta_data()
{
cat <


1.0


MsgBB-Active resource (long desc)

MsgBB-Active resource












END
}

# don't exit on TERM, to test that lrmd makes sure that we do exit
trap sigterm_handler TERM
sigterm_handler() {
ocf_log info "They use TERM to bring us down. No such luck."
return
}

msgbb_usage() {
cat 

Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-28 Thread Scott Greenlese

A quick addendum...

After sending this post, I decided to stop pacemaker on the single, Online
node in the cluster,
and this effectively killed the corosync daemon:

[root@zs93kl VD]# date;pcs cluster stop
Wed Sep 28 16:39:22 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...


[root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep
Wed Sep 28 16:46:19 EDT 2016
[root@zs93kl VD]#



Next, I went to a node in "Pending" state, and sure enough... the pcs
cluster stop killed the daemon there, too:

[root@zs95kj VD]# date;pcs cluster stop
Wed Sep 28 16:48:15 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

[root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep
Wed Sep 28 16:48:38 EDT 2016
[root@zs95kj VD]#

So, this answers my own question...  cluster stop should kill corosync.
So, why isn't the `pcs cluster stop --all` failing to
kill corosync?

Thanks...


Scott Greenlese ... IBM KVM on System Z Test,  Poughkeepsie, N.Y.
  INTERNET:  swgre...@us.ibm.com





From:   Scott Greenlese/Poughkeepsie/IBM
To: kgail...@redhat.com, Cluster Labs - All topics related to
open-source clustering welcomed 
Date:   09/28/2016 04:30 PM
Subject:Re: [ClusterLabs] Pacemaker quorum behavior


Hi folks..

I have some follow-up questions about corosync daemon status after cluster
shutdown.

Basically, what should happen to corosync on a cluster node when pacemaker
is shutdown on that node?
On my 5 node cluster, when I do a global shutdown, the pacemaker processes
exit, but corosync processes remain active.

Here's an example of where this led me into some trouble...

My cluster is still configured to use the "symmetric" resource
distribution.   I don't have any location constraints in place, so
pacemaker tries to evenly distribute resources across all Online nodes.

With one cluster node (KVM host) powered off, I did the global cluster
stop:

[root@zs90KP VD]# date;pcs cluster stop --all
Wed Sep 28 15:07:40 EDT 2016
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
zs90kppcs1: Stopping Cluster (pacemaker)...
zs95KLpcs1: Stopping Cluster (pacemaker)...
zs95kjpcs1: Stopping Cluster (pacemaker)...
zs93kjpcs1: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)

Note:  The "No route to host" messages are expected because that node /
LPAR is powered down.

(I don't show it here, but the corosync daemon is still running on the 4
active nodes. I do show it later).

I then powered on the one zs93KLpcs1 LPAR,  so in theory I should not have
quorum when it comes up and activates
pacemaker, which is enabled to autostart at boot time on all 5 cluster
nodes.  At this point, only 1 out of 5
nodes should be Online to the cluster, and therefore ... no quorum.

I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
Online, and "partition with quorum":

[root@zs93kl ~]# date;pcs status |less
Wed Sep 28 15:25:13 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep 28 15:25:13 2016  Last change: Mon Sep 26
16:15:08 2016 by root via crm_resource on zs95kjpcs1
Stack: corosync
Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Node zs90kppcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Node zs95kjpcs1: pending
Online: [ zs93KLpcs1 ]

Full list of resources:

 zs95kjg109062_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109063_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
.
.
.


Here you can see that corosync is up on all 5 nodes:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep
-v grep"; done
Wed Sep 28 15:22:21 EDT 2016
zs90KP
root 155374  1  0 Sep26 ?00:10:17 corosync
zs95KL
root  22933  1  0 11:51 ?00:00:54 corosync
zs95kj
root  19382  1  0 Sep26 ?00:10:15 corosync
zs93kj
root 129102  1  0 Sep26 ?00:12:10 corosync
zs93kl
root  21894  1  0 15:19 ?00:00:00 corosync


But, pacemaker is only running on the one, online node:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd |
grep -v grep"; done
Wed Sep 28 15:23:29 EDT 2016
zs90KP
zs95KL
zs95kj
zs93kj
zs93kl
root  23005  1  0 15:19 ?00:00:00 /usr/sbin/pacemakerd -f
You have new mail in /var/spool/mail/root
[root@zs95kj VD]#


This situation wreaks havoc on my VirtualDomain resources, as the majority
of them are in FAILED or Stopped state, and to my
surprise... many of them show as Started:

[root@zs93kl VD]# date;pcs resource show |grep zs93KL
Wed Sep 28 15:55:29 EDT 2016
 zs95kjg109062_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109063_res  (ocf::heartbeat:VirtualDomain)

Re: [ClusterLabs] Failover debugging

2016-09-28 Thread Evan Rinaldo
Thanks for the information. We are actually a few revs behind unfortunately.

Thanks

On Wed, Sep 28, 2016 at 12:38 AM, Klaus Wenninger 
wrote:

> On 09/28/2016 03:13 AM, Evan Rinaldo wrote:
> > Is it possible to trigger the blackbox recorder or even a crm_report
> > on a failover event.  I know these can be triggered manually but I
> > wasn't sure if there was an option in pacemaker that would trigger
> > these on a resource failover.
> At least one way to do this might be registering an alert-agent that can
> observe monitoring actions fail
> or resources being started on certain nodes - given your
> pacemaker-version is >= 1.1.15.
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/
> Pacemaker_Explained/ch07.html
>
> Regards,
> Klaus
> >
> > Thanks
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-28 Thread Scott Greenlese

Hi folks..

I have some follow-up questions about corosync daemon status after cluster
shutdown.

Basically, what should happen to corosync on a cluster node when pacemaker
is shutdown on that node?
On my 5 node cluster, when I do a global shutdown, the pacemaker processes
exit, but corosync processes remain active.

Here's an example of where this led me into some trouble...

My cluster is still configured to use the "symmetric" resource
distribution.   I don't have any location constraints in place, so
pacemaker tries to evenly distribute resources across all Online nodes.

With one cluster node (KVM host) powered off, I did the global cluster
stop:

[root@zs90KP VD]# date;pcs cluster stop --all
Wed Sep 28 15:07:40 EDT 2016
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
zs90kppcs1: Stopping Cluster (pacemaker)...
zs95KLpcs1: Stopping Cluster (pacemaker)...
zs95kjpcs1: Stopping Cluster (pacemaker)...
zs93kjpcs1: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)

Note:  The "No route to host" messages are expected because that node /
LPAR is powered down.

(I don't show it here, but the corosync daemon is still running on the 4
active nodes. I do show it later).

I then powered on the one zs93KLpcs1 LPAR,  so in theory I should not have
quorum when it comes up and activates
pacemaker, which is enabled to autostart at boot time on all 5 cluster
nodes.  At this point, only 1 out of 5
nodes should be Online to the cluster, and therefore ... no quorum.

I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
Online, and "partition with quorum":

[root@zs93kl ~]# date;pcs status |less
Wed Sep 28 15:25:13 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep 28 15:25:13 2016  Last change: Mon Sep 26
16:15:08 2016 by root via crm_resource on zs95kjpcs1
Stack: corosync
Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Node zs90kppcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Node zs95kjpcs1: pending
Online: [ zs93KLpcs1 ]

Full list of resources:

 zs95kjg109062_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109063_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
.
.
.


Here you can see that corosync is up on all 5 nodes:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep
-v grep"; done
Wed Sep 28 15:22:21 EDT 2016
zs90KP
root 155374  1  0 Sep26 ?00:10:17 corosync
zs95KL
root  22933  1  0 11:51 ?00:00:54 corosync
zs95kj
root  19382  1  0 Sep26 ?00:10:15 corosync
zs93kj
root 129102  1  0 Sep26 ?00:12:10 corosync
zs93kl
root  21894  1  0 15:19 ?00:00:00 corosync


But, pacemaker is only running on the one, online node:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd |
grep -v grep"; done
Wed Sep 28 15:23:29 EDT 2016
zs90KP
zs95KL
zs95kj
zs93kj
zs93kl
root  23005  1  0 15:19 ?00:00:00 /usr/sbin/pacemakerd -f
You have new mail in /var/spool/mail/root
[root@zs95kj VD]#


This situation wreaks havoc on my VirtualDomain resources, as the majority
of them are in FAILED or Stopped state, and to my
surprise... many of them show as Started:

[root@zs93kl VD]# date;pcs resource show |grep zs93KL
Wed Sep 28 15:55:29 EDT 2016
 zs95kjg109062_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109063_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109064_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109065_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109066_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109068_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109069_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109070_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109071_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109072_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109073_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109074_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109075_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109076_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109077_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109078_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109079_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109080_res  (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109081_res  (ocf::hear

Re: [ClusterLabs] pacemaker_remoted XML parse error

2016-09-28 Thread Radoslaw Garbacz
Just to add maybe a helpful observation: either "cib" or "pengine" process
goes to ~100% CPU when this remote nodes errors happen.

On Tue, Sep 27, 2016 at 2:36 PM, Radoslaw Garbacz <
radoslaw.garb...@xtremedatainc.com> wrote:

> Hi,
>
> I encountered the same problem with pacemaker built from github at around
> August 22.
>
> Remote nodes go offline occasionally and stay so, their logs show same
> errors. The cluster is on AWS ec2 instances, the network works and is an
> unlikely reason.
>
> Have there be any commits on github recently (after August 22) addressing
> this issue?
>
>
> Logs:
> [...]
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> lrmd_remote_client_msg:   Client disconnect detected in tls msg dispatcher.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> ipc_proxy_remove_provider:ipc proxy connection for client
> ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster
> node disconnected.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> cancel_recurring_action:  Cancelling ocf operation
> monitor_all_monitor_191000
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_send_tls: Connection terminated rc = -53
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_send_tls: Connection terminated rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_send:  Failed to send remote msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning:
> send_client_notify:   Notification of client
> remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22
> failed
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice:
> lrmd_remote_client_destroy:   LRMD client disconnecting remote client
> - name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296-
> 725735d17e22
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> handle_new_connection:Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> handle_new_connection:Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> [...]
>
>
>
>
> On Thu, Jun 9, 2016 at 12:24 AM, Narayanamoorthy Srinivasan <
> narayanamoort...@gmail.com> wrote:
>
>> Don't see any issues in network traffic.
>>
>> Some more logs where the XML tags are incomplete:
>>
>> 2016-06-09T03:06:03.096449+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:error: Partial
>> > operation="stop" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.10" transition-key="225:116:0:8fbf
>> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;225:116:
>> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd"
>> call-id="489" rc-code="0" op-status="0" interval="0" last-run="1459491026"
>> last-rc-change="1459491026" exec-time="158" queue-time="0"
>> op-digest="dfb0c861
>> 2016-06-09T03:06:03.097136+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:error: Partial
>> > operation_key="fs-postgresql_monitor_0" operation="monitor"
>> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
>> transition-key="41:4:7:8fbf83fd-241b-4623-8bbe-31d92e4dfce1"
>> transition-magic="0:0;41:4:7:8fbf83fd-241b-4623-8bbe-31d92e4dfce1"
>> on_node="d00-50-56-94-24-dd" call-id="5" rc-code="0" op-status="