Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-29 Thread Ken Gaillot
On 09/28/2016 10:54 PM, Andrew Beekhof wrote:
> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot  wrote:
>>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>> then migrate", but I can't think of a real-world situation where that
>>> makes sense,
>>>
>>>
>>> really?
>>>
>>> it is not uncommon to hear "i know its failed, but i dont want the
>>> cluster to do anything until its _really_ failed"
>>
>> Hmm, I guess that would be similar to how monitoring systems such as
>> nagios can be configured to send an alert only if N checks in a row
>> fail. That's useful where transient outages (e.g. a webserver hitting
>> its request limit) are acceptable for a short time.
>>
>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>> is not "in a row" but "since the count was last cleared".
> 
> It would be a major change, but perhaps it should be "in-a-row" and
> successfully performing the action clears the count.
> Its entirely possible that the current behaviour is like that because
> I wasn't smart enough to implement anything else at the time :-)

Or you were smart enough to realize what a can of worms it is. :) Take a
look at all of nagios' options for deciding when a failure becomes "real".

If you clear failures after a success, you can't detect/recover a
resource that is flapping.

>> "Ignore up to three monitor failures if they occur in a row [or, within
>> 10 minutes?], then try soft recovery for the next two monitor failures,
>> then ban this node for the next monitor failure." Not sure being able to
>> say that is worth the complexity.
> 
> Not disagreeing

It only makes sense to escalate from ignore -> restart -> hard, so maybe
something like:

  op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban


To express current default behavior:

  op start ignore-fail=0 soft-fail=0on-hard-fail=ban
  op stop  ignore-fail=0 soft-fail=0on-hard-fail=fence
  op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban


on-fail, migration-threshold, and start-failure-is-fatal would be
deprecated (and would be easy to map to the new parameters).

I'd avoid the hassles of counting failures "in a row", and stick with
counting failures since the last cleanup.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Failed to retrieve meta-data for custom ocf resource

2016-09-29 Thread Jan Pokorný
Hello ,

On 29/09/16 12:41 -0400, Christopher Harvey wrote:
> I think something is failing at the  execvp() level. I'm seeing
> useful looking trace logs in the code, but can't enable them right
> now. I have:
> PCMK_debug=yes
> PCMK_logfile=/tmp/pacemaker.log
> PCMK_logpriority=debug
> PCMK_trace_files=services_linux.c

Just in case, pacemaker needs to be restarted once this change in the
appropriate configuration file is made.

Another try, does "crm_resource --show-metadata=ocf:acme:MsgBB-Active"
work for you?

-- 
Jan (Poki)


pgptI8zu5Q5eN.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-29 Thread Jan Pokorný
On 28/09/16 16:30 -0400, Scott Greenlese wrote:
> Also,  I have tried simulating a failed cluster node (to trigger a
> STONITH action) by killing the corosync daemon on one node, but all
> that does is respawn the daemon ...  causing a temporary / transient
> failure condition, and no fence takes place.   Is there a way to
> kill corosync in such a way that it stays down?   Is there a best
> practice for STONITH testing?

This makes me seriously wonder what could cause this involuntary
daemon-scoped high availability...

Are you sure you are using upstream provided initscript/unit file?

(Just hope there's no fence_corosync_restart.)

-- 
Jan (Poki)


pgpbBS3jdAmnn.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Failed to retrieve meta-data for custom ocf resource

2016-09-29 Thread Christopher Harvey
On Thu, Sep 29, 2016, at 12:20 PM, Jan Pokorný wrote:
> On 28/09/16 16:55 -0500, Ken Gaillot wrote:
> > On 09/28/2016 04:04 PM, Christopher Harvey wrote:
> >> My corosync/pacemaker logs are seeing a bunch of messages like the
> >> following:
> >> 
> >> Sep 22 14:50:36 [1346] node-132-60   crmd: info:
> >> action_synced_wait: Managed MsgBB-Active_meta-data_0 process 15613
> >> exited with rc=4
> 
> Another possibility is that "execvp" call, i.e., means to run this very
> agent, failed at a fundemental level (could also be due to kernel's
> security modules like SELinux, seccomp, etc. as already mentioned).

I don't have seccomp or SELinux.

> Do other agents work flawlessly for you?

I only have my custom agent. All actions work except meta-data. In fact,
I put the following at the very top of my resource agent:
#! /bin/bash
touch /tmp/yeah
echo "yeah running ${@}" >> /tmp/yeah

and the 'yeah' file gets filled with monitor/start/stop, but not
meta-data. I think something is failing at the  execvp() level. I'm
seeing useful looking trace logs in the code, but can't enable them
right now. I have:
PCMK_debug=yes
PCMK_logfile=/tmp/pacemaker.log
PCMK_logpriority=debug
PCMK_trace_files=services_linux.c

but I'm not seeing the pacemaker.log anywhere, and corosync.log only has
info and higher.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Failed to retrieve meta-data for custom ocf resource

2016-09-29 Thread Jan Pokorný
On 28/09/16 16:55 -0500, Ken Gaillot wrote:
> On 09/28/2016 04:04 PM, Christopher Harvey wrote:
>> My corosync/pacemaker logs are seeing a bunch of messages like the
>> following:
>> 
>> Sep 22 14:50:36 [1346] node-132-60   crmd: info:
>> action_synced_wait: Managed MsgBB-Active_meta-data_0 process 15613
>> exited with rc=4

Another possibility is that "execvp" call, i.e., means to run this very
agent, failed at a fundemental level (could also be due to kernel's
security modules like SELinux, seccomp, etc. as already mentioned).

Do other agents work flawlessly for you?

> This is the (unmodified) exit status of the process, so the resource
> agent must be returning "4" for some reason. Normally, that is used to
> indicate "insufficient privileges".
> 
>> Sep 22 14:50:36 [1346] node-132-60   crmd:error:
>> generic_get_metadata:   Failed to retrieve meta-data for
>> ocf:acme:MsgBB-Active
>> Sep 22 14:50:36 [1346] node-132-60   crmd:  warning:
>> get_rsc_metadata:   No metadata found for MsgBB-Active::ocf:acme:
>> Input/output error (-5)
>> Sep 22 14:50:36 [1346] node-132-60   crmd:error:
>> build_operation_update: No metadata for acme::ocf:MsgBB-Active
>> Sep 22 14:50:36 [1346] node-132-60   crmd:   notice:
>> process_lrm_event:  Operation MsgBB-Active_start_0: ok
>> (node=node-132-60, call=25, rc=0, cib-update=27, confirmed=true)
>> 
>> I am able to run the meta-data command on the command line:
> 
> I would suspect that your user account has some privileges that the lrmd
> user (typically hacluster:haclient) doesn't have. Try "su - hacluster"
> first and see if it's any different. Maybe directory or file
> permissions, or SELinux?

In fact lrmd (along with stonithd) is an exception in the daemons'
conglomerate as it runs as root:root, so as to portably handle
execution of the resources that, naturally and in general, require
execution with as high (here: inherited), privileges.

-- 
Jan (Poki)


pgprzjCCKfyxO.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker quorum behavior

2016-09-29 Thread Tomas Jelinek

Dne 29.9.2016 v 00:14 Ken Gaillot napsal(a):

On 09/28/2016 03:57 PM, Scott Greenlese wrote:

A quick addendum...

After sending this post, I decided to stop pacemaker on the single,
Online node in the cluster,
and this effectively killed the corosync daemon:

[root@zs93kl VD]# date;pcs cluster stop
Wed Sep 28 16:39:22 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...


Correct, "pcs cluster stop" tries to stop both pacemaker and corosync.


[root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep
Wed Sep 28 16:46:19 EDT 2016


Totally irrelevant, but a little trick I picked up somewhere: when
grepping for a process, square-bracketing a character lets you avoid the
"grep -v", e.g. "ps -ef | grep cor[o]"

It's nice when I remember to use it ;)


[root@zs93kl VD]#



Next, I went to a node in "Pending" state, and sure enough... the pcs
cluster stop killed the daemon there, too:

[root@zs95kj VD]# date;pcs cluster stop
Wed Sep 28 16:48:15 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

[root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep
Wed Sep 28 16:48:38 EDT 2016
[root@zs95kj VD]#

So, this answers my own question... cluster stop should kill corosync.
So, why isn't the `pcs cluster stop --all` failing to
kill corosync?


It should. At least you've narrowed it down :)


This is a bug in pcs. Thanks for spotting it and providing detailed 
description. I filed the bug here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1380372


Regards,
Tomas




Thanks...


Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y.
INTERNET: swgre...@us.ibm.com



Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi
folks.. I have some follow-up questions about corosync Scott
Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up
questions about corosync daemon status after cluster shutdown.

From: Scott Greenlese/Poughkeepsie/IBM
To: kgail...@redhat.com, Cluster Labs - All topics related to
open-source clustering welcomed 
Date: 09/28/2016 04:30 PM
Subject: Re: [ClusterLabs] Pacemaker quorum behavior




Hi folks..

I have some follow-up questions about corosync daemon status after
cluster shutdown.

Basically, what should happen to corosync on a cluster node when
pacemaker is shutdown on that node?
On my 5 node cluster, when I do a global shutdown, the pacemaker
processes exit, but corosync processes remain active.

Here's an example of where this led me into some trouble...

My cluster is still configured to use the "symmetric" resource
distribution. I don't have any location constraints in place, so
pacemaker tries to evenly distribute resources across all Online nodes.

With one cluster node (KVM host) powered off, I did the global cluster
stop:

[root@zs90KP VD]# date;pcs cluster stop --all
Wed Sep 28 15:07:40 EDT 2016
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
zs90kppcs1: Stopping Cluster (pacemaker)...
zs95KLpcs1: Stopping Cluster (pacemaker)...
zs95kjpcs1: Stopping Cluster (pacemaker)...
zs93kjpcs1: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)

Note: The "No route to host" messages are expected because that node /
LPAR is powered down.

(I don't show it here, but the corosync daemon is still running on the 4
active nodes. I do show it later).

I then powered on the one zs93KLpcs1 LPAR, so in theory I should not
have quorum when it comes up and activates
pacemaker, which is enabled to autostart at boot time on all 5 cluster
nodes. At this point, only 1 out of 5
nodes should be Online to the cluster, and therefore ... no quorum.

I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
Online, and "partition with quorum":


Corosync determines quorum, pacemaker just uses it. If corosync is
running, the node contributes to quorum.


[root@zs93kl ~]# date;pcs status |less
Wed Sep 28 15:25:13 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08
2016 by root via crm_resource on zs95kjpcs1
Stack: corosync
Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
partition with quorum
106 nodes and 304 resources configured

Node zs90kppcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Node zs95kjpcs1: pending
Online: [ zs93KLpcs1 ]

Full list of resources:

zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
.
.
.


Here you can see that corosync is up on all 5 nodes:

[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync
|grep -v grep"; done
Wed Sep 28 15:22:21 EDT 2016
zs90KP
root 155374 1 0 Sep26 ? 00:10:17 corosync
zs95KL
root 22933 1 0 11:51 ? 00:00:54 corosync