Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.
On 09/28/2016 10:54 PM, Andrew Beekhof wrote: > On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillotwrote: >>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures >>> then migrate", but I can't think of a real-world situation where that >>> makes sense, >>> >>> >>> really? >>> >>> it is not uncommon to hear "i know its failed, but i dont want the >>> cluster to do anything until its _really_ failed" >> >> Hmm, I guess that would be similar to how monitoring systems such as >> nagios can be configured to send an alert only if N checks in a row >> fail. That's useful where transient outages (e.g. a webserver hitting >> its request limit) are acceptable for a short time. >> >> I'm not sure that's translatable to Pacemaker. Pacemaker's error count >> is not "in a row" but "since the count was last cleared". > > It would be a major change, but perhaps it should be "in-a-row" and > successfully performing the action clears the count. > Its entirely possible that the current behaviour is like that because > I wasn't smart enough to implement anything else at the time :-) Or you were smart enough to realize what a can of worms it is. :) Take a look at all of nagios' options for deciding when a failure becomes "real". If you clear failures after a success, you can't detect/recover a resource that is flapping. >> "Ignore up to three monitor failures if they occur in a row [or, within >> 10 minutes?], then try soft recovery for the next two monitor failures, >> then ban this node for the next monitor failure." Not sure being able to >> say that is worth the complexity. > > Not disagreeing It only makes sense to escalate from ignore -> restart -> hard, so maybe something like: op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban To express current default behavior: op start ignore-fail=0 soft-fail=0on-hard-fail=ban op stop ignore-fail=0 soft-fail=0on-hard-fail=fence op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban on-fail, migration-threshold, and start-failure-is-fatal would be deprecated (and would be easy to map to the new parameters). I'd avoid the hassles of counting failures "in a row", and stick with counting failures since the last cleanup. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Failed to retrieve meta-data for custom ocf resource
Hello , On 29/09/16 12:41 -0400, Christopher Harvey wrote: > I think something is failing at the execvp() level. I'm seeing > useful looking trace logs in the code, but can't enable them right > now. I have: > PCMK_debug=yes > PCMK_logfile=/tmp/pacemaker.log > PCMK_logpriority=debug > PCMK_trace_files=services_linux.c Just in case, pacemaker needs to be restarted once this change in the appropriate configuration file is made. Another try, does "crm_resource --show-metadata=ocf:acme:MsgBB-Active" work for you? -- Jan (Poki) pgptI8zu5Q5eN.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker quorum behavior
On 28/09/16 16:30 -0400, Scott Greenlese wrote: > Also, I have tried simulating a failed cluster node (to trigger a > STONITH action) by killing the corosync daemon on one node, but all > that does is respawn the daemon ... causing a temporary / transient > failure condition, and no fence takes place. Is there a way to > kill corosync in such a way that it stays down? Is there a best > practice for STONITH testing? This makes me seriously wonder what could cause this involuntary daemon-scoped high availability... Are you sure you are using upstream provided initscript/unit file? (Just hope there's no fence_corosync_restart.) -- Jan (Poki) pgpbBS3jdAmnn.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Failed to retrieve meta-data for custom ocf resource
On Thu, Sep 29, 2016, at 12:20 PM, Jan Pokorný wrote: > On 28/09/16 16:55 -0500, Ken Gaillot wrote: > > On 09/28/2016 04:04 PM, Christopher Harvey wrote: > >> My corosync/pacemaker logs are seeing a bunch of messages like the > >> following: > >> > >> Sep 22 14:50:36 [1346] node-132-60 crmd: info: > >> action_synced_wait: Managed MsgBB-Active_meta-data_0 process 15613 > >> exited with rc=4 > > Another possibility is that "execvp" call, i.e., means to run this very > agent, failed at a fundemental level (could also be due to kernel's > security modules like SELinux, seccomp, etc. as already mentioned). I don't have seccomp or SELinux. > Do other agents work flawlessly for you? I only have my custom agent. All actions work except meta-data. In fact, I put the following at the very top of my resource agent: #! /bin/bash touch /tmp/yeah echo "yeah running ${@}" >> /tmp/yeah and the 'yeah' file gets filled with monitor/start/stop, but not meta-data. I think something is failing at the execvp() level. I'm seeing useful looking trace logs in the code, but can't enable them right now. I have: PCMK_debug=yes PCMK_logfile=/tmp/pacemaker.log PCMK_logpriority=debug PCMK_trace_files=services_linux.c but I'm not seeing the pacemaker.log anywhere, and corosync.log only has info and higher. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Failed to retrieve meta-data for custom ocf resource
On 28/09/16 16:55 -0500, Ken Gaillot wrote: > On 09/28/2016 04:04 PM, Christopher Harvey wrote: >> My corosync/pacemaker logs are seeing a bunch of messages like the >> following: >> >> Sep 22 14:50:36 [1346] node-132-60 crmd: info: >> action_synced_wait: Managed MsgBB-Active_meta-data_0 process 15613 >> exited with rc=4 Another possibility is that "execvp" call, i.e., means to run this very agent, failed at a fundemental level (could also be due to kernel's security modules like SELinux, seccomp, etc. as already mentioned). Do other agents work flawlessly for you? > This is the (unmodified) exit status of the process, so the resource > agent must be returning "4" for some reason. Normally, that is used to > indicate "insufficient privileges". > >> Sep 22 14:50:36 [1346] node-132-60 crmd:error: >> generic_get_metadata: Failed to retrieve meta-data for >> ocf:acme:MsgBB-Active >> Sep 22 14:50:36 [1346] node-132-60 crmd: warning: >> get_rsc_metadata: No metadata found for MsgBB-Active::ocf:acme: >> Input/output error (-5) >> Sep 22 14:50:36 [1346] node-132-60 crmd:error: >> build_operation_update: No metadata for acme::ocf:MsgBB-Active >> Sep 22 14:50:36 [1346] node-132-60 crmd: notice: >> process_lrm_event: Operation MsgBB-Active_start_0: ok >> (node=node-132-60, call=25, rc=0, cib-update=27, confirmed=true) >> >> I am able to run the meta-data command on the command line: > > I would suspect that your user account has some privileges that the lrmd > user (typically hacluster:haclient) doesn't have. Try "su - hacluster" > first and see if it's any different. Maybe directory or file > permissions, or SELinux? In fact lrmd (along with stonithd) is an exception in the daemons' conglomerate as it runs as root:root, so as to portably handle execution of the resources that, naturally and in general, require execution with as high (here: inherited), privileges. -- Jan (Poki) pgprzjCCKfyxO.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker quorum behavior
Dne 29.9.2016 v 00:14 Ken Gaillot napsal(a): On 09/28/2016 03:57 PM, Scott Greenlese wrote: A quick addendum... After sending this post, I decided to stop pacemaker on the single, Online node in the cluster, and this effectively killed the corosync daemon: [root@zs93kl VD]# date;pcs cluster stop Wed Sep 28 16:39:22 EDT 2016 Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... Correct, "pcs cluster stop" tries to stop both pacemaker and corosync. [root@zs93kl VD]# date;ps -ef |grep coro|grep -v grep Wed Sep 28 16:46:19 EDT 2016 Totally irrelevant, but a little trick I picked up somewhere: when grepping for a process, square-bracketing a character lets you avoid the "grep -v", e.g. "ps -ef | grep cor[o]" It's nice when I remember to use it ;) [root@zs93kl VD]# Next, I went to a node in "Pending" state, and sure enough... the pcs cluster stop killed the daemon there, too: [root@zs95kj VD]# date;pcs cluster stop Wed Sep 28 16:48:15 EDT 2016 Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... [root@zs95kj VD]# date;ps -ef |grep coro |grep -v grep Wed Sep 28 16:48:38 EDT 2016 [root@zs95kj VD]# So, this answers my own question... cluster stop should kill corosync. So, why isn't the `pcs cluster stop --all` failing to kill corosync? It should. At least you've narrowed it down :) This is a bug in pcs. Thanks for spotting it and providing detailed description. I filed the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1380372 Regards, Tomas Thanks... Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y. INTERNET: swgre...@us.ibm.com Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up questions about corosync Scott Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up questions about corosync daemon status after cluster shutdown. From: Scott Greenlese/Poughkeepsie/IBM To: kgail...@redhat.com, Cluster Labs - All topics related to open-source clustering welcomedDate: 09/28/2016 04:30 PM Subject: Re: [ClusterLabs] Pacemaker quorum behavior Hi folks.. I have some follow-up questions about corosync daemon status after cluster shutdown. Basically, what should happen to corosync on a cluster node when pacemaker is shutdown on that node? On my 5 node cluster, when I do a global shutdown, the pacemaker processes exit, but corosync processes remain active. Here's an example of where this led me into some trouble... My cluster is still configured to use the "symmetric" resource distribution. I don't have any location constraints in place, so pacemaker tries to evenly distribute resources across all Online nodes. With one cluster node (KVM host) powered off, I did the global cluster stop: [root@zs90KP VD]# date;pcs cluster stop --all Wed Sep 28 15:07:40 EDT 2016 zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) zs90kppcs1: Stopping Cluster (pacemaker)... zs95KLpcs1: Stopping Cluster (pacemaker)... zs95kjpcs1: Stopping Cluster (pacemaker)... zs93kjpcs1: Stopping Cluster (pacemaker)... Error: unable to stop all nodes zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host) Note: The "No route to host" messages are expected because that node / LPAR is powered down. (I don't show it here, but the corosync daemon is still running on the 4 active nodes. I do show it later). I then powered on the one zs93KLpcs1 LPAR, so in theory I should not have quorum when it comes up and activates pacemaker, which is enabled to autostart at boot time on all 5 cluster nodes. At this point, only 1 out of 5 nodes should be Online to the cluster, and therefore ... no quorum. I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending' Online, and "partition with quorum": Corosync determines quorum, pacemaker just uses it. If corosync is running, the node contributes to quorum. [root@zs93kl ~]# date;pcs status |less Wed Sep 28 15:25:13 EDT 2016 Cluster name: test_cluster_2 Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08 2016 by root via crm_resource on zs95kjpcs1 Stack: corosync Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum 106 nodes and 304 resources configured Node zs90kppcs1: pending Node zs93kjpcs1: pending Node zs95KLpcs1: pending Node zs95kjpcs1: pending Online: [ zs93KLpcs1 ] Full list of resources: zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1 . . . Here you can see that corosync is up on all 5 nodes: [root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep -v grep"; done Wed Sep 28 15:22:21 EDT 2016 zs90KP root 155374 1 0 Sep26 ? 00:10:17 corosync zs95KL root 22933 1 0 11:51 ? 00:00:54 corosync