Re: [ClusterLabs] crmsh resource failcount does not appear to work
02.01.2018 06:48, Ken Gaillot пишет: > On Wed, 2017-12-27 at 14:03 +0300, Andrei Borzenkov wrote: >> On Wed, Dec 27, 2017 at 11:40 AM, Kristoffer Grönlund >>wrote: >>> >>> Andrei Borzenkov writes: >>> As far as I can tell, pacemaker acts on failcount attributes qualified by operation name, while crm sets/queries unqualified attribute; I do not see any syntax to set fail-count for specific operation in crmsh. >>> >>> crmsh uses crm_attribute to get the failcount. It could be that >>> this >>> usage has stopped working as of 1.1.17.. >>> >> >> There is probably misunderstanding. The problem is what attribute is >> used, not how it is set. crmsh sets (and as far as I can tell always >> set) attribute with name fail-count- while pacemaker >> internally sets and queries attributes with name >> fail-count-#. >> >> It is possible that this has changed in recent pacemaker versions of >> course ... yep, here is crm_failcount commit that implemented new >> (per-operation) failcounts. Which means "crm resource failcount set" >> without qualifying by operation is simply not valid ... actually >> crm_failcount will refuse to set failcount at all (only clear it). > > Hmm, I didn't realize crm shell supported setting a fail count. > > We discourage setting a fail count attribute directly as of 1.1.17, as > having a fail count without any failed operation history or last > failure time can be confusing to users (no failures would show up in > status, yet failure recovery behavior would be in effect, and failure > timeouts would not work properly). > > It is possible to set the new per-operation attributes directly, if > that capability is still desired, but I'm not sure there's a good > reason to do so. > > crm_failcount is a better choice than crm_attribute for querying and > clearing fail count attributes, as it will handle summing per-operation > fail counts if a resource total fail count is desired. Clearing a fail > count is now equivalent to crm_resource --cleanup, so it keeps the > operation history and last failure times consistent. > The problem is that neither "crm resource failcount show" nor "crm resource failcount delete" work anymore - that is how I hit this issue in the first place. I do not particularly care whether it is possible to set failcounts, although I can see it could be useful for testing. If it is decided to allow setting them, may be crmsh could default to "monitor" operation if none is explicitly given - that is likely what most users mean, as during normal run we expect recurring monitor errors. Although I suppose that crmsh should really be using crm_failcount which makes support for "set" to be topic of core pacemaker. > FYI the per-operation fail counts are not particularly useful now, but > they will make future failure handling enhancements possible, e.g. > configuring start-failure-is-fatal per resource, or ignoring a certain > number of monitor failures before recovering while still recovering > immediately for other operation failures. > >> >> https://github.com/ClusterLabs/pacemaker/commit/8323616179dc3f8038c6a >> 69e7323757bd1feacb1#diff-6e58482648938fd488a920b9902daac4 >> >> >>> >>> Cheers, >>> Kristoffer >>> ha1:~ # rpm -q crmsh crmsh-4.0.0+git.1511604050.816cb0f5-1.1.noarch ha1:~ # crm_mon -1rf Stack: corosync Current DC: ha2 (version 1.1.17-3.3-36d2962a8) - partition with quorum Last updated: Sun Dec 24 10:55:54 2017 Last change: Sun Dec 24 10:55:47 2017 by hacluster via crmd on ha2 2 nodes configured 4 resources configured Online: [ ha1 ha2 ] Full list of resources: stonith-sbd (stonith:external/sbd): Started ha1 rsc_dummy_1 (ocf::pacemaker:Dummy): Started ha2 Master/Slave Set: ms_Stateful_1 [rsc_Stateful_1] Masters: [ ha1 ] Slaves: [ ha2 ] Migration Summary: * Node ha2: * Node ha1: ha1:~ # echo xxx > /run/Stateful-rsc_Stateful_1.state ha1:~ # crm_failcount -G -r rsc_Stateful_1 scope=status name=fail-count-rsc_Stateful_1 value=1 ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 scope=status name=fail-count-rsc_Stateful_1 value=0 ha1:~ # crm resource failcount rsc_Stateful_1 set ha1 4 ha1:~ # crm_failcount -G -r rsc_Stateful_1 scope=status name=fail-count-rsc_Stateful_1 value=1 ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 scope=status name=fail-count-rsc_Stateful_1 value=4 ha1:~ # cibadmin -Q | grep fail-count >>> id="status-1084752129-fail-count-rsc_Stateful_1.monitor_1" name="fail-count-rsc_Stateful_1#monitor_1" value="1"/> >>> name="fail-count-rsc_Stateful_1" value="4"/> ha1:~ # ___ Users mailing list: Users@clusterlabs.org
Re: [ClusterLabs] crmsh resource failcount does not appear to work
On Wed, 2017-12-27 at 14:03 +0300, Andrei Borzenkov wrote: > On Wed, Dec 27, 2017 at 11:40 AM, Kristoffer Grönlund >wrote: > > > > Andrei Borzenkov writes: > > > > > As far as I can tell, pacemaker acts on failcount attributes > > > qualified > > > by operation name, while crm sets/queries unqualified attribute; > > > I do > > > not see any syntax to set fail-count for specific operation in > > > crmsh. > > > > crmsh uses crm_attribute to get the failcount. It could be that > > this > > usage has stopped working as of 1.1.17.. > > > > There is probably misunderstanding. The problem is what attribute is > used, not how it is set. crmsh sets (and as far as I can tell always > set) attribute with name fail-count- while pacemaker > internally sets and queries attributes with name > fail-count-#. > > It is possible that this has changed in recent pacemaker versions of > course ... yep, here is crm_failcount commit that implemented new > (per-operation) failcounts. Which means "crm resource failcount set" > without qualifying by operation is simply not valid ... actually > crm_failcount will refuse to set failcount at all (only clear it). Hmm, I didn't realize crm shell supported setting a fail count. We discourage setting a fail count attribute directly as of 1.1.17, as having a fail count without any failed operation history or last failure time can be confusing to users (no failures would show up in status, yet failure recovery behavior would be in effect, and failure timeouts would not work properly). It is possible to set the new per-operation attributes directly, if that capability is still desired, but I'm not sure there's a good reason to do so. crm_failcount is a better choice than crm_attribute for querying and clearing fail count attributes, as it will handle summing per-operation fail counts if a resource total fail count is desired. Clearing a fail count is now equivalent to crm_resource --cleanup, so it keeps the operation history and last failure times consistent. FYI the per-operation fail counts are not particularly useful now, but they will make future failure handling enhancements possible, e.g. configuring start-failure-is-fatal per resource, or ignoring a certain number of monitor failures before recovering while still recovering immediately for other operation failures. > > https://github.com/ClusterLabs/pacemaker/commit/8323616179dc3f8038c6a > 69e7323757bd1feacb1#diff-6e58482648938fd488a920b9902daac4 > > > > > > Cheers, > > Kristoffer > > > > > > > > ha1:~ # rpm -q crmsh > > > crmsh-4.0.0+git.1511604050.816cb0f5-1.1.noarch > > > ha1:~ # crm_mon -1rf > > > Stack: corosync > > > Current DC: ha2 (version 1.1.17-3.3-36d2962a8) - partition with > > > quorum > > > Last updated: Sun Dec 24 10:55:54 2017 > > > Last change: Sun Dec 24 10:55:47 2017 by hacluster via crmd on > > > ha2 > > > > > > 2 nodes configured > > > 4 resources configured > > > > > > Online: [ ha1 ha2 ] > > > > > > Full list of resources: > > > > > > stonith-sbd (stonith:external/sbd): Started ha1 > > > rsc_dummy_1 (ocf::pacemaker:Dummy): Started ha2 > > > Master/Slave Set: ms_Stateful_1 [rsc_Stateful_1] > > > Masters: [ ha1 ] > > > Slaves: [ ha2 ] > > > > > > Migration Summary: > > > * Node ha2: > > > * Node ha1: > > > ha1:~ # echo xxx > /run/Stateful-rsc_Stateful_1.state > > > ha1:~ # crm_failcount -G -r rsc_Stateful_1 > > > scope=status name=fail-count-rsc_Stateful_1 value=1 > > > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 > > > scope=status name=fail-count-rsc_Stateful_1 value=0 > > > ha1:~ # crm resource failcount rsc_Stateful_1 set ha1 4 > > > ha1:~ # crm_failcount -G -r rsc_Stateful_1 > > > scope=status name=fail-count-rsc_Stateful_1 value=1 > > > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 > > > scope=status name=fail-count-rsc_Stateful_1 value=4 > > > ha1:~ # cibadmin -Q | grep fail-count > > > > > id="status-1084752129-fail-count-rsc_Stateful_1.monitor_1" > > > name="fail-count-rsc_Stateful_1#monitor_1" value="1"/> > > > > > name="fail-count-rsc_Stateful_1" value="4"/> > > > ha1:~ # > > > > > > ___ > > > Users mailing list: Users@clusterlabs.org > > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra > > > tch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > > -- > > // Kristoffer Grönlund > > // kgronl...@suse.com > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot
Re: [ClusterLabs] crmsh resource failcount does not appear to work
Andrei Borzenkovwrites: > As far as I can tell, pacemaker acts on failcount attributes qualified > by operation name, while crm sets/queries unqualified attribute; I do > not see any syntax to set fail-count for specific operation in crmsh. crmsh uses crm_attribute to get the failcount. It could be that this usage has stopped working as of 1.1.17.. Cheers, Kristoffer > > ha1:~ # rpm -q crmsh > crmsh-4.0.0+git.1511604050.816cb0f5-1.1.noarch > ha1:~ # crm_mon -1rf > Stack: corosync > Current DC: ha2 (version 1.1.17-3.3-36d2962a8) - partition with quorum > Last updated: Sun Dec 24 10:55:54 2017 > Last change: Sun Dec 24 10:55:47 2017 by hacluster via crmd on ha2 > > 2 nodes configured > 4 resources configured > > Online: [ ha1 ha2 ] > > Full list of resources: > > stonith-sbd (stonith:external/sbd): Started ha1 > rsc_dummy_1 (ocf::pacemaker:Dummy): Started ha2 > Master/Slave Set: ms_Stateful_1 [rsc_Stateful_1] > Masters: [ ha1 ] > Slaves: [ ha2 ] > > Migration Summary: > * Node ha2: > * Node ha1: > ha1:~ # echo xxx > /run/Stateful-rsc_Stateful_1.state > ha1:~ # crm_failcount -G -r rsc_Stateful_1 > scope=status name=fail-count-rsc_Stateful_1 value=1 > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 > scope=status name=fail-count-rsc_Stateful_1 value=0 > ha1:~ # crm resource failcount rsc_Stateful_1 set ha1 4 > ha1:~ # crm_failcount -G -r rsc_Stateful_1 > scope=status name=fail-count-rsc_Stateful_1 value=1 > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 > scope=status name=fail-count-rsc_Stateful_1 value=4 > ha1:~ # cibadmin -Q | grep fail-count >id="status-1084752129-fail-count-rsc_Stateful_1.monitor_1" > name="fail-count-rsc_Stateful_1#monitor_1" value="1"/> >name="fail-count-rsc_Stateful_1" value="4"/> > ha1:~ # > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] crmsh resource failcount does not appear to work
On Wed, Dec 27, 2017 at 11:40 AM, Kristoffer Grönlundwrote: > > Andrei Borzenkov writes: > > > As far as I can tell, pacemaker acts on failcount attributes qualified > > by operation name, while crm sets/queries unqualified attribute; I do > > not see any syntax to set fail-count for specific operation in crmsh. > > crmsh uses crm_attribute to get the failcount. It could be that this > usage has stopped working as of 1.1.17.. > There is probably misunderstanding. The problem is what attribute is used, not how it is set. crmsh sets (and as far as I can tell always set) attribute with name fail-count- while pacemaker internally sets and queries attributes with name fail-count-#. It is possible that this has changed in recent pacemaker versions of course ... yep, here is crm_failcount commit that implemented new (per-operation) failcounts. Which means "crm resource failcount set" without qualifying by operation is simply not valid ... actually crm_failcount will refuse to set failcount at all (only clear it). https://github.com/ClusterLabs/pacemaker/commit/8323616179dc3f8038c6a69e7323757bd1feacb1#diff-6e58482648938fd488a920b9902daac4 > > Cheers, > Kristoffer > > > > > ha1:~ # rpm -q crmsh > > crmsh-4.0.0+git.1511604050.816cb0f5-1.1.noarch > > ha1:~ # crm_mon -1rf > > Stack: corosync > > Current DC: ha2 (version 1.1.17-3.3-36d2962a8) - partition with quorum > > Last updated: Sun Dec 24 10:55:54 2017 > > Last change: Sun Dec 24 10:55:47 2017 by hacluster via crmd on ha2 > > > > 2 nodes configured > > 4 resources configured > > > > Online: [ ha1 ha2 ] > > > > Full list of resources: > > > > stonith-sbd (stonith:external/sbd): Started ha1 > > rsc_dummy_1 (ocf::pacemaker:Dummy): Started ha2 > > Master/Slave Set: ms_Stateful_1 [rsc_Stateful_1] > > Masters: [ ha1 ] > > Slaves: [ ha2 ] > > > > Migration Summary: > > * Node ha2: > > * Node ha1: > > ha1:~ # echo xxx > /run/Stateful-rsc_Stateful_1.state > > ha1:~ # crm_failcount -G -r rsc_Stateful_1 > > scope=status name=fail-count-rsc_Stateful_1 value=1 > > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 > > scope=status name=fail-count-rsc_Stateful_1 value=0 > > ha1:~ # crm resource failcount rsc_Stateful_1 set ha1 4 > > ha1:~ # crm_failcount -G -r rsc_Stateful_1 > > scope=status name=fail-count-rsc_Stateful_1 value=1 > > ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 > > scope=status name=fail-count-rsc_Stateful_1 value=4 > > ha1:~ # cibadmin -Q | grep fail-count > >> id="status-1084752129-fail-count-rsc_Stateful_1.monitor_1" > > name="fail-count-rsc_Stateful_1#monitor_1" value="1"/> > >> name="fail-count-rsc_Stateful_1" value="4"/> > > ha1:~ # > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > -- > // Kristoffer Grönlund > // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] crmsh resource failcount does not appear to work
As far as I can tell, pacemaker acts on failcount attributes qualified by operation name, while crm sets/queries unqualified attribute; I do not see any syntax to set fail-count for specific operation in crmsh. ha1:~ # rpm -q crmsh crmsh-4.0.0+git.1511604050.816cb0f5-1.1.noarch ha1:~ # crm_mon -1rf Stack: corosync Current DC: ha2 (version 1.1.17-3.3-36d2962a8) - partition with quorum Last updated: Sun Dec 24 10:55:54 2017 Last change: Sun Dec 24 10:55:47 2017 by hacluster via crmd on ha2 2 nodes configured 4 resources configured Online: [ ha1 ha2 ] Full list of resources: stonith-sbd(stonith:external/sbd): Started ha1 rsc_dummy_1(ocf::pacemaker:Dummy): Started ha2 Master/Slave Set: ms_Stateful_1 [rsc_Stateful_1] Masters: [ ha1 ] Slaves: [ ha2 ] Migration Summary: * Node ha2: * Node ha1: ha1:~ # echo xxx > /run/Stateful-rsc_Stateful_1.state ha1:~ # crm_failcount -G -r rsc_Stateful_1 scope=status name=fail-count-rsc_Stateful_1 value=1 ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 scope=status name=fail-count-rsc_Stateful_1 value=0 ha1:~ # crm resource failcount rsc_Stateful_1 set ha1 4 ha1:~ # crm_failcount -G -r rsc_Stateful_1 scope=status name=fail-count-rsc_Stateful_1 value=1 ha1:~ # crm resource failcount rsc_Stateful_1 show ha1 scope=status name=fail-count-rsc_Stateful_1 value=4 ha1:~ # cibadmin -Q | grep fail-count ha1:~ # ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org