etc.

Ken Gaillot Thu, 22 Sep 2016 12:09:55 -0700

On 09/22/2016 12:58 PM, Kristoffer Grönlund wrote:
> Ken Gaillot <[email protected]> writes:
>>
>> "restart" is the only on-fail value that it makes sense to escalate.
>>
>> block/stop/fence/standby are final. Block means "don't touch the
>> resource again", so there can't be any further response to failures.
>> Stop/fence/standby move the resource off the local node, so failure
>> handling is reset (there are 0 failures on the new node to begin with).
> 
> Hrm. If a restart potentially migrates the resource to a different node,
> is the failcount reset then as well? If so, wouldn't that complicate the
> hard-fail-threshold variable too, since potentially, the resource could
> keep migrating between nodes and since the failcount is reset on each
> migration, it would never reach the hard-fail-threshold. (or am I
> missing something?)


The failure count is specific to each node. By "failure handling is
reset" I mean that when the resource moves to a different node, the
failure count of the original node no longer matters -- the new node's
failure count is now what matters.

A node's failure count is reset only when the user manually clears it,
or the node is rebooted. Also, resources may have a failure-timeout
configured, in which case the count will go down as failures expire.

So, a resource with on-fail=restart would never go back to a node where
it had previously reached the threshold, unless that node's fail count
were cleared in one of those ways.

_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Reply via email to