Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-10-04 Thread Andrew Beekhof
On Wed, Oct 5, 2016 at 7:03 AM, Ken Gaillot  wrote:

> On 10/02/2016 10:02 PM, Andrew Beekhof wrote:
> >> Take a
> >> look at all of nagios' options for deciding when a failure becomes
> "real".
> >
> > I used to take a very hard line on this: if you don't want the cluster
> > to do anything about an error, don't tell us about it.
> > However I'm slowly changing my position... the reality is that many
> > people do want a heads up in advance and we have been forcing that
> > policy (when does an error become real) into the agents where one size
> > must fit all.
> >
> > So I'm now generally in favour of having the PE handle this "somehow".
>
> Nagios is a useful comparison:
>
> check_interval - like pacemaker's monitor interval
>
> retry_interval - if a check returns failure, switch to this interval
> (i.e. check more frequently when trying to decide whether it's a "hard"
> failure)
>
> max_check_attempts - if a check fails this many times in a row, it's a
> hard failure. Before this is reached, it's considered a soft failure.
> Nagios will call event handlers (comparable to pacemaker's alert agents)
> for both soft and hard failures (distinguishing the two). A service is
> also considered to have a "hard failure" if its host goes down.
>
> high_flap_threshold/low_flap_threshold - a service is considered to be
> flapping when its percent of state changes (ok <-> not ok) in the last
> 21 checks (= max. 20 state changes) reaches high_flap_threshold, and
> stable again once the percentage drops to low_flap_threshold. To put it
> another way, a service that passes every monitor is 0% flapping, and a
> service that fails every other monitor is 100% flapping. With these,
> even if a service never reaches max_check_attempts failures in a row, an
> alert can be sent if it's repeatedly failing and recovering.
>

makes sense.

since we're overhauling this functionality anyway, do you think we need to
add an equivalent of retry_interval too?


>
> >> If you clear failures after a success, you can't detect/recover a
> >> resource that is flapping.
> >
> > Ah, but you can if the thing you're clearing only applies to other
> > failures of the same action.
> > A completed start doesn't clear a previously failed monitor.
>
> Nope -- a monitor can alternately succeed and fail repeatedly, and that
> indicates a problem, but wouldn't trip an "N-failures-in-a-row" system.
>
> >> It only makes sense to escalate from ignore -> restart -> hard, so maybe
> >> something like:
> >>
> >>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban
> >>
> > I would favour something more concrete than 'soft' and 'hard' here.
> > Do they have a sufficiently obvious meaning outside of us developers?
> >
> > Perhaps (with or without a "failures-" prefix) :
> >
> >ignore-count
> >recover-count
> >escalation-policy
>
> I think the "soft" vs "hard" terminology is somewhat familiar to
> sysadmins -- there's at least nagios, email (SPF failures and bounces),
> and ECC RAM. But throwing "ignore" into the mix does confuse things.
>
> How about ... max-fail-ignore=3 max-fail-restart=2 fail-escalation=ban
>
>
I could live with that :-)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-10-04 Thread Ken Gaillot
On 10/02/2016 10:02 PM, Andrew Beekhof wrote:
>> Take a
>> look at all of nagios' options for deciding when a failure becomes "real".
> 
> I used to take a very hard line on this: if you don't want the cluster
> to do anything about an error, don't tell us about it.
> However I'm slowly changing my position... the reality is that many
> people do want a heads up in advance and we have been forcing that
> policy (when does an error become real) into the agents where one size
> must fit all.
> 
> So I'm now generally in favour of having the PE handle this "somehow".

Nagios is a useful comparison:

check_interval - like pacemaker's monitor interval

retry_interval - if a check returns failure, switch to this interval
(i.e. check more frequently when trying to decide whether it's a "hard"
failure)

max_check_attempts - if a check fails this many times in a row, it's a
hard failure. Before this is reached, it's considered a soft failure.
Nagios will call event handlers (comparable to pacemaker's alert agents)
for both soft and hard failures (distinguishing the two). A service is
also considered to have a "hard failure" if its host goes down.

high_flap_threshold/low_flap_threshold - a service is considered to be
flapping when its percent of state changes (ok <-> not ok) in the last
21 checks (= max. 20 state changes) reaches high_flap_threshold, and
stable again once the percentage drops to low_flap_threshold. To put it
another way, a service that passes every monitor is 0% flapping, and a
service that fails every other monitor is 100% flapping. With these,
even if a service never reaches max_check_attempts failures in a row, an
alert can be sent if it's repeatedly failing and recovering.

>> If you clear failures after a success, you can't detect/recover a
>> resource that is flapping.
> 
> Ah, but you can if the thing you're clearing only applies to other
> failures of the same action.
> A completed start doesn't clear a previously failed monitor.

Nope -- a monitor can alternately succeed and fail repeatedly, and that
indicates a problem, but wouldn't trip an "N-failures-in-a-row" system.

>> It only makes sense to escalate from ignore -> restart -> hard, so maybe
>> something like:
>>
>>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban
>>
> I would favour something more concrete than 'soft' and 'hard' here.
> Do they have a sufficiently obvious meaning outside of us developers?
> 
> Perhaps (with or without a "failures-" prefix) :
> 
>ignore-count
>recover-count
>escalation-policy

I think the "soft" vs "hard" terminology is somewhat familiar to
sysadmins -- there's at least nagios, email (SPF failures and bounces),
and ECC RAM. But throwing "ignore" into the mix does confuse things.

How about ... max-fail-ignore=3 max-fail-restart=2 fail-escalation=ban


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-10-02 Thread Andrew Beekhof
On Fri, Sep 30, 2016 at 10:28 AM, Ken Gaillot  wrote:
> On 09/28/2016 10:54 PM, Andrew Beekhof wrote:
>> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot  wrote:
 "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
 then migrate", but I can't think of a real-world situation where that
 makes sense,


 really?

 it is not uncommon to hear "i know its failed, but i dont want the
 cluster to do anything until its _really_ failed"
>>>
>>> Hmm, I guess that would be similar to how monitoring systems such as
>>> nagios can be configured to send an alert only if N checks in a row
>>> fail. That's useful where transient outages (e.g. a webserver hitting
>>> its request limit) are acceptable for a short time.
>>>
>>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>>> is not "in a row" but "since the count was last cleared".
>>
>> It would be a major change, but perhaps it should be "in-a-row" and
>> successfully performing the action clears the count.
>> Its entirely possible that the current behaviour is like that because
>> I wasn't smart enough to implement anything else at the time :-)
>
> Or you were smart enough to realize what a can of worms it is. :)

So you're saying two dumbs makes a smart? :-)

>Take a
> look at all of nagios' options for deciding when a failure becomes "real".

I used to take a very hard line on this: if you don't want the cluster
to do anything about an error, don't tell us about it.
However I'm slowly changing my position... the reality is that many
people do want a heads up in advance and we have been forcing that
policy (when does an error become real) into the agents where one size
must fit all.

So I'm now generally in favour of having the PE handle this "somehow".

>
> If you clear failures after a success, you can't detect/recover a
> resource that is flapping.

Ah, but you can if the thing you're clearing only applies to other
failures of the same action.
A completed start doesn't clear a previously failed monitor.

>
>>> "Ignore up to three monitor failures if they occur in a row [or, within
>>> 10 minutes?], then try soft recovery for the next two monitor failures,
>>> then ban this node for the next monitor failure." Not sure being able to
>>> say that is worth the complexity.
>>
>> Not disagreeing
>
> It only makes sense to escalate from ignore -> restart -> hard, so maybe
> something like:
>
>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban

The other idea I had, was to create some new return codes:
PCMK_OCF_ERR_BAN, PCMK_OCF_ERR_FENCE, etc.
Ie. make the internal mapping of return codes like
PCMK_OCF_NOT_CONFIGURED and PCMK_OCF_DEGRADED to hard/soft/ignore
recovery logic into something available to the agent.

To use your example above, return PCMK_OCF_DEGRADED for the first 3
monitor failures, PCMK_OCF_ERR_RESTART for the next two and
PCMK_OCF_ERR_BAN for the last.

But the more I think about it, the less I like it.
- We loose precision about what the actual error was
- We're pushing too much user config/policy into the agent (every
agent would end up with equivalents of 'ignore-fail', 'soft-fail', and
'on-hard-fail')
- We might need the agent to know about the fencing config
(enabled/disabled/valid)
- If forces the agent to track the number of operation failures

So I think I'm just mentioning it for completeness and in case it
prompts a good idea in someone else.

>
>
> To express current default behavior:
>
>   op start ignore-fail=0 soft-fail=0on-hard-fail=ban

I would favour something more concrete than 'soft' and 'hard' here.
Do they have a sufficiently obvious meaning outside of us developers?

Perhaps (with or without a "failures-" prefix) :

   ignore-count
   recover-count
   escalation-policy

>   op stop  ignore-fail=0 soft-fail=0on-hard-fail=fence
>   op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban
>
>
> on-fail, migration-threshold, and start-failure-is-fatal would be
> deprecated (and would be easy to map to the new parameters).
>
> I'd avoid the hassles of counting failures "in a row", and stick with
> counting failures since the last cleanup.

sure

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-29 Thread Ken Gaillot
On 09/28/2016 10:54 PM, Andrew Beekhof wrote:
> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot  wrote:
>>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>> then migrate", but I can't think of a real-world situation where that
>>> makes sense,
>>>
>>>
>>> really?
>>>
>>> it is not uncommon to hear "i know its failed, but i dont want the
>>> cluster to do anything until its _really_ failed"
>>
>> Hmm, I guess that would be similar to how monitoring systems such as
>> nagios can be configured to send an alert only if N checks in a row
>> fail. That's useful where transient outages (e.g. a webserver hitting
>> its request limit) are acceptable for a short time.
>>
>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>> is not "in a row" but "since the count was last cleared".
> 
> It would be a major change, but perhaps it should be "in-a-row" and
> successfully performing the action clears the count.
> Its entirely possible that the current behaviour is like that because
> I wasn't smart enough to implement anything else at the time :-)

Or you were smart enough to realize what a can of worms it is. :) Take a
look at all of nagios' options for deciding when a failure becomes "real".

If you clear failures after a success, you can't detect/recover a
resource that is flapping.

>> "Ignore up to three monitor failures if they occur in a row [or, within
>> 10 minutes?], then try soft recovery for the next two monitor failures,
>> then ban this node for the next monitor failure." Not sure being able to
>> say that is worth the complexity.
> 
> Not disagreeing

It only makes sense to escalate from ignore -> restart -> hard, so maybe
something like:

  op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban


To express current default behavior:

  op start ignore-fail=0 soft-fail=0on-hard-fail=ban
  op stop  ignore-fail=0 soft-fail=0on-hard-fail=fence
  op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban


on-fail, migration-threshold, and start-failure-is-fatal would be
deprecated (and would be easy to map to the new parameters).

I'd avoid the hassles of counting failures "in a row", and stick with
counting failures since the last cleanup.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-23 Thread Ken Gaillot
On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
> 
> 
> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot  > wrote:
> 
> On 09/22/2016 09:53 AM, Jan Pokorný wrote:
> > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
> >> Ken Gaillot > writes:
> >>
> >>> I'm not saying it's a bad idea, just that it's more complicated than 
> it
> >>> first sounds, so it's worth thinking through the implications.
> >>
> >> Thinking about it and looking at how complicated it gets, maybe what
> >> you'd really want, to make it clearer for the user, is the ability to
> >> explicitly configure the behavior, either globally or per-resource. So
> >> instead of having to tweak a set of variables that interact in complex
> >> ways, you'd configure something like rule expressions,
> >>
> >> 
> >>   
> >>   
> >>   
> >> 
> >>
> >> So, try to restart the service 3 times, if that fails migrate the
> >> service, if it still fails, fence the node.
> >>
> >> (obviously the details and XML syntax are just an example)
> >>
> >> This would then replace on-fail, migration-threshold, etc.
> >
> > I must admit that in previous emails in this thread, I wasn't able to
> > follow during the first pass, which is not the case with this procedural
> > (sequence-ordered) approach.  Though someone can argue it doesn't take
> > type of operation into account, which might again open the door for
> > non-obvious interactions.
> 
> "restart" is the only on-fail value that it makes sense to escalate.
> 
> block/stop/fence/standby are final. Block means "don't touch the
> resource again", so there can't be any further response to failures.
> Stop/fence/standby move the resource off the local node, so failure
> handling is reset (there are 0 failures on the new node to begin with).
> 
> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
> then migrate", but I can't think of a real-world situation where that
> makes sense, 
> 
> 
> really?
> 
> it is not uncommon to hear "i know its failed, but i dont want the
> cluster to do anything until its _really_ failed"  

Hmm, I guess that would be similar to how monitoring systems such as
nagios can be configured to send an alert only if N checks in a row
fail. That's useful where transient outages (e.g. a webserver hitting
its request limit) are acceptable for a short time.

I'm not sure that's translatable to Pacemaker. Pacemaker's error count
is not "in a row" but "since the count was last cleared".

"Ignore up to three monitor failures if they occur in a row [or, within
10 minutes?], then try soft recovery for the next two monitor failures,
then ban this node for the next monitor failure." Not sure being able to
say that is worth the complexity.

> 
> and it would be a significant re-implementation of "ignore"
> (which currently ignores the state of having failed, as opposed to a
> particular instance of failure).
> 
> 
> agreed
>  
> 
> 
> What the interface needs to express is: "If this operation fails,
> optionally try a soft recovery [always stop+start], but if  failures
> occur on the same node, proceed to a [configurable] hard recovery".
> 
> And of course the interface will need to be different depending on how
> certain details are decided, e.g. whether any failures count toward 
> or just failures of one particular operation type, and whether the hard
> recovery type can vary depending on what operation failed.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-22 Thread Andrew Beekhof
On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot  wrote:

> On 09/22/2016 09:53 AM, Jan Pokorný wrote:
> > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
> >> Ken Gaillot  writes:
> >>
> >>> I'm not saying it's a bad idea, just that it's more complicated than it
> >>> first sounds, so it's worth thinking through the implications.
> >>
> >> Thinking about it and looking at how complicated it gets, maybe what
> >> you'd really want, to make it clearer for the user, is the ability to
> >> explicitly configure the behavior, either globally or per-resource. So
> >> instead of having to tweak a set of variables that interact in complex
> >> ways, you'd configure something like rule expressions,
> >>
> >> 
> >>   
> >>   
> >>   
> >> 
> >>
> >> So, try to restart the service 3 times, if that fails migrate the
> >> service, if it still fails, fence the node.
> >>
> >> (obviously the details and XML syntax are just an example)
> >>
> >> This would then replace on-fail, migration-threshold, etc.
> >
> > I must admit that in previous emails in this thread, I wasn't able to
> > follow during the first pass, which is not the case with this procedural
> > (sequence-ordered) approach.  Though someone can argue it doesn't take
> > type of operation into account, which might again open the door for
> > non-obvious interactions.
>
> "restart" is the only on-fail value that it makes sense to escalate.
>
> block/stop/fence/standby are final. Block means "don't touch the
> resource again", so there can't be any further response to failures.
> Stop/fence/standby move the resource off the local node, so failure
> handling is reset (there are 0 failures on the new node to begin with).
>
> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
> then migrate", but I can't think of a real-world situation where that
> makes sense,


really?

it is not uncommon to hear "i know its failed, but i dont want the cluster
to do anything until its _really_ failed"


> and it would be a significant re-implementation of "ignore"
> (which currently ignores the state of having failed, as opposed to a
> particular instance of failure).
>

agreed


>
> What the interface needs to express is: "If this operation fails,
> optionally try a soft recovery [always stop+start], but if  failures
> occur on the same node, proceed to a [configurable] hard recovery".
>
> And of course the interface will need to be different depending on how
> certain details are decided, e.g. whether any failures count toward 
> or just failures of one particular operation type, and whether the hard
> recovery type can vary depending on what operation failed.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-22 Thread Ken Gaillot
On 09/22/2016 12:58 PM, Kristoffer Grönlund wrote:
> Ken Gaillot  writes:
>>
>> "restart" is the only on-fail value that it makes sense to escalate.
>>
>> block/stop/fence/standby are final. Block means "don't touch the
>> resource again", so there can't be any further response to failures.
>> Stop/fence/standby move the resource off the local node, so failure
>> handling is reset (there are 0 failures on the new node to begin with).
> 
> Hrm. If a restart potentially migrates the resource to a different node,
> is the failcount reset then as well? If so, wouldn't that complicate the
> hard-fail-threshold variable too, since potentially, the resource could
> keep migrating between nodes and since the failcount is reset on each
> migration, it would never reach the hard-fail-threshold. (or am I
> missing something?)

The failure count is specific to each node. By "failure handling is
reset" I mean that when the resource moves to a different node, the
failure count of the original node no longer matters -- the new node's
failure count is now what matters.

A node's failure count is reset only when the user manually clears it,
or the node is rebooted. Also, resources may have a failure-timeout
configured, in which case the count will go down as failures expire.

So, a resource with on-fail=restart would never go back to a node where
it had previously reached the threshold, unless that node's fail count
were cleared in one of those ways.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-22 Thread Kristoffer Grönlund
Ken Gaillot  writes:
>
> "restart" is the only on-fail value that it makes sense to escalate.
>
> block/stop/fence/standby are final. Block means "don't touch the
> resource again", so there can't be any further response to failures.
> Stop/fence/standby move the resource off the local node, so failure
> handling is reset (there are 0 failures on the new node to begin with).

Hrm. If a restart potentially migrates the resource to a different node,
is the failcount reset then as well? If so, wouldn't that complicate the
hard-fail-threshold variable too, since potentially, the resource could
keep migrating between nodes and since the failcount is reset on each
migration, it would never reach the hard-fail-threshold. (or am I
missing something?)

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-22 Thread Ken Gaillot
On 09/22/2016 09:53 AM, Jan Pokorný wrote:
> On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
>> Ken Gaillot  writes:
>>
>>> I'm not saying it's a bad idea, just that it's more complicated than it
>>> first sounds, so it's worth thinking through the implications.
>>
>> Thinking about it and looking at how complicated it gets, maybe what
>> you'd really want, to make it clearer for the user, is the ability to
>> explicitly configure the behavior, either globally or per-resource. So
>> instead of having to tweak a set of variables that interact in complex
>> ways, you'd configure something like rule expressions,
>>
>> 
>>   
>>   
>>   
>> 
>>
>> So, try to restart the service 3 times, if that fails migrate the
>> service, if it still fails, fence the node.
>>
>> (obviously the details and XML syntax are just an example)
>>
>> This would then replace on-fail, migration-threshold, etc.
> 
> I must admit that in previous emails in this thread, I wasn't able to
> follow during the first pass, which is not the case with this procedural
> (sequence-ordered) approach.  Though someone can argue it doesn't take
> type of operation into account, which might again open the door for
> non-obvious interactions.

"restart" is the only on-fail value that it makes sense to escalate.

block/stop/fence/standby are final. Block means "don't touch the
resource again", so there can't be any further response to failures.
Stop/fence/standby move the resource off the local node, so failure
handling is reset (there are 0 failures on the new node to begin with).

"Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
then migrate", but I can't think of a real-world situation where that
makes sense, and it would be a significant re-implementation of "ignore"
(which currently ignores the state of having failed, as opposed to a
particular instance of failure).

What the interface needs to express is: "If this operation fails,
optionally try a soft recovery [always stop+start], but if  failures
occur on the same node, proceed to a [configurable] hard recovery".

And of course the interface will need to be different depending on how
certain details are decided, e.g. whether any failures count toward 
or just failures of one particular operation type, and whether the hard
recovery type can vary depending on what operation failed.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-22 Thread Ken Gaillot
On 09/22/2016 10:43 AM, Jan Pokorný wrote:
> On 21/09/16 10:51 +1000, Andrew Beekhof wrote:
>> On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot  wrote:
>>> Our first proposed approach would add a new hard-fail-threshold
>>> operation property. If specified, the cluster would first try restarting
>>> the resource on the same node,
>>
>>
>> Well, just as now, it would be _allowed_ to start on the same node, but
>> this is not guaranteed.
> 
> Yeah, I should attend doublethink classes to understand "the same
> node" term better:
> 
> https://github.com/ClusterLabs/pacemaker/pull/1146/commits/3b3fc1fd8f2c95d8ab757711cf096cf231f27941

"Same node" is really a shorthand to hand-wave some details, because
that's what will typically happen.

The exact behavior is: "If the fail-count on this node reaches , ban
this node from running the resource."

That's not the same as *requiring* the resource to restart on the same
node before  is reached. As in any situation, Pacemaker will
re-evaluate the current state of the cluster, and choose the best node
to try starting the resource on.

For example, consider if the failed resource with on-fail=restart is
colocated with another resource with on-fail=standby that also failed,
then the whole node will be put in standby, and the original resource
will of course move away. It will be restarted, but the start will
happen on another node.

There are endless such scenarios, so "try restarting on the same node"
is not really accurate. To be accurate, I should have said something
like "try restarting without banning the node with the failure".

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-22 Thread Jan Pokorný
On 21/09/16 10:51 +1000, Andrew Beekhof wrote:
> On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot  wrote:
>> Our first proposed approach would add a new hard-fail-threshold
>> operation property. If specified, the cluster would first try restarting
>> the resource on the same node,
> 
> 
> Well, just as now, it would be _allowed_ to start on the same node, but
> this is not guaranteed.

Yeah, I should attend doublethink classes to understand "the same
node" term better:

https://github.com/ClusterLabs/pacemaker/pull/1146/commits/3b3fc1fd8f2c95d8ab757711cf096cf231f27941


-- 
Jan (Poki)


pgpTKo0jPTs3W.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-22 Thread Kristoffer Grönlund
Ken Gaillot  writes:

> I'm not saying it's a bad idea, just that it's more complicated than it
> first sounds, so it's worth thinking through the implications.

Thinking about it and looking at how complicated it gets, maybe what
you'd really want, to make it clearer for the user, is the ability to
explicitly configure the behavior, either globally or per-resource. So
instead of having to tweak a set of variables that interact in complex
ways, you'd configure something like rule expressions,


  
  
  


So, try to restart the service 3 times, if that fails migrate the
service, if it still fails, fence the node.

(obviously the details and XML syntax are just an example)

This would then replace on-fail, migration-threshold, etc.

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-21 Thread Ken Gaillot
On 09/20/2016 07:51 PM, Andrew Beekhof wrote:
> 
> 
> On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot  > wrote:
> 
> Hi everybody,
> 
> Currently, Pacemaker's on-fail property allows you to configure how the
> cluster reacts to operation failures. The default "restart" means try to
> restart on the same node, optionally moving to another node once
> migration-threshold is reached. Other possibilities are "ignore",
> "block", "stop", "fence", and "standby".
> 
> Occasionally, we get requests to have something like migration-threshold
> for values besides restart. For example, try restarting the resource on
> the same node 3 times, then fence.
> 
> I'd like to get your feedback on two alternative approaches we're
> considering.
> 
> ###
> 
> Our first proposed approach would add a new hard-fail-threshold
> operation property. If specified, the cluster would first try restarting
> the resource on the same node, 
> 
> 
> Well, just as now, it would be _allowed_ to start on the same node, but
> this is not guaranteed.
>  
> 
> before doing the on-fail handling.
> 
> For example, you could configure a promote operation with
> hard-fail-threshold=3 and on-fail=fence, to fence the node after 3
> failures.
> 
> 
> One point that's not settled is whether failures of *any* operation
> would count toward the 3 failures (which is how migration-threshold
> works now), or only failures of the specified operation.
> 
> 
> I think if hard-fail-threshold is per-op, then only failures of that
> operation should count.
>  
> 
> 
> Currently, if a start fails (but is retried successfully), then a
> promote fails (but is retried successfully), then a monitor fails, the
> resource will move to another node if migration-threshold=3. We could
> keep that behavior with hard-fail-threshold, or only count monitor
> failures toward monitor's hard-fail-threshold. Each alternative has
> advantages and disadvantages.
> 
> ###
> 
> The second proposed approach would add a new on-restart-fail resource
> property.
> 
> Same as now, on-fail set to anything but restart would be done
> immediately after the first failure. A new value, "ban", would
> immediately move the resource to another node. (on-fail=ban would behave
> like on-fail=restart with migration-threshold=1.)
> 
> When on-fail=restart, and restarting on the same node doesn't work, the
> cluster would do the on-restart-fail handling. on-restart-fail would
> allow the same values as on-fail (minus "restart"), and would default to
> "ban". 
> 
> 
> I do wish you well tracking "is this a restart" across demote -> stop ->
> start -> promote in 4 different transitions :-)
>  
> 
> 
> So, if you want to fence immediately after any promote failure, you
> would still configure on-fail=fence; if you want to try restarting a few
> times first, you would configure on-fail=restart and
> on-restart-fail=fence.
> 
> This approach keeps the current threshold behavior -- failures of any
> operation count toward the threshold. We'd rename migration-threshold to
> something like hard-fail-threshold, since it would apply to more than
> just migration, but unlike the first approach, it would stay a resource
> property.
> 
> ###
> 
> Comparing the two approaches, the first is more flexible, but also more
> complex and potentially confusing.
> 
> 
> More complex to implement or more complex to configure?

I was thinking more complex in behavior, so perhaps harder to follow /
expect.

For example, "After two start failures, fence this node; after three
promote failures, put the node in standby; but if a monitor failure is
the third operation failure of any type, then move the resource to
another node."

Granted, someone would have to inflict that on themselves :) but another
sysadmin / support tech / etc. who had to deal with the config later
might have trouble following it.

To keep the current default behavior, the default would be complicated,
too: "1 for start and stop operations, and 0 for other operations" where
"0 is equivalent to 1 except when on-fail=restart, in which case
migration-threshold will be used instead".

And then add to that tracking fail-count per node+resource+operation
combination, with the associated status output and cleanup options.
"crm_mon -f" currently shows failures like:

* Node node1:
   rsc1: migration-threshold=3 fail-count=1 last-failure='Wed Sep 21
15:12:59 2016'

What should that look like with per-op thresholds and fail-counts?

I'm not saying it's a bad idea, just that it's more complicated than it
first sounds, so it's worth thinking through the implications.

> With either approach, we would deprecate the start-failure-is-fatal
> cluster property. start-failure-is-fatal=true would be equivalent to
> 

Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-21 Thread Ken Gaillot
On 09/21/2016 02:23 AM, Kristoffer Grönlund wrote:
> First of all, is there a use case for when fence-after-3-failures is a
> useful behavior? I seem to recall some case where someone expected that
> to be the behavior and were surprised by how pacemaker works, but that
> problem wouldn't be helped by adding another option for them not to know
> about.

I think I've most often encountered it with ignore/block. Sometimes
users have one particular service that's buggy and not really important,
so they ignore errors (or block). But they would like to try restarting
it a few times first.

I think fence-after-3-failures would make as much sense as
fence-immediately. The idea behind restarting a few times then taking a
more drastic action is that restarting is for the case where the service
crashed or is in a buggy state, and if that doesn't work, maybe
something's wrong with the node.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-21 Thread Klaus Wenninger
On 09/20/2016 10:25 PM, Ken Gaillot wrote:
> Hi everybody,
>
> Currently, Pacemaker's on-fail property allows you to configure how the
> cluster reacts to operation failures. The default "restart" means try to
> restart on the same node, optionally moving to another node once
> migration-threshold is reached. Other possibilities are "ignore",
> "block", "stop", "fence", and "standby".
>
> Occasionally, we get requests to have something like migration-threshold
> for values besides restart. For example, try restarting the resource on
> the same node 3 times, then fence.
>
> I'd like to get your feedback on two alternative approaches we're
> considering.
>
> ###
>
> Our first proposed approach would add a new hard-fail-threshold
> operation property. If specified, the cluster would first try restarting
> the resource on the same node, before doing the on-fail handling.
>
> For example, you could configure a promote operation with
> hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 failures.
>
> One point that's not settled is whether failures of *any* operation
> would count toward the 3 failures (which is how migration-threshold
> works now), or only failures of the specified operation.
>
> Currently, if a start fails (but is retried successfully), then a
> promote fails (but is retried successfully), then a monitor fails, the
> resource will move to another node if migration-threshold=3. We could
> keep that behavior with hard-fail-threshold, or only count monitor
> failures toward monitor's hard-fail-threshold. Each alternative has
> advantages and disadvantages.
having something like reset-failcount-on-success might
be interesting here as well (as operation property, per resource
or both)
>
> ###
>
> The second proposed approach would add a new on-restart-fail resource
> property.
>
> Same as now, on-fail set to anything but restart would be done
> immediately after the first failure. A new value, "ban", would
> immediately move the resource to another node. (on-fail=ban would behave
> like on-fail=restart with migration-threshold=1.)
>
> When on-fail=restart, and restarting on the same node doesn't work, the
> cluster would do the on-restart-fail handling. on-restart-fail would
> allow the same values as on-fail (minus "restart"), and would default to
> "ban".
>
> So, if you want to fence immediately after any promote failure, you
> would still configure on-fail=fence; if you want to try restarting a few
> times first, you would configure on-fail=restart and on-restart-fail=fence.
>
> This approach keeps the current threshold behavior -- failures of any
> operation count toward the threshold. We'd rename migration-threshold to
> something like hard-fail-threshold, since it would apply to more than
> just migration, but unlike the first approach, it would stay a resource
> property.
>
> ###
>
> Comparing the two approaches, the first is more flexible, but also more
> complex and potentially confusing.
>
> With either approach, we would deprecate the start-failure-is-fatal
> cluster property. start-failure-is-fatal=true would be equivalent to
> hard-fail-threshold=1 with the first approach, and on-fail=ban with the
> second approach. This would be both simpler and more useful -- it allows
> the value to be set differently per resource.
As said both approaches have their pros and cons and you
can probably invent more that seem to be especially suitable
for certain cases.

A different approach would be to gather a couple of statistics
(e.g. restart failures globally & per operation and ...) and leave
it to the RA to derive a return-value from that info and what
it knows itself.

I know we had a similar discussion already ... and this thread
is probably a result of that ...

We could introduce a new RA-operation for getting the answer or
the new statistics are just another input for start.
A new operation might be appealing as this could - on demand -
be replaced by calling a custom script - no need to edit RAs; one
RA different custom scripts for different resources using the same RA ...

I know that using scripts makes it impossible to derive the
cluster-behavior from just seeing what is configure in the cib,
that the scripts have to be kept in sync over the nodes, ...

Can as well be an addition to one of the suggestions above.
Gathering additional statistics will probably be needed for them
anyway. 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-21 Thread Kristoffer Grönlund
Kristoffer Grönlund  writes:

> If implementing the first option, I would prefer to keep the behavior of
> migration-threshold of counting all failures, not just
> monitors. Otherwise there would be two closely related thresholds with
> subtly divergent behavior, which seems confusing indeed.

I see now that the proposed threshold would be per-operation, in which
case I completely reverse opinions and think that a per-operation
configured threshold should apply to instances of that operation
only. :)

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-21 Thread Kristoffer Grönlund
Ken Gaillot  writes:

> Hi everybody,
>
> Currently, Pacemaker's on-fail property allows you to configure how the
> cluster reacts to operation failures. The default "restart" means try to
> restart on the same node, optionally moving to another node once
> migration-threshold is reached. Other possibilities are "ignore",
> "block", "stop", "fence", and "standby".
>
> Occasionally, we get requests to have something like migration-threshold
> for values besides restart. For example, try restarting the resource on
> the same node 3 times, then fence.
>
> I'd like to get your feedback on two alternative approaches we're
> considering.
>
> ###
>
> Our first proposed approach would add a new hard-fail-threshold
> operation property. If specified, the cluster would first try restarting
> the resource on the same node, before doing the on-fail handling.
>
> For example, you could configure a promote operation with
> hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 failures.
>
> One point that's not settled is whether failures of *any* operation
> would count toward the 3 failures (which is how migration-threshold
> works now), or only failures of the specified operation.
>
> Currently, if a start fails (but is retried successfully), then a
> promote fails (but is retried successfully), then a monitor fails, the
> resource will move to another node if migration-threshold=3. We could
> keep that behavior with hard-fail-threshold, or only count monitor
> failures toward monitor's hard-fail-threshold. Each alternative has
> advantages and disadvantages.
>
> ###
>
> The second proposed approach would add a new on-restart-fail resource
> property.
>
> Same as now, on-fail set to anything but restart would be done
> immediately after the first failure. A new value, "ban", would
> immediately move the resource to another node. (on-fail=ban would behave
> like on-fail=restart with migration-threshold=1.)
>
> When on-fail=restart, and restarting on the same node doesn't work, the
> cluster would do the on-restart-fail handling. on-restart-fail would
> allow the same values as on-fail (minus "restart"), and would default to
> "ban".
>
> So, if you want to fence immediately after any promote failure, you
> would still configure on-fail=fence; if you want to try restarting a few
> times first, you would configure on-fail=restart and on-restart-fail=fence.
>
> This approach keeps the current threshold behavior -- failures of any
> operation count toward the threshold. We'd rename migration-threshold to
> something like hard-fail-threshold, since it would apply to more than
> just migration, but unlike the first approach, it would stay a resource
> property.
>
> ###
>
> Comparing the two approaches, the first is more flexible, but also more
> complex and potentially confusing.
>
> With either approach, we would deprecate the start-failure-is-fatal
> cluster property. start-failure-is-fatal=true would be equivalent to
> hard-fail-threshold=1 with the first approach, and on-fail=ban with the
> second approach. This would be both simpler and more useful -- it allows
> the value to be set differently per resource.

Apologies for quoting the entire mail, but I had a hard time picking out
which part was more relevant when replying.

First of all, is there a use case for when fence-after-3-failures is a
useful behavior? I seem to recall some case where someone expected that
to be the behavior and were surprised by how pacemaker works, but that
problem wouldn't be helped by adding another option for them not to know
about.

My second comment would be that to me, the first option sounds less
complex, but then I don't know the internals of pacemaker that
well. Having a special case on-fail for restarts seems inelegant,
somehow.

If implementing the first option, I would prefer to keep the behavior of
migration-threshold of counting all failures, not just
monitors. Otherwise there would be two closely related thresholds with
subtly divergent behavior, which seems confusing indeed.

Cheers,
Kristoffer

> -- 
> Ken Gaillot 
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-09-20 Thread Andrew Beekhof
On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot  wrote:

> Hi everybody,
>
> Currently, Pacemaker's on-fail property allows you to configure how the
> cluster reacts to operation failures. The default "restart" means try to
> restart on the same node, optionally moving to another node once
> migration-threshold is reached. Other possibilities are "ignore",
> "block", "stop", "fence", and "standby".
>
> Occasionally, we get requests to have something like migration-threshold
> for values besides restart. For example, try restarting the resource on
> the same node 3 times, then fence.
>
> I'd like to get your feedback on two alternative approaches we're
> considering.
>
> ###
>
> Our first proposed approach would add a new hard-fail-threshold
> operation property. If specified, the cluster would first try restarting
> the resource on the same node,


Well, just as now, it would be _allowed_ to start on the same node, but
this is not guaranteed.


> before doing the on-fail handling.
>
> For example, you could configure a promote operation with
> hard-fail-threshold=3 and on-fail=fence, to fence the node after 3
> failures.


> One point that's not settled is whether failures of *any* operation
> would count toward the 3 failures (which is how migration-threshold
> works now), or only failures of the specified operation.
>

I think if hard-fail-threshold is per-op, then only failures of that
operation should count.


>
> Currently, if a start fails (but is retried successfully), then a
> promote fails (but is retried successfully), then a monitor fails, the
> resource will move to another node if migration-threshold=3. We could
> keep that behavior with hard-fail-threshold, or only count monitor
> failures toward monitor's hard-fail-threshold. Each alternative has
> advantages and disadvantages.
>
> ###
>
> The second proposed approach would add a new on-restart-fail resource
> property.
>
> Same as now, on-fail set to anything but restart would be done
> immediately after the first failure. A new value, "ban", would
> immediately move the resource to another node. (on-fail=ban would behave
> like on-fail=restart with migration-threshold=1.)
>
> When on-fail=restart, and restarting on the same node doesn't work, the
> cluster would do the on-restart-fail handling. on-restart-fail would
> allow the same values as on-fail (minus "restart"), and would default to
> "ban".


I do wish you well tracking "is this a restart" across demote -> stop ->
start -> promote in 4 different transitions :-)


>
> So, if you want to fence immediately after any promote failure, you
> would still configure on-fail=fence; if you want to try restarting a few
> times first, you would configure on-fail=restart and on-restart-fail=fence.
>
> This approach keeps the current threshold behavior -- failures of any
> operation count toward the threshold. We'd rename migration-threshold to
> something like hard-fail-threshold, since it would apply to more than
> just migration, but unlike the first approach, it would stay a resource
> property.
>
> ###
>
> Comparing the two approaches, the first is more flexible, but also more
> complex and potentially confusing.
>

More complex to implement or more complex to configure?


>
> With either approach, we would deprecate the start-failure-is-fatal
> cluster property. start-failure-is-fatal=true would be equivalent to
> hard-fail-threshold=1 with the first approach, and on-fail=ban with the
> second approach. This would be both simpler and more useful -- it allows
> the value to be set differently per resource.
> --
> Ken Gaillot 
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org