On Fri, Sep 30, 2016 at 10:28 AM, Ken Gaillot <kgail...@redhat.com> wrote: > On 09/28/2016 10:54 PM, Andrew Beekhof wrote: >> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot <kgail...@redhat.com> wrote: >>>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures >>>> then migrate", but I can't think of a real-world situation where that >>>> makes sense, >>>> >>>> >>>> really? >>>> >>>> it is not uncommon to hear "i know its failed, but i dont want the >>>> cluster to do anything until its _really_ failed" >>> >>> Hmm, I guess that would be similar to how monitoring systems such as >>> nagios can be configured to send an alert only if N checks in a row >>> fail. That's useful where transient outages (e.g. a webserver hitting >>> its request limit) are acceptable for a short time. >>> >>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count >>> is not "in a row" but "since the count was last cleared". >> >> It would be a major change, but perhaps it should be "in-a-row" and >> successfully performing the action clears the count. >> Its entirely possible that the current behaviour is like that because >> I wasn't smart enough to implement anything else at the time :-) > > Or you were smart enough to realize what a can of worms it is. :)
So you're saying two dumbs makes a smart? :-) >Take a > look at all of nagios' options for deciding when a failure becomes "real". I used to take a very hard line on this: if you don't want the cluster to do anything about an error, don't tell us about it. However I'm slowly changing my position... the reality is that many people do want a heads up in advance and we have been forcing that policy (when does an error become real) into the agents where one size must fit all. So I'm now generally in favour of having the PE handle this "somehow". > > If you clear failures after a success, you can't detect/recover a > resource that is flapping. Ah, but you can if the thing you're clearing only applies to other failures of the same action. A completed start doesn't clear a previously failed monitor. > >>> "Ignore up to three monitor failures if they occur in a row [or, within >>> 10 minutes?], then try soft recovery for the next two monitor failures, >>> then ban this node for the next monitor failure." Not sure being able to >>> say that is worth the complexity. >> >> Not disagreeing > > It only makes sense to escalate from ignore -> restart -> hard, so maybe > something like: > > op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban The other idea I had, was to create some new return codes: PCMK_OCF_ERR_BAN, PCMK_OCF_ERR_FENCE, etc. Ie. make the internal mapping of return codes like PCMK_OCF_NOT_CONFIGURED and PCMK_OCF_DEGRADED to hard/soft/ignore recovery logic into something available to the agent. To use your example above, return PCMK_OCF_DEGRADED for the first 3 monitor failures, PCMK_OCF_ERR_RESTART for the next two and PCMK_OCF_ERR_BAN for the last. But the more I think about it, the less I like it. - We loose precision about what the actual error was - We're pushing too much user config/policy into the agent (every agent would end up with equivalents of 'ignore-fail', 'soft-fail', and 'on-hard-fail') - We might need the agent to know about the fencing config (enabled/disabled/valid) - If forces the agent to track the number of operation failures So I think I'm just mentioning it for completeness and in case it prompts a good idea in someone else. > > > To express current default behavior: > > op start ignore-fail=0 soft-fail=0 on-hard-fail=ban I would favour something more concrete than 'soft' and 'hard' here. Do they have a sufficiently obvious meaning outside of us developers? Perhaps (with or without a "failures-" prefix) : ignore-count recover-count escalation-policy > op stop ignore-fail=0 soft-fail=0 on-hard-fail=fence > op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban > > > on-fail, migration-threshold, and start-failure-is-fatal would be > deprecated (and would be easy to map to the new parameters). > > I'd avoid the hassles of counting failures "in a row", and stick with > counting failures since the last cleanup. sure _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org