On Wed, Oct 5, 2016 at 7:03 AM, Ken Gaillot <kgail...@redhat.com> wrote:
> On 10/02/2016 10:02 PM, Andrew Beekhof wrote: > >> Take a > >> look at all of nagios' options for deciding when a failure becomes > "real". > > > > I used to take a very hard line on this: if you don't want the cluster > > to do anything about an error, don't tell us about it. > > However I'm slowly changing my position... the reality is that many > > people do want a heads up in advance and we have been forcing that > > policy (when does an error become real) into the agents where one size > > must fit all. > > > > So I'm now generally in favour of having the PE handle this "somehow". > > Nagios is a useful comparison: > > check_interval - like pacemaker's monitor interval > > retry_interval - if a check returns failure, switch to this interval > (i.e. check more frequently when trying to decide whether it's a "hard" > failure) > > max_check_attempts - if a check fails this many times in a row, it's a > hard failure. Before this is reached, it's considered a soft failure. > Nagios will call event handlers (comparable to pacemaker's alert agents) > for both soft and hard failures (distinguishing the two). A service is > also considered to have a "hard failure" if its host goes down. > > high_flap_threshold/low_flap_threshold - a service is considered to be > flapping when its percent of state changes (ok <-> not ok) in the last > 21 checks (= max. 20 state changes) reaches high_flap_threshold, and > stable again once the percentage drops to low_flap_threshold. To put it > another way, a service that passes every monitor is 0% flapping, and a > service that fails every other monitor is 100% flapping. With these, > even if a service never reaches max_check_attempts failures in a row, an > alert can be sent if it's repeatedly failing and recovering. > makes sense. since we're overhauling this functionality anyway, do you think we need to add an equivalent of retry_interval too? > > >> If you clear failures after a success, you can't detect/recover a > >> resource that is flapping. > > > > Ah, but you can if the thing you're clearing only applies to other > > failures of the same action. > > A completed start doesn't clear a previously failed monitor. > > Nope -- a monitor can alternately succeed and fail repeatedly, and that > indicates a problem, but wouldn't trip an "N-failures-in-a-row" system. > > >> It only makes sense to escalate from ignore -> restart -> hard, so maybe > >> something like: > >> > >> op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban > >> > > I would favour something more concrete than 'soft' and 'hard' here. > > Do they have a sufficiently obvious meaning outside of us developers? > > > > Perhaps (with or without a "failures-" prefix) : > > > > ignore-count > > recover-count > > escalation-policy > > I think the "soft" vs "hard" terminology is somewhat familiar to > sysadmins -- there's at least nagios, email (SPF failures and bounces), > and ECC RAM. But throwing "ignore" into the mix does confuse things. > > How about ... max-fail-ignore=3 max-fail-restart=2 fail-escalation=ban > > I could live with that :-)
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org