Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-15 Thread Andrei Borzenkov
On 15.10.2021 09:24, Klaus Wenninger wrote:
> Main pain-point here is that ping-RA allows us to configure the count of
> pings sent, but it
> is just using the exit-value from ping that becomes negative already when
> one of the
> answers is missing.

Looking closer, this is not true. This is behavior of ping if deadline
option (-w) is given which ping RA does not use by default. Otherwise
ping fails if no reply is received.

> This is why with
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
> I chose to both give the number of packets sent + number received necessary
> to be
> assumed as alive.

That is of course more flexible, except I am not sure how useful it is
in practice. Can you describe real life scenario where it matters
whether you got 3 or 4 replies out of 5 when pinging *single* server?
Because for multiple servers you already have score option.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-15 Thread Klaus Wenninger
On Fri, Oct 15, 2021 at 12:01 PM Andrei Borzenkov 
wrote:

> On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger 
> wrote:
>
> > Main pain-point here is that ping-RA allows us to configure the count of
> pings sent, but it
> > is just using the exit-value from ping that becomes negative already
> when one of the
> > answers is missing.
>
> Use fping instead? Which is supported by ping RA and should behave
> exactly as needed - report host alive if at least one reply was
> received.
>
I like fping but it having some reputation as DOS tool not everybody might
be fine installing it.
And we will still have something that would be fine with at least a 50%
packet
loss, which as well might not be acceptable to qualify a host as reachable.
But of course we still can tweak it even with the current implementation to
let's say a loss <20% by giving the same host 5 times and having
the limit set to 4.

>
> Maybe when using ping RA could also parse ping output instead of
> relying on exit status.
>
as the fence-agent referenced is doing ;-)

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Trying to understand dampening (ping)

2021-10-15 Thread Ulrich Windl
>>> Andrei Borzenkov  schrieb am 15.10.2021 um 12:00 in
Nachricht
:
> On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger 
wrote:
> 
>> Main pain‑point here is that ping‑RA allows us to configure the count of
pings 
> sent, but it
>> is just using the exit‑value from ping that becomes negative already when
one 
> of the
>> answers is missing.

The manual says:
   If  ping  does  not  receive any reply packets at all it will exit
with
   code 1. If a packet count and deadline are both  specified,  and 
fewer
   than  count  packets are received by the time the deadline has
arrived,
   it will also exit with code 1.  On other error it exits  with  code 
2.
   Otherwise  it exits with code 0. This makes it possible to use the
exit
   code to see if a host is alive or not.

That's odd: The higher the ping count, the more likely an error exit is.

> 
> Use fping instead? Which is supported by ping RA and should behave
> exactly as needed ‑ report host alive if at least one reply was
> received.
> 
> Maybe when using ping RA could also parse ping output instead of
> relying on exit status.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-15 Thread Andrei Borzenkov
On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger  wrote:

> Main pain-point here is that ping-RA allows us to configure the count of 
> pings sent, but it
> is just using the exit-value from ping that becomes negative already when one 
> of the
> answers is missing.

Use fping instead? Which is supported by ping RA and should behave
exactly as needed - report host alive if at least one reply was
received.

Maybe when using ping RA could also parse ping output instead of
relying on exit status.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Trying to understand dampening (ping)

2021-10-15 Thread Ulrich Windl
Oh well, pingd is interesting:
My guess is that it was originally designed to check the connectivity of an 
interface by pinging some hosts. but some people seem to use it to check the 
reachability of a specific host.
Regardless of the number of packets being sent, some non-binary behavior would 
be desired (instead of setting the attribute to 0 or 100 (for example), the 
value could _range_ from 0 to 1000, indicating the quality of the 
reachability). As said before, some moving average or exponential average, 
maybe.

When trying to find out more about pingd, I found this interesting thing in 
SLES15 SP2 (resource-agents-4.4.0+git57.70549516-3.36.1.x86_64):
"crm ra info pingd" reports:
---
Monitors connectivity to specific hosts or
IP addresses ("ping nodes") (deprecated) (ocf:heartbeat:pingd)

Deprecation warning: This agent is deprecated and may be removed from
a future release. See the ocf:pacemaker:pingd resource agent for a
supported alternative. --
This is a pingd Resource Agent.
...
---

However when I use the recommended "crm ra info ocf:pacemaker:pingd", I also 
get:
---
pingd resource agent (ocf:pacemaker:pingd)

This agent (ocf:pacemaker:pingd) is deprecated and broken, and has been
replaced by the more reliable ocf:pacemaker:ping. It records (in the CIB)
the current number of ping nodes (specified in the 'host_list' parameter)
a cluster node can connect to.
---
The final ocf:pacemaker:ping still has the same poor description:
---
dampen (integer, [5s]): Dampening interval
The time to wait (dampening) further changes occur
---

(IMHO "wait ... _for_ further changes _to_ occur" would be a half-was correct 
sentence)

Regards,
Ulrich


>>> Klaus Wenninger  schrieb am 15.10.2021 um 08:24 in
Nachricht
:
> On Thu, Oct 14, 2021 at 10:51 PM martin doc  wrote:
> 
>>
>>
>> --
>> *From: *Andrei Borzenkov ,  Friday, 15 October 2021
>> 4:59 AM
>> *...*
>> > Dampening defines delay before attributes are committed to CIB.
>> > Private attributes are never ever written into CIB, so dampening
>> > makes no sense here. Private attributes are managed by attrd
>> > itself and you see the latest value.
>>
>> > If you change transient attribute (without -p option) value you
>> > will see different values reported by
>>
>> > attrd_updater -n my_ping -Q
>>
>> > and
>>
>> > cibadmin -Q -A "//nvpair[@name='my_ping']"
>>
>> > until dampening timeout expires.
>>
>> > This applies even to deleting attribute.
>>
>> Ok, now I understand what the dampen function does.
>>
>> If I understand this correctly then this probably makes every documented
>> example of using ocf:pacemaker:ping with a colocation statement wrong
>> because the only way to see the effect of dampen is to use a rule that
>> references the value of pingd directly. That or the script for ping has a
>> major flaw with respect to dampen.
>>
> 
> As we've already tried to explain, purpose of dampening is not
> implementation of any
> kind of resilience against loss of a certain percentage of packets or
> anything similar.
> 
> Basic idea is to have more than one ping host so that - given failure_score
> is low enough -
> there is gonna be a certain resilience against packet loss.
> If your number of ping-hosts isn't large enough you might play with adding
> them in multiple
> times to get some kind of resilience.
> But I agree that this one out of two behavior is probably too resilient for
> most cases and
> thus there might be room for improvement.
> Main pain-point here is that ping-RA allows us to configure the count of
> pings sent, but it
> is just using the exit-value from ping that becomes negative already when
> one of the
> answers is missing.
> This is why with
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_pi 
> ng/fence_heuristics_ping.py
> I chose to both give the number of packets sent + number received necessary
> to be
> assumed as alive. If we assume the latter, when not given at all, as equal
> to the number
> of packets sent we would preserve unchanged behavior for existent
> configurations.
> 
> Klaus
> 
> 
>>
>> That is when I do this:
>>
>> pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1
>> failure_score=1
>> pcs resource create database ocf:heartbeat:pgsql
>> pcs group add pgrp myPing database
>>
>> PCS will move everything to a new node if there is even 1 ping failure
>> because monitor in ping doesn't look at the dampened value, only the value
>> of the immediate returned value.
>>
>> The same is true with colocation statements - if a constraint is made with
>> a ping resource without using a rule that references pingd then  the dampen
>> behaviour is ignored completely.
>>
>> Is the ping'er missing something that does this:
>>
>> score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e
>> 's/.*value="\([^"]*\)".*/\1/'`
>>
>> before it checks if $score is less than $OCF_RESKEY_failure_score?
>>
>> Thanks
>>
>> __