Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-17 Thread martin doc
Some other notes... I really wish there was better documentation for the 
individual resources. from the clusterlabs website, I cannot find a page that 
describes "ping" in any detail.

There's been some suggestions about using the same host more than once. I 
suspect that only really works if you disable fping (but I haven't tried.)

The description for timeout is "how long, in seconds, to wait before delcaring 
a ping lost". That kind of sounds like it means that each ping is allowed to 
take " seconds", but in the fping case it really means "the total time 
to wait, in seconds, before declaring the ping monitor has failed." I suppose 
it depends on how you interpret "a ping": does it mean one instance of the ping 
command or one ICMP echo?

>From the script, the timeout value allowed per ping is actually "timeout * 
>1000 / attempts". That's for fping. If fping isn't used, it's "timeout" per 
>instance of ping being run.

As an example, using timeout=5,attempts=5 with fping results in fping retruing 
after a maximum of 6 seconds, whereas with ping, it can take 9-10 seconds to 
return. To get equivalent behaviour to ping with fping, there should be a "-i 
1000" added to its command line. This behaviour difference is very significant 
because a disruption to the network for 1 second can make fping report a 
failure when ping wouldn't. Unless you dig into the source code, and can 
comprehend the differences, there's no reason to want to use one or the other.

The ping resource is very important and needs much better documentation, and 
perhaps should be more than one reasource ... if only there wasn't the problem 
of backwards compatibility.

____
From: Users  on behalf of martin doc 

Sent: Monday, 18 October 2021 5:35 AM
To: users@clusterlabs.org 
Subject: Re: [ClusterLabs] Trying to understand dampening (ping)


The use case is to detect if the network path to the default gateway has failed 
in one of 3 hosts. The use of "ping" covers cable failure, SFP failure, or some 
other sort of failure that is local to a single host.

In none of the reading I did on the web was there ever a sentence that said 
"dampen is not active if failure_score is not 0."

Given the incompatibility between the two attributes, should both coexist on 
the same resource?

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-17 Thread martin doc

The use case is to detect if the network path to the default gateway has failed 
in one of 3 hosts. The use of "ping" covers cable failure, SFP failure, or some 
other sort of failure that is local to a single host.

In none of the reading I did on the web was there ever a sentence that said 
"dampen is not active if failure_score is not 0."

Given the incompatibility between the two attributes, should both coexist on 
the same resource?

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-16 Thread Andrei Borzenkov
On 15.10.2021 13:24, Klaus Wenninger wrote:
> On Fri, Oct 15, 2021 at 12:01 PM Andrei Borzenkov 
> wrote:
> 
>> On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger 
>> wrote:
>>
>>> Main pain-point here is that ping-RA allows us to configure the count of
>> pings sent, but it
>>> is just using the exit-value from ping that becomes negative already
>> when one of the
>>> answers is missing.
>>
>> Use fping instead? Which is supported by ping RA and should behave
>> exactly as needed - report host alive if at least one reply was
>> received.
>>
> I like fping but it having some reputation as DOS tool not everybody might
> be fine installing it.
> And we will still have something that would be fine with at least a 50%
> packet
> loss, which as well might not be acceptable to qualify a host as reachable.
> But of course we still can tweak it even with the current implementation to
> let's say a loss <20% by giving the same host 5 times and having
> the limit set to 4.
> 
>>
>> Maybe when using ping RA could also parse ping output instead of
>> relying on exit status.
>>
> as the fence-agent referenced is doing ;-)
> 

Actually simply having inner loop from 1 to $OCF_RESKEY_attempts with
"ping -c 1" is more simple and portable. But I am not convinced it is
worth the troubles.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-16 Thread Andrei Borzenkov
On 14.10.2021 23:51, martin doc wrote:
> 
> 
> 
> From: Andrei Borzenkov ,  Friday, 15 October 2021 4:59 AM
> ...
>> Dampening defines delay before attributes are committed to CIB.
>> Private attributes are never ever written into CIB, so dampening
>> makes no sense here. Private attributes are managed by attrd
>> itself and you see the latest value.
> 
>> If you change transient attribute (without -p option) value you
>> will see different values reported by
> 
>> attrd_updater -n my_ping -Q
> 
>> and
> 
>> cibadmin -Q -A "//nvpair[@name='my_ping']"
> 
>> until dampening timeout expires.
> 
>> This applies even to deleting attribute.
> 
> Ok, now I understand what the dampen function does.
> 
> If I understand this correctly then this probably makes every documented 
> example of using ocf:pacemaker:ping with a colocation statement wrong because 
> the only way to see the effect of dampen is to use a rule that references the 
> value of pingd directly. That or the script for ping has a major flaw with 
> respect to dampen.
> 
> That is when I do this:
> 
> pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1 
> failure_score=1
> pcs resource create database ocf:heartbeat:pgsql
> pcs group add pgrp myPing database
> 
> PCS will move everything to a new node if there is even 1 ping failure 
> because monitor in ping doesn't look at the dampened value, only the value of 
> the immediate returned value.
> 

failure_score is number of hosts that must answer ping during single
monitor invocation. If you have single host, the only meani
ngful value is 1.

If you want to smooth out single ping failure, use "attempts" parameter.
It defaults to 3, which means every monitor operation does 3 pings and
fails only if all of the fail. So it already does what you want without
any special configuration.

> The same is true with colocation statements - if a constraint is made with a 
> ping resource without using a rule that references pingd then  the dampen 
> behaviour is ignored completely.
> 

You completely misunderstand what dampen is used for. It is used to wait
for multiple nodes to record results of their monitor actions so when
policy engine is invoked it (hopefully) has final picture. It has
nothing to do with individual ping results on any single node.

> Is the ping'er missing something that does this:
> 
> score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e 
> 's/.*value="\([^"]*\)".*/\1/'`
> 

The only effect it will have will be using results of previous monitor
invocation instead of current one.

You cannot used dampening to smooth out ping results. You will still
have only one final value recorded, so in the sequence success, success,
failure it will be failure.

To do anything more sophisticated you need to actually record every
individual ping result. This is far more involved and I still miss real
use case.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-15 Thread Andrei Borzenkov
On 15.10.2021 09:24, Klaus Wenninger wrote:
> Main pain-point here is that ping-RA allows us to configure the count of
> pings sent, but it
> is just using the exit-value from ping that becomes negative already when
> one of the
> answers is missing.

Looking closer, this is not true. This is behavior of ping if deadline
option (-w) is given which ping RA does not use by default. Otherwise
ping fails if no reply is received.

> This is why with
> https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
> I chose to both give the number of packets sent + number received necessary
> to be
> assumed as alive.

That is of course more flexible, except I am not sure how useful it is
in practice. Can you describe real life scenario where it matters
whether you got 3 or 4 replies out of 5 when pinging *single* server?
Because for multiple servers you already have score option.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-15 Thread Klaus Wenninger
On Fri, Oct 15, 2021 at 12:01 PM Andrei Borzenkov 
wrote:

> On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger 
> wrote:
>
> > Main pain-point here is that ping-RA allows us to configure the count of
> pings sent, but it
> > is just using the exit-value from ping that becomes negative already
> when one of the
> > answers is missing.
>
> Use fping instead? Which is supported by ping RA and should behave
> exactly as needed - report host alive if at least one reply was
> received.
>
I like fping but it having some reputation as DOS tool not everybody might
be fine installing it.
And we will still have something that would be fine with at least a 50%
packet
loss, which as well might not be acceptable to qualify a host as reachable.
But of course we still can tweak it even with the current implementation to
let's say a loss <20% by giving the same host 5 times and having
the limit set to 4.

>
> Maybe when using ping RA could also parse ping output instead of
> relying on exit status.
>
as the fence-agent referenced is doing ;-)

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-15 Thread Andrei Borzenkov
On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger  wrote:

> Main pain-point here is that ping-RA allows us to configure the count of 
> pings sent, but it
> is just using the exit-value from ping that becomes negative already when one 
> of the
> answers is missing.

Use fping instead? Which is supported by ping RA and should behave
exactly as needed - report host alive if at least one reply was
received.

Maybe when using ping RA could also parse ping output instead of
relying on exit status.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-14 Thread Klaus Wenninger
On Thu, Oct 14, 2021 at 10:51 PM martin doc  wrote:

>
>
> --
> *From: *Andrei Borzenkov ,  Friday, 15 October 2021
> 4:59 AM
> *...*
> > Dampening defines delay before attributes are committed to CIB.
> > Private attributes are never ever written into CIB, so dampening
> > makes no sense here. Private attributes are managed by attrd
> > itself and you see the latest value.
>
> > If you change transient attribute (without -p option) value you
> > will see different values reported by
>
> > attrd_updater -n my_ping -Q
>
> > and
>
> > cibadmin -Q -A "//nvpair[@name='my_ping']"
>
> > until dampening timeout expires.
>
> > This applies even to deleting attribute.
>
> Ok, now I understand what the dampen function does.
>
> If I understand this correctly then this probably makes every documented
> example of using ocf:pacemaker:ping with a colocation statement wrong
> because the only way to see the effect of dampen is to use a rule that
> references the value of pingd directly. That or the script for ping has a
> major flaw with respect to dampen.
>

As we've already tried to explain, purpose of dampening is not
implementation of any
kind of resilience against loss of a certain percentage of packets or
anything similar.

Basic idea is to have more than one ping host so that - given failure_score
is low enough -
there is gonna be a certain resilience against packet loss.
If your number of ping-hosts isn't large enough you might play with adding
them in multiple
times to get some kind of resilience.
But I agree that this one out of two behavior is probably too resilient for
most cases and
thus there might be room for improvement.
Main pain-point here is that ping-RA allows us to configure the count of
pings sent, but it
is just using the exit-value from ping that becomes negative already when
one of the
answers is missing.
This is why with
https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py
I chose to both give the number of packets sent + number received necessary
to be
assumed as alive. If we assume the latter, when not given at all, as equal
to the number
of packets sent we would preserve unchanged behavior for existent
configurations.

Klaus


>
> That is when I do this:
>
> pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1
> failure_score=1
> pcs resource create database ocf:heartbeat:pgsql
> pcs group add pgrp myPing database
>
> PCS will move everything to a new node if there is even 1 ping failure
> because monitor in ping doesn't look at the dampened value, only the value
> of the immediate returned value.
>
> The same is true with colocation statements - if a constraint is made with
> a ping resource without using a rule that references pingd then  the dampen
> behaviour is ignored completely.
>
> Is the ping'er missing something that does this:
>
> score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e
> 's/.*value="\([^"]*\)".*/\1/'`
>
> before it checks if $score is less than $OCF_RESKEY_failure_score?
>
> Thanks
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-14 Thread Ken Gaillot
On Thu, 2021-10-14 at 20:51 +, martin doc wrote:
> 
> 
> From: Andrei Borzenkov ,  Friday, 15 October
> 2021 4:59 AM
> ...
> > Dampening defines delay before attributes are committed to CIB.
> > Private attributes are never ever written into CIB, so dampening
> > makes no sense here. Private attributes are managed by attrd
> > itself and you see the latest value.
> 
> > If you change transient attribute (without -p option) value you
> > will see different values reported by
> 
> > attrd_updater -n my_ping -Q
> 
> > and
> 
> > cibadmin -Q -A "//nvpair[@name='my_ping']"
> 
> > until dampening timeout expires.
> 
> > This applies even to deleting attribute.
> 
> Ok, now I understand what the dampen function does.
> 
> If I understand this correctly then this probably makes every
> documented example of using ocf:pacemaker:ping with a colocation
> statement wrong because the only way to see the effect of dampen is
> to use a rule that references the value of pingd directly. That or
> the script for ping has a major flaw with respect to dampen.

Basically ping has 2 modes of operation, with and without
failure_score. Without failure_score, a rule must be used.

I only recall examples showing it without failure_score and with a rule

> That is when I do this:
> 
> pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1
> failure_score=1
> pcs resource create database ocf:heartbeat:pgsql
> pcs group add pgrp myPing database
> 
> PCS will move everything to a new node if there is even 1 ping
> failure because monitor in ping doesn't look at the dampened value,
> only the value of the immediate returned value.

If you use failure_score.

If you don't use failure_score, then the ping monitor does not fail if
a ping fails. The ping monitor only sets a node attribute, which then
is used in a rule. With this setup, the ping resource should be cloned
on all nodes, and usually not involved in any group or constraints.

> The same is true with colocation statements - if a constraint is made
> with a ping resource without using a rule that references pingd then
>  the dampen behaviour is ignored completely.
> 
> Is the ping'er missing something that does this:
> 
> score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e
> 's/.*value="\([^"]*\)".*/\1/'`
> 
> before it checks if $score is less than $OCF_RESKEY_failure_score?
> 
> Thanks

-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-14 Thread martin doc



From: Andrei Borzenkov ,  Friday, 15 October 2021 4:59 AM
...
> Dampening defines delay before attributes are committed to CIB.
> Private attributes are never ever written into CIB, so dampening
> makes no sense here. Private attributes are managed by attrd
> itself and you see the latest value.

> If you change transient attribute (without -p option) value you
> will see different values reported by

> attrd_updater -n my_ping -Q

> and

> cibadmin -Q -A "//nvpair[@name='my_ping']"

> until dampening timeout expires.

> This applies even to deleting attribute.

Ok, now I understand what the dampen function does.

If I understand this correctly then this probably makes every documented 
example of using ocf:pacemaker:ping with a colocation statement wrong because 
the only way to see the effect of dampen is to use a rule that references the 
value of pingd directly. That or the script for ping has a major flaw with 
respect to dampen.

That is when I do this:

pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1 
failure_score=1
pcs resource create database ocf:heartbeat:pgsql
pcs group add pgrp myPing database

PCS will move everything to a new node if there is even 1 ping failure because 
monitor in ping doesn't look at the dampened value, only the value of the 
immediate returned value.

The same is true with colocation statements - if a constraint is made with a 
ping resource without using a rule that references pingd then  the dampen 
behaviour is ignored completely.

Is the ping'er missing something that does this:

score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e 
's/.*value="\([^"]*\)".*/\1/'`

before it checks if $score is less than $OCF_RESKEY_failure_score?

Thanks

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Trying to understand dampening (ping)

2021-10-14 Thread Andrei Borzenkov
On 13.10.2021 18:01, martin doc wrote:
> In the ping resource script, there's support for "dampen" in the use of 
> attrd_updater.
> 
> My expectation is that it will cause "ping", "no-ping", "ping" to result in 
> the service being continually presented as up rather than to flap about.
> 
> In testing I can't demonstrate this, even using attrd_updater directly.
> 
> To test out how attrd_updater works, I wrote a small script to do this:
> 
> attrd_updater -n my_ping -D
> attrd_updater -n my_ping -p -B 1000 -d 3s

Dampening defines delay before attributes are committed to CIB. Private
attributes are never ever written into CIB, so dampening makes no sense
here. Private attributes are managed by attrd itself and you see the
latest value.

If you change transient attribute (without -p option) value you will see
different values reported by

attrd_updater -n my_ping -Q

and

cibadmin -Q -A "//nvpair[@name='my_ping']"

until dampening timeout expires.

This applies even to deleting attribute.

Somewhat interesting is that it is apparently not possible to change
attribute type at all. The very first command that creates attribute
sets its type forever. attrd_updater --delete seems to only delete
value, but does not make attrd forget about this attribute. So to retry
without -p option you need to restart pacemaker ...

... checking source code, --delete translates to operation
PCMK__ATTRD_CMD_UPDATE with empty value. So it only changes value
indeed. No way to actually delete attribute.

> sleep 1
> for i in 0 1 2 3 4 5 6 7 8 9; do
> attrd_updater -n my_ping -Q
> sleep 1
> attrd_updater -n my_ping -p -U 0 -d 3s
> done
> 
> The output always has the first line as 1000 and every other line with a 
> valud of "0" - as if there was no dampening actually happening.
> 
> Even if I modify the above to do -U 1000, -U 0, -U 1000, doing -Q at any 
> point always shows the last value supplied, with no evidence of any smoothng 
> as a result of dampening.
> 
> Is the problem here that the -Q doesn't retrieve the value for my_ping using 
> the same method as is used for resource scripts?
> 
> Am I totally misunderstanding how dampening works?
> 
> Thanks.
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Trying to understand dampening (ping)

2021-10-13 Thread martin doc
In the ping resource script, there's support for "dampen" in the use of 
attrd_updater.

My expectation is that it will cause "ping", "no-ping", "ping" to result in the 
service being continually presented as up rather than to flap about.

In testing I can't demonstrate this, even using attrd_updater directly.

To test out how attrd_updater works, I wrote a small script to do this:

attrd_updater -n my_ping -D
attrd_updater -n my_ping -p -B 1000 -d 3s
sleep 1
for i in 0 1 2 3 4 5 6 7 8 9; do
attrd_updater -n my_ping -Q
sleep 1
attrd_updater -n my_ping -p -U 0 -d 3s
done

The output always has the first line as 1000 and every other line with a valud 
of "0" - as if there was no dampening actually happening.

Even if I modify the above to do -U 1000, -U 0, -U 1000, doing -Q at any point 
always shows the last value supplied, with no evidence of any smoothng as a 
result of dampening.

Is the problem here that the -Q doesn't retrieve the value for my_ping using 
the same method as is used for resource scripts?

Am I totally misunderstanding how dampening works?

Thanks.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/