Re: [ClusterLabs] Trying to understand dampening (ping)
Some other notes... I really wish there was better documentation for the individual resources. from the clusterlabs website, I cannot find a page that describes "ping" in any detail. There's been some suggestions about using the same host more than once. I suspect that only really works if you disable fping (but I haven't tried.) The description for timeout is "how long, in seconds, to wait before delcaring a ping lost". That kind of sounds like it means that each ping is allowed to take " seconds", but in the fping case it really means "the total time to wait, in seconds, before declaring the ping monitor has failed." I suppose it depends on how you interpret "a ping": does it mean one instance of the ping command or one ICMP echo? >From the script, the timeout value allowed per ping is actually "timeout * >1000 / attempts". That's for fping. If fping isn't used, it's "timeout" per >instance of ping being run. As an example, using timeout=5,attempts=5 with fping results in fping retruing after a maximum of 6 seconds, whereas with ping, it can take 9-10 seconds to return. To get equivalent behaviour to ping with fping, there should be a "-i 1000" added to its command line. This behaviour difference is very significant because a disruption to the network for 1 second can make fping report a failure when ping wouldn't. Unless you dig into the source code, and can comprehend the differences, there's no reason to want to use one or the other. The ping resource is very important and needs much better documentation, and perhaps should be more than one reasource ... if only there wasn't the problem of backwards compatibility. ____ From: Users on behalf of martin doc Sent: Monday, 18 October 2021 5:35 AM To: users@clusterlabs.org Subject: Re: [ClusterLabs] Trying to understand dampening (ping) The use case is to detect if the network path to the default gateway has failed in one of 3 hosts. The use of "ping" covers cable failure, SFP failure, or some other sort of failure that is local to a single host. In none of the reading I did on the web was there ever a sentence that said "dampen is not active if failure_score is not 0." Given the incompatibility between the two attributes, should both coexist on the same resource? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
The use case is to detect if the network path to the default gateway has failed in one of 3 hosts. The use of "ping" covers cable failure, SFP failure, or some other sort of failure that is local to a single host. In none of the reading I did on the web was there ever a sentence that said "dampen is not active if failure_score is not 0." Given the incompatibility between the two attributes, should both coexist on the same resource? ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On 15.10.2021 13:24, Klaus Wenninger wrote: > On Fri, Oct 15, 2021 at 12:01 PM Andrei Borzenkov > wrote: > >> On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger >> wrote: >> >>> Main pain-point here is that ping-RA allows us to configure the count of >> pings sent, but it >>> is just using the exit-value from ping that becomes negative already >> when one of the >>> answers is missing. >> >> Use fping instead? Which is supported by ping RA and should behave >> exactly as needed - report host alive if at least one reply was >> received. >> > I like fping but it having some reputation as DOS tool not everybody might > be fine installing it. > And we will still have something that would be fine with at least a 50% > packet > loss, which as well might not be acceptable to qualify a host as reachable. > But of course we still can tweak it even with the current implementation to > let's say a loss <20% by giving the same host 5 times and having > the limit set to 4. > >> >> Maybe when using ping RA could also parse ping output instead of >> relying on exit status. >> > as the fence-agent referenced is doing ;-) > Actually simply having inner loop from 1 to $OCF_RESKEY_attempts with "ping -c 1" is more simple and portable. But I am not convinced it is worth the troubles. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On 14.10.2021 23:51, martin doc wrote: > > > > From: Andrei Borzenkov , Friday, 15 October 2021 4:59 AM > ... >> Dampening defines delay before attributes are committed to CIB. >> Private attributes are never ever written into CIB, so dampening >> makes no sense here. Private attributes are managed by attrd >> itself and you see the latest value. > >> If you change transient attribute (without -p option) value you >> will see different values reported by > >> attrd_updater -n my_ping -Q > >> and > >> cibadmin -Q -A "//nvpair[@name='my_ping']" > >> until dampening timeout expires. > >> This applies even to deleting attribute. > > Ok, now I understand what the dampen function does. > > If I understand this correctly then this probably makes every documented > example of using ocf:pacemaker:ping with a colocation statement wrong because > the only way to see the effect of dampen is to use a rule that references the > value of pingd directly. That or the script for ping has a major flaw with > respect to dampen. > > That is when I do this: > > pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1 > failure_score=1 > pcs resource create database ocf:heartbeat:pgsql > pcs group add pgrp myPing database > > PCS will move everything to a new node if there is even 1 ping failure > because monitor in ping doesn't look at the dampened value, only the value of > the immediate returned value. > failure_score is number of hosts that must answer ping during single monitor invocation. If you have single host, the only meani ngful value is 1. If you want to smooth out single ping failure, use "attempts" parameter. It defaults to 3, which means every monitor operation does 3 pings and fails only if all of the fail. So it already does what you want without any special configuration. > The same is true with colocation statements - if a constraint is made with a > ping resource without using a rule that references pingd then the dampen > behaviour is ignored completely. > You completely misunderstand what dampen is used for. It is used to wait for multiple nodes to record results of their monitor actions so when policy engine is invoked it (hopefully) has final picture. It has nothing to do with individual ping results on any single node. > Is the ping'er missing something that does this: > > score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e > 's/.*value="\([^"]*\)".*/\1/'` > The only effect it will have will be using results of previous monitor invocation instead of current one. You cannot used dampening to smooth out ping results. You will still have only one final value recorded, so in the sequence success, success, failure it will be failure. To do anything more sophisticated you need to actually record every individual ping result. This is far more involved and I still miss real use case. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On 15.10.2021 09:24, Klaus Wenninger wrote: > Main pain-point here is that ping-RA allows us to configure the count of > pings sent, but it > is just using the exit-value from ping that becomes negative already when > one of the > answers is missing. Looking closer, this is not true. This is behavior of ping if deadline option (-w) is given which ping RA does not use by default. Otherwise ping fails if no reply is received. > This is why with > https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py > I chose to both give the number of packets sent + number received necessary > to be > assumed as alive. That is of course more flexible, except I am not sure how useful it is in practice. Can you describe real life scenario where it matters whether you got 3 or 4 replies out of 5 when pinging *single* server? Because for multiple servers you already have score option. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On Fri, Oct 15, 2021 at 12:01 PM Andrei Borzenkov wrote: > On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger > wrote: > > > Main pain-point here is that ping-RA allows us to configure the count of > pings sent, but it > > is just using the exit-value from ping that becomes negative already > when one of the > > answers is missing. > > Use fping instead? Which is supported by ping RA and should behave > exactly as needed - report host alive if at least one reply was > received. > I like fping but it having some reputation as DOS tool not everybody might be fine installing it. And we will still have something that would be fine with at least a 50% packet loss, which as well might not be acceptable to qualify a host as reachable. But of course we still can tweak it even with the current implementation to let's say a loss <20% by giving the same host 5 times and having the limit set to 4. > > Maybe when using ping RA could also parse ping output instead of > relying on exit status. > as the fence-agent referenced is doing ;-) > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On Fri, Oct 15, 2021 at 9:25 AM Klaus Wenninger wrote: > Main pain-point here is that ping-RA allows us to configure the count of > pings sent, but it > is just using the exit-value from ping that becomes negative already when one > of the > answers is missing. Use fping instead? Which is supported by ping RA and should behave exactly as needed - report host alive if at least one reply was received. Maybe when using ping RA could also parse ping output instead of relying on exit status. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On Thu, Oct 14, 2021 at 10:51 PM martin doc wrote: > > > -- > *From: *Andrei Borzenkov , Friday, 15 October 2021 > 4:59 AM > *...* > > Dampening defines delay before attributes are committed to CIB. > > Private attributes are never ever written into CIB, so dampening > > makes no sense here. Private attributes are managed by attrd > > itself and you see the latest value. > > > If you change transient attribute (without -p option) value you > > will see different values reported by > > > attrd_updater -n my_ping -Q > > > and > > > cibadmin -Q -A "//nvpair[@name='my_ping']" > > > until dampening timeout expires. > > > This applies even to deleting attribute. > > Ok, now I understand what the dampen function does. > > If I understand this correctly then this probably makes every documented > example of using ocf:pacemaker:ping with a colocation statement wrong > because the only way to see the effect of dampen is to use a rule that > references the value of pingd directly. That or the script for ping has a > major flaw with respect to dampen. > As we've already tried to explain, purpose of dampening is not implementation of any kind of resilience against loss of a certain percentage of packets or anything similar. Basic idea is to have more than one ping host so that - given failure_score is low enough - there is gonna be a certain resilience against packet loss. If your number of ping-hosts isn't large enough you might play with adding them in multiple times to get some kind of resilience. But I agree that this one out of two behavior is probably too resilient for most cases and thus there might be room for improvement. Main pain-point here is that ping-RA allows us to configure the count of pings sent, but it is just using the exit-value from ping that becomes negative already when one of the answers is missing. This is why with https://github.com/ClusterLabs/fence-agents/blob/master/agents/heuristics_ping/fence_heuristics_ping.py I chose to both give the number of packets sent + number received necessary to be assumed as alive. If we assume the latter, when not given at all, as equal to the number of packets sent we would preserve unchanged behavior for existent configurations. Klaus > > That is when I do this: > > pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1 > failure_score=1 > pcs resource create database ocf:heartbeat:pgsql > pcs group add pgrp myPing database > > PCS will move everything to a new node if there is even 1 ping failure > because monitor in ping doesn't look at the dampened value, only the value > of the immediate returned value. > > The same is true with colocation statements - if a constraint is made with > a ping resource without using a rule that references pingd then the dampen > behaviour is ignored completely. > > Is the ping'er missing something that does this: > > score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e > 's/.*value="\([^"]*\)".*/\1/'` > > before it checks if $score is less than $OCF_RESKEY_failure_score? > > Thanks > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On Thu, 2021-10-14 at 20:51 +, martin doc wrote: > > > From: Andrei Borzenkov , Friday, 15 October > 2021 4:59 AM > ... > > Dampening defines delay before attributes are committed to CIB. > > Private attributes are never ever written into CIB, so dampening > > makes no sense here. Private attributes are managed by attrd > > itself and you see the latest value. > > > If you change transient attribute (without -p option) value you > > will see different values reported by > > > attrd_updater -n my_ping -Q > > > and > > > cibadmin -Q -A "//nvpair[@name='my_ping']" > > > until dampening timeout expires. > > > This applies even to deleting attribute. > > Ok, now I understand what the dampen function does. > > If I understand this correctly then this probably makes every > documented example of using ocf:pacemaker:ping with a colocation > statement wrong because the only way to see the effect of dampen is > to use a rule that references the value of pingd directly. That or > the script for ping has a major flaw with respect to dampen. Basically ping has 2 modes of operation, with and without failure_score. Without failure_score, a rule must be used. I only recall examples showing it without failure_score and with a rule > That is when I do this: > > pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1 > failure_score=1 > pcs resource create database ocf:heartbeat:pgsql > pcs group add pgrp myPing database > > PCS will move everything to a new node if there is even 1 ping > failure because monitor in ping doesn't look at the dampened value, > only the value of the immediate returned value. If you use failure_score. If you don't use failure_score, then the ping monitor does not fail if a ping fails. The ping monitor only sets a node attribute, which then is used in a rule. With this setup, the ping resource should be cloned on all nodes, and usually not involved in any group or constraints. > The same is true with colocation statements - if a constraint is made > with a ping resource without using a rule that references pingd then > the dampen behaviour is ignored completely. > > Is the ping'er missing something that does this: > > score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e > 's/.*value="\([^"]*\)".*/\1/'` > > before it checks if $score is less than $OCF_RESKEY_failure_score? > > Thanks -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
From: Andrei Borzenkov , Friday, 15 October 2021 4:59 AM ... > Dampening defines delay before attributes are committed to CIB. > Private attributes are never ever written into CIB, so dampening > makes no sense here. Private attributes are managed by attrd > itself and you see the latest value. > If you change transient attribute (without -p option) value you > will see different values reported by > attrd_updater -n my_ping -Q > and > cibadmin -Q -A "//nvpair[@name='my_ping']" > until dampening timeout expires. > This applies even to deleting attribute. Ok, now I understand what the dampen function does. If I understand this correctly then this probably makes every documented example of using ocf:pacemaker:ping with a colocation statement wrong because the only way to see the effect of dampen is to use a rule that references the value of pingd directly. That or the script for ping has a major flaw with respect to dampen. That is when I do this: pcs resource create myPing ocf:pacemaker:ping host_list=192.168.1.1 failure_score=1 pcs resource create database ocf:heartbeat:pgsql pcs group add pgrp myPing database PCS will move everything to a new node if there is even 1 ping failure because monitor in ping doesn't look at the dampened value, only the value of the immediate returned value. The same is true with colocation statements - if a constraint is made with a ping resource without using a rule that references pingd then the dampen behaviour is ignored completely. Is the ping'er missing something that does this: score=`cibadmin -Q -A "//nvpair[@name='ping']" | sed -e 's/.*value="\([^"]*\)".*/\1/'` before it checks if $score is less than $OCF_RESKEY_failure_score? Thanks ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Trying to understand dampening (ping)
On 13.10.2021 18:01, martin doc wrote: > In the ping resource script, there's support for "dampen" in the use of > attrd_updater. > > My expectation is that it will cause "ping", "no-ping", "ping" to result in > the service being continually presented as up rather than to flap about. > > In testing I can't demonstrate this, even using attrd_updater directly. > > To test out how attrd_updater works, I wrote a small script to do this: > > attrd_updater -n my_ping -D > attrd_updater -n my_ping -p -B 1000 -d 3s Dampening defines delay before attributes are committed to CIB. Private attributes are never ever written into CIB, so dampening makes no sense here. Private attributes are managed by attrd itself and you see the latest value. If you change transient attribute (without -p option) value you will see different values reported by attrd_updater -n my_ping -Q and cibadmin -Q -A "//nvpair[@name='my_ping']" until dampening timeout expires. This applies even to deleting attribute. Somewhat interesting is that it is apparently not possible to change attribute type at all. The very first command that creates attribute sets its type forever. attrd_updater --delete seems to only delete value, but does not make attrd forget about this attribute. So to retry without -p option you need to restart pacemaker ... ... checking source code, --delete translates to operation PCMK__ATTRD_CMD_UPDATE with empty value. So it only changes value indeed. No way to actually delete attribute. > sleep 1 > for i in 0 1 2 3 4 5 6 7 8 9; do > attrd_updater -n my_ping -Q > sleep 1 > attrd_updater -n my_ping -p -U 0 -d 3s > done > > The output always has the first line as 1000 and every other line with a > valud of "0" - as if there was no dampening actually happening. > > Even if I modify the above to do -U 1000, -U 0, -U 1000, doing -Q at any > point always shows the last value supplied, with no evidence of any smoothng > as a result of dampening. > > Is the problem here that the -Q doesn't retrieve the value for my_ping using > the same method as is used for resource scripts? > > Am I totally misunderstanding how dampening works? > > Thanks. > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Trying to understand dampening (ping)
In the ping resource script, there's support for "dampen" in the use of attrd_updater. My expectation is that it will cause "ping", "no-ping", "ping" to result in the service being continually presented as up rather than to flap about. In testing I can't demonstrate this, even using attrd_updater directly. To test out how attrd_updater works, I wrote a small script to do this: attrd_updater -n my_ping -D attrd_updater -n my_ping -p -B 1000 -d 3s sleep 1 for i in 0 1 2 3 4 5 6 7 8 9; do attrd_updater -n my_ping -Q sleep 1 attrd_updater -n my_ping -p -U 0 -d 3s done The output always has the first line as 1000 and every other line with a valud of "0" - as if there was no dampening actually happening. Even if I modify the above to do -U 1000, -U 0, -U 1000, doing -Q at any point always shows the last value supplied, with no evidence of any smoothng as a result of dampening. Is the problem here that the -Q doesn't retrieve the value for my_ping using the same method as is used for resource scripts? Am I totally misunderstanding how dampening works? Thanks. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/