[ClusterLabs] IPaddr2, interval between unsolicited ARP packets

2016-10-02 Thread Shinjiro Hamaguchi
Hello everyone!!


I'm using IPaddr2 for VIP.

In the IPaddr2 document, it say interval between unsolicited ARP packets is
default 200msec and can change it using option "-i", but when i check with
tcpdump, it looks like sending arp every 1000msec fixed.

Does someone have any idea ?

Thank you in advance.


[environment]
kvm, centOS 6.8
pacemaker-1.1.14-8.el6_8.1.x86_64
cman-3.0.12.1-78.el6.x86_64
resource-agents-3.9.5-34.el6_8.2.x86_64


[command used to send unsolicited arp]
NOTE: i got this command from /var/log/cluster/corosync.log
#/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-192.168.12.215 eth0 192.168.12.215 auto
not_used not_used

[result of tcudump]

#tcpdump arp

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes

05:28:17.267296 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28

05:28:18.267519 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28

05:28:19.267638 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28

05:28:20.267715 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28

05:28:21.267801 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-10-02 Thread Andrew Beekhof
On Fri, Sep 30, 2016 at 10:28 AM, Ken Gaillot  wrote:
> On 09/28/2016 10:54 PM, Andrew Beekhof wrote:
>> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot  wrote:
 "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
 then migrate", but I can't think of a real-world situation where that
 makes sense,


 really?

 it is not uncommon to hear "i know its failed, but i dont want the
 cluster to do anything until its _really_ failed"
>>>
>>> Hmm, I guess that would be similar to how monitoring systems such as
>>> nagios can be configured to send an alert only if N checks in a row
>>> fail. That's useful where transient outages (e.g. a webserver hitting
>>> its request limit) are acceptable for a short time.
>>>
>>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>>> is not "in a row" but "since the count was last cleared".
>>
>> It would be a major change, but perhaps it should be "in-a-row" and
>> successfully performing the action clears the count.
>> Its entirely possible that the current behaviour is like that because
>> I wasn't smart enough to implement anything else at the time :-)
>
> Or you were smart enough to realize what a can of worms it is. :)

So you're saying two dumbs makes a smart? :-)

>Take a
> look at all of nagios' options for deciding when a failure becomes "real".

I used to take a very hard line on this: if you don't want the cluster
to do anything about an error, don't tell us about it.
However I'm slowly changing my position... the reality is that many
people do want a heads up in advance and we have been forcing that
policy (when does an error become real) into the agents where one size
must fit all.

So I'm now generally in favour of having the PE handle this "somehow".

>
> If you clear failures after a success, you can't detect/recover a
> resource that is flapping.

Ah, but you can if the thing you're clearing only applies to other
failures of the same action.
A completed start doesn't clear a previously failed monitor.

>
>>> "Ignore up to three monitor failures if they occur in a row [or, within
>>> 10 minutes?], then try soft recovery for the next two monitor failures,
>>> then ban this node for the next monitor failure." Not sure being able to
>>> say that is worth the complexity.
>>
>> Not disagreeing
>
> It only makes sense to escalate from ignore -> restart -> hard, so maybe
> something like:
>
>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban

The other idea I had, was to create some new return codes:
PCMK_OCF_ERR_BAN, PCMK_OCF_ERR_FENCE, etc.
Ie. make the internal mapping of return codes like
PCMK_OCF_NOT_CONFIGURED and PCMK_OCF_DEGRADED to hard/soft/ignore
recovery logic into something available to the agent.

To use your example above, return PCMK_OCF_DEGRADED for the first 3
monitor failures, PCMK_OCF_ERR_RESTART for the next two and
PCMK_OCF_ERR_BAN for the last.

But the more I think about it, the less I like it.
- We loose precision about what the actual error was
- We're pushing too much user config/policy into the agent (every
agent would end up with equivalents of 'ignore-fail', 'soft-fail', and
'on-hard-fail')
- We might need the agent to know about the fencing config
(enabled/disabled/valid)
- If forces the agent to track the number of operation failures

So I think I'm just mentioning it for completeness and in case it
prompts a good idea in someone else.

>
>
> To express current default behavior:
>
>   op start ignore-fail=0 soft-fail=0on-hard-fail=ban

I would favour something more concrete than 'soft' and 'hard' here.
Do they have a sufficiently obvious meaning outside of us developers?

Perhaps (with or without a "failures-" prefix) :

   ignore-count
   recover-count
   escalation-policy

>   op stop  ignore-fail=0 soft-fail=0on-hard-fail=fence
>   op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban
>
>
> on-fail, migration-threshold, and start-failure-is-fatal would be
> deprecated (and would be easy to map to the new parameters).
>
> I'd avoid the hassles of counting failures "in a row", and stick with
> counting failures since the last cleanup.

sure

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org