Отправлено с iPhone

> 12 авг. 2019 г., в 8:46, Jan Friesse <jfrie...@redhat.com> написал(а):
> 
> Олег Самойлов napsal(a):
>>> 9 авг. 2019 г., в 9:25, Jan Friesse <jfrie...@redhat.com> написал(а):
>>> Please do not set dpd_interval that high. dpd_interval on qnetd side is not 
>>> about how often is the ping is sent. Could you please retry your test with 
>>> dpd_interval=1000? I'm pretty sure it will work then.
>>> 
>>> Honza
>> Yep. As far as I undestand dpd_interval of qnetd, timeout and sync_timeout 
>> of qdevice is somehow linked. By default they are dpd_interval=10, 
>> timeout=10, sync_timeout=30. And you advised to change them proportionally.
> 
> Yes, timeout and sync_timeout should be changed proportionally. dpd_interval 
> is different story.
> 
>> https://github.com/ClusterLabs/sbd/pull/76#issuecomment-486952369
>> But mechanic how they are depend on each other is mysterious and is not 
>> documented.
> 
> Let me try to bring some light in there:
> 
> - dpd_interval is qnetd variable how often qnetd walks thru the list of all 
> clients (qdevices) and checks timestamp of last sent message. If diff between 
> current timestamp and last sent message timestamp is larger than 2 * timeout 
> sent by client then client is considered as death.
> 
> - interval - affects how often qdevice sends heartbeat to corosync (this is 
> half of the interval) about its liveness and also how often it sends 
> heartbeat to qnetd (0.8 * interval). On corosync side this is used as a 
> timeout after which qdevice daemon is considered death and its votes are no 
> longer valid.
> 
> - sync_timeout - Not used by qdevice/qnetd. Used by corosync during sync 
> phase. If corosync doesn't get reply by qdevice till this timeout it 
> considers qdevice daemon death and continues sync process.
> 

Looking at logs on the beginning of this thread as well as logs in linked 
github issue, it appears that corosync does not do anything during 
sync_timeout, in particular does *not* ask qdevice and device does not ask 
qnetd.


>> I rechecked test with 20-60 combination. I get the same problem on 16th 
>> failure simultation. The 
> qnetd return vote exactly in the same second, when qdevice expects, but 
> slightly less. So the node lost quorum, got vote slightly later, but don't 
> get quorum may be due to 'wait for all' option.

That matches above observation. As soon as corosync is unfrozen, it asks qnetd 
which returns its vote.

So I still do not understand what is supposed to happen during sync_timeout and 
whether observed behavior is intentional. So far it looks just like artificial 
delay.

>> I retried the default 10-30 combination. I got the same problem on the first 
>> failure simulation. Qnetd send vote after 1 second, then expected.
>> Combination is 1-3 (dpd_interval=1, timeout=1, sync_timeout=3). The same 
>> problem on 11th failore simulation. The qnetd return vote exactly in the 
>> same second, when qdevice expects, but slightly less. So the node lost 
>> quorum, got vote slightly later, but don't get quorum may be due to 'wait 
>> for all' option. And node is watchdoged later due to lack of quorum.
> 
> It was probably not evident from my reply, but what I meant was to change 
> just dpd_interval. Could you please recheck with dpd_interval=1, timeout=20, 
> sync_timeout=60?
> 
> Honza
> 
> 
>> So, my conclusions:
>> 1. IMHO may be this bug depend not on absolute value of dpd_interval, on 
>> proportion between dpd_interval of qnetd and timeout, sync_timeout of 
>> qdevice. Because this options, I can not predict how to change them to work 
>> around this behaviour.
>> 2. IMHO "wait for all" also bugged. According on documentation it must fire 
>> only on the start of cluster, but looked like it fire every time when quorum 
>> (or all votes) is lost.
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to