Отправлено с iPhone
> 12 авг. 2019 г., в 8:46, Jan Friesse <jfrie...@redhat.com> написал(а): > > Олег Самойлов napsal(a): >>> 9 авг. 2019 г., в 9:25, Jan Friesse <jfrie...@redhat.com> написал(а): >>> Please do not set dpd_interval that high. dpd_interval on qnetd side is not >>> about how often is the ping is sent. Could you please retry your test with >>> dpd_interval=1000? I'm pretty sure it will work then. >>> >>> Honza >> Yep. As far as I undestand dpd_interval of qnetd, timeout and sync_timeout >> of qdevice is somehow linked. By default they are dpd_interval=10, >> timeout=10, sync_timeout=30. And you advised to change them proportionally. > > Yes, timeout and sync_timeout should be changed proportionally. dpd_interval > is different story. > >> https://github.com/ClusterLabs/sbd/pull/76#issuecomment-486952369 >> But mechanic how they are depend on each other is mysterious and is not >> documented. > > Let me try to bring some light in there: > > - dpd_interval is qnetd variable how often qnetd walks thru the list of all > clients (qdevices) and checks timestamp of last sent message. If diff between > current timestamp and last sent message timestamp is larger than 2 * timeout > sent by client then client is considered as death. > > - interval - affects how often qdevice sends heartbeat to corosync (this is > half of the interval) about its liveness and also how often it sends > heartbeat to qnetd (0.8 * interval). On corosync side this is used as a > timeout after which qdevice daemon is considered death and its votes are no > longer valid. > > - sync_timeout - Not used by qdevice/qnetd. Used by corosync during sync > phase. If corosync doesn't get reply by qdevice till this timeout it > considers qdevice daemon death and continues sync process. > Looking at logs on the beginning of this thread as well as logs in linked github issue, it appears that corosync does not do anything during sync_timeout, in particular does *not* ask qdevice and device does not ask qnetd. >> I rechecked test with 20-60 combination. I get the same problem on 16th >> failure simultation. The > qnetd return vote exactly in the same second, when qdevice expects, but > slightly less. So the node lost quorum, got vote slightly later, but don't > get quorum may be due to 'wait for all' option. That matches above observation. As soon as corosync is unfrozen, it asks qnetd which returns its vote. So I still do not understand what is supposed to happen during sync_timeout and whether observed behavior is intentional. So far it looks just like artificial delay. >> I retried the default 10-30 combination. I got the same problem on the first >> failure simulation. Qnetd send vote after 1 second, then expected. >> Combination is 1-3 (dpd_interval=1, timeout=1, sync_timeout=3). The same >> problem on 11th failore simulation. The qnetd return vote exactly in the >> same second, when qdevice expects, but slightly less. So the node lost >> quorum, got vote slightly later, but don't get quorum may be due to 'wait >> for all' option. And node is watchdoged later due to lack of quorum. > > It was probably not evident from my reply, but what I meant was to change > just dpd_interval. Could you please recheck with dpd_interval=1, timeout=20, > sync_timeout=60? > > Honza > > >> So, my conclusions: >> 1. IMHO may be this bug depend not on absolute value of dpd_interval, on >> proportion between dpd_interval of qnetd and timeout, sync_timeout of >> qdevice. Because this options, I can not predict how to change them to work >> around this behaviour. >> 2. IMHO "wait for all" also bugged. According on documentation it must fire >> only on the start of cluster, but looked like it fire every time when quorum >> (or all votes) is lost. > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/