On April 8, 2020 8:32:59 PM GMT+03:00, Sherrard Burton <sb-clusterl...@allafrica.com> wrote: > > >On 4/8/20 1:09 PM, Andrei Borzenkov wrote: >> 08.04.2020 10:12, Jan Friesse пишет: >>> Sherrard, >>> >>>> i could not determine which of these sub-threads to include this >in, >>>> so i am going to (reluctantly) top-post it. >>>> >>>> i switched the transport to udp, and in limited testing i seem to >not >>>> be hitting the race condition. of course i have no idea whether >this >>>> will behave consistently, or which part of the knet vs udp setup >makes >>>> the most difference. >>>> >>>> ie, is it the overhead of the crypto handshakes/setup? is there >some >>>> other knet layer that imparts additional delay in establishing >>>> connection to other nodes? is the delay on the rebooted node, the >>>> standing node, or both? >>>> >>> >>> Very high level, what is happening in corosync when using udpu: >>> - Corosync started and begins in gather state -> sends "multicast" >>> (emulated by unicast to all expected members) message telling "I'm >here >>> and this is my view of live nodes"). >>> - In this state, corosync waits for answers >>> - When node receives this message it "multicast" same message with >>> updated view of live nodes >>> - After all nodes agrees, they move to next state (commit/recovery >and >>> finally operational) >>> >>> With udp, this happens instantly so most of the time corosync >doesn't >>> even create single node membership, which would be created if no >other >>> nodes exists and/or replies wouldn't be delivered on time. >>> >> >> Is it possible to delay "creating single node membership" until some >> reasonable initial timeout after corosync starts to ensure node view >of >> cluster is up to date? It is clear that there will always be some >corner >> cases, but at least this would make "obviously correct" configuration >to >> behave as expected. >> >> Corosync already must have timeout to declare peers unreachable - it >> sounds like most logical to use in this case. >> > >i tossed that idea around in my head as well. basically if there was an > >analogue client_leaving called client_joining that could be used to >allowed the qdevice to return 'ask later'. > >i think the trade-off here is that you sacrifice some responsiveness in > >your failover times, since (i'm guessing) the timeout for declaring >peers unreachable errors on the side of caution. > >the other hairy bit is determining the difference between a new >(illegitimate) single-node membership, and the existing (legitimate) >single-node membership. both are equally legitimate from the standpoint > >of each client, which can see the qdevice, but not the peer, and from >the standpoint of the qdevice, which can see both clients. > >as such, i suspect that this all comes right back to figuring out how >to >implement issue #7. > > >>> >>> Knet adds a layer which monitors links between each of the node and >it >>> will make line active after it received configured number of "pong" >>> packets. Idea behind is to have evidence of reasonable stable line. >As >>> long as line is not active no data packet goes thru (corosync >traffic is >>> just "data"). This basically means, that initial corosync multicast >is >>> not delivered to other nodes so corosync creates single node >membership. >>> After line becomes active "multicast" is delivered to other nodes >and >>> they move to gather state. >>> >> >> I would expect "reasonable timeout" to also take in account knet >delay. >> >>> So to answer you question. "Delay" is on both nodes side because >link is >>> not established between the nodes. >>> >> >> knet was expected to improve things, was not it? :) >> >_______________________________________________ >Manage your subscription: >https://lists.clusterlabs.org/mailman/listinfo/users > >ClusterLabs home: https://www.clusterlabs.org/
I would have increased the consensus with several seconds. Best Regards, Strahil Nikolov _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/