Sherrard Burton napsal(a):


On 4/8/20 1:09 PM, Andrei Borzenkov wrote:
08.04.2020 10:12, Jan Friesse пишет:
Sherrard,

i could not determine which of these sub-threads to include this in,
so i am going to (reluctantly) top-post it.

i switched the transport to udp, and in limited testing i seem to not
be hitting the race condition. of course i have no idea whether this
will behave consistently, or which part of the knet vs udp setup makes
the most difference.

ie, is it the overhead of the crypto handshakes/setup? is there some
other knet layer that imparts additional delay in establishing
connection to other nodes? is the delay on the rebooted node, the
standing node, or both?


Very high level, what is happening in corosync when using udpu:
- Corosync started and begins in gather state -> sends "multicast"
(emulated by unicast to all expected members) message telling "I'm here
and this is my view of live nodes").
- In this state, corosync waits for answers
- When node receives this message it "multicast" same message with
updated view of live nodes
- After all nodes agrees, they move to next state (commit/recovery and
finally operational)

With udp, this happens instantly so most of the time corosync doesn't
even create single node membership, which would be created if no other
nodes exists and/or replies wouldn't be delivered on time.


Is it possible to delay "creating single node membership" until some
reasonable initial timeout after corosync starts to ensure node view of
cluster is up to date? It is clear that there will always be some corner
cases, but at least this would make "obviously correct" configuration to
behave as expected.

Corosync already must have timeout to declare peers unreachable - it
sounds like most logical to use in this case.


i tossed that idea around in my head as well. basically if there was an analogue client_leaving called client_joining that could be used to allowed the qdevice to return 'ask later'.

It is there.


i think the trade-off here is that you sacrifice some responsiveness in your failover times, since (i'm guessing) the timeout for declaring peers unreachable errors on the side of caution.

the other hairy bit is determining the difference between a new (illegitimate) single-node membership, and the existing (legitimate) single-node membership. both are equally legitimate from the standpoint of each client, which can see the qdevice, but not the peer, and from the standpoint of the qdevice, which can see both clients.

Yep. Actually this is really situation which I hadn't think about. It is quite special, because for more than 2 nodes, it works as it should (single node partition never gets a vote then). That doesn't mean 2 node cluster is not important - it's quite opposite - this is where qdevice makes sense.


as such, i suspect that this all comes right back to figuring out how to implement issue #7.

It's not hard, it is just quite some work to do. I'm on it, but I have no ETA yet (and of course current situation in real life doesn't help too much). When I get something, I will let you know and be happy if you would be able to test it.

Regards,
  Honza




Knet adds a layer which monitors links between each of the node and it
will make line active after it received configured number of "pong"
packets. Idea behind is to have evidence of reasonable stable line. As
long as line is not active no data packet goes thru (corosync traffic is
just "data"). This basically means, that initial corosync multicast is
not delivered to other nodes so corosync creates single node membership.
After line becomes active "multicast" is delivered to other nodes and
they move to gather state.


I would expect "reasonable timeout" to also take in account knet delay.

So to answer you question. "Delay" is on both nodes side because link is
not established between the nodes.


knet was expected to improve things, was not it? :)



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to