Happy New Year John, Alexey and everyone else!

I just finished catching up with this thread, and I must admit that I now
concur with John's distaste of the asymmetric nature of cluster node restarts!

Although it is correct and gets the job done, the 2.4 "seed" mechanism forces
the admin to conditionally add a "opensipsctl fifo ul_cluster_sync" command
into the startup script of all "seed" nodes.  I think we can do better :)

What if we kept the "seed" concept, but tweaked it such that instead of meaning:

"following a restart, always start in 'synced' state, with an empty dataset"

... it would now mean:

"following a restart or cluster sync command, fall back to a 'synced' state,
with an empty dataset if and only if we are unable to find a suitable sync
candidate within X seconds"

This solution seems to fit all requirements that I've seen posted so far.  It is:

* correct (a cluster with at least 1 "seed" node will still never deadlock)
* symmetric (with the exception of cluster bootstrapping, all node restarts are identical) * autonomous (users need not even know about "ul_cluster_sync" anymore!  Not saying               this is necessarily good, but it brings down the learning curve)

The only downside could be that any cluster bootstrap will now last at least X seconds. But that seems such a rare event (in production, at least) that we need not worry
about it.  Furthermore, the X seconds will be configurable.

What do you think?

PS: by "cluster bootstrap" I mean (re)starting all nodes simultaneously.

Best regards,

Liviu Chircu
OpenSIPS Developer
http://www.opensips-solutions.com

On 02.01.2019 12:24, John Quick wrote:
Alexey,

Thanks for your feedback. I acknowledge that, in theory, a situation may
arise where a node is brought online and all the previously running nodes
were not fully synchronised so it is then a problem for the newly started
node to know which data set to pull. In addition to the example you give -
lost interconnection - I can also foresee difficulties when several nodes
all start at the same time. However, I do not see how arbitrarily setting
one node as "seed" will help to resolve either of these situations unless
the seed node has more (or better) information than the others.

I am trying to design a multi-node solution that is scalable. I want to be
able to add and remove nodes according to current load. Also, to be able to
take one node offline, do some maintenance, then bring it back online. For
my scenario, the probability of any node being taken offline for maintenance
during the year is 99.9% whereas I would say the probability of partial loss
of LAN connectivity (causing the split-brain issue) is less than 0.01%.

If possible, I would really like to see an option added to the usrloc module
to override the "seed" node concept. Something that allows any node
(including seed) to attempt to pull registration details from another node
on startup. In my scenario, a newly started node with no usrloc data is a
major problem - it could take it 40 minutes to get close to having a full
set of registration data. I would prefer to take the risk of it pulling data
from the wrong node rather than it not attempting to synchronise at all.

Happy New Year to all.

John Quick
Smartvox Limited


Hi John,

Next is just my opinion. And I didn't explore source code OpenSIPS for
syncing data.
The problem is little bit deeper. As we have cluster, we potentially have
split-brain.
We can disable seed node at all and just let nodes work after
disaster/restart. But it means that we can't guarantee consistency of data.
So nodes must show this with <Not in sync> state.
Usually clusters use quorum to trust on. But for OpenSIPS I think this
approach is too expensive. And of course for quorum we need minimum 3 hosts.
For 2 hosts after loosing/restoring interconnection it is impossible to
say, which host has consistent data. That's why OpenSIPS uses seed node as
artificial trust point. I think <seed> node doesn't solve syncing problems,
but it simplifies total work.
Let's imagine 3 nodes A,B,C. A is Active. A and B lost interconnection. C
is down. Then C is up and has 2 hosts for syncing. But A already has 200
phones re-registered for some reason. So we have 200 conflicts (on node B
the same phones still in memory). Where to sync from? <Seed> host will
answer this question in 2 cases (A or B). Of course if C is <seed> so it
just will be happy from the start. And I actually don't know what happens,
if we now run <ul_cluster_sync> on C. Will it get all the contacts from A
and B or not?
We operate with specific data, which is temporary. So syncing policy can be
more relaxed. May be it's a good idea to connect somehow <seed> node with
Active role in the cluster. But again, if Active node restarts and still
Active - we will have a problem.
-----
Alexey Vasilyev

_______________________________________________
Users mailing list
[email protected]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

Reply via email to