Hi everybody, I like the approach, but here are some thoughts.
I think that X seconds delay should not pause all the opensips work. Just to run asynchronously, allowing to process requests even before syncing data. For example, I use for syncyng from systemd "ExecStartPost" script. So it runs, when opensips already started. (And, by the way, John, be careful, don't run "ul_cluster_sync" when you are starting "seed" node first, without running any another node. It makes cluster "Not synced') Lets imagine, "seed" node starts and find 2 nodes (or more), which one to choose for syncing? And if they have different data (they were not synced between each other), what should it do? Thanks. чт, 3 янв. 2019 г. в 11:33, Liviu Chircu <[email protected]>: > Happy New Year John, Alexey and everyone else! > > I just finished catching up with this thread, and I must admit that I now > concur with John's distaste of the asymmetric nature of cluster node > restarts! > > Although it is correct and gets the job done, the 2.4 "seed" mechanism > forces > the admin to conditionally add a "opensipsctl fifo ul_cluster_sync" command > into the startup script of all "seed" nodes. I think we can do better :) > > What if we kept the "seed" concept, but tweaked it such that instead of > meaning: > > "following a restart, always start in 'synced' state, with an empty > dataset" > > ... it would now mean: > > "following a restart or cluster sync command, fall back to a 'synced' > state, > with an empty dataset if and only if we are unable to find a suitable sync > candidate within X seconds" > > This solution seems to fit all requirements that I've seen posted so > far. It is: > > * correct (a cluster with at least 1 "seed" node will still never deadlock) > * symmetric (with the exception of cluster bootstrapping, all node > restarts are identical) > * autonomous (users need not even know about "ul_cluster_sync" anymore! > Not saying > this is necessarily good, but it brings down the learning > curve) > > The only downside could be that any cluster bootstrap will now last at > least X seconds. > But that seems such a rare event (in production, at least) that we need > not worry > about it. Furthermore, the X seconds will be configurable. > > What do you think? > > PS: by "cluster bootstrap" I mean (re)starting all nodes simultaneously. > > Best regards, > > Liviu Chircu > OpenSIPS Developer > http://www.opensips-solutions.com > > On 02.01.2019 12:24, John Quick wrote: > > Alexey, > > > > Thanks for your feedback. I acknowledge that, in theory, a situation may > > arise where a node is brought online and all the previously running nodes > > were not fully synchronised so it is then a problem for the newly started > > node to know which data set to pull. In addition to the example you give > - > > lost interconnection - I can also foresee difficulties when several nodes > > all start at the same time. However, I do not see how arbitrarily setting > > one node as "seed" will help to resolve either of these situations unless > > the seed node has more (or better) information than the others. > > > > I am trying to design a multi-node solution that is scalable. I want to > be > > able to add and remove nodes according to current load. Also, to be able > to > > take one node offline, do some maintenance, then bring it back online. > For > > my scenario, the probability of any node being taken offline for > maintenance > > during the year is 99.9% whereas I would say the probability of partial > loss > > of LAN connectivity (causing the split-brain issue) is less than 0.01%. > > > > If possible, I would really like to see an option added to the usrloc > module > > to override the "seed" node concept. Something that allows any node > > (including seed) to attempt to pull registration details from another > node > > on startup. In my scenario, a newly started node with no usrloc data is a > > major problem - it could take it 40 minutes to get close to having a full > > set of registration data. I would prefer to take the risk of it pulling > data > > from the wrong node rather than it not attempting to synchronise at all. > > > > Happy New Year to all. > > > > John Quick > > Smartvox Limited > > > > > >> Hi John, > >> > >> Next is just my opinion. And I didn't explore source code OpenSIPS for > > syncing data. > >> The problem is little bit deeper. As we have cluster, we potentially > have > > split-brain. > >> We can disable seed node at all and just let nodes work after > > disaster/restart. But it means that we can't guarantee consistency of > data. > > So nodes must show this with <Not in sync> state. > >> Usually clusters use quorum to trust on. But for OpenSIPS I think this > > approach is too expensive. And of course for quorum we need minimum 3 > hosts. > >> For 2 hosts after loosing/restoring interconnection it is impossible to > > say, which host has consistent data. That's why OpenSIPS uses seed node > as > > artificial trust point. I think <seed> node doesn't solve syncing > problems, > > but it simplifies total work. > >> Let's imagine 3 nodes A,B,C. A is Active. A and B lost interconnection. > C > > is down. Then C is up and has 2 hosts for syncing. But A already has 200 > > phones re-registered for some reason. So we have 200 conflicts (on node B > > the same phones still in memory). Where to sync from? <Seed> host will > > answer this question in 2 cases (A or B). Of course if C is <seed> so it > > just will be happy from the start. And I actually don't know what > happens, > > if we now run <ul_cluster_sync> on C. Will it get all the contacts from A > > and B or not? > >> We operate with specific data, which is temporary. So syncing policy > can be > > more relaxed. May be it's a good idea to connect somehow <seed> node with > > Active role in the cluster. But again, if Active node restarts and still > > Active - we will have a problem. > >> ----- > >> Alexey Vasilyev > -- Best regards Alexey Vasilyev
_______________________________________________ Users mailing list [email protected] http://lists.opensips.org/cgi-bin/mailman/listinfo/users
