Hello Liviu! And first of all thank you for your detailed explanation. Now I completely understand the approach was taken when developing this feature of Clusterer. And this looks logical to me. It gives less chances to make a human error.
What I did for now, is a clustering super-structure (that works apart OpenSIPS), and here what happens when both nodes see each other again (when recovering from networking issues): Shared IP remains working on the Master side, and one-shot systemd service activates "clusterer_shtag_set_active" on the Master right away. Thus suppressing a Stand-by node to apply backup state. For now this schema works perfectly. Might be I will come up with more robust solution later.. In case this happens, I will share my experience. Have a nice day! On Sat, Apr 11, 2020 at 2:13 PM Liviu Chircu <[email protected]> wrote: > On 09.04.2020 15:03, Donat Zenichev wrote: > > I have question and it's almost theoretic. > > The question relates to Clusterer module and its behavior. > > > > Will Clusterer module solve this contradiction on its own? > > And if so, to which side the precedence is given? > > > > The other way around could be to manually re-activate all services, > > when all the cluster resumes into normal working process (all nodes > > are present). > > Thus this gives us a warranty that shared tag will only be activated > > on one of the sides. > > Hi, Donat! > > A very good question and one that we had to answer ourselves when we > came up with the current design. To begin with, in your scenario, for > all OpenSIPS 2.4+ clusterer versions, after the link between the nodes > comes back online, you will have the following: > > * node A: ACTIVE state (holds the VIP), sharing tag state: ACTIVE (1) > * node B: BACKUP state, sharing tag state: ACTIVE (1) > > The main reason behind this inconsistent state is that we did not > provide an MI command to force a sharing tag to BACKUP (0), which could > be triggered on node B's transition from ACTIVE -> BACKUP once the link > is restored, so recovering from this state will not work automatically - > you have to provide handling for this scenario as well (see last > paragraph). > > Reasoning behind this design > ---------------------------- > > Ultimately, our priority was not to get into solving consensus problems, > Paxos algorithms, etc. What we wanted is a robust active/backup > solution which you could flip back and forth with ease, thus achieving > both High-Availability and easy maintenance. By not providing a > "set_sharing_tag vip 0" command, we _avoid_ the situation where, due to > a developer error, both tags end up being BACKUP (0)!! In such a > scenario: there will be no more CDRs and you will be able to run > infinite CPS/CC through that instance, since all call profile counters > are equal to 0. None of the instances take responsibility for any call > running through them, so a lot of data will be lost. > > On the flip side, in a scenario where both tags are bugged in the ACTIVE > (1) state, you would have: duplicated CDRs (along with maybe some DB > error logs due to conflicting unique keys) and possibly extra-counted > calls, leading to a reduction of the maximally supported CC/CPS. Assume > that the platform wasn't even at 50% of the max limits to begin with, > and the latter has 0 impact on the live system. Thinking about this, > this didn't sound that bad at all to us: no data loss, at the expense of > a few error logs and possibly some added call limits. > > So you can see that we went for a design which minimizes any errors that > the developers can make, and protects the overall system. The platform > will work decently, regardless of network conditions or how the > tag-managing MI commands are sent or abused. > > How to automatically recover from the ACTIVE/ACTIVE sharing tag state > --------------------------------------------------------------------- > > Given that the "clusterer_shtag_set_active" [1] MI command issued to a > node will force all other nodes to transition from ACTIVE -> BACKUP, you > could enhance your system with a logic that sends this command to the > opposite node any time a node's VIP performs the ACTIVE -> BACKUP > transition. This should fix the original problem, where both tags end > up in the ACTIVE state, due to the link between nodes being temporarily > down, without any of the OS'es necessarily being down. > > PS: we haven't implemented the above ^ ourselves yet, but it should work > in theory :) let me know if it works for you if you do decide to plug > this rare issue for your setup! > > Best regards, > > [1]: > > https://opensips.org/docs/modules/3.1.x/clusterer#mi_clusterer_shtag_set_active > > -- > Liviu Chircu > www.twitter.com/liviuchircu | www.opensips-solutions.com > > OpenSIPS Summit, Amsterdam, May 2020 > www.opensips.org/events > > > _______________________________________________ > Users mailing list > [email protected] > http://lists.opensips.org/cgi-bin/mailman/listinfo/users > -- Best regards, Donat Zenichev
_______________________________________________ Users mailing list [email protected] http://lists.opensips.org/cgi-bin/mailman/listinfo/users
