Hi, Sorry for the delay, I was(am) quite busy integrating our product with sipx 4.2. First of all, thanks for your reply. I will try to answer as best as I can.
>#1 - If you have two sites, Site 1 and Site 2 then typically, the IP >subnets fot both sites would be different... It seems to be that the >scheme you present would impose that all nodes be part of the same subnet. Yes, this is a requirement for this design. All three nodes have to be in the same subnet. >#2 - Even if we overcome problem #1, sipXecs components cannot reconfigure >the IP address then bind to on the fly. This means that the whole sipXecs >would have to be brought up upon detecting the fail-over. Yes, node 3 is a cold standby. Once the IP failover happens, the sipXecs on it is started up. It should have the exact copy of the config files of node 1 and the same IP address when start happens. According to my experience the proxy and registrar come up pretty quick, so the outage should be short. >#3 - What is the trigger for initiating the 'fail-over' process? This is >kind of tricky actually... Yes it is. The first implementation I plan will be based on a requirement we have where there is a SIP trunk connected to each site and only one is active. There the SIP trunk failover triggers the IP/SIPX failover. I would like to keep the failover detection logic separate so, that it can be changed later with minimum impact on the rest of the system. >#4 - Sometimes, configuration changes made through sipXconfig affect >multiple configuration files. If a crash happens in the middle of the >generation of a multi-file configuration change then partial configuration >will end up at the replicated node. It's hard to predict how the partial >configuration will affect the fail-over system but I think it's fair to say >that the results will not be what the customer expects. I think initially it's OK if the profile is sent to the nodes again manually if this happens. It's a good point and worth thinking about. >I am not a clustering expert so maybe there are trivial answers to the >points I'm raising. I have you studied the techniques employed by >clustering technologies out there but I'd be interested in getting your >perspective on this. Me neither. If there'd be trivial answers I wouldn't bring this discussion to the list. ;) BR, Chris -----Original Message----- From: JOLY, ROBERT (ROBERT) [mailto:[email protected]] Sent: Thursday, May 20, 2010 9:43 PM To: Krisztian Ganyai; [email protected] Subject: RE: Fully redundant sipxecs > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Krisztian Ganyai > Sent: Thursday, May 06, 2010 7:56 AM > To: [email protected] > Subject: [sipX-dev] Fully redundant sipxecs > > Hi, > > There are some services in sipXecs, which are SPOF even in > redundant sipXecs deployments. These are: the > mediaservices(ACD, conferencing, VM) and the config(admin and > user portal). > > Our goal is to build a fully HA system. We have come up with > multiple ideas and some of them we discussed briefly with the > sipx developers already. The most promising idea we have to > date looks like this: > > .------------------------. > .---------------------------------------------. > | Site 1 | | Site 2 > | > | | | > | > | .--------------------. | | .-------------------. > .-------------------. | > | | Node 1 | | | | Node 2 | | Node 3(cs) > | | > | | (service IP) | | | | | | > | | > | | | | | | | | > | | > | | .---------------. | | | | .---------------. | | > .---------------. > | | > | | | Proxy(pri.) |----------->| Proxy(sec.) | | | | > Proxy(pri.) | > | | > | | '---------------' | | | | '---------------' | | > '---------------' > | | > | | .---------------. | | | | | | > .---------------. > | | > | | | Config | | | | | | | | > Config | > | | > | | '---------------' | | | | | | > '---------------' > | | > | | .---------------. | | | | | | > .---------------. > | | > | | | Mediaservices | | | | | | | | > Mediaservices | > | | > | | '---------------' | | | | | | > '---------------' > | | > | | | | | | | | > | | > | '--------------------' | | '-------------------' > '-------------------' | > | | | | | > | > '-----------|------------' > '---------------------------------|-----------' > | .------------------------. | > | | Replicated config | | > '----------->| (using DRBD) |<-----------' > '------------------------' > Figure 1: Before failover > > The idea behind is to extend the basic redundant > configuration sipfoundry suggests(Node 1 on Site 1 and Node 2 > on Site 2) with a 3rd node(Node 3) on Site 2. The Node 3 acts > as a cold standby primary node and has all configuration > files synced to/from the primary server. > > The configuration directories(/etc/sipxpbx and the > /var/sipxdata) between the primary Node 1 and the cold > standby Node 3 would be replicated using DRBD, which is a > filesystem replication facility available in CentOS. > > Since the configuration is shared, the Node 1 and Node 3 has > to have same IP address as we don't want to mess around with > the config files. > To avoid IP conflicts, there would be 3 IP addresses for the > Node 1 and Node 2. From the three, one would be the "service > IP", for which sipXecs is configured. When both primary nodes > start, they come up with one of the non-service IP address > and with some mechanism they decide who should be active. The > active one takes over the service IP(using ifconfig eth0 for > eg) and starts the sipxecs services. When failover > happens(previous active goes down), the other node takes over > the service IP and starts sipXecs service. > > .------------------------. > .---------------------------------------------. > | Site 1 | | Site 2 > | > | (DOWN) | | > | > | .--------------------. | | .-------------------. > .-------------------. | > | | Node 1 | | | | Node 2 | | Node 3 > | | > | | (DOWN) | | | | | | (service IP) > | | > | | | | | | | | > | | > | | .---------------. | | | | .---------------. | | > .---------------. > | | > | | | Proxy(pri.) | | | | | | Proxy(sec.) |<----| > Proxy(pri.) | > | | > | | '---------------' | | | | '---------------' | | > '---------------' > | | > | | .---------------. | | | | | | > .---------------. > | | > | | | Config | | | | | | | | > Config | > | | > | | '---------------' | | | | | | > '---------------' > | | > | | .---------------. | | | | | | > .---------------. > | | > | | | Mediaservices | | | | | | | | > Mediaservices | > | | > | | '---------------' | | | | | | > '---------------' > | | > | | | | | | | | > | | > | '--------------------' | | '-------------------' > '-------------------' | > | | | | | > | > '-----------|------------' > '---------------------------------|-----------' > | .------------------------. | > | | Replicated config | | > '----------->| (using DRBD) |<-----------' > '------------------------' > Figure 2: After failover > > Since it's probably not only us who'd like to have a fully > redundant sipXecs we decided not to do it undercover, but > share our ides/progress with the community. I hope we can > come up with something usable. > Chris Chris, thank you very much for putting this together. Sorry for the late reply. Although your ASCII art didn't come out nice and clean on my e-mail reader, I got an idea of what you are trying to do. There are a couple of fundamental elements I cannot quite piece together. Here they are. #1 - If you have two sites, Site 1 and Site 2 then typically, the IP subnets fot both sites would be different. With the description you present, it appears that the Service IP address needs to be routable in both sites. How would you manage that? It seems to be that the scheme you present would impose that all nodes be part of the same subnet. That way the service IP address remains routable when 'transitioning' from one node to the other. Such an IP address 'transition' would require that you send out gratuitous ARPs to update everyones ARP table. #2 - Even if we overcome problem #1, sipXecs components cannot reconfigure the IP address then bind to on the fly. This means that the whole sipXecs would have to be brought up upon detecting the fail-over. #3 - What is the trigger for initiating the 'fail-over' process? This is kind of tricky actually. If you have some kind of ping going between nodes 1 and 3, how do we deal with cases where the network connectivity between 1 and 3 is broken but both nodes remain operational and reachable to a subset of devices. Would that not lead to a situation where both 1 and 3 think that should be active, leaving two active primary servers out there? And during that time where both are active, if some configuration changes are independently made on both nodes, how do we reconcile/merge configuration changes when the connectivity problems heal between nodes 1 and 3? #4 - Sometimes, configuration changes made through sipXconfig affect multiple configuration files. If a crash happens in the middle of the generation of a multi-file configuration change then partial configuration will end up at the replicated node. It's hard to predict how the partial configuration will affect the fail-over system but I think it's fair to say that the results will not be what the customer expects. I am not a clustering expert so maybe there are trivial answers to the points I'm raising. I have you studied the techniques employed by clustering technologies out there but I'd be interested in getting your perspective on this. Thanks again, bob _______________________________________________ sipx-dev mailing list [email protected] List Archive: http://list.sipfoundry.org/archive/sipx-dev Unsubscribe: http://list.sipfoundry.org/mailman/listinfo/sipx-dev sipXecs IP PBX -- http://www.sipfoundry.org/
