DNS as a primary method for facilitating communication between endpoints, sipXecs servers, and gateways has proven to be an excellent method of enabling load balancing and, to a lesser extent, redundancy of sipXecs services.
As it is currently implemented in sipXecs, DNS is the root cause of many outages and issues. This is because DNS configuration for proper sipXecs operation is complex for most network engineers/administrators and is very difficult for most telecom engineers to understand. This will only become more complex as future versions of sipXecs will add the capability for many more servers to be added to each sipXecs cluster, each possibly being deployed at a remote site to be used for survivability or load balancing. It has also been observed that individual sipXecs processes are making unnecessarily large amounts of DNS queries which can result in network congestion and extra load on the sipXecs cluster. All sipXecs services require proper DNS records to facilitate interservice communication. The current method sipXecs services utilize for learning these records is to query DNS at seemingly random times: "2011-06-15T21:19:15.062826Z":9:SIP:WARNING:uc.example.com:SipSrvLookupThread-24:4284E940:SipXProxy:"DNS query for name '_sip._tls.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:15.062883Z":10:SIP:WARNING:uc.example.com:SipSrvLookupThread-22:4264C940:SipXProxy:"DNS query for name '_sip._udp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:15.062937Z":11:SIP:WARNING:uc.example.com:SipSrvLookupThread-23:4274D940:SipXProxy:"DNS query for name '_sip._tcp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:15.067976Z":13:SIP:WARNING:uc.example.com:SipSrvLookupThread-23:4274D940:SipXProxy:"DNS query for name '_sip._tcp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:15.068022Z":14:SIP:WARNING:uc.example.com:SipSrvLookupThread-24:4284E940:SipXProxy:"DNS query for name '_sip._tls.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:15.068162Z":15:SIP:WARNING:uc.example.com:SipSrvLookupThread-22:4264C940:SipXProxy:"DNS query for name '_sip._udp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:40.386864Z":22:SIP:WARNING:uc.example.com:SipSrvLookupThread-24:4284E940:SipXProxy:"DNS query for name '_sip._tls.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:40.386904Z":23:SIP:WARNING:uc.example.com:SipSrvLookupThread-23:4274D940:SipXProxy:"DNS query for name '_sip._tcp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:40.386919Z":24:SIP:WARNING:uc.example.com:SipSrvLookupThread-22:4264C940:SipXProxy:"DNS query for name '_sip._udp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:45.355097Z":27:SIP:WARNING:uc.example.com:SipSrvLookupThread-23:4274D940:SipXProxy:"DNS query for name '_sip._tcp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:45.355154Z":28:SIP:WARNING:uc.example.com:SipSrvLookupThread-24:4284E940:SipXProxy:"DNS query for name '_sip._tls.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:45.355168Z":29:SIP:WARNING:uc.example.com:SipSrvLookupThread-22:4264C940:SipXProxy:"DNS query for name '_sip._udp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:45.360783Z":31:SIP:WARNING:uc.example.com:SipSrvLookupThread-22:4264C940:SipXProxy:"DNS query for name '_sip._udp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:45.360851Z":32:SIP:WARNING:uc.example.com:SipSrvLookupThread-23:4274D940:SipXProxy:"DNS query for name '_sip._tcp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:45.360942Z":33:SIP:WARNING:uc.example.com:SipSrvLookupThread-24:4284E940:SipXProxy:"DNS query for name '_sip._tls.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:53.439959Z":44:SIP:WARNING:uc.example.com:SipSrvLookupThread-22:4264C940:SipXProxy:"DNS query for name '_sip._udp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:53.440031Z":45:SIP:WARNING:uc.example.com:SipSrvLookupThread-23:4274D940:SipXProxy:"DNS query for name '_sip._tcp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:53.440075Z":46:SIP:WARNING:uc.example.com:SipSrvLookupThread-24:4284E940:SipXProxy:"DNS query for name '_sip._tls.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:53.442692Z":47:SIP:WARNING:uc.example.com:SipSrvLookupThread-23:4274D940:SipXProxy:"DNS query for name '_sip._tcp.example.com', type = 33 (SRV): returned error" "2011-06-15T21:19:53.442697Z":48:SIP:WARNING:uc.example.com:SipSrvLookupThread-22:4264C940:SipXProxy:"DNS query for name '_sip._udp.example.com', type = 33 (SRV): returned error" While the traffic and load generated by all of these requests is generally tolerable it should be noted that: - If any of the expected record types, such as A, SRV, TLS, etc. are missing, even if they are not necessary, then a log entry is written every time there is a lookup for one of these values in every service. Because these lookups are happening constantly log files can become large rather quickly. - Because each service is constantly performing DNS lookups, if the primary DNS server or caching DNS server has an unexpected outage or is experiencing delays then these issues directly affect signaling capabilities between services and as a result can cause service issues. - This behavior negates the benefit of TTL values for DNS records . It is apparent that the amount of unnecessary DNS queries should be eliminated in favor of a much more conservative query approach that utilizes TTL values received from the DNS server. This would result in much less query traffic and would possibly reduce the load on sipXecs services. Some considerations for taking this approach are if a DNS value was changed shortly before an individual service restart and the TTL has not yet expired on the remaining running services then there could be a potentially fatal conflict of DNS records for services that have not yet updated their internal DNS information. A possible solution for this conflict is that if an individual sipXecs service is restarted or reloaded then sipXsupervisor could send a signal to all running DNS dependent processes to force a DNS refresh. *Linux DNS subsystem failover* The default behavior of the DNS client subsystem in Linux is to wait a full 5 seconds for a response from the primary DNS server before attempting to contact the secondary server. In many cases by the time this 5 second delay has been reached the request has already timed out and the signaling has been cancelled with an error. The resolution to this problem at first glance seems simple. According to http://forums.whirlpool.net.au/archive/592813 a few small configuration changes to /etc/resolv.conf will allow for a much shorter timeout in order to fail over to the secondary DNS server. This will require further testing to ensure this resolves the DNS failover issue. *Simplified DNS Management from Administration Web Interface* Advanced DNS configurations are a requirement for proper branch site setups. These configurations are usually beyond the technical capabilities of many telecom engineers and even many network administrators and engineers. To accommodate these groups it will be necessary simplify DNS management and centralize DNS configuration within sipXecs. Simplification of DNS configurations is the eventual goal as there are many documented cases where DNS was the root of a system outage or issue. In anticipation for the 4.6 or 4.8/5.0 release of sipXecs with mongoDB database backend, http://wiki.sipfoundry.org/display/sipXecs/Location+based+DNS+views+for+sipXecs+using+BINDgives an outline of proper branch site DNS configuration however the only comprehensive management tool for configuration of this type is http://www.webmin.com which is not entirely intuitive with its approach to DNS but does provide a complete package for managing existing DNS zones and views. Some thoughts for adding a comprehensive DNS management suite into sipXecs are: - Create a default view for subnets that are not assigned to a particular branch. - Create a view for each branch that a server is assigned to and copy the zone from the default view to the new view assigned to the branch, then allow administrator to define server priorities per branch. - Assign subnets to branches, which will in turn be applied to the view assigned to each branch. Views only need to be configured on the primary DNS server. Secondary DNS servers located on secondary sipXecs servers, when set up as slave zones to the primary server, will retrieve the zone that has been assigned to the branch/view that the secondary sipXecs server is assigned to. It should also be noted that Microsoft Windows DNS server does not currently have a function similar to BIND views so it is not yet known how sipXecs will address this issue in large multi-site deployments. Some thoughts for addressing this issue is if there is a windows DC that runs DNS for a branch site then that DC can be set up to perform a forward lookup to the sipXecs DNS system, which would return the zone connected to the view that’s assigned to the subnet the DC is on. Please post comments and ideas on this subject. I'd like to see what the community and the engineers think is the best way to approach these issues. Thanks!
_______________________________________________ sipx-dev mailing list [email protected] List Archive: http://list.sipfoundry.org/archive/sipx-dev/
