[sipX-dev] XX-4790: HA pullUpdates request can time out if there are too many changes

Scott Lawrence Mon, 07 Dec 2009 09:08:36 -0800

>From XX-4790:

        If one of the registrars is down for a time, and during that
        time its peer(s) process a large number of registrations (100s
        or more), then when the failed registrar comes back up it will
        do a pullUpdates request and the peer will create a very large
        response. Preparing that large response can take several seconds
        (one item to look into - making this faster), but the requester
        times out after only 3 seconds.


We've temporarily made that timeout longer, but I'd like to discuss some
possibilities for a more permanent solution.

Background on how the registry synchronization works is in:
        
http://sipxecs.sipfoundry.org/rep/sipXecs/main/sipXregistry/doc/SyncDesign.html

A simplified view of how sync works is that when each registrar starts
up, it does an xml-rpc call to each of its peers asking for all the
registrations that peer has received since the last update (this is a
'pull' operation), then does a 'reset' xml-rpc to make sure that the
update numbers are synchronized, and subsequently does an xml-rpc call
to 'push' each registration it receives to each of its peers.  The
registrar does not start accepting SIP traffic until after the 'pull'
and 'reset' operations are done or have failed, so that it doesn't
provide any redirects until it is as synchronized as possible.

It is possible for a very large 'pull' operation to fail, which has the
effect that registry startup is delayed and, more important, there then
follows a period of time where the registrar is operational but not
really up to date.  Eventually, the one-update-at-a-time 'push'
operations should catch it up, but this might also take a while (and I'm
not sure that we've really verified this recovery in testing).

It seems to me that the first and simplest fix is to modify the startup
'pull' phase by introducing a maximum number of updates allowed in a
single xml-rpc.  Instead of doing one 'pull' to each peer, the startup
would loop, repeating the pull operation until it either fails (the peer
becomes unavailable) or the number of updates returned is less than the
maximum.   I think this change is backward-compatible (I don't think we
need to change the xml-rpc interface definition): if a system with the
limit receives more updates than it expects, it should just process
them; if a system gets the maximum but doesn't repeat the pull (it has
not been updated yet), then it will eventually catch up via push (the
current situation).

This also suggests a second change for the 'push' (operational) phase.
At present, each push operation sends exactly one update number (which
can, in theory, modify multiple registration table entries all for the
same identity).  I think it should be possible to change this update to
push multiple updates at the same time - I don't remember why we limited
it to one (except maybe that it would have made debugging easier, which
is significant).


_______________________________________________
sipx-dev mailing list [email protected]
List Archive: http://list.sipfoundry.org/archive/sipx-dev
Unsubscribe: http://list.sipfoundry.org/mailman/listinfo/sipx-dev
sipXecs IP PBX -- http://www.sipfoundry.org/

[sipX-dev] XX-4790: HA pullUpdates request can time out if there are too many changes

Reply via email to