Re: [sipX-dev] XX-4790: HA pullUpdates request can time out if there are too many changes

Dale Worley Tue, 08 Dec 2009 11:44:23 -0800

[I've already purged the message I'm responding to; I've recovered it
from http://list.sipfoundry.org/archive/sipx-dev/msg20942.html.  But the
archive doesn't have a "show original message text", so I can't recover
its Message-Id, and so I can't insert a References header to make this
message thread with the original.]


        From: "Scott Lawrence" <scott.lawre...@xxxxxxxxxx>

        It is possible for a very large 'pull' operation to fail, which has the
        effect that registry startup is delayed and, more important, there then
        follows a period of time where the registrar is operational but not
        really up to date.  Eventually, the one-update-at-a-time 'push'
        operations should catch it up, but this might also take a while (and I'm
        not sure that we've really verified this recovery in testing).
        
I believe we've seen this situation in the wild, and it behaves exactly
as you've described.  Note that "pushes" execute at about 5 per second,
so for a very large system (5,000 active registrations), it takes about
1,000 seconds (~16 minutes) for the starting registrar to get
synchronized.

Assuming that 1/2 of the incoming calls go to the starting registrar,
and assuming that all of the registrations that the registrar obtained
from its disk database had expired (a fairly reasonable assumption), for
that 16 minute segment, 1/4 of all incoming calls will fail (1/2 of all
calls at the beginning of the interval, falling smoothly to 0 at the
end).
        
        It seems to me that the first and simplest fix is to modify the startup
        'pull' phase by introducing a maximum number of updates allowed in a
        single xml-rpc.  Instead of doing one 'pull' to each peer, the startup
        would loop, repeating the pull operation until it either fails (the peer
        becomes unavailable) or the number of updates returned is less than the
        maximum.   I think this change is backward-compatible (I don't think we
        need to change the xml-rpc interface definition): if a system with the
        limit receives more updates than it expects, it should just process
        them; if a system gets the maximum but doesn't repeat the pull (it has
        not been updated yet), then it will eventually catch up via push (the
        current situation).
        
That makes a great deal of sense.  In theory, we would want to create a
new "pull with maximum" RPC call for compatibility's sake, but since
generating an oversized response to a "pull" significantly burdens the
server system, we probably don't want to change the RPC name.
        
        This also suggests a second change for the 'push' (operational) phase.
        At present, each push operation sends exactly one update number (which
        can, in theory, modify multiple registration table entries all for the
        same identity).  I think it should be possible to change this update to
        push multiple updates at the same time - I don't remember why we limited
        it to one (except maybe that it would have made debugging easier, which
        is significant).
        
IIRC, we limited "push" to a single update number just to simplify
coding the client (pushing) code.  The XML RPC accommodates a separate
update number for each registration pushed.  (See section 5.8.3 of
sipXregistry/doc/SyncDesign.txt.)  I wonder whether the server code for
this RPC will function correctly if it receives a "push" with multiple
update numbers?  We may need to define a new "push" method name, and
play some compatibility tricks.  (See section 5.12, "Protocol
Versioning".)

Dale


_______________________________________________
sipx-dev mailing list [email protected]
List Archive: http://list.sipfoundry.org/archive/sipx-dev
Unsubscribe: http://list.sipfoundry.org/mailman/listinfo/sipx-dev
sipXecs IP PBX -- http://www.sipfoundry.org/

Re: [sipX-dev] XX-4790: HA pullUpdates request can time out if there are too many changes

Reply via email to