Re: [sipX-dev] XX-6547 - sipXsupervisor core dump during replication of10, 000 user database

Martin Steinmann Tue, 22 Sep 2009 09:59:39 -0700

>
>
>Raymond Dans wrote:
>> Martin wrote:
>>>> Subject: [sipX-dev] XX-6547 - sipXsupervisor core dump during 
>>>> replication >of10, 000 user database
>>>>
>>>> In tracing the issue with replication of a 10,000 user database,
I've 
>>>> found a couple of areas in the system that are problematic.
>>>>
>>>> A typical database replication request involves sipXconfig
gathering 
>>>> all of the necessary information, constructing an XML-RPC request
for 
>>>> that replication and sending it to the appropriate sipXsupervisor.
>>>> sipXconfig will then wait a pre-determined amount of time for a 
>>>> response.  Should it not receive a response in the alloted time,
the 
>>>> replication will be marked as failed (problem 1).
>>> Are you saying that if there is no response it is marked 
>>> failed for good and no automatic retry is performaed?  If so, 
>>> that is a problem.
>> 
>>>From what I've seen, sipXconfig does not do an immediate retry on a
>> failure scenario.  It will eventually try when the next send profiles
>> occurs due to some change in configuration.
>> 
>
>I think that current sipXconfig behavior is better than the brute force
>retrying. Making sipXconfig retry XML/RPC call automatically in many
cases
>(certainly in this one) would make things worse. Underlying protocol
>insures basic connectivity: if there are problems they are not the kind
>that would clear themselves in a matter of seconds.
>D.


I think that in the current UI it is very hard for the admin to find out
what to do once a replication error occured (also see my recent post on
Job Status page). We got to have an automatic recovery process. There
are many reasons why connectivity to another host in the cluster is
temporarily lost. If this happens during some replication event to that
host, then all the admin gets is a failure status on the Job Status
page. Most admins will just clear that page so that the annoying error
that displays on every page goes away. The host in question remains with
failed replication. It might recover sort of 'by accident' next time the
admin makes a config change, but even that is hard to tell. The system
might still be running, but e.g. a distributed proxy might use the wrong
dialplan. Did I get this right?
--martin


_______________________________________________
sipx-dev mailing list [email protected]
List Archive: http://list.sipfoundry.org/archive/sipx-dev
Unsubscribe: http://list.sipfoundry.org/mailman/listinfo/sipx-dev
sipXecs IP PBX -- http://www.sipfoundry.org/

Re: [sipX-dev] XX-6547 - sipXsupervisor core dump during replication of10, 000 user database

Reply via email to