[sipX-dev] XX-6547 - sipXsupervisor core dump during replication of 10, 000 user database

Raymond Dans Mon, 21 Sep 2009 11:46:24 -0700

In tracing the issue with replication of a 10,000 user database, I've
found a couple of areas in the system that are problematic.


A typical database replication request involves sipXconfig gathering all
of the necessary information, constructing an XML-RPC request for that
replication and sending it to the appropriate sipXsupervisor.
sipXconfig will then wait a pre-determined amount of time for a
response.  Should it not receive a response in the alloted time, the
replication will be marked as failed (problem 1).

sipXsupervisor, when it receives the XML-RPC request, will read in the
entire request, check that its from a valid node and then process the
request before sending back a response indicating the results.

The reading of the XML-RPC request happens in HttpServer which calls
HttpMessage to actually read in the full message.

HttpMessage, when reading in the full message, checks the content length
against a defined maximum value (currently default value of 12000000
bytes).  If the content length is greater than the maximum allowed, the
socket is immediately closed down and a "bytes read" value of 0 is
returned to HttpServer (problem 2).  

Now problems really start to happen.  HttpServer DOES NOT check the
number of bytes read and simply goes on to try and process the request.
sipXsupervisor in trying to process the request, detects that there are
zero bytes in it and proceeds to build and try and send back an XML-RPC
response with a failure code.  However, the socket was closed in the
read operation and hence a core dump occurs in the SSL socket layer when
performing a write.

There are a few different type of solutions that can be used to rectify
the situation.

1. Bandage solution:  Increase the maximum content length to some
greater value(let say 24000000 bytes) such that the read will handle the
very large request.  The issue I have with this is that its only a
bandage solution and that in some cases sipXconfig will timeout waiting
for a response because it takes supervisor quite a while to process
large requests.  As a result, sipXconfig will show the replication as
failed even though it may have succeeded.  We could then increase the
sipXconfig timeout in this situation but its really just adding another
bandage on the solution.

2. HttpServer only solution:  Add a check in the read to detect a 0 byte
read return value and not proceed to pass the request to supervisor for
processing.  Problem with this solution is that there are other cases
where 0 bytes can be returned (without closing the socket) and we should
send a response back indicating an error in the request.     

3. HttpMessage/HttpServer solution: Modify HttpMessage to NOT close the
socket in the case of exceeding the maximum content length but return
some error value (< 0).  Modify HttpServer to detect the error on the
HttpMessage read and construct an appropriate response that is then
returned.

4. HttpServer-2 solution: Same as #2 but also add a check to see if the
socket is still open.  If not, don't send a response back at all.

5. sipXconfig solution:  Instead of sending extremely large replications
in 1 request, break them up into smaller chunks and use the appropriate
XML-RPC method that adds the records to an IMDB as opposed to calling
the method (replace) which clears the existing IMDB and then adds all of
the records (this of course assumes that all records are in the
request).    

6. Some other type of solution I may not have thought of.

My preference is for solution #3 but I'd like to find out from some of
the experts what their thoughts are before proceeding with a "proper"
solution.


Raymond
_______________________________________________
sipx-dev mailing list [email protected]
List Archive: http://list.sipfoundry.org/archive/sipx-dev
Unsubscribe: http://list.sipfoundry.org/mailman/listinfo/sipx-dev
sipXecs IP PBX -- http://www.sipfoundry.org/

[sipX-dev] XX-6547 - sipXsupervisor core dump during replication of 10, 000 user database

Reply via email to