Re: [sipX-dev] XX-6547 - sipXsupervisor core dump during replication of 10, 000 user database

Scott Lawrence Mon, 21 Sep 2009 12:56:40 -0700

On Mon, 2009-09-21 at 14:46 -0400, Raymond Dans wrote:
> In tracing the issue with replication of a 10,000 user database, I've
> found a couple of areas in the system that are problematic.


Nice (if depressing) summary, Raymond.

> A typical database replication request involves sipXconfig gathering all
> of the necessary information, constructing an XML-RPC request for that
> replication and sending it to the appropriate sipXsupervisor.
> sipXconfig will then wait a pre-determined amount of time for a
> response.  Should it not receive a response in the alloted time, the
> replication will be marked as failed (problem 1).
> 
> sipXsupervisor, when it receives the XML-RPC request, will read in the
> entire request, check that its from a valid node and then process the
> request before sending back a response indicating the results.
> 
> The reading of the XML-RPC request happens in HttpServer which calls
> HttpMessage to actually read in the full message.
> 
> HttpMessage, when reading in the full message, checks the content length
> against a defined maximum value (currently default value of 12000000
> bytes).  If the content length is greater than the maximum allowed, the
> socket is immediately closed down and a "bytes read" value of 0 is
> returned to HttpServer (problem 2).  
> 
> Now problems really start to happen.  HttpServer DOES NOT check the
> number of bytes read and simply goes on to try and process the request.
> sipXsupervisor in trying to process the request, detects that there are
> zero bytes in it and proceeds to build and try and send back an XML-RPC
> response with a failure code.  However, the socket was closed in the
> read operation and hence a core dump occurs in the SSL socket layer when
> performing a write.
> 
> There are a few different type of solutions that can be used to rectify
> the situation.
> 
> 1. Bandage solution:  Increase the maximum content length to some
> greater value(let say 24000000 bytes) such that the read will handle the
> very large request.  The issue I have with this is that its only a
> bandage solution and that in some cases sipXconfig will timeout waiting
> for a response because it takes supervisor quite a while to process
> large requests.  As a result, sipXconfig will show the replication as
> failed even though it may have succeeded.  We could then increase the
> sipXconfig timeout in this situation but its really just adding another
> bandage on the solution.
> 
> 2. HttpServer only solution:  Add a check in the read to detect a 0 byte
> read return value and not proceed to pass the request to supervisor for
> processing.  Problem with this solution is that there are other cases
> where 0 bytes can be returned (without closing the socket) and we should
> send a response back indicating an error in the request.     
> 
> 3. HttpMessage/HttpServer solution: Modify HttpMessage to NOT close the
> socket in the case of exceeding the maximum content length but return
> some error value (< 0).  Modify HttpServer to detect the error on the
> HttpMessage read and construct an appropriate response that is then
> returned.

That seems better.  We would need to actually read the data off the
socket anyway to maintain the connection, but we could just read and
discard (or possibly wait until the response has been sent and then
close).

> 4. HttpServer-2 solution: Same as #2 but also add a check to see if the
> socket is still open.  If not, don't send a response back at all.

Well, if SSL asserts when we write to a closed socket, then we need to
do something to keep that from happening.

> 5. sipXconfig solution:  Instead of sending extremely large replications
> in 1 request, break them up into smaller chunks and use the appropriate
> XML-RPC method that adds the records to an IMDB as opposed to calling
> the method (replace) which clears the existing IMDB and then adds all of
> the records (this of course assumes that all records are in the
> request).    

This is probably needed sooner or later, but is more involved.  We'd
need a some kind of transaction control: start, add, delete, add,
modify, stop - where the changes in the middle are applied only after
the stop operation.

> 6. Some other type of solution I may not have thought of.
> 
> My preference is for solution #3 but I'd like to find out from some of
> the experts what their thoughts are before proceeding with a "proper"
> solution.

_______________________________________________
sipx-dev mailing list [email protected]
List Archive: http://list.sipfoundry.org/archive/sipx-dev
Unsubscribe: http://list.sipfoundry.org/mailman/listinfo/sipx-dev
sipXecs IP PBX -- http://www.sipfoundry.org/

Re: [sipX-dev] XX-6547 - sipXsupervisor core dump during replication of 10, 000 user database

Reply via email to