In tracing the issue with replication of a 10,000 user database, I've found a couple of areas in the system that are problematic.
A typical database replication request involves sipXconfig gathering all of the necessary information, constructing an XML-RPC request for that replication and sending it to the appropriate sipXsupervisor. sipXconfig will then wait a pre-determined amount of time for a response. Should it not receive a response in the alloted time, the replication will be marked as failed (problem 1). sipXsupervisor, when it receives the XML-RPC request, will read in the entire request, check that its from a valid node and then process the request before sending back a response indicating the results. The reading of the XML-RPC request happens in HttpServer which calls HttpMessage to actually read in the full message. HttpMessage, when reading in the full message, checks the content length against a defined maximum value (currently default value of 12000000 bytes). If the content length is greater than the maximum allowed, the socket is immediately closed down and a "bytes read" value of 0 is returned to HttpServer (problem 2). Now problems really start to happen. HttpServer DOES NOT check the number of bytes read and simply goes on to try and process the request. sipXsupervisor in trying to process the request, detects that there are zero bytes in it and proceeds to build and try and send back an XML-RPC response with a failure code. However, the socket was closed in the read operation and hence a core dump occurs in the SSL socket layer when performing a write. There are a few different type of solutions that can be used to rectify the situation. 1. Bandage solution: Increase the maximum content length to some greater value(let say 24000000 bytes) such that the read will handle the very large request. The issue I have with this is that its only a bandage solution and that in some cases sipXconfig will timeout waiting for a response because it takes supervisor quite a while to process large requests. As a result, sipXconfig will show the replication as failed even though it may have succeeded. We could then increase the sipXconfig timeout in this situation but its really just adding another bandage on the solution. 2. HttpServer only solution: Add a check in the read to detect a 0 byte read return value and not proceed to pass the request to supervisor for processing. Problem with this solution is that there are other cases where 0 bytes can be returned (without closing the socket) and we should send a response back indicating an error in the request. 3. HttpMessage/HttpServer solution: Modify HttpMessage to NOT close the socket in the case of exceeding the maximum content length but return some error value (< 0). Modify HttpServer to detect the error on the HttpMessage read and construct an appropriate response that is then returned. 4. HttpServer-2 solution: Same as #2 but also add a check to see if the socket is still open. If not, don't send a response back at all. 5. sipXconfig solution: Instead of sending extremely large replications in 1 request, break them up into smaller chunks and use the appropriate XML-RPC method that adds the records to an IMDB as opposed to calling the method (replace) which clears the existing IMDB and then adds all of the records (this of course assumes that all records are in the request). 6. Some other type of solution I may not have thought of. My preference is for solution #3 but I'd like to find out from some of the experts what their thoughts are before proceeding with a "proper" solution. Raymond _______________________________________________ sipx-dev mailing list [email protected] List Archive: http://list.sipfoundry.org/archive/sipx-dev Unsubscribe: http://list.sipfoundry.org/mailman/listinfo/sipx-dev sipXecs IP PBX -- http://www.sipfoundry.org/
