I've got a weird bug with my server and I'm wondering if anyone could provide some insight into it.
The general idea of the server is that I have 8 or so clients constantly pinging it for information. Each ping usually only takes up to 2 seconds to process. Essentially, everything runs fine, until a certain code path is executed - basically, one of the clients causes the server to try to give it some other information than usual. When this client is turned on and starts pinging the server, suddenly all those requests that only took 0.5-2 seconds now take 10, 20, or even 30 seconds. I'm using paster and using its thread-tracker tool I can see what all the threads are doing at any given point. Usually there's anywhere from 0 to 4 requests going on, each having taken less than 1 second up to that point. When the 'special client' is turned on, there's more like 10 requests each having taken 6 seconds or more up to that point. There aren't more requests, it's just that they don't clear out quickly enough. The tail end of the tracebacks are one of two things: File "/home/tsa/env/lib/python2.6/site-packages/transaction/_manager.py", line 89, in commit return self.get().commit() File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py", line 329, in commit self._commitResources() File "/home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py", line 441, in _commitResources rm.tpc_begin(self) File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 547, in tpc_begin self._normal_storage.tpc_begin(transaction) File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py", line 1118, in tpc_begin self._tpc_cond.wait(30) File "/usr/lib/python2.6/threading.py", line 258, in wait _sleep(delay) Or: File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 856, in setstate self._setstate(obj) File "/home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py", line 897, in _setstate p, serial = self._storage.load(obj._p_oid, '') File "/home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py", line 824, in load self._load_lock.acquire() I read through the code a bit and it sees that only one thread can commit at a time. So, my first guess is that the 'special' request is taking particularly long to commit, which ends up slowing everything else down. Would that be a fair estimation? What might cause a commit to take abnormally long? Just looking at the code I can't see anything that particularly stands out, especially since this bug only started happening recently and I haven't really changed that part of the code on the server-side recently. I did refactor the client code but I don't yet see how that might be causing this issue. One last thing. I have certain views which touch a lot of different parts of the database and do a lot of processing. I had many conflict errors in the past, and retrying the entire view all the time proved to be too inefficient, so I separated the code out into blocks, where I retry each block separately. This function describes the pattern well: def commit_and_inner_retry(transaction_manager, func, num_retries, ingdescription)): """Using the `transaction_manager`: 1) commit the transaction thus far 2) try `num_retries` times to begin a new transaction, execute `func`, which is passed the current transaction as its only argument, and then commit the new transaction. if ConflictError is raised, abort & try again 3) if a failure happens, then an error is printed involving `ingdescription` 4) raise ConflictError if retrying `num_retries` times does not work. """ tm = transaction_manager tm.commit() for retry in range(num_retries): tm.begin() return_value = func(tm.get()) try: tm.commit() except ConflictError: print "Conflict error attempt #%d %s, trying again" % (retry+1, ingdescription) tm.abort() continue break else: raise ConflictError("Was never able to commit %s" % ingdescription) return return_value With the view code looking something like this: def long_view(context, request): db_objs = prepare_stuff() def do_stuff_1(): return stuff_1_with_db_objs(db_objs) stuff1 = commit_and_inner_retry( transaction.manager, block, 10, 'doing stuff 1') def do_stuff_2(): return stuff_2_with_db_objs(stuff1) stuff2 = commit_and_inner_retry( transaction.manager, block, 10, 'doing stuff 2') def do_stuff_3(): stuff_3_with_db_objs(stuff2) try: commit_and_inner_retry( transaction.manager, block, 10, 'doing stuff 3') except ConflictError: #this stuff not important transaction.abort() return response(stuff2) It seems like this might cause more transactions than usual. Is it a reasonable design pattern or is there a better way of not having to re-do a lot of computation for a conflict error that only affects a small part of that computation? Thanks, - Claudiu
_______________________________________________ For more information about ZODB, see http://zodb.org/ ZODB-Dev mailing list - ZODB-Dev@zope.org https://mail.zope.org/mailman/listinfo/zodb-dev