[Christian Theune] > this is nasty. We detected a couple of hangs with a Zope 2.7.7 final and > (using the DeadlockDebugger) we found this situation in every of the four > threads: (See attached file for complete listing of all threads)
And there are 4 threads each stuck at self._commit_lock_acquire() in BaseStorage.BaseStorage.tpc_begin(). > ... > We then looked at the locks in BaseStorage, and what I found is that in > tpc_finish, the commit_lock will be freed in any case (using a finally > clause) whereas in tpc_abort, there might be a race condition that does > not free the commit_lock. (the finally clause only covers a second, > different lock). I wouldn't call this "a race", because nothing here appears to be timing-dependent. This is the code: def tpc_abort(self, transaction): self._lock_acquire() try: if transaction is not self._transaction: return self._abort() self._clear_temp() self._transaction = None self._commit_lock_release() finally: self._lock_release() That will release the commit lock if and only if neither of: self._abort() self._clear_temp() raises an exception. The code does seem to implicitly assume that neither of those _will_ raise an exception, and I agree the commit lock release belongs in the `finally` clause instead. I'll change that, but doubt it will make a difference to you: if either of those did raise an exception, I expect you would have seen a traceback. This code doesn't _suppress_ exceptions, right? I'd also look at the source code for whatever storages you're using, to see whether/how they override _abort() and/or _clear_temp(). If any of those _can_ raise exceptions, then (a) they probably ought to be changed so that they cannot; but, (b) that indeed could provoke BaseStorage.tpc_abort() into leaving the commit lock acquired. > Additionally, we are using Ape running on a postgres backend, so this > might trigger some unusual side effects, maybe this possible race > condition. It's also possible for storages (deriving from BaseStorage) to do their own unique things with the locks BaseStorage created. That is, _just_ staring at BaseStorage doesn't tell us what the other stuff you're using may be doing with these locks. For example, FileStorage mucks with the commit lock too, to give packing a way to block commits during the final copy. > (If someone has another suspicion where this hang might come from, I'm > all your's to listen) As above, FileStorage can block commits "for a long time" if a pack is going on. That's not really a hang, but can _appear_ to be a hang until packing completes. There's no evidence in your msg that packing was going on, though. Another possibility is that some path thru the code calls tpc_begin on a storage but never calls tpc_finish or tpc_abort on that storage. Then the storage's commit lock will remain acquired forever. For example, did the following msg show up in your logs? LOG('ZODB', PANIC, "A storage error occurred in the last phase of a " "two-phase commit. This shouldn\'t happen. " "The application will not be allowed to commit " "until the site/storage is reset by a restart. ", error=sys.exc_info()) Or this one? LOG('ZODB', ERROR, "A storage error occured during object abort. This " "shouldn't happen. ", error=sys.exc_info()) _______________________________________________ For more information about ZODB, see the ZODB Wiki: http://www.zope.org/Wikis/ZODB/ ZODB-Dev mailing list - ZODB-Dev@zope.org http://mail.zope.org/mailman/listinfo/zodb-dev