Re: [ZODB-Dev] A certain code path seems to be blocking everything

2013-02-15 Thread Claudiu Saftoiu
I figured out what was wrong and it was so convoluted I thought I'd share
it.

It ended up having nothing to do with zodb *per se*. I had a thread which,
whenever it processes a certain kind of internal message, loaded  saved a
~1.5mb pickled object from disk. The 'special client' ended up triggering
this thread twice/request. It was using pickle instead of cPickle, and I
think due to the GIL it was just thrashing everything else, though in
particular it seemed the zodb commit code was impacted particularly hard.

Using cPickle instead of pickle seems to have fixed the issue for now, but
I'm going to change that part of the code to be more efficient.

- Claudiu

On Thu, Feb 14, 2013 at 4:24 PM, Claudiu Saftoiu csaft...@gmail.com wrote:

 I've got a weird bug with my server and I'm wondering if anyone could
 provide some insight
 into it.

 The general idea of the server is that I have 8 or so clients constantly
 pinging it for information.
 Each ping usually only takes up to 2 seconds to process.

 Essentially, everything runs fine, until a certain code path is executed -
 basically, one of
 the clients causes the server to try to give it some other information
 than usual. When this
 client is turned on and starts pinging the server, suddenly all those
 requests that only took
 0.5-2 seconds now take 10, 20, or even 30 seconds.

 I'm using paster and using its thread-tracker tool I can see what all the
 threads are doing at
 any given point. Usually there's anywhere from 0 to 4 requests going on,
 each having taken
 less than 1 second up to that point. When the 'special client' is turned
 on, there's more like
 10 requests each having taken 6 seconds or more up to that point. There
 aren't more
 requests, it's just that they don't clear out quickly enough. The tail end
 of the tracebacks
 are one of two things:

   File /home/tsa/env/lib/python2.6/site-packages/transaction/_manager.py, 
 line 89, in commit
 return self.get().commit()
   File 
 /home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py, line 
 329, in commit
 self._commitResources()
   File 
 /home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py, line 
 441, in _commitResources
 rm.tpc_begin(self)
   File /home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py, line 
 547, in tpc_begin
 self._normal_storage.tpc_begin(transaction)
   File /home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py, line 
 1118, in tpc_begin
 self._tpc_cond.wait(30)
   File /usr/lib/python2.6/threading.py, line 258, in wait
 _sleep(delay)

 Or:

   File /home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py, line 
 856, in setstate
 self._setstate(obj)
   File /home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py, line 
 897, in _setstate
 p, serial = self._storage.load(obj._p_oid, '')
   File /home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py, line 
 824, in load
 self._load_lock.acquire()

 I read through the code a bit and it sees that only one thread can commit
 at a time.
 So, my first guess is that the 'special' request is taking particularly
 long to commit, which
 ends up slowing everything else down. Would that be a fair estimation?
 What might
 cause a commit to take abnormally long? Just looking at the code I can't
 see anything
 that particularly stands out, especially since this bug only started
 happening recently
 and I haven't really changed that part of the code on the server-side
 recently. I did
 refactor the client code but I don't yet see how that might be causing
 this issue.

 One last thing. I have certain views which touch a lot of different parts
 of the database and
 do a lot of processing. I had many conflict errors in the past, and
 retrying the entire view
 all the time proved to be too inefficient, so I separated the code out
 into blocks, where
 I retry each block separately. This function describes the pattern well:

 def commit_and_inner_retry(transaction_manager, func, num_retries,
 ingdescription)):
 Using the `transaction_manager`:
 1) commit the transaction thus far
 2) try `num_retries` times to begin a new transaction, execute
 `func`, which is
passed the current transaction as its only argument, and then
 commit the
new transaction. if ConflictError is raised, abort  try again
 3) if a failure happens, then an error is printed involving
 `ingdescription`
 4) raise ConflictError if retrying `num_retries` times does not
 work.

 
 tm = transaction_manager
 tm.commit()
 for retry in range(num_retries):
 tm.begin()
 return_value = func(tm.get())
 try:
 tm.commit()
 except ConflictError:
 print Conflict error attempt #%d %s, trying again %
 (retry+1, ingdescription)
 tm.abort()
 continue
 break
 

[ZODB-Dev] A certain code path seems to be blocking everything

2013-02-14 Thread Claudiu Saftoiu
I've got a weird bug with my server and I'm wondering if anyone could
provide some insight
into it.

The general idea of the server is that I have 8 or so clients constantly
pinging it for information.
Each ping usually only takes up to 2 seconds to process.

Essentially, everything runs fine, until a certain code path is executed -
basically, one of
the clients causes the server to try to give it some other information than
usual. When this
client is turned on and starts pinging the server, suddenly all those
requests that only took
0.5-2 seconds now take 10, 20, or even 30 seconds.

I'm using paster and using its thread-tracker tool I can see what all the
threads are doing at
any given point. Usually there's anywhere from 0 to 4 requests going on,
each having taken
less than 1 second up to that point. When the 'special client' is turned
on, there's more like
10 requests each having taken 6 seconds or more up to that point. There
aren't more
requests, it's just that they don't clear out quickly enough. The tail end
of the tracebacks
are one of two things:

  File /home/tsa/env/lib/python2.6/site-packages/transaction/_manager.py,
line 89, in commit
return self.get().commit()
  File /home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py,
line 329, in commit
self._commitResources()
  File /home/tsa/env/lib/python2.6/site-packages/transaction/_transaction.py,
line 441, in _commitResources
rm.tpc_begin(self)
  File /home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py,
line 547, in tpc_begin
self._normal_storage.tpc_begin(transaction)
  File /home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py,
line 1118, in tpc_begin
self._tpc_cond.wait(30)
  File /usr/lib/python2.6/threading.py, line 258, in wait
_sleep(delay)

Or:

  File /home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py,
line 856, in setstate
self._setstate(obj)
  File /home/tsa/env/lib/python2.6/site-packages/ZODB/Connection.py,
line 897, in _setstate
p, serial = self._storage.load(obj._p_oid, '')
  File /home/tsa/env/lib/python2.6/site-packages/ZEO/ClientStorage.py,
line 824, in load
self._load_lock.acquire()

I read through the code a bit and it sees that only one thread can commit
at a time.
So, my first guess is that the 'special' request is taking particularly
long to commit, which
ends up slowing everything else down. Would that be a fair estimation? What
might
cause a commit to take abnormally long? Just looking at the code I can't
see anything
that particularly stands out, especially since this bug only started
happening recently
and I haven't really changed that part of the code on the server-side
recently. I did
refactor the client code but I don't yet see how that might be causing this
issue.

One last thing. I have certain views which touch a lot of different parts
of the database and
do a lot of processing. I had many conflict errors in the past, and
retrying the entire view
all the time proved to be too inefficient, so I separated the code out into
blocks, where
I retry each block separately. This function describes the pattern well:

def commit_and_inner_retry(transaction_manager, func, num_retries,
ingdescription)):
Using the `transaction_manager`:
1) commit the transaction thus far
2) try `num_retries` times to begin a new transaction, execute
`func`, which is
   passed the current transaction as its only argument, and then
commit the
   new transaction. if ConflictError is raised, abort  try again
3) if a failure happens, then an error is printed involving
`ingdescription`
4) raise ConflictError if retrying `num_retries` times does not
work.


tm = transaction_manager
tm.commit()
for retry in range(num_retries):
tm.begin()
return_value = func(tm.get())
try:
tm.commit()
except ConflictError:
print Conflict error attempt #%d %s, trying again %
(retry+1, ingdescription)
tm.abort()
continue
break
else:
raise ConflictError(Was never able to commit %s %
ingdescription)

return return_value

With the view code looking something like this:

def long_view(context, request):
db_objs = prepare_stuff()

def do_stuff_1():
return stuff_1_with_db_objs(db_objs)
stuff1 = commit_and_inner_retry(
transaction.manager, block, 10, 'doing stuff 1')

def do_stuff_2():
return stuff_2_with_db_objs(stuff1)
stuff2 = commit_and_inner_retry(
transaction.manager, block, 10, 'doing stuff 2')

def do_stuff_3():
stuff_3_with_db_objs(stuff2)
try:
commit_and_inner_retry(
transaction.manager, block, 10, 'doing stuff 3')
except ConflictError:
#this stuff not important