Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-16 Thread jiajunsu . zju
This patch worked.
https://gerrit.sqlalchemy.org/#/c/zzzeek/sqlalchemy/+/747/

But I didn't get the log "Double checkin attempted", maybe the real working 
code is 
if connection_record and connection_record.fairy_ref is not None:
connection_record.checkin()




在 2018年5月16日星期三 UTC+8下午4:04:21,jiajun...@gmail.com写道:
>
> Thanks a lot.
>
> I'll test this patch in our environment.
>
> 在 2018年5月16日星期三 UTC+8上午2:58:56,Mike Bayer写道:
>>
>> good news, I reproduced it. 
>>
>> Even better news for you, vindication!  nothing to do with greenlets, 
>> threads, concurrency, or anything (though it doesn't have to do with 
>> the "del" or the weakref, this is much simpler ). 
>>
>> This has to do with a special rule in the pool when there is a 
>> so-called "reset agent" involved that cleans up a pooled connection 
>> while taking into account the fact that the enclosing Connection had a 
>> transaction in progress.   That is introducing the 
>> _handle_error()->invalidate() routine *inside* the normal 
>> _finalize_fairy routine and that is why we see _finalize_fairy twice. 
>>  The _finalize_fairy we see that seems to start in the middle of the 
>> operation is called first, then the _finalize_fairy we see failing is 
>> inside of that.   So they aren't sibling operations, they are nested. 
>>
>>
>> https://bitbucket.org/zzzeek/sqlalchemy/issues/4252/double-checkin-due-to-handle_error-called
>>  
>>
>>
>>
>> On Tue, May 15, 2018 at 1:26 PM, Mike Bayer  
>> wrote: 
>> > can you add these lines to your logging, ensure that the check for 
>> > "already in pool" is present, and produce new logs? 
>> > 
>> > try: 
>> > fairy = fairy or _ConnectionFairy( 
>> > connection, connection_record, echo) 
>> > assert fairy.connection is connection 
>> > pool.logger.debug("ABOUT TO RESET CONNECTION %s CONNECTION 
>> > REC %s FAIRY %s", connection, fairy, connection_record) 
>> > fairy._reset(pool) 
>> > pool.logger.debug("COMPLETED RESET CONNECTION %s 
>> > CONNECTION REC %s FAIRY %s", connection, fairy, connection_record) 
>> > 
>> > # Immediately close detached instances 
>> > if not connection_record: 
>> > pool._close_connection(connection) 
>> > except BaseException as e: 
>> > pool.logger.debug("EXCEPTION RAISED IN RESET CONNECTION %s 
>> > CONNECTION REC %s FAIRY %s: %s", connection, fairy, connection_record, 
>> > e) 
>> > pool.logger.error( 
>> > "Exception during reset or similar", exc_info=True) 
>> > if connection_record: 
>> > pool.logger.debug("jjs going to 
>> connection_record.invalidate") 
>> > connection_record.invalidate(e=e) 
>> > if not isinstance(e, Exception): 
>> > pool.logger.debug("jjs e %s is not Exception, raise 
>> again", e) 
>> > raise 
>> > 
>> > my current theory is that a greenlet is being blocked in _reset(), 
>> > then waking up later but is being corrupted at the greenlet level in 
>> > some way. 
>> > 
>> > 
>> > 
>> > On Tue, May 15, 2018 at 12:15 PM, Mike Bayer  
>> wrote: 
>> >> 
>> >> 
>> >> On Tue, May 15, 2018, 3:26 AM  wrote: 
>> >>> 
>> >>> We got the log "jjs _finalize_fairy args" just once but the code in 
>> >>> function _finalize_fairy seem to be called twice. It also confused 
>> me. 
>> >> 
>> >> 
>> >> It's because that code originated elsewhere and got blocked, then it 
>> picks 
>> >> up again but the callstack is somehow corrupted (or logging isn't 
>> telling 
>> >> the truth). 
>> >> 
>> >> Can you please give me: 
>> >> 
>> >> Exact version of eventlet in use 
>> >> Exact version of psycopg2 and psyco-green 
>> >> Exact python interpreter version 
>> >> Exact version of SQLAlchemy 
>> >> Exact version of Oslo.db 
>> >> 
>> >> Thanks 
>> >> 
>> >> 
>> >> 
>> >>> 
>> >>> BTW, I tried to add code[1] to avoid double checkin in 
>> _finalize_fairy but 
>> >>> failed. 
>> >>> That means the code in pool.py#L670-692 was called without L646-668. 
>> >>> 
>> >>> [1] https://github.com/jiajunsu/test_logs/blob/master/pool.py#L659 
>> >>> 
>> >>> 在 2018年5月15日星期二 UTC+8下午12:41:17,Mike Bayer写道: 
>>  
>>  s 
>>  
>>  
>>  
>>  On Mon, May 14, 2018 at 9:38 PM,   wrote: 
>>  > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in 
>> each 
>>  > coroutine(green thread). 
>>  > 
>>  > We run nova-conductor with monkey_patch, the same as [1], 
>>  > eventlet.monkey_patch(os=False) 
>>  > 
>>  > Our nova-conductor is running with multi workers. 
>>  > 
>>  > It's confusing why the _finalize_fairy is called twice but only 
>> log 
>>  > once at 
>>  > the entrance, that's why I guess it's called by weakref's 
>> callback. 
>>  
>>  For 130446032 which is 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-16 Thread jiajunsu . zju
Thanks a lot.

I'll test this patch in our environment.

在 2018年5月16日星期三 UTC+8上午2:58:56,Mike Bayer写道:
>
> good news, I reproduced it. 
>
> Even better news for you, vindication!  nothing to do with greenlets, 
> threads, concurrency, or anything (though it doesn't have to do with 
> the "del" or the weakref, this is much simpler ). 
>
> This has to do with a special rule in the pool when there is a 
> so-called "reset agent" involved that cleans up a pooled connection 
> while taking into account the fact that the enclosing Connection had a 
> transaction in progress.   That is introducing the 
> _handle_error()->invalidate() routine *inside* the normal 
> _finalize_fairy routine and that is why we see _finalize_fairy twice. 
>  The _finalize_fairy we see that seems to start in the middle of the 
> operation is called first, then the _finalize_fairy we see failing is 
> inside of that.   So they aren't sibling operations, they are nested. 
>
>
> https://bitbucket.org/zzzeek/sqlalchemy/issues/4252/double-checkin-due-to-handle_error-called
>  
>
>
>
> On Tue, May 15, 2018 at 1:26 PM, Mike Bayer  > wrote: 
> > can you add these lines to your logging, ensure that the check for 
> > "already in pool" is present, and produce new logs? 
> > 
> > try: 
> > fairy = fairy or _ConnectionFairy( 
> > connection, connection_record, echo) 
> > assert fairy.connection is connection 
> > pool.logger.debug("ABOUT TO RESET CONNECTION %s CONNECTION 
> > REC %s FAIRY %s", connection, fairy, connection_record) 
> > fairy._reset(pool) 
> > pool.logger.debug("COMPLETED RESET CONNECTION %s 
> > CONNECTION REC %s FAIRY %s", connection, fairy, connection_record) 
> > 
> > # Immediately close detached instances 
> > if not connection_record: 
> > pool._close_connection(connection) 
> > except BaseException as e: 
> > pool.logger.debug("EXCEPTION RAISED IN RESET CONNECTION %s 
> > CONNECTION REC %s FAIRY %s: %s", connection, fairy, connection_record, 
> > e) 
> > pool.logger.error( 
> > "Exception during reset or similar", exc_info=True) 
> > if connection_record: 
> > pool.logger.debug("jjs going to 
> connection_record.invalidate") 
> > connection_record.invalidate(e=e) 
> > if not isinstance(e, Exception): 
> > pool.logger.debug("jjs e %s is not Exception, raise 
> again", e) 
> > raise 
> > 
> > my current theory is that a greenlet is being blocked in _reset(), 
> > then waking up later but is being corrupted at the greenlet level in 
> > some way. 
> > 
> > 
> > 
> > On Tue, May 15, 2018 at 12:15 PM, Mike Bayer  > wrote: 
> >> 
> >> 
> >> On Tue, May 15, 2018, 3:26 AM  
> wrote: 
> >>> 
> >>> We got the log "jjs _finalize_fairy args" just once but the code in 
> >>> function _finalize_fairy seem to be called twice. It also confused me. 
> >> 
> >> 
> >> It's because that code originated elsewhere and got blocked, then it 
> picks 
> >> up again but the callstack is somehow corrupted (or logging isn't 
> telling 
> >> the truth). 
> >> 
> >> Can you please give me: 
> >> 
> >> Exact version of eventlet in use 
> >> Exact version of psycopg2 and psyco-green 
> >> Exact python interpreter version 
> >> Exact version of SQLAlchemy 
> >> Exact version of Oslo.db 
> >> 
> >> Thanks 
> >> 
> >> 
> >> 
> >>> 
> >>> BTW, I tried to add code[1] to avoid double checkin in _finalize_fairy 
> but 
> >>> failed. 
> >>> That means the code in pool.py#L670-692 was called without L646-668. 
> >>> 
> >>> [1] https://github.com/jiajunsu/test_logs/blob/master/pool.py#L659 
> >>> 
> >>> 在 2018年5月15日星期二 UTC+8下午12:41:17,Mike Bayer写道: 
>  
>  s 
>  
>  
>  
>  On Mon, May 14, 2018 at 9:38 PM,   wrote: 
>  > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in 
> each 
>  > coroutine(green thread). 
>  > 
>  > We run nova-conductor with monkey_patch, the same as [1], 
>  > eventlet.monkey_patch(os=False) 
>  > 
>  > Our nova-conductor is running with multi workers. 
>  > 
>  > It's confusing why the _finalize_fairy is called twice but only log 
>  > once at 
>  > the entrance, that's why I guess it's called by weakref's callback. 
>  
>  For 130446032 which is the tid you mentioned, _finalize_fairy is only 
>  called once in your log: 
>  
>  $ cat nc_5679.log | grep -e "130446032.*jjs _finalize" 
>  2018-05-14T11:11:39.955+08:00 localhost nova-conductor DEBUG 
>  [pid:5679] [MainThread] [tid:130446032] [log.py:104 debug] 
>  [req-841bb432-7cd2-4fa3-9def-559a34b05a21] jjs _finalize_fairy args 
>  conn None, rec.conn None, rec,   object at 0x70cdb10>. ref None, fairy 
>   debug 
> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread Mike Bayer
good news, I reproduced it.

Even better news for you, vindication!  nothing to do with greenlets,
threads, concurrency, or anything (though it doesn't have to do with
the "del" or the weakref, this is much simpler ).

This has to do with a special rule in the pool when there is a
so-called "reset agent" involved that cleans up a pooled connection
while taking into account the fact that the enclosing Connection had a
transaction in progress.   That is introducing the
_handle_error()->invalidate() routine *inside* the normal
_finalize_fairy routine and that is why we see _finalize_fairy twice.
 The _finalize_fairy we see that seems to start in the middle of the
operation is called first, then the _finalize_fairy we see failing is
inside of that.   So they aren't sibling operations, they are nested.

https://bitbucket.org/zzzeek/sqlalchemy/issues/4252/double-checkin-due-to-handle_error-called



On Tue, May 15, 2018 at 1:26 PM, Mike Bayer  wrote:
> can you add these lines to your logging, ensure that the check for
> "already in pool" is present, and produce new logs?
>
> try:
> fairy = fairy or _ConnectionFairy(
> connection, connection_record, echo)
> assert fairy.connection is connection
> pool.logger.debug("ABOUT TO RESET CONNECTION %s CONNECTION
> REC %s FAIRY %s", connection, fairy, connection_record)
> fairy._reset(pool)
> pool.logger.debug("COMPLETED RESET CONNECTION %s
> CONNECTION REC %s FAIRY %s", connection, fairy, connection_record)
>
> # Immediately close detached instances
> if not connection_record:
> pool._close_connection(connection)
> except BaseException as e:
> pool.logger.debug("EXCEPTION RAISED IN RESET CONNECTION %s
> CONNECTION REC %s FAIRY %s: %s", connection, fairy, connection_record,
> e)
> pool.logger.error(
> "Exception during reset or similar", exc_info=True)
> if connection_record:
> pool.logger.debug("jjs going to connection_record.invalidate")
> connection_record.invalidate(e=e)
> if not isinstance(e, Exception):
> pool.logger.debug("jjs e %s is not Exception, raise again", e)
> raise
>
> my current theory is that a greenlet is being blocked in _reset(),
> then waking up later but is being corrupted at the greenlet level in
> some way.
>
>
>
> On Tue, May 15, 2018 at 12:15 PM, Mike Bayer  wrote:
>>
>>
>> On Tue, May 15, 2018, 3:26 AM  wrote:
>>>
>>> We got the log "jjs _finalize_fairy args" just once but the code in
>>> function _finalize_fairy seem to be called twice. It also confused me.
>>
>>
>> It's because that code originated elsewhere and got blocked, then it picks
>> up again but the callstack is somehow corrupted (or logging isn't telling
>> the truth).
>>
>> Can you please give me:
>>
>> Exact version of eventlet in use
>> Exact version of psycopg2 and psyco-green
>> Exact python interpreter version
>> Exact version of SQLAlchemy
>> Exact version of Oslo.db
>>
>> Thanks
>>
>>
>>
>>>
>>> BTW, I tried to add code[1] to avoid double checkin in _finalize_fairy but
>>> failed.
>>> That means the code in pool.py#L670-692 was called without L646-668.
>>>
>>> [1] https://github.com/jiajunsu/test_logs/blob/master/pool.py#L659
>>>
>>> 在 2018年5月15日星期二 UTC+8下午12:41:17,Mike Bayer写道:

 s



 On Mon, May 14, 2018 at 9:38 PM,   wrote:
 > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each
 > coroutine(green thread).
 >
 > We run nova-conductor with monkey_patch, the same as [1],
 > eventlet.monkey_patch(os=False)
 >
 > Our nova-conductor is running with multi workers.
 >
 > It's confusing why the _finalize_fairy is called twice but only log
 > once at
 > the entrance, that's why I guess it's called by weakref's callback.

 For 130446032 which is the tid you mentioned, _finalize_fairy is only
 called once in your log:

 $ cat nc_5679.log | grep -e "130446032.*jjs _finalize"
 2018-05-14T11:11:39.955+08:00 localhost nova-conductor DEBUG
 [pid:5679] [MainThread] [tid:130446032] [log.py:104 debug]
 [req-841bb432-7cd2-4fa3-9def-559a34b05a21] jjs _finalize_fairy args
 conn None, rec.conn None, rec, >>> object at 0x70cdb10>. ref None, fairy
  debug

 Also, the log shows that no _finalize_fairy is ever called due to
 "del" - this would show the weakref "ref" as non-None:

 $ cat nc_5679.log | grep -e "FINALIZE FAIRY.*REF:.*" | grep -v "REF:
 None"
 $

 I would assume by "finalize fairy" called twice, you're referring to
 the fact that you see the "exception during reset or simlar" message.
 But that message is still AFAICT impossible to be emitted without the
 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread Mike Bayer
can you add these lines to your logging, ensure that the check for
"already in pool" is present, and produce new logs?

try:
fairy = fairy or _ConnectionFairy(
connection, connection_record, echo)
assert fairy.connection is connection
pool.logger.debug("ABOUT TO RESET CONNECTION %s CONNECTION
REC %s FAIRY %s", connection, fairy, connection_record)
fairy._reset(pool)
pool.logger.debug("COMPLETED RESET CONNECTION %s
CONNECTION REC %s FAIRY %s", connection, fairy, connection_record)

# Immediately close detached instances
if not connection_record:
pool._close_connection(connection)
except BaseException as e:
pool.logger.debug("EXCEPTION RAISED IN RESET CONNECTION %s
CONNECTION REC %s FAIRY %s: %s", connection, fairy, connection_record,
e)
pool.logger.error(
"Exception during reset or similar", exc_info=True)
if connection_record:
pool.logger.debug("jjs going to connection_record.invalidate")
connection_record.invalidate(e=e)
if not isinstance(e, Exception):
pool.logger.debug("jjs e %s is not Exception, raise again", e)
raise

my current theory is that a greenlet is being blocked in _reset(),
then waking up later but is being corrupted at the greenlet level in
some way.



On Tue, May 15, 2018 at 12:15 PM, Mike Bayer  wrote:
>
>
> On Tue, May 15, 2018, 3:26 AM  wrote:
>>
>> We got the log "jjs _finalize_fairy args" just once but the code in
>> function _finalize_fairy seem to be called twice. It also confused me.
>
>
> It's because that code originated elsewhere and got blocked, then it picks
> up again but the callstack is somehow corrupted (or logging isn't telling
> the truth).
>
> Can you please give me:
>
> Exact version of eventlet in use
> Exact version of psycopg2 and psyco-green
> Exact python interpreter version
> Exact version of SQLAlchemy
> Exact version of Oslo.db
>
> Thanks
>
>
>
>>
>> BTW, I tried to add code[1] to avoid double checkin in _finalize_fairy but
>> failed.
>> That means the code in pool.py#L670-692 was called without L646-668.
>>
>> [1] https://github.com/jiajunsu/test_logs/blob/master/pool.py#L659
>>
>> 在 2018年5月15日星期二 UTC+8下午12:41:17,Mike Bayer写道:
>>>
>>> s
>>>
>>>
>>>
>>> On Mon, May 14, 2018 at 9:38 PM,   wrote:
>>> > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each
>>> > coroutine(green thread).
>>> >
>>> > We run nova-conductor with monkey_patch, the same as [1],
>>> > eventlet.monkey_patch(os=False)
>>> >
>>> > Our nova-conductor is running with multi workers.
>>> >
>>> > It's confusing why the _finalize_fairy is called twice but only log
>>> > once at
>>> > the entrance, that's why I guess it's called by weakref's callback.
>>>
>>> For 130446032 which is the tid you mentioned, _finalize_fairy is only
>>> called once in your log:
>>>
>>> $ cat nc_5679.log | grep -e "130446032.*jjs _finalize"
>>> 2018-05-14T11:11:39.955+08:00 localhost nova-conductor DEBUG
>>> [pid:5679] [MainThread] [tid:130446032] [log.py:104 debug]
>>> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] jjs _finalize_fairy args
>>> conn None, rec.conn None, rec, >> object at 0x70cdb10>. ref None, fairy
>>>  debug
>>>
>>> Also, the log shows that no _finalize_fairy is ever called due to
>>> "del" - this would show the weakref "ref" as non-None:
>>>
>>> $ cat nc_5679.log | grep -e "FINALIZE FAIRY.*REF:.*" | grep -v "REF:
>>> None"
>>> $
>>>
>>> I would assume by "finalize fairy" called twice, you're referring to
>>> the fact that you see the "exception during reset or simlar" message.
>>> But that message is still AFAICT impossible to be emitted without the
>>> function being entered, and a corresponding "FINALIZE FAIRY" log line
>>> preceding it, unless something is going wrong with the logging, or the
>>> interpreter is somehow not functioning correctly (I wonder if eventlet
>>> had any such issues).   I'm struggling to see any other possibility
>>> but not coming up with any.
>>>
>>> One weird thing I'm trying to figure out is that I see a connection
>>> being returned to the pool normally, but then nothing is logged about
>>> it after that:
>>>
>>> 2018-05-14T11:11:21.183+08:00 localhost nova-conductor DEBUG
>>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
>>> [req-19a57f61-bc64-4575-8253-00b5bfe62a74] Connection >> object at 0x7f419c0; closed: 0> being returned to pool,
>>> connection_record.connection >> 0> debug
>>>
>>>
>>> the next log message from tid:130446352 should be "jjs conn reset".
>>> But that never happens.  The next we see of that tid is:
>>>
>>> 2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG
>>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
>>> [req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs
>>> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread Mike Bayer
On Tue, May 15, 2018, 3:26 AM  wrote:

> We got the log "jjs _finalize_fairy args" just once but the code in
> function _finalize_fairy seem to be called twice. It also confused me.
>

It's because that code originated elsewhere and got blocked, then it picks
up again but the callstack is somehow corrupted (or logging isn't telling
the truth).

Can you please give me:

Exact version of eventlet in use
Exact version of psycopg2 and psyco-green
Exact python interpreter version
Exact version of SQLAlchemy
Exact version of Oslo.db

Thanks




> BTW, I tried to add code[1] to avoid double checkin in _finalize_fairy but
> failed.
> That means the code in pool.py#L670-692 was called without L646-668.
>
> [1] https://github.com/jiajunsu/test_logs/blob/master/pool.py#L659
>
> 在 2018年5月15日星期二 UTC+8下午12:41:17,Mike Bayer写道:
>>
>> s
>>
>>
>>
>> On Mon, May 14, 2018 at 9:38 PM,   wrote:
>> > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each
>> > coroutine(green thread).
>> >
>> > We run nova-conductor with monkey_patch, the same as [1],
>> > eventlet.monkey_patch(os=False)
>> >
>> > Our nova-conductor is running with multi workers.
>> >
>> > It's confusing why the _finalize_fairy is called twice but only log
>> once at
>> > the entrance, that's why I guess it's called by weakref's callback.
>>
>> For 130446032 which is the tid you mentioned, _finalize_fairy is only
>> called once in your log:
>>
>> $ cat nc_5679.log | grep -e "130446032.*jjs _finalize"
>> 2018-05-14T11:11:39.955+08:00 localhost nova-conductor DEBUG
>> [pid:5679] [MainThread] [tid:130446032] [log.py:104 debug]
>> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] jjs _finalize_fairy args
>> conn None, rec.conn None, rec, > object at 0x70cdb10>. ref None, fairy
>>  debug
>>
>> Also, the log shows that no _finalize_fairy is ever called due to
>> "del" - this would show the weakref "ref" as non-None:
>>
>> $ cat nc_5679.log | grep -e "FINALIZE FAIRY.*REF:.*" | grep -v "REF:
>> None"
>> $
>>
>> I would assume by "finalize fairy" called twice, you're referring to
>> the fact that you see the "exception during reset or simlar" message.
>> But that message is still AFAICT impossible to be emitted without the
>> function being entered, and a corresponding "FINALIZE FAIRY" log line
>> preceding it, unless something is going wrong with the logging, or the
>> interpreter is somehow not functioning correctly (I wonder if eventlet
>> had any such issues).   I'm struggling to see any other possibility
>> but not coming up with any.
>>
>> One weird thing I'm trying to figure out is that I see a connection
>> being returned to the pool normally, but then nothing is logged about
>> it after that:
>>
>> 2018-05-14T11:11:21.183+08:00 localhost nova-conductor DEBUG
>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
>> [req-19a57f61-bc64-4575-8253-00b5bfe62a74] Connection > object at 0x7f419c0; closed: 0> being returned to pool,
>> connection_record.connection > 0> debug
>>
>>
>> the next log message from tid:130446352 should be "jjs conn reset".
>> But that never happens.  The next we see of that tid is:
>>
>> 2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG
>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
>> [req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs
>> _ConnectionFairy._checkout fariy None debug
>> 2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG
>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
>> [req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs before pool._do_get
>> pool queue deque([> 0xe10>, ,
>> ,
>> ,
>> ,
>> ,
>> ,
>> ,
>> ]) debug
>>
>> that is, it's checking out a new connection ?  If it were blocking on
>> trying to reset the connection, we wouldn't see that tid again.  I
>> wonder if the greenlet was interrupted or something.I'm finding
>> various impossibilities like this going on and not able to come up
>> with any theory other than the interpreter is somehow not functioning
>> correcltly, or the logging here is somehow incomplete.
>>
>> I notice you're on a very old version of SQLAlchemy (three years old).
>> You really should be on at least the 1.1 series because for
>> greenlets you need to have:
>>
>> http://docs.sqlalchemy.org/en/latest/changelog/migration_11.html#change-3803.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> >
>> > [1]https://github.com/openstack/nova/blob/master/nova/cmd/__init__.py
>> >
>> >
>> > 在 2018年5月14日星期一 UTC+8下午11:59:43,Mike Bayer写道:
>> >>
>> >> I notice in this log file there are multiple "tid", which I assume to
>> >> be thread identifiers ?   E.g. OS-level threads?Are you using
>> >> eventlet or not?Are you monkeypatching SQLAlchemy and then using
>> >> multiple OS-level threads at the same time? Because that likely
>> >> doesn't work.
>> >>
>> >> Particularly, the log here illustrates an impossible codepath.
>> >>
>> >> We enter finalize_fairy with connection is None:
>> >>
>> >> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread jiajunsu . zju
We got the log "jjs _finalize_fairy args" just once but the code in 
function _finalize_fairy seem to be called twice. It also confused me.

BTW, I tried to add code[1] to avoid double checkin in _finalize_fairy but 
failed.
That means the code in pool.py#L670-692 was called without L646-668.

[1] https://github.com/jiajunsu/test_logs/blob/master/pool.py#L659

在 2018年5月15日星期二 UTC+8下午12:41:17,Mike Bayer写道:
>
> s 
>
>
>
> On Mon, May 14, 2018 at 9:38 PM,   
> wrote: 
> > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each 
> > coroutine(green thread). 
> > 
> > We run nova-conductor with monkey_patch, the same as [1], 
> > eventlet.monkey_patch(os=False) 
> > 
> > Our nova-conductor is running with multi workers. 
> > 
> > It's confusing why the _finalize_fairy is called twice but only log once 
> at 
> > the entrance, that's why I guess it's called by weakref's callback. 
>
> For 130446032 which is the tid you mentioned, _finalize_fairy is only 
> called once in your log: 
>
> $ cat nc_5679.log | grep -e "130446032.*jjs _finalize" 
> 2018-05-14T11:11:39.955+08:00 localhost nova-conductor DEBUG 
> [pid:5679] [MainThread] [tid:130446032] [log.py:104 debug] 
> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] jjs _finalize_fairy args 
> conn None, rec.conn None, rec,  object at 0x70cdb10>. ref None, fairy 
>  debug 
>
> Also, the log shows that no _finalize_fairy is ever called due to 
> "del" - this would show the weakref "ref" as non-None: 
>
> $ cat nc_5679.log | grep -e "FINALIZE FAIRY.*REF:.*" | grep -v "REF: None" 
> $ 
>
> I would assume by "finalize fairy" called twice, you're referring to 
> the fact that you see the "exception during reset or simlar" message. 
> But that message is still AFAICT impossible to be emitted without the 
> function being entered, and a corresponding "FINALIZE FAIRY" log line 
> preceding it, unless something is going wrong with the logging, or the 
> interpreter is somehow not functioning correctly (I wonder if eventlet 
> had any such issues).   I'm struggling to see any other possibility 
> but not coming up with any. 
>
> One weird thing I'm trying to figure out is that I see a connection 
> being returned to the pool normally, but then nothing is logged about 
> it after that: 
>
> 2018-05-14T11:11:21.183+08:00 localhost nova-conductor DEBUG 
> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
> [req-19a57f61-bc64-4575-8253-00b5bfe62a74] Connection  object at 0x7f419c0; closed: 0> being returned to pool, 
> connection_record.connection  0> debug 
>
>
> the next log message from tid:130446352 should be "jjs conn reset". 
> But that never happens.  The next we see of that tid is: 
>
> 2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG 
> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
> [req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs 
> _ConnectionFairy._checkout fariy None debug 
> 2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG 
> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
> [req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs before pool._do_get 
> pool queue deque([ 0xe10>, , 
> , 
> , 
> , 
> , 
> , 
> , 
> ]) debug 
>
> that is, it's checking out a new connection ?  If it were blocking on 
> trying to reset the connection, we wouldn't see that tid again.  I 
> wonder if the greenlet was interrupted or something.I'm finding 
> various impossibilities like this going on and not able to come up 
> with any theory other than the interpreter is somehow not functioning 
> correcltly, or the logging here is somehow incomplete. 
>
> I notice you're on a very old version of SQLAlchemy (three years old). 
> You really should be on at least the 1.1 series because for 
> greenlets you need to have: 
>
> http://docs.sqlalchemy.org/en/latest/changelog/migration_11.html#change-3803. 
>
>
>
>
>
>
>
>
>
> > 
> > [1]https://github.com/openstack/nova/blob/master/nova/cmd/__init__.py 
> > 
> > 
> > 在 2018年5月14日星期一 UTC+8下午11:59:43,Mike Bayer写道: 
> >> 
> >> I notice in this log file there are multiple "tid", which I assume to 
> >> be thread identifiers ?   E.g. OS-level threads?Are you using 
> >> eventlet or not?Are you monkeypatching SQLAlchemy and then using 
> >> multiple OS-level threads at the same time? Because that likely 
> >> doesn't work. 
> >> 
> >> Particularly, the log here illustrates an impossible codepath. 
> >> 
> >> We enter finalize_fairy with connection is None: 
> >> 
> >> 2018-05-14T11:11:39.961+08:00 localhost nova-conductor DEBUG 
> >> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
> >> [req-74ef6531-2d92-40ea-a524-f54e11edab06] FINALIZE FAIRY: DB CONN: 
> >> None, FAIRY: , 
> >> REF: None  CONNECTION_REC:  >> at 0x776fe90> CONNECTION_REC REF  >> '_ConnectionFairy' at 0x87dd3d0> debug 
> >> 
> >> this seems to make it through to the checkin at the bottom. 
> >> 
> >> Then, we see this: 
> >> 
> >> 2018-05-14T11:11:39.968+08:00 localhost 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread Mike Bayer
 s



On Mon, May 14, 2018 at 9:38 PM,   wrote:
> The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each
> coroutine(green thread).
>
> We run nova-conductor with monkey_patch, the same as [1],
> eventlet.monkey_patch(os=False)
>
> Our nova-conductor is running with multi workers.
>
> It's confusing why the _finalize_fairy is called twice but only log once at
> the entrance, that's why I guess it's called by weakref's callback.

For 130446032 which is the tid you mentioned, _finalize_fairy is only
called once in your log:

$ cat nc_5679.log | grep -e "130446032.*jjs _finalize"
2018-05-14T11:11:39.955+08:00 localhost nova-conductor DEBUG
[pid:5679] [MainThread] [tid:130446032] [log.py:104 debug]
[req-841bb432-7cd2-4fa3-9def-559a34b05a21] jjs _finalize_fairy args
conn None, rec.conn None, rec, . ref None, fairy
 debug

Also, the log shows that no _finalize_fairy is ever called due to
"del" - this would show the weakref "ref" as non-None:

$ cat nc_5679.log | grep -e "FINALIZE FAIRY.*REF:.*" | grep -v "REF: None"
$

I would assume by "finalize fairy" called twice, you're referring to
the fact that you see the "exception during reset or simlar" message.
But that message is still AFAICT impossible to be emitted without the
function being entered, and a corresponding "FINALIZE FAIRY" log line
preceding it, unless something is going wrong with the logging, or the
interpreter is somehow not functioning correctly (I wonder if eventlet
had any such issues).   I'm struggling to see any other possibility
but not coming up with any.

One weird thing I'm trying to figure out is that I see a connection
being returned to the pool normally, but then nothing is logged about
it after that:

2018-05-14T11:11:21.183+08:00 localhost nova-conductor DEBUG
[pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
[req-19a57f61-bc64-4575-8253-00b5bfe62a74] Connection  being returned to pool,
connection_record.connection  debug


the next log message from tid:130446352 should be "jjs conn reset".
But that never happens.  The next we see of that tid is:

2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG
[pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
[req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs
_ConnectionFairy._checkout fariy None debug
2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG
[pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
[req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs before pool._do_get
pool queue deque([, ,
,
,
,
,
,
,
]) debug

that is, it's checking out a new connection ?  If it were blocking on
trying to reset the connection, we wouldn't see that tid again.  I
wonder if the greenlet was interrupted or something.I'm finding
various impossibilities like this going on and not able to come up
with any theory other than the interpreter is somehow not functioning
correcltly, or the logging here is somehow incomplete.

I notice you're on a very old version of SQLAlchemy (three years old).
You really should be on at least the 1.1 series because for
greenlets you need to have:
http://docs.sqlalchemy.org/en/latest/changelog/migration_11.html#change-3803.








>
> [1]https://github.com/openstack/nova/blob/master/nova/cmd/__init__.py
>
>
> 在 2018年5月14日星期一 UTC+8下午11:59:43,Mike Bayer写道:
>>
>> I notice in this log file there are multiple "tid", which I assume to
>> be thread identifiers ?   E.g. OS-level threads?Are you using
>> eventlet or not?Are you monkeypatching SQLAlchemy and then using
>> multiple OS-level threads at the same time? Because that likely
>> doesn't work.
>>
>> Particularly, the log here illustrates an impossible codepath.
>>
>> We enter finalize_fairy with connection is None:
>>
>> 2018-05-14T11:11:39.961+08:00 localhost nova-conductor DEBUG
>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
>> [req-74ef6531-2d92-40ea-a524-f54e11edab06] FINALIZE FAIRY: DB CONN:
>> None, FAIRY: ,
>> REF: None  CONNECTION_REC: > at 0x776fe90> CONNECTION_REC REF > '_ConnectionFairy' at 0x87dd3d0> debug
>>
>> this seems to make it through to the checkin at the bottom.
>>
>> Then, we see this:
>>
>> 2018-05-14T11:11:39.968+08:00 localhost nova-conductor ERROR
>> [pid:5679] [MainThread] [tid:130446032] [log.py:122 error]
>> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] Exception during reset or
>> similar
>>
>> The only place that message is generated is *in* finalize_fairy,
>> *before* the checkin occurs, and *only* if _finalize_fairy was called
>> *with* a connection that is non -None.There is no other call
>> _finalize_fairy with that request identifier, therefore I see no
>> evidence of where this message "exception during reset" can be
>> generated from - there would at least need to be an additional log
>> line with "FINALIZE FAIRY" and the connection non-None for this to be
>> reached.I see no way that the pool.py shipped with SQLAlchemy can
>> possibly generate this series of steps.
>>
>> This is an 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju

The exc_info of tid 130446032 "Exception during reset or similar" is

sqlalchemy.pool.QueuePool Traceback (most recent call last):
sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 679, in 
_finalize_fairy
sqlalchemy.pool.QueuePool fairy._reset(pool)
sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 825, in _reset
sqlalchemy.pool.QueuePool self._reset_agent.rollback()
sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 1586, in 
rollback
sqlalchemy.pool.QueuePool self._do_rollback()
sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 1624, in 
_do_rollback
sqlalchemy.pool.QueuePool self.connection._rollback_impl()
sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 672, in 
_rollback_impl
sqlalchemy.pool.QueuePool self._handle_dbapi_exception(e, None, None, 
None, None)
sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 1356, in 
_handle_dbapi_exception
sqlalchemy.pool.QueuePool util.raise_from_cause(newraise, exc_info)
sqlalchemy.pool.QueuePool   File "sqlalchemy/util/compat.py", line 200, in 
raise_from_cause
sqlalchemy.pool.QueuePool reraise(type(exception), exception, tb=exc_tb)
sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 670, in 
_rollback_impl
sqlalchemy.pool.QueuePool
 self.engine.dialect.do_rollback(self.connection)
sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/default.py", line 421, 
in do_rollback
sqlalchemy.pool.QueuePool dbapi_connection.rollback()
sqlalchemy.pool.QueuePool   File "eventlet/support/psycopg2_patcher.py", 
line 46, in eventlet_wait_callback
sqlalchemy.pool.QueuePool state = conn.poll()
sqlalchemy.pool.QueuePool DBConnectionError: (psycopg2.OperationalError) 
server closed the connection unexpectedly
sqlalchemy.pool.QueuePoolThis probably means the server terminated 
abnormally
sqlalchemy.pool.QueuePoolbefore or while processing the request.





在 2018年5月14日星期一 UTC+8下午11:59:43,Mike Bayer写道:
>
> I notice in this log file there are multiple "tid", which I assume to 
> be thread identifiers ?   E.g. OS-level threads?Are you using 
> eventlet or not?Are you monkeypatching SQLAlchemy and then using 
> multiple OS-level threads at the same time? Because that likely 
> doesn't work. 
>
> Particularly, the log here illustrates an impossible codepath. 
>
> We enter finalize_fairy with connection is None: 
>
> 2018-05-14T11:11:39.961+08:00 localhost nova-conductor DEBUG 
> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
> [req-74ef6531-2d92-40ea-a524-f54e11edab06] FINALIZE FAIRY: DB CONN: 
> None, FAIRY: , 
> REF: None  CONNECTION_REC:  at 0x776fe90> CONNECTION_REC REF  '_ConnectionFairy' at 0x87dd3d0> debug 
>
> this seems to make it through to the checkin at the bottom. 
>
> Then, we see this: 
>
> 2018-05-14T11:11:39.968+08:00 localhost nova-conductor ERROR 
> [pid:5679] [MainThread] [tid:130446032] [log.py:122 error] 
> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] Exception during reset or 
> similar 
>
> The only place that message is generated is *in* finalize_fairy, 
> *before* the checkin occurs, and *only* if _finalize_fairy was called 
> *with* a connection that is non -None.There is no other call 
> _finalize_fairy with that request identifier, therefore I see no 
> evidence of where this message "exception during reset" can be 
> generated from - there would at least need to be an additional log 
> line with "FINALIZE FAIRY" and the connection non-None for this to be 
> reached.I see no way that the pool.py shipped with SQLAlchemy can 
> possibly generate this series of steps. 
>
> This is an extremely intricate issue and I still have only the vaguest 
> notion of how you are running SQLAlchemy / Nova. 
>
>
>
>
> On Mon, May 14, 2018 at 11:10 AM, Mike Bayer  > wrote: 
> > OK I need the exact pool.py you are using as well that includes all 
> > these logging statements and any other adjustments that were made. 
> > 
> > 
> > 
> > On Mon, May 14, 2018 at 8:02 AM,   
> wrote: 
> >> I've uploaded the logs here: 
> >> https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log 
> >> 
> >> The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032 
> >> 
> >> At  2018-05-14T11:11:39.970, we got the log " jjs rec already in pool 
> >> checkin", but I just deleted `return` in code, so the problem happen 
> again. 
> >> 
> >> 
> >> 
> >> 
> >> 在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道: 
> >>> 
> >>> as well as the log that shows this: 
> >>> 
> >>> +if self in pool._pool.queue: 
> >>> +pool.logger.debug("jjs rec already in pool checkin, skip %r", 
> >>> self) 
> >>> +return 
> >>> 
> >>> 
> >>> actually happening. 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer  
>
> >>> wrote: 
> >>> > On Sun, May 13, 2018 at 10:35 PM,   wrote: 
> >>> >> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju
The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each 
coroutine(green thread).

We run nova-conductor with monkey_patch, the same as [1], 
eventlet.monkey_patch(os=False)

Our nova-conductor is running with multi workers.

It's confusing why the _finalize_fairy is called twice but only log once at 
the entrance, that's why I guess it's called by weakref's callback.

[1]https://github.com/openstack/nova/blob/master/nova/cmd/__init__.py


在 2018年5月14日星期一 UTC+8下午11:59:43,Mike Bayer写道:
>
> I notice in this log file there are multiple "tid", which I assume to 
> be thread identifiers ?   E.g. OS-level threads?Are you using 
> eventlet or not?Are you monkeypatching SQLAlchemy and then using 
> multiple OS-level threads at the same time? Because that likely 
> doesn't work. 
>
> Particularly, the log here illustrates an impossible codepath. 
>
> We enter finalize_fairy with connection is None: 
>
> 2018-05-14T11:11:39.961+08:00 localhost nova-conductor DEBUG 
> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
> [req-74ef6531-2d92-40ea-a524-f54e11edab06] FINALIZE FAIRY: DB CONN: 
> None, FAIRY: , 
> REF: None  CONNECTION_REC:  at 0x776fe90> CONNECTION_REC REF  '_ConnectionFairy' at 0x87dd3d0> debug 
>
> this seems to make it through to the checkin at the bottom. 
>
> Then, we see this: 
>
> 2018-05-14T11:11:39.968+08:00 localhost nova-conductor ERROR 
> [pid:5679] [MainThread] [tid:130446032] [log.py:122 error] 
> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] Exception during reset or 
> similar 
>
> The only place that message is generated is *in* finalize_fairy, 
> *before* the checkin occurs, and *only* if _finalize_fairy was called 
> *with* a connection that is non -None.There is no other call 
> _finalize_fairy with that request identifier, therefore I see no 
> evidence of where this message "exception during reset" can be 
> generated from - there would at least need to be an additional log 
> line with "FINALIZE FAIRY" and the connection non-None for this to be 
> reached.I see no way that the pool.py shipped with SQLAlchemy can 
> possibly generate this series of steps. 
>
> This is an extremely intricate issue and I still have only the vaguest 
> notion of how you are running SQLAlchemy / Nova. 
>
>
>
>
> On Mon, May 14, 2018 at 11:10 AM, Mike Bayer  > wrote: 
> > OK I need the exact pool.py you are using as well that includes all 
> > these logging statements and any other adjustments that were made. 
> > 
> > 
> > 
> > On Mon, May 14, 2018 at 8:02 AM,   
> wrote: 
> >> I've uploaded the logs here: 
> >> https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log 
> >> 
> >> The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032 
> >> 
> >> At  2018-05-14T11:11:39.970, we got the log " jjs rec already in pool 
> >> checkin", but I just deleted `return` in code, so the problem happen 
> again. 
> >> 
> >> 
> >> 
> >> 
> >> 在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道: 
> >>> 
> >>> as well as the log that shows this: 
> >>> 
> >>> +if self in pool._pool.queue: 
> >>> +pool.logger.debug("jjs rec already in pool checkin, skip %r", 
> >>> self) 
> >>> +return 
> >>> 
> >>> 
> >>> actually happening. 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer  
>
> >>> wrote: 
> >>> > On Sun, May 13, 2018 at 10:35 PM,   wrote: 
> >>> >> "jjs" is short for my name and "cjr" is my colleague's(JinRong 
> Cai). We 
> >>> >> work 
> >>> >> together and add logs in the same environment, add prefix of name 
> to 
> >>> >> get a 
> >>> >> keyword for log filter. 
> >>> >> 
> >>> >> The doubled ConnectionRecords(with None connection) are got by two 
> >>> >> different 
> >>> >> coroutines, and the variable record.connection is rewrite by the 
> second 
> >>> >> coroutines. 
> >>> >> After the record being put back to QueuePool again, there are two 
> same 
> >>> >> record with same connection in the pool, which will cause the next 
> two 
> >>> >> coroutines get conflict and deadlock in psycopg2 as above. 
> >>> > 
> >>> > OK, I need the detailed step by step that shows how the two 
> coroutines 
> >>> > are interacting such that they both are returning the same 
> connection. 
> >>> >   The nova log above seems to show just one coroutine I assume since 
> >>> > there is just one nova request ID, if I understand correctly. 
> >>> > 
> >>> >> 
> >>> >> The logs of psycopg2 has been pasted at 
> >>> >> https://github.com/psycopg/psycopg2/issues/703 
> >>> > 
> >>> > The above issue says the bug has been found in SQLAlchemy but there 
> is 
> >>> > no bug report yet and I've not yet been shown how the behavior here 
> is 
> >>> > possible.  I posted a straightforward example above, can you 
> >>> > please show me where the two coroutines get mixed up ? 
> >>> > 
> >>> > 
> >>> > 
> >>> > 
> >>> >> 
> >>> >> 
> >>> >> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju
I put the files we added logs here, include engine/base.py and pool.py
https://github.com/jiajunsu/test_logs

在 2018年5月14日星期一 UTC+8下午11:10:37,Mike Bayer写道:
>
> OK I need the exact pool.py you are using as well that includes all 
> these logging statements and any other adjustments that were made. 
>
>
>
> On Mon, May 14, 2018 at 8:02 AM,   
> wrote: 
> > I've uploaded the logs here: 
> > https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log 
> > 
> > The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032 
> > 
> > At  2018-05-14T11:11:39.970, we got the log " jjs rec already in pool 
> > checkin", but I just deleted `return` in code, so the problem happen 
> again. 
> > 
> > 
> > 
> > 
> > 在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道: 
> >> 
> >> as well as the log that shows this: 
> >> 
> >> +if self in pool._pool.queue: 
> >> +pool.logger.debug("jjs rec already in pool checkin, skip %r", 
> >> self) 
> >> +return 
> >> 
> >> 
> >> actually happening. 
> >> 
> >> 
> >> 
> >> 
> >> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer  
> >> wrote: 
> >> > On Sun, May 13, 2018 at 10:35 PM,   wrote: 
> >> >> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). 
> We 
> >> >> work 
> >> >> together and add logs in the same environment, add prefix of name to 
> >> >> get a 
> >> >> keyword for log filter. 
> >> >> 
> >> >> The doubled ConnectionRecords(with None connection) are got by two 
> >> >> different 
> >> >> coroutines, and the variable record.connection is rewrite by the 
> second 
> >> >> coroutines. 
> >> >> After the record being put back to QueuePool again, there are two 
> same 
> >> >> record with same connection in the pool, which will cause the next 
> two 
> >> >> coroutines get conflict and deadlock in psycopg2 as above. 
> >> > 
> >> > OK, I need the detailed step by step that shows how the two 
> coroutines 
> >> > are interacting such that they both are returning the same 
> connection. 
> >> >   The nova log above seems to show just one coroutine I assume since 
> >> > there is just one nova request ID, if I understand correctly. 
> >> > 
> >> >> 
> >> >> The logs of psycopg2 has been pasted at 
> >> >> https://github.com/psycopg/psycopg2/issues/703 
> >> > 
> >> > The above issue says the bug has been found in SQLAlchemy but there 
> is 
> >> > no bug report yet and I've not yet been shown how the behavior here 
> is 
> >> > possible.  I posted a straightforward example above, can you 
> >> > please show me where the two coroutines get mixed up ? 
> >> > 
> >> > 
> >> > 
> >> > 
> >> >> 
> >> >> 
> >> >> 
> >> >> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道: 
> >> >>> 
> >> >>> I tried to follow your log lines, which is difficult because I 
> don't 
> >> >>> know where those log statements actually are and I don't know what 
> >> >>> acronyms like "jjs", "cjr", mean.   But it does look like it's al 
> in 
> >> >>> one thread (so not async GC) and all in one Nova request (so 
> perhaps 
> >> >>> just one greenlet involved, unless other greenlets are affecting 
> this 
> >> >>> without them being displayed in this log snippet).   I do see the 
> >> >>> connection record doubled at the end.   So here is the illustration 
> of 
> >> >>> those steps in isolation, the connection record is not doubled. 
>  Can 
> >> >>> you alter this program to show this condition occurring? 
> >> >>> 
> >> >>> from sqlalchemy import create_engine 
> >> >>> 
> >> >>> e = create_engine("postgresql://scott:tiger@localhost/test", 
> >> >>> echo_pool='debug') 
> >> >>> e.connect().close() 
> >> >>> 
> >> >>> assert len(e.pool._pool.queue) == 1 
> >> >>> 
> >> >>> c1 = e.connect() 
> >> >>> 
> >> >>> assert len(e.pool._pool.queue) == 0 
> >> >>> 
> >> >>> c1.engine.pool._invalidate(c1._Connection__connection, None) 
> >> >>> c1.invalidate(e) 
> >> >>> 
> >> >>> # connection record is back 
> >> >>> assert len(e.pool._pool.queue) == 1 
> >> >>> 
> >> >>> # do a reconnect 
> >> >>> c1.connection 
> >> >>> 
> >> >>> # uses the record again 
> >> >>> assert len(e.pool._pool.queue) == 0 
> >> >>> 
> >> >>> # close the Connection 
> >> >>> c1.close() 
> >> >>> 
> >> >>> # record is back 
> >> >>> assert len(e.pool._pool.queue) == 1 
> >> >>> 
> >> >>> # etc. 
> >> >>> c1 = e.connect() 
> >> >>> assert len(e.pool._pool.queue) == 0 
> >> >>> 
> >> >>> c1.close() 
> >> >>> assert len(e.pool._pool.queue) == 1 
> >> >>> 
> >> >>> 
> >> >>> 
> >> >>> 
> >> >>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer <
> mik...@zzzcomputing.com> 
> >> >>> wrote: 
> >> >>> > Here's a print statement you might want to try: 
> >> >>> > 
> >> >>> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py 
> >> >>> > index 89a4cea7c..452db4883 100644 
> >> >>> > --- a/lib/sqlalchemy/pool.py 
> >> >>> > +++ b/lib/sqlalchemy/pool.py 
> >> >>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, 
> >> >>> > 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread Mike Bayer
I notice in this log file there are multiple "tid", which I assume to
be thread identifiers ?   E.g. OS-level threads?Are you using
eventlet or not?Are you monkeypatching SQLAlchemy and then using
multiple OS-level threads at the same time? Because that likely
doesn't work.

Particularly, the log here illustrates an impossible codepath.

We enter finalize_fairy with connection is None:

2018-05-14T11:11:39.961+08:00 localhost nova-conductor DEBUG
[pid:5679] [MainThread] [tid:130446352] [log.py:104 debug]
[req-74ef6531-2d92-40ea-a524-f54e11edab06] FINALIZE FAIRY: DB CONN:
None, FAIRY: ,
REF: None  CONNECTION_REC:  CONNECTION_REC REF  debug

this seems to make it through to the checkin at the bottom.

Then, we see this:

2018-05-14T11:11:39.968+08:00 localhost nova-conductor ERROR
[pid:5679] [MainThread] [tid:130446032] [log.py:122 error]
[req-841bb432-7cd2-4fa3-9def-559a34b05a21] Exception during reset or
similar

The only place that message is generated is *in* finalize_fairy,
*before* the checkin occurs, and *only* if _finalize_fairy was called
*with* a connection that is non -None.There is no other call
_finalize_fairy with that request identifier, therefore I see no
evidence of where this message "exception during reset" can be
generated from - there would at least need to be an additional log
line with "FINALIZE FAIRY" and the connection non-None for this to be
reached.I see no way that the pool.py shipped with SQLAlchemy can
possibly generate this series of steps.

This is an extremely intricate issue and I still have only the vaguest
notion of how you are running SQLAlchemy / Nova.




On Mon, May 14, 2018 at 11:10 AM, Mike Bayer  wrote:
> OK I need the exact pool.py you are using as well that includes all
> these logging statements and any other adjustments that were made.
>
>
>
> On Mon, May 14, 2018 at 8:02 AM,   wrote:
>> I've uploaded the logs here:
>> https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log
>>
>> The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032
>>
>> At  2018-05-14T11:11:39.970, we got the log " jjs rec already in pool
>> checkin", but I just deleted `return` in code, so the problem happen again.
>>
>>
>>
>>
>> 在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道:
>>>
>>> as well as the log that shows this:
>>>
>>> +if self in pool._pool.queue:
>>> +pool.logger.debug("jjs rec already in pool checkin, skip %r",
>>> self)
>>> +return
>>>
>>>
>>> actually happening.
>>>
>>>
>>>
>>>
>>> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer 
>>> wrote:
>>> > On Sun, May 13, 2018 at 10:35 PM,   wrote:
>>> >> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We
>>> >> work
>>> >> together and add logs in the same environment, add prefix of name to
>>> >> get a
>>> >> keyword for log filter.
>>> >>
>>> >> The doubled ConnectionRecords(with None connection) are got by two
>>> >> different
>>> >> coroutines, and the variable record.connection is rewrite by the second
>>> >> coroutines.
>>> >> After the record being put back to QueuePool again, there are two same
>>> >> record with same connection in the pool, which will cause the next two
>>> >> coroutines get conflict and deadlock in psycopg2 as above.
>>> >
>>> > OK, I need the detailed step by step that shows how the two coroutines
>>> > are interacting such that they both are returning the same connection.
>>> >   The nova log above seems to show just one coroutine I assume since
>>> > there is just one nova request ID, if I understand correctly.
>>> >
>>> >>
>>> >> The logs of psycopg2 has been pasted at
>>> >> https://github.com/psycopg/psycopg2/issues/703
>>> >
>>> > The above issue says the bug has been found in SQLAlchemy but there is
>>> > no bug report yet and I've not yet been shown how the behavior here is
>>> > possible.  I posted a straightforward example above, can you
>>> > please show me where the two coroutines get mixed up ?
>>> >
>>> >
>>> >
>>> >
>>> >>
>>> >>
>>> >>
>>> >> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道:
>>> >>>
>>> >>> I tried to follow your log lines, which is difficult because I don't
>>> >>> know where those log statements actually are and I don't know what
>>> >>> acronyms like "jjs", "cjr", mean.   But it does look like it's al in
>>> >>> one thread (so not async GC) and all in one Nova request (so perhaps
>>> >>> just one greenlet involved, unless other greenlets are affecting this
>>> >>> without them being displayed in this log snippet).   I do see the
>>> >>> connection record doubled at the end.   So here is the illustration of
>>> >>> those steps in isolation, the connection record is not doubled.  Can
>>> >>> you alter this program to show this condition occurring?
>>> >>>
>>> >>> from sqlalchemy import create_engine
>>> >>>
>>> >>> e = create_engine("postgresql://scott:tiger@localhost/test",
>>> >>> echo_pool='debug')

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread Mike Bayer
OK I need the exact pool.py you are using as well that includes all
these logging statements and any other adjustments that were made.



On Mon, May 14, 2018 at 8:02 AM,   wrote:
> I've uploaded the logs here:
> https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log
>
> The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032
>
> At  2018-05-14T11:11:39.970, we got the log " jjs rec already in pool
> checkin", but I just deleted `return` in code, so the problem happen again.
>
>
>
>
> 在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道:
>>
>> as well as the log that shows this:
>>
>> +if self in pool._pool.queue:
>> +pool.logger.debug("jjs rec already in pool checkin, skip %r",
>> self)
>> +return
>>
>>
>> actually happening.
>>
>>
>>
>>
>> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer 
>> wrote:
>> > On Sun, May 13, 2018 at 10:35 PM,   wrote:
>> >> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We
>> >> work
>> >> together and add logs in the same environment, add prefix of name to
>> >> get a
>> >> keyword for log filter.
>> >>
>> >> The doubled ConnectionRecords(with None connection) are got by two
>> >> different
>> >> coroutines, and the variable record.connection is rewrite by the second
>> >> coroutines.
>> >> After the record being put back to QueuePool again, there are two same
>> >> record with same connection in the pool, which will cause the next two
>> >> coroutines get conflict and deadlock in psycopg2 as above.
>> >
>> > OK, I need the detailed step by step that shows how the two coroutines
>> > are interacting such that they both are returning the same connection.
>> >   The nova log above seems to show just one coroutine I assume since
>> > there is just one nova request ID, if I understand correctly.
>> >
>> >>
>> >> The logs of psycopg2 has been pasted at
>> >> https://github.com/psycopg/psycopg2/issues/703
>> >
>> > The above issue says the bug has been found in SQLAlchemy but there is
>> > no bug report yet and I've not yet been shown how the behavior here is
>> > possible.  I posted a straightforward example above, can you
>> > please show me where the two coroutines get mixed up ?
>> >
>> >
>> >
>> >
>> >>
>> >>
>> >>
>> >> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道:
>> >>>
>> >>> I tried to follow your log lines, which is difficult because I don't
>> >>> know where those log statements actually are and I don't know what
>> >>> acronyms like "jjs", "cjr", mean.   But it does look like it's al in
>> >>> one thread (so not async GC) and all in one Nova request (so perhaps
>> >>> just one greenlet involved, unless other greenlets are affecting this
>> >>> without them being displayed in this log snippet).   I do see the
>> >>> connection record doubled at the end.   So here is the illustration of
>> >>> those steps in isolation, the connection record is not doubled.  Can
>> >>> you alter this program to show this condition occurring?
>> >>>
>> >>> from sqlalchemy import create_engine
>> >>>
>> >>> e = create_engine("postgresql://scott:tiger@localhost/test",
>> >>> echo_pool='debug')
>> >>> e.connect().close()
>> >>>
>> >>> assert len(e.pool._pool.queue) == 1
>> >>>
>> >>> c1 = e.connect()
>> >>>
>> >>> assert len(e.pool._pool.queue) == 0
>> >>>
>> >>> c1.engine.pool._invalidate(c1._Connection__connection, None)
>> >>> c1.invalidate(e)
>> >>>
>> >>> # connection record is back
>> >>> assert len(e.pool._pool.queue) == 1
>> >>>
>> >>> # do a reconnect
>> >>> c1.connection
>> >>>
>> >>> # uses the record again
>> >>> assert len(e.pool._pool.queue) == 0
>> >>>
>> >>> # close the Connection
>> >>> c1.close()
>> >>>
>> >>> # record is back
>> >>> assert len(e.pool._pool.queue) == 1
>> >>>
>> >>> # etc.
>> >>> c1 = e.connect()
>> >>> assert len(e.pool._pool.queue) == 0
>> >>>
>> >>> c1.close()
>> >>> assert len(e.pool._pool.queue) == 1
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer 
>> >>> wrote:
>> >>> > Here's a print statement you might want to try:
>> >>> >
>> >>> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py
>> >>> > index 89a4cea7c..452db4883 100644
>> >>> > --- a/lib/sqlalchemy/pool.py
>> >>> > +++ b/lib/sqlalchemy/pool.py
>> >>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection,
>> >>> > connection_record,
>> >>> >  been garbage collected.
>> >>> >
>> >>> >  """
>> >>> > +print(
>> >>> > +"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s,
>> >>> > REF:
>> >>> > %s  "
>> >>> > +"CONNECTION_REC: %s CONNECTION_REC REF %s" % (
>> >>> > +"put greenlet id here",
>> >>> > +connection,
>> >>> > +fairy,
>> >>> > +ref,
>> >>> > +connection_record,
>> >>> > +connection_record.fairy_ref
>> >>> > +if connection_record is not None else "n/a"
>> >>> > +)
>> >>> > 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju
I've uploaded the logs 
here: https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log

The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032

At  2018-05-14T11:11:39.970, we got the log " jjs rec already in pool 
checkin", but I just deleted `return` in code, so the problem happen again.




在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道:
>
> as well as the log that shows this: 
>
> +if self in pool._pool.queue: 
> +pool.logger.debug("jjs rec already in pool checkin, skip %r", 
> self) 
> +return 
>
>
> actually happening. 
>
>
>
>
> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer  > wrote: 
> > On Sun, May 13, 2018 at 10:35 PM,   
> wrote: 
> >> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We 
> work 
> >> together and add logs in the same environment, add prefix of name to 
> get a 
> >> keyword for log filter. 
> >> 
> >> The doubled ConnectionRecords(with None connection) are got by two 
> different 
> >> coroutines, and the variable record.connection is rewrite by the second 
> >> coroutines. 
> >> After the record being put back to QueuePool again, there are two same 
> >> record with same connection in the pool, which will cause the next two 
> >> coroutines get conflict and deadlock in psycopg2 as above. 
> > 
> > OK, I need the detailed step by step that shows how the two coroutines 
> > are interacting such that they both are returning the same connection. 
> >   The nova log above seems to show just one coroutine I assume since 
> > there is just one nova request ID, if I understand correctly. 
> > 
> >> 
> >> The logs of psycopg2 has been pasted at 
> >> https://github.com/psycopg/psycopg2/issues/703 
> > 
> > The above issue says the bug has been found in SQLAlchemy but there is 
> > no bug report yet and I've not yet been shown how the behavior here is 
> > possible.  I posted a straightforward example above, can you 
> > please show me where the two coroutines get mixed up ? 
> > 
> > 
> > 
> > 
> >> 
> >> 
> >> 
> >> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道: 
> >>> 
> >>> I tried to follow your log lines, which is difficult because I don't 
> >>> know where those log statements actually are and I don't know what 
> >>> acronyms like "jjs", "cjr", mean.   But it does look like it's al in 
> >>> one thread (so not async GC) and all in one Nova request (so perhaps 
> >>> just one greenlet involved, unless other greenlets are affecting this 
> >>> without them being displayed in this log snippet).   I do see the 
> >>> connection record doubled at the end.   So here is the illustration of 
> >>> those steps in isolation, the connection record is not doubled.  Can 
> >>> you alter this program to show this condition occurring? 
> >>> 
> >>> from sqlalchemy import create_engine 
> >>> 
> >>> e = create_engine("postgresql://scott:tiger@localhost/test", 
> >>> echo_pool='debug') 
> >>> e.connect().close() 
> >>> 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> c1 = e.connect() 
> >>> 
> >>> assert len(e.pool._pool.queue) == 0 
> >>> 
> >>> c1.engine.pool._invalidate(c1._Connection__connection, None) 
> >>> c1.invalidate(e) 
> >>> 
> >>> # connection record is back 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> # do a reconnect 
> >>> c1.connection 
> >>> 
> >>> # uses the record again 
> >>> assert len(e.pool._pool.queue) == 0 
> >>> 
> >>> # close the Connection 
> >>> c1.close() 
> >>> 
> >>> # record is back 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> # etc. 
> >>> c1 = e.connect() 
> >>> assert len(e.pool._pool.queue) == 0 
> >>> 
> >>> c1.close() 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer  
>
> >>> wrote: 
> >>> > Here's a print statement you might want to try: 
> >>> > 
> >>> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py 
> >>> > index 89a4cea7c..452db4883 100644 
> >>> > --- a/lib/sqlalchemy/pool.py 
> >>> > +++ b/lib/sqlalchemy/pool.py 
> >>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, 
> connection_record, 
> >>> >  been garbage collected. 
> >>> > 
> >>> >  """ 
> >>> > +print( 
> >>> > +"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, 
> REF: 
> >>> > %s  " 
> >>> > +"CONNECTION_REC: %s CONNECTION_REC REF %s" % ( 
> >>> > +"put greenlet id here", 
> >>> > +connection, 
> >>> > +fairy, 
> >>> > +ref, 
> >>> > +connection_record, 
> >>> > +connection_record.fairy_ref 
> >>> > +if connection_record is not None else "n/a" 
> >>> > +) 
> >>> > +) 
> >>> > + 
> >>> >  _refs.discard(connection_record) 
> >>> > 
> >>> >  if ref is not None: 
> >>> > 
> >>> > 
> >>> > 
> >>> > 
> >>> > 
> >>> > On Thu, May 10, 2018 at 12:21 PM, Mike Bayer <
> mik...@zzzcomputing.com> 
> >>> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread jiajunsu . zju
OK. I'll reproduce it and upload all logs of the process.

在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道:
>
> as well as the log that shows this: 
>
> +if self in pool._pool.queue: 
> +pool.logger.debug("jjs rec already in pool checkin, skip %r", 
> self) 
> +return 
>
>
> actually happening. 
>
>
>
>
> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer  > wrote: 
> > On Sun, May 13, 2018 at 10:35 PM,   
> wrote: 
> >> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We 
> work 
> >> together and add logs in the same environment, add prefix of name to 
> get a 
> >> keyword for log filter. 
> >> 
> >> The doubled ConnectionRecords(with None connection) are got by two 
> different 
> >> coroutines, and the variable record.connection is rewrite by the second 
> >> coroutines. 
> >> After the record being put back to QueuePool again, there are two same 
> >> record with same connection in the pool, which will cause the next two 
> >> coroutines get conflict and deadlock in psycopg2 as above. 
> > 
> > OK, I need the detailed step by step that shows how the two coroutines 
> > are interacting such that they both are returning the same connection. 
> >   The nova log above seems to show just one coroutine I assume since 
> > there is just one nova request ID, if I understand correctly. 
> > 
> >> 
> >> The logs of psycopg2 has been pasted at 
> >> https://github.com/psycopg/psycopg2/issues/703 
> > 
> > The above issue says the bug has been found in SQLAlchemy but there is 
> > no bug report yet and I've not yet been shown how the behavior here is 
> > possible.  I posted a straightforward example above, can you 
> > please show me where the two coroutines get mixed up ? 
> > 
> > 
> > 
> > 
> >> 
> >> 
> >> 
> >> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道: 
> >>> 
> >>> I tried to follow your log lines, which is difficult because I don't 
> >>> know where those log statements actually are and I don't know what 
> >>> acronyms like "jjs", "cjr", mean.   But it does look like it's al in 
> >>> one thread (so not async GC) and all in one Nova request (so perhaps 
> >>> just one greenlet involved, unless other greenlets are affecting this 
> >>> without them being displayed in this log snippet).   I do see the 
> >>> connection record doubled at the end.   So here is the illustration of 
> >>> those steps in isolation, the connection record is not doubled.  Can 
> >>> you alter this program to show this condition occurring? 
> >>> 
> >>> from sqlalchemy import create_engine 
> >>> 
> >>> e = create_engine("postgresql://scott:tiger@localhost/test", 
> >>> echo_pool='debug') 
> >>> e.connect().close() 
> >>> 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> c1 = e.connect() 
> >>> 
> >>> assert len(e.pool._pool.queue) == 0 
> >>> 
> >>> c1.engine.pool._invalidate(c1._Connection__connection, None) 
> >>> c1.invalidate(e) 
> >>> 
> >>> # connection record is back 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> # do a reconnect 
> >>> c1.connection 
> >>> 
> >>> # uses the record again 
> >>> assert len(e.pool._pool.queue) == 0 
> >>> 
> >>> # close the Connection 
> >>> c1.close() 
> >>> 
> >>> # record is back 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> # etc. 
> >>> c1 = e.connect() 
> >>> assert len(e.pool._pool.queue) == 0 
> >>> 
> >>> c1.close() 
> >>> assert len(e.pool._pool.queue) == 1 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer  
>
> >>> wrote: 
> >>> > Here's a print statement you might want to try: 
> >>> > 
> >>> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py 
> >>> > index 89a4cea7c..452db4883 100644 
> >>> > --- a/lib/sqlalchemy/pool.py 
> >>> > +++ b/lib/sqlalchemy/pool.py 
> >>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, 
> connection_record, 
> >>> >  been garbage collected. 
> >>> > 
> >>> >  """ 
> >>> > +print( 
> >>> > +"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, 
> REF: 
> >>> > %s  " 
> >>> > +"CONNECTION_REC: %s CONNECTION_REC REF %s" % ( 
> >>> > +"put greenlet id here", 
> >>> > +connection, 
> >>> > +fairy, 
> >>> > +ref, 
> >>> > +connection_record, 
> >>> > +connection_record.fairy_ref 
> >>> > +if connection_record is not None else "n/a" 
> >>> > +) 
> >>> > +) 
> >>> > + 
> >>> >  _refs.discard(connection_record) 
> >>> > 
> >>> >  if ref is not None: 
> >>> > 
> >>> > 
> >>> > 
> >>> > 
> >>> > 
> >>> > On Thu, May 10, 2018 at 12:21 PM, Mike Bayer <
> mik...@zzzcomputing.com> 
> >>> > wrote: 
> >>> >> On Thu, May 10, 2018 at 7:23 AM,   wrote: 
> >>> >>> Hi Mike, 
> >>> >>> 
> >>> >>> I added more logs and finnaly find the reason and a method to 
> avoid 
> >>> >>> the 
> >>> >>> deadlock between coroutines. 
> >>> >>> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread Mike Bayer
as well as the log that shows this:

+if self in pool._pool.queue:
+pool.logger.debug("jjs rec already in pool checkin, skip %r", self)
+return


actually happening.




On Sun, May 13, 2018 at 10:42 PM, Mike Bayer  wrote:
> On Sun, May 13, 2018 at 10:35 PM,   wrote:
>> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We work
>> together and add logs in the same environment, add prefix of name to get a
>> keyword for log filter.
>>
>> The doubled ConnectionRecords(with None connection) are got by two different
>> coroutines, and the variable record.connection is rewrite by the second
>> coroutines.
>> After the record being put back to QueuePool again, there are two same
>> record with same connection in the pool, which will cause the next two
>> coroutines get conflict and deadlock in psycopg2 as above.
>
> OK, I need the detailed step by step that shows how the two coroutines
> are interacting such that they both are returning the same connection.
>   The nova log above seems to show just one coroutine I assume since
> there is just one nova request ID, if I understand correctly.
>
>>
>> The logs of psycopg2 has been pasted at
>> https://github.com/psycopg/psycopg2/issues/703
>
> The above issue says the bug has been found in SQLAlchemy but there is
> no bug report yet and I've not yet been shown how the behavior here is
> possible.  I posted a straightforward example above, can you
> please show me where the two coroutines get mixed up ?
>
>
>
>
>>
>>
>>
>> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道:
>>>
>>> I tried to follow your log lines, which is difficult because I don't
>>> know where those log statements actually are and I don't know what
>>> acronyms like "jjs", "cjr", mean.   But it does look like it's al in
>>> one thread (so not async GC) and all in one Nova request (so perhaps
>>> just one greenlet involved, unless other greenlets are affecting this
>>> without them being displayed in this log snippet).   I do see the
>>> connection record doubled at the end.   So here is the illustration of
>>> those steps in isolation, the connection record is not doubled.  Can
>>> you alter this program to show this condition occurring?
>>>
>>> from sqlalchemy import create_engine
>>>
>>> e = create_engine("postgresql://scott:tiger@localhost/test",
>>> echo_pool='debug')
>>> e.connect().close()
>>>
>>> assert len(e.pool._pool.queue) == 1
>>>
>>> c1 = e.connect()
>>>
>>> assert len(e.pool._pool.queue) == 0
>>>
>>> c1.engine.pool._invalidate(c1._Connection__connection, None)
>>> c1.invalidate(e)
>>>
>>> # connection record is back
>>> assert len(e.pool._pool.queue) == 1
>>>
>>> # do a reconnect
>>> c1.connection
>>>
>>> # uses the record again
>>> assert len(e.pool._pool.queue) == 0
>>>
>>> # close the Connection
>>> c1.close()
>>>
>>> # record is back
>>> assert len(e.pool._pool.queue) == 1
>>>
>>> # etc.
>>> c1 = e.connect()
>>> assert len(e.pool._pool.queue) == 0
>>>
>>> c1.close()
>>> assert len(e.pool._pool.queue) == 1
>>>
>>>
>>>
>>>
>>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer 
>>> wrote:
>>> > Here's a print statement you might want to try:
>>> >
>>> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py
>>> > index 89a4cea7c..452db4883 100644
>>> > --- a/lib/sqlalchemy/pool.py
>>> > +++ b/lib/sqlalchemy/pool.py
>>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, connection_record,
>>> >  been garbage collected.
>>> >
>>> >  """
>>> > +print(
>>> > +"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, REF:
>>> > %s  "
>>> > +"CONNECTION_REC: %s CONNECTION_REC REF %s" % (
>>> > +"put greenlet id here",
>>> > +connection,
>>> > +fairy,
>>> > +ref,
>>> > +connection_record,
>>> > +connection_record.fairy_ref
>>> > +if connection_record is not None else "n/a"
>>> > +)
>>> > +)
>>> > +
>>> >  _refs.discard(connection_record)
>>> >
>>> >  if ref is not None:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, May 10, 2018 at 12:21 PM, Mike Bayer 
>>> > wrote:
>>> >> On Thu, May 10, 2018 at 7:23 AM,   wrote:
>>> >>> Hi Mike,
>>> >>>
>>> >>> I added more logs and finnaly find the reason and a method to avoid
>>> >>> the
>>> >>> deadlock between coroutines.
>>> >>>
>>> >>> In summary, the deadlock happend because two different coroutines use
>>> >>> the
>>> >>> same ConnectionRecord with same connection.
>>> >>> It happends under QueuePool, when the DB connections are reset by the
>>> >>> DB
>>> >>> server(not related with the backend driver).
>>> >>>
>>> >>> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
>>> >>> weakref of the ConnectionFairy, with callback function
>>> >>> `_finalize_fairy`.
>>> >>> 2.When the connection get disconnected, it receives an
>>> >>> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread Mike Bayer
On Sun, May 13, 2018 at 10:35 PM,   wrote:
> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We work
> together and add logs in the same environment, add prefix of name to get a
> keyword for log filter.
>
> The doubled ConnectionRecords(with None connection) are got by two different
> coroutines, and the variable record.connection is rewrite by the second
> coroutines.
> After the record being put back to QueuePool again, there are two same
> record with same connection in the pool, which will cause the next two
> coroutines get conflict and deadlock in psycopg2 as above.

OK, I need the detailed step by step that shows how the two coroutines
are interacting such that they both are returning the same connection.
  The nova log above seems to show just one coroutine I assume since
there is just one nova request ID, if I understand correctly.

>
> The logs of psycopg2 has been pasted at
> https://github.com/psycopg/psycopg2/issues/703

The above issue says the bug has been found in SQLAlchemy but there is
no bug report yet and I've not yet been shown how the behavior here is
possible.  I posted a straightforward example above, can you
please show me where the two coroutines get mixed up ?




>
>
>
> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道:
>>
>> I tried to follow your log lines, which is difficult because I don't
>> know where those log statements actually are and I don't know what
>> acronyms like "jjs", "cjr", mean.   But it does look like it's al in
>> one thread (so not async GC) and all in one Nova request (so perhaps
>> just one greenlet involved, unless other greenlets are affecting this
>> without them being displayed in this log snippet).   I do see the
>> connection record doubled at the end.   So here is the illustration of
>> those steps in isolation, the connection record is not doubled.  Can
>> you alter this program to show this condition occurring?
>>
>> from sqlalchemy import create_engine
>>
>> e = create_engine("postgresql://scott:tiger@localhost/test",
>> echo_pool='debug')
>> e.connect().close()
>>
>> assert len(e.pool._pool.queue) == 1
>>
>> c1 = e.connect()
>>
>> assert len(e.pool._pool.queue) == 0
>>
>> c1.engine.pool._invalidate(c1._Connection__connection, None)
>> c1.invalidate(e)
>>
>> # connection record is back
>> assert len(e.pool._pool.queue) == 1
>>
>> # do a reconnect
>> c1.connection
>>
>> # uses the record again
>> assert len(e.pool._pool.queue) == 0
>>
>> # close the Connection
>> c1.close()
>>
>> # record is back
>> assert len(e.pool._pool.queue) == 1
>>
>> # etc.
>> c1 = e.connect()
>> assert len(e.pool._pool.queue) == 0
>>
>> c1.close()
>> assert len(e.pool._pool.queue) == 1
>>
>>
>>
>>
>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer 
>> wrote:
>> > Here's a print statement you might want to try:
>> >
>> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py
>> > index 89a4cea7c..452db4883 100644
>> > --- a/lib/sqlalchemy/pool.py
>> > +++ b/lib/sqlalchemy/pool.py
>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, connection_record,
>> >  been garbage collected.
>> >
>> >  """
>> > +print(
>> > +"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, REF:
>> > %s  "
>> > +"CONNECTION_REC: %s CONNECTION_REC REF %s" % (
>> > +"put greenlet id here",
>> > +connection,
>> > +fairy,
>> > +ref,
>> > +connection_record,
>> > +connection_record.fairy_ref
>> > +if connection_record is not None else "n/a"
>> > +)
>> > +)
>> > +
>> >  _refs.discard(connection_record)
>> >
>> >  if ref is not None:
>> >
>> >
>> >
>> >
>> >
>> > On Thu, May 10, 2018 at 12:21 PM, Mike Bayer 
>> > wrote:
>> >> On Thu, May 10, 2018 at 7:23 AM,   wrote:
>> >>> Hi Mike,
>> >>>
>> >>> I added more logs and finnaly find the reason and a method to avoid
>> >>> the
>> >>> deadlock between coroutines.
>> >>>
>> >>> In summary, the deadlock happend because two different coroutines use
>> >>> the
>> >>> same ConnectionRecord with same connection.
>> >>> It happends under QueuePool, when the DB connections are reset by the
>> >>> DB
>> >>> server(not related with the backend driver).
>> >>>
>> >>> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
>> >>> weakref of the ConnectionFairy, with callback function
>> >>> `_finalize_fairy`.
>> >>> 2.When the connection get disconnected, it receives an
>> >>> exception(psycopg2.OperationalError), and goto
>> >>> Connection._handle_dbapi_exception.
>> >>> 3.In `_handle_dbapi_exception.finnaly`, it will call both
>> >>> `ConnectionFairy.invalidate` and `self.invalidate`.
>> >>> 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is
>> >>> called,
>> >>> and put the ConnectionRecord back to the QueuePool.
>> >>> 5.In `self.invalidate`, `del self._root.__connection` is 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread jiajunsu . zju
"jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We 
work together and add logs in the same environment, add prefix of name to 
get a keyword for log filter.

The doubled ConnectionRecords(with None connection) are got by two 
different coroutines, and the variable record.connection is rewrite by the 
second coroutines.
After the record being put back to QueuePool again, there are two same 
record with same connection in the pool, which will cause the next two 
coroutines get conflict and deadlock in psycopg2 as above.

The logs of psycopg2 has been pasted 
at https://github.com/psycopg/psycopg2/issues/703



在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道:
>
> I tried to follow your log lines, which is difficult because I don't 
> know where those log statements actually are and I don't know what 
> acronyms like "jjs", "cjr", mean.   But it does look like it's al in 
> one thread (so not async GC) and all in one Nova request (so perhaps 
> just one greenlet involved, unless other greenlets are affecting this 
> without them being displayed in this log snippet).   I do see the 
> connection record doubled at the end.   So here is the illustration of 
> those steps in isolation, the connection record is not doubled.  Can 
> you alter this program to show this condition occurring? 
>
> from sqlalchemy import create_engine 
>
> e = create_engine("postgresql://scott:tiger@localhost/test", 
> echo_pool='debug') 
> e.connect().close() 
>
> assert len(e.pool._pool.queue) == 1 
>
> c1 = e.connect() 
>
> assert len(e.pool._pool.queue) == 0 
>
> c1.engine.pool._invalidate(c1._Connection__connection, None) 
> c1.invalidate(e) 
>
> # connection record is back 
> assert len(e.pool._pool.queue) == 1 
>
> # do a reconnect 
> c1.connection 
>
> # uses the record again 
> assert len(e.pool._pool.queue) == 0 
>
> # close the Connection 
> c1.close() 
>
> # record is back 
> assert len(e.pool._pool.queue) == 1 
>
> # etc. 
> c1 = e.connect() 
> assert len(e.pool._pool.queue) == 0 
>
> c1.close() 
> assert len(e.pool._pool.queue) == 1 
>
>
>
>
> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer  > wrote: 
> > Here's a print statement you might want to try: 
> > 
> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py 
> > index 89a4cea7c..452db4883 100644 
> > --- a/lib/sqlalchemy/pool.py 
> > +++ b/lib/sqlalchemy/pool.py 
> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, connection_record, 
> >  been garbage collected. 
> > 
> >  """ 
> > +print( 
> > +"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, REF: 
> %s  " 
> > +"CONNECTION_REC: %s CONNECTION_REC REF %s" % ( 
> > +"put greenlet id here", 
> > +connection, 
> > +fairy, 
> > +ref, 
> > +connection_record, 
> > +connection_record.fairy_ref 
> > +if connection_record is not None else "n/a" 
> > +) 
> > +) 
> > + 
> >  _refs.discard(connection_record) 
> > 
> >  if ref is not None: 
> > 
> > 
> > 
> > 
> > 
> > On Thu, May 10, 2018 at 12:21 PM, Mike Bayer  > wrote: 
> >> On Thu, May 10, 2018 at 7:23 AM,   
> wrote: 
> >>> Hi Mike, 
> >>> 
> >>> I added more logs and finnaly find the reason and a method to avoid 
> the 
> >>> deadlock between coroutines. 
> >>> 
> >>> In summary, the deadlock happend because two different coroutines use 
> the 
> >>> same ConnectionRecord with same connection. 
> >>> It happends under QueuePool, when the DB connections are reset by the 
> DB 
> >>> server(not related with the backend driver). 
> >>> 
> >>> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a 
> >>> weakref of the ConnectionFairy, with callback function 
> `_finalize_fairy`. 
> >>> 2.When the connection get disconnected, it receives an 
> >>> exception(psycopg2.OperationalError), and goto 
> >>> Connection._handle_dbapi_exception. 
> >>> 3.In `_handle_dbapi_exception.finnaly`, it will call both 
> >>> `ConnectionFairy.invalidate` and `self.invalidate`. 
> >>> 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is 
> called, 
> >>> and put the ConnectionRecord back to the QueuePool. 
> >>> 5.In `self.invalidate`, `del self._root.__connection` is called, and 
> **del** 
> >>> the ConnectionFairy. 
> >>> 6.The **del** operation will make weakref of ConnectionFairy to dead, 
> and 
> >>> the callback `_finalize_fairy` executed. 
> >>> 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is 
> called, put 
> >>> the ConnectionRecord to the QueuePool again. 
> >> 
> >> Here's what does not make sense.  You are claiming that a simple 
> >> invalidation will result in the same ConnectionRecord being checked in 
> >> twice.   This is obviously not the case, what is special in your 
> >> situation that makes this happen? 
> >> 
> >> If I follow the steps you refer towards, in step 4, when 
> >> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread Mike Bayer
if that connection error is not handled in "is_disconnect" then this
wouldn't invalidate the pool.

However we've had lots of problems moving vips around with Galera.  I
would recommend using haproxy or PGBouncer which are both designed to
mitigate this issue. As well as, run Nova under mod_wsgi instead
of eventlet.

There's lots of ways your problem can be solved here.   I'd like to
find the specific race in SQLAlchemy but I need specific code steps.

On Sun, May 13, 2018 at 9:50 PM,   wrote:
> The scenario we make to reproduce the problem is:
> 1.stop PG master
> 2.move the IP of PG from master node to slave node, and make an ARP
> 3.fail over the PG slave to master
>
> That will shut all DB connections to PG master, and in SQLAlchemy we got an
> exception:
>
> ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 655, in
> _finalize_fairy
> ERROR sqlalchemy.pool.QueuePool fairy._reset(pool)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 798, in
> _reset
> ERROR sqlalchemy.pool.QueuePool self._reset_agent.rollback()
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 1586, in rollback
> ERROR sqlalchemy.pool.QueuePool self._do_rollback()
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 1624, in _do_rollback
> ERROR sqlalchemy.pool.QueuePool self.connection._rollback_impl()
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 672, in _rollback_impl
> ERROR sqlalchemy.pool.QueuePool self._handle_dbapi_exception(e, None,
> None, None, None)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 1356, in _handle_dbapi_exception
> ERROR sqlalchemy.pool.QueuePool util.raise_from_cause(newraise,
> exc_info)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/util/compat.py", line
> 200, in raise_from_cause
> ERROR sqlalchemy.pool.QueuePool reraise(type(exception), exception,
> tb=exc_tb)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 670, in _rollback_impl
> ERROR sqlalchemy.pool.QueuePool
> self.engine.dialect.do_rollback(self.connection)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/default.py", line
> 421, in do_rollback
> ERROR sqlalchemy.pool.QueuePool dbapi_connection.rollback()
> ERROR sqlalchemy.pool.QueuePool DBConnectionError:
> (psycopg2.OperationalError) connection pointer is NULL
>
> With that exception, SQLAlchemy first get in
> `Connection._handle_dbapi_exception`(keyword "err connection pointer is
> NULL" in log) and then get it again in `_finalize_fairy`(keyword "Exception
> during reset or similar" in log)
>
>
>
> 在 2018年5月11日星期五 UTC+8上午12:21:42,Mike Bayer写道:
>>
>> On Thu, May 10, 2018 at 7:23 AM,   wrote:
>> > Hi Mike,
>> >
>> > I added more logs and finnaly find the reason and a method to avoid the
>> > deadlock between coroutines.
>> >
>> > In summary, the deadlock happend because two different coroutines use
>> > the
>> > same ConnectionRecord with same connection.
>> > It happends under QueuePool, when the DB connections are reset by the DB
>> > server(not related with the backend driver).
>> >
>> > 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
>> > weakref of the ConnectionFairy, with callback function
>> > `_finalize_fairy`.
>> > 2.When the connection get disconnected, it receives an
>> > exception(psycopg2.OperationalError), and goto
>> > Connection._handle_dbapi_exception.
>> > 3.In `_handle_dbapi_exception.finnaly`, it will call both
>> > `ConnectionFairy.invalidate` and `self.invalidate`.
>> > 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is
>> > called,
>> > and put the ConnectionRecord back to the QueuePool.
>> > 5.In `self.invalidate`, `del self._root.__connection` is called, and
>> > **del**
>> > the ConnectionFairy.
>> > 6.The **del** operation will make weakref of ConnectionFairy to dead,
>> > and
>> > the callback `_finalize_fairy` executed.
>> > 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called,
>> > put
>> > the ConnectionRecord to the QueuePool again.
>>
>> Here's what does not make sense.  You are claiming that a simple
>> invalidation will result in the same ConnectionRecord being checked in
>> twice.   This is obviously not the case, what is special in your
>> situation that makes this happen?
>>
>> If I follow the steps you refer towards, in step 4, when
>> ConnectionFairy._checkin is called, that calls finalize_fairy
>> directly, which then calls ConnectionRecord.checkin().  Within
>> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None.
>>Then back in ConnectionFairy._checkin(), it also sets
>> self._connection_record = None.
>>
>> If we come back into finalize_fairy a *second* time then, as a result
>> of the "del" which occurs in _handle_error(), and the
>> 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread jiajunsu . zju
The scenario we make to reproduce the problem is: 
1.stop PG master
2.move the IP of PG from master node to slave node, and make an ARP
3.fail over the PG slave to master

That will shut all DB connections to PG master, and in SQLAlchemy we got an 
exception:

ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 655, in 
_finalize_fairy
ERROR sqlalchemy.pool.QueuePool fairy._reset(pool)
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 798, in 
_reset
ERROR sqlalchemy.pool.QueuePool self._reset_agent.rollback()
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 
1586, in rollback
ERROR sqlalchemy.pool.QueuePool self._do_rollback()
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 
1624, in _do_rollback
ERROR sqlalchemy.pool.QueuePool self.connection._rollback_impl()
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 
672, in _rollback_impl
ERROR sqlalchemy.pool.QueuePool self._handle_dbapi_exception(e, None, 
None, None, None)
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 
1356, in _handle_dbapi_exception
ERROR sqlalchemy.pool.QueuePool util.raise_from_cause(newraise, 
exc_info)
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/util/compat.py", line 
200, in raise_from_cause
ERROR sqlalchemy.pool.QueuePool reraise(type(exception), exception, 
tb=exc_tb)
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line 
670, in _rollback_impl
ERROR sqlalchemy.pool.QueuePool
 self.engine.dialect.do_rollback(self.connection)
ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/default.py", line 
421, in do_rollback
ERROR sqlalchemy.pool.QueuePool dbapi_connection.rollback()
ERROR sqlalchemy.pool.QueuePool DBConnectionError: 
(psycopg2.OperationalError) connection pointer is NULL

With that exception, SQLAlchemy first get in 
`Connection._handle_dbapi_exception`(keyword "err connection pointer is 
NULL" in log) and then get it again in `_finalize_fairy`(keyword "Exception 
during reset or similar" in log)



在 2018年5月11日星期五 UTC+8上午12:21:42,Mike Bayer写道:
>
> On Thu, May 10, 2018 at 7:23 AM,   
> wrote: 
> > Hi Mike, 
> > 
> > I added more logs and finnaly find the reason and a method to avoid the 
> > deadlock between coroutines. 
> > 
> > In summary, the deadlock happend because two different coroutines use 
> the 
> > same ConnectionRecord with same connection. 
> > It happends under QueuePool, when the DB connections are reset by the DB 
> > server(not related with the backend driver). 
> > 
> > 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a 
> > weakref of the ConnectionFairy, with callback function 
> `_finalize_fairy`. 
> > 2.When the connection get disconnected, it receives an 
> > exception(psycopg2.OperationalError), and goto 
> > Connection._handle_dbapi_exception. 
> > 3.In `_handle_dbapi_exception.finnaly`, it will call both 
> > `ConnectionFairy.invalidate` and `self.invalidate`. 
> > 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is 
> called, 
> > and put the ConnectionRecord back to the QueuePool. 
> > 5.In `self.invalidate`, `del self._root.__connection` is called, and 
> **del** 
> > the ConnectionFairy. 
> > 6.The **del** operation will make weakref of ConnectionFairy to dead, 
> and 
> > the callback `_finalize_fairy` executed. 
> > 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, 
> put 
> > the ConnectionRecord to the QueuePool again. 
>
> Here's what does not make sense.  You are claiming that a simple 
> invalidation will result in the same ConnectionRecord being checked in 
> twice.   This is obviously not the case, what is special in your 
> situation that makes this happen? 
>
> If I follow the steps you refer towards, in step 4, when 
> ConnectionFairy._checkin is called, that calls finalize_fairy 
> directly, which then calls ConnectionRecord.checkin().  Within 
> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None. 
>Then back in ConnectionFairy._checkin(), it also sets 
> self._connection_record = None. 
>
> If we come back into finalize_fairy a *second* time then, as a result 
> of the "del" which occurs in _handle_error(), and the 
> connection_record is present, the function will exit immediately, 
> because: 
>
>if connection_record.fairy_ref is not ref: 
> return 
>
> So the situation as given is not possible without concurrent access to 
> the ConnectionFairy being introduced externally to the pool. 
>
> There is exactly one thing I can think of that is both unique to the 
> psycopg2-style of asynchronous connection, not the PyMySQL form of it, 
> that might be important here, and it's that I think psycopg2's 
> connection.close() method will defer to another greenlet.  That means 
> when we're inside of 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread Mike Bayer
I tried to follow your log lines, which is difficult because I don't
know where those log statements actually are and I don't know what
acronyms like "jjs", "cjr", mean.   But it does look like it's al in
one thread (so not async GC) and all in one Nova request (so perhaps
just one greenlet involved, unless other greenlets are affecting this
without them being displayed in this log snippet).   I do see the
connection record doubled at the end.   So here is the illustration of
those steps in isolation, the connection record is not doubled.  Can
you alter this program to show this condition occurring?

from sqlalchemy import create_engine

e = create_engine("postgresql://scott:tiger@localhost/test", echo_pool='debug')
e.connect().close()

assert len(e.pool._pool.queue) == 1

c1 = e.connect()

assert len(e.pool._pool.queue) == 0

c1.engine.pool._invalidate(c1._Connection__connection, None)
c1.invalidate(e)

# connection record is back
assert len(e.pool._pool.queue) == 1

# do a reconnect
c1.connection

# uses the record again
assert len(e.pool._pool.queue) == 0

# close the Connection
c1.close()

# record is back
assert len(e.pool._pool.queue) == 1

# etc.
c1 = e.connect()
assert len(e.pool._pool.queue) == 0

c1.close()
assert len(e.pool._pool.queue) == 1




On Thu, May 10, 2018 at 12:25 PM, Mike Bayer  wrote:
> Here's a print statement you might want to try:
>
> diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py
> index 89a4cea7c..452db4883 100644
> --- a/lib/sqlalchemy/pool.py
> +++ b/lib/sqlalchemy/pool.py
> @@ -689,6 +689,19 @@ def _finalize_fairy(connection, connection_record,
>  been garbage collected.
>
>  """
> +print(
> +"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, REF: %s  "
> +"CONNECTION_REC: %s CONNECTION_REC REF %s" % (
> +"put greenlet id here",
> +connection,
> +fairy,
> +ref,
> +connection_record,
> +connection_record.fairy_ref
> +if connection_record is not None else "n/a"
> +)
> +)
> +
>  _refs.discard(connection_record)
>
>  if ref is not None:
>
>
>
>
>
> On Thu, May 10, 2018 at 12:21 PM, Mike Bayer  wrote:
>> On Thu, May 10, 2018 at 7:23 AM,   wrote:
>>> Hi Mike,
>>>
>>> I added more logs and finnaly find the reason and a method to avoid the
>>> deadlock between coroutines.
>>>
>>> In summary, the deadlock happend because two different coroutines use the
>>> same ConnectionRecord with same connection.
>>> It happends under QueuePool, when the DB connections are reset by the DB
>>> server(not related with the backend driver).
>>>
>>> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
>>> weakref of the ConnectionFairy, with callback function `_finalize_fairy`.
>>> 2.When the connection get disconnected, it receives an
>>> exception(psycopg2.OperationalError), and goto
>>> Connection._handle_dbapi_exception.
>>> 3.In `_handle_dbapi_exception.finnaly`, it will call both
>>> `ConnectionFairy.invalidate` and `self.invalidate`.
>>> 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is called,
>>> and put the ConnectionRecord back to the QueuePool.
>>> 5.In `self.invalidate`, `del self._root.__connection` is called, and **del**
>>> the ConnectionFairy.
>>> 6.The **del** operation will make weakref of ConnectionFairy to dead, and
>>> the callback `_finalize_fairy` executed.
>>> 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, put
>>> the ConnectionRecord to the QueuePool again.
>>
>> Here's what does not make sense.  You are claiming that a simple
>> invalidation will result in the same ConnectionRecord being checked in
>> twice.   This is obviously not the case, what is special in your
>> situation that makes this happen?
>>
>> If I follow the steps you refer towards, in step 4, when
>> ConnectionFairy._checkin is called, that calls finalize_fairy
>> directly, which then calls ConnectionRecord.checkin().  Within
>> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None.
>>Then back in ConnectionFairy._checkin(), it also sets
>> self._connection_record = None.
>>
>> If we come back into finalize_fairy a *second* time then, as a result
>> of the "del" which occurs in _handle_error(), and the
>> connection_record is present, the function will exit immediately,
>> because:
>>
>>if connection_record.fairy_ref is not ref:
>> return
>>
>> So the situation as given is not possible without concurrent access to
>> the ConnectionFairy being introduced externally to the pool.
>>
>> There is exactly one thing I can think of that is both unique to the
>> psycopg2-style of asynchronous connection, not the PyMySQL form of it,
>> that might be important here, and it's that I think psycopg2's
>> connection.close() method will defer to another greenlet.  That means
>> when we're inside of 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread Mike Bayer
Here's a print statement you might want to try:

diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py
index 89a4cea7c..452db4883 100644
--- a/lib/sqlalchemy/pool.py
+++ b/lib/sqlalchemy/pool.py
@@ -689,6 +689,19 @@ def _finalize_fairy(connection, connection_record,
 been garbage collected.

 """
+print(
+"FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, REF: %s  "
+"CONNECTION_REC: %s CONNECTION_REC REF %s" % (
+"put greenlet id here",
+connection,
+fairy,
+ref,
+connection_record,
+connection_record.fairy_ref
+if connection_record is not None else "n/a"
+)
+)
+
 _refs.discard(connection_record)

 if ref is not None:





On Thu, May 10, 2018 at 12:21 PM, Mike Bayer  wrote:
> On Thu, May 10, 2018 at 7:23 AM,   wrote:
>> Hi Mike,
>>
>> I added more logs and finnaly find the reason and a method to avoid the
>> deadlock between coroutines.
>>
>> In summary, the deadlock happend because two different coroutines use the
>> same ConnectionRecord with same connection.
>> It happends under QueuePool, when the DB connections are reset by the DB
>> server(not related with the backend driver).
>>
>> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
>> weakref of the ConnectionFairy, with callback function `_finalize_fairy`.
>> 2.When the connection get disconnected, it receives an
>> exception(psycopg2.OperationalError), and goto
>> Connection._handle_dbapi_exception.
>> 3.In `_handle_dbapi_exception.finnaly`, it will call both
>> `ConnectionFairy.invalidate` and `self.invalidate`.
>> 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is called,
>> and put the ConnectionRecord back to the QueuePool.
>> 5.In `self.invalidate`, `del self._root.__connection` is called, and **del**
>> the ConnectionFairy.
>> 6.The **del** operation will make weakref of ConnectionFairy to dead, and
>> the callback `_finalize_fairy` executed.
>> 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, put
>> the ConnectionRecord to the QueuePool again.
>
> Here's what does not make sense.  You are claiming that a simple
> invalidation will result in the same ConnectionRecord being checked in
> twice.   This is obviously not the case, what is special in your
> situation that makes this happen?
>
> If I follow the steps you refer towards, in step 4, when
> ConnectionFairy._checkin is called, that calls finalize_fairy
> directly, which then calls ConnectionRecord.checkin().  Within
> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None.
>Then back in ConnectionFairy._checkin(), it also sets
> self._connection_record = None.
>
> If we come back into finalize_fairy a *second* time then, as a result
> of the "del" which occurs in _handle_error(), and the
> connection_record is present, the function will exit immediately,
> because:
>
>if connection_record.fairy_ref is not ref:
> return
>
> So the situation as given is not possible without concurrent access to
> the ConnectionFairy being introduced externally to the pool.
>
> There is exactly one thing I can think of that is both unique to the
> psycopg2-style of asynchronous connection, not the PyMySQL form of it,
> that might be important here, and it's that I think psycopg2's
> connection.close() method will defer to another greenlet.  That means
> when we're inside of ConnectionRecord.invalidate(), and we call
> self.__close(), that could be deferring to another greenlet.   But
> this all happens before the connection is returned to the pool, so I
> still don't know what is happening.
>
>>
>> The logs in detail is below. To prove my result is right, I tried to avoid
>> the scienario with this patch, and it worked.
>>
>> def checkin(self):
>> self.fairy_ref = None
>> connection = self.connection
>> pool = self.__pool
>>
>> while self.finalize_callback:
>> finalizer = self.finalize_callback.pop()
>> finalizer(connection)
>> +if self in pool._pool.queue:
>> +pool.logger.debug("jjs rec already in pool checkin, skip %r", self)
>> +return
>
> as mentioned above it's not possible to reach this point without
> concurrent access to the pooled connection, which means this fix could
> fail also (with at least threaded-style concurrency, not greenlet
> style).   The original source of the greenlet issue must still be
> identified.
>
> I would need a self-contained proof of concept that demonstrates this
> issue occurring.  The log output is not that helpful because it at
> least would need to write out the identifier of the current greenlet,
> since it is very likely that greenlets have something to do with this
> issue.
>
>
>
>>
>> if pool.dispatch.checkin:
>> pool.dispatch.checkin(connection, self)
>> pool._return_conn(self)
>>
>> Trace the 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread Mike Bayer
On Thu, May 10, 2018 at 7:23 AM,   wrote:
> Hi Mike,
>
> I added more logs and finnaly find the reason and a method to avoid the
> deadlock between coroutines.
>
> In summary, the deadlock happend because two different coroutines use the
> same ConnectionRecord with same connection.
> It happends under QueuePool, when the DB connections are reset by the DB
> server(not related with the backend driver).
>
> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
> weakref of the ConnectionFairy, with callback function `_finalize_fairy`.
> 2.When the connection get disconnected, it receives an
> exception(psycopg2.OperationalError), and goto
> Connection._handle_dbapi_exception.
> 3.In `_handle_dbapi_exception.finnaly`, it will call both
> `ConnectionFairy.invalidate` and `self.invalidate`.
> 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is called,
> and put the ConnectionRecord back to the QueuePool.
> 5.In `self.invalidate`, `del self._root.__connection` is called, and **del**
> the ConnectionFairy.
> 6.The **del** operation will make weakref of ConnectionFairy to dead, and
> the callback `_finalize_fairy` executed.
> 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, put
> the ConnectionRecord to the QueuePool again.

Here's what does not make sense.  You are claiming that a simple
invalidation will result in the same ConnectionRecord being checked in
twice.   This is obviously not the case, what is special in your
situation that makes this happen?

If I follow the steps you refer towards, in step 4, when
ConnectionFairy._checkin is called, that calls finalize_fairy
directly, which then calls ConnectionRecord.checkin().  Within
ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None.
   Then back in ConnectionFairy._checkin(), it also sets
self._connection_record = None.

If we come back into finalize_fairy a *second* time then, as a result
of the "del" which occurs in _handle_error(), and the
connection_record is present, the function will exit immediately,
because:

   if connection_record.fairy_ref is not ref:
return

So the situation as given is not possible without concurrent access to
the ConnectionFairy being introduced externally to the pool.

There is exactly one thing I can think of that is both unique to the
psycopg2-style of asynchronous connection, not the PyMySQL form of it,
that might be important here, and it's that I think psycopg2's
connection.close() method will defer to another greenlet.  That means
when we're inside of ConnectionRecord.invalidate(), and we call
self.__close(), that could be deferring to another greenlet.   But
this all happens before the connection is returned to the pool, so I
still don't know what is happening.

>
> The logs in detail is below. To prove my result is right, I tried to avoid
> the scienario with this patch, and it worked.
>
> def checkin(self):
> self.fairy_ref = None
> connection = self.connection
> pool = self.__pool
>
> while self.finalize_callback:
> finalizer = self.finalize_callback.pop()
> finalizer(connection)
> +if self in pool._pool.queue:
> +pool.logger.debug("jjs rec already in pool checkin, skip %r", self)
> +return

as mentioned above it's not possible to reach this point without
concurrent access to the pooled connection, which means this fix could
fail also (with at least threaded-style concurrency, not greenlet
style).   The original source of the greenlet issue must still be
identified.

I would need a self-contained proof of concept that demonstrates this
issue occurring.  The log output is not that helpful because it at
least would need to write out the identifier of the current greenlet,
since it is very likely that greenlets have something to do with this
issue.



>
> if pool.dispatch.checkin:
> pool.dispatch.checkin(connection, self)
> pool._return_conn(self)
>
> Trace the ConnectionRecord 0x88fc5d0
>
> 2018-05-10T14:02:17.927+08:00 localhost nova-conductor DEBUG [pid:4528]
> [MainThread] [tid:122769072] [log.py:104 debug]
> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Created new connection
> 
> 2018-05-10T14:02:17.928+08:00 localhost nova-conductor DEBUG [pid:4528]
> [MainThread] [tid:122769072] [log.py:104 debug]
> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs get conn success  object at 0x81f69b0; closed: 0>, rec.connection  0x81f69b0; closed: 0>, pool queue deque([ object at 0x7a04810>,  0x79fdad0>])
> 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528]
> [MainThread] [tid:122769072] [log.py:104 debug]
> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs checkout fairy
>  rec
>  pool queue
> deque([,
> ])
> 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528]
> [MainThread] [tid:122769072] [log.py:104 debug]
> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Connection  0x81f69b0; closed: 0> checked out from pool, status 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread jiajunsu . zju
Hi Mike,

I added more logs and finnaly find the reason and a method to avoid the 
deadlock between coroutines.

In summary, the deadlock happend because two different coroutines use the 
same ConnectionRecord with same connection.
It happends under QueuePool, when the DB connections are reset by the DB 
server(not related with the backend driver).

1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a 
weakref of the ConnectionFairy, with callback function `_finalize_fairy`.
2.When the connection get disconnected, it receives an 
exception(psycopg2.OperationalError), and goto 
Connection._handle_dbapi_exception.
3.In `_handle_dbapi_exception.finnaly`, it will call both 
`ConnectionFairy.invalidate` and `self.invalidate`.
4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is called, 
and put the ConnectionRecord back to the QueuePool.
5.In `self.invalidate`, `del self._root.__connection` is called, and 
**del** the ConnectionFairy.
6.The **del** operation will make weakref of ConnectionFairy to dead, and 
the callback `_finalize_fairy` executed.
7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, put 
the ConnectionRecord to the QueuePool again.

The logs in detail is below. To prove my result is right, I tried to avoid 
the scienario with this patch, and it worked.

def checkin(self):
self.fairy_ref = None
connection = self.connection
pool = self.__pool

while self.finalize_callback:
finalizer = self.finalize_callback.pop()
finalizer(connection)
+if self in pool._pool.queue:
+pool.logger.debug("jjs rec already in pool checkin, skip %r", self)
+return

if pool.dispatch.checkin:
pool.dispatch.checkin(connection, self)
pool._return_conn(self)

Trace the ConnectionRecord 0x88fc5d0

2018-05-10T14:02:17.927+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Created new connection 
 
2018-05-10T14:02:17.928+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs get conn success , rec.connection , pool queue deque([, ]) 
2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs checkout fairy 
 rec 
 pool queue 
deque([, 
]) 
2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Connection  checked out from pool, status Pool size: 10  
Connections in pool: 2 Current Overflow: 0 Max Overflow: 40 Current Checked 
out connections: 8, queue deque([, ]) 
2018-05-10T14:02:17.930+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _ConnectionFairy._checkout 
conn , rec 
, rec.conn 
 
2018-05-10T14:02:49.716+08:00 localhost nova-conductor ERROR [pid:4528] 
[MainThread] [tid:122769072] [log.py:122 error] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr error conn 
, err connection 
pointer is NULL
2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528] 
[MainThread] [tid:122769072] [log.py:122 error] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter newraise
2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528] 
[MainThread] [tid:122769072] [log.py:122 error] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter finally
>>Connection._handle_dbapi_exception --> finally --> 
**self.engine.pool._invalidate** & **self.invalidate**
>>self.engine.pool._invalidate --> Pool._invalidate --> 
ConnectionFairy.invalidate --> _ConnectionRecord.invalidate & 
**self._checkin()**
>>self.invalidate() --> **del self._root.__connection**
2018-05-10T14:02:49.718+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs connectionFairy invalidate 
2018-05-10T14:02:49.718+08:00 localhost nova-conductor INFO [pid:4528] 
[MainThread] [tid:122769072] [log.py:109 info] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Invalidate connection 
 (reason: 
OperationalError:connection pointer is NULL
>>_ConnectionRecord.invalidate(e)  --> 
self.__pool.dispatch.invalidate(self.connection, self, e) & self.__close() 
& self.connection = None
>>self.__close() --> self.finalize_callback.clear() & 
self.__pool._close_connection(self.connection)
2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Closing connection  
>>Pool._close_connection(connection) --> self._dialect.do_close(connection) 
--> _ConnDialect.do_close(dbapi_connection) --> dbapi_connection.close() FIN
2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-08 Thread jiajunsu . zju
I found why we could get the same connection twice from the pool...

Because we put it into pool twice.

Any suggestion for finding why it put the same connection into pool within 
different coroutines?


2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:187954992] [log.py:104 debug] 
[req-7c2199c0-d9a6-4131-8358-2aebcf5a4ce1] jjs conn checkin  
2018-05-08T22:39:17.502+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:178645264] [log.py:104 debug] 
[req-7ab83c30-91bd-4278-b694-85a71035992c] jjs conn checkin  
...
2018-05-08T22:39:17.992+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:309091856] [log.py:104 debug] 
[req-dee5dd2d-866f-408e-999b-b415ea7ce953] Connection  checked out from pool 
2018-05-08T22:39:18.039+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:178645104] [log.py:104 debug] 
[req-ac8017d7-c97e-44ca-92d2-99f180858e14] Connection  checked out from pool 


在 2018年5月8日星期二 UTC+8下午10:18:52,jiajun...@gmail.com写道:
>
> I added logs in sqlalchemy/pool.py
>
> QueuePool.checkout
>
> @classmethod
> def checkout(cls, pool):
> rec = pool._do_get()
> try:
> dbapi_connection = rec.get_connection()
> pool.logger.debug("jjs get conn success %s", dbapi_connection)
> except:
> pool.logger.debug("jjs get error in get_connection")
> with util.safe_reraise():
> rec.checkin()
> echo = pool._should_log_debug()
> fairy = _ConnectionFairy(dbapi_connection, rec, echo)
> rec.fairy_ref = weakref.ref(
> fairy,
> lambda ref: _finalize_fairy and
> _finalize_fairy(
> dbapi_connection,
> rec, pool, ref, echo)
> )
> _refs.add(rec)
> if echo:
> pool.logger.debug("jjs conn %s, pool status %s", dbapi_connection, 
> pool.status())
> pool.logger.debug("Connection %r checked out from pool",
>   dbapi_connection)
> return fairy
>
> Then I got logs below:
>
> coroutine A:
>
> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:197123760] [log.py:109 info] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection 
>  
> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success  object at 0x903f8c0; closed: 0> 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn  0x903f8c0;  closed: 0>, pool status Pool size: 10  Connections in pool: 1 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 45 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection  0x903f8c0; closed: 0> checked out from pool 
>
> coroutine B:
>
> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:151519312] [log.py:109 info] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success  object at 0x903f8c0;  closed: 0> 
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn  0x903f8c0; closed: 0>, pool status Pool size: 10  Connections in pool: 0 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 46 
> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection  0x903f8c0; closed: 0> checked out from pool 
>
>
> It seems that when we create_connection in QueuePool._do_get, we may not 
> get the lock _pool.mutex(~Queue.mutex).
>
>
>
> 在 2018年5月7日星期一 UTC+8下午8:41:14,Mike Bayer写道:
>>
>>
>>
>> On Mon, May 7, 2018, 7:01 AM  wrote:
>>
>>> We added coroutine_id in psycopg2, and found that two coroutine use the 
>>> same connection before the pthread_mutex_lock released.
>>>
>>> Maybe something in the connection pool goes wrong?
>>> [pid] [coroutine_id] msg
>>>
>>
>>
>> This seems unlikely.   Openstack uses the SQLAlchemy connection pool with 
>> eventlet patching in conjunction with the MySQL drivers and nothing like 
>> that happens.   The biggest issue SQLAlchemy had with greenlets is that a 
>> timed out 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-08 Thread jiajunsu . zju
It's not a problem with lock. I tried to get the _pool.mutex in 
create_connection but helpless.

Maybe we create the connection, put it into pool, but use it without get it 
out?

在 2018年5月8日星期二 UTC+8下午10:18:52,jiajun...@gmail.com写道:
>
> I added logs in sqlalchemy/pool.py
>
> QueuePool.checkout
>
> @classmethod
> def checkout(cls, pool):
> rec = pool._do_get()
> try:
> dbapi_connection = rec.get_connection()
> pool.logger.debug("jjs get conn success %s", dbapi_connection)
> except:
> pool.logger.debug("jjs get error in get_connection")
> with util.safe_reraise():
> rec.checkin()
> echo = pool._should_log_debug()
> fairy = _ConnectionFairy(dbapi_connection, rec, echo)
> rec.fairy_ref = weakref.ref(
> fairy,
> lambda ref: _finalize_fairy and
> _finalize_fairy(
> dbapi_connection,
> rec, pool, ref, echo)
> )
> _refs.add(rec)
> if echo:
> pool.logger.debug("jjs conn %s, pool status %s", dbapi_connection, 
> pool.status())
> pool.logger.debug("Connection %r checked out from pool",
>   dbapi_connection)
> return fairy
>
> Then I got logs below:
>
> coroutine A:
>
> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:197123760] [log.py:109 info] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection 
>  
> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success  object at 0x903f8c0; closed: 0> 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn  0x903f8c0;  closed: 0>, pool status Pool size: 10  Connections in pool: 1 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 45 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection  0x903f8c0; closed: 0> checked out from pool 
>
> coroutine B:
>
> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:151519312] [log.py:109 info] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success  object at 0x903f8c0;  closed: 0> 
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn  0x903f8c0; closed: 0>, pool status Pool size: 10  Connections in pool: 0 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 46 
> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection  0x903f8c0; closed: 0> checked out from pool 
>
>
> It seems that when we create_connection in QueuePool._do_get, we may not 
> get the lock _pool.mutex(~Queue.mutex).
>
>
>
> 在 2018年5月7日星期一 UTC+8下午8:41:14,Mike Bayer写道:
>>
>>
>>
>> On Mon, May 7, 2018, 7:01 AM  wrote:
>>
>>> We added coroutine_id in psycopg2, and found that two coroutine use the 
>>> same connection before the pthread_mutex_lock released.
>>>
>>> Maybe something in the connection pool goes wrong?
>>> [pid] [coroutine_id] msg
>>>
>>
>>
>> This seems unlikely.   Openstack uses the SQLAlchemy connection pool with 
>> eventlet patching in conjunction with the MySQL drivers and nothing like 
>> that happens.   The biggest issue SQLAlchemy had with greenlets is that a 
>> timed out greenlet would throw GreenletExit and cause a dirty connection to 
>> go back into the pool, that was fixed in 1.2.  
>>
>>
>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, 
>>> cb 0x23ad320
>>>
>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>
>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>>> status = 2
>>>
>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>
>>> Below is the logs grep by coroutine id:
>>>
>>> [49174] 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-08 Thread jiajunsu . zju
I added logs in sqlalchemy/pool.py

QueuePool.checkout

@classmethod
def checkout(cls, pool):
rec = pool._do_get()
try:
dbapi_connection = rec.get_connection()
pool.logger.debug("jjs get conn success %s", dbapi_connection)
except:
pool.logger.debug("jjs get error in get_connection")
with util.safe_reraise():
rec.checkin()
echo = pool._should_log_debug()
fairy = _ConnectionFairy(dbapi_connection, rec, echo)
rec.fairy_ref = weakref.ref(
fairy,
lambda ref: _finalize_fairy and
_finalize_fairy(
dbapi_connection,
rec, pool, ref, echo)
)
_refs.add(rec)
if echo:
pool.logger.debug("jjs conn %s, pool status %s", dbapi_connection, 
pool.status())
pool.logger.debug("Connection %r checked out from pool",
  dbapi_connection)
return fairy

Then I got logs below:

coroutine A:

2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO [pid:55528] 
[MainThread] [tid:197123760] [log.py:109 info] 
[req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, timeout 
30
2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG [pid:55528] 
[MainThread] [tid:197123760] [log.py:104 debug] 
[req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection 
 
2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG [pid:55528] 
[MainThread] [tid:197123760] [log.py:104 debug] 
[req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success  
2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
[MainThread] [tid:197123760] [log.py:104 debug] 
[req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn , pool status Pool size: 10  Connections in pool: 1 
Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 45 
2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
[MainThread] [tid:197123760] [log.py:104 debug] 
[req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection  checked out from pool 

coroutine B:

2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO [pid:55528] 
[MainThread] [tid:151519312] [log.py:109 info] 
[req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, timeout 
30
2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
[MainThread] [tid:151519312] [log.py:104 debug] 
[req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success  
2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
[MainThread] [tid:151519312] [log.py:104 debug] 
[req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn , pool status Pool size: 10  Connections in pool: 0 
Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 46 
2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG [pid:55528] 
[MainThread] [tid:151519312] [log.py:104 debug] 
[req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection  checked out from pool 


It seems that when we create_connection in QueuePool._do_get, we may not 
get the lock _pool.mutex(~Queue.mutex).



在 2018年5月7日星期一 UTC+8下午8:41:14,Mike Bayer写道:
>
>
>
> On Mon, May 7, 2018, 7:01 AM  wrote:
>
>> We added coroutine_id in psycopg2, and found that two coroutine use the 
>> same connection before the pthread_mutex_lock released.
>>
>> Maybe something in the connection pool goes wrong?
>> [pid] [coroutine_id] msg
>>
>
>
> This seems unlikely.   Openstack uses the SQLAlchemy connection pool with 
> eventlet patching in conjunction with the MySQL drivers and nothing like 
> that happens.   The biggest issue SQLAlchemy had with greenlets is that a 
> timed out greenlet would throw GreenletExit and cause a dirty connection to 
> go back into the pool, that was fixed in 1.2.  
>
>
>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 
>> 0x23ad320
>>
>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>> status = 2
>>
>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>
>> Below is the logs grep by coroutine id:
>>
>> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>>
>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>>
>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 
>> 0x23ad320
>>
>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: poll writing
>>
>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>>
>> -
>>
>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection 
>> at 0x8de2d30
>>

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-07 Thread Mike Bayer
just to note, there's definitely a bug somewhere, and if the bug is
found there's a 90% chance it involves code that I wrote at some point
:)  (though likely some assumption that the PG /async stuff does not
meet)  but this is part of oslo / openstack.  I work on oslo /
openstack, but that's my employment and the Postgresql space is
something we as a community decided we didn't have the resources to
work on.  The community certainly welcomes people to work on the
Postgresql side of things, though.   But it depends on your
priorities.  If you really need to scale in production and not get
into full blown development, the majority of the community is still on
MySQL and Galera.  You also might consider picking up this thread on
the openstack-dev list where there may be other Postgresql-minded
developers with insights.



On Mon, May 7, 2018 at 8:40 AM, Mike Bayer  wrote:
>
>
> On Mon, May 7, 2018, 7:01 AM  wrote:
>>
>> We added coroutine_id in psycopg2, and found that two coroutine use the
>> same connection before the pthread_mutex_lock released.
>>
>> Maybe something in the connection pool goes wrong?
>> [pid] [coroutine_id] msg
>
>
>
> This seems unlikely.   Openstack uses the SQLAlchemy connection pool with
> eventlet patching in conjunction with the MySQL drivers and nothing like
> that happens.   The biggest issue SQLAlchemy had with greenlets is that a
> timed out greenlet would throw GreenletExit and cause a dirty connection to
> go back into the pool, that was fixed in 1.2.
>
>>
>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb
>> 0x23ad320
>>
>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
>> status = 2
>>
>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>
>>
>> Below is the logs grep by coroutine id:
>>
>> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>>
>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>>
>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb
>> 0x23ad320
>>
>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: poll writing
>>
>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>>
>> -
>>
>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection at
>> 0x8de2d30
>>
>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>>
>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn =
>> 0x8de2d30
>>
>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt =
>> 1
>>
>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
>> status = 2
>>
>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>
>>
>> 在 2018年4月28日星期六 UTC+8下午4:07:34,jiajun...@gmail.com写道:
>>>
>>> We reproduced this problem and added logs in psycopg2, found a confusing
>>> thing and reported it to psycopg2.
>>>
>>> It seems sqlalchemy & psycopg2 stack flow is:
>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit
>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>>> 3.get exception in sqlalchemy, and do_rollback
>>> 4.from sqlalchemy to psycopg: pq_abort
>>> 5.psycopg get pthread_mutex_lock deadlock
>>>
>>> What we have done to reproduce it is: stop master pg-server and promote
>>> the slave pg-server to master, with moving FIP from old master to slave. At
>>> the same time, let nova-conductor do quite a lot db query requests.
>>>
>>> [1] https://github.com/psycopg/psycopg2/issues/703
>>>
>>> 在 2018年4月23日星期一 UTC+8下午9:45:04,Mike Bayer写道:

 On Mon, Apr 23, 2018 at 9:03 AM,   wrote:
 > Sorry for reply on this topic.
 >
 > We recently get the same problem in our production environment.
 >
 > I found a patch in other lib [1], and they added conn.close() in
 > exception
 > psycopg2.ProgrammingError.
 >
 > Shall we do the same in [2] ?

 SQLAlchemy does things much more carefully than that, we parse the
 error message for specific ones that correspond to "connection is no
 longer usable", we call these "is_disconnect" but it can be any
 invalidating condition.

 You can make these yourself, and they can also be made to 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-07 Thread Mike Bayer
On Mon, May 7, 2018, 7:01 AM  wrote:

> We added coroutine_id in psycopg2, and found that two coroutine use the
> same connection before the pthread_mutex_lock released.
>
> Maybe something in the connection pool goes wrong?
> [pid] [coroutine_id] msg
>


This seems unlikely.   Openstack uses the SQLAlchemy connection pool with
eventlet patching in conjunction with the MySQL drivers and nothing like
that happens.   The biggest issue SQLAlchemy had with greenlets is that a
timed out greenlet would throw GreenletExit and cause a dirty connection to
go back into the pool, that was fixed in 1.2.


> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb
> 0x23ad320
>
> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>
> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>
> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
> async_cursor 0x881ac00
>
> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
> async_cursor 0x881ac00
>
> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
> status = 2
>
> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>
> Below is the logs grep by coroutine id:
>
> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>
> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>
> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb
> 0x23ad320
>
> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>
> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>
> [49174] [0xa5db730]conn_poll: poll writing
>
> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>
> -
>
> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection at
> 0x8de2d30
>
> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>
> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn =
> 0x8de2d30
>
> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt =
> 1
>
> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1
>
> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
> async_cursor 0x881ac00
>
> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
> async_cursor 0x881ac00
>
> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
> status = 2
>
> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>
> 在 2018年4月28日星期六 UTC+8下午4:07:34,jiajun...@gmail.com写道:
>>
>> We reproduced this problem and added logs in psycopg2, found a confusing
>> thing and reported it to psycopg2.
>>
>> It seems sqlalchemy & psycopg2 stack flow is:
>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit
>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>> 3.get exception in sqlalchemy, and do_rollback
>> 4.from sqlalchemy to psycopg: pq_abort
>> 5.psycopg get pthread_mutex_lock deadlock
>>
>> What we have done to reproduce it is: stop master pg-server and promote
>> the slave pg-server to master, with moving FIP from old master to slave. At
>> the same time, let nova-conductor do quite a lot db query requests.
>>
>> [1] https://github.com/psycopg/psycopg2/issues/703
>>
>> 在 2018年4月23日星期一 UTC+8下午9:45:04,Mike Bayer写道:
>>>
>>> On Mon, Apr 23, 2018 at 9:03 AM,   wrote:
>>> > Sorry for reply on this topic.
>>> >
>>> > We recently get the same problem in our production environment.
>>> >
>>> > I found a patch in other lib [1], and they added conn.close() in
>>> exception
>>> > psycopg2.ProgrammingError.
>>> >
>>> > Shall we do the same in [2] ?
>>>
>>> SQLAlchemy does things much more carefully than that, we parse the
>>> error message for specific ones that correspond to "connection is no
>>> longer usable", we call these "is_disconnect" but it can be any
>>> invalidating condition.
>>>
>>> You can make these yourself, and they can also be made to be part of
>>> oslo.db, using the handle_error event:
>>>
>>>
>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>>
>>> within oslo.db you would want to propose a change here:
>>>
>>>
>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>>
>>>
>>>
>>>
>>>
>>> >
>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split
>>> >
>>> > [2]
>>> >
>>> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>> >
>>> >
>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31,JinRong Cai写道:
>>> >>
>>> >> Hi  Michael ,
>>> >>
>>> >> I am using openstack with postgresql which sqlalchemy and oslo_db
>>> module
>>> >> were used.
>>> >> And there are some problems after my pg database switched over.
>>> >>
>>> >> Here is my switch over process:
>>> >> 1. nova-conductor(python application) is running with DB connection
>>> >> strings point to vip , which is in primary site(A) of pg.
>>> >> 2. switch VIP 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-07 Thread jiajunsu . zju
We added coroutine_id in psycopg2, and found that two coroutine use the 
same connection before the pthread_mutex_lock released.

Maybe something in the connection pool goes wrong?
[pid] [coroutine_id] msg

[49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 
0x23ad320

[49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0

[49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
status = 2

[49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1

Below is the logs grep by coroutine id:

[49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0

[49174] [0xa5db730]before have_wait_callback conn 0x94122f0

[49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 
0x23ad320

[49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0

[49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0

[49174] [0xa5db730]conn_poll: poll writing

[49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ

-

[49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection at 
0x8de2d30

[49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650

[49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn = 0x8de2d30

[49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt = 1

[49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
status = 2

[49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1

在 2018年4月28日星期六 UTC+8下午4:07:34,jiajun...@gmail.com写道:
>
> We reproduced this problem and added logs in psycopg2, found a confusing 
> thing and reported it to psycopg2.
>
> It seems sqlalchemy & psycopg2 stack flow is:
> 1.from sqlalchemy to psycopg: pq_execute/pq_commit 
> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
> 3.get exception in sqlalchemy, and do_rollback
> 4.from sqlalchemy to psycopg: pq_abort
> 5.psycopg get pthread_mutex_lock deadlock
>
> What we have done to reproduce it is: stop master pg-server and promote 
> the slave pg-server to master, with moving FIP from old master to slave. At 
> the same time, let nova-conductor do quite a lot db query requests.
>
> [1] https://github.com/psycopg/psycopg2/issues/703
>
> 在 2018年4月23日星期一 UTC+8下午9:45:04,Mike Bayer写道:
>>
>> On Mon, Apr 23, 2018 at 9:03 AM,   wrote: 
>> > Sorry for reply on this topic. 
>> > 
>> > We recently get the same problem in our production environment. 
>> > 
>> > I found a patch in other lib [1], and they added conn.close() in 
>> exception 
>> > psycopg2.ProgrammingError. 
>> > 
>> > Shall we do the same in [2] ? 
>>
>> SQLAlchemy does things much more carefully than that, we parse the 
>> error message for specific ones that correspond to "connection is no 
>> longer usable", we call these "is_disconnect" but it can be any 
>> invalidating condition. 
>>
>> You can make these yourself, and they can also be made to be part of 
>> oslo.db, using the handle_error event: 
>>
>>
>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>  
>>
>> within oslo.db you would want to propose a change here: 
>>
>>
>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>  
>>
>>
>>
>>
>>
>> > 
>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split 
>> > 
>> > [2] 
>> > 
>> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>  
>> > 
>> > 
>> > 在 2017年11月13日星期一 UTC+8上午10:44:31,JinRong Cai写道: 
>> >> 
>> >> Hi  Michael , 
>> >> 
>> >> I am using openstack with postgresql which sqlalchemy and oslo_db 
>> module 
>> >> were used. 
>> >> And there are some problems after my pg database switched over. 
>> >> 
>> >> Here is my switch over process: 
>> >> 1. nova-conductor(python application) is running with DB connection 
>> >> strings point to vip , which is in primary site(A) of pg. 
>> >> 2. switch VIP from primary(A) to new primary(B) 
>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
>> primary. 
>> >> 4. nova-conductor is running in the whole process. 
>> >> 
>> >> After some seconds, I found some nova-conductor processes are hang 
>> with 
>> >> status futex_wait_queue_me, and the status of the query in DB is "idle 
>> in 
>> >> transaction", the transaction was not commited or rollbacked! 
>> >> I think disconnection was handled in the oslo_db, which will send a 
>> >> ping(select 1) to DB. 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-04-28 Thread jiajunsu . zju
We reproduced this problem and added logs in psycopg2, found a confusing 
thing and reported it to psycopg2.

It seems sqlalchemy & psycopg2 stack flow is:
1.from sqlalchemy to psycopg: pq_execute/pq_commit 
2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
3.get exception in sqlalchemy, and do_rollback
4.from sqlalchemy to psycopg: pq_abort
5.psycopg get pthread_mutex_lock deadlock

What we have done to reproduce it is: stop master pg-server and promote the 
slave pg-server to master, with moving FIP from old master to slave. At the 
same time, let nova-conductor do quite a lot db query requests.

[1] https://github.com/psycopg/psycopg2/issues/703

在 2018年4月23日星期一 UTC+8下午9:45:04,Mike Bayer写道:
>
> On Mon, Apr 23, 2018 at 9:03 AM,   
> wrote: 
> > Sorry for reply on this topic. 
> > 
> > We recently get the same problem in our production environment. 
> > 
> > I found a patch in other lib [1], and they added conn.close() in 
> exception 
> > psycopg2.ProgrammingError. 
> > 
> > Shall we do the same in [2] ? 
>
> SQLAlchemy does things much more carefully than that, we parse the 
> error message for specific ones that correspond to "connection is no 
> longer usable", we call these "is_disconnect" but it can be any 
> invalidating condition. 
>
> You can make these yourself, and they can also be made to be part of 
> oslo.db, using the handle_error event: 
>
>
> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>  
>
> within oslo.db you would want to propose a change here: 
>
>
> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>  
>
>
>
>
>
> > 
> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split 
> > 
> > [2] 
> > 
> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>  
> > 
> > 
> > 在 2017年11月13日星期一 UTC+8上午10:44:31,JinRong Cai写道: 
> >> 
> >> Hi  Michael , 
> >> 
> >> I am using openstack with postgresql which sqlalchemy and oslo_db 
> module 
> >> were used. 
> >> And there are some problems after my pg database switched over. 
> >> 
> >> Here is my switch over process: 
> >> 1. nova-conductor(python application) is running with DB connection 
> >> strings point to vip , which is in primary site(A) of pg. 
> >> 2. switch VIP from primary(A) to new primary(B) 
> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
> primary. 
> >> 4. nova-conductor is running in the whole process. 
> >> 
> >> After some seconds, I found some nova-conductor processes are hang with 
> >> status futex_wait_queue_me, and the status of the query in DB is "idle 
> in 
> >> transaction", the transaction was not commited or rollbacked! 
> >> I think disconnection was handled in the oslo_db, which will send a 
> >> ping(select 1) to DB. 
> >> 
> >> If DB was switchd over, the connection in the pool would be set with 
> >> status invalid, and reconnect after next check out. 
> >> 
> >> ###error messages from nova-conductor 
> >> localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712] 
> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped 
> from 
> >> (psycopg2.ProgrammingError) execute cannot be used while an 
> asynchronous 
> >> query is underway [SQL: 'SELECT 1'] 
> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call 
> >> last): 
> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in 
> >> _execute_context 
> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters context) 
> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in 
> >> do_execute 
> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters 
> cursor.execute(statement, 
> >> parameters) 
> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute 
> >> cannot be used while an asynchronous query is underway 
> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters 
> >>  localhost nova-conductor ERROR [pid:36365] [MainThread] 
> [tid:122397712] 
> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error 
> closing 
> >> cursor 
> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call 
> last): 
> >>   36365 ERROR sqlalchemy.pool.QueuePool   File 
> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in 
> >> _safe_close_cursor 
> >>   36365 ERROR sqlalchemy.pool.QueuePool cursor.close() 
> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot 
> be 
> >> used while an asynchronous query is underway 
> >>   36365 ERROR sqlalchemy.pool.QueuePool 
> >> 
> >> ###ps status of nova-conductor 
> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # cat 
> >> /proc/33316/stack 
> >> [] 

Re: [sqlalchemy] Re: long transaction after database switched over

2018-04-23 Thread Mike Bayer
On Mon, Apr 23, 2018 at 9:03 AM,   wrote:
> Sorry for reply on this topic.
>
> We recently get the same problem in our production environment.
>
> I found a patch in other lib [1], and they added conn.close() in exception
> psycopg2.ProgrammingError.
>
> Shall we do the same in [2] ?

SQLAlchemy does things much more carefully than that, we parse the
error message for specific ones that correspond to "connection is no
longer usable", we call these "is_disconnect" but it can be any
invalidating condition.

You can make these yourself, and they can also be made to be part of
oslo.db, using the handle_error event:

http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error

within oslo.db you would want to propose a change here:

https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387





>
> [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split
>
> [2]
> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>
>
> 在 2017年11月13日星期一 UTC+8上午10:44:31,JinRong Cai写道:
>>
>> Hi  Michael ,
>>
>> I am using openstack with postgresql which sqlalchemy and oslo_db module
>> were used.
>> And there are some problems after my pg database switched over.
>>
>> Here is my switch over process:
>> 1. nova-conductor(python application) is running with DB connection
>> strings point to vip , which is in primary site(A) of pg.
>> 2. switch VIP from primary(A) to new primary(B)
>> 3. switch over pg: shutdown primary(A), promopt standby(B) to new primary.
>> 4. nova-conductor is running in the whole process.
>>
>> After some seconds, I found some nova-conductor processes are hang with
>> status futex_wait_queue_me, and the status of the query in DB is "idle in
>> transaction", the transaction was not commited or rollbacked!
>> I think disconnection was handled in the oslo_db, which will send a
>> ping(select 1) to DB.
>>
>> If DB was switchd over, the connection in the pool would be set with
>> status invalid, and reconnect after next check out.
>>
>> ###error messages from nova-conductor
>> localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
>> [exc_filters.py:330 _raise_for_remaining_DBAPIError]
>> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped from
>> (psycopg2.ProgrammingError) execute cannot be used while an asynchronous
>> query is underway [SQL: 'SELECT 1']
>>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call
>> last):
>>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in
>> _execute_context
>>   36365 ERROR oslo_db.sqlalchemy.exc_filters context)
>>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in
>> do_execute
>>   36365 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement,
>> parameters)
>>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute
>> cannot be used while an asynchronous query is underway
>>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>>  localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
>> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error closing
>> cursor
>>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
>>   36365 ERROR sqlalchemy.pool.QueuePool   File
>> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in
>> _safe_close_cursor
>>   36365 ERROR sqlalchemy.pool.QueuePool cursor.close()
>>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot be
>> used while an asynchronous query is underway
>>   36365 ERROR sqlalchemy.pool.QueuePool
>>
>> ###ps status of nova-conductor
>> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # cat
>> /proc/33316/stack
>> [] futex_wait_queue_me+0xc4/0x120
>> [] futex_wait+0x179/0x280
>> [] do_futex+0xfe/0x5b0
>> [] SyS_futex+0x80/0x180
>> [] system_call_fastpath+0x16/0x1b
>> [] 0x
>>
>> ### stack of the nova-conductor process
>> POD6-Mongodb03:/tmp # pstack 33316
>> #0  0x2b8449e35f4d in __lll_lock_wait () from /lib64/libpthread.so.0
>> #1  0x2b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0
>> #2  0x2b8449e31c08 in pthread_mutex_lock () from
>> /lib64/libpthread.so.0
>> #3  0x2b84554c44ab in pq_abort () from
>> /python2.7/site-packages/psycopg2/_psycopg.so
>> #4  0x2b84554c955e in psyco_conn_rollback () from
>> /python2.7/site-packages/psycopg2/_psycopg.so
>> #5  0x2b8449b42b50 in PyEval_EvalFrameEx () from
>> /lib64/libpython2.7.so.1.0
>> #6  0x2b8449b42ad0 in PyEval_EvalFrameEx () from
>> /lib64/libpython2.7.so.1.0
>>
>> The psycopg2 was trying to close the cursor, and try to get the mutex lock
>> "pthread_mutex_lock", but it seems that the cursor was used by other
>> session.
>>
>>
>> Questions:
>>