Re: [sqlalchemy] Re: long transaction after database switched over

jiajunsu . zju Wed, 16 May 2018 18:44:30 -0700

This patch worked.
https://gerrit.sqlalchemy.org/#/c/zzzeek/sqlalchemy/+/747/


But I didn't get the log "Double checkin attempted", maybe the real working 
code is 
    if connection_record and connection_record.fairy_ref is not None:
        connection_record.checkin()




在 2018年5月16日星期三 UTC+8下午4:04:21，jiajun...@gmail.com写道：
>
> Thanks a lot.
>
> I'll test this patch in our environment.
>
> 在 2018年5月16日星期三 UTC+8上午2:58:56，Mike Bayer写道：
>>
>> good news, I reproduced it. 
>>
>> Even better news for you, vindication!  nothing to do with greenlets, 
>> threads, concurrency, or anything (though it doesn't have to do with 
>> the "del" or the weakref, this is much simpler ). 
>>
>> This has to do with a special rule in the pool when there is a 
>> so-called "reset agent" involved that cleans up a pooled connection 
>> while taking into account the fact that the enclosing Connection had a 
>> transaction in progress.   That is introducing the 
>> _handle_error()->invalidate() routine *inside* the normal 
>> _finalize_fairy routine and that is why we see _finalize_fairy twice. 
>>  The _finalize_fairy we see that seems to start in the middle of the 
>> operation is called first, then the _finalize_fairy we see failing is 
>> inside of that.   So they aren't sibling operations, they are nested. 
>>
>>
>> https://bitbucket.org/zzzeek/sqlalchemy/issues/4252/double-checkin-due-to-handle_error-called
>>  
>>
>>
>>
>> On Tue, May 15, 2018 at 1:26 PM, Mike Bayer <cla...@zzzcomputing.com> 
>> wrote: 
>> > can you add these lines to your logging, ensure that the check for 
>> > "already in pool" is present, and produce new logs? 
>> > 
>> >         try: 
>> >             fairy = fairy or _ConnectionFairy( 
>> >                 connection, connection_record, echo) 
>> >             assert fairy.connection is connection 
>> >             pool.logger.debug("ABOUT TO RESET CONNECTION %s CONNECTION 
>> > REC %s FAIRY %s", connection, fairy, connection_record) 
>> >             fairy._reset(pool) 
>> >             pool.logger.debug("COMPLETED RESET CONNECTION %s 
>> > CONNECTION REC %s FAIRY %s", connection, fairy, connection_record) 
>> > 
>> >             # Immediately close detached instances 
>> >             if not connection_record: 
>> >                 pool._close_connection(connection) 
>> >         except BaseException as e: 
>> >             pool.logger.debug("EXCEPTION RAISED IN RESET CONNECTION %s 
>> > CONNECTION REC %s FAIRY %s: %s", connection, fairy, connection_record, 
>> > e) 
>> >             pool.logger.error( 
>> >                 "Exception during reset or similar", exc_info=True) 
>> >             if connection_record: 
>> >                 pool.logger.debug("jjs going to 
>> connection_record.invalidate") 
>> >                 connection_record.invalidate(e=e) 
>> >             if not isinstance(e, Exception): 
>> >                 pool.logger.debug("jjs e %s is not Exception, raise 
>> again", e) 
>> >                 raise 
>> > 
>> > my current theory is that a greenlet is being blocked in _reset(), 
>> > then waking up later but is being corrupted at the greenlet level in 
>> > some way. 
>> > 
>> > 
>> > 
>> > On Tue, May 15, 2018 at 12:15 PM, Mike Bayer <cla...@zzzcomputing.com> 
>> wrote: 
>> >> 
>> >> 
>> >> On Tue, May 15, 2018, 3:26 AM <jiajun...@gmail.com> wrote: 
>> >>> 
>> >>> We got the log "jjs _finalize_fairy args" just once but the code in 
>> >>> function _finalize_fairy seem to be called twice. It also confused 
>> me. 
>> >> 
>> >> 
>> >> It's because that code originated elsewhere and got blocked, then it 
>> picks 
>> >> up again but the callstack is somehow corrupted (or logging isn't 
>> telling 
>> >> the truth). 
>> >> 
>> >> Can you please give me: 
>> >> 
>> >> Exact version of eventlet in use 
>> >> Exact version of psycopg2 and psyco-green 
>> >> Exact python interpreter version 
>> >> Exact version of SQLAlchemy 
>> >> Exact version of Oslo.db 
>> >> 
>> >> Thanks 
>> >> 
>> >> 
>> >> 
>> >>> 
>> >>> BTW, I tried to add code[1] to avoid double checkin in 
>> _finalize_fairy but 
>> >>> failed. 
>> >>> That means the code in pool.py#L670-692 was called without L646-668. 
>> >>> 
>> >>> [1] https://github.com/jiajunsu/test_logs/blob/master/pool.py#L659 
>> >>> 
>> >>> 在 2018年5月15日星期二 UTC+8下午12:41:17，Mike Bayer写道： 
>> >>>> 
>> >>>> s 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> On Mon, May 14, 2018 at 9:38 PM,  <jiajun...@gmail.com> wrote: 
>> >>>> > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in 
>> each 
>> >>>> > coroutine(green thread). 
>> >>>> > 
>> >>>> > We run nova-conductor with monkey_patch, the same as [1], 
>> >>>> > eventlet.monkey_patch(os=False) 
>> >>>> > 
>> >>>> > Our nova-conductor is running with multi workers. 
>> >>>> > 
>> >>>> > It's confusing why the _finalize_fairy is called twice but only 
>> log 
>> >>>> > once at 
>> >>>> > the entrance, that's why I guess it's called by weakref's 
>> callback. 
>> >>>> 
>> >>>> For 130446032 which is the tid you mentioned, _finalize_fairy is 
>> only 
>> >>>> called once in your log: 
>> >>>> 
>> >>>> $ cat nc_5679.log | grep -e "130446032.*jjs _finalize" 
>> >>>> 2018-05-14T11:11:39.955+08:00 localhost nova-conductor DEBUG 
>> >>>> [pid:5679] [MainThread] [tid:130446032] [log.py:104 debug] 
>> >>>> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] jjs _finalize_fairy args 
>> >>>> conn None, rec.conn None, rec, <sqlalchemy.pool._ConnectionRecord 
>> >>>> object at 0x70cdb10>. ref None, fairy 
>> >>>> <sqlalchemy.pool._ConnectionFairy object at 0x8a87390> debug 
>> >>>> 
>> >>>> Also, the log shows that no _finalize_fairy is ever called due to 
>> >>>> "del" - this would show the weakref "ref" as non-None: 
>> >>>> 
>> >>>> $ cat nc_5679.log | grep -e "FINALIZE FAIRY.*REF:.*" | grep -v "REF: 
>> >>>> None" 
>> >>>> $ 
>> >>>> 
>> >>>> I would assume by "finalize fairy" called twice, you're referring to 
>> >>>> the fact that you see the "exception during reset or simlar" 
>> message. 
>> >>>> But that message is still AFAICT impossible to be emitted without 
>> the 
>> >>>> function being entered, and a corresponding "FINALIZE FAIRY" log 
>> line 
>> >>>> preceding it, unless something is going wrong with the logging, or 
>> the 
>> >>>> interpreter is somehow not functioning correctly (I wonder if 
>> eventlet 
>> >>>> had any such issues).   I'm struggling to see any other possibility 
>> >>>> but not coming up with any. 
>> >>>> 
>> >>>> One weird thing I'm trying to figure out is that I see a connection 
>> >>>> being returned to the pool normally, but then nothing is logged 
>> about 
>> >>>> it after that: 
>> >>>> 
>> >>>> 2018-05-14T11:11:21.183+08:00 localhost nova-conductor DEBUG 
>> >>>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
>> >>>> [req-19a57f61-bc64-4575-8253-00b5bfe62a74] Connection <connection 
>> >>>> object at 0x7f419c0; closed: 0> being returned to pool, 
>> >>>> connection_record.connection <connection object at 0x7f419c0; 
>> closed: 
>> >>>> 0> debug 
>> >>>> 
>> >>>> 
>> >>>> the next log message from tid:130446352 should be "jjs conn reset". 
>> >>>> But that never happens.  The next we see of that tid is: 
>> >>>> 
>> >>>> 2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG 
>> >>>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
>> >>>> [req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs 
>> >>>> _ConnectionFairy._checkout fariy None debug 
>> >>>> 2018-05-14T11:11:21.549+08:00 localhost nova-conductor DEBUG 
>> >>>> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
>> >>>> [req-0ef61146-5cb6-4f46-9788-406b6d376843] jjs before pool._do_get 
>> >>>> pool queue deque([<sqlalchemy.pool._ConnectionRecord object at 
>> >>>> 0x7777e10>, <sqlalchemy.pool._ConnectionRecord object at 0x75fb7d0>, 
>> >>>> <sqlalchemy.pool._ConnectionRecord object at 0x778d810>, 
>> >>>> <sqlalchemy.pool._ConnectionRecord object at 0x7600610>, 
>> >>>> <sqlalchemy.pool._ConnectionRecord object at 0x6ea7e10>, 
>> >>>> <sqlalchemy.pool._ConnectionRecord object at 0x7798050>, 
>> >>>> <sqlalchemy.pool._ConnectionRecord object at 0x796d110>, 
>> >>>> <sqlalchemy.pool._ConnectionRecord object at 0x7798d50>, 
>> >>>> <sqlalchemy.pool._ConnectionRecord object at 0x75f5710>]) debug 
>> >>>> 
>> >>>> that is, it's checking out a new connection ?  If it were blocking 
>> on 
>> >>>> trying to reset the connection, we wouldn't see that tid again.  I 
>> >>>> wonder if the greenlet was interrupted or something.    I'm finding 
>> >>>> various impossibilities like this going on and not able to come up 
>> >>>> with any theory other than the interpreter is somehow not 
>> functioning 
>> >>>> correcltly, or the logging here is somehow incomplete. 
>> >>>> 
>> >>>> I notice you're on a very old version of SQLAlchemy (three years 
>> old). 
>> >>>>     You really should be on at least the 1.1 series because for 
>> >>>> greenlets you need to have: 
>> >>>> 
>> >>>> 
>> http://docs.sqlalchemy.org/en/latest/changelog/migration_11.html#change-3803.
>>  
>>
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> 
>> >>>> > 
>> >>>> > [1]
>> https://github.com/openstack/nova/blob/master/nova/cmd/__init__.py 
>> >>>> > 
>> >>>> > 
>> >>>> > 在 2018年5月14日星期一 UTC+8下午11:59:43，Mike Bayer写道： 
>> >>>> >> 
>> >>>> >> I notice in this log file there are multiple "tid", which I 
>> assume to 
>> >>>> >> be thread identifiers ?   E.g. OS-level threads?    Are you using 
>> >>>> >> eventlet or not?    Are you monkeypatching SQLAlchemy and then 
>> using 
>> >>>> >> multiple OS-level threads at the same time? Because that likely 
>> >>>> >> doesn't work. 
>> >>>> >> 
>> >>>> >> Particularly, the log here illustrates an impossible codepath. 
>> >>>> >> 
>> >>>> >> We enter finalize_fairy with connection is None: 
>> >>>> >> 
>> >>>> >> 2018-05-14T11:11:39.961+08:00 localhost nova-conductor DEBUG 
>> >>>> >> [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] 
>> >>>> >> [req-74ef6531-2d92-40ea-a524-f54e11edab06] FINALIZE FAIRY: DB 
>> CONN: 
>> >>>> >> None, FAIRY: <sqlalchemy.pool._ConnectionFairy object at 
>> 0x87dd3d0>, 
>> >>>> >> REF: None  CONNECTION_REC: <sqlalchemy.pool._ConnectionRecord 
>> object 
>> >>>> >> at 0x776fe90> CONNECTION_REC REF <weakref at 0x7c5cdb8; to 
>> >>>> >> '_ConnectionFairy' at 0x87dd3d0> debug 
>> >>>> >> 
>> >>>> >> this seems to make it through to the checkin at the bottom. 
>> >>>> >> 
>> >>>> >> Then, we see this: 
>> >>>> >> 
>> >>>> >> 2018-05-14T11:11:39.968+08:00 localhost nova-conductor ERROR 
>> >>>> >> [pid:5679] [MainThread] [tid:130446032] [log.py:122 error] 
>> >>>> >> [req-841bb432-7cd2-4fa3-9def-559a34b05a21] Exception during reset 
>> or 
>> >>>> >> similar 
>> >>>> >> 
>> >>>> >> The only place that message is generated is *in* finalize_fairy, 
>> >>>> >> *before* the checkin occurs, and *only* if _finalize_fairy was 
>> called 
>> >>>> >> *with* a connection that is non -None.    There is no other call 
>> >>>> >> _finalize_fairy with that request identifier, therefore I see no 
>> >>>> >> evidence of where this message "exception during reset" can be 
>> >>>> >> generated from - there would at least need to be an additional 
>> log 
>> >>>> >> line with "FINALIZE FAIRY" and the connection non-None for this 
>> to be 
>> >>>> >> reached.    I see no way that the pool.py shipped with SQLAlchemy 
>> can 
>> >>>> >> possibly generate this series of steps. 
>> >>>> >> 
>> >>>> >> This is an extremely intricate issue and I still have only the 
>> vaguest 
>> >>>> >> notion of how you are running SQLAlchemy / Nova. 
>> >>>> >> 
>> >>>> >> 
>> >>>> >> 
>> >>>> >> 
>> >>>> >> On Mon, May 14, 2018 at 11:10 AM, Mike Bayer <
>> mik...@zzzcomputing.com> 
>> >>>> >> wrote: 
>> >>>> >> > OK I need the exact pool.py you are using as well that includes 
>> all 
>> >>>> >> > these logging statements and any other adjustments that were 
>> made. 
>> >>>> >> > 
>> >>>> >> > 
>> >>>> >> > 
>> >>>> >> > On Mon, May 14, 2018 at 8:02 AM,  <jiajun...@gmail.com> wrote: 
>> >>>> >> >> I've uploaded the logs here: 
>> >>>> >> >> 
>> >>>> >> >> 
>> https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log 
>> >>>> >> >> 
>> >>>> >> >> The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 
>> 130446032 
>> >>>> >> >> 
>> >>>> >> >> At  2018-05-14T11:11:39.970, we got the log " jjs rec already 
>> in 
>> >>>> >> >> pool 
>> >>>> >> >> checkin", but I just deleted `return` in code, so the problem 
>> >>>> >> >> happen 
>> >>>> >> >> again. 
>> >>>> >> >> 
>> >>>> >> >> 
>> >>>> >> >> 
>> >>>> >> >> 
>> >>>> >> >> 在 2018年5月14日星期一 UTC+8上午10:48:49，Mike Bayer写道： 
>> >>>> >> >>> 
>> >>>> >> >>> as well as the log that shows this: 
>> >>>> >> >>> 
>> >>>> >> >>> +    if self in pool._pool.queue: 
>> >>>> >> >>> +        pool.logger.debug("jjs rec already in pool checkin, 
>> skip 
>> >>>> >> >>> %r", 
>> >>>> >> >>> self) 
>> >>>> >> >>> +        return 
>> >>>> >> >>> 
>> >>>> >> >>> 
>> >>>> >> >>> actually happening. 
>> >>>> >> >>> 
>> >>>> >> >>> 
>> >>>> >> >>> 
>> >>>> >> >>> 
>> >>>> >> >>> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer 
>> >>>> >> >>> <mik...@zzzcomputing.com> 
>> >>>> >> >>> wrote: 
>> >>>> >> >>> > On Sun, May 13, 2018 at 10:35 PM,  <jiajun...@gmail.com> 
>> wrote: 
>> >>>> >> >>> >> "jjs" is short for my name and "cjr" is my 
>> colleague's(JinRong 
>> >>>> >> >>> >> Cai). We 
>> >>>> >> >>> >> work 
>> >>>> >> >>> >> together and add logs in the same environment, add prefix 
>> of 
>> >>>> >> >>> >> name 
>> >>>> >> >>> >> to 
>> >>>> >> >>> >> get a 
>> >>>> >> >>> >> keyword for log filter. 
>> >>>> >> >>> >> 
>> >>>> >> >>> >> The doubled ConnectionRecords(with None connection) are 
>> got by 
>> >>>> >> >>> >> two 
>> >>>> >> >>> >> different 
>> >>>> >> >>> >> coroutines, and the variable record.connection is rewrite 
>> by 
>> >>>> >> >>> >> the 
>> >>>> >> >>> >> second 
>> >>>> >> >>> >> coroutines. 
>> >>>> >> >>> >> After the record being put back to QueuePool again, there 
>> are 
>> >>>> >> >>> >> two 
>> >>>> >> >>> >> same 
>> >>>> >> >>> >> record with same connection in the pool, which will cause 
>> the 
>> >>>> >> >>> >> next 
>> >>>> >> >>> >> two 
>> >>>> >> >>> >> coroutines get conflict and deadlock in psycopg2 as above. 
>> >>>> >> >>> > 
>> >>>> >> >>> > OK, I need the detailed step by step that shows how the two 
>> >>>> >> >>> > coroutines 
>> >>>> >> >>> > are interacting such that they both are returning the same 
>> >>>> >> >>> > connection. 
>> >>>> >> >>> >   The nova log above seems to show just one coroutine I 
>> assume 
>> >>>> >> >>> > since 
>> >>>> >> >>> > there is just one nova request ID, if I understand 
>> correctly. 
>> >>>> >> >>> > 
>> >>>> >> >>> >> 
>> >>>> >> >>> >> The logs of psycopg2 has been pasted at 
>> >>>> >> >>> >> https://github.com/psycopg/psycopg2/issues/703 
>> >>>> >> >>> > 
>> >>>> >> >>> > The above issue says the bug has been found in SQLAlchemy 
>> but 
>> >>>> >> >>> > there 
>> >>>> >> >>> > is 
>> >>>> >> >>> > no bug report yet and I've not yet been shown how the 
>> behavior 
>> >>>> >> >>> > here 
>> >>>> >> >>> > is 
>> >>>> >> >>> > possible.      I posted a straightforward example above, 
>> can you 
>> >>>> >> >>> > please show me where the two coroutines get mixed up ? 
>> >>>> >> >>> > 
>> >>>> >> >>> > 
>> >>>> >> >>> > 
>> >>>> >> >>> > 
>> >>>> >> >>> >> 
>> >>>> >> >>> >> 
>> >>>> >> >>> >> 
>> >>>> >> >>> >> 在 2018年5月11日星期五 UTC+8上午3:05:46，Mike Bayer写道： 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> I tried to follow your log lines, which is difficult 
>> because I 
>> >>>> >> >>> >>> don't 
>> >>>> >> >>> >>> know where those log statements actually are and I don't 
>> know 
>> >>>> >> >>> >>> what 
>> >>>> >> >>> >>> acronyms like "jjs", "cjr", mean.   But it does look like 
>> it's 
>> >>>> >> >>> >>> al 
>> >>>> >> >>> >>> in 
>> >>>> >> >>> >>> one thread (so not async GC) and all in one Nova request 
>> (so 
>> >>>> >> >>> >>> perhaps 
>> >>>> >> >>> >>> just one greenlet involved, unless other greenlets are 
>> >>>> >> >>> >>> affecting 
>> >>>> >> >>> >>> this 
>> >>>> >> >>> >>> without them being displayed in this log snippet).   I do 
>> see 
>> >>>> >> >>> >>> the 
>> >>>> >> >>> >>> connection record doubled at the end.   So here is the 
>> >>>> >> >>> >>> illustration of 
>> >>>> >> >>> >>> those steps in isolation, the connection record is not 
>> >>>> >> >>> >>> doubled. 
>> >>>> >> >>> >>> Can 
>> >>>> >> >>> >>> you alter this program to show this condition occurring? 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> from sqlalchemy import create_engine 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> e = 
>> create_engine("postgresql://scott:tiger@localhost/test", 
>> >>>> >> >>> >>> echo_pool='debug') 
>> >>>> >> >>> >>> e.connect().close() 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> assert len(e.pool._pool.queue) == 1 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> c1 = e.connect() 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> assert len(e.pool._pool.queue) == 0 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> c1.engine.pool._invalidate(c1._Connection__connection, 
>> None) 
>> >>>> >> >>> >>> c1.invalidate(e) 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> # connection record is back 
>> >>>> >> >>> >>> assert len(e.pool._pool.queue) == 1 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> # do a reconnect 
>> >>>> >> >>> >>> c1.connection 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> # uses the record again 
>> >>>> >> >>> >>> assert len(e.pool._pool.queue) == 0 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> # close the Connection 
>> >>>> >> >>> >>> c1.close() 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> # record is back 
>> >>>> >> >>> >>> assert len(e.pool._pool.queue) == 1 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> # etc. 
>> >>>> >> >>> >>> c1 = e.connect() 
>> >>>> >> >>> >>> assert len(e.pool._pool.queue) == 0 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> c1.close() 
>> >>>> >> >>> >>> assert len(e.pool._pool.queue) == 1 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> 
>> >>>> >> >>> >>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer 
>> >>>> >> >>> >>> <mik...@zzzcomputing.com> 
>> >>>> >> >>> >>> wrote: 
>> >>>> >> >>> >>> > Here's a print statement you might want to try: 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> > diff --git a/lib/sqlalchemy/pool.py 
>> b/lib/sqlalchemy/pool.py 
>> >>>> >> >>> >>> > index 89a4cea7c..452db4883 100644 
>> >>>> >> >>> >>> > --- a/lib/sqlalchemy/pool.py 
>> >>>> >> >>> >>> > +++ b/lib/sqlalchemy/pool.py 
>> >>>> >> >>> >>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, 
>> >>>> >> >>> >>> > connection_record, 
>> >>>> >> >>> >>> >      been garbage collected. 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> >      """ 
>> >>>> >> >>> >>> > +    print( 
>> >>>> >> >>> >>> > +        "FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: 
>> %s, 
>> >>>> >> >>> >>> > FAIRY: 
>> >>>> >> >>> >>> > %s, 
>> >>>> >> >>> >>> > REF: 
>> >>>> >> >>> >>> > %s  " 
>> >>>> >> >>> >>> > +        "CONNECTION_REC: %s CONNECTION_REC REF %s" % ( 
>> >>>> >> >>> >>> > +            "put greenlet id here", 
>> >>>> >> >>> >>> > +            connection, 
>> >>>> >> >>> >>> > +            fairy, 
>> >>>> >> >>> >>> > +            ref, 
>> >>>> >> >>> >>> > +            connection_record, 
>> >>>> >> >>> >>> > +            connection_record.fairy_ref 
>> >>>> >> >>> >>> > +            if connection_record is not None else 
>> "n/a" 
>> >>>> >> >>> >>> > +        ) 
>> >>>> >> >>> >>> > +    ) 
>> >>>> >> >>> >>> > + 
>> >>>> >> >>> >>> >      _refs.discard(connection_record) 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> >      if ref is not None: 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> > 
>> >>>> >> >>> >>> > On Thu, May 10, 2018 at 12:21 PM, Mike Bayer 
>> >>>> >> >>> >>> > <mik...@zzzcomputing.com> 
>> >>>> >> >>> >>> > wrote: 
>> >>>> >> >>> >>> >> On Thu, May 10, 2018 at 7:23 AM,  <jiajun...@gmail.com> 
>>
>> >>>> >> >>> >>> >> wrote: 
>> >>>> >> >>> >>> >>> Hi Mike, 
>> >>>> >> >>> >>> >>> 
>> >>>> >> >>> >>> >>> I added more logs and finnaly find the reason and a 
>> method 
>> >>>> >> >>> >>> >>> to 
>> >>>> >> >>> >>> >>> avoid 
>> >>>> >> >>> >>> >>> the 
>> >>>> >> >>> >>> >>> deadlock between coroutines. 
>> >>>> >> >>> >>> >>> 
>> >>>> >> >>> >>> >>> In summary, the deadlock happend because two 
>> different 
>> >>>> >> >>> >>> >>> coroutines 
>> >>>> >> >>> >>> >>> use 
>> >>>> >> >>> >>> >>> the 
>> >>>> >> >>> >>> >>> same ConnectionRecord with same connection. 
>> >>>> >> >>> >>> >>> It happends under QueuePool, when the DB connections 
>> are 
>> >>>> >> >>> >>> >>> reset 
>> >>>> >> >>> >>> >>> by 
>> >>>> >> >>> >>> >>> the 
>> >>>> >> >>> >>> >>> DB 
>> >>>> >> >>> >>> >>> server(not related with the backend driver). 
>> >>>> >> >>> >>> >>> 
>> >>>> >> >>> >>> >>> 1.A coroutine get a ConnectionRecord into 
>> ConnectionFairy, 
>> >>>> >> >>> >>> >>> and 
>> >>>> >> >>> >>> >>> apply a 
>> >>>> >> >>> >>> >>> weakref of the ConnectionFairy, with callback 
>> function 
>> >>>> >> >>> >>> >>> `_finalize_fairy`. 
>> >>>> >> >>> >>> >>> 2.When the connection get disconnected, it receives 
>> an 
>> >>>> >> >>> >>> >>> exception(psycopg2.OperationalError), and goto 
>> >>>> >> >>> >>> >>> Connection._handle_dbapi_exception. 
>> >>>> >> >>> >>> >>> 3.In `_handle_dbapi_exception.finnaly`, it will call 
>> both 
>> >>>> >> >>> >>> >>> `ConnectionFairy.invalidate` and `self.invalidate`. 
>> >>>> >> >>> >>> >>> 4.In `ConnectionFairy.invalidate`, 
>> >>>> >> >>> >>> >>> `_ConnectionRecord._checkin` is 
>> >>>> >> >>> >>> >>> called, 
>> >>>> >> >>> >>> >>> and put the ConnectionRecord back to the QueuePool. 
>> >>>> >> >>> >>> >>> 5.In `self.invalidate`, `del self._root.__connection` 
>> is 
>> >>>> >> >>> >>> >>> called, 
>> >>>> >> >>> >>> >>> and 
>> >>>> >> >>> >>> >>> **del** 
>> >>>> >> >>> >>> >>> the ConnectionFairy. 
>> >>>> >> >>> >>> >>> 6.The **del** operation will make weakref of 
>> >>>> >> >>> >>> >>> ConnectionFairy 
>> >>>> >> >>> >>> >>> to 
>> >>>> >> >>> >>> >>> dead, 
>> >>>> >> >>> >>> >>> and 
>> >>>> >> >>> >>> >>> the callback `_finalize_fairy` executed. 
>> >>>> >> >>> >>> >>> 7.In `_finalize_fairy`, another 
>> >>>> >> >>> >>> >>> `_ConnectionRecord._checkin` 
>> >>>> >> >>> >>> >>> is 
>> >>>> >> >>> >>> >>> called, put 
>> >>>> >> >>> >>> >>> the ConnectionRecord to the QueuePool again. 
>> >>>> >> >>> >>> >> 
>> >>>> >> >>> >>> >> Here's what does not make sense.  You are claiming 
>> that a 
>> >>>> >> >>> >>> >> simple 
>> >>>> >> >>> >>> >> invalidation will result in the same ConnectionRecord 
>> being 
>> >>>> >> >>> >>> >> checked 
>> >>>> >> >>> >>> >> in 
>> >>>> >> >>> >>> >> twice.   This is obviously not the case, what is 
>> special in 
>> >>>> >> >>> >>> >> your 
>> >>>> >> >>> >>> >> situation that makes this happen? 
>> >>>> >> >>> >>>
>
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to