Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-16 Thread jiajunsu . zju
This patch worked. https://gerrit.sqlalchemy.org/#/c/zzzeek/sqlalchemy/+/747/ But I didn't get the log "Double checkin attempted", maybe the real working code is if connection_record and connection_record.fairy_ref is not None: connection_record.checkin() 在 2018年5月16日星期三

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-16 Thread jiajunsu . zju
Thanks a lot. I'll test this patch in our environment. 在 2018年5月16日星期三 UTC+8上午2:58:56,Mike Bayer写道: > > good news, I reproduced it. > > Even better news for you, vindication! nothing to do with greenlets, > threads, concurrency, or anything (though it doesn't have to do with > the "del" or

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread Mike Bayer
good news, I reproduced it. Even better news for you, vindication! nothing to do with greenlets, threads, concurrency, or anything (though it doesn't have to do with the "del" or the weakref, this is much simpler ). This has to do with a special rule in the pool when there is a so-called "reset

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread Mike Bayer
can you add these lines to your logging, ensure that the check for "already in pool" is present, and produce new logs? try: fairy = fairy or _ConnectionFairy( connection, connection_record, echo) assert fairy.connection is connection

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread Mike Bayer
On Tue, May 15, 2018, 3:26 AM wrote: > We got the log "jjs _finalize_fairy args" just once but the code in > function _finalize_fairy seem to be called twice. It also confused me. > It's because that code originated elsewhere and got blocked, then it picks up again but

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-15 Thread jiajunsu . zju
We got the log "jjs _finalize_fairy args" just once but the code in function _finalize_fairy seem to be called twice. It also confused me. BTW, I tried to add code[1] to avoid double checkin in _finalize_fairy but failed. That means the code in pool.py#L670-692 was called without L646-668. [1]

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread Mike Bayer
s On Mon, May 14, 2018 at 9:38 PM, wrote: > The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each > coroutine(green thread). > > We run nova-conductor with monkey_patch, the same as [1], > eventlet.monkey_patch(os=False) > > Our nova-conductor is

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju
The exc_info of tid 130446032 "Exception during reset or similar" is sqlalchemy.pool.QueuePool Traceback (most recent call last): sqlalchemy.pool.QueuePool File "sqlalchemy/pool.py", line 679, in _finalize_fairy sqlalchemy.pool.QueuePool fairy._reset(pool) sqlalchemy.pool.QueuePool File

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju
The tid is coroutine_id, we get it by id(greenlet.getcurrent()) in each coroutine(green thread). We run nova-conductor with monkey_patch, the same as [1], eventlet.monkey_patch(os=False) Our nova-conductor is running with multi workers. It's confusing why the _finalize_fairy is called twice

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju
I put the files we added logs here, include engine/base.py and pool.py https://github.com/jiajunsu/test_logs 在 2018年5月14日星期一 UTC+8下午11:10:37,Mike Bayer写道: > > OK I need the exact pool.py you are using as well that includes all > these logging statements and any other adjustments that were made.

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread Mike Bayer
I notice in this log file there are multiple "tid", which I assume to be thread identifiers ? E.g. OS-level threads?Are you using eventlet or not?Are you monkeypatching SQLAlchemy and then using multiple OS-level threads at the same time? Because that likely doesn't work. Particularly,

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread Mike Bayer
OK I need the exact pool.py you are using as well that includes all these logging statements and any other adjustments that were made. On Mon, May 14, 2018 at 8:02 AM, wrote: > I've uploaded the logs here: >

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-14 Thread jiajunsu . zju
I've uploaded the logs here: https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032 At 2018-05-14T11:11:39.970, we got the log " jjs rec already in pool checkin", but I just deleted `return` in code, so the

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread jiajunsu . zju
OK. I'll reproduce it and upload all logs of the process. 在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道: > > as well as the log that shows this: > > +if self in pool._pool.queue: > +pool.logger.debug("jjs rec already in pool checkin, skip %r", > self) > +return > > >

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread Mike Bayer
as well as the log that shows this: +if self in pool._pool.queue: +pool.logger.debug("jjs rec already in pool checkin, skip %r", self) +return actually happening. On Sun, May 13, 2018 at 10:42 PM, Mike Bayer wrote: > On Sun, May 13, 2018 at

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread Mike Bayer
On Sun, May 13, 2018 at 10:35 PM, wrote: > "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We work > together and add logs in the same environment, add prefix of name to get a > keyword for log filter. > > The doubled ConnectionRecords(with None

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread jiajunsu . zju
"jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We work together and add logs in the same environment, add prefix of name to get a keyword for log filter. The doubled ConnectionRecords(with None connection) are got by two different coroutines, and the variable

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread Mike Bayer
if that connection error is not handled in "is_disconnect" then this wouldn't invalidate the pool. However we've had lots of problems moving vips around with Galera. I would recommend using haproxy or PGBouncer which are both designed to mitigate this issue. As well as, run Nova under

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-13 Thread jiajunsu . zju
The scenario we make to reproduce the problem is: 1.stop PG master 2.move the IP of PG from master node to slave node, and make an ARP 3.fail over the PG slave to master That will shut all DB connections to PG master, and in SQLAlchemy we got an exception: ERROR sqlalchemy.pool.QueuePool

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread Mike Bayer
I tried to follow your log lines, which is difficult because I don't know where those log statements actually are and I don't know what acronyms like "jjs", "cjr", mean. But it does look like it's al in one thread (so not async GC) and all in one Nova request (so perhaps just one greenlet

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread Mike Bayer
Here's a print statement you might want to try: diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py index 89a4cea7c..452db4883 100644 --- a/lib/sqlalchemy/pool.py +++ b/lib/sqlalchemy/pool.py @@ -689,6 +689,19 @@ def _finalize_fairy(connection, connection_record, been garbage

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread Mike Bayer
On Thu, May 10, 2018 at 7:23 AM, wrote: > Hi Mike, > > I added more logs and finnaly find the reason and a method to avoid the > deadlock between coroutines. > > In summary, the deadlock happend because two different coroutines use the > same ConnectionRecord with same

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-10 Thread jiajunsu . zju
Hi Mike, I added more logs and finnaly find the reason and a method to avoid the deadlock between coroutines. In summary, the deadlock happend because two different coroutines use the same ConnectionRecord with same connection. It happends under QueuePool, when the DB connections are reset by

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-08 Thread jiajunsu . zju
I found why we could get the same connection twice from the pool... Because we put it into pool twice. Any suggestion for finding why it put the same connection into pool within different coroutines? 2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG [pid:15755] [MainThread]

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-08 Thread jiajunsu . zju
It's not a problem with lock. I tried to get the _pool.mutex in create_connection but helpless. Maybe we create the connection, put it into pool, but use it without get it out? 在 2018年5月8日星期二 UTC+8下午10:18:52,jiajun...@gmail.com写道: > > I added logs in sqlalchemy/pool.py > > QueuePool.checkout >

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-08 Thread jiajunsu . zju
I added logs in sqlalchemy/pool.py QueuePool.checkout @classmethod def checkout(cls, pool): rec = pool._do_get() try: dbapi_connection = rec.get_connection() pool.logger.debug("jjs get conn success %s", dbapi_connection) except: pool.logger.debug("jjs get

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-07 Thread Mike Bayer
just to note, there's definitely a bug somewhere, and if the bug is found there's a 90% chance it involves code that I wrote at some point :) (though likely some assumption that the PG /async stuff does not meet) but this is part of oslo / openstack. I work on oslo / openstack, but that's my

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-07 Thread Mike Bayer
On Mon, May 7, 2018, 7:01 AM wrote: > We added coroutine_id in psycopg2, and found that two coroutine use the > same connection before the pthread_mutex_lock released. > > Maybe something in the connection pool goes wrong? > [pid] [coroutine_id] msg > This seems

Re: [sqlalchemy] Re: long transaction after database switched over

2018-05-07 Thread jiajunsu . zju
We added coroutine_id in psycopg2, and found that two coroutine use the same connection before the pthread_mutex_lock released. Maybe something in the connection pool goes wrong? [pid] [coroutine_id] msg [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 0x23ad320

Re: [sqlalchemy] Re: long transaction after database switched over

2018-04-28 Thread jiajunsu . zju
We reproduced this problem and added logs in psycopg2, found a confusing thing and reported it to psycopg2. It seems sqlalchemy & psycopg2 stack flow is: 1.from sqlalchemy to psycopg: pq_execute/pq_commit 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn) 3.get exception in sqlalchemy, and

Re: [sqlalchemy] Re: long transaction after database switched over

2018-04-23 Thread Mike Bayer
On Mon, Apr 23, 2018 at 9:03 AM, wrote: > Sorry for reply on this topic. > > We recently get the same problem in our production environment. > > I found a patch in other lib [1], and they added conn.close() in exception > psycopg2.ProgrammingError. > > Shall we do the