Re: [sqlalchemy] Re: long transaction after database switched over

jiajunsu . zju Tue, 08 May 2018 08:06:34 -0700

I found why we could get the same connection twice from the pool...

Because we put it into pool twice.


Any suggestion for finding why it put the same connection into pool within 
different coroutines?


2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:187954992] [log.py:104 debug] 
[req-7c2199c0-d9a6-4131-8358-2aebcf5a4ce1] jjs conn checkin <connection 
object at 0x111d9740; closed: 0> 
2018-05-08T22:39:17.502+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:178645264] [log.py:104 debug] 
[req-7ab83c30-91bd-4278-b694-85a71035992c] jjs conn checkin <connection 
object at 0x111d9740; closed: 0> 
...
2018-05-08T22:39:17.992+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:309091856] [log.py:104 debug] 
[req-dee5dd2d-866f-408e-999b-b415ea7ce953] Connection <connection object at 
0x111d9740; closed: 0> checked out from pool 
2018-05-08T22:39:18.039+08:00 localhost nova-conductor DEBUG [pid:15755] 
[MainThread] [tid:178645104] [log.py:104 debug] 
[req-ac8017d7-c97e-44ca-92d2-99f180858e14] Connection <connection object at 
0x111d9740; closed: 0> checked out from pool 


在 2018年5月8日星期二 UTC+8下午10:18:52，[email protected]写道：
>
> I added logs in sqlalchemy/pool.py
>
> QueuePool.checkout
>
> @classmethod
> def checkout(cls, pool):
>     rec = pool._do_get()
>     try:
>         dbapi_connection = rec.get_connection()
>         pool.logger.debug("jjs get conn success %s", dbapi_connection)
>     except:
>         pool.logger.debug("jjs get error in get_connection")
>         with util.safe_reraise():
>             rec.checkin()
>     echo = pool._should_log_debug()
>     fairy = _ConnectionFairy(dbapi_connection, rec, echo)
>     rec.fairy_ref = weakref.ref(
>         fairy,
>         lambda ref: _finalize_fairy and
>         _finalize_fairy(
>             dbapi_connection,
>             rec, pool, ref, echo)
>     )
>     _refs.add(rec)
>     if echo:
>         pool.logger.debug("jjs conn %s, pool status %s", dbapi_connection, 
> pool.status())
>         pool.logger.debug("Connection %r checked out from pool",
>                           dbapi_connection)
>     return fairy
>
> Then I got logs below:
>
> coroutine A:
>
> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:197123760] [log.py:109 info] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection 
> <connection object at 0x903f8c0; closed: 0> 
> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success <connection 
> object at 0x903f8c0; closed: 0> 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn <connection object at 
> 0x903f8c0;  closed: 0>, pool status Pool size: 10  Connections in pool: 1 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 45 
> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:197123760] [log.py:104 debug] 
> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection <connection object at 
> 0x903f8c0; closed: 0> checked out from pool 
>
> coroutine B:
>
> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO [pid:55528] 
> [MainThread] [tid:151519312] [log.py:109 info] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, timeout 
> 30
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success <connection 
> object at 0x903f8c0;  closed: 0> 
> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn <connection object at 
> 0x903f8c0; closed: 0>, pool status Pool size: 10  Connections in pool: 0 
> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 46 
> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG [pid:55528] 
> [MainThread] [tid:151519312] [log.py:104 debug] 
> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection <connection object at 
> 0x903f8c0; closed: 0> checked out from pool 
>
>
> It seems that when we create_connection in QueuePool._do_get, we may not 
> get the lock _pool.mutex(~Queue.mutex).
>
>
>
> 在 2018年5月7日星期一 UTC+8下午8:41:14，Mike Bayer写道：
>>
>>
>>
>> On Mon, May 7, 2018, 7:01 AM <[email protected]> wrote:
>>
>>> We added coroutine_id in psycopg2, and found that two coroutine use the 
>>> same connection before the pthread_mutex_lock released.
>>>
>>> Maybe something in the connection pool goes wrong?
>>> [pid] [coroutine_id] msg
>>>
>>
>>
>> This seems unlikely.   Openstack uses the SQLAlchemy connection pool with 
>> eventlet patching in conjunction with the MySQL drivers and nothing like 
>> that happens.   The biggest issue SQLAlchemy had with greenlets is that a 
>> timed out greenlet would throw GreenletExit and cause a dirty connection to 
>> go back into the pool, that was fixed in 1.2.  
>>
>>
>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, 
>>> cb 0x23ad320
>>>
>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>
>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>>> status = 2
>>>
>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>
>>> Below is the logs grep by coroutine id:
>>>
>>> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>>>
>>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>>>
>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, 
>>> cb 0x23ad320
>>>
>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>
>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>
>>> [49174] [0xa5db730]conn_poll: poll writing
>>>
>>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>>>
>>> -----
>>>
>>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection 
>>> at 0x8de2d30
>>>
>>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>>>
>>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn = 
>>> 0x8de2d30
>>>
>>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt 
>>> = 1
>>>
>>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>> async_cursor 0x881ac00
>>>
>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>>> status = 2
>>>
>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>
>>> 在 2018年4月28日星期六 UTC+8下午4:07:34，[email protected]写道：
>>>>
>>>> We reproduced this problem and added logs in psycopg2, found a 
>>>> confusing thing and reported it to psycopg2.
>>>>
>>>> It seems sqlalchemy & psycopg2 stack flow is:
>>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit 
>>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>>>> 3.get exception in sqlalchemy, and do_rollback
>>>> 4.from sqlalchemy to psycopg: pq_abort
>>>> 5.psycopg get pthread_mutex_lock deadlock
>>>>
>>>> What we have done to reproduce it is: stop master pg-server and promote 
>>>> the slave pg-server to master, with moving FIP from old master to slave. 
>>>> At 
>>>> the same time, let nova-conductor do quite a lot db query requests.
>>>>
>>>> [1] https://github.com/psycopg/psycopg2/issues/703
>>>>
>>>> 在 2018年4月23日星期一 UTC+8下午9:45:04，Mike Bayer写道：
>>>>>
>>>>> On Mon, Apr 23, 2018 at 9:03 AM,  <[email protected]> wrote: 
>>>>> > Sorry for reply on this topic. 
>>>>> > 
>>>>> > We recently get the same problem in our production environment. 
>>>>> > 
>>>>> > I found a patch in other lib [1], and they added conn.close() in 
>>>>> exception 
>>>>> > psycopg2.ProgrammingError. 
>>>>> > 
>>>>> > Shall we do the same in [2] ? 
>>>>>
>>>>> SQLAlchemy does things much more carefully than that, we parse the 
>>>>> error message for specific ones that correspond to "connection is no 
>>>>> longer usable", we call these "is_disconnect" but it can be any 
>>>>> invalidating condition. 
>>>>>
>>>>> You can make these yourself, and they can also be made to be part of 
>>>>> oslo.db, using the handle_error event: 
>>>>>
>>>>>
>>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>>>>  
>>>>>
>>>>> within oslo.db you would want to propose a change here: 
>>>>>
>>>>>
>>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>>>>  
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> > 
>>>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split 
>>>>> > 
>>>>> > [2] 
>>>>> > 
>>>>> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>>>>  
>>>>> > 
>>>>> > 
>>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31，JinRong Cai写道： 
>>>>> >> 
>>>>> >> Hi  Michael , 
>>>>> >> 
>>>>> >> I am using openstack with postgresql which sqlalchemy and oslo_db 
>>>>> module 
>>>>> >> were used. 
>>>>> >> And there are some problems after my pg database switched over. 
>>>>> >> 
>>>>> >> Here is my switch over process: 
>>>>> >> 1. nova-conductor(python application) is running with DB connection 
>>>>> >> strings point to vip , which is in primary site(A) of pg. 
>>>>> >> 2. switch VIP from primary(A) to new primary(B) 
>>>>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
>>>>> primary. 
>>>>> >> 4. nova-conductor is running in the whole process. 
>>>>> >> 
>>>>> >> After some seconds, I found some nova-conductor processes are hang 
>>>>> with 
>>>>> >> status futex_wait_queue_me, and the status of the query in DB is 
>>>>> "idle in 
>>>>> >> transaction", the transaction was not commited or rollbacked! 
>>>>> >> I think disconnection was handled in the oslo_db, which will send a 
>>>>> >> ping(select 1) to DB. 
>>>>> >> 
>>>>> >> If DB was switchd over, the connection in the pool would be set 
>>>>> with 
>>>>> >> status invalid, and reconnect after next check out. 
>>>>> >> 
>>>>> >> ###error messages from nova-conductor 
>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] 
>>>>> [tid:122397712] 
>>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception 
>>>>> wrapped from 
>>>>> >> (psycopg2.ProgrammingError) execute cannot be used while an 
>>>>> asynchronous 
>>>>> >> query is underway [SQL: 'SELECT 1'] 
>>>>> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent 
>>>>> call 
>>>>> >> last): 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in 
>>>>> >> _execute_context 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context) 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, 
>>>>> in 
>>>>> >> do_execute 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     
>>>>> cursor.execute(statement, 
>>>>> >> parameters) 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: 
>>>>> execute 
>>>>> >> cannot be used while an asynchronous query is underway 
>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters 
>>>>> >>  localhost nova-conductor ERROR [pid:36365] [MainThread] 
>>>>> [tid:122397712] 
>>>>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error 
>>>>> closing 
>>>>> >> cursor 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call 
>>>>> last): 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool   File 
>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in 
>>>>> >> _safe_close_cursor 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close() 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close 
>>>>> cannot be 
>>>>> >> used while an asynchronous query is underway 
>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool 
>>>>> >> 
>>>>> >> ###ps status of nova-conductor 
>>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # 
>>>>> cat 
>>>>> >> /proc/33316/stack 
>>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 
>>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280 
>>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0 
>>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180 
>>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b 
>>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff 
>>>>> >> 
>>>>> >> ### stack of the nova-conductor process 
>>>>> >> POD6-Mongodb03:/tmp # pstack 33316 
>>>>> >> #0  0x00002b8449e35f4d in __lll_lock_wait () from 
>>>>> /lib64/libpthread.so.0 
>>>>> >> #1  0x00002b8449e31d02 in _L_lock_791 () from 
>>>>> /lib64/libpthread.so.0 
>>>>> >> #2  0x00002b8449e31c08 in pthread_mutex_lock () from 
>>>>> >> /lib64/libpthread.so.0 
>>>>> >> #3  0x00002b84554c44ab in pq_abort () from 
>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>>>>> >> #4  0x00002b84554c955e in psyco_conn_rollback () from 
>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>>>>> >> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from 
>>>>> >> /lib64/libpython2.7.so.1.0 
>>>>> >> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from 
>>>>> >> /lib64/libpython2.7.so.1.0 
>>>>> >> 
>>>>> >> The psycopg2 was trying to close the cursor, and try to get the 
>>>>> mutex lock 
>>>>> >> "pthread_mutex_lock", but it seems that the cursor was used by 
>>>>> other 
>>>>> >> session. 
>>>>> >> 
>>>>> >> 
>>>>> >> Questions: 
>>>>> >> 
>>>>> >> 1. What the error "ProgrammingError: close cannot be used while an 
>>>>> >> asynchronous query is underway" mean? 
>>>>> >> AFAIK, these caused by psycopg2, which means a asynchronous query 
>>>>> was 
>>>>> >> executed in one connection. 
>>>>> >> But the I think the sqlalchemy was thread safe since it was patched 
>>>>> by 
>>>>> >> eventlet, see details in  eventlet/support/psycopg2_patcher.py 
>>>>> >> And we can see different green thread number in the log, as: 
>>>>> >> [pid:36365] [MainThread] [tid:122397712] 
>>>>> >> [pid:36365] [MainThread] [tid:122397815] 
>>>>> >> So, I guess the connection pool in one process is safe. 
>>>>> >> 
>>>>> >> 2. The nova-conductor was a multi-thread python client, which 
>>>>> forked 
>>>>> >> several child process. 
>>>>> >> ps -elf|grep -i nova-conductor 
>>>>> >> 30878  1 pool_s /usr/bin/nova-conductor 
>>>>> >> 36364  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 36365  1 futex_ /usr/bin/nova-conductor 
>>>>> >> 36366  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 36367  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 36368  1 ep_pol /usr/bin/nova-conductor 
>>>>> >> 
>>>>> >> If the nova-conductor was started with only one child, the problem 
>>>>> was not 
>>>>> >> happen. 
>>>>> >> Does this mean the connection/engine CAN NOT shared in these child 
>>>>> >> processes? 
>>>>> >> 
>>>>> >> Thanks. 
>>>>> > 
>>>>> > -- 
>>>>> > SQLAlchemy - 
>>>>> > The Python SQL Toolkit and Object Relational Mapper 
>>>>> > 
>>>>> > http://www.sqlalchemy.org/ 
>>>>> > 
>>>>> > To post example code, please provide an MCVE: Minimal, Complete, and 
>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a 
>>>>> full 
>>>>> > description. 
>>>>> > --- 
>>>>> > You received this message because you are subscribed to the Google 
>>>>> Groups 
>>>>> > "sqlalchemy" group. 
>>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>>> send an 
>>>>> > email to [email protected]. 
>>>>> > To post to this group, send email to [email protected]. 
>>>>> > Visit this group at https://groups.google.com/group/sqlalchemy. 
>>>>> > For more options, visit https://groups.google.com/d/optout. 
>>>>>
>>>> -- 
>>> SQLAlchemy - 
>>> The Python SQL Toolkit and Object Relational Mapper
>>>  
>>> http://www.sqlalchemy.org/
>>>  
>>> To post example code, please provide an MCVE: Minimal, Complete, and 
>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>>> description.
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "sqlalchemy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to