Re: [sqlalchemy] Re: long transaction after database switched over

jiajunsu . zju Thu, 10 May 2018 04:24:08 -0700

Hi Mike,

I added more logs and finnaly find the reason and a method to avoid the 
deadlock between coroutines.


In summary, the deadlock happend because two different coroutines use the 
same ConnectionRecord with same connection.
It happends under QueuePool, when the DB connections are reset by the DB 
server(not related with the backend driver).

1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a 
weakref of the ConnectionFairy, with callback function `_finalize_fairy`.
2.When the connection get disconnected, it receives an 
exception(psycopg2.OperationalError), and goto 
Connection._handle_dbapi_exception.
3.In `_handle_dbapi_exception.finnaly`, it will call both 
`ConnectionFairy.invalidate` and `self.invalidate`.
4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is called, 
and put the ConnectionRecord back to the QueuePool.
5.In `self.invalidate`, `del self._root.__connection` is called, and 
**del** the ConnectionFairy.
6.The **del** operation will make weakref of ConnectionFairy to dead, and 
the callback `_finalize_fairy` executed.
7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, put 
the ConnectionRecord to the QueuePool again.

The logs in detail is below. To prove my result is right, I tried to avoid 
the scienario with this patch, and it worked.

def checkin(self):
    self.fairy_ref = None
    connection = self.connection
    pool = self.__pool

    while self.finalize_callback:
        finalizer = self.finalize_callback.pop()
        finalizer(connection)
+    if self in pool._pool.queue:
+        pool.logger.debug("jjs rec already in pool checkin, skip %r", self)
+        return

    if pool.dispatch.checkin:
        pool.dispatch.checkin(connection, self)
    pool._return_conn(self)

Trace the ConnectionRecord 0x88fc5d0

2018-05-10T14:02:17.927+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Created new connection 
<connection object at 0x81f69b0; closed: 0> 
2018-05-10T14:02:17.928+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs get conn success <connection 
object at 0x81f69b0; closed: 0>, rec.connection <connection object at 
0x81f69b0; closed: 0>, pool queue deque([<sqlalchemy.pool._ConnectionRecord 
object at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at 
0x79fdad0>]) 
2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs checkout fairy 
<sqlalchemy.pool._ConnectionFairy object at 0x329c0d0> rec 
<sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0> pool queue 
deque([<sqlalchemy.pool._ConnectionRecord object at 0x7a04810>, 
<sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>]) 
2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Connection <connection object at 
0x81f69b0; closed: 0> checked out from pool, status Pool size: 10  
Connections in pool: 2 Current Overflow: 0 Max Overflow: 40 Current Checked 
out connections: 8, queue deque([<sqlalchemy.pool._ConnectionRecord object 
at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>]) 
2018-05-10T14:02:17.930+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _ConnectionFairy._checkout 
conn <connection object at 0x81f69b0; closed: 0>, rec 
<sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, rec.conn 
<connection object at 0x81f69b0; closed: 0> 
2018-05-10T14:02:49.716+08:00 localhost nova-conductor ERROR [pid:4528] 
[MainThread] [tid:122769072] [log.py:122 error] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr error conn 
<sqlalchemy.engine.base.Connection object at 0x329c150>, err connection 
pointer is NULL
2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528] 
[MainThread] [tid:122769072] [log.py:122 error] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter newraise
2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528] 
[MainThread] [tid:122769072] [log.py:122 error] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter finally
>>Connection._handle_dbapi_exception --> finally --> 
**self.engine.pool._invalidate** & **self.invalidate**
>>self.engine.pool._invalidate --> Pool._invalidate --> 
ConnectionFairy.invalidate --> _ConnectionRecord.invalidate & 
**self._checkin()**
>>self.invalidate() --> **del self._root.__connection**
2018-05-10T14:02:49.718+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs connectionFairy invalidate 
2018-05-10T14:02:49.718+08:00 localhost nova-conductor INFO [pid:4528] 
[MainThread] [tid:122769072] [log.py:109 info] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Invalidate connection 
<connection object at 0x81f69b0; closed: 2> (reason: 
OperationalError:connection pointer is NULL
>>_ConnectionRecord.invalidate(e)  --> 
self.__pool.dispatch.invalidate(self.connection, self, e) & self.__close() 
& self.connection = None
>>self.__close() --> self.finalize_callback.clear() & 
self.__pool._close_connection(self.connection)
2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Closing connection <connection 
object at 0x81f69b0; closed: 2> 
>>Pool._close_connection(connection) --> self._dialect.do_close(connection) 
--> _ConnDialect.do_close(dbapi_connection) --> dbapi_connection.close() FIN
2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _ConnectionFairy._checkin 
conn None rec <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0> 
rec.conn None 
>>_ConnectionFairy._checkin --> _finalize_fairy(conn, conn_rec, pool, None, 
echo, fairy) & self.conn = None & self.conn_rec = None
2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _finalize_fairy args conn 
None, rec.conn None, rec, <sqlalchemy.pool._ConnectionRecord object at 
0x88fc5d0>. ref None, fairy <sqlalchemy.pool._ConnectionFairy object at 
0x329c0d0> 
>>_finalize_fairy --> connection_record.checkin()
2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to 
connection_record.checkin 
2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None, 
pool queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, 
<sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>]), 
finalize_callback deque([]) 
2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool 
queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, 
<sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, 
<sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]) 
2018-05-10T14:02:49.723+08:00 localhost nova-conductor ERROR [pid:4528] 
[MainThread] [tid:122769072] [log.py:122 error] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Exception during reset or similar
2018-05-10T14:02:49.723+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to 
connection_record.invalidate 
2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to 
connection_record.checkin 
2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None, 
pool queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, 
<sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, 
<sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]), 
finalize_callback deque([]) 
2018-05-10T14:02:49.732+08:00 localhost nova-conductor DEBUG [pid:4528] 
[MainThread] [tid:122769072] [log.py:104 debug] 
[req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool 
queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, 
<sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, 
<sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, 
<sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, 
<sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]) 






在 2018年5月8日星期二 UTC+8下午11:06:04，[email protected]写道：
>
> I found why we could get the same connection twice from the pool...
>
> Because we put it into pool twice.
>
> Any suggestion for finding why it put the same connection into pool within 
> different coroutines?
>
>
> 2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG [pid:15755] 
> [MainThread] [tid:187954992] [log.py:104 debug] 
> [req-7c2199c0-d9a6-4131-8358-2aebcf5a4ce1] jjs conn checkin <connection 
> object at 0x111d9740; closed: 0> 
> 2018-05-08T22:39:17.502+08:00 localhost nova-conductor DEBUG [pid:15755] 
> [MainThread] [tid:178645264] [log.py:104 debug] 
> [req-7ab83c30-91bd-4278-b694-85a71035992c] jjs conn checkin <connection 
> object at 0x111d9740; closed: 0> 
> ...
> 2018-05-08T22:39:17.992+08:00 localhost nova-conductor DEBUG [pid:15755] 
> [MainThread] [tid:309091856] [log.py:104 debug] 
> [req-dee5dd2d-866f-408e-999b-b415ea7ce953] Connection <connection object at 
> 0x111d9740; closed: 0> checked out from pool 
> 2018-05-08T22:39:18.039+08:00 localhost nova-conductor DEBUG [pid:15755] 
> [MainThread] [tid:178645104] [log.py:104 debug] 
> [req-ac8017d7-c97e-44ca-92d2-99f180858e14] Connection <connection object at 
> 0x111d9740; closed: 0> checked out from pool 
>
>
> 在 2018年5月8日星期二 UTC+8下午10:18:52，[email protected]写道：
>>
>> I added logs in sqlalchemy/pool.py
>>
>> QueuePool.checkout
>>
>> @classmethod
>> def checkout(cls, pool):
>>     rec = pool._do_get()
>>     try:
>>         dbapi_connection = rec.get_connection()
>>         pool.logger.debug("jjs get conn success %s", dbapi_connection)
>>     except:
>>         pool.logger.debug("jjs get error in get_connection")
>>         with util.safe_reraise():
>>             rec.checkin()
>>     echo = pool._should_log_debug()
>>     fairy = _ConnectionFairy(dbapi_connection, rec, echo)
>>     rec.fairy_ref = weakref.ref(
>>         fairy,
>>         lambda ref: _finalize_fairy and
>>         _finalize_fairy(
>>             dbapi_connection,
>>             rec, pool, ref, echo)
>>     )
>>     _refs.add(rec)
>>     if echo:
>>         pool.logger.debug("jjs conn %s, pool status %s", 
>> dbapi_connection, pool.status())
>>         pool.logger.debug("Connection %r checked out from pool",
>>                           dbapi_connection)
>>     return fairy
>>
>> Then I got logs below:
>>
>> coroutine A:
>>
>> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO [pid:55528] 
>> [MainThread] [tid:197123760] [log.py:109 info] 
>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, timeout 
>> 30
>> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG [pid:55528] 
>> [MainThread] [tid:197123760] [log.py:104 debug] 
>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection 
>> <connection object at 0x903f8c0; closed: 0> 
>> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG [pid:55528] 
>> [MainThread] [tid:197123760] [log.py:104 debug] 
>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success <connection 
>> object at 0x903f8c0; closed: 0> 
>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
>> [MainThread] [tid:197123760] [log.py:104 debug] 
>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn <connection object at 
>> 0x903f8c0;  closed: 0>, pool status Pool size: 10  Connections in pool: 1 
>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 45 
>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528] 
>> [MainThread] [tid:197123760] [log.py:104 debug] 
>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection <connection object at 
>> 0x903f8c0; closed: 0> checked out from pool 
>>
>> coroutine B:
>>
>> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO [pid:55528] 
>> [MainThread] [tid:151519312] [log.py:109 info] 
>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, timeout 
>> 30
>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
>> [MainThread] [tid:151519312] [log.py:104 debug] 
>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success <connection 
>> object at 0x903f8c0;  closed: 0> 
>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528] 
>> [MainThread] [tid:151519312] [log.py:104 debug] 
>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn <connection object at 
>> 0x903f8c0; closed: 0>, pool status Pool size: 10  Connections in pool: 0 
>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 46 
>> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG [pid:55528] 
>> [MainThread] [tid:151519312] [log.py:104 debug] 
>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection <connection object at 
>> 0x903f8c0; closed: 0> checked out from pool 
>>
>>
>> It seems that when we create_connection in QueuePool._do_get, we may not 
>> get the lock _pool.mutex(~Queue.mutex).
>>
>>
>>
>> 在 2018年5月7日星期一 UTC+8下午8:41:14，Mike Bayer写道：
>>>
>>>
>>>
>>> On Mon, May 7, 2018, 7:01 AM <[email protected]> wrote:
>>>
>>>> We added coroutine_id in psycopg2, and found that two coroutine use the 
>>>> same connection before the pthread_mutex_lock released.
>>>>
>>>> Maybe something in the connection pool goes wrong?
>>>> [pid] [coroutine_id] msg
>>>>
>>>
>>>
>>> This seems unlikely.   Openstack uses the SQLAlchemy connection pool 
>>> with eventlet patching in conjunction with the MySQL drivers and nothing 
>>> like that happens.   The biggest issue SQLAlchemy had with greenlets is 
>>> that a timed out greenlet would throw GreenletExit and cause a dirty 
>>> connection to go back into the pool, that was fixed in 1.2.  
>>>
>>>
>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, 
>>>> cb 0x23ad320
>>>>
>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>>
>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>>
>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>>> async_cursor 0x881ac00
>>>>
>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>>> async_cursor 0x881ac00
>>>>
>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>>>> status = 2
>>>>
>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>>
>>>> Below is the logs grep by coroutine id:
>>>>
>>>> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>>>>
>>>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>>>>
>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, 
>>>> cb 0x23ad320
>>>>
>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>>
>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>>
>>>> [49174] [0xa5db730]conn_poll: poll writing
>>>>
>>>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>>>>
>>>> -----
>>>>
>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection 
>>>> at 0x8de2d30
>>>>
>>>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>>>>
>>>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn = 
>>>> 0x8de2d30
>>>>
>>>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, 
>>>> refcnt = 1
>>>>
>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 
>>>> 1
>>>>
>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>>> async_cursor 0x881ac00
>>>>
>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
>>>> async_cursor 0x881ac00
>>>>
>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
>>>> status = 2
>>>>
>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>>
>>>> 在 2018年4月28日星期六 UTC+8下午4:07:34，[email protected]写道：
>>>>>
>>>>> We reproduced this problem and added logs in psycopg2, found a 
>>>>> confusing thing and reported it to psycopg2.
>>>>>
>>>>> It seems sqlalchemy & psycopg2 stack flow is:
>>>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit 
>>>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>>>>> 3.get exception in sqlalchemy, and do_rollback
>>>>> 4.from sqlalchemy to psycopg: pq_abort
>>>>> 5.psycopg get pthread_mutex_lock deadlock
>>>>>
>>>>> What we have done to reproduce it is: stop master pg-server and 
>>>>> promote the slave pg-server to master, with moving FIP from old master to 
>>>>> slave. At the same time, let nova-conductor do quite a lot db query 
>>>>> requests.
>>>>>
>>>>> [1] https://github.com/psycopg/psycopg2/issues/703
>>>>>
>>>>> 在 2018年4月23日星期一 UTC+8下午9:45:04，Mike Bayer写道：
>>>>>>
>>>>>> On Mon, Apr 23, 2018 at 9:03 AM,  <[email protected]> wrote: 
>>>>>> > Sorry for reply on this topic. 
>>>>>> > 
>>>>>> > We recently get the same problem in our production environment. 
>>>>>> > 
>>>>>> > I found a patch in other lib [1], and they added conn.close() in 
>>>>>> exception 
>>>>>> > psycopg2.ProgrammingError. 
>>>>>> > 
>>>>>> > Shall we do the same in [2] ? 
>>>>>>
>>>>>> SQLAlchemy does things much more carefully than that, we parse the 
>>>>>> error message for specific ones that correspond to "connection is no 
>>>>>> longer usable", we call these "is_disconnect" but it can be any 
>>>>>> invalidating condition. 
>>>>>>
>>>>>> You can make these yourself, and they can also be made to be part of 
>>>>>> oslo.db, using the handle_error event: 
>>>>>>
>>>>>>
>>>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>>>>>  
>>>>>>
>>>>>> within oslo.db you would want to propose a change here: 
>>>>>>
>>>>>>
>>>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>>>>>  
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> > 
>>>>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split 
>>>>>> > 
>>>>>> > [2] 
>>>>>> > 
>>>>>> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>>>>>  
>>>>>> > 
>>>>>> > 
>>>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31，JinRong Cai写道： 
>>>>>> >> 
>>>>>> >> Hi  Michael , 
>>>>>> >> 
>>>>>> >> I am using openstack with postgresql which sqlalchemy and oslo_db 
>>>>>> module 
>>>>>> >> were used. 
>>>>>> >> And there are some problems after my pg database switched over. 
>>>>>> >> 
>>>>>> >> Here is my switch over process: 
>>>>>> >> 1. nova-conductor(python application) is running with DB 
>>>>>> connection 
>>>>>> >> strings point to vip , which is in primary site(A) of pg. 
>>>>>> >> 2. switch VIP from primary(A) to new primary(B) 
>>>>>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
>>>>>> primary. 
>>>>>> >> 4. nova-conductor is running in the whole process. 
>>>>>> >> 
>>>>>> >> After some seconds, I found some nova-conductor processes are hang 
>>>>>> with 
>>>>>> >> status futex_wait_queue_me, and the status of the query in DB is 
>>>>>> "idle in 
>>>>>> >> transaction", the transaction was not commited or rollbacked! 
>>>>>> >> I think disconnection was handled in the oslo_db, which will send 
>>>>>> a 
>>>>>> >> ping(select 1) to DB. 
>>>>>> >> 
>>>>>> >> If DB was switchd over, the connection in the pool would be set 
>>>>>> with 
>>>>>> >> status invalid, and reconnect after next check out. 
>>>>>> >> 
>>>>>> >> ###error messages from nova-conductor 
>>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] 
>>>>>> [tid:122397712] 
>>>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
>>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception 
>>>>>> wrapped from 
>>>>>> >> (psycopg2.ProgrammingError) execute cannot be used while an 
>>>>>> asynchronous 
>>>>>> >> query is underway [SQL: 'SELECT 1'] 
>>>>>> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent 
>>>>>> call 
>>>>>> >> last): 
>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, 
>>>>>> in 
>>>>>> >> _execute_context 
>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context) 
>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, 
>>>>>> in 
>>>>>> >> do_execute 
>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     
>>>>>> cursor.execute(statement, 
>>>>>> >> parameters) 
>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: 
>>>>>> execute 
>>>>>> >> cannot be used while an asynchronous query is underway 
>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters 
>>>>>> >>  localhost nova-conductor ERROR [pid:36365] [MainThread] 
>>>>>> [tid:122397712] 
>>>>>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] 
>>>>>> Error closing 
>>>>>> >> cursor 
>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent 
>>>>>> call last): 
>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool   File 
>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, 
>>>>>> in 
>>>>>> >> _safe_close_cursor 
>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close() 
>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close 
>>>>>> cannot be 
>>>>>> >> used while an asynchronous query is underway 
>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool 
>>>>>> >> 
>>>>>> >> ###ps status of nova-conductor 
>>>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log 
>>>>>> # cat 
>>>>>> >> /proc/33316/stack 
>>>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 
>>>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280 
>>>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0 
>>>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180 
>>>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b 
>>>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff 
>>>>>> >> 
>>>>>> >> ### stack of the nova-conductor process 
>>>>>> >> POD6-Mongodb03:/tmp # pstack 33316 
>>>>>> >> #0  0x00002b8449e35f4d in __lll_lock_wait () from 
>>>>>> /lib64/libpthread.so.0 
>>>>>> >> #1  0x00002b8449e31d02 in _L_lock_791 () from 
>>>>>> /lib64/libpthread.so.0 
>>>>>> >> #2  0x00002b8449e31c08 in pthread_mutex_lock () from 
>>>>>> >> /lib64/libpthread.so.0 
>>>>>> >> #3  0x00002b84554c44ab in pq_abort () from 
>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>>>>>> >> #4  0x00002b84554c955e in psyco_conn_rollback () from 
>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>>>>>> >> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from 
>>>>>> >> /lib64/libpython2.7.so.1.0 
>>>>>> >> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from 
>>>>>> >> /lib64/libpython2.7.so.1.0 
>>>>>> >> 
>>>>>> >> The psycopg2 was trying to close the cursor, and try to get the 
>>>>>> mutex lock 
>>>>>> >> "pthread_mutex_lock", but it seems that the cursor was used by 
>>>>>> other 
>>>>>> >> session. 
>>>>>> >> 
>>>>>> >> 
>>>>>> >> Questions: 
>>>>>> >> 
>>>>>> >> 1. What the error "ProgrammingError: close cannot be used while an 
>>>>>> >> asynchronous query is underway" mean? 
>>>>>> >> AFAIK, these caused by psycopg2, which means a asynchronous query 
>>>>>> was 
>>>>>> >> executed in one connection. 
>>>>>> >> But the I think the sqlalchemy was thread safe since it was 
>>>>>> patched by 
>>>>>> >> eventlet, see details in  eventlet/support/psycopg2_patcher.py 
>>>>>> >> And we can see different green thread number in the log, as: 
>>>>>> >> [pid:36365] [MainThread] [tid:122397712] 
>>>>>> >> [pid:36365] [MainThread] [tid:122397815] 
>>>>>> >> So, I guess the connection pool in one process is safe. 
>>>>>> >> 
>>>>>> >> 2. The nova-conductor was a multi-thread python client, which 
>>>>>> forked 
>>>>>> >> several child process. 
>>>>>> >> ps -elf|grep -i nova-conductor 
>>>>>> >> 30878  1 pool_s /usr/bin/nova-conductor 
>>>>>> >> 36364  1 ep_pol /usr/bin/nova-conductor 
>>>>>> >> 36365  1 futex_ /usr/bin/nova-conductor 
>>>>>> >> 36366  1 ep_pol /usr/bin/nova-conductor 
>>>>>> >> 36367  1 ep_pol /usr/bin/nova-conductor 
>>>>>> >> 36368  1 ep_pol /usr/bin/nova-conductor 
>>>>>> >> 
>>>>>> >> If the nova-conductor was started with only one child, the problem 
>>>>>> was not 
>>>>>> >> happen. 
>>>>>> >> Does this mean the connection/engine CAN NOT shared in these child 
>>>>>> >> processes? 
>>>>>> >> 
>>>>>> >> Thanks. 
>>>>>> > 
>>>>>> > -- 
>>>>>> > SQLAlchemy - 
>>>>>> > The Python SQL Toolkit and Object Relational Mapper 
>>>>>> > 
>>>>>> > http://www.sqlalchemy.org/ 
>>>>>> > 
>>>>>> > To post example code, please provide an MCVE: Minimal, Complete, 
>>>>>> and 
>>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a 
>>>>>> full 
>>>>>> > description. 
>>>>>> > --- 
>>>>>> > You received this message because you are subscribed to the Google 
>>>>>> Groups 
>>>>>> > "sqlalchemy" group. 
>>>>>> > To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an 
>>>>>> > email to [email protected]. 
>>>>>> > To post to this group, send email to [email protected]. 
>>>>>> > Visit this group at https://groups.google.com/group/sqlalchemy. 
>>>>>> > For more options, visit https://groups.google.com/d/optout. 
>>>>>>
>>>>> -- 
>>>> SQLAlchemy - 
>>>> The Python SQL Toolkit and Object Relational Mapper
>>>>  
>>>> http://www.sqlalchemy.org/
>>>>  
>>>> To post example code, please provide an MCVE: Minimal, Complete, and 
>>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>>>> description.
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "sqlalchemy" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to