Re: [sqlalchemy] Re: long transaction after database switched over

Mike Bayer Thu, 10 May 2018 09:26:20 -0700

Here's a print statement you might want to try:

diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py
index 89a4cea7c..452db4883 100644
--- a/lib/sqlalchemy/pool.py
+++ b/lib/sqlalchemy/pool.py
@@ -689,6 +689,19 @@ def _finalize_fairy(connection, connection_record,
     been garbage collected.


     """
+    print(
+        "FINALIZE FAIRY:  GREENLET ID: %s, DB CONN: %s, FAIRY: %s, REF: %s  "
+        "CONNECTION_REC: %s CONNECTION_REC REF %s" % (
+            "put greenlet id here",
+            connection,
+            fairy,
+            ref,
+            connection_record,
+            connection_record.fairy_ref
+            if connection_record is not None else "n/a"
+        )
+    )
+
     _refs.discard(connection_record)

     if ref is not None:





On Thu, May 10, 2018 at 12:21 PM, Mike Bayer <[email protected]> wrote:
> On Thu, May 10, 2018 at 7:23 AM,  <[email protected]> wrote:
>> Hi Mike,
>>
>> I added more logs and finnaly find the reason and a method to avoid the
>> deadlock between coroutines.
>>
>> In summary, the deadlock happend because two different coroutines use the
>> same ConnectionRecord with same connection.
>> It happends under QueuePool, when the DB connections are reset by the DB
>> server(not related with the backend driver).
>>
>> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
>> weakref of the ConnectionFairy, with callback function `_finalize_fairy`.
>> 2.When the connection get disconnected, it receives an
>> exception(psycopg2.OperationalError), and goto
>> Connection._handle_dbapi_exception.
>> 3.In `_handle_dbapi_exception.finnaly`, it will call both
>> `ConnectionFairy.invalidate` and `self.invalidate`.
>> 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is called,
>> and put the ConnectionRecord back to the QueuePool.
>> 5.In `self.invalidate`, `del self._root.__connection` is called, and **del**
>> the ConnectionFairy.
>> 6.The **del** operation will make weakref of ConnectionFairy to dead, and
>> the callback `_finalize_fairy` executed.
>> 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, put
>> the ConnectionRecord to the QueuePool again.
>
> Here's what does not make sense.  You are claiming that a simple
> invalidation will result in the same ConnectionRecord being checked in
> twice.   This is obviously not the case, what is special in your
> situation that makes this happen?
>
> If I follow the steps you refer towards, in step 4, when
> ConnectionFairy._checkin is called, that calls finalize_fairy
> directly, which then calls ConnectionRecord.checkin().  Within
> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None.
>    Then back in ConnectionFairy._checkin(), it also sets
> self._connection_record = None.
>
> If we come back into finalize_fairy a *second* time then, as a result
> of the "del" which occurs in _handle_error(), and the
> connection_record is present, the function will exit immediately,
> because:
>
>        if connection_record.fairy_ref is not ref:
>             return
>
> So the situation as given is not possible without concurrent access to
> the ConnectionFairy being introduced externally to the pool.
>
> There is exactly one thing I can think of that is both unique to the
> psycopg2-style of asynchronous connection, not the PyMySQL form of it,
> that might be important here, and it's that I think psycopg2's
> connection.close() method will defer to another greenlet.  That means
> when we're inside of ConnectionRecord.invalidate(), and we call
> self.__close(), that could be deferring to another greenlet.   But
> this all happens before the connection is returned to the pool, so I
> still don't know what is happening.
>
>>
>> The logs in detail is below. To prove my result is right, I tried to avoid
>> the scienario with this patch, and it worked.
>>
>> def checkin(self):
>>     self.fairy_ref = None
>>     connection = self.connection
>>     pool = self.__pool
>>
>>     while self.finalize_callback:
>>         finalizer = self.finalize_callback.pop()
>>         finalizer(connection)
>> +    if self in pool._pool.queue:
>> +        pool.logger.debug("jjs rec already in pool checkin, skip %r", self)
>> +        return
>
> as mentioned above it's not possible to reach this point without
> concurrent access to the pooled connection, which means this fix could
> fail also (with at least threaded-style concurrency, not greenlet
> style).   The original source of the greenlet issue must still be
> identified.
>
> I would need a self-contained proof of concept that demonstrates this
> issue occurring.  The log output is not that helpful because it at
> least would need to write out the identifier of the current greenlet,
> since it is very likely that greenlets have something to do with this
> issue.
>
>
>
>>
>>     if pool.dispatch.checkin:
>>         pool.dispatch.checkin(connection, self)
>>     pool._return_conn(self)
>>
>> Trace the ConnectionRecord 0x88fc5d0
>>
>> 2018-05-10T14:02:17.927+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Created new connection
>> <connection object at 0x81f69b0; closed: 0>
>> 2018-05-10T14:02:17.928+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs get conn success <connection
>> object at 0x81f69b0; closed: 0>, rec.connection <connection object at
>> 0x81f69b0; closed: 0>, pool queue deque([<sqlalchemy.pool._ConnectionRecord
>> object at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at
>> 0x79fdad0>])
>> 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs checkout fairy
>> <sqlalchemy.pool._ConnectionFairy object at 0x329c0d0> rec
>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0> pool queue
>> deque([<sqlalchemy.pool._ConnectionRecord object at 0x7a04810>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>])
>> 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Connection <connection object at
>> 0x81f69b0; closed: 0> checked out from pool, status Pool size: 10
>> Connections in pool: 2 Current Overflow: 0 Max Overflow: 40 Current Checked
>> out connections: 8, queue deque([<sqlalchemy.pool._ConnectionRecord object
>> at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>])
>> 2018-05-10T14:02:17.930+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _ConnectionFairy._checkout
>> conn <connection object at 0x81f69b0; closed: 0>, rec
>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, rec.conn
>> <connection object at 0x81f69b0; closed: 0>
>> 2018-05-10T14:02:49.716+08:00 localhost nova-conductor ERROR [pid:4528]
>> [MainThread] [tid:122769072] [log.py:122 error]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr error conn
>> <sqlalchemy.engine.base.Connection object at 0x329c150>, err connection
>> pointer is NULL
>> 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528]
>> [MainThread] [tid:122769072] [log.py:122 error]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter newraise
>> 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528]
>> [MainThread] [tid:122769072] [log.py:122 error]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter finally
>>>>Connection._handle_dbapi_exception --> finally -->
>>>> **self.engine.pool._invalidate** & **self.invalidate**
>>>>self.engine.pool._invalidate --> Pool._invalidate -->
>>>> ConnectionFairy.invalidate --> _ConnectionRecord.invalidate &
>>>> **self._checkin()**
>>>>self.invalidate() --> **del self._root.__connection**
>> 2018-05-10T14:02:49.718+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs connectionFairy invalidate
>> 2018-05-10T14:02:49.718+08:00 localhost nova-conductor INFO [pid:4528]
>> [MainThread] [tid:122769072] [log.py:109 info]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Invalidate connection <connection
>> object at 0x81f69b0; closed: 2> (reason: OperationalError:connection pointer
>> is NULL
>>>>_ConnectionRecord.invalidate(e)  -->
>>>> self.__pool.dispatch.invalidate(self.connection, self, e) & self.__close() 
>>>> &
>>>> self.connection = None
>>>>self.__close() --> self.finalize_callback.clear() &
>>>> self.__pool._close_connection(self.connection)
>> 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Closing connection <connection
>> object at 0x81f69b0; closed: 2>
>>>>Pool._close_connection(connection) --> self._dialect.do_close(connection)
>>>> --> _ConnDialect.do_close(dbapi_connection) --> dbapi_connection.close() 
>>>> FIN
>> 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _ConnectionFairy._checkin
>> conn None rec <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>
>> rec.conn None
>>>>_ConnectionFairy._checkin --> _finalize_fairy(conn, conn_rec, pool, None,
>>>> echo, fairy) & self.conn = None & self.conn_rec = None
>> 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _finalize_fairy args conn
>> None, rec.conn None, rec, <sqlalchemy.pool._ConnectionRecord object at
>> 0x88fc5d0>. ref None, fairy <sqlalchemy.pool._ConnectionFairy object at
>> 0x329c0d0>
>>>>_finalize_fairy --> connection_record.checkin()
>> 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to
>> connection_record.checkin
>> 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None,
>> pool queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>]), finalize_callback
>> deque([])
>> 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool
>> queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>])
>> 2018-05-10T14:02:49.723+08:00 localhost nova-conductor ERROR [pid:4528]
>> [MainThread] [tid:122769072] [log.py:122 error]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Exception during reset or similar
>> 2018-05-10T14:02:49.723+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to
>> connection_record.invalidate
>> 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to
>> connection_record.checkin
>> 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None,
>> pool queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]), finalize_callback
>> deque([])
>> 2018-05-10T14:02:49.732+08:00 localhost nova-conductor DEBUG [pid:4528]
>> [MainThread] [tid:122769072] [log.py:104 debug]
>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool
>> queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>,
>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>])
>>
>>
>>
>>
>>
>>
>> 在 2018年5月8日星期二 UTC+8下午11:06:04，[email protected]写道：
>>>
>>> I found why we could get the same connection twice from the pool...
>>>
>>> Because we put it into pool twice.
>>>
>>> Any suggestion for finding why it put the same connection into pool within
>>> different coroutines?
>>>
>>>
>>> 2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG [pid:15755]
>>> [MainThread] [tid:187954992] [log.py:104 debug]
>>> [req-7c2199c0-d9a6-4131-8358-2aebcf5a4ce1] jjs conn checkin <connection
>>> object at 0x111d9740; closed: 0>
>>> 2018-05-08T22:39:17.502+08:00 localhost nova-conductor DEBUG [pid:15755]
>>> [MainThread] [tid:178645264] [log.py:104 debug]
>>> [req-7ab83c30-91bd-4278-b694-85a71035992c] jjs conn checkin <connection
>>> object at 0x111d9740; closed: 0>
>>> ...
>>> 2018-05-08T22:39:17.992+08:00 localhost nova-conductor DEBUG [pid:15755]
>>> [MainThread] [tid:309091856] [log.py:104 debug]
>>> [req-dee5dd2d-866f-408e-999b-b415ea7ce953] Connection <connection object at
>>> 0x111d9740; closed: 0> checked out from pool
>>> 2018-05-08T22:39:18.039+08:00 localhost nova-conductor DEBUG [pid:15755]
>>> [MainThread] [tid:178645104] [log.py:104 debug]
>>> [req-ac8017d7-c97e-44ca-92d2-99f180858e14] Connection <connection object at
>>> 0x111d9740; closed: 0> checked out from pool
>>>
>>>
>>> 在 2018年5月8日星期二 UTC+8下午10:18:52，[email protected]写道：
>>>>
>>>> I added logs in sqlalchemy/pool.py
>>>>
>>>> QueuePool.checkout
>>>>
>>>> @classmethod
>>>> def checkout(cls, pool):
>>>>     rec = pool._do_get()
>>>>     try:
>>>>         dbapi_connection = rec.get_connection()
>>>>         pool.logger.debug("jjs get conn success %s", dbapi_connection)
>>>>     except:
>>>>         pool.logger.debug("jjs get error in get_connection")
>>>>         with util.safe_reraise():
>>>>             rec.checkin()
>>>>     echo = pool._should_log_debug()
>>>>     fairy = _ConnectionFairy(dbapi_connection, rec, echo)
>>>>     rec.fairy_ref = weakref.ref(
>>>>         fairy,
>>>>         lambda ref: _finalize_fairy and
>>>>         _finalize_fairy(
>>>>             dbapi_connection,
>>>>             rec, pool, ref, echo)
>>>>     )
>>>>     _refs.add(rec)
>>>>     if echo:
>>>>         pool.logger.debug("jjs conn %s, pool status %s",
>>>> dbapi_connection, pool.status())
>>>>         pool.logger.debug("Connection %r checked out from pool",
>>>>                           dbapi_connection)
>>>>     return fairy
>>>>
>>>> Then I got logs below:
>>>>
>>>> coroutine A:
>>>>
>>>> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO [pid:55528]
>>>> [MainThread] [tid:197123760] [log.py:109 info]
>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, timeout
>>>> 30
>>>> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG [pid:55528]
>>>> [MainThread] [tid:197123760] [log.py:104 debug]
>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection
>>>> <connection object at 0x903f8c0; closed: 0>
>>>> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG [pid:55528]
>>>> [MainThread] [tid:197123760] [log.py:104 debug]
>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success <connection
>>>> object at 0x903f8c0; closed: 0>
>>>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528]
>>>> [MainThread] [tid:197123760] [log.py:104 debug]
>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn <connection object at
>>>> 0x903f8c0;  closed: 0>, pool status Pool size: 10  Connections in pool: 1
>>>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 45
>>>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG [pid:55528]
>>>> [MainThread] [tid:197123760] [log.py:104 debug]
>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection <connection object at
>>>> 0x903f8c0; closed: 0> checked out from pool
>>>>
>>>> coroutine B:
>>>>
>>>> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO [pid:55528]
>>>> [MainThread] [tid:151519312] [log.py:109 info]
>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, timeout
>>>> 30
>>>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528]
>>>> [MainThread] [tid:151519312] [log.py:104 debug]
>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success <connection
>>>> object at 0x903f8c0;  closed: 0>
>>>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG [pid:55528]
>>>> [MainThread] [tid:151519312] [log.py:104 debug]
>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn <connection object at
>>>> 0x903f8c0; closed: 0>, pool status Pool size: 10  Connections in pool: 0
>>>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: 46
>>>> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG [pid:55528]
>>>> [MainThread] [tid:151519312] [log.py:104 debug]
>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection <connection object at
>>>> 0x903f8c0; closed: 0> checked out from pool
>>>>
>>>>
>>>> It seems that when we create_connection in QueuePool._do_get, we may not
>>>> get the lock _pool.mutex(~Queue.mutex).
>>>>
>>>>
>>>>
>>>> 在 2018年5月7日星期一 UTC+8下午8:41:14，Mike Bayer写道：
>>>>>
>>>>>
>>>>>
>>>>> On Mon, May 7, 2018, 7:01 AM <[email protected]> wrote:
>>>>>>
>>>>>> We added coroutine_id in psycopg2, and found that two coroutine use the
>>>>>> same connection before the pthread_mutex_lock released.
>>>>>>
>>>>>> Maybe something in the connection pool goes wrong?
>>>>>> [pid] [coroutine_id] msg
>>>>>
>>>>>
>>>>>
>>>>> This seems unlikely.   Openstack uses the SQLAlchemy connection pool
>>>>> with eventlet patching in conjunction with the MySQL drivers and nothing
>>>>> like that happens.   The biggest issue SQLAlchemy had with greenlets is 
>>>>> that
>>>>> a timed out greenlet would throw GreenletExit and cause a dirty connection
>>>>> to go back into the pool, that was fixed in 1.2.
>>>>>
>>>>>>
>>>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0,
>>>>>> cb 0x23ad320
>>>>>>
>>>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>>>>
>>>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>>>>
>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>>>>>> async_cursor 0x881ac00
>>>>>>
>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>>>>>> async_cursor 0x881ac00
>>>>>>
>>>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
>>>>>> status = 2
>>>>>>
>>>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>>>>
>>>>>>
>>>>>> Below is the logs grep by coroutine id:
>>>>>>
>>>>>> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>>>>>>
>>>>>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>>>>>>
>>>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0,
>>>>>> cb 0x23ad320
>>>>>>
>>>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>>>>>
>>>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>>>>>
>>>>>> [49174] [0xa5db730]conn_poll: poll writing
>>>>>>
>>>>>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>>>>>>
>>>>>> -----
>>>>>>
>>>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection
>>>>>> at 0x8de2d30
>>>>>>
>>>>>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>>>>>>
>>>>>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn =
>>>>>> 0x8de2d30
>>>>>>
>>>>>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650,
>>>>>> refcnt = 1
>>>>>>
>>>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt =
>>>>>> 1
>>>>>>
>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>>>>>> async_cursor 0x881ac00
>>>>>>
>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>>>>>> async_cursor 0x881ac00
>>>>>>
>>>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
>>>>>> status = 2
>>>>>>
>>>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>>>>>
>>>>>>
>>>>>> 在 2018年4月28日星期六 UTC+8下午4:07:34，[email protected]写道：
>>>>>>>
>>>>>>> We reproduced this problem and added logs in psycopg2, found a
>>>>>>> confusing thing and reported it to psycopg2.
>>>>>>>
>>>>>>> It seems sqlalchemy & psycopg2 stack flow is:
>>>>>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit
>>>>>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>>>>>>> 3.get exception in sqlalchemy, and do_rollback
>>>>>>> 4.from sqlalchemy to psycopg: pq_abort
>>>>>>> 5.psycopg get pthread_mutex_lock deadlock
>>>>>>>
>>>>>>> What we have done to reproduce it is: stop master pg-server and
>>>>>>> promote the slave pg-server to master, with moving FIP from old master 
>>>>>>> to
>>>>>>> slave. At the same time, let nova-conductor do quite a lot db query
>>>>>>> requests.
>>>>>>>
>>>>>>> [1] https://github.com/psycopg/psycopg2/issues/703
>>>>>>>
>>>>>>> 在 2018年4月23日星期一 UTC+8下午9:45:04，Mike Bayer写道：
>>>>>>>>
>>>>>>>> On Mon, Apr 23, 2018 at 9:03 AM,  <[email protected]> wrote:
>>>>>>>> > Sorry for reply on this topic.
>>>>>>>> >
>>>>>>>> > We recently get the same problem in our production environment.
>>>>>>>> >
>>>>>>>> > I found a patch in other lib [1], and they added conn.close() in
>>>>>>>> > exception
>>>>>>>> > psycopg2.ProgrammingError.
>>>>>>>> >
>>>>>>>> > Shall we do the same in [2] ?
>>>>>>>>
>>>>>>>> SQLAlchemy does things much more carefully than that, we parse the
>>>>>>>> error message for specific ones that correspond to "connection is no
>>>>>>>> longer usable", we call these "is_disconnect" but it can be any
>>>>>>>> invalidating condition.
>>>>>>>>
>>>>>>>> You can make these yourself, and they can also be made to be part of
>>>>>>>> oslo.db, using the handle_error event:
>>>>>>>>
>>>>>>>>
>>>>>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>>>>>>>
>>>>>>>> within oslo.db you would want to propose a change here:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> >
>>>>>>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split
>>>>>>>> >
>>>>>>>> > [2]
>>>>>>>> >
>>>>>>>> > https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31，JinRong Cai写道：
>>>>>>>> >>
>>>>>>>> >> Hi  Michael ,
>>>>>>>> >>
>>>>>>>> >> I am using openstack with postgresql which sqlalchemy and oslo_db
>>>>>>>> >> module
>>>>>>>> >> were used.
>>>>>>>> >> And there are some problems after my pg database switched over.
>>>>>>>> >>
>>>>>>>> >> Here is my switch over process:
>>>>>>>> >> 1. nova-conductor(python application) is running with DB
>>>>>>>> >> connection
>>>>>>>> >> strings point to vip , which is in primary site(A) of pg.
>>>>>>>> >> 2. switch VIP from primary(A) to new primary(B)
>>>>>>>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new
>>>>>>>> >> primary.
>>>>>>>> >> 4. nova-conductor is running in the whole process.
>>>>>>>> >>
>>>>>>>> >> After some seconds, I found some nova-conductor processes are hang
>>>>>>>> >> with
>>>>>>>> >> status futex_wait_queue_me, and the status of the query in DB is
>>>>>>>> >> "idle in
>>>>>>>> >> transaction", the transaction was not commited or rollbacked!
>>>>>>>> >> I think disconnection was handled in the oslo_db, which will send
>>>>>>>> >> a
>>>>>>>> >> ping(select 1) to DB.
>>>>>>>> >>
>>>>>>>> >> If DB was switchd over, the connection in the pool would be set
>>>>>>>> >> with
>>>>>>>> >> status invalid, and reconnect after next check out.
>>>>>>>> >>
>>>>>>>> >> ###error messages from nova-conductor
>>>>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread]
>>>>>>>> >> [tid:122397712]
>>>>>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError]
>>>>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception
>>>>>>>> >> wrapped from
>>>>>>>> >> (psycopg2.ProgrammingError) execute cannot be used while an
>>>>>>>> >> asynchronous
>>>>>>>> >> query is underway [SQL: 'SELECT 1']
>>>>>>>> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent
>>>>>>>> >> call
>>>>>>>> >> last):
>>>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>>>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139,
>>>>>>>> >> in
>>>>>>>> >> _execute_context
>>>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context)
>>>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>>>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450,
>>>>>>>> >> in
>>>>>>>> >> do_execute
>>>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>>>>>>>> >> cursor.execute(statement,
>>>>>>>> >> parameters)
>>>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError:
>>>>>>>> >> execute
>>>>>>>> >> cannot be used while an asynchronous query is underway
>>>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>>>>>>>> >>  localhost nova-conductor ERROR [pid:36365] [MainThread]
>>>>>>>> >> [tid:122397712]
>>>>>>>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85]
>>>>>>>> >> Error closing
>>>>>>>> >> cursor
>>>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent
>>>>>>>> >> call last):
>>>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool   File
>>>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226,
>>>>>>>> >> in
>>>>>>>> >> _safe_close_cursor
>>>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close()
>>>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close
>>>>>>>> >> cannot be
>>>>>>>> >> used while an asynchronous query is underway
>>>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool
>>>>>>>> >>
>>>>>>>> >> ###ps status of nova-conductor
>>>>>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log
>>>>>>>> >> # cat
>>>>>>>> >> /proc/33316/stack
>>>>>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120
>>>>>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280
>>>>>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0
>>>>>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180
>>>>>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b
>>>>>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>> >>
>>>>>>>> >> ### stack of the nova-conductor process
>>>>>>>> >> POD6-Mongodb03:/tmp # pstack 33316
>>>>>>>> >> #0  0x00002b8449e35f4d in __lll_lock_wait () from
>>>>>>>> >> /lib64/libpthread.so.0
>>>>>>>> >> #1  0x00002b8449e31d02 in _L_lock_791 () from
>>>>>>>> >> /lib64/libpthread.so.0
>>>>>>>> >> #2  0x00002b8449e31c08 in pthread_mutex_lock () from
>>>>>>>> >> /lib64/libpthread.so.0
>>>>>>>> >> #3  0x00002b84554c44ab in pq_abort () from
>>>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so
>>>>>>>> >> #4  0x00002b84554c955e in psyco_conn_rollback () from
>>>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so
>>>>>>>> >> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from
>>>>>>>> >> /lib64/libpython2.7.so.1.0
>>>>>>>> >> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from
>>>>>>>> >> /lib64/libpython2.7.so.1.0
>>>>>>>> >>
>>>>>>>> >> The psycopg2 was trying to close the cursor, and try to get the
>>>>>>>> >> mutex lock
>>>>>>>> >> "pthread_mutex_lock", but it seems that the cursor was used by
>>>>>>>> >> other
>>>>>>>> >> session.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> Questions:
>>>>>>>> >>
>>>>>>>> >> 1. What the error "ProgrammingError: close cannot be used while an
>>>>>>>> >> asynchronous query is underway" mean?
>>>>>>>> >> AFAIK, these caused by psycopg2, which means a asynchronous query
>>>>>>>> >> was
>>>>>>>> >> executed in one connection.
>>>>>>>> >> But the I think the sqlalchemy was thread safe since it was
>>>>>>>> >> patched by
>>>>>>>> >> eventlet, see details in  eventlet/support/psycopg2_patcher.py
>>>>>>>> >> And we can see different green thread number in the log, as:
>>>>>>>> >> [pid:36365] [MainThread] [tid:122397712]
>>>>>>>> >> [pid:36365] [MainThread] [tid:122397815]
>>>>>>>> >> So, I guess the connection pool in one process is safe.
>>>>>>>> >>
>>>>>>>> >> 2. The nova-conductor was a multi-thread python client, which
>>>>>>>> >> forked
>>>>>>>> >> several child process.
>>>>>>>> >> ps -elf|grep -i nova-conductor
>>>>>>>> >> 30878  1 pool_s /usr/bin/nova-conductor
>>>>>>>> >> 36364  1 ep_pol /usr/bin/nova-conductor
>>>>>>>> >> 36365  1 futex_ /usr/bin/nova-conductor
>>>>>>>> >> 36366  1 ep_pol /usr/bin/nova-conductor
>>>>>>>> >> 36367  1 ep_pol /usr/bin/nova-conductor
>>>>>>>> >> 36368  1 ep_pol /usr/bin/nova-conductor
>>>>>>>> >>
>>>>>>>> >> If the nova-conductor was started with only one child, the problem
>>>>>>>> >> was not
>>>>>>>> >> happen.
>>>>>>>> >> Does this mean the connection/engine CAN NOT shared in these child
>>>>>>>> >> processes?
>>>>>>>> >>
>>>>>>>> >> Thanks.
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > SQLAlchemy -
>>>>>>>> > The Python SQL Toolkit and Object Relational Mapper
>>>>>>>> >
>>>>>>>> > http://www.sqlalchemy.org/
>>>>>>>> >
>>>>>>>> > To post example code, please provide an MCVE: Minimal, Complete,
>>>>>>>> > and
>>>>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a
>>>>>>>> > full
>>>>>>>> > description.
>>>>>>>> > ---
>>>>>>>> > You received this message because you are subscribed to the Google
>>>>>>>> > Groups
>>>>>>>> > "sqlalchemy" group.
>>>>>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> > send an
>>>>>>>> > email to [email protected].
>>>>>>>> > To post to this group, send email to [email protected].
>>>>>>>> > Visit this group at https://groups.google.com/group/sqlalchemy.
>>>>>>>> > For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>> --
>>>>>> SQLAlchemy -
>>>>>> The Python SQL Toolkit and Object Relational Mapper
>>>>>>
>>>>>> http://www.sqlalchemy.org/
>>>>>>
>>>>>> To post example code, please provide an MCVE: Minimal, Complete, and
>>>>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>>>>>> description.
>>>>>> ---
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "sqlalchemy" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>>> an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> SQLAlchemy -
>> The Python SQL Toolkit and Object Relational Mapper
>>
>> http://www.sqlalchemy.org/
>>
>> To post example code, please provide an MCVE: Minimal, Complete, and
>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>> description.
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "sqlalchemy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/sqlalchemy.
>> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to