I notice in this log file there are multiple "tid", which I assume to be thread identifiers ? E.g. OS-level threads? Are you using eventlet or not? Are you monkeypatching SQLAlchemy and then using multiple OS-level threads at the same time? Because that likely doesn't work.
Particularly, the log here illustrates an impossible codepath. We enter finalize_fairy with connection is None: 2018-05-14T11:11:39.961+08:00 localhost nova-conductor DEBUG [pid:5679] [MainThread] [tid:130446352] [log.py:104 debug] [req-74ef6531-2d92-40ea-a524-f54e11edab06] FINALIZE FAIRY: DB CONN: None, FAIRY: <sqlalchemy.pool._ConnectionFairy object at 0x87dd3d0>, REF: None CONNECTION_REC: <sqlalchemy.pool._ConnectionRecord object at 0x776fe90> CONNECTION_REC REF <weakref at 0x7c5cdb8; to '_ConnectionFairy' at 0x87dd3d0> debug this seems to make it through to the checkin at the bottom. Then, we see this: 2018-05-14T11:11:39.968+08:00 localhost nova-conductor ERROR [pid:5679] [MainThread] [tid:130446032] [log.py:122 error] [req-841bb432-7cd2-4fa3-9def-559a34b05a21] Exception during reset or similar The only place that message is generated is *in* finalize_fairy, *before* the checkin occurs, and *only* if _finalize_fairy was called *with* a connection that is non -None. There is no other call _finalize_fairy with that request identifier, therefore I see no evidence of where this message "exception during reset" can be generated from - there would at least need to be an additional log line with "FINALIZE FAIRY" and the connection non-None for this to be reached. I see no way that the pool.py shipped with SQLAlchemy can possibly generate this series of steps. This is an extremely intricate issue and I still have only the vaguest notion of how you are running SQLAlchemy / Nova. On Mon, May 14, 2018 at 11:10 AM, Mike Bayer <[email protected]> wrote: > OK I need the exact pool.py you are using as well that includes all > these logging statements and any other adjustments that were made. > > > > On Mon, May 14, 2018 at 8:02 AM, <[email protected]> wrote: >> I've uploaded the logs here: >> https://raw.githubusercontent.com/jiajunsu/test_logs/master/nc_5679.log >> >> The ConnectionRecord is 0x70cdb10, tid(coroutine id) is 130446032 >> >> At 2018-05-14T11:11:39.970, we got the log " jjs rec already in pool >> checkin", but I just deleted `return` in code, so the problem happen again. >> >> >> >> >> 在 2018年5月14日星期一 UTC+8上午10:48:49,Mike Bayer写道: >>> >>> as well as the log that shows this: >>> >>> + if self in pool._pool.queue: >>> + pool.logger.debug("jjs rec already in pool checkin, skip %r", >>> self) >>> + return >>> >>> >>> actually happening. >>> >>> >>> >>> >>> On Sun, May 13, 2018 at 10:42 PM, Mike Bayer <[email protected]> >>> wrote: >>> > On Sun, May 13, 2018 at 10:35 PM, <[email protected]> wrote: >>> >> "jjs" is short for my name and "cjr" is my colleague's(JinRong Cai). We >>> >> work >>> >> together and add logs in the same environment, add prefix of name to >>> >> get a >>> >> keyword for log filter. >>> >> >>> >> The doubled ConnectionRecords(with None connection) are got by two >>> >> different >>> >> coroutines, and the variable record.connection is rewrite by the second >>> >> coroutines. >>> >> After the record being put back to QueuePool again, there are two same >>> >> record with same connection in the pool, which will cause the next two >>> >> coroutines get conflict and deadlock in psycopg2 as above. >>> > >>> > OK, I need the detailed step by step that shows how the two coroutines >>> > are interacting such that they both are returning the same connection. >>> > The nova log above seems to show just one coroutine I assume since >>> > there is just one nova request ID, if I understand correctly. >>> > >>> >> >>> >> The logs of psycopg2 has been pasted at >>> >> https://github.com/psycopg/psycopg2/issues/703 >>> > >>> > The above issue says the bug has been found in SQLAlchemy but there is >>> > no bug report yet and I've not yet been shown how the behavior here is >>> > possible. I posted a straightforward example above, can you >>> > please show me where the two coroutines get mixed up ? >>> > >>> > >>> > >>> > >>> >> >>> >> >>> >> >>> >> 在 2018年5月11日星期五 UTC+8上午3:05:46,Mike Bayer写道: >>> >>> >>> >>> I tried to follow your log lines, which is difficult because I don't >>> >>> know where those log statements actually are and I don't know what >>> >>> acronyms like "jjs", "cjr", mean. But it does look like it's al in >>> >>> one thread (so not async GC) and all in one Nova request (so perhaps >>> >>> just one greenlet involved, unless other greenlets are affecting this >>> >>> without them being displayed in this log snippet). I do see the >>> >>> connection record doubled at the end. So here is the illustration of >>> >>> those steps in isolation, the connection record is not doubled. Can >>> >>> you alter this program to show this condition occurring? >>> >>> >>> >>> from sqlalchemy import create_engine >>> >>> >>> >>> e = create_engine("postgresql://scott:tiger@localhost/test", >>> >>> echo_pool='debug') >>> >>> e.connect().close() >>> >>> >>> >>> assert len(e.pool._pool.queue) == 1 >>> >>> >>> >>> c1 = e.connect() >>> >>> >>> >>> assert len(e.pool._pool.queue) == 0 >>> >>> >>> >>> c1.engine.pool._invalidate(c1._Connection__connection, None) >>> >>> c1.invalidate(e) >>> >>> >>> >>> # connection record is back >>> >>> assert len(e.pool._pool.queue) == 1 >>> >>> >>> >>> # do a reconnect >>> >>> c1.connection >>> >>> >>> >>> # uses the record again >>> >>> assert len(e.pool._pool.queue) == 0 >>> >>> >>> >>> # close the Connection >>> >>> c1.close() >>> >>> >>> >>> # record is back >>> >>> assert len(e.pool._pool.queue) == 1 >>> >>> >>> >>> # etc. >>> >>> c1 = e.connect() >>> >>> assert len(e.pool._pool.queue) == 0 >>> >>> >>> >>> c1.close() >>> >>> assert len(e.pool._pool.queue) == 1 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, May 10, 2018 at 12:25 PM, Mike Bayer <[email protected]> >>> >>> wrote: >>> >>> > Here's a print statement you might want to try: >>> >>> > >>> >>> > diff --git a/lib/sqlalchemy/pool.py b/lib/sqlalchemy/pool.py >>> >>> > index 89a4cea7c..452db4883 100644 >>> >>> > --- a/lib/sqlalchemy/pool.py >>> >>> > +++ b/lib/sqlalchemy/pool.py >>> >>> > @@ -689,6 +689,19 @@ def _finalize_fairy(connection, >>> >>> > connection_record, >>> >>> > been garbage collected. >>> >>> > >>> >>> > """ >>> >>> > + print( >>> >>> > + "FINALIZE FAIRY: GREENLET ID: %s, DB CONN: %s, FAIRY: %s, >>> >>> > REF: >>> >>> > %s " >>> >>> > + "CONNECTION_REC: %s CONNECTION_REC REF %s" % ( >>> >>> > + "put greenlet id here", >>> >>> > + connection, >>> >>> > + fairy, >>> >>> > + ref, >>> >>> > + connection_record, >>> >>> > + connection_record.fairy_ref >>> >>> > + if connection_record is not None else "n/a" >>> >>> > + ) >>> >>> > + ) >>> >>> > + >>> >>> > _refs.discard(connection_record) >>> >>> > >>> >>> > if ref is not None: >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > On Thu, May 10, 2018 at 12:21 PM, Mike Bayer >>> >>> > <[email protected]> >>> >>> > wrote: >>> >>> >> On Thu, May 10, 2018 at 7:23 AM, <[email protected]> wrote: >>> >>> >>> Hi Mike, >>> >>> >>> >>> >>> >>> I added more logs and finnaly find the reason and a method to >>> >>> >>> avoid >>> >>> >>> the >>> >>> >>> deadlock between coroutines. >>> >>> >>> >>> >>> >>> In summary, the deadlock happend because two different coroutines >>> >>> >>> use >>> >>> >>> the >>> >>> >>> same ConnectionRecord with same connection. >>> >>> >>> It happends under QueuePool, when the DB connections are reset by >>> >>> >>> the >>> >>> >>> DB >>> >>> >>> server(not related with the backend driver). >>> >>> >>> >>> >>> >>> 1.A coroutine get a ConnectionRecord into ConnectionFairy, and >>> >>> >>> apply a >>> >>> >>> weakref of the ConnectionFairy, with callback function >>> >>> >>> `_finalize_fairy`. >>> >>> >>> 2.When the connection get disconnected, it receives an >>> >>> >>> exception(psycopg2.OperationalError), and goto >>> >>> >>> Connection._handle_dbapi_exception. >>> >>> >>> 3.In `_handle_dbapi_exception.finnaly`, it will call both >>> >>> >>> `ConnectionFairy.invalidate` and `self.invalidate`. >>> >>> >>> 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is >>> >>> >>> called, >>> >>> >>> and put the ConnectionRecord back to the QueuePool. >>> >>> >>> 5.In `self.invalidate`, `del self._root.__connection` is called, >>> >>> >>> and >>> >>> >>> **del** >>> >>> >>> the ConnectionFairy. >>> >>> >>> 6.The **del** operation will make weakref of ConnectionFairy to >>> >>> >>> dead, >>> >>> >>> and >>> >>> >>> the callback `_finalize_fairy` executed. >>> >>> >>> 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is >>> >>> >>> called, put >>> >>> >>> the ConnectionRecord to the QueuePool again. >>> >>> >> >>> >>> >> Here's what does not make sense. You are claiming that a simple >>> >>> >> invalidation will result in the same ConnectionRecord being checked >>> >>> >> in >>> >>> >> twice. This is obviously not the case, what is special in your >>> >>> >> situation that makes this happen? >>> >>> >> >>> >>> >> If I follow the steps you refer towards, in step 4, when >>> >>> >> ConnectionFairy._checkin is called, that calls finalize_fairy >>> >>> >> directly, which then calls ConnectionRecord.checkin(). Within >>> >>> >> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to >>> >>> >> None. >>> >>> >> Then back in ConnectionFairy._checkin(), it also sets >>> >>> >> self._connection_record = None. >>> >>> >> >>> >>> >> If we come back into finalize_fairy a *second* time then, as a >>> >>> >> result >>> >>> >> of the "del" which occurs in _handle_error(), and the >>> >>> >> connection_record is present, the function will exit immediately, >>> >>> >> because: >>> >>> >> >>> >>> >> if connection_record.fairy_ref is not ref: >>> >>> >> return >>> >>> >> >>> >>> >> So the situation as given is not possible without concurrent access >>> >>> >> to >>> >>> >> the ConnectionFairy being introduced externally to the pool. >>> >>> >> >>> >>> >> There is exactly one thing I can think of that is both unique to >>> >>> >> the >>> >>> >> psycopg2-style of asynchronous connection, not the PyMySQL form of >>> >>> >> it, >>> >>> >> that might be important here, and it's that I think psycopg2's >>> >>> >> connection.close() method will defer to another greenlet. That >>> >>> >> means >>> >>> >> when we're inside of ConnectionRecord.invalidate(), and we call >>> >>> >> self.__close(), that could be deferring to another greenlet. But >>> >>> >> this all happens before the connection is returned to the pool, so >>> >>> >> I >>> >>> >> still don't know what is happening. >>> >>> >> >>> >>> >>> >>> >>> >>> The logs in detail is below. To prove my result is right, I tried >>> >>> >>> to >>> >>> >>> avoid >>> >>> >>> the scienario with this patch, and it worked. >>> >>> >>> >>> >>> >>> def checkin(self): >>> >>> >>> self.fairy_ref = None >>> >>> >>> connection = self.connection >>> >>> >>> pool = self.__pool >>> >>> >>> >>> >>> >>> while self.finalize_callback: >>> >>> >>> finalizer = self.finalize_callback.pop() >>> >>> >>> finalizer(connection) >>> >>> >>> + if self in pool._pool.queue: >>> >>> >>> + pool.logger.debug("jjs rec already in pool checkin, skip >>> >>> >>> %r", >>> >>> >>> self) >>> >>> >>> + return >>> >>> >> >>> >>> >> as mentioned above it's not possible to reach this point without >>> >>> >> concurrent access to the pooled connection, which means this fix >>> >>> >> could >>> >>> >> fail also (with at least threaded-style concurrency, not greenlet >>> >>> >> style). The original source of the greenlet issue must still be >>> >>> >> identified. >>> >>> >> >>> >>> >> I would need a self-contained proof of concept that demonstrates >>> >>> >> this >>> >>> >> issue occurring. The log output is not that helpful because it at >>> >>> >> least would need to write out the identifier of the current >>> >>> >> greenlet, >>> >>> >> since it is very likely that greenlets have something to do with >>> >>> >> this >>> >>> >> issue. >>> >>> >> >>> >>> >> >>> >>> >> >>> >>> >>> >>> >>> >>> if pool.dispatch.checkin: >>> >>> >>> pool.dispatch.checkin(connection, self) >>> >>> >>> pool._return_conn(self) >>> >>> >>> >>> >>> >>> Trace the ConnectionRecord 0x88fc5d0 >>> >>> >>> >>> >>> >>> 2018-05-10T14:02:17.927+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Created new connection >>> >>> >>> <connection object at 0x81f69b0; closed: 0> >>> >>> >>> 2018-05-10T14:02:17.928+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs get conn success >>> >>> >>> <connection >>> >>> >>> object at 0x81f69b0; closed: 0>, rec.connection <connection object >>> >>> >>> at >>> >>> >>> 0x81f69b0; closed: 0>, pool queue >>> >>> >>> deque([<sqlalchemy.pool._ConnectionRecord >>> >>> >>> object at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at >>> >>> >>> 0x79fdad0>]) >>> >>> >>> 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs checkout fairy >>> >>> >>> <sqlalchemy.pool._ConnectionFairy object at 0x329c0d0> rec >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0> pool queue >>> >>> >>> deque([<sqlalchemy.pool._ConnectionRecord object at 0x7a04810>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>]) >>> >>> >>> 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Connection <connection >>> >>> >>> object at >>> >>> >>> 0x81f69b0; closed: 0> checked out from pool, status Pool size: 10 >>> >>> >>> Connections in pool: 2 Current Overflow: 0 Max Overflow: 40 >>> >>> >>> Current >>> >>> >>> Checked >>> >>> >>> out connections: 8, queue >>> >>> >>> deque([<sqlalchemy.pool._ConnectionRecord >>> >>> >>> object >>> >>> >>> at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at >>> >>> >>> 0x79fdad0>]) >>> >>> >>> 2018-05-10T14:02:17.930+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs >>> >>> >>> _ConnectionFairy._checkout >>> >>> >>> conn <connection object at 0x81f69b0; closed: 0>, rec >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, rec.conn >>> >>> >>> <connection object at 0x81f69b0; closed: 0> >>> >>> >>> 2018-05-10T14:02:49.716+08:00 localhost nova-conductor ERROR >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:122 error] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr error conn >>> >>> >>> <sqlalchemy.engine.base.Connection object at 0x329c150>, err >>> >>> >>> connection >>> >>> >>> pointer is NULL >>> >>> >>> 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:122 error] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter newraise >>> >>> >>> 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:122 error] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter finally >>> >>> >>>>>Connection._handle_dbapi_exception --> finally --> >>> >>> >>>>> **self.engine.pool._invalidate** & **self.invalidate** >>> >>> >>>>>self.engine.pool._invalidate --> Pool._invalidate --> >>> >>> >>>>> ConnectionFairy.invalidate --> _ConnectionRecord.invalidate & >>> >>> >>>>> **self._checkin()** >>> >>> >>>>>self.invalidate() --> **del self._root.__connection** >>> >>> >>> 2018-05-10T14:02:49.718+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs connectionFairy >>> >>> >>> invalidate >>> >>> >>> 2018-05-10T14:02:49.718+08:00 localhost nova-conductor INFO >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:109 info] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Invalidate connection >>> >>> >>> <connection >>> >>> >>> object at 0x81f69b0; closed: 2> (reason: >>> >>> >>> OperationalError:connection >>> >>> >>> pointer >>> >>> >>> is NULL >>> >>> >>>>>_ConnectionRecord.invalidate(e) --> >>> >>> >>>>> self.__pool.dispatch.invalidate(self.connection, self, e) & >>> >>> >>>>> self.__close() & >>> >>> >>>>> self.connection = None >>> >>> >>>>>self.__close() --> self.finalize_callback.clear() & >>> >>> >>>>> self.__pool._close_connection(self.connection) >>> >>> >>> 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Closing connection >>> >>> >>> <connection >>> >>> >>> object at 0x81f69b0; closed: 2> >>> >>> >>>>>Pool._close_connection(connection) --> >>> >>> >>>>> self._dialect.do_close(connection) >>> >>> >>>>> --> _ConnDialect.do_close(dbapi_connection) --> >>> >>> >>>>> dbapi_connection.close() FIN >>> >>> >>> 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs >>> >>> >>> _ConnectionFairy._checkin >>> >>> >>> conn None rec <sqlalchemy.pool._ConnectionRecord object at >>> >>> >>> 0x88fc5d0> >>> >>> >>> rec.conn None >>> >>> >>>>>_ConnectionFairy._checkin --> _finalize_fairy(conn, conn_rec, >>> >>> >>>>> pool, >>> >>> >>>>> None, >>> >>> >>>>> echo, fairy) & self.conn = None & self.conn_rec = None >>> >>> >>> 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _finalize_fairy >>> >>> >>> args >>> >>> >>> conn >>> >>> >>> None, rec.conn None, rec, <sqlalchemy.pool._ConnectionRecord >>> >>> >>> object at >>> >>> >>> 0x88fc5d0>. ref None, fairy <sqlalchemy.pool._ConnectionFairy >>> >>> >>> object >>> >>> >>> at >>> >>> >>> 0x329c0d0> >>> >>> >>>>>_finalize_fairy --> connection_record.checkin() >>> >>> >>> 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to >>> >>> >>> connection_record.checkin >>> >>> >>> 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin >>> >>> >>> None, >>> >>> >>> pool queue deque([<sqlalchemy.pool._ConnectionRecord object at >>> >>> >>> 0x7524590>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>]), >>> >>> >>> finalize_callback >>> >>> >>> deque([]) >>> >>> >>> 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, >>> >>> >>> pool >>> >>> >>> queue deque([<sqlalchemy.pool._ConnectionRecord object at >>> >>> >>> 0x7524590>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]) >>> >>> >>> 2018-05-10T14:02:49.723+08:00 localhost nova-conductor ERROR >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:122 error] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Exception during reset >>> >>> >>> or >>> >>> >>> similar >>> >>> >>> 2018-05-10T14:02:49.723+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to >>> >>> >>> connection_record.invalidate >>> >>> >>> 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to >>> >>> >>> connection_record.checkin >>> >>> >>> 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin >>> >>> >>> None, >>> >>> >>> pool queue deque([<sqlalchemy.pool._ConnectionRecord object at >>> >>> >>> 0x7524590>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]), >>> >>> >>> finalize_callback >>> >>> >>> deque([]) >>> >>> >>> 2018-05-10T14:02:49.732+08:00 localhost nova-conductor DEBUG >>> >>> >>> [pid:4528] >>> >>> >>> [MainThread] [tid:122769072] [log.py:104 debug] >>> >>> >>> [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, >>> >>> >>> pool >>> >>> >>> queue deque([<sqlalchemy.pool._ConnectionRecord object at >>> >>> >>> 0x7524590>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, >>> >>> >>> <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]) >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 在 2018年5月8日星期二 UTC+8下午11:06:04,[email protected]写道: >>> >>> >>>> >>> >>> >>>> I found why we could get the same connection twice from the >>> >>> >>>> pool... >>> >>> >>>> >>> >>> >>>> Because we put it into pool twice. >>> >>> >>>> >>> >>> >>>> Any suggestion for finding why it put the same connection into >>> >>> >>>> pool >>> >>> >>>> within >>> >>> >>>> different coroutines? >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> 2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG >>> >>> >>>> [pid:15755] >>> >>> >>>> [MainThread] [tid:187954992] [log.py:104 debug] >>> >>> >>>> [req-7c2199c0-d9a6-4131-8358-2aebcf5a4ce1] jjs conn checkin >>> >>> >>>> <connection >>> >>> >>>> object at 0x111d9740; closed: 0> >>> >>> >>>> 2018-05-08T22:39:17.502+08:00 localhost nova-conductor DEBUG >>> >>> >>>> [pid:15755] >>> >>> >>>> [MainThread] [tid:178645264] [log.py:104 debug] >>> >>> >>>> [req-7ab83c30-91bd-4278-b694-85a71035992c] jjs conn checkin >>> >>> >>>> <connection >>> >>> >>>> object at 0x111d9740; closed: 0> >>> >>> >>>> ... >>> >>> >>>> 2018-05-08T22:39:17.992+08:00 localhost nova-conductor DEBUG >>> >>> >>>> [pid:15755] >>> >>> >>>> [MainThread] [tid:309091856] [log.py:104 debug] >>> >>> >>>> [req-dee5dd2d-866f-408e-999b-b415ea7ce953] Connection <connection >>> >>> >>>> object at >>> >>> >>>> 0x111d9740; closed: 0> checked out from pool >>> >>> >>>> 2018-05-08T22:39:18.039+08:00 localhost nova-conductor DEBUG >>> >>> >>>> [pid:15755] >>> >>> >>>> [MainThread] [tid:178645104] [log.py:104 debug] >>> >>> >>>> [req-ac8017d7-c97e-44ca-92d2-99f180858e14] Connection <connection >>> >>> >>>> object at >>> >>> >>>> 0x111d9740; closed: 0> checked out from pool >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> 在 2018年5月8日星期二 UTC+8下午10:18:52,[email protected]写道: >>> >>> >>>>> >>> >>> >>>>> I added logs in sqlalchemy/pool.py >>> >>> >>>>> >>> >>> >>>>> QueuePool.checkout >>> >>> >>>>> >>> >>> >>>>> @classmethod >>> >>> >>>>> def checkout(cls, pool): >>> >>> >>>>> rec = pool._do_get() >>> >>> >>>>> try: >>> >>> >>>>> dbapi_connection = rec.get_connection() >>> >>> >>>>> pool.logger.debug("jjs get conn success %s", >>> >>> >>>>> dbapi_connection) >>> >>> >>>>> except: >>> >>> >>>>> pool.logger.debug("jjs get error in get_connection") >>> >>> >>>>> with util.safe_reraise(): >>> >>> >>>>> rec.checkin() >>> >>> >>>>> echo = pool._should_log_debug() >>> >>> >>>>> fairy = _ConnectionFairy(dbapi_connection, rec, echo) >>> >>> >>>>> rec.fairy_ref = weakref.ref( >>> >>> >>>>> fairy, >>> >>> >>>>> lambda ref: _finalize_fairy and >>> >>> >>>>> _finalize_fairy( >>> >>> >>>>> dbapi_connection, >>> >>> >>>>> rec, pool, ref, echo) >>> >>> >>>>> ) >>> >>> >>>>> _refs.add(rec) >>> >>> >>>>> if echo: >>> >>> >>>>> pool.logger.debug("jjs conn %s, pool status %s", >>> >>> >>>>> dbapi_connection, pool.status()) >>> >>> >>>>> pool.logger.debug("Connection %r checked out from pool", >>> >>> >>>>> dbapi_connection) >>> >>> >>>>> return fairy >>> >>> >>>>> >>> >>> >>>>> Then I got logs below: >>> >>> >>>>> >>> >>> >>>>> coroutine A: >>> >>> >>>>> >>> >>> >>>>> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:197123760] [log.py:109 info] >>> >>> >>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is >>> >>> >>>>> False, >>> >>> >>>>> timeout >>> >>> >>>>> 30 >>> >>> >>>>> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:197123760] [log.py:104 debug] >>> >>> >>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new >>> >>> >>>>> connection >>> >>> >>>>> <connection object at 0x903f8c0; closed: 0> >>> >>> >>>>> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:197123760] [log.py:104 debug] >>> >>> >>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success >>> >>> >>>>> <connection >>> >>> >>>>> object at 0x903f8c0; closed: 0> >>> >>> >>>>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:197123760] [log.py:104 debug] >>> >>> >>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn <connection >>> >>> >>>>> object at >>> >>> >>>>> 0x903f8c0; closed: 0>, pool status Pool size: 10 Connections >>> >>> >>>>> in >>> >>> >>>>> pool: 1 >>> >>> >>>>> Current Overflow: 36 Max Overflow: 40 Current Checked out >>> >>> >>>>> connections: 45 >>> >>> >>>>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:197123760] [log.py:104 debug] >>> >>> >>>>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection >>> >>> >>>>> <connection >>> >>> >>>>> object at >>> >>> >>>>> 0x903f8c0; closed: 0> checked out from pool >>> >>> >>>>> >>> >>> >>>>> coroutine B: >>> >>> >>>>> >>> >>> >>>>> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:151519312] [log.py:109 info] >>> >>> >>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is >>> >>> >>>>> False, >>> >>> >>>>> timeout >>> >>> >>>>> 30 >>> >>> >>>>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:151519312] [log.py:104 debug] >>> >>> >>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success >>> >>> >>>>> <connection >>> >>> >>>>> object at 0x903f8c0; closed: 0> >>> >>> >>>>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:151519312] [log.py:104 debug] >>> >>> >>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn <connection >>> >>> >>>>> object at >>> >>> >>>>> 0x903f8c0; closed: 0>, pool status Pool size: 10 Connections in >>> >>> >>>>> pool: 0 >>> >>> >>>>> Current Overflow: 36 Max Overflow: 40 Current Checked out >>> >>> >>>>> connections: 46 >>> >>> >>>>> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG >>> >>> >>>>> [pid:55528] >>> >>> >>>>> [MainThread] [tid:151519312] [log.py:104 debug] >>> >>> >>>>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection >>> >>> >>>>> <connection >>> >>> >>>>> object at >>> >>> >>>>> 0x903f8c0; closed: 0> checked out from pool >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> It seems that when we create_connection in QueuePool._do_get, we >>> >>> >>>>> may >>> >>> >>>>> not >>> >>> >>>>> get the lock _pool.mutex(~Queue.mutex). >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>>>> 在 2018年5月7日星期一 UTC+8下午8:41:14,Mike Bayer写道: >>> >>> >>>>>> >>> >>> >>>>>> >>> >>> >>>>>> >>> >>> >>>>>> On Mon, May 7, 2018, 7:01 AM <[email protected]> wrote: >>> >>> >>>>>>> >>> >>> >>>>>>> We added coroutine_id in psycopg2, and found that two >>> >>> >>>>>>> coroutine >>> >>> >>>>>>> use the >>> >>> >>>>>>> same connection before the pthread_mutex_lock released. >>> >>> >>>>>>> >>> >>> >>>>>>> Maybe something in the connection pool goes wrong? >>> >>> >>>>>>> [pid] [coroutine_id] msg >>> >>> >>>>>> >>> >>> >>>>>> >>> >>> >>>>>> >>> >>> >>>>>> This seems unlikely. Openstack uses the SQLAlchemy connection >>> >>> >>>>>> pool >>> >>> >>>>>> with eventlet patching in conjunction with the MySQL drivers >>> >>> >>>>>> and >>> >>> >>>>>> nothing >>> >>> >>>>>> like that happens. The biggest issue SQLAlchemy had with >>> >>> >>>>>> greenlets is that >>> >>> >>>>>> a timed out greenlet would throw GreenletExit and cause a dirty >>> >>> >>>>>> connection >>> >>> >>>>>> to go back into the pool, that was fixed in 1.2. >>> >>> >>>>>> >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn >>> >>> >>>>>>> 0x94122f0, >>> >>> >>>>>>> cb 0x23ad320 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE >>> >>> >>>>>>> 0x94122f0 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn >>> >>> >>>>>>> 0x94122f0, >>> >>> >>>>>>> async_cursor 0x881ac00 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn >>> >>> >>>>>>> 0x94122f0, >>> >>> >>>>>>> async_cursor 0x881ac00 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, >>> >>> >>>>>>> autocommit >>> >>> >>>>>>> = 0, >>> >>> >>>>>>> status = 2 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 >>> >>> >>>>>>> __lock 1 >>> >>> >>>>>>> >>> >>> >>>>>>> >>> >>> >>>>>>> Below is the logs grep by coroutine id: >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]finish send query, before psyco_wait, conn >>> >>> >>>>>>> 0x94122f0 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn >>> >>> >>>>>>> 0x94122f0, >>> >>> >>>>>>> cb 0x23ad320 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE >>> >>> >>>>>>> 0x94122f0 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]conn_poll: poll writing >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ >>> >>> >>>>>>> >>> >>> >>>>>>> ----- >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for >>> >>> >>>>>>> connection >>> >>> >>>>>>> at 0x8de2d30 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]cursor_setup: init cursor object at >>> >>> >>>>>>> 0xa6c2650 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), >>> >>> >>>>>>> conn = >>> >>> >>>>>>> 0x8de2d30 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]cursor_setup: good cursor object at >>> >>> >>>>>>> 0xa6c2650, >>> >>> >>>>>>> refcnt = 1 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: >>> >>> >>>>>>> refcnt = >>> >>> >>>>>>> 1 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn >>> >>> >>>>>>> 0x94122f0, >>> >>> >>>>>>> async_cursor 0x881ac00 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn >>> >>> >>>>>>> 0x94122f0, >>> >>> >>>>>>> async_cursor 0x881ac00 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, >>> >>> >>>>>>> autocommit >>> >>> >>>>>>> = 0, >>> >>> >>>>>>> status = 2 >>> >>> >>>>>>> >>> >>> >>>>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 >>> >>> >>>>>>> __lock 1 >>> >>> >>>>>>> >>> >>> >>>>>>> >>> >>> >>>>>>> 在 2018年4月28日星期六 UTC+8下午4:07:34,[email protected]写道: >>> >>> >>>>>>>> >>> >>> >>>>>>>> We reproduced this problem and added logs in psycopg2, found >>> >>> >>>>>>>> a >>> >>> >>>>>>>> confusing thing and reported it to psycopg2. >>> >>> >>>>>>>> >>> >>> >>>>>>>> It seems sqlalchemy & psycopg2 stack flow is: >>> >>> >>>>>>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit >>> >>> >>>>>>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn) >>> >>> >>>>>>>> 3.get exception in sqlalchemy, and do_rollback >>> >>> >>>>>>>> 4.from sqlalchemy to psycopg: pq_abort >>> >>> >>>>>>>> 5.psycopg get pthread_mutex_lock deadlock >>> >>> >>>>>>>> >>> >>> >>>>>>>> What we have done to reproduce it is: stop master pg-server >>> >>> >>>>>>>> and >>> >>> >>>>>>>> promote the slave pg-server to master, with moving FIP from >>> >>> >>>>>>>> old >>> >>> >>>>>>>> master to >>> >>> >>>>>>>> slave. At the same time, let nova-conductor do quite a lot db >>> >>> >>>>>>>> query >>> >>> >>>>>>>> requests. >>> >>> >>>>>>>> >>> >>> >>>>>>>> [1] https://github.com/psycopg/psycopg2/issues/703 >>> >>> >>>>>>>> >>> >>> >>>>>>>> 在 2018年4月23日星期一 UTC+8下午9:45:04,Mike Bayer写道: >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> On Mon, Apr 23, 2018 at 9:03 AM, <[email protected]> >>> >>> >>>>>>>>> wrote: >>> >>> >>>>>>>>> > Sorry for reply on this topic. >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > We recently get the same problem in our production >>> >>> >>>>>>>>> > environment. >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > I found a patch in other lib [1], and they added >>> >>> >>>>>>>>> > conn.close() >>> >>> >>>>>>>>> > in >>> >>> >>>>>>>>> > exception >>> >>> >>>>>>>>> > psycopg2.ProgrammingError. >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > Shall we do the same in [2] ? >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> SQLAlchemy does things much more carefully than that, we >>> >>> >>>>>>>>> parse >>> >>> >>>>>>>>> the >>> >>> >>>>>>>>> error message for specific ones that correspond to >>> >>> >>>>>>>>> "connection >>> >>> >>>>>>>>> is no >>> >>> >>>>>>>>> longer usable", we call these "is_disconnect" but it can be >>> >>> >>>>>>>>> any >>> >>> >>>>>>>>> invalidating condition. >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> You can make these yourself, and they can also be made to be >>> >>> >>>>>>>>> part of >>> >>> >>>>>>>>> oslo.db, using the handle_error event: >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> within oslo.db you would want to propose a change here: >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387 >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > [1] >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > https://github.com/aio-libs/aiopg/pull/415/files?diff=split >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > [2] >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289 >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31,JinRong Cai写道: >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> Hi Michael , >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> I am using openstack with postgresql which sqlalchemy and >>> >>> >>>>>>>>> >> oslo_db >>> >>> >>>>>>>>> >> module >>> >>> >>>>>>>>> >> were used. >>> >>> >>>>>>>>> >> And there are some problems after my pg database switched >>> >>> >>>>>>>>> >> over. >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> Here is my switch over process: >>> >>> >>>>>>>>> >> 1. nova-conductor(python application) is running with DB >>> >>> >>>>>>>>> >> connection >>> >>> >>>>>>>>> >> strings point to vip , which is in primary site(A) of pg. >>> >>> >>>>>>>>> >> 2. switch VIP from primary(A) to new primary(B) >>> >>> >>>>>>>>> >> 3. switch over pg: shutdown primary(A), promopt >>> >>> >>>>>>>>> >> standby(B) to >>> >>> >>>>>>>>> >> new >>> >>> >>>>>>>>> >> primary. >>> >>> >>>>>>>>> >> 4. nova-conductor is running in the whole process. >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> After some seconds, I found some nova-conductor processes >>> >>> >>>>>>>>> >> are >>> >>> >>>>>>>>> >> hang >>> >>> >>>>>>>>> >> with >>> >>> >>>>>>>>> >> status futex_wait_queue_me, and the status of the query >>> >>> >>>>>>>>> >> in DB >>> >>> >>>>>>>>> >> is >>> >>> >>>>>>>>> >> "idle in >>> >>> >>>>>>>>> >> transaction", the transaction was not commited or >>> >>> >>>>>>>>> >> rollbacked! >>> >>> >>>>>>>>> >> I think disconnection was handled in the oslo_db, which >>> >>> >>>>>>>>> >> will >>> >>> >>>>>>>>> >> send >>> >>> >>>>>>>>> >> a >>> >>> >>>>>>>>> >> ping(select 1) to DB. >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> If DB was switchd over, the connection in the pool would >>> >>> >>>>>>>>> >> be >>> >>> >>>>>>>>> >> set >>> >>> >>>>>>>>> >> with >>> >>> >>>>>>>>> >> status invalid, and reconnect after next check out. >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> ###error messages from nova-conductor >>> >>> >>>>>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] >>> >>> >>>>>>>>> >> [tid:122397712] >>> >>> >>>>>>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] >>> >>> >>>>>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError >>> >>> >>>>>>>>> >> exception >>> >>> >>>>>>>>> >> wrapped from >>> >>> >>>>>>>>> >> (psycopg2.ProgrammingError) execute cannot be used while >>> >>> >>>>>>>>> >> an >>> >>> >>>>>>>>> >> asynchronous >>> >>> >>>>>>>>> >> query is underway [SQL: 'SELECT 1'] >>> >>> >>>>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback >>> >>> >>>>>>>>> >> (most >>> >>> >>>>>>>>> >> recent >>> >>> >>>>>>>>> >> call >>> >>> >>>>>>>>> >> last): >>> >>> >>>>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters File >>> >>> >>>>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", >>> >>> >>>>>>>>> >> line >>> >>> >>>>>>>>> >> 1139, >>> >>> >>>>>>>>> >> in >>> >>> >>>>>>>>> >> _execute_context >>> >>> >>>>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters context) >>> >>> >>>>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters File >>> >>> >>>>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", >>> >>> >>>>>>>>> >> line >>> >>> >>>>>>>>> >> 450, >>> >>> >>>>>>>>> >> in >>> >>> >>>>>>>>> >> do_execute >>> >>> >>>>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters >>> >>> >>>>>>>>> >> cursor.execute(statement, >>> >>> >>>>>>>>> >> parameters) >>> >>> >>>>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters >>> >>> >>>>>>>>> >> ProgrammingError: >>> >>> >>>>>>>>> >> execute >>> >>> >>>>>>>>> >> cannot be used while an asynchronous query is underway >>> >>> >>>>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters >>> >>> >>>>>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] >>> >>> >>>>>>>>> >> [tid:122397712] >>> >>> >>>>>>>>> >> [log.py:122 error] >>> >>> >>>>>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] >>> >>> >>>>>>>>> >> Error closing >>> >>> >>>>>>>>> >> cursor >>> >>> >>>>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool Traceback (most >>> >>> >>>>>>>>> >> recent >>> >>> >>>>>>>>> >> call last): >>> >>> >>>>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool File >>> >>> >>>>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", >>> >>> >>>>>>>>> >> line >>> >>> >>>>>>>>> >> 1226, >>> >>> >>>>>>>>> >> in >>> >>> >>>>>>>>> >> _safe_close_cursor >>> >>> >>>>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool >>> >>> >>>>>>>>> >> cursor.close() >>> >>> >>>>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: >>> >>> >>>>>>>>> >> close >>> >>> >>>>>>>>> >> cannot be >>> >>> >>>>>>>>> >> used while an asynchronous query is underway >>> >>> >>>>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> ###ps status of nova-conductor >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log >>> >>> >>>>>>>>> >> # cat >>> >>> >>>>>>>>> >> /proc/33316/stack >>> >>> >>>>>>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 >>> >>> >>>>>>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280 >>> >>> >>>>>>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0 >>> >>> >>>>>>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180 >>> >>> >>>>>>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b >>> >>> >>>>>>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> ### stack of the nova-conductor process >>> >>> >>>>>>>>> >> POD6-Mongodb03:/tmp # pstack 33316 >>> >>> >>>>>>>>> >> #0 0x00002b8449e35f4d in __lll_lock_wait () from >>> >>> >>>>>>>>> >> /lib64/libpthread.so.0 >>> >>> >>>>>>>>> >> #1 0x00002b8449e31d02 in _L_lock_791 () from >>> >>> >>>>>>>>> >> /lib64/libpthread.so.0 >>> >>> >>>>>>>>> >> #2 0x00002b8449e31c08 in pthread_mutex_lock () from >>> >>> >>>>>>>>> >> /lib64/libpthread.so.0 >>> >>> >>>>>>>>> >> #3 0x00002b84554c44ab in pq_abort () from >>> >>> >>>>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so >>> >>> >>>>>>>>> >> #4 0x00002b84554c955e in psyco_conn_rollback () from >>> >>> >>>>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so >>> >>> >>>>>>>>> >> #5 0x00002b8449b42b50 in PyEval_EvalFrameEx () from >>> >>> >>>>>>>>> >> /lib64/libpython2.7.so.1.0 >>> >>> >>>>>>>>> >> #6 0x00002b8449b42ad0 in PyEval_EvalFrameEx () from >>> >>> >>>>>>>>> >> /lib64/libpython2.7.so.1.0 >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> The psycopg2 was trying to close the cursor, and try to >>> >>> >>>>>>>>> >> get >>> >>> >>>>>>>>> >> the >>> >>> >>>>>>>>> >> mutex lock >>> >>> >>>>>>>>> >> "pthread_mutex_lock", but it seems that the cursor was >>> >>> >>>>>>>>> >> used >>> >>> >>>>>>>>> >> by >>> >>> >>>>>>>>> >> other >>> >>> >>>>>>>>> >> session. >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> Questions: >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> 1. What the error "ProgrammingError: close cannot be used >>> >>> >>>>>>>>> >> while an >>> >>> >>>>>>>>> >> asynchronous query is underway" mean? >>> >>> >>>>>>>>> >> AFAIK, these caused by psycopg2, which means a >>> >>> >>>>>>>>> >> asynchronous >>> >>> >>>>>>>>> >> query >>> >>> >>>>>>>>> >> was >>> >>> >>>>>>>>> >> executed in one connection. >>> >>> >>>>>>>>> >> But the I think the sqlalchemy was thread safe since it >>> >>> >>>>>>>>> >> was >>> >>> >>>>>>>>> >> patched by >>> >>> >>>>>>>>> >> eventlet, see details in >>> >>> >>>>>>>>> >> eventlet/support/psycopg2_patcher.py >>> >>> >>>>>>>>> >> And we can see different green thread number in the log, >>> >>> >>>>>>>>> >> as: >>> >>> >>>>>>>>> >> [pid:36365] [MainThread] [tid:122397712] >>> >>> >>>>>>>>> >> [pid:36365] [MainThread] [tid:122397815] >>> >>> >>>>>>>>> >> So, I guess the connection pool in one process is safe. >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> 2. The nova-conductor was a multi-thread python client, >>> >>> >>>>>>>>> >> which >>> >>> >>>>>>>>> >> forked >>> >>> >>>>>>>>> >> several child process. >>> >>> >>>>>>>>> >> ps -elf|grep -i nova-conductor >>> >>> >>>>>>>>> >> 30878 1 pool_s /usr/bin/nova-conductor >>> >>> >>>>>>>>> >> 36364 1 ep_pol /usr/bin/nova-conductor >>> >>> >>>>>>>>> >> 36365 1 futex_ /usr/bin/nova-conductor >>> >>> >>>>>>>>> >> 36366 1 ep_pol /usr/bin/nova-conductor >>> >>> >>>>>>>>> >> 36367 1 ep_pol /usr/bin/nova-conductor >>> >>> >>>>>>>>> >> 36368 1 ep_pol /usr/bin/nova-conductor >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> If the nova-conductor was started with only one child, >>> >>> >>>>>>>>> >> the >>> >>> >>>>>>>>> >> problem >>> >>> >>>>>>>>> >> was not >>> >>> >>>>>>>>> >> happen. >>> >>> >>>>>>>>> >> Does this mean the connection/engine CAN NOT shared in >>> >>> >>>>>>>>> >> these >>> >>> >>>>>>>>> >> child >>> >>> >>>>>>>>> >> processes? >>> >>> >>>>>>>>> >> >>> >>> >>>>>>>>> >> Thanks. >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > -- >>> >>> >>>>>>>>> > SQLAlchemy - >>> >>> >>>>>>>>> > The Python SQL Toolkit and Object Relational Mapper >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > http://www.sqlalchemy.org/ >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> > To post example code, please provide an MCVE: Minimal, >>> >>> >>>>>>>>> > Complete, >>> >>> >>>>>>>>> > and >>> >>> >>>>>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve >>> >>> >>>>>>>>> > for >>> >>> >>>>>>>>> > a >>> >>> >>>>>>>>> > full >>> >>> >>>>>>>>> > description. >>> >>> >>>>>>>>> > --- >>> >>> >>>>>>>>> > You received this message because you are subscribed to >>> >>> >>>>>>>>> > the >>> >>> >>>>>>>>> > Google >>> >>> >>>>>>>>> > Groups >>> >>> >>>>>>>>> > "sqlalchemy" group. >>> >>> >>>>>>>>> > To unsubscribe from this group and stop receiving emails >>> >>> >>>>>>>>> > from >>> >>> >>>>>>>>> > it, >>> >>> >>>>>>>>> > send an >>> >>> >>>>>>>>> > email to [email protected]. >>> >>> >>>>>>>>> > To post to this group, send email to >>> >>> >>>>>>>>> > [email protected]. >>> >>> >>>>>>>>> > Visit this group at >>> >>> >>>>>>>>> > https://groups.google.com/group/sqlalchemy. >>> >>> >>>>>>>>> > For more options, visit >>> >>> >>>>>>>>> > https://groups.google.com/d/optout. >>> >>> >>>>>>> >>> >>> >>>>>>> -- >>> >>> >>>>>>> SQLAlchemy - >>> >>> >>>>>>> The Python SQL Toolkit and Object Relational Mapper >>> >>> >>>>>>> >>> >>> >>>>>>> http://www.sqlalchemy.org/ >>> >>> >>>>>>> >>> >>> >>>>>>> To post example code, please provide an MCVE: Minimal, >>> >>> >>>>>>> Complete, >>> >>> >>>>>>> and >>> >>> >>>>>>> Verifiable Example. See http://stackoverflow.com/help/mcve for >>> >>> >>>>>>> a >>> >>> >>>>>>> full >>> >>> >>>>>>> description. >>> >>> >>>>>>> --- >>> >>> >>>>>>> You received this message because you are subscribed to the >>> >>> >>>>>>> Google >>> >>> >>>>>>> Groups "sqlalchemy" group. >>> >>> >>>>>>> To unsubscribe from this group and stop receiving emails from >>> >>> >>>>>>> it, >>> >>> >>>>>>> send >>> >>> >>>>>>> an email to [email protected]. >>> >>> >>>>>>> To post to this group, send email to >>> >>> >>>>>>> [email protected]. >>> >>> >>>>>>> Visit this group at >>> >>> >>>>>>> https://groups.google.com/group/sqlalchemy. >>> >>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> >>> >>> >>> -- >>> >>> >>> SQLAlchemy - >>> >>> >>> The Python SQL Toolkit and Object Relational Mapper >>> >>> >>> >>> >>> >>> http://www.sqlalchemy.org/ >>> >>> >>> >>> >>> >>> To post example code, please provide an MCVE: Minimal, Complete, >>> >>> >>> and >>> >>> >>> Verifiable Example. See http://stackoverflow.com/help/mcve for a >>> >>> >>> full >>> >>> >>> description. >>> >>> >>> --- >>> >>> >>> You received this message because you are subscribed to the Google >>> >>> >>> Groups >>> >>> >>> "sqlalchemy" group. >>> >>> >>> To unsubscribe from this group and stop receiving emails from it, >>> >>> >>> send >>> >>> >>> an >>> >>> >>> email to [email protected]. >>> >>> >>> To post to this group, send email to [email protected]. >>> >>> >>> Visit this group at https://groups.google.com/group/sqlalchemy. >>> >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >>> >> -- >>> >> SQLAlchemy - >>> >> The Python SQL Toolkit and Object Relational Mapper >>> >> >>> >> http://www.sqlalchemy.org/ >>> >> >>> >> To post example code, please provide an MCVE: Minimal, Complete, and >>> >> Verifiable Example. See http://stackoverflow.com/help/mcve for a full >>> >> description. >>> >> --- >>> >> You received this message because you are subscribed to the Google >>> >> Groups >>> >> "sqlalchemy" group. >>> >> To unsubscribe from this group and stop receiving emails from it, send >>> >> an >>> >> email to [email protected]. >>> >> To post to this group, send email to [email protected]. >>> >> Visit this group at https://groups.google.com/group/sqlalchemy. >>> >> For more options, visit https://groups.google.com/d/optout. >> >> -- >> SQLAlchemy - >> The Python SQL Toolkit and Object Relational Mapper >> >> http://www.sqlalchemy.org/ >> >> To post example code, please provide an MCVE: Minimal, Complete, and >> Verifiable Example. See http://stackoverflow.com/help/mcve for a full >> description. >> --- >> You received this message because you are subscribed to the Google Groups >> "sqlalchemy" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/sqlalchemy. >> For more options, visit https://groups.google.com/d/optout. -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
