if that connection error is not handled in "is_disconnect" then this wouldn't invalidate the pool.
However we've had lots of problems moving vips around with Galera. I would recommend using haproxy or PGBouncer which are both designed to mitigate this issue. As well as, run Nova under mod_wsgi instead of eventlet. There's lots of ways your problem can be solved here. I'd like to find the specific race in SQLAlchemy but I need specific code steps. On Sun, May 13, 2018 at 9:50 PM, <[email protected]> wrote: > The scenario we make to reproduce the problem is: > 1.stop PG master > 2.move the IP of PG from master node to slave node, and make an ARP > 3.fail over the PG slave to master > > That will shut all DB connections to PG master, and in SQLAlchemy we got an > exception: > > ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last): > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/pool.py", line 655, in > _finalize_fairy > ERROR sqlalchemy.pool.QueuePool fairy._reset(pool) > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/pool.py", line 798, in > _reset > ERROR sqlalchemy.pool.QueuePool self._reset_agent.rollback() > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/engine/base.py", line > 1586, in rollback > ERROR sqlalchemy.pool.QueuePool self._do_rollback() > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/engine/base.py", line > 1624, in _do_rollback > ERROR sqlalchemy.pool.QueuePool self.connection._rollback_impl() > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/engine/base.py", line > 672, in _rollback_impl > ERROR sqlalchemy.pool.QueuePool self._handle_dbapi_exception(e, None, > None, None, None) > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/engine/base.py", line > 1356, in _handle_dbapi_exception > ERROR sqlalchemy.pool.QueuePool util.raise_from_cause(newraise, > exc_info) > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/util/compat.py", line > 200, in raise_from_cause > ERROR sqlalchemy.pool.QueuePool reraise(type(exception), exception, > tb=exc_tb) > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/engine/base.py", line > 670, in _rollback_impl > ERROR sqlalchemy.pool.QueuePool > self.engine.dialect.do_rollback(self.connection) > ERROR sqlalchemy.pool.QueuePool File "sqlalchemy/engine/default.py", line > 421, in do_rollback > ERROR sqlalchemy.pool.QueuePool dbapi_connection.rollback() > ERROR sqlalchemy.pool.QueuePool DBConnectionError: > (psycopg2.OperationalError) connection pointer is NULL > > With that exception, SQLAlchemy first get in > `Connection._handle_dbapi_exception`(keyword "err connection pointer is > NULL" in log) and then get it again in `_finalize_fairy`(keyword "Exception > during reset or similar" in log) > > > > 在 2018年5月11日星期五 UTC+8上午12:21:42,Mike Bayer写道: >> >> On Thu, May 10, 2018 at 7:23 AM, <[email protected]> wrote: >> > Hi Mike, >> > >> > I added more logs and finnaly find the reason and a method to avoid the >> > deadlock between coroutines. >> > >> > In summary, the deadlock happend because two different coroutines use >> > the >> > same ConnectionRecord with same connection. >> > It happends under QueuePool, when the DB connections are reset by the DB >> > server(not related with the backend driver). >> > >> > 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a >> > weakref of the ConnectionFairy, with callback function >> > `_finalize_fairy`. >> > 2.When the connection get disconnected, it receives an >> > exception(psycopg2.OperationalError), and goto >> > Connection._handle_dbapi_exception. >> > 3.In `_handle_dbapi_exception.finnaly`, it will call both >> > `ConnectionFairy.invalidate` and `self.invalidate`. >> > 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is >> > called, >> > and put the ConnectionRecord back to the QueuePool. >> > 5.In `self.invalidate`, `del self._root.__connection` is called, and >> > **del** >> > the ConnectionFairy. >> > 6.The **del** operation will make weakref of ConnectionFairy to dead, >> > and >> > the callback `_finalize_fairy` executed. >> > 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called, >> > put >> > the ConnectionRecord to the QueuePool again. >> >> Here's what does not make sense. You are claiming that a simple >> invalidation will result in the same ConnectionRecord being checked in >> twice. This is obviously not the case, what is special in your >> situation that makes this happen? >> >> If I follow the steps you refer towards, in step 4, when >> ConnectionFairy._checkin is called, that calls finalize_fairy >> directly, which then calls ConnectionRecord.checkin(). Within >> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None. >> Then back in ConnectionFairy._checkin(), it also sets >> self._connection_record = None. >> >> If we come back into finalize_fairy a *second* time then, as a result >> of the "del" which occurs in _handle_error(), and the >> connection_record is present, the function will exit immediately, >> because: >> >> if connection_record.fairy_ref is not ref: >> return >> >> So the situation as given is not possible without concurrent access to >> the ConnectionFairy being introduced externally to the pool. >> >> There is exactly one thing I can think of that is both unique to the >> psycopg2-style of asynchronous connection, not the PyMySQL form of it, >> that might be important here, and it's that I think psycopg2's >> connection.close() method will defer to another greenlet. That means >> when we're inside of ConnectionRecord.invalidate(), and we call >> self.__close(), that could be deferring to another greenlet. But >> this all happens before the connection is returned to the pool, so I >> still don't know what is happening. >> >> > >> > The logs in detail is below. To prove my result is right, I tried to >> > avoid >> > the scienario with this patch, and it worked. >> > >> > def checkin(self): >> > self.fairy_ref = None >> > connection = self.connection >> > pool = self.__pool >> > >> > while self.finalize_callback: >> > finalizer = self.finalize_callback.pop() >> > finalizer(connection) >> > + if self in pool._pool.queue: >> > + pool.logger.debug("jjs rec already in pool checkin, skip %r", >> > self) >> > + return >> >> as mentioned above it's not possible to reach this point without >> concurrent access to the pooled connection, which means this fix could >> fail also (with at least threaded-style concurrency, not greenlet >> style). The original source of the greenlet issue must still be >> identified. >> >> I would need a self-contained proof of concept that demonstrates this >> issue occurring. The log output is not that helpful because it at >> least would need to write out the identifier of the current greenlet, >> since it is very likely that greenlets have something to do with this >> issue. >> >> >> >> > >> > if pool.dispatch.checkin: >> > pool.dispatch.checkin(connection, self) >> > pool._return_conn(self) >> > >> > Trace the ConnectionRecord 0x88fc5d0 >> > >> > 2018-05-10T14:02:17.927+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Created new connection >> > <connection object at 0x81f69b0; closed: 0> >> > 2018-05-10T14:02:17.928+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs get conn success >> > <connection >> > object at 0x81f69b0; closed: 0>, rec.connection <connection object at >> > 0x81f69b0; closed: 0>, pool queue >> > deque([<sqlalchemy.pool._ConnectionRecord >> > object at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at >> > 0x79fdad0>]) >> > 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs checkout fairy >> > <sqlalchemy.pool._ConnectionFairy object at 0x329c0d0> rec >> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0> pool queue >> > deque([<sqlalchemy.pool._ConnectionRecord object at 0x7a04810>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>]) >> > 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Connection <connection object >> > at >> > 0x81f69b0; closed: 0> checked out from pool, status Pool size: 10 >> > Connections in pool: 2 Current Overflow: 0 Max Overflow: 40 Current >> > Checked >> > out connections: 8, queue deque([<sqlalchemy.pool._ConnectionRecord >> > object >> > at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>]) >> > 2018-05-10T14:02:17.930+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs >> > _ConnectionFairy._checkout >> > conn <connection object at 0x81f69b0; closed: 0>, rec >> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, rec.conn >> > <connection object at 0x81f69b0; closed: 0> >> > 2018-05-10T14:02:49.716+08:00 localhost nova-conductor ERROR [pid:4528] >> > [MainThread] [tid:122769072] [log.py:122 error] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr error conn >> > <sqlalchemy.engine.base.Connection object at 0x329c150>, err connection >> > pointer is NULL >> > 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528] >> > [MainThread] [tid:122769072] [log.py:122 error] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter newraise >> > 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528] >> > [MainThread] [tid:122769072] [log.py:122 error] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter finally >> >>>Connection._handle_dbapi_exception --> finally --> >> >>> **self.engine.pool._invalidate** & **self.invalidate** >> >>>self.engine.pool._invalidate --> Pool._invalidate --> >> >>> ConnectionFairy.invalidate --> _ConnectionRecord.invalidate & >> >>> **self._checkin()** >> >>>self.invalidate() --> **del self._root.__connection** >> > 2018-05-10T14:02:49.718+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs connectionFairy >> > invalidate >> > 2018-05-10T14:02:49.718+08:00 localhost nova-conductor INFO [pid:4528] >> > [MainThread] [tid:122769072] [log.py:109 info] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Invalidate connection >> > <connection >> > object at 0x81f69b0; closed: 2> (reason: OperationalError:connection >> > pointer >> > is NULL >> >>>_ConnectionRecord.invalidate(e) --> >> >>> self.__pool.dispatch.invalidate(self.connection, self, e) & >> >>> self.__close() & >> >>> self.connection = None >> >>>self.__close() --> self.finalize_callback.clear() & >> >>> self.__pool._close_connection(self.connection) >> > 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Closing connection >> > <connection >> > object at 0x81f69b0; closed: 2> >> >>>Pool._close_connection(connection) --> >> >>> self._dialect.do_close(connection) >> >>> --> _ConnDialect.do_close(dbapi_connection) --> >> >>> dbapi_connection.close() FIN >> > 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _ConnectionFairy._checkin >> > conn None rec <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0> >> > rec.conn None >> >>>_ConnectionFairy._checkin --> _finalize_fairy(conn, conn_rec, pool, >> >>> None, >> >>> echo, fairy) & self.conn = None & self.conn_rec = None >> > 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _finalize_fairy args conn >> > None, rec.conn None, rec, <sqlalchemy.pool._ConnectionRecord object at >> > 0x88fc5d0>. ref None, fairy <sqlalchemy.pool._ConnectionFairy object at >> > 0x329c0d0> >> >>>_finalize_fairy --> connection_record.checkin() >> > 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to >> > connection_record.checkin >> > 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None, >> > pool queue deque([<sqlalchemy.pool._ConnectionRecord object at >> > 0x7524590>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>]), >> > finalize_callback >> > deque([]) >> > 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool >> > queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]) >> > 2018-05-10T14:02:49.723+08:00 localhost nova-conductor ERROR [pid:4528] >> > [MainThread] [tid:122769072] [log.py:122 error] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Exception during reset or >> > similar >> > 2018-05-10T14:02:49.723+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to >> > connection_record.invalidate >> > 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to >> > connection_record.checkin >> > 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None, >> > pool queue deque([<sqlalchemy.pool._ConnectionRecord object at >> > 0x7524590>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]), >> > finalize_callback >> > deque([]) >> > 2018-05-10T14:02:49.732+08:00 localhost nova-conductor DEBUG [pid:4528] >> > [MainThread] [tid:122769072] [log.py:104 debug] >> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool >> > queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>, >> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, >> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]) >> > >> > >> > >> > >> > >> > >> > 在 2018年5月8日星期二 UTC+8下午11:06:04,[email protected]写道: >> >> >> >> I found why we could get the same connection twice from the pool... >> >> >> >> Because we put it into pool twice. >> >> >> >> Any suggestion for finding why it put the same connection into pool >> >> within >> >> different coroutines? >> >> >> >> >> >> 2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG >> >> [pid:15755] >> >> [MainThread] [tid:187954992] [log.py:104 debug] >> >> [req-7c2199c0-d9a6-4131-8358-2aebcf5a4ce1] jjs conn checkin <connection >> >> object at 0x111d9740; closed: 0> >> >> 2018-05-08T22:39:17.502+08:00 localhost nova-conductor DEBUG >> >> [pid:15755] >> >> [MainThread] [tid:178645264] [log.py:104 debug] >> >> [req-7ab83c30-91bd-4278-b694-85a71035992c] jjs conn checkin <connection >> >> object at 0x111d9740; closed: 0> >> >> ... >> >> 2018-05-08T22:39:17.992+08:00 localhost nova-conductor DEBUG >> >> [pid:15755] >> >> [MainThread] [tid:309091856] [log.py:104 debug] >> >> [req-dee5dd2d-866f-408e-999b-b415ea7ce953] Connection <connection >> >> object at >> >> 0x111d9740; closed: 0> checked out from pool >> >> 2018-05-08T22:39:18.039+08:00 localhost nova-conductor DEBUG >> >> [pid:15755] >> >> [MainThread] [tid:178645104] [log.py:104 debug] >> >> [req-ac8017d7-c97e-44ca-92d2-99f180858e14] Connection <connection >> >> object at >> >> 0x111d9740; closed: 0> checked out from pool >> >> >> >> >> >> 在 2018年5月8日星期二 UTC+8下午10:18:52,[email protected]写道: >> >>> >> >>> I added logs in sqlalchemy/pool.py >> >>> >> >>> QueuePool.checkout >> >>> >> >>> @classmethod >> >>> def checkout(cls, pool): >> >>> rec = pool._do_get() >> >>> try: >> >>> dbapi_connection = rec.get_connection() >> >>> pool.logger.debug("jjs get conn success %s", dbapi_connection) >> >>> except: >> >>> pool.logger.debug("jjs get error in get_connection") >> >>> with util.safe_reraise(): >> >>> rec.checkin() >> >>> echo = pool._should_log_debug() >> >>> fairy = _ConnectionFairy(dbapi_connection, rec, echo) >> >>> rec.fairy_ref = weakref.ref( >> >>> fairy, >> >>> lambda ref: _finalize_fairy and >> >>> _finalize_fairy( >> >>> dbapi_connection, >> >>> rec, pool, ref, echo) >> >>> ) >> >>> _refs.add(rec) >> >>> if echo: >> >>> pool.logger.debug("jjs conn %s, pool status %s", >> >>> dbapi_connection, pool.status()) >> >>> pool.logger.debug("Connection %r checked out from pool", >> >>> dbapi_connection) >> >>> return fairy >> >>> >> >>> Then I got logs below: >> >>> >> >>> coroutine A: >> >>> >> >>> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO >> >>> [pid:55528] >> >>> [MainThread] [tid:197123760] [log.py:109 info] >> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False, >> >>> timeout >> >>> 30 >> >>> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG >> >>> [pid:55528] >> >>> [MainThread] [tid:197123760] [log.py:104 debug] >> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection >> >>> <connection object at 0x903f8c0; closed: 0> >> >>> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG >> >>> [pid:55528] >> >>> [MainThread] [tid:197123760] [log.py:104 debug] >> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success >> >>> <connection >> >>> object at 0x903f8c0; closed: 0> >> >>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG >> >>> [pid:55528] >> >>> [MainThread] [tid:197123760] [log.py:104 debug] >> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn <connection object >> >>> at >> >>> 0x903f8c0; closed: 0>, pool status Pool size: 10 Connections in >> >>> pool: 1 >> >>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: >> >>> 45 >> >>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG >> >>> [pid:55528] >> >>> [MainThread] [tid:197123760] [log.py:104 debug] >> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection <connection >> >>> object at >> >>> 0x903f8c0; closed: 0> checked out from pool >> >>> >> >>> coroutine B: >> >>> >> >>> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO >> >>> [pid:55528] >> >>> [MainThread] [tid:151519312] [log.py:109 info] >> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False, >> >>> timeout >> >>> 30 >> >>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG >> >>> [pid:55528] >> >>> [MainThread] [tid:151519312] [log.py:104 debug] >> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success >> >>> <connection >> >>> object at 0x903f8c0; closed: 0> >> >>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG >> >>> [pid:55528] >> >>> [MainThread] [tid:151519312] [log.py:104 debug] >> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn <connection object >> >>> at >> >>> 0x903f8c0; closed: 0>, pool status Pool size: 10 Connections in pool: >> >>> 0 >> >>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections: >> >>> 46 >> >>> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG >> >>> [pid:55528] >> >>> [MainThread] [tid:151519312] [log.py:104 debug] >> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection <connection >> >>> object at >> >>> 0x903f8c0; closed: 0> checked out from pool >> >>> >> >>> >> >>> It seems that when we create_connection in QueuePool._do_get, we may >> >>> not >> >>> get the lock _pool.mutex(~Queue.mutex). >> >>> >> >>> >> >>> >> >>> 在 2018年5月7日星期一 UTC+8下午8:41:14,Mike Bayer写道: >> >>>> >> >>>> >> >>>> >> >>>> On Mon, May 7, 2018, 7:01 AM <[email protected]> wrote: >> >>>>> >> >>>>> We added coroutine_id in psycopg2, and found that two coroutine use >> >>>>> the >> >>>>> same connection before the pthread_mutex_lock released. >> >>>>> >> >>>>> Maybe something in the connection pool goes wrong? >> >>>>> [pid] [coroutine_id] msg >> >>>> >> >>>> >> >>>> >> >>>> This seems unlikely. Openstack uses the SQLAlchemy connection pool >> >>>> with eventlet patching in conjunction with the MySQL drivers and >> >>>> nothing >> >>>> like that happens. The biggest issue SQLAlchemy had with greenlets >> >>>> is that >> >>>> a timed out greenlet would throw GreenletExit and cause a dirty >> >>>> connection >> >>>> to go back into the pool, that was fixed in 1.2. >> >>>> >> >>>>> >> >>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn >> >>>>> 0x94122f0, >> >>>>> cb 0x23ad320 >> >>>>> >> >>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0 >> >>>>> >> >>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0 >> >>>>> >> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, >> >>>>> async_cursor 0x881ac00 >> >>>>> >> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, >> >>>>> async_cursor 0x881ac00 >> >>>>> >> >>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = >> >>>>> 0, >> >>>>> status = 2 >> >>>>> >> >>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 >> >>>>> __lock 1 >> >>>>> >> >>>>> >> >>>>> Below is the logs grep by coroutine id: >> >>>>> >> >>>>> [49174] [0xa5db730]finish send query, before psyco_wait, conn >> >>>>> 0x94122f0 >> >>>>> >> >>>>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0 >> >>>>> >> >>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn >> >>>>> 0x94122f0, >> >>>>> cb 0x23ad320 >> >>>>> >> >>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0 >> >>>>> >> >>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0 >> >>>>> >> >>>>> [49174] [0xa5db730]conn_poll: poll writing >> >>>>> >> >>>>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ >> >>>>> >> >>>>> ----- >> >>>>> >> >>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for >> >>>>> connection >> >>>>> at 0x8de2d30 >> >>>>> >> >>>>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650 >> >>>>> >> >>>>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn = >> >>>>> 0x8de2d30 >> >>>>> >> >>>>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, >> >>>>> refcnt = 1 >> >>>>> >> >>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: >> >>>>> refcnt = >> >>>>> 1 >> >>>>> >> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, >> >>>>> async_cursor 0x881ac00 >> >>>>> >> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, >> >>>>> async_cursor 0x881ac00 >> >>>>> >> >>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = >> >>>>> 0, >> >>>>> status = 2 >> >>>>> >> >>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 >> >>>>> __lock 1 >> >>>>> >> >>>>> >> >>>>> 在 2018年4月28日星期六 UTC+8下午4:07:34,[email protected]写道: >> >>>>>> >> >>>>>> We reproduced this problem and added logs in psycopg2, found a >> >>>>>> confusing thing and reported it to psycopg2. >> >>>>>> >> >>>>>> It seems sqlalchemy & psycopg2 stack flow is: >> >>>>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit >> >>>>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn) >> >>>>>> 3.get exception in sqlalchemy, and do_rollback >> >>>>>> 4.from sqlalchemy to psycopg: pq_abort >> >>>>>> 5.psycopg get pthread_mutex_lock deadlock >> >>>>>> >> >>>>>> What we have done to reproduce it is: stop master pg-server and >> >>>>>> promote the slave pg-server to master, with moving FIP from old >> >>>>>> master to >> >>>>>> slave. At the same time, let nova-conductor do quite a lot db query >> >>>>>> requests. >> >>>>>> >> >>>>>> [1] https://github.com/psycopg/psycopg2/issues/703 >> >>>>>> >> >>>>>> 在 2018年4月23日星期一 UTC+8下午9:45:04,Mike Bayer写道: >> >>>>>>> >> >>>>>>> On Mon, Apr 23, 2018 at 9:03 AM, <[email protected]> wrote: >> >>>>>>> > Sorry for reply on this topic. >> >>>>>>> > >> >>>>>>> > We recently get the same problem in our production environment. >> >>>>>>> > >> >>>>>>> > I found a patch in other lib [1], and they added conn.close() in >> >>>>>>> > exception >> >>>>>>> > psycopg2.ProgrammingError. >> >>>>>>> > >> >>>>>>> > Shall we do the same in [2] ? >> >>>>>>> >> >>>>>>> SQLAlchemy does things much more carefully than that, we parse the >> >>>>>>> error message for specific ones that correspond to "connection is >> >>>>>>> no >> >>>>>>> longer usable", we call these "is_disconnect" but it can be any >> >>>>>>> invalidating condition. >> >>>>>>> >> >>>>>>> You can make these yourself, and they can also be made to be part >> >>>>>>> of >> >>>>>>> oslo.db, using the handle_error event: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error >> >>>>>>> >> >>>>>>> within oslo.db you would want to propose a change here: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387 >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> > >> >>>>>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split >> >>>>>>> > >> >>>>>>> > [2] >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289 >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31,JinRong Cai写道: >> >>>>>>> >> >> >>>>>>> >> Hi Michael , >> >>>>>>> >> >> >>>>>>> >> I am using openstack with postgresql which sqlalchemy and >> >>>>>>> >> oslo_db >> >>>>>>> >> module >> >>>>>>> >> were used. >> >>>>>>> >> And there are some problems after my pg database switched over. >> >>>>>>> >> >> >>>>>>> >> Here is my switch over process: >> >>>>>>> >> 1. nova-conductor(python application) is running with DB >> >>>>>>> >> connection >> >>>>>>> >> strings point to vip , which is in primary site(A) of pg. >> >>>>>>> >> 2. switch VIP from primary(A) to new primary(B) >> >>>>>>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to >> >>>>>>> >> new >> >>>>>>> >> primary. >> >>>>>>> >> 4. nova-conductor is running in the whole process. >> >>>>>>> >> >> >>>>>>> >> After some seconds, I found some nova-conductor processes are >> >>>>>>> >> hang >> >>>>>>> >> with >> >>>>>>> >> status futex_wait_queue_me, and the status of the query in DB >> >>>>>>> >> is >> >>>>>>> >> "idle in >> >>>>>>> >> transaction", the transaction was not commited or rollbacked! >> >>>>>>> >> I think disconnection was handled in the oslo_db, which will >> >>>>>>> >> send >> >>>>>>> >> a >> >>>>>>> >> ping(select 1) to DB. >> >>>>>>> >> >> >>>>>>> >> If DB was switchd over, the connection in the pool would be set >> >>>>>>> >> with >> >>>>>>> >> status invalid, and reconnect after next check out. >> >>>>>>> >> >> >>>>>>> >> ###error messages from nova-conductor >> >>>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] >> >>>>>>> >> [tid:122397712] >> >>>>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] >> >>>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception >> >>>>>>> >> wrapped from >> >>>>>>> >> (psycopg2.ProgrammingError) execute cannot be used while an >> >>>>>>> >> asynchronous >> >>>>>>> >> query is underway [SQL: 'SELECT 1'] >> >>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most >> >>>>>>> >> recent >> >>>>>>> >> call >> >>>>>>> >> last): >> >>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> >>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line >> >>>>>>> >> 1139, >> >>>>>>> >> in >> >>>>>>> >> _execute_context >> >>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters context) >> >>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> >>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line >> >>>>>>> >> 450, >> >>>>>>> >> in >> >>>>>>> >> do_execute >> >>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters >> >>>>>>> >> cursor.execute(statement, >> >>>>>>> >> parameters) >> >>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: >> >>>>>>> >> execute >> >>>>>>> >> cannot be used while an asynchronous query is underway >> >>>>>>> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters >> >>>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] >> >>>>>>> >> [tid:122397712] >> >>>>>>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] >> >>>>>>> >> Error closing >> >>>>>>> >> cursor >> >>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent >> >>>>>>> >> call last): >> >>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool File >> >>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line >> >>>>>>> >> 1226, >> >>>>>>> >> in >> >>>>>>> >> _safe_close_cursor >> >>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool cursor.close() >> >>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close >> >>>>>>> >> cannot be >> >>>>>>> >> used while an asynchronous query is underway >> >>>>>>> >> 36365 ERROR sqlalchemy.pool.QueuePool >> >>>>>>> >> >> >>>>>>> >> ###ps status of nova-conductor >> >>>>>>> >> >> >>>>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log >> >>>>>>> >> # cat >> >>>>>>> >> /proc/33316/stack >> >>>>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 >> >>>>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280 >> >>>>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0 >> >>>>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180 >> >>>>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b >> >>>>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff >> >>>>>>> >> >> >>>>>>> >> ### stack of the nova-conductor process >> >>>>>>> >> POD6-Mongodb03:/tmp # pstack 33316 >> >>>>>>> >> #0 0x00002b8449e35f4d in __lll_lock_wait () from >> >>>>>>> >> /lib64/libpthread.so.0 >> >>>>>>> >> #1 0x00002b8449e31d02 in _L_lock_791 () from >> >>>>>>> >> /lib64/libpthread.so.0 >> >>>>>>> >> #2 0x00002b8449e31c08 in pthread_mutex_lock () from >> >>>>>>> >> /lib64/libpthread.so.0 >> >>>>>>> >> #3 0x00002b84554c44ab in pq_abort () from >> >>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so >> >>>>>>> >> #4 0x00002b84554c955e in psyco_conn_rollback () from >> >>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so >> >>>>>>> >> #5 0x00002b8449b42b50 in PyEval_EvalFrameEx () from >> >>>>>>> >> /lib64/libpython2.7.so.1.0 >> >>>>>>> >> #6 0x00002b8449b42ad0 in PyEval_EvalFrameEx () from >> >>>>>>> >> /lib64/libpython2.7.so.1.0 >> >>>>>>> >> >> >>>>>>> >> The psycopg2 was trying to close the cursor, and try to get the >> >>>>>>> >> mutex lock >> >>>>>>> >> "pthread_mutex_lock", but it seems that the cursor was used by >> >>>>>>> >> other >> >>>>>>> >> session. >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> Questions: >> >>>>>>> >> >> >>>>>>> >> 1. What the error "ProgrammingError: close cannot be used while >> >>>>>>> >> an >> >>>>>>> >> asynchronous query is underway" mean? >> >>>>>>> >> AFAIK, these caused by psycopg2, which means a asynchronous >> >>>>>>> >> query >> >>>>>>> >> was >> >>>>>>> >> executed in one connection. >> >>>>>>> >> But the I think the sqlalchemy was thread safe since it was >> >>>>>>> >> patched by >> >>>>>>> >> eventlet, see details in eventlet/support/psycopg2_patcher.py >> >>>>>>> >> And we can see different green thread number in the log, as: >> >>>>>>> >> [pid:36365] [MainThread] [tid:122397712] >> >>>>>>> >> [pid:36365] [MainThread] [tid:122397815] >> >>>>>>> >> So, I guess the connection pool in one process is safe. >> >>>>>>> >> >> >>>>>>> >> 2. The nova-conductor was a multi-thread python client, which >> >>>>>>> >> forked >> >>>>>>> >> several child process. >> >>>>>>> >> ps -elf|grep -i nova-conductor >> >>>>>>> >> 30878 1 pool_s /usr/bin/nova-conductor >> >>>>>>> >> 36364 1 ep_pol /usr/bin/nova-conductor >> >>>>>>> >> 36365 1 futex_ /usr/bin/nova-conductor >> >>>>>>> >> 36366 1 ep_pol /usr/bin/nova-conductor >> >>>>>>> >> 36367 1 ep_pol /usr/bin/nova-conductor >> >>>>>>> >> 36368 1 ep_pol /usr/bin/nova-conductor >> >>>>>>> >> >> >>>>>>> >> If the nova-conductor was started with only one child, the >> >>>>>>> >> problem >> >>>>>>> >> was not >> >>>>>>> >> happen. >> >>>>>>> >> Does this mean the connection/engine CAN NOT shared in these >> >>>>>>> >> child >> >>>>>>> >> processes? >> >>>>>>> >> >> >>>>>>> >> Thanks. >> >>>>>>> > >> >>>>>>> > -- >> >>>>>>> > SQLAlchemy - >> >>>>>>> > The Python SQL Toolkit and Object Relational Mapper >> >>>>>>> > >> >>>>>>> > http://www.sqlalchemy.org/ >> >>>>>>> > >> >>>>>>> > To post example code, please provide an MCVE: Minimal, Complete, >> >>>>>>> > and >> >>>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a >> >>>>>>> > full >> >>>>>>> > description. >> >>>>>>> > --- >> >>>>>>> > You received this message because you are subscribed to the >> >>>>>>> > Google >> >>>>>>> > Groups >> >>>>>>> > "sqlalchemy" group. >> >>>>>>> > To unsubscribe from this group and stop receiving emails from >> >>>>>>> > it, >> >>>>>>> > send an >> >>>>>>> > email to [email protected]. >> >>>>>>> > To post to this group, send email to [email protected]. >> >>>>>>> > Visit this group at https://groups.google.com/group/sqlalchemy. >> >>>>>>> > For more options, visit https://groups.google.com/d/optout. >> >>>>> >> >>>>> -- >> >>>>> SQLAlchemy - >> >>>>> The Python SQL Toolkit and Object Relational Mapper >> >>>>> >> >>>>> http://www.sqlalchemy.org/ >> >>>>> >> >>>>> To post example code, please provide an MCVE: Minimal, Complete, and >> >>>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a >> >>>>> full >> >>>>> description. >> >>>>> --- >> >>>>> You received this message because you are subscribed to the Google >> >>>>> Groups "sqlalchemy" group. >> >>>>> To unsubscribe from this group and stop receiving emails from it, >> >>>>> send >> >>>>> an email to [email protected]. >> >>>>> To post to this group, send email to [email protected]. >> >>>>> Visit this group at https://groups.google.com/group/sqlalchemy. >> >>>>> For more options, visit https://groups.google.com/d/optout. >> > >> > -- >> > SQLAlchemy - >> > The Python SQL Toolkit and Object Relational Mapper >> > >> > http://www.sqlalchemy.org/ >> > >> > To post example code, please provide an MCVE: Minimal, Complete, and >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full >> > description. >> > --- >> > You received this message because you are subscribed to the Google >> > Groups >> > "sqlalchemy" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an >> > email to [email protected]. >> > To post to this group, send email to [email protected]. >> > Visit this group at https://groups.google.com/group/sqlalchemy. >> > For more options, visit https://groups.google.com/d/optout. > > -- > SQLAlchemy - > The Python SQL Toolkit and Object Relational Mapper > > http://www.sqlalchemy.org/ > > To post example code, please provide an MCVE: Minimal, Complete, and > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > description. > --- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/sqlalchemy. > For more options, visit https://groups.google.com/d/optout. -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
