Re: [sqlalchemy] Re: long transaction after database switched over

Mike Bayer Sun, 13 May 2018 19:22:51 -0700

if that connection error is not handled in "is_disconnect" then this
wouldn't invalidate the pool.


However we've had lots of problems moving vips around with Galera.  I
would recommend using haproxy or PGBouncer which are both designed to
mitigate this issue.     As well as, run Nova under mod_wsgi instead
of eventlet.

There's lots of ways your problem can be solved here.   I'd like to
find the specific race in SQLAlchemy but I need specific code steps.

On Sun, May 13, 2018 at 9:50 PM,  <[email protected]> wrote:
> The scenario we make to reproduce the problem is:
> 1.stop PG master
> 2.move the IP of PG from master node to slave node, and make an ARP
> 3.fail over the PG slave to master
>
> That will shut all DB connections to PG master, and in SQLAlchemy we got an
> exception:
>
> ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 655, in
> _finalize_fairy
> ERROR sqlalchemy.pool.QueuePool     fairy._reset(pool)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/pool.py", line 798, in
> _reset
> ERROR sqlalchemy.pool.QueuePool     self._reset_agent.rollback()
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 1586, in rollback
> ERROR sqlalchemy.pool.QueuePool     self._do_rollback()
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 1624, in _do_rollback
> ERROR sqlalchemy.pool.QueuePool     self.connection._rollback_impl()
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 672, in _rollback_impl
> ERROR sqlalchemy.pool.QueuePool     self._handle_dbapi_exception(e, None,
> None, None, None)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 1356, in _handle_dbapi_exception
> ERROR sqlalchemy.pool.QueuePool     util.raise_from_cause(newraise,
> exc_info)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/util/compat.py", line
> 200, in raise_from_cause
> ERROR sqlalchemy.pool.QueuePool     reraise(type(exception), exception,
> tb=exc_tb)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/base.py", line
> 670, in _rollback_impl
> ERROR sqlalchemy.pool.QueuePool
> self.engine.dialect.do_rollback(self.connection)
> ERROR sqlalchemy.pool.QueuePool   File "sqlalchemy/engine/default.py", line
> 421, in do_rollback
> ERROR sqlalchemy.pool.QueuePool     dbapi_connection.rollback()
> ERROR sqlalchemy.pool.QueuePool DBConnectionError:
> (psycopg2.OperationalError) connection pointer is NULL
>
> With that exception, SQLAlchemy first get in
> `Connection._handle_dbapi_exception`(keyword "err connection pointer is
> NULL" in log) and then get it again in `_finalize_fairy`(keyword "Exception
> during reset or similar" in log)
>
>
>
> 在 2018年5月11日星期五 UTC+8上午12:21:42，Mike Bayer写道：
>>
>> On Thu, May 10, 2018 at 7:23 AM,  <[email protected]> wrote:
>> > Hi Mike,
>> >
>> > I added more logs and finnaly find the reason and a method to avoid the
>> > deadlock between coroutines.
>> >
>> > In summary, the deadlock happend because two different coroutines use
>> > the
>> > same ConnectionRecord with same connection.
>> > It happends under QueuePool, when the DB connections are reset by the DB
>> > server(not related with the backend driver).
>> >
>> > 1.A coroutine get a ConnectionRecord into ConnectionFairy, and apply a
>> > weakref of the ConnectionFairy, with callback function
>> > `_finalize_fairy`.
>> > 2.When the connection get disconnected, it receives an
>> > exception(psycopg2.OperationalError), and goto
>> > Connection._handle_dbapi_exception.
>> > 3.In `_handle_dbapi_exception.finnaly`, it will call both
>> > `ConnectionFairy.invalidate` and `self.invalidate`.
>> > 4.In `ConnectionFairy.invalidate`, `_ConnectionRecord._checkin` is
>> > called,
>> > and put the ConnectionRecord back to the QueuePool.
>> > 5.In `self.invalidate`, `del self._root.__connection` is called, and
>> > **del**
>> > the ConnectionFairy.
>> > 6.The **del** operation will make weakref of ConnectionFairy to dead,
>> > and
>> > the callback `_finalize_fairy` executed.
>> > 7.In `_finalize_fairy`, another `_ConnectionRecord._checkin` is called,
>> > put
>> > the ConnectionRecord to the QueuePool again.
>>
>> Here's what does not make sense.  You are claiming that a simple
>> invalidation will result in the same ConnectionRecord being checked in
>> twice.   This is obviously not the case, what is special in your
>> situation that makes this happen?
>>
>> If I follow the steps you refer towards, in step 4, when
>> ConnectionFairy._checkin is called, that calls finalize_fairy
>> directly, which then calls ConnectionRecord.checkin().  Within
>> ConnectionRecord.checkin(), ConnectionRecord.fairy_ref is set to None.
>>    Then back in ConnectionFairy._checkin(), it also sets
>> self._connection_record = None.
>>
>> If we come back into finalize_fairy a *second* time then, as a result
>> of the "del" which occurs in _handle_error(), and the
>> connection_record is present, the function will exit immediately,
>> because:
>>
>>        if connection_record.fairy_ref is not ref:
>>             return
>>
>> So the situation as given is not possible without concurrent access to
>> the ConnectionFairy being introduced externally to the pool.
>>
>> There is exactly one thing I can think of that is both unique to the
>> psycopg2-style of asynchronous connection, not the PyMySQL form of it,
>> that might be important here, and it's that I think psycopg2's
>> connection.close() method will defer to another greenlet.  That means
>> when we're inside of ConnectionRecord.invalidate(), and we call
>> self.__close(), that could be deferring to another greenlet.   But
>> this all happens before the connection is returned to the pool, so I
>> still don't know what is happening.
>>
>> >
>> > The logs in detail is below. To prove my result is right, I tried to
>> > avoid
>> > the scienario with this patch, and it worked.
>> >
>> > def checkin(self):
>> >     self.fairy_ref = None
>> >     connection = self.connection
>> >     pool = self.__pool
>> >
>> >     while self.finalize_callback:
>> >         finalizer = self.finalize_callback.pop()
>> >         finalizer(connection)
>> > +    if self in pool._pool.queue:
>> > +        pool.logger.debug("jjs rec already in pool checkin, skip %r",
>> > self)
>> > +        return
>>
>> as mentioned above it's not possible to reach this point without
>> concurrent access to the pooled connection, which means this fix could
>> fail also (with at least threaded-style concurrency, not greenlet
>> style).   The original source of the greenlet issue must still be
>> identified.
>>
>> I would need a self-contained proof of concept that demonstrates this
>> issue occurring.  The log output is not that helpful because it at
>> least would need to write out the identifier of the current greenlet,
>> since it is very likely that greenlets have something to do with this
>> issue.
>>
>>
>>
>> >
>> >     if pool.dispatch.checkin:
>> >         pool.dispatch.checkin(connection, self)
>> >     pool._return_conn(self)
>> >
>> > Trace the ConnectionRecord 0x88fc5d0
>> >
>> > 2018-05-10T14:02:17.927+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Created new connection
>> > <connection object at 0x81f69b0; closed: 0>
>> > 2018-05-10T14:02:17.928+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs get conn success
>> > <connection
>> > object at 0x81f69b0; closed: 0>, rec.connection <connection object at
>> > 0x81f69b0; closed: 0>, pool queue
>> > deque([<sqlalchemy.pool._ConnectionRecord
>> > object at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at
>> > 0x79fdad0>])
>> > 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs checkout fairy
>> > <sqlalchemy.pool._ConnectionFairy object at 0x329c0d0> rec
>> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0> pool queue
>> > deque([<sqlalchemy.pool._ConnectionRecord object at 0x7a04810>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>])
>> > 2018-05-10T14:02:17.929+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Connection <connection object
>> > at
>> > 0x81f69b0; closed: 0> checked out from pool, status Pool size: 10
>> > Connections in pool: 2 Current Overflow: 0 Max Overflow: 40 Current
>> > Checked
>> > out connections: 8, queue deque([<sqlalchemy.pool._ConnectionRecord
>> > object
>> > at 0x7a04810>, <sqlalchemy.pool._ConnectionRecord object at 0x79fdad0>])
>> > 2018-05-10T14:02:17.930+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs
>> > _ConnectionFairy._checkout
>> > conn <connection object at 0x81f69b0; closed: 0>, rec
>> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>, rec.conn
>> > <connection object at 0x81f69b0; closed: 0>
>> > 2018-05-10T14:02:49.716+08:00 localhost nova-conductor ERROR [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:122 error]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr error conn
>> > <sqlalchemy.engine.base.Connection object at 0x329c150>, err connection
>> > pointer is NULL
>> > 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:122 error]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter newraise
>> > 2018-05-10T14:02:49.717+08:00 localhost nova-conductor ERROR [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:122 error]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] cjr enter finally
>> >>>Connection._handle_dbapi_exception --> finally -->
>> >>> **self.engine.pool._invalidate** & **self.invalidate**
>> >>>self.engine.pool._invalidate --> Pool._invalidate -->
>> >>> ConnectionFairy.invalidate --> _ConnectionRecord.invalidate &
>> >>> **self._checkin()**
>> >>>self.invalidate() --> **del self._root.__connection**
>> > 2018-05-10T14:02:49.718+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs connectionFairy
>> > invalidate
>> > 2018-05-10T14:02:49.718+08:00 localhost nova-conductor INFO [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:109 info]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Invalidate connection
>> > <connection
>> > object at 0x81f69b0; closed: 2> (reason: OperationalError:connection
>> > pointer
>> > is NULL
>> >>>_ConnectionRecord.invalidate(e)  -->
>> >>> self.__pool.dispatch.invalidate(self.connection, self, e) &
>> >>> self.__close() &
>> >>> self.connection = None
>> >>>self.__close() --> self.finalize_callback.clear() &
>> >>> self.__pool._close_connection(self.connection)
>> > 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Closing connection
>> > <connection
>> > object at 0x81f69b0; closed: 2>
>> >>>Pool._close_connection(connection) -->
>> >>> self._dialect.do_close(connection)
>> >>> --> _ConnDialect.do_close(dbapi_connection) -->
>> >>> dbapi_connection.close() FIN
>> > 2018-05-10T14:02:49.719+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _ConnectionFairy._checkin
>> > conn None rec <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>
>> > rec.conn None
>> >>>_ConnectionFairy._checkin --> _finalize_fairy(conn, conn_rec, pool,
>> >>> None,
>> >>> echo, fairy) & self.conn = None & self.conn_rec = None
>> > 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs _finalize_fairy args conn
>> > None, rec.conn None, rec, <sqlalchemy.pool._ConnectionRecord object at
>> > 0x88fc5d0>. ref None, fairy <sqlalchemy.pool._ConnectionFairy object at
>> > 0x329c0d0>
>> >>>_finalize_fairy --> connection_record.checkin()
>> > 2018-05-10T14:02:49.720+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to
>> > connection_record.checkin
>> > 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None,
>> > pool queue deque([<sqlalchemy.pool._ConnectionRecord object at
>> > 0x7524590>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>]),
>> > finalize_callback
>> > deque([])
>> > 2018-05-10T14:02:49.721+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool
>> > queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>])
>> > 2018-05-10T14:02:49.723+08:00 localhost nova-conductor ERROR [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:122 error]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] Exception during reset or
>> > similar
>> > 2018-05-10T14:02:49.723+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to
>> > connection_record.invalidate
>> > 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs going to
>> > connection_record.checkin
>> > 2018-05-10T14:02:49.724+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs before conn checkin None,
>> > pool queue deque([<sqlalchemy.pool._ConnectionRecord object at
>> > 0x7524590>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>]),
>> > finalize_callback
>> > deque([])
>> > 2018-05-10T14:02:49.732+08:00 localhost nova-conductor DEBUG [pid:4528]
>> > [MainThread] [tid:122769072] [log.py:104 debug]
>> > [req-9e51bb8d-cdfe-434b-859a-9211ac2087f9] jjs after conn checkin, pool
>> > queue deque([<sqlalchemy.pool._ConnectionRecord object at 0x7524590>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7a2b990>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0xa856b10>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x7ebee90>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>,
>> > <sqlalchemy.pool._ConnectionRecord object at 0x88fc5d0>])
>> >
>> >
>> >
>> >
>> >
>> >
>> > 在 2018年5月8日星期二 UTC+8下午11:06:04，[email protected]写道：
>> >>
>> >> I found why we could get the same connection twice from the pool...
>> >>
>> >> Because we put it into pool twice.
>> >>
>> >> Any suggestion for finding why it put the same connection into pool
>> >> within
>> >> different coroutines?
>> >>
>> >>
>> >> 2018-05-08T22:39:17.448+08:00 localhost nova-conductor DEBUG
>> >> [pid:15755]
>> >> [MainThread] [tid:187954992] [log.py:104 debug]
>> >> [req-7c2199c0-d9a6-4131-8358-2aebcf5a4ce1] jjs conn checkin <connection
>> >> object at 0x111d9740; closed: 0>
>> >> 2018-05-08T22:39:17.502+08:00 localhost nova-conductor DEBUG
>> >> [pid:15755]
>> >> [MainThread] [tid:178645264] [log.py:104 debug]
>> >> [req-7ab83c30-91bd-4278-b694-85a71035992c] jjs conn checkin <connection
>> >> object at 0x111d9740; closed: 0>
>> >> ...
>> >> 2018-05-08T22:39:17.992+08:00 localhost nova-conductor DEBUG
>> >> [pid:15755]
>> >> [MainThread] [tid:309091856] [log.py:104 debug]
>> >> [req-dee5dd2d-866f-408e-999b-b415ea7ce953] Connection <connection
>> >> object at
>> >> 0x111d9740; closed: 0> checked out from pool
>> >> 2018-05-08T22:39:18.039+08:00 localhost nova-conductor DEBUG
>> >> [pid:15755]
>> >> [MainThread] [tid:178645104] [log.py:104 debug]
>> >> [req-ac8017d7-c97e-44ca-92d2-99f180858e14] Connection <connection
>> >> object at
>> >> 0x111d9740; closed: 0> checked out from pool
>> >>
>> >>
>> >> 在 2018年5月8日星期二 UTC+8下午10:18:52，[email protected]写道：
>> >>>
>> >>> I added logs in sqlalchemy/pool.py
>> >>>
>> >>> QueuePool.checkout
>> >>>
>> >>> @classmethod
>> >>> def checkout(cls, pool):
>> >>>     rec = pool._do_get()
>> >>>     try:
>> >>>         dbapi_connection = rec.get_connection()
>> >>>         pool.logger.debug("jjs get conn success %s", dbapi_connection)
>> >>>     except:
>> >>>         pool.logger.debug("jjs get error in get_connection")
>> >>>         with util.safe_reraise():
>> >>>             rec.checkin()
>> >>>     echo = pool._should_log_debug()
>> >>>     fairy = _ConnectionFairy(dbapi_connection, rec, echo)
>> >>>     rec.fairy_ref = weakref.ref(
>> >>>         fairy,
>> >>>         lambda ref: _finalize_fairy and
>> >>>         _finalize_fairy(
>> >>>             dbapi_connection,
>> >>>             rec, pool, ref, echo)
>> >>>     )
>> >>>     _refs.add(rec)
>> >>>     if echo:
>> >>>         pool.logger.debug("jjs conn %s, pool status %s",
>> >>> dbapi_connection, pool.status())
>> >>>         pool.logger.debug("Connection %r checked out from pool",
>> >>>                           dbapi_connection)
>> >>>     return fairy
>> >>>
>> >>> Then I got logs below:
>> >>>
>> >>> coroutine A:
>> >>>
>> >>> 2018-05-08T21:39:59.055+08:00 localhost nova-conductor INFO
>> >>> [pid:55528]
>> >>> [MainThread] [tid:197123760] [log.py:109 info]
>> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs Pool wait is False,
>> >>> timeout
>> >>> 30
>> >>> 2018-05-08T21:39:59.078+08:00 localhost nova-conductor DEBUG
>> >>> [pid:55528]
>> >>> [MainThread] [tid:197123760] [log.py:104 debug]
>> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Created new connection
>> >>> <connection object at 0x903f8c0; closed: 0>
>> >>> 2018-05-08T21:39:59.079+08:00 localhost nova-conductor DEBUG
>> >>> [pid:55528]
>> >>> [MainThread] [tid:197123760] [log.py:104 debug]
>> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs get conn success
>> >>> <connection
>> >>> object at 0x903f8c0; closed: 0>
>> >>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG
>> >>> [pid:55528]
>> >>> [MainThread] [tid:197123760] [log.py:104 debug]
>> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] jjs conn <connection object
>> >>> at
>> >>> 0x903f8c0;  closed: 0>, pool status Pool size: 10  Connections in
>> >>> pool: 1
>> >>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections:
>> >>> 45
>> >>> 2018-05-08T21:39:59.080+08:00 localhost nova-conductor DEBUG
>> >>> [pid:55528]
>> >>> [MainThread] [tid:197123760] [log.py:104 debug]
>> >>> [req-ba1ab897-c80a-4bc4-922a-39044d669da1] Connection <connection
>> >>> object at
>> >>> 0x903f8c0; closed: 0> checked out from pool
>> >>>
>> >>> coroutine B:
>> >>>
>> >>> 2018-05-08T21:39:59.212+08:00 localhost nova-conductor INFO
>> >>> [pid:55528]
>> >>> [MainThread] [tid:151519312] [log.py:109 info]
>> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs Pool wait is False,
>> >>> timeout
>> >>> 30
>> >>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG
>> >>> [pid:55528]
>> >>> [MainThread] [tid:151519312] [log.py:104 debug]
>> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs get conn success
>> >>> <connection
>> >>> object at 0x903f8c0;  closed: 0>
>> >>> 2018-05-08T21:39:59.213+08:00 localhost nova-conductor DEBUG
>> >>> [pid:55528]
>> >>> [MainThread] [tid:151519312] [log.py:104 debug]
>> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] jjs conn <connection object
>> >>> at
>> >>> 0x903f8c0; closed: 0>, pool status Pool size: 10  Connections in pool:
>> >>> 0
>> >>> Current Overflow: 36 Max Overflow: 40 Current Checked out connections:
>> >>> 46
>> >>> 2018-05-08T21:39:59.214+08:00 localhost nova-conductor DEBUG
>> >>> [pid:55528]
>> >>> [MainThread] [tid:151519312] [log.py:104 debug]
>> >>> [req-a78075b5-c55c-40c4-86d0-65957e89cb56] Connection <connection
>> >>> object at
>> >>> 0x903f8c0; closed: 0> checked out from pool
>> >>>
>> >>>
>> >>> It seems that when we create_connection in QueuePool._do_get, we may
>> >>> not
>> >>> get the lock _pool.mutex(~Queue.mutex).
>> >>>
>> >>>
>> >>>
>> >>> 在 2018年5月7日星期一 UTC+8下午8:41:14，Mike Bayer写道：
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, May 7, 2018, 7:01 AM <[email protected]> wrote:
>> >>>>>
>> >>>>> We added coroutine_id in psycopg2, and found that two coroutine use
>> >>>>> the
>> >>>>> same connection before the pthread_mutex_lock released.
>> >>>>>
>> >>>>> Maybe something in the connection pool goes wrong?
>> >>>>> [pid] [coroutine_id] msg
>> >>>>
>> >>>>
>> >>>>
>> >>>> This seems unlikely.   Openstack uses the SQLAlchemy connection pool
>> >>>> with eventlet patching in conjunction with the MySQL drivers and
>> >>>> nothing
>> >>>> like that happens.   The biggest issue SQLAlchemy had with greenlets
>> >>>> is that
>> >>>> a timed out greenlet would throw GreenletExit and cause a dirty
>> >>>> connection
>> >>>> to go back into the pool, that was fixed in 1.2.
>> >>>>
>> >>>>>
>> >>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn
>> >>>>> 0x94122f0,
>> >>>>> cb 0x23ad320
>> >>>>>
>> >>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>> >>>>>
>> >>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> >>>>> async_cursor 0x881ac00
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> >>>>> async_cursor 0x881ac00
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit =
>> >>>>> 0,
>> >>>>> status = 2
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174
>> >>>>> __lock 1
>> >>>>>
>> >>>>>
>> >>>>> Below is the logs grep by coroutine id:
>> >>>>>
>> >>>>> [49174] [0xa5db730]finish send query, before psyco_wait, conn
>> >>>>> 0x94122f0
>> >>>>>
>> >>>>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>> >>>>>
>> >>>>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn
>> >>>>> 0x94122f0,
>> >>>>> cb 0x23ad320
>> >>>>>
>> >>>>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>> >>>>>
>> >>>>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>> >>>>>
>> >>>>> [49174] [0xa5db730]conn_poll: poll writing
>> >>>>>
>> >>>>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>> >>>>>
>> >>>>> -----
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for
>> >>>>> connection
>> >>>>> at 0x8de2d30
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn =
>> >>>>> 0x8de2d30
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650,
>> >>>>> refcnt = 1
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650:
>> >>>>> refcnt =
>> >>>>> 1
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> >>>>> async_cursor 0x881ac00
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> >>>>> async_cursor 0x881ac00
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit =
>> >>>>> 0,
>> >>>>> status = 2
>> >>>>>
>> >>>>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174
>> >>>>> __lock 1
>> >>>>>
>> >>>>>
>> >>>>> 在 2018年4月28日星期六 UTC+8下午4:07:34，[email protected]写道：
>> >>>>>>
>> >>>>>> We reproduced this problem and added logs in psycopg2, found a
>> >>>>>> confusing thing and reported it to psycopg2.
>> >>>>>>
>> >>>>>> It seems sqlalchemy & psycopg2 stack flow is:
>> >>>>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit
>> >>>>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>> >>>>>> 3.get exception in sqlalchemy, and do_rollback
>> >>>>>> 4.from sqlalchemy to psycopg: pq_abort
>> >>>>>> 5.psycopg get pthread_mutex_lock deadlock
>> >>>>>>
>> >>>>>> What we have done to reproduce it is: stop master pg-server and
>> >>>>>> promote the slave pg-server to master, with moving FIP from old
>> >>>>>> master to
>> >>>>>> slave. At the same time, let nova-conductor do quite a lot db query
>> >>>>>> requests.
>> >>>>>>
>> >>>>>> [1] https://github.com/psycopg/psycopg2/issues/703
>> >>>>>>
>> >>>>>> 在 2018年4月23日星期一 UTC+8下午9:45:04，Mike Bayer写道：
>> >>>>>>>
>> >>>>>>> On Mon, Apr 23, 2018 at 9:03 AM,  <[email protected]> wrote:
>> >>>>>>> > Sorry for reply on this topic.
>> >>>>>>> >
>> >>>>>>> > We recently get the same problem in our production environment.
>> >>>>>>> >
>> >>>>>>> > I found a patch in other lib [1], and they added conn.close() in
>> >>>>>>> > exception
>> >>>>>>> > psycopg2.ProgrammingError.
>> >>>>>>> >
>> >>>>>>> > Shall we do the same in [2] ?
>> >>>>>>>
>> >>>>>>> SQLAlchemy does things much more carefully than that, we parse the
>> >>>>>>> error message for specific ones that correspond to "connection is
>> >>>>>>> no
>> >>>>>>> longer usable", we call these "is_disconnect" but it can be any
>> >>>>>>> invalidating condition.
>> >>>>>>>
>> >>>>>>> You can make these yourself, and they can also be made to be part
>> >>>>>>> of
>> >>>>>>> oslo.db, using the handle_error event:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>> >>>>>>>
>> >>>>>>> within oslo.db you would want to propose a change here:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> >
>> >>>>>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split
>> >>>>>>> >
>> >>>>>>> > [2]
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31，JinRong Cai写道：
>> >>>>>>> >>
>> >>>>>>> >> Hi  Michael ,
>> >>>>>>> >>
>> >>>>>>> >> I am using openstack with postgresql which sqlalchemy and
>> >>>>>>> >> oslo_db
>> >>>>>>> >> module
>> >>>>>>> >> were used.
>> >>>>>>> >> And there are some problems after my pg database switched over.
>> >>>>>>> >>
>> >>>>>>> >> Here is my switch over process:
>> >>>>>>> >> 1. nova-conductor(python application) is running with DB
>> >>>>>>> >> connection
>> >>>>>>> >> strings point to vip , which is in primary site(A) of pg.
>> >>>>>>> >> 2. switch VIP from primary(A) to new primary(B)
>> >>>>>>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to
>> >>>>>>> >> new
>> >>>>>>> >> primary.
>> >>>>>>> >> 4. nova-conductor is running in the whole process.
>> >>>>>>> >>
>> >>>>>>> >> After some seconds, I found some nova-conductor processes are
>> >>>>>>> >> hang
>> >>>>>>> >> with
>> >>>>>>> >> status futex_wait_queue_me, and the status of the query in DB
>> >>>>>>> >> is
>> >>>>>>> >> "idle in
>> >>>>>>> >> transaction", the transaction was not commited or rollbacked!
>> >>>>>>> >> I think disconnection was handled in the oslo_db, which will
>> >>>>>>> >> send
>> >>>>>>> >> a
>> >>>>>>> >> ping(select 1) to DB.
>> >>>>>>> >>
>> >>>>>>> >> If DB was switchd over, the connection in the pool would be set
>> >>>>>>> >> with
>> >>>>>>> >> status invalid, and reconnect after next check out.
>> >>>>>>> >>
>> >>>>>>> >> ###error messages from nova-conductor
>> >>>>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread]
>> >>>>>>> >> [tid:122397712]
>> >>>>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError]
>> >>>>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception
>> >>>>>>> >> wrapped from
>> >>>>>>> >> (psycopg2.ProgrammingError) execute cannot be used while an
>> >>>>>>> >> asynchronous
>> >>>>>>> >> query is underway [SQL: 'SELECT 1']
>> >>>>>>> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most
>> >>>>>>> >> recent
>> >>>>>>> >> call
>> >>>>>>> >> last):
>> >>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> >>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line
>> >>>>>>> >> 1139,
>> >>>>>>> >> in
>> >>>>>>> >> _execute_context
>> >>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context)
>> >>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> >>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line
>> >>>>>>> >> 450,
>> >>>>>>> >> in
>> >>>>>>> >> do_execute
>> >>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>> >>>>>>> >> cursor.execute(statement,
>> >>>>>>> >> parameters)
>> >>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError:
>> >>>>>>> >> execute
>> >>>>>>> >> cannot be used while an asynchronous query is underway
>> >>>>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>> >>>>>>> >>  localhost nova-conductor ERROR [pid:36365] [MainThread]
>> >>>>>>> >> [tid:122397712]
>> >>>>>>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85]
>> >>>>>>> >> Error closing
>> >>>>>>> >> cursor
>> >>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent
>> >>>>>>> >> call last):
>> >>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool   File
>> >>>>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line
>> >>>>>>> >> 1226,
>> >>>>>>> >> in
>> >>>>>>> >> _safe_close_cursor
>> >>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close()
>> >>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close
>> >>>>>>> >> cannot be
>> >>>>>>> >> used while an asynchronous query is underway
>> >>>>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool
>> >>>>>>> >>
>> >>>>>>> >> ###ps status of nova-conductor
>> >>>>>>> >>
>> >>>>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log
>> >>>>>>> >> # cat
>> >>>>>>> >> /proc/33316/stack
>> >>>>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120
>> >>>>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280
>> >>>>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0
>> >>>>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180
>> >>>>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b
>> >>>>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff
>> >>>>>>> >>
>> >>>>>>> >> ### stack of the nova-conductor process
>> >>>>>>> >> POD6-Mongodb03:/tmp # pstack 33316
>> >>>>>>> >> #0  0x00002b8449e35f4d in __lll_lock_wait () from
>> >>>>>>> >> /lib64/libpthread.so.0
>> >>>>>>> >> #1  0x00002b8449e31d02 in _L_lock_791 () from
>> >>>>>>> >> /lib64/libpthread.so.0
>> >>>>>>> >> #2  0x00002b8449e31c08 in pthread_mutex_lock () from
>> >>>>>>> >> /lib64/libpthread.so.0
>> >>>>>>> >> #3  0x00002b84554c44ab in pq_abort () from
>> >>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so
>> >>>>>>> >> #4  0x00002b84554c955e in psyco_conn_rollback () from
>> >>>>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so
>> >>>>>>> >> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from
>> >>>>>>> >> /lib64/libpython2.7.so.1.0
>> >>>>>>> >> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from
>> >>>>>>> >> /lib64/libpython2.7.so.1.0
>> >>>>>>> >>
>> >>>>>>> >> The psycopg2 was trying to close the cursor, and try to get the
>> >>>>>>> >> mutex lock
>> >>>>>>> >> "pthread_mutex_lock", but it seems that the cursor was used by
>> >>>>>>> >> other
>> >>>>>>> >> session.
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> Questions:
>> >>>>>>> >>
>> >>>>>>> >> 1. What the error "ProgrammingError: close cannot be used while
>> >>>>>>> >> an
>> >>>>>>> >> asynchronous query is underway" mean?
>> >>>>>>> >> AFAIK, these caused by psycopg2, which means a asynchronous
>> >>>>>>> >> query
>> >>>>>>> >> was
>> >>>>>>> >> executed in one connection.
>> >>>>>>> >> But the I think the sqlalchemy was thread safe since it was
>> >>>>>>> >> patched by
>> >>>>>>> >> eventlet, see details in  eventlet/support/psycopg2_patcher.py
>> >>>>>>> >> And we can see different green thread number in the log, as:
>> >>>>>>> >> [pid:36365] [MainThread] [tid:122397712]
>> >>>>>>> >> [pid:36365] [MainThread] [tid:122397815]
>> >>>>>>> >> So, I guess the connection pool in one process is safe.
>> >>>>>>> >>
>> >>>>>>> >> 2. The nova-conductor was a multi-thread python client, which
>> >>>>>>> >> forked
>> >>>>>>> >> several child process.
>> >>>>>>> >> ps -elf|grep -i nova-conductor
>> >>>>>>> >> 30878  1 pool_s /usr/bin/nova-conductor
>> >>>>>>> >> 36364  1 ep_pol /usr/bin/nova-conductor
>> >>>>>>> >> 36365  1 futex_ /usr/bin/nova-conductor
>> >>>>>>> >> 36366  1 ep_pol /usr/bin/nova-conductor
>> >>>>>>> >> 36367  1 ep_pol /usr/bin/nova-conductor
>> >>>>>>> >> 36368  1 ep_pol /usr/bin/nova-conductor
>> >>>>>>> >>
>> >>>>>>> >> If the nova-conductor was started with only one child, the
>> >>>>>>> >> problem
>> >>>>>>> >> was not
>> >>>>>>> >> happen.
>> >>>>>>> >> Does this mean the connection/engine CAN NOT shared in these
>> >>>>>>> >> child
>> >>>>>>> >> processes?
>> >>>>>>> >>
>> >>>>>>> >> Thanks.
>> >>>>>>> >
>> >>>>>>> > --
>> >>>>>>> > SQLAlchemy -
>> >>>>>>> > The Python SQL Toolkit and Object Relational Mapper
>> >>>>>>> >
>> >>>>>>> > http://www.sqlalchemy.org/
>> >>>>>>> >
>> >>>>>>> > To post example code, please provide an MCVE: Minimal, Complete,
>> >>>>>>> > and
>> >>>>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a
>> >>>>>>> > full
>> >>>>>>> > description.
>> >>>>>>> > ---
>> >>>>>>> > You received this message because you are subscribed to the
>> >>>>>>> > Google
>> >>>>>>> > Groups
>> >>>>>>> > "sqlalchemy" group.
>> >>>>>>> > To unsubscribe from this group and stop receiving emails from
>> >>>>>>> > it,
>> >>>>>>> > send an
>> >>>>>>> > email to [email protected].
>> >>>>>>> > To post to this group, send email to [email protected].
>> >>>>>>> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> >>>>>>> > For more options, visit https://groups.google.com/d/optout.
>> >>>>>
>> >>>>> --
>> >>>>> SQLAlchemy -
>> >>>>> The Python SQL Toolkit and Object Relational Mapper
>> >>>>>
>> >>>>> http://www.sqlalchemy.org/
>> >>>>>
>> >>>>> To post example code, please provide an MCVE: Minimal, Complete, and
>> >>>>> Verifiable Example. See http://stackoverflow.com/help/mcve for a
>> >>>>> full
>> >>>>> description.
>> >>>>> ---
>> >>>>> You received this message because you are subscribed to the Google
>> >>>>> Groups "sqlalchemy" group.
>> >>>>> To unsubscribe from this group and stop receiving emails from it,
>> >>>>> send
>> >>>>> an email to [email protected].
>> >>>>> To post to this group, send email to [email protected].
>> >>>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>> >>>>> For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > SQLAlchemy -
>> > The Python SQL Toolkit and Object Relational Mapper
>> >
>> > http://www.sqlalchemy.org/
>> >
>> > To post example code, please provide an MCVE: Minimal, Complete, and
>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>> > description.
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "sqlalchemy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to [email protected].
>> > To post to this group, send email to [email protected].
>> > Visit this group at https://groups.google.com/group/sqlalchemy.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to