Re: [sqlalchemy] Re: long transaction after database switched over

Mike Bayer Mon, 07 May 2018 09:55:04 -0700

just to note, there's definitely a bug somewhere, and if the bug is
found there's a 90% chance it involves code that I wrote at some point
:)  (though likely some assumption that the PG /async stuff does not
meet)  but this is part of oslo / openstack.  I work on oslo /
openstack, but that's my employment and the Postgresql space is
something we as a community decided we didn't have the resources to
work on.      The community certainly welcomes people to work on the
Postgresql side of things, though.   But it depends on your
priorities.  If you really need to scale in production and not get
into full blown development, the majority of the community is still on
MySQL and Galera.  You also might consider picking up this thread on
the openstack-dev list where there may be other Postgresql-minded
developers with insights.




On Mon, May 7, 2018 at 8:40 AM, Mike Bayer <clas...@zzzcomputing.com> wrote:
>
>
> On Mon, May 7, 2018, 7:01 AM <jiajunsu....@gmail.com> wrote:
>>
>> We added coroutine_id in psycopg2, and found that two coroutine use the
>> same connection before the pthread_mutex_lock released.
>>
>> Maybe something in the connection pool goes wrong?
>> [pid] [coroutine_id] msg
>
>
>
> This seems unlikely.   Openstack uses the SQLAlchemy connection pool with
> eventlet patching in conjunction with the MySQL drivers and nothing like
> that happens.   The biggest issue SQLAlchemy had with greenlets is that a
> timed out greenlet would throw GreenletExit and cause a dirty connection to
> go back into the pool, that was fixed in 1.2.
>
>>
>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb
>> 0x23ad320
>>
>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
>> status = 2
>>
>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>
>>
>> Below is the logs grep by coroutine id:
>>
>> [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0
>>
>> [49174] [0xa5db730]before have_wait_callback conn 0x94122f0
>>
>> [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb
>> 0x23ad320
>>
>> [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0
>>
>> [49174] [0xa5db730]conn_poll: poll writing
>>
>> [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ
>>
>> -----
>>
>> [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection at
>> 0x8de2d30
>>
>> [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650
>>
>> [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn =
>> 0x8de2d30
>>
>> [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt =
>> 1
>>
>> [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0,
>> async_cursor 0x881ac00
>>
>> [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0,
>> status = 2
>>
>> [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1
>>
>>
>> 在 2018年4月28日星期六 UTC+8下午4:07:34，jiajun...@gmail.com写道：
>>>
>>> We reproduced this problem and added logs in psycopg2, found a confusing
>>> thing and reported it to psycopg2.
>>>
>>> It seems sqlalchemy & psycopg2 stack flow is:
>>> 1.from sqlalchemy to psycopg: pq_execute/pq_commit
>>> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
>>> 3.get exception in sqlalchemy, and do_rollback
>>> 4.from sqlalchemy to psycopg: pq_abort
>>> 5.psycopg get pthread_mutex_lock deadlock
>>>
>>> What we have done to reproduce it is: stop master pg-server and promote
>>> the slave pg-server to master, with moving FIP from old master to slave. At
>>> the same time, let nova-conductor do quite a lot db query requests.
>>>
>>> [1] https://github.com/psycopg/psycopg2/issues/703
>>>
>>> 在 2018年4月23日星期一 UTC+8下午9:45:04，Mike Bayer写道：
>>>>
>>>> On Mon, Apr 23, 2018 at 9:03 AM,  <jiajun...@gmail.com> wrote:
>>>> > Sorry for reply on this topic.
>>>> >
>>>> > We recently get the same problem in our production environment.
>>>> >
>>>> > I found a patch in other lib [1], and they added conn.close() in
>>>> > exception
>>>> > psycopg2.ProgrammingError.
>>>> >
>>>> > Shall we do the same in [2] ?
>>>>
>>>> SQLAlchemy does things much more carefully than that, we parse the
>>>> error message for specific ones that correspond to "connection is no
>>>> longer usable", we call these "is_disconnect" but it can be any
>>>> invalidating condition.
>>>>
>>>> You can make these yourself, and they can also be made to be part of
>>>> oslo.db, using the handle_error event:
>>>>
>>>>
>>>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>>>
>>>> within oslo.db you would want to propose a change here:
>>>>
>>>>
>>>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> >
>>>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split
>>>> >
>>>> > [2]
>>>> >
>>>> > https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>>> >
>>>> >
>>>> > 在 2017年11月13日星期一 UTC+8上午10:44:31，JinRong Cai写道：
>>>> >>
>>>> >> Hi  Michael ,
>>>> >>
>>>> >> I am using openstack with postgresql which sqlalchemy and oslo_db
>>>> >> module
>>>> >> were used.
>>>> >> And there are some problems after my pg database switched over.
>>>> >>
>>>> >> Here is my switch over process:
>>>> >> 1. nova-conductor(python application) is running with DB connection
>>>> >> strings point to vip , which is in primary site(A) of pg.
>>>> >> 2. switch VIP from primary(A) to new primary(B)
>>>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new
>>>> >> primary.
>>>> >> 4. nova-conductor is running in the whole process.
>>>> >>
>>>> >> After some seconds, I found some nova-conductor processes are hang
>>>> >> with
>>>> >> status futex_wait_queue_me, and the status of the query in DB is
>>>> >> "idle in
>>>> >> transaction", the transaction was not commited or rollbacked!
>>>> >> I think disconnection was handled in the oslo_db, which will send a
>>>> >> ping(select 1) to DB.
>>>> >>
>>>> >> If DB was switchd over, the connection in the pool would be set with
>>>> >> status invalid, and reconnect after next check out.
>>>> >>
>>>> >> ###error messages from nova-conductor
>>>> >> localhost nova-conductor ERROR [pid:36365] [MainThread]
>>>> >> [tid:122397712]
>>>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError]
>>>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception
>>>> >> wrapped from
>>>> >> (psycopg2.ProgrammingError) execute cannot be used while an
>>>> >> asynchronous
>>>> >> query is underway [SQL: 'SELECT 1']
>>>> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent
>>>> >> call
>>>> >> last):
>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in
>>>> >> _execute_context
>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context)
>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>>>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in
>>>> >> do_execute
>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>>>> >> cursor.execute(statement,
>>>> >> parameters)
>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError:
>>>> >> execute
>>>> >> cannot be used while an asynchronous query is underway
>>>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>>>> >>  localhost nova-conductor ERROR [pid:36365] [MainThread]
>>>> >> [tid:122397712]
>>>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error
>>>> >> closing
>>>> >> cursor
>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call
>>>> >> last):
>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool   File
>>>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in
>>>> >> _safe_close_cursor
>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close()
>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close
>>>> >> cannot be
>>>> >> used while an asynchronous query is underway
>>>> >>   36365 ERROR sqlalchemy.pool.QueuePool
>>>> >>
>>>> >> ###ps status of nova-conductor
>>>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log #
>>>> >> cat
>>>> >> /proc/33316/stack
>>>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120
>>>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280
>>>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0
>>>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180
>>>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b
>>>> >> [<ffffffffffffffff>] 0xffffffffffffffff
>>>> >>
>>>> >> ### stack of the nova-conductor process
>>>> >> POD6-Mongodb03:/tmp # pstack 33316
>>>> >> #0  0x00002b8449e35f4d in __lll_lock_wait () from
>>>> >> /lib64/libpthread.so.0
>>>> >> #1  0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0
>>>> >> #2  0x00002b8449e31c08 in pthread_mutex_lock () from
>>>> >> /lib64/libpthread.so.0
>>>> >> #3  0x00002b84554c44ab in pq_abort () from
>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so
>>>> >> #4  0x00002b84554c955e in psyco_conn_rollback () from
>>>> >> /python2.7/site-packages/psycopg2/_psycopg.so
>>>> >> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from
>>>> >> /lib64/libpython2.7.so.1.0
>>>> >> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from
>>>> >> /lib64/libpython2.7.so.1.0
>>>> >>
>>>> >> The psycopg2 was trying to close the cursor, and try to get the mutex
>>>> >> lock
>>>> >> "pthread_mutex_lock", but it seems that the cursor was used by other
>>>> >> session.
>>>> >>
>>>> >>
>>>> >> Questions:
>>>> >>
>>>> >> 1. What the error "ProgrammingError: close cannot be used while an
>>>> >> asynchronous query is underway" mean?
>>>> >> AFAIK, these caused by psycopg2, which means a asynchronous query was
>>>> >> executed in one connection.
>>>> >> But the I think the sqlalchemy was thread safe since it was patched
>>>> >> by
>>>> >> eventlet, see details in  eventlet/support/psycopg2_patcher.py
>>>> >> And we can see different green thread number in the log, as:
>>>> >> [pid:36365] [MainThread] [tid:122397712]
>>>> >> [pid:36365] [MainThread] [tid:122397815]
>>>> >> So, I guess the connection pool in one process is safe.
>>>> >>
>>>> >> 2. The nova-conductor was a multi-thread python client, which forked
>>>> >> several child process.
>>>> >> ps -elf|grep -i nova-conductor
>>>> >> 30878  1 pool_s /usr/bin/nova-conductor
>>>> >> 36364  1 ep_pol /usr/bin/nova-conductor
>>>> >> 36365  1 futex_ /usr/bin/nova-conductor
>>>> >> 36366  1 ep_pol /usr/bin/nova-conductor
>>>> >> 36367  1 ep_pol /usr/bin/nova-conductor
>>>> >> 36368  1 ep_pol /usr/bin/nova-conductor
>>>> >>
>>>> >> If the nova-conductor was started with only one child, the problem
>>>> >> was not
>>>> >> happen.
>>>> >> Does this mean the connection/engine CAN NOT shared in these child
>>>> >> processes?
>>>> >>
>>>> >> Thanks.
>>>> >
>>>> > --
>>>> > SQLAlchemy -
>>>> > The Python SQL Toolkit and Object Relational Mapper
>>>> >
>>>> > http://www.sqlalchemy.org/
>>>> >
>>>> > To post example code, please provide an MCVE: Minimal, Complete, and
>>>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>>>> > description.
>>>> > ---
>>>> > You received this message because you are subscribed to the Google
>>>> > Groups
>>>> > "sqlalchemy" group.
>>>> > To unsubscribe from this group and stop receiving emails from it, send
>>>> > an
>>>> > email to sqlalchemy+...@googlegroups.com.
>>>> > To post to this group, send email to sqlal...@googlegroups.com.
>>>> > Visit this group at https://groups.google.com/group/sqlalchemy.
>>>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> SQLAlchemy -
>> The Python SQL Toolkit and Object Relational Mapper
>>
>> http://www.sqlalchemy.org/
>>
>> To post example code, please provide an MCVE: Minimal, Complete, and
>> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
>> description.
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "sqlalchemy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to sqlalchemy+unsubscr...@googlegroups.com.
>> To post to this group, send email to sqlalchemy@googlegroups.com.
>> Visit this group at https://groups.google.com/group/sqlalchemy.
>> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to