We added coroutine_id in psycopg2, and found that two coroutine use the same connection before the pthread_mutex_lock released.
Maybe something in the connection pool goes wrong? [pid] [coroutine_id] msg [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 0x23ad320 [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0 [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0 [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, async_cursor 0x881ac00 [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, async_cursor 0x881ac00 [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, status = 2 [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1 Below is the logs grep by coroutine id: [49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0 [49174] [0xa5db730]before have_wait_callback conn 0x94122f0 [49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 0x23ad320 [49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0 [49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0 [49174] [0xa5db730]conn_poll: poll writing [49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ ----- [49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection at 0x8de2d30 [49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650 [49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn = 0x8de2d30 [49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt = 1 [49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1 [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, async_cursor 0x881ac00 [49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, async_cursor 0x881ac00 [49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, status = 2 [49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1 在 2018年4月28日星期六 UTC+8下午4:07:34,[email protected]写道: > > We reproduced this problem and added logs in psycopg2, found a confusing > thing and reported it to psycopg2. > > It seems sqlalchemy & psycopg2 stack flow is: > 1.from sqlalchemy to psycopg: pq_execute/pq_commit > 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn) > 3.get exception in sqlalchemy, and do_rollback > 4.from sqlalchemy to psycopg: pq_abort > 5.psycopg get pthread_mutex_lock deadlock > > What we have done to reproduce it is: stop master pg-server and promote > the slave pg-server to master, with moving FIP from old master to slave. At > the same time, let nova-conductor do quite a lot db query requests. > > [1] https://github.com/psycopg/psycopg2/issues/703 > > 在 2018年4月23日星期一 UTC+8下午9:45:04,Mike Bayer写道: >> >> On Mon, Apr 23, 2018 at 9:03 AM, <[email protected]> wrote: >> > Sorry for reply on this topic. >> > >> > We recently get the same problem in our production environment. >> > >> > I found a patch in other lib [1], and they added conn.close() in >> exception >> > psycopg2.ProgrammingError. >> > >> > Shall we do the same in [2] ? >> >> SQLAlchemy does things much more carefully than that, we parse the >> error message for specific ones that correspond to "connection is no >> longer usable", we call these "is_disconnect" but it can be any >> invalidating condition. >> >> You can make these yourself, and they can also be made to be part of >> oslo.db, using the handle_error event: >> >> >> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error >> >> >> within oslo.db you would want to propose a change here: >> >> >> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387 >> >> >> >> >> >> >> > >> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split >> > >> > [2] >> > >> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289 >> >> > >> > >> > 在 2017年11月13日星期一 UTC+8上午10:44:31,JinRong Cai写道: >> >> >> >> Hi Michael , >> >> >> >> I am using openstack with postgresql which sqlalchemy and oslo_db >> module >> >> were used. >> >> And there are some problems after my pg database switched over. >> >> >> >> Here is my switch over process: >> >> 1. nova-conductor(python application) is running with DB connection >> >> strings point to vip , which is in primary site(A) of pg. >> >> 2. switch VIP from primary(A) to new primary(B) >> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new >> primary. >> >> 4. nova-conductor is running in the whole process. >> >> >> >> After some seconds, I found some nova-conductor processes are hang >> with >> >> status futex_wait_queue_me, and the status of the query in DB is "idle >> in >> >> transaction", the transaction was not commited or rollbacked! >> >> I think disconnection was handled in the oslo_db, which will send a >> >> ping(select 1) to DB. >> >> >> >> If DB was switchd over, the connection in the pool would be set with >> >> status invalid, and reconnect after next check out. >> >> >> >> ###error messages from nova-conductor >> >> localhost nova-conductor ERROR [pid:36365] [MainThread] >> [tid:122397712] >> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] >> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception >> wrapped from >> >> (psycopg2.ProgrammingError) execute cannot be used while an >> asynchronous >> >> query is underway [SQL: 'SELECT 1'] >> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent >> call >> >> last): >> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in >> >> _execute_context >> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters context) >> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters File >> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in >> >> do_execute >> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters >> cursor.execute(statement, >> >> parameters) >> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute >> >> cannot be used while an asynchronous query is underway >> >> 36365 ERROR oslo_db.sqlalchemy.exc_filters >> >> localhost nova-conductor ERROR [pid:36365] [MainThread] >> [tid:122397712] >> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error >> closing >> >> cursor >> >> 36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call >> last): >> >> 36365 ERROR sqlalchemy.pool.QueuePool File >> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in >> >> _safe_close_cursor >> >> 36365 ERROR sqlalchemy.pool.QueuePool cursor.close() >> >> 36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot >> be >> >> used while an asynchronous query is underway >> >> 36365 ERROR sqlalchemy.pool.QueuePool >> >> >> >> ###ps status of nova-conductor >> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # >> cat >> >> /proc/33316/stack >> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 >> >> [<ffffffff810e5799>] futex_wait+0x179/0x280 >> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0 >> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180 >> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b >> >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> >> >> ### stack of the nova-conductor process >> >> POD6-Mongodb03:/tmp # pstack 33316 >> >> #0 0x00002b8449e35f4d in __lll_lock_wait () from >> /lib64/libpthread.so.0 >> >> #1 0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 >> >> #2 0x00002b8449e31c08 in pthread_mutex_lock () from >> >> /lib64/libpthread.so.0 >> >> #3 0x00002b84554c44ab in pq_abort () from >> >> /python2.7/site-packages/psycopg2/_psycopg.so >> >> #4 0x00002b84554c955e in psyco_conn_rollback () from >> >> /python2.7/site-packages/psycopg2/_psycopg.so >> >> #5 0x00002b8449b42b50 in PyEval_EvalFrameEx () from >> >> /lib64/libpython2.7.so.1.0 >> >> #6 0x00002b8449b42ad0 in PyEval_EvalFrameEx () from >> >> /lib64/libpython2.7.so.1.0 >> >> >> >> The psycopg2 was trying to close the cursor, and try to get the mutex >> lock >> >> "pthread_mutex_lock", but it seems that the cursor was used by other >> >> session. >> >> >> >> >> >> Questions: >> >> >> >> 1. What the error "ProgrammingError: close cannot be used while an >> >> asynchronous query is underway" mean? >> >> AFAIK, these caused by psycopg2, which means a asynchronous query was >> >> executed in one connection. >> >> But the I think the sqlalchemy was thread safe since it was patched by >> >> eventlet, see details in eventlet/support/psycopg2_patcher.py >> >> And we can see different green thread number in the log, as: >> >> [pid:36365] [MainThread] [tid:122397712] >> >> [pid:36365] [MainThread] [tid:122397815] >> >> So, I guess the connection pool in one process is safe. >> >> >> >> 2. The nova-conductor was a multi-thread python client, which forked >> >> several child process. >> >> ps -elf|grep -i nova-conductor >> >> 30878 1 pool_s /usr/bin/nova-conductor >> >> 36364 1 ep_pol /usr/bin/nova-conductor >> >> 36365 1 futex_ /usr/bin/nova-conductor >> >> 36366 1 ep_pol /usr/bin/nova-conductor >> >> 36367 1 ep_pol /usr/bin/nova-conductor >> >> 36368 1 ep_pol /usr/bin/nova-conductor >> >> >> >> If the nova-conductor was started with only one child, the problem was >> not >> >> happen. >> >> Does this mean the connection/engine CAN NOT shared in these child >> >> processes? >> >> >> >> Thanks. >> > >> > -- >> > SQLAlchemy - >> > The Python SQL Toolkit and Object Relational Mapper >> > >> > http://www.sqlalchemy.org/ >> > >> > To post example code, please provide an MCVE: Minimal, Complete, and >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full >> > description. >> > --- >> > You received this message because you are subscribed to the Google >> Groups >> > "sqlalchemy" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an >> > email to [email protected]. >> > To post to this group, send email to [email protected]. >> > Visit this group at https://groups.google.com/group/sqlalchemy. >> > For more options, visit https://groups.google.com/d/optout. >> > -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.
