Re: [sqlalchemy] Re: long transaction after database switched over

jiajunsu . zju Mon, 07 May 2018 04:02:37 -0700

We added coroutine_id in psycopg2, and found that two coroutine use the 
same connection before the pthread_mutex_lock released.


Maybe something in the connection pool goes wrong?
[pid] [coroutine_id] msg

[49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 
0x23ad320

[49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0

[49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
status = 2

[49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1

Below is the logs grep by coroutine id:

[49174] [0xa5db730]finish send query, before psyco_wait, conn 0x94122f0

[49174] [0xa5db730]before have_wait_callback conn 0x94122f0

[49174] [0xa5db730]before PyObject_CallFunctionObjArgs conn 0x94122f0, cb 
0x23ad320

[49174] [0xa5db730]conn_poll: status = 2, conn 0x94122f0

[49174] [0xa5db730]conn_poll: async_status = ASYNC_WRITE 0x94122f0

[49174] [0xa5db730]conn_poll: poll writing

[49174] [0xa5db730]conn_poll: async_status -> ASYNC_READ

-----

[49174] [0xa5db4b0]psyco_conn_cursor: new unnamed cursor for connection at 
0x8de2d30

[49174] [0xa5db4b0]cursor_setup: init cursor object at 0xa6c2650

[49174] [0xa5db4b0]cursor_setup: parameters: name = (null), conn = 0x8de2d30

[49174] [0xa5db4b0]cursor_setup: good cursor object at 0xa6c2650, refcnt = 1

[49174] [0xa5db4b0]psyco_conn_cursor: new cursor at 0xa6c2650: refcnt = 1

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]before EXC_IF_ASYNC_IN_PROGRESS conn 0x94122f0, 
async_cursor 0x881ac00

[49174] [0xa5db4b0]pq_abort: enter pgconn = 0x94122f0, autocommit = 0, 
status = 2

[49174] [0xa5db4b0]before lock pgconn = 0x94122f0, owner 49174 __lock 1

在 2018年4月28日星期六 UTC+8下午4:07:34，[email protected]写道：
>
> We reproduced this problem and added logs in psycopg2, found a confusing 
> thing and reported it to psycopg2.
>
> It seems sqlalchemy & psycopg2 stack flow is:
> 1.from sqlalchemy to psycopg: pq_execute/pq_commit 
> 2.from psycopg to sqlalchemy PyWeakref_NewRef(conn)
> 3.get exception in sqlalchemy, and do_rollback
> 4.from sqlalchemy to psycopg: pq_abort
> 5.psycopg get pthread_mutex_lock deadlock
>
> What we have done to reproduce it is: stop master pg-server and promote 
> the slave pg-server to master, with moving FIP from old master to slave. At 
> the same time, let nova-conductor do quite a lot db query requests.
>
> [1] https://github.com/psycopg/psycopg2/issues/703
>
> 在 2018年4月23日星期一 UTC+8下午9:45:04，Mike Bayer写道：
>>
>> On Mon, Apr 23, 2018 at 9:03 AM,  <[email protected]> wrote: 
>> > Sorry for reply on this topic. 
>> > 
>> > We recently get the same problem in our production environment. 
>> > 
>> > I found a patch in other lib [1], and they added conn.close() in 
>> exception 
>> > psycopg2.ProgrammingError. 
>> > 
>> > Shall we do the same in [2] ? 
>>
>> SQLAlchemy does things much more carefully than that, we parse the 
>> error message for specific ones that correspond to "connection is no 
>> longer usable", we call these "is_disconnect" but it can be any 
>> invalidating condition. 
>>
>> You can make these yourself, and they can also be made to be part of 
>> oslo.db, using the handle_error event: 
>>
>>
>> http://docs.sqlalchemy.org/en/latest/core/events.html?highlight=handle_error#sqlalchemy.events.ConnectionEvents.handle_error
>>  
>>
>> within oslo.db you would want to propose a change here: 
>>
>>
>> https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L387
>>  
>>
>>
>>
>>
>>
>> > 
>> > [1] https://github.com/aio-libs/aiopg/pull/415/files?diff=split 
>> > 
>> > [2] 
>> > 
>> https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/engine/base.py#L1289
>>  
>> > 
>> > 
>> > 在 2017年11月13日星期一 UTC+8上午10:44:31，JinRong Cai写道： 
>> >> 
>> >> Hi  Michael , 
>> >> 
>> >> I am using openstack with postgresql which sqlalchemy and oslo_db 
>> module 
>> >> were used. 
>> >> And there are some problems after my pg database switched over. 
>> >> 
>> >> Here is my switch over process: 
>> >> 1. nova-conductor(python application) is running with DB connection 
>> >> strings point to vip , which is in primary site(A) of pg. 
>> >> 2. switch VIP from primary(A) to new primary(B) 
>> >> 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
>> primary. 
>> >> 4. nova-conductor is running in the whole process. 
>> >> 
>> >> After some seconds, I found some nova-conductor processes are hang 
>> with 
>> >> status futex_wait_queue_me, and the status of the query in DB is "idle 
>> in 
>> >> transaction", the transaction was not commited or rollbacked! 
>> >> I think disconnection was handled in the oslo_db, which will send a 
>> >> ping(select 1) to DB. 
>> >> 
>> >> If DB was switchd over, the connection in the pool would be set with 
>> >> status invalid, and reconnect after next check out. 
>> >> 
>> >> ###error messages from nova-conductor 
>> >> localhost nova-conductor ERROR [pid:36365] [MainThread] 
>> [tid:122397712] 
>> >> [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
>> >> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception 
>> wrapped from 
>> >> (psycopg2.ProgrammingError) execute cannot be used while an 
>> asynchronous 
>> >> query is underway [SQL: 'SELECT 1'] 
>> >>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent 
>> call 
>> >> last): 
>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in 
>> >> _execute_context 
>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context) 
>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
>> >> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in 
>> >> do_execute 
>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters     
>> cursor.execute(statement, 
>> >> parameters) 
>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute 
>> >> cannot be used while an asynchronous query is underway 
>> >>   36365 ERROR oslo_db.sqlalchemy.exc_filters 
>> >>  localhost nova-conductor ERROR [pid:36365] [MainThread] 
>> [tid:122397712] 
>> >> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error 
>> closing 
>> >> cursor 
>> >>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call 
>> last): 
>> >>   36365 ERROR sqlalchemy.pool.QueuePool   File 
>> >> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in 
>> >> _safe_close_cursor 
>> >>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close() 
>> >>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot 
>> be 
>> >> used while an asynchronous query is underway 
>> >>   36365 ERROR sqlalchemy.pool.QueuePool 
>> >> 
>> >> ###ps status of nova-conductor 
>> >> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # 
>> cat 
>> >> /proc/33316/stack 
>> >> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 
>> >> [<ffffffff810e5799>] futex_wait+0x179/0x280 
>> >> [<ffffffff810e782e>] do_futex+0xfe/0x5b0 
>> >> [<ffffffff810e7d60>] SyS_futex+0x80/0x180 
>> >> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b 
>> >> [<ffffffffffffffff>] 0xffffffffffffffff 
>> >> 
>> >> ### stack of the nova-conductor process 
>> >> POD6-Mongodb03:/tmp # pstack 33316 
>> >> #0  0x00002b8449e35f4d in __lll_lock_wait () from 
>> /lib64/libpthread.so.0 
>> >> #1  0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 
>> >> #2  0x00002b8449e31c08 in pthread_mutex_lock () from 
>> >> /lib64/libpthread.so.0 
>> >> #3  0x00002b84554c44ab in pq_abort () from 
>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>> >> #4  0x00002b84554c955e in psyco_conn_rollback () from 
>> >> /python2.7/site-packages/psycopg2/_psycopg.so 
>> >> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from 
>> >> /lib64/libpython2.7.so.1.0 
>> >> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from 
>> >> /lib64/libpython2.7.so.1.0 
>> >> 
>> >> The psycopg2 was trying to close the cursor, and try to get the mutex 
>> lock 
>> >> "pthread_mutex_lock", but it seems that the cursor was used by other 
>> >> session. 
>> >> 
>> >> 
>> >> Questions: 
>> >> 
>> >> 1. What the error "ProgrammingError: close cannot be used while an 
>> >> asynchronous query is underway" mean? 
>> >> AFAIK, these caused by psycopg2, which means a asynchronous query was 
>> >> executed in one connection. 
>> >> But the I think the sqlalchemy was thread safe since it was patched by 
>> >> eventlet, see details in  eventlet/support/psycopg2_patcher.py 
>> >> And we can see different green thread number in the log, as: 
>> >> [pid:36365] [MainThread] [tid:122397712] 
>> >> [pid:36365] [MainThread] [tid:122397815] 
>> >> So, I guess the connection pool in one process is safe. 
>> >> 
>> >> 2. The nova-conductor was a multi-thread python client, which forked 
>> >> several child process. 
>> >> ps -elf|grep -i nova-conductor 
>> >> 30878  1 pool_s /usr/bin/nova-conductor 
>> >> 36364  1 ep_pol /usr/bin/nova-conductor 
>> >> 36365  1 futex_ /usr/bin/nova-conductor 
>> >> 36366  1 ep_pol /usr/bin/nova-conductor 
>> >> 36367  1 ep_pol /usr/bin/nova-conductor 
>> >> 36368  1 ep_pol /usr/bin/nova-conductor 
>> >> 
>> >> If the nova-conductor was started with only one child, the problem was 
>> not 
>> >> happen. 
>> >> Does this mean the connection/engine CAN NOT shared in these child 
>> >> processes? 
>> >> 
>> >> Thanks. 
>> > 
>> > -- 
>> > SQLAlchemy - 
>> > The Python SQL Toolkit and Object Relational Mapper 
>> > 
>> > http://www.sqlalchemy.org/ 
>> > 
>> > To post example code, please provide an MCVE: Minimal, Complete, and 
>> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
>> > description. 
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "sqlalchemy" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an 
>> > email to [email protected]. 
>> > To post to this group, send email to [email protected]. 
>> > Visit this group at https://groups.google.com/group/sqlalchemy. 
>> > For more options, visit https://groups.google.com/d/optout. 
>>
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] Re: long transaction after database switched over

Reply via email to