Re: [sqlalchemy] long transaction after database switched over

JinRong Cai Thu, 16 Nov 2017 20:02:53 -0800

Hi Michael,

Thanks for the reply.
It was a pity that Openstack would only support Mysql ONLY.
And for this problem, I would dig into the code for more first.
Will let you know if I found more.


Thanks agagin!
--Jinrong

在 2017年11月13日星期一 UTC+8下午10:32:21，Mike Bayer写道：

> On Mon, Nov 13, 2017 at 2:14 AM, JinRong Cai <cai...@gmail.com 
> <javascript:>> wrote: 
> > Hi  Michael, 
> > 
> > Thanks for your detailed answers. 
> > 
> > 1, We can see sereral transactions hang in the failover, including: 
> > "select 1" 
> > "select xxx from instances" 
> > even, we see the transaction hang after "update services set xxxx" 
> > So I am not sure whether the problem is related to the connection 
> > invalidation in the pool. 
> > 
> > 
> > 2, I have tried to disable the connection pool with the code "NULLPOOL". 
> > And yes, the fail over is ok, and no long transaction in the DB side 
> after 
> > failover. 
> > But, in my understanding, if we disable connection pool in this way, the 
> db 
> > will suffer with poor performance. 
> > 
> > 
> > 3, For the nova worker, in the log, I did not see the failed connection 
> > objected was shared between other nova processes(pid). 
> > 
> > 
> > BTW,  from the nova-conductor, we can see the socket receive queue is 
> > increasing, and the response from DB was not consumed by nova-conductor: 
> > 
> > Proto Recv-Q Send-Q Local Address           Foreign Address         
> State 
> > PID/Program name 
> > tcp   69      0     nova-conductor_ip:60132  pg_ip:5432 
> > established 36365/python2.7-nova-conductor 
> > 
> > 
> > And now , I just doubt  the long transaction ,which is not 
> > commited/rollbacked, was caused by the connection pool. 
> > I am wondering which situration would cause the nova client could not 
> > receive the response? 
> > And which information do I need if dig it further. 
>
> this is really more of an openstack integration issue and it should be 
> taken up by the Nova team first, as it would involve being able to 
> reproduce your problem and digging into the architecture to see where 
> the engine might be being misused.   Unfortunately you might find it 
> difficult to get support because this is Postgresql, even though it's 
> historically been "supported" I'm sure you're aware they've been 
> trying for some time to find a way for Postgresql not really be 
> supported, and that may be the case today, even though IIUC they've 
> still left the door open for alternative backends (because there is 
> always pressure to support other backends). 
>
> If you're truly looking for a stable and performant Openstack without 
> being deeply involved in fixing issues I'd be switching to MySQL 
> because that's really the only backend that has widespread testing and 
> support.  OTOH if you're involved with development of Nova against 
> Postgresql then getting a launchpad issue with steps to reproduce 
> would be a good start (e.g. rally job that you're running, something 
> like that). 
>
>
> > 
> > Thanks for your help again. 
> > --Jinrong. 
> > 
> > 
> > 在 2017年11月13日星期一 UTC+8上午11:42:13，Mike Bayer写道： 
> >> 
> >> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai <cai...@gmail.com> wrote: 
> >> > Hi  Michael , 
> >> > 
> >> > I am using openstack with postgresql which sqlalchemy and oslo_db 
> module 
> >> > were used. 
> >> > And there are some problems after my pg database switched over. 
> >> > 
> >> > Here is my switch over process: 
> >> > 1. nova-conductor(python application) is running with DB connection 
> >> > strings 
> >> > point to vip , which is in primary site(A) of pg. 
> >> > 2. switch VIP from primary(A) to new primary(B) 
> >> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
> >> > primary. 
> >> > 4. nova-conductor is running in the whole process. 
> >> > 
> >> > After some seconds, I found some nova-conductor processes are hang 
> with 
> >> > status futex_wait_queue_me, and the status of the query in DB is 
> "idle 
> >> > in 
> >> > transaction", the transaction was not commited or rollbacked! 
> >> > I think disconnection was handled in the oslo_db, which will send a 
> >> > ping(select 1) to DB. 
> >> > 
> >> > If DB was switchd over, the connection in the pool would be set with 
> >> > status 
> >> > invalid, and reconnect after next check out. 
> >> > 
> >> > ###error messages from nova-conductor 
> >> > localhost nova-conductor ERROR [pid:36365] [MainThread] 
> [tid:122397712] 
> >> > [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
> >> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception 
> wrapped 
> >> > from 
> >> > (psycopg2.ProgrammingError) execute cannot be used while an 
> asynchronous 
> >> > query is underway [SQL: 'SELECT 1'] 
> >> >  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent 
> call 
> >> > last): 
> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
> >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in 
> >> > _execute_context 
> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters     context) 
> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
> >> > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in 
> >> > do_execute 
> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters 
> >> > cursor.execute(statement, 
> >> > parameters) 
> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: 
> execute 
> >> > cannot be used while an asynchronous query is underway 
> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters 
> >> >  localhost nova-conductor ERROR [pid:36365] [MainThread] 
> [tid:122397712] 
> >> > [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error 
> >> > closing 
> >> > cursor 
> >> >   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call 
> >> > last): 
> >> >   36365 ERROR sqlalchemy.pool.QueuePool   File 
> >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in 
> >> > _safe_close_cursor 
> >> >   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close() 
> >> >   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close 
> cannot 
> >> > be 
> >> > used while an asynchronous query is underway 
> >> >   36365 ERROR sqlalchemy.pool.QueuePool 
> >> > 
> >> > ###ps status of nova-conductor 
> >> > POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # 
> cat 
> >> > /proc/33316/stack 
> >> > [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 
> >> > [<ffffffff810e5799>] futex_wait+0x179/0x280 
> >> > [<ffffffff810e782e>] do_futex+0xfe/0x5b0 
> >> > [<ffffffff810e7d60>] SyS_futex+0x80/0x180 
> >> > [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b 
> >> > [<ffffffffffffffff>] 0xffffffffffffffff 
> >> > 
> >> > ### stack of the nova-conductor process 
> >> > POD6-Mongodb03:/tmp # pstack 33316 
> >> > #0  0x00002b8449e35f4d in __lll_lock_wait () from 
> /lib64/libpthread.so.0 
> >> > #1  0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 
> >> > #2  0x00002b8449e31c08 in pthread_mutex_lock () from 
> >> > /lib64/libpthread.so.0 
> >> > #3  0x00002b84554c44ab in pq_abort () from 
> >> > /python2.7/site-packages/psycopg2/_psycopg.so 
> >> > #4  0x00002b84554c955e in psyco_conn_rollback () from 
> >> > /python2.7/site-packages/psycopg2/_psycopg.so 
> >> > #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from 
> >> > /lib64/libpython2.7.so.1.0 
> >> > #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from 
> >> > /lib64/libpython2.7.so.1.0 
> >> > 
> >> > The psycopg2 was trying to close the cursor, and try to get the mutex 
> >> > lock 
> >> > "pthread_mutex_lock", but it seems that the cursor was used by other 
> >> > session. 
> >> > 
> >> > 
> >> > Questions: 
> >> > 
> >> > 1. What the error "ProgrammingError: close cannot be used while an 
> >> > asynchronous query is underway" mean? 
> >> > AFAIK, these caused by psycopg2, which means a asynchronous query was 
> >> > executed in one connection. 
> >> > But the I think the sqlalchemy was thread safe since it was patched 
> by 
> >> > eventlet, see details in  eventlet/support/psycopg2_patcher.py 
> >> 
> >> it's kind of news to me that postgresql/eventlet is even doing the 
> >> psyco-green monkeypatch, but this seems to have been the case for a 
> >> long time since eventlet 0.9.8.    So what we're seeing is just a poor 
> >> failure mode which is helped along by the awkward limitations of the 
> >> async connection, where the "SELECT 1" ping we do to test liveness of 
> >> the connection is not able to raise a proper error code. 
> >> 
> >> Right off I will note that "moving the VIP" is not adequate for a 
> >> clean HA switchover, we've had lots of problems doing things that way 
> >> including specifically that a connection that is over the VIP has a 
> >> very hard time detecting that it's no longer "open" when you switch 
> >> the VIP under it.    In the MySQL world we've had to modify the socket 
> >> with an additional parameter "source_address" 
> >> (https://docs.python.org/2/library/socket.html#socket.create_connection) 
>
> >> to help with VIP switchovers, not sure what options psycopg2 offers in 
> >> this regard.    But overall we don't switch VIPs around, we use 
> >> HAProxy to do smooth switchovers and load balancing. 
> >> 
> >> Another option here would be to not use connection pooling (which 
> >> unfortunately is not an option that oslo.db exposes at this time, 
> >> though it could/should be added), or what you could do is set 
> >> pool_timeout (lifetime of a connection) to be very low, and assuming 
> >> these errors involve stale connections only it would limit how long 
> >> that would be. 
> >> 
> >> Finally, nova is moving off of eventlet to mod_wsgi, and I'd strongly 
> >> recommend running it in this way.   I've been pushing for years to get 
> >> openstack off of eventlet entirely as it's entirely useless for the 
> >> architecture of openstack. 
> >> 
> >> 
> >> > And we can see different green thread number in the log, as: 
> >> > [pid:36365] [MainThread] [tid:122397712] 
> >> > [pid:36365] [MainThread] [tid:122397815] 
> >> > So, I guess the connection pool in one process is safe. 
> >> > 
> >> > 2. The nova-conductor was a multi-thread python client, which forked 
> >> > several 
> >> > child process. 
> >> > ps -elf|grep -i nova-conductor 
> >> > 30878  1 pool_s /usr/bin/nova-conductor 
> >> > 36364  1 ep_pol /usr/bin/nova-conductor 
> >> > 36365  1 futex_ /usr/bin/nova-conductor 
> >> > 36366  1 ep_pol /usr/bin/nova-conductor 
> >> > 36367  1 ep_pol /usr/bin/nova-conductor 
> >> > 36368  1 ep_pol /usr/bin/nova-conductor 
> >> > 
> >> > If the nova-conductor was started with only one child, the problem 
> was 
> >> > not 
> >> > happen. 
> >> > Does this mean the connection/engine CAN NOT shared in these child 
> >> > processes? 
> >> 
> >> nova starts a lot of workers but they should not be starting up the 
> >> engine in the parent process.   if you can identify that nova is 
> >> sending a filehandle from the parent process to multiple child 
> >> processes then please report a nova issue on launchpad, that has to be 
> >> fixed ASAP. 
> >> 
> >> 
> >> 
> >> > 
> >> > Thanks. 
> >> > 
> >> > -- 
> >> > SQLAlchemy - 
> >> > The Python SQL Toolkit and Object Relational Mapper 
> >> > 
> >> > http://www.sqlalchemy.org/ 
> >> > 
> >> > To post example code, please provide an MCVE: Minimal, Complete, and 
> >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a 
> full 
> >> > description. 
> >> > --- 
> >> > You received this message because you are subscribed to the Google 
> >> > Groups 
> >> > "sqlalchemy" group. 
> >> > To unsubscribe from this group and stop receiving emails from it, 
> send 
> >> > an 
> >> > email to sqlalchemy+...@googlegroups.com. 
> >> > To post to this group, send email to sqlal...@googlegroups.com. 
> >> > Visit this group at https://groups.google.com/group/sqlalchemy. 
> >> > For more options, visit https://groups.google.com/d/optout. 
> > 
> > -- 
> > SQLAlchemy - 
> > The Python SQL Toolkit and Object Relational Mapper 
> > 
> > http://www.sqlalchemy.org/ 
> > 
> > To post example code, please provide an MCVE: Minimal, Complete, and 
> > Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> > description. 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "sqlalchemy" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to sqlalchemy+...@googlegroups.com <javascript:>. 
> > To post to this group, send email to sqlal...@googlegroups.com 
> <javascript:>. 
> > Visit this group at https://groups.google.com/group/sqlalchemy. 
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] long transaction after database switched over

Reply via email to