Hi Michael, Thanks for the reply. It was a pity that Openstack would only support Mysql ONLY. And for this problem, I would dig into the code for more first. Will let you know if I found more.
Thanks agagin! --Jinrong 在 2017年11月13日星期一 UTC+8下午10:32:21,Mike Bayer写道: > On Mon, Nov 13, 2017 at 2:14 AM, JinRong Cai <cai...@gmail.com > <javascript:>> wrote: > > Hi Michael, > > > > Thanks for your detailed answers. > > > > 1, We can see sereral transactions hang in the failover, including: > > "select 1" > > "select xxx from instances" > > even, we see the transaction hang after "update services set xxxx" > > So I am not sure whether the problem is related to the connection > > invalidation in the pool. > > > > > > 2, I have tried to disable the connection pool with the code "NULLPOOL". > > And yes, the fail over is ok, and no long transaction in the DB side > after > > failover. > > But, in my understanding, if we disable connection pool in this way, the > db > > will suffer with poor performance. > > > > > > 3, For the nova worker, in the log, I did not see the failed connection > > objected was shared between other nova processes(pid). > > > > > > BTW, from the nova-conductor, we can see the socket receive queue is > > increasing, and the response from DB was not consumed by nova-conductor: > > > > Proto Recv-Q Send-Q Local Address Foreign Address > State > > PID/Program name > > tcp 69 0 nova-conductor_ip:60132 pg_ip:5432 > > established 36365/python2.7-nova-conductor > > > > > > And now , I just doubt the long transaction ,which is not > > commited/rollbacked, was caused by the connection pool. > > I am wondering which situration would cause the nova client could not > > receive the response? > > And which information do I need if dig it further. > > this is really more of an openstack integration issue and it should be > taken up by the Nova team first, as it would involve being able to > reproduce your problem and digging into the architecture to see where > the engine might be being misused. Unfortunately you might find it > difficult to get support because this is Postgresql, even though it's > historically been "supported" I'm sure you're aware they've been > trying for some time to find a way for Postgresql not really be > supported, and that may be the case today, even though IIUC they've > still left the door open for alternative backends (because there is > always pressure to support other backends). > > If you're truly looking for a stable and performant Openstack without > being deeply involved in fixing issues I'd be switching to MySQL > because that's really the only backend that has widespread testing and > support. OTOH if you're involved with development of Nova against > Postgresql then getting a launchpad issue with steps to reproduce > would be a good start (e.g. rally job that you're running, something > like that). > > > > > > Thanks for your help again. > > --Jinrong. > > > > > > 在 2017年11月13日星期一 UTC+8上午11:42:13,Mike Bayer写道: > >> > >> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai <cai...@gmail.com> wrote: > >> > Hi Michael , > >> > > >> > I am using openstack with postgresql which sqlalchemy and oslo_db > module > >> > were used. > >> > And there are some problems after my pg database switched over. > >> > > >> > Here is my switch over process: > >> > 1. nova-conductor(python application) is running with DB connection > >> > strings > >> > point to vip , which is in primary site(A) of pg. > >> > 2. switch VIP from primary(A) to new primary(B) > >> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new > >> > primary. > >> > 4. nova-conductor is running in the whole process. > >> > > >> > After some seconds, I found some nova-conductor processes are hang > with > >> > status futex_wait_queue_me, and the status of the query in DB is > "idle > >> > in > >> > transaction", the transaction was not commited or rollbacked! > >> > I think disconnection was handled in the oslo_db, which will send a > >> > ping(select 1) to DB. > >> > > >> > If DB was switchd over, the connection in the pool would be set with > >> > status > >> > invalid, and reconnect after next check out. > >> > > >> > ###error messages from nova-conductor > >> > localhost nova-conductor ERROR [pid:36365] [MainThread] > [tid:122397712] > >> > [exc_filters.py:330 _raise_for_remaining_DBAPIError] > >> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception > wrapped > >> > from > >> > (psycopg2.ProgrammingError) execute cannot be used while an > asynchronous > >> > query is underway [SQL: 'SELECT 1'] > >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent > call > >> > last): > >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters File > >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in > >> > _execute_context > >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters context) > >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters File > >> > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in > >> > do_execute > >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters > >> > cursor.execute(statement, > >> > parameters) > >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: > execute > >> > cannot be used while an asynchronous query is underway > >> > 36365 ERROR oslo_db.sqlalchemy.exc_filters > >> > localhost nova-conductor ERROR [pid:36365] [MainThread] > [tid:122397712] > >> > [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error > >> > closing > >> > cursor > >> > 36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call > >> > last): > >> > 36365 ERROR sqlalchemy.pool.QueuePool File > >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in > >> > _safe_close_cursor > >> > 36365 ERROR sqlalchemy.pool.QueuePool cursor.close() > >> > 36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close > cannot > >> > be > >> > used while an asynchronous query is underway > >> > 36365 ERROR sqlalchemy.pool.QueuePool > >> > > >> > ###ps status of nova-conductor > >> > POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # > cat > >> > /proc/33316/stack > >> > [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 > >> > [<ffffffff810e5799>] futex_wait+0x179/0x280 > >> > [<ffffffff810e782e>] do_futex+0xfe/0x5b0 > >> > [<ffffffff810e7d60>] SyS_futex+0x80/0x180 > >> > [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b > >> > [<ffffffffffffffff>] 0xffffffffffffffff > >> > > >> > ### stack of the nova-conductor process > >> > POD6-Mongodb03:/tmp # pstack 33316 > >> > #0 0x00002b8449e35f4d in __lll_lock_wait () from > /lib64/libpthread.so.0 > >> > #1 0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 > >> > #2 0x00002b8449e31c08 in pthread_mutex_lock () from > >> > /lib64/libpthread.so.0 > >> > #3 0x00002b84554c44ab in pq_abort () from > >> > /python2.7/site-packages/psycopg2/_psycopg.so > >> > #4 0x00002b84554c955e in psyco_conn_rollback () from > >> > /python2.7/site-packages/psycopg2/_psycopg.so > >> > #5 0x00002b8449b42b50 in PyEval_EvalFrameEx () from > >> > /lib64/libpython2.7.so.1.0 > >> > #6 0x00002b8449b42ad0 in PyEval_EvalFrameEx () from > >> > /lib64/libpython2.7.so.1.0 > >> > > >> > The psycopg2 was trying to close the cursor, and try to get the mutex > >> > lock > >> > "pthread_mutex_lock", but it seems that the cursor was used by other > >> > session. > >> > > >> > > >> > Questions: > >> > > >> > 1. What the error "ProgrammingError: close cannot be used while an > >> > asynchronous query is underway" mean? > >> > AFAIK, these caused by psycopg2, which means a asynchronous query was > >> > executed in one connection. > >> > But the I think the sqlalchemy was thread safe since it was patched > by > >> > eventlet, see details in eventlet/support/psycopg2_patcher.py > >> > >> it's kind of news to me that postgresql/eventlet is even doing the > >> psyco-green monkeypatch, but this seems to have been the case for a > >> long time since eventlet 0.9.8. So what we're seeing is just a poor > >> failure mode which is helped along by the awkward limitations of the > >> async connection, where the "SELECT 1" ping we do to test liveness of > >> the connection is not able to raise a proper error code. > >> > >> Right off I will note that "moving the VIP" is not adequate for a > >> clean HA switchover, we've had lots of problems doing things that way > >> including specifically that a connection that is over the VIP has a > >> very hard time detecting that it's no longer "open" when you switch > >> the VIP under it. In the MySQL world we've had to modify the socket > >> with an additional parameter "source_address" > >> (https://docs.python.org/2/library/socket.html#socket.create_connection) > > >> to help with VIP switchovers, not sure what options psycopg2 offers in > >> this regard. But overall we don't switch VIPs around, we use > >> HAProxy to do smooth switchovers and load balancing. > >> > >> Another option here would be to not use connection pooling (which > >> unfortunately is not an option that oslo.db exposes at this time, > >> though it could/should be added), or what you could do is set > >> pool_timeout (lifetime of a connection) to be very low, and assuming > >> these errors involve stale connections only it would limit how long > >> that would be. > >> > >> Finally, nova is moving off of eventlet to mod_wsgi, and I'd strongly > >> recommend running it in this way. I've been pushing for years to get > >> openstack off of eventlet entirely as it's entirely useless for the > >> architecture of openstack. > >> > >> > >> > And we can see different green thread number in the log, as: > >> > [pid:36365] [MainThread] [tid:122397712] > >> > [pid:36365] [MainThread] [tid:122397815] > >> > So, I guess the connection pool in one process is safe. > >> > > >> > 2. The nova-conductor was a multi-thread python client, which forked > >> > several > >> > child process. > >> > ps -elf|grep -i nova-conductor > >> > 30878 1 pool_s /usr/bin/nova-conductor > >> > 36364 1 ep_pol /usr/bin/nova-conductor > >> > 36365 1 futex_ /usr/bin/nova-conductor > >> > 36366 1 ep_pol /usr/bin/nova-conductor > >> > 36367 1 ep_pol /usr/bin/nova-conductor > >> > 36368 1 ep_pol /usr/bin/nova-conductor > >> > > >> > If the nova-conductor was started with only one child, the problem > was > >> > not > >> > happen. > >> > Does this mean the connection/engine CAN NOT shared in these child > >> > processes? > >> > >> nova starts a lot of workers but they should not be starting up the > >> engine in the parent process. if you can identify that nova is > >> sending a filehandle from the parent process to multiple child > >> processes then please report a nova issue on launchpad, that has to be > >> fixed ASAP. > >> > >> > >> > >> > > >> > Thanks. > >> > > >> > -- > >> > SQLAlchemy - > >> > The Python SQL Toolkit and Object Relational Mapper > >> > > >> > http://www.sqlalchemy.org/ > >> > > >> > To post example code, please provide an MCVE: Minimal, Complete, and > >> > Verifiable Example. See http://stackoverflow.com/help/mcve for a > full > >> > description. > >> > --- > >> > You received this message because you are subscribed to the Google > >> > Groups > >> > "sqlalchemy" group. > >> > To unsubscribe from this group and stop receiving emails from it, > send > >> > an > >> > email to sqlalchemy+...@googlegroups.com. > >> > To post to this group, send email to sqlal...@googlegroups.com. > >> > Visit this group at https://groups.google.com/group/sqlalchemy. > >> > For more options, visit https://groups.google.com/d/optout. > > > > -- > > SQLAlchemy - > > The Python SQL Toolkit and Object Relational Mapper > > > > http://www.sqlalchemy.org/ > > > > To post example code, please provide an MCVE: Minimal, Complete, and > > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > > description. > > --- > > You received this message because you are subscribed to the Google > Groups > > "sqlalchemy" group. > > To unsubscribe from this group and stop receiving emails from it, send > an > > email to sqlalchemy+...@googlegroups.com <javascript:>. > > To post to this group, send email to sqlal...@googlegroups.com > <javascript:>. > > Visit this group at https://groups.google.com/group/sqlalchemy. > > For more options, visit https://groups.google.com/d/optout. > -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.