On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai <caij...@gmail.com> wrote: > Hi Michael , > > I am using openstack with postgresql which sqlalchemy and oslo_db module > were used. > And there are some problems after my pg database switched over. > > Here is my switch over process: > 1. nova-conductor(python application) is running with DB connection strings > point to vip , which is in primary site(A) of pg. > 2. switch VIP from primary(A) to new primary(B) > 3. switch over pg: shutdown primary(A), promopt standby(B) to new primary. > 4. nova-conductor is running in the whole process. > > After some seconds, I found some nova-conductor processes are hang with > status futex_wait_queue_me, and the status of the query in DB is "idle in > transaction", the transaction was not commited or rollbacked! > I think disconnection was handled in the oslo_db, which will send a > ping(select 1) to DB. > > If DB was switchd over, the connection in the pool would be set with status > invalid, and reconnect after next check out. > > ###error messages from nova-conductor > localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712] > [exc_filters.py:330 _raise_for_remaining_DBAPIError] > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped from > (psycopg2.ProgrammingError) execute cannot be used while an asynchronous > query is underway [SQL: 'SELECT 1'] > 36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call > last): > 36365 ERROR oslo_db.sqlalchemy.exc_filters File > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in > _execute_context > 36365 ERROR oslo_db.sqlalchemy.exc_filters context) > 36365 ERROR oslo_db.sqlalchemy.exc_filters File > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in > do_execute > 36365 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement, > parameters) > 36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute > cannot be used while an asynchronous query is underway > 36365 ERROR oslo_db.sqlalchemy.exc_filters > localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712] > [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error closing > cursor > 36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last): > 36365 ERROR sqlalchemy.pool.QueuePool File > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in > _safe_close_cursor > 36365 ERROR sqlalchemy.pool.QueuePool cursor.close() > 36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot be > used while an asynchronous query is underway > 36365 ERROR sqlalchemy.pool.QueuePool > > ###ps status of nova-conductor > POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # cat > /proc/33316/stack > [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120 > [<ffffffff810e5799>] futex_wait+0x179/0x280 > [<ffffffff810e782e>] do_futex+0xfe/0x5b0 > [<ffffffff810e7d60>] SyS_futex+0x80/0x180 > [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff > > ### stack of the nova-conductor process > POD6-Mongodb03:/tmp # pstack 33316 > #0 0x00002b8449e35f4d in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 > #2 0x00002b8449e31c08 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x00002b84554c44ab in pq_abort () from > /python2.7/site-packages/psycopg2/_psycopg.so > #4 0x00002b84554c955e in psyco_conn_rollback () from > /python2.7/site-packages/psycopg2/_psycopg.so > #5 0x00002b8449b42b50 in PyEval_EvalFrameEx () from > /lib64/libpython2.7.so.1.0 > #6 0x00002b8449b42ad0 in PyEval_EvalFrameEx () from > /lib64/libpython2.7.so.1.0 > > The psycopg2 was trying to close the cursor, and try to get the mutex lock > "pthread_mutex_lock", but it seems that the cursor was used by other > session. > > > Questions: > > 1. What the error "ProgrammingError: close cannot be used while an > asynchronous query is underway" mean? > AFAIK, these caused by psycopg2, which means a asynchronous query was > executed in one connection. > But the I think the sqlalchemy was thread safe since it was patched by > eventlet, see details in eventlet/support/psycopg2_patcher.py
it's kind of news to me that postgresql/eventlet is even doing the psyco-green monkeypatch, but this seems to have been the case for a long time since eventlet 0.9.8. So what we're seeing is just a poor failure mode which is helped along by the awkward limitations of the async connection, where the "SELECT 1" ping we do to test liveness of the connection is not able to raise a proper error code. Right off I will note that "moving the VIP" is not adequate for a clean HA switchover, we've had lots of problems doing things that way including specifically that a connection that is over the VIP has a very hard time detecting that it's no longer "open" when you switch the VIP under it. In the MySQL world we've had to modify the socket with an additional parameter "source_address" (https://docs.python.org/2/library/socket.html#socket.create_connection) to help with VIP switchovers, not sure what options psycopg2 offers in this regard. But overall we don't switch VIPs around, we use HAProxy to do smooth switchovers and load balancing. Another option here would be to not use connection pooling (which unfortunately is not an option that oslo.db exposes at this time, though it could/should be added), or what you could do is set pool_timeout (lifetime of a connection) to be very low, and assuming these errors involve stale connections only it would limit how long that would be. Finally, nova is moving off of eventlet to mod_wsgi, and I'd strongly recommend running it in this way. I've been pushing for years to get openstack off of eventlet entirely as it's entirely useless for the architecture of openstack. > And we can see different green thread number in the log, as: > [pid:36365] [MainThread] [tid:122397712] > [pid:36365] [MainThread] [tid:122397815] > So, I guess the connection pool in one process is safe. > > 2. The nova-conductor was a multi-thread python client, which forked several > child process. > ps -elf|grep -i nova-conductor > 30878 1 pool_s /usr/bin/nova-conductor > 36364 1 ep_pol /usr/bin/nova-conductor > 36365 1 futex_ /usr/bin/nova-conductor > 36366 1 ep_pol /usr/bin/nova-conductor > 36367 1 ep_pol /usr/bin/nova-conductor > 36368 1 ep_pol /usr/bin/nova-conductor > > If the nova-conductor was started with only one child, the problem was not > happen. > Does this mean the connection/engine CAN NOT shared in these child > processes? nova starts a lot of workers but they should not be starting up the engine in the parent process. if you can identify that nova is sending a filehandle from the parent process to multiple child processes then please report a nova issue on launchpad, that has to be fixed ASAP. > > Thanks. > > -- > SQLAlchemy - > The Python SQL Toolkit and Object Relational Mapper > > http://www.sqlalchemy.org/ > > To post example code, please provide an MCVE: Minimal, Complete, and > Verifiable Example. See http://stackoverflow.com/help/mcve for a full > description. > --- > You received this message because you are subscribed to the Google Groups > "sqlalchemy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to sqlalchemy+unsubscr...@googlegroups.com. > To post to this group, send email to sqlalchemy@googlegroups.com. > Visit this group at https://groups.google.com/group/sqlalchemy. > For more options, visit https://groups.google.com/d/optout. -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at https://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/d/optout.