Re: [sqlalchemy] long transaction after database switched over

Mike Bayer Sun, 12 Nov 2017 19:42:20 -0800

On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai <caij...@gmail.com> wrote:
> Hi  Michael ,
>
> I am using openstack with postgresql which sqlalchemy and oslo_db module
> were used.
> And there are some problems after my pg database switched over.
>
> Here is my switch over process:
> 1. nova-conductor(python application) is running with DB connection strings
> point to vip , which is in primary site(A) of pg.
> 2. switch VIP from primary(A) to new primary(B)
> 3. switch over pg: shutdown primary(A), promopt standby(B) to new primary.
> 4. nova-conductor is running in the whole process.
>
> After some seconds, I found some nova-conductor processes are hang with
> status futex_wait_queue_me, and the status of the query in DB is "idle in
> transaction", the transaction was not commited or rollbacked!
> I think disconnection was handled in the oslo_db, which will send a
> ping(select 1) to DB.
>
> If DB was switchd over, the connection in the pool would be set with status
> invalid, and reconnect after next check out.
>
> ###error messages from nova-conductor
> localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
> [exc_filters.py:330 _raise_for_remaining_DBAPIError]
> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped from
> (psycopg2.ProgrammingError) execute cannot be used while an asynchronous
> query is underway [SQL: 'SELECT 1']
>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call
> last):
>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in
> _execute_context
>   36365 ERROR oslo_db.sqlalchemy.exc_filters     context)
>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in
> do_execute
>   36365 ERROR oslo_db.sqlalchemy.exc_filters     cursor.execute(statement,
> parameters)
>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute
> cannot be used while an asynchronous query is underway
>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>  localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error closing
> cursor
>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
>   36365 ERROR sqlalchemy.pool.QueuePool   File
> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in
> _safe_close_cursor
>   36365 ERROR sqlalchemy.pool.QueuePool     cursor.close()
>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot be
> used while an asynchronous query is underway
>   36365 ERROR sqlalchemy.pool.QueuePool
>
> ###ps status of nova-conductor
> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # cat
> /proc/33316/stack
> [<ffffffff810e4c24>] futex_wait_queue_me+0xc4/0x120
> [<ffffffff810e5799>] futex_wait+0x179/0x280
> [<ffffffff810e782e>] do_futex+0xfe/0x5b0
> [<ffffffff810e7d60>] SyS_futex+0x80/0x180
> [<ffffffff81654e09>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> ### stack of the nova-conductor process
> POD6-Mongodb03:/tmp # pstack 33316
> #0  0x00002b8449e35f4d in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00002b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0
> #2  0x00002b8449e31c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00002b84554c44ab in pq_abort () from
> /python2.7/site-packages/psycopg2/_psycopg.so
> #4  0x00002b84554c955e in psyco_conn_rollback () from
> /python2.7/site-packages/psycopg2/_psycopg.so
> #5  0x00002b8449b42b50 in PyEval_EvalFrameEx () from
> /lib64/libpython2.7.so.1.0
> #6  0x00002b8449b42ad0 in PyEval_EvalFrameEx () from
> /lib64/libpython2.7.so.1.0
>
> The psycopg2 was trying to close the cursor, and try to get the mutex lock
> "pthread_mutex_lock", but it seems that the cursor was used by other
> session.
>
>
> Questions:
>
> 1. What the error "ProgrammingError: close cannot be used while an
> asynchronous query is underway" mean?
> AFAIK, these caused by psycopg2, which means a asynchronous query was
> executed in one connection.
> But the I think the sqlalchemy was thread safe since it was patched by
> eventlet, see details in  eventlet/support/psycopg2_patcher.py


it's kind of news to me that postgresql/eventlet is even doing the
psyco-green monkeypatch, but this seems to have been the case for a
long time since eventlet 0.9.8.    So what we're seeing is just a poor
failure mode which is helped along by the awkward limitations of the
async connection, where the "SELECT 1" ping we do to test liveness of
the connection is not able to raise a proper error code.

Right off I will note that "moving the VIP" is not adequate for a
clean HA switchover, we've had lots of problems doing things that way
including specifically that a connection that is over the VIP has a
very hard time detecting that it's no longer "open" when you switch
the VIP under it.    In the MySQL world we've had to modify the socket
with an additional parameter "source_address"
(https://docs.python.org/2/library/socket.html#socket.create_connection)
to help with VIP switchovers, not sure what options psycopg2 offers in
this regard.    But overall we don't switch VIPs around, we use
HAProxy to do smooth switchovers and load balancing.

Another option here would be to not use connection pooling (which
unfortunately is not an option that oslo.db exposes at this time,
though it could/should be added), or what you could do is set
pool_timeout (lifetime of a connection) to be very low, and assuming
these errors involve stale connections only it would limit how long
that would be.

Finally, nova is moving off of eventlet to mod_wsgi, and I'd strongly
recommend running it in this way.   I've been pushing for years to get
openstack off of eventlet entirely as it's entirely useless for the
architecture of openstack.


> And we can see different green thread number in the log, as:
> [pid:36365] [MainThread] [tid:122397712]
> [pid:36365] [MainThread] [tid:122397815]
> So, I guess the connection pool in one process is safe.
>
> 2. The nova-conductor was a multi-thread python client, which forked several
> child process.
> ps -elf|grep -i nova-conductor
> 30878  1 pool_s /usr/bin/nova-conductor
> 36364  1 ep_pol /usr/bin/nova-conductor
> 36365  1 futex_ /usr/bin/nova-conductor
> 36366  1 ep_pol /usr/bin/nova-conductor
> 36367  1 ep_pol /usr/bin/nova-conductor
> 36368  1 ep_pol /usr/bin/nova-conductor
>
> If the nova-conductor was started with only one child, the problem was not
> happen.
> Does this mean the connection/engine CAN NOT shared in these child
> processes?

nova starts a lot of workers but they should not be starting up the
engine in the parent process.   if you can identify that nova is
sending a filehandle from the parent process to multiple child
processes then please report a nova issue on launchpad, that has to be
fixed ASAP.



>
> Thanks.
>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+unsubscr...@googlegroups.com.
> To post to this group, send email to sqlalchemy@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Re: [sqlalchemy] long transaction after database switched over

Reply via email to