Re: [sqlalchemy] long transaction after database switched over

2017-11-16 Thread JinRong Cai
Hi Michael,

Thanks for the reply.
It was a pity that Openstack would only support Mysql ONLY.
And for this problem, I would dig into the code for more first.
Will let you know if I found more.

Thanks agagin!
--Jinrong

在 2017年11月13日星期一 UTC+8下午10:32:21,Mike Bayer写道:

> On Mon, Nov 13, 2017 at 2:14 AM, JinRong Cai  > wrote: 
> > Hi  Michael, 
> > 
> > Thanks for your detailed answers. 
> > 
> > 1, We can see sereral transactions hang in the failover, including: 
> > "select 1" 
> > "select xxx from instances" 
> > even, we see the transaction hang after "update services set " 
> > So I am not sure whether the problem is related to the connection 
> > invalidation in the pool. 
> > 
> > 
> > 2, I have tried to disable the connection pool with the code "NULLPOOL". 
> > And yes, the fail over is ok, and no long transaction in the DB side 
> after 
> > failover. 
> > But, in my understanding, if we disable connection pool in this way, the 
> db 
> > will suffer with poor performance. 
> > 
> > 
> > 3, For the nova worker, in the log, I did not see the failed connection 
> > objected was shared between other nova processes(pid). 
> > 
> > 
> > BTW,  from the nova-conductor, we can see the socket receive queue is 
> > increasing, and the response from DB was not consumed by nova-conductor: 
> > 
> > Proto Recv-Q Send-Q Local Address   Foreign Address 
> State 
> > PID/Program name 
> > tcp   69  0 nova-conductor_ip:60132  pg_ip:5432 
> > established 36365/python2.7-nova-conductor 
> > 
> > 
> > And now , I just doubt  the long transaction ,which is not 
> > commited/rollbacked, was caused by the connection pool. 
> > I am wondering which situration would cause the nova client could not 
> > receive the response? 
> > And which information do I need if dig it further. 
>
> this is really more of an openstack integration issue and it should be 
> taken up by the Nova team first, as it would involve being able to 
> reproduce your problem and digging into the architecture to see where 
> the engine might be being misused.   Unfortunately you might find it 
> difficult to get support because this is Postgresql, even though it's 
> historically been "supported" I'm sure you're aware they've been 
> trying for some time to find a way for Postgresql not really be 
> supported, and that may be the case today, even though IIUC they've 
> still left the door open for alternative backends (because there is 
> always pressure to support other backends). 
>
> If you're truly looking for a stable and performant Openstack without 
> being deeply involved in fixing issues I'd be switching to MySQL 
> because that's really the only backend that has widespread testing and 
> support.  OTOH if you're involved with development of Nova against 
> Postgresql then getting a launchpad issue with steps to reproduce 
> would be a good start (e.g. rally job that you're running, something 
> like that). 
>
>
> > 
> > Thanks for your help again. 
> > --Jinrong. 
> > 
> > 
> > 在 2017年11月13日星期一 UTC+8上午11:42:13,Mike Bayer写道: 
> >> 
> >> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai  wrote: 
> >> > Hi  Michael , 
> >> > 
> >> > I am using openstack with postgresql which sqlalchemy and oslo_db 
> module 
> >> > were used. 
> >> > And there are some problems after my pg database switched over. 
> >> > 
> >> > Here is my switch over process: 
> >> > 1. nova-conductor(python application) is running with DB connection 
> >> > strings 
> >> > point to vip , which is in primary site(A) of pg. 
> >> > 2. switch VIP from primary(A) to new primary(B) 
> >> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
> >> > primary. 
> >> > 4. nova-conductor is running in the whole process. 
> >> > 
> >> > After some seconds, I found some nova-conductor processes are hang 
> with 
> >> > status futex_wait_queue_me, and the status of the query in DB is 
> "idle 
> >> > in 
> >> > transaction", the transaction was not commited or rollbacked! 
> >> > I think disconnection was handled in the oslo_db, which will send a 
> >> > ping(select 1) to DB. 
> >> > 
> >> > If DB was switchd over, the connection in the pool would be set with 
> >> > status 
> >> > invalid, and reconnect after next check out. 
> >> > 
> >> > ###error messages from nova-conductor 
> >> > localhost nova-conductor ERROR [pid:36365] [MainThread] 
> [tid:122397712] 
> >> > [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
> >> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception 
> wrapped 
> >> > from 
> >> > (psycopg2.ProgrammingError) execute cannot be used while an 
> asynchronous 
> >> > query is underway [SQL: 'SELECT 1'] 
> >> >  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent 
> call 
> >> > last): 
> >> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
> >> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in 
> >> > _execute_context 
> >> >   36365 ERROR 

Re: [sqlalchemy] long transaction after database switched over

2017-11-13 Thread Mike Bayer
On Mon, Nov 13, 2017 at 2:14 AM, JinRong Cai  wrote:
> Hi  Michael,
>
> Thanks for your detailed answers.
>
> 1, We can see sereral transactions hang in the failover, including:
> "select 1"
> "select xxx from instances"
> even, we see the transaction hang after "update services set "
> So I am not sure whether the problem is related to the connection
> invalidation in the pool.
>
>
> 2, I have tried to disable the connection pool with the code "NULLPOOL".
> And yes, the fail over is ok, and no long transaction in the DB side after
> failover.
> But, in my understanding, if we disable connection pool in this way, the db
> will suffer with poor performance.
>
>
> 3, For the nova worker, in the log, I did not see the failed connection
> objected was shared between other nova processes(pid).
>
>
> BTW,  from the nova-conductor, we can see the socket receive queue is
> increasing, and the response from DB was not consumed by nova-conductor:
>
> Proto Recv-Q Send-Q Local Address   Foreign Address State
> PID/Program name
> tcp   69  0 nova-conductor_ip:60132  pg_ip:5432
> established 36365/python2.7-nova-conductor
>
>
> And now , I just doubt  the long transaction ,which is not
> commited/rollbacked, was caused by the connection pool.
> I am wondering which situration would cause the nova client could not
> receive the response?
> And which information do I need if dig it further.

this is really more of an openstack integration issue and it should be
taken up by the Nova team first, as it would involve being able to
reproduce your problem and digging into the architecture to see where
the engine might be being misused.   Unfortunately you might find it
difficult to get support because this is Postgresql, even though it's
historically been "supported" I'm sure you're aware they've been
trying for some time to find a way for Postgresql not really be
supported, and that may be the case today, even though IIUC they've
still left the door open for alternative backends (because there is
always pressure to support other backends).

If you're truly looking for a stable and performant Openstack without
being deeply involved in fixing issues I'd be switching to MySQL
because that's really the only backend that has widespread testing and
support.  OTOH if you're involved with development of Nova against
Postgresql then getting a launchpad issue with steps to reproduce
would be a good start (e.g. rally job that you're running, something
like that).


>
> Thanks for your help again.
> --Jinrong.
>
>
> 在 2017年11月13日星期一 UTC+8上午11:42:13,Mike Bayer写道:
>>
>> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai  wrote:
>> > Hi  Michael ,
>> >
>> > I am using openstack with postgresql which sqlalchemy and oslo_db module
>> > were used.
>> > And there are some problems after my pg database switched over.
>> >
>> > Here is my switch over process:
>> > 1. nova-conductor(python application) is running with DB connection
>> > strings
>> > point to vip , which is in primary site(A) of pg.
>> > 2. switch VIP from primary(A) to new primary(B)
>> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new
>> > primary.
>> > 4. nova-conductor is running in the whole process.
>> >
>> > After some seconds, I found some nova-conductor processes are hang with
>> > status futex_wait_queue_me, and the status of the query in DB is "idle
>> > in
>> > transaction", the transaction was not commited or rollbacked!
>> > I think disconnection was handled in the oslo_db, which will send a
>> > ping(select 1) to DB.
>> >
>> > If DB was switchd over, the connection in the pool would be set with
>> > status
>> > invalid, and reconnect after next check out.
>> >
>> > ###error messages from nova-conductor
>> > localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
>> > [exc_filters.py:330 _raise_for_remaining_DBAPIError]
>> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped
>> > from
>> > (psycopg2.ProgrammingError) execute cannot be used while an asynchronous
>> > query is underway [SQL: 'SELECT 1']
>> >  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call
>> > last):
>> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in
>> > _execute_context
>> >   36365 ERROR oslo_db.sqlalchemy.exc_filters context)
>> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
>> > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in
>> > do_execute
>> >   36365 ERROR oslo_db.sqlalchemy.exc_filters
>> > cursor.execute(statement,
>> > parameters)
>> >   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute
>> > cannot be used while an asynchronous query is underway
>> >   36365 ERROR oslo_db.sqlalchemy.exc_filters
>> >  localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
>> > [log.py:122 error] 

Re: [sqlalchemy] long transaction after database switched over

2017-11-12 Thread JinRong Cai
Hi  Michael,

Thanks for your detailed answers.

1, We can see sereral transactions hang in the failover, including:
"select 1"
"select xxx from instances"
even, we see the transaction hang after "update services set "
So I am not sure whether the problem is related to the connection 
invalidation in the pool.


2, I have tried to disable the connection pool with the code "NULLPOOL".
And yes, the fail over is ok, and no long transaction in the DB side after 
failover.
But, in my understanding, if we disable connection pool in this way, the db 
will suffer with poor performance.


3, For the nova worker, in the log, I did not see the failed connection 
objected was shared between other nova processes(pid).


BTW,  from the nova-conductor, we can see the socket receive queue is 
increasing, and the response from DB was not consumed by nova-conductor:

Proto Recv-Q Send-Q Local Address   Foreign Address 
State   PID/Program name 
tcp   69  0 nova-conductor_ip:60132  pg_ip:5432 
established 36365/python2.7-nova-conductor


And now , I just doubt  the long transaction ,which is not 
commited/rollbacked, was caused by the connection pool.
I am wondering which situration would cause the nova client could not 
receive the response?
And which information do I need if dig it further.

Thanks for your help again.
--Jinrong.


在 2017年11月13日星期一 UTC+8上午11:42:13,Mike Bayer写道:

> On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai  > wrote: 
> > Hi  Michael , 
> > 
> > I am using openstack with postgresql which sqlalchemy and oslo_db module 
> > were used. 
> > And there are some problems after my pg database switched over. 
> > 
> > Here is my switch over process: 
> > 1. nova-conductor(python application) is running with DB connection 
> strings 
> > point to vip , which is in primary site(A) of pg. 
> > 2. switch VIP from primary(A) to new primary(B) 
> > 3. switch over pg: shutdown primary(A), promopt standby(B) to new 
> primary. 
> > 4. nova-conductor is running in the whole process. 
> > 
> > After some seconds, I found some nova-conductor processes are hang with 
> > status futex_wait_queue_me, and the status of the query in DB is "idle 
> in 
> > transaction", the transaction was not commited or rollbacked! 
> > I think disconnection was handled in the oslo_db, which will send a 
> > ping(select 1) to DB. 
> > 
> > If DB was switchd over, the connection in the pool would be set with 
> status 
> > invalid, and reconnect after next check out. 
> > 
> > ###error messages from nova-conductor 
> > localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712] 
> > [exc_filters.py:330 _raise_for_remaining_DBAPIError] 
> > [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped 
> from 
> > (psycopg2.ProgrammingError) execute cannot be used while an asynchronous 
> > query is underway [SQL: 'SELECT 1'] 
> >  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call 
> > last): 
> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in 
> > _execute_context 
> >   36365 ERROR oslo_db.sqlalchemy.exc_filters context) 
> >   36365 ERROR oslo_db.sqlalchemy.exc_filters   File 
> > "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in 
> > do_execute 
> >   36365 ERROR oslo_db.sqlalchemy.exc_filters 
> cursor.execute(statement, 
> > parameters) 
> >   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute 
> > cannot be used while an asynchronous query is underway 
> >   36365 ERROR oslo_db.sqlalchemy.exc_filters 
> >  localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712] 
> > [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error 
> closing 
> > cursor 
> >   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call 
> last): 
> >   36365 ERROR sqlalchemy.pool.QueuePool   File 
> > "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in 
> > _safe_close_cursor 
> >   36365 ERROR sqlalchemy.pool.QueuePool cursor.close() 
> >   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot 
> be 
> > used while an asynchronous query is underway 
> >   36365 ERROR sqlalchemy.pool.QueuePool 
> > 
> > ###ps status of nova-conductor 
> > POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # cat 
> > /proc/33316/stack 
> > [] futex_wait_queue_me+0xc4/0x120 
> > [] futex_wait+0x179/0x280 
> > [] do_futex+0xfe/0x5b0 
> > [] SyS_futex+0x80/0x180 
> > [] system_call_fastpath+0x16/0x1b 
> > [] 0x 
> > 
> > ### stack of the nova-conductor process 
> > POD6-Mongodb03:/tmp # pstack 33316 
> > #0  0x2b8449e35f4d in __lll_lock_wait () from /lib64/libpthread.so.0 
> > #1  0x2b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0 
> > #2  0x2b8449e31c08 in pthread_mutex_lock () from 
> /lib64/libpthread.so.0 
> > #3  0x2b84554c44ab in 

Re: [sqlalchemy] long transaction after database switched over

2017-11-12 Thread Mike Bayer
On Sun, Nov 12, 2017 at 9:44 PM, JinRong Cai  wrote:
> Hi  Michael ,
>
> I am using openstack with postgresql which sqlalchemy and oslo_db module
> were used.
> And there are some problems after my pg database switched over.
>
> Here is my switch over process:
> 1. nova-conductor(python application) is running with DB connection strings
> point to vip , which is in primary site(A) of pg.
> 2. switch VIP from primary(A) to new primary(B)
> 3. switch over pg: shutdown primary(A), promopt standby(B) to new primary.
> 4. nova-conductor is running in the whole process.
>
> After some seconds, I found some nova-conductor processes are hang with
> status futex_wait_queue_me, and the status of the query in DB is "idle in
> transaction", the transaction was not commited or rollbacked!
> I think disconnection was handled in the oslo_db, which will send a
> ping(select 1) to DB.
>
> If DB was switchd over, the connection in the pool would be set with status
> invalid, and reconnect after next check out.
>
> ###error messages from nova-conductor
> localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
> [exc_filters.py:330 _raise_for_remaining_DBAPIError]
> [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] DBAPIError exception wrapped from
> (psycopg2.ProgrammingError) execute cannot be used while an asynchronous
> query is underway [SQL: 'SELECT 1']
>  36365 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call
> last):
>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in
> _execute_context
>   36365 ERROR oslo_db.sqlalchemy.exc_filters context)
>   36365 ERROR oslo_db.sqlalchemy.exc_filters   File
> "/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in
> do_execute
>   36365 ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement,
> parameters)
>   36365 ERROR oslo_db.sqlalchemy.exc_filters ProgrammingError: execute
> cannot be used while an asynchronous query is underway
>   36365 ERROR oslo_db.sqlalchemy.exc_filters
>  localhost nova-conductor ERROR [pid:36365] [MainThread] [tid:122397712]
> [log.py:122 error] [req-2bd8a290-e17b-4178-80a6-4b36d5793d85] Error closing
> cursor
>   36365 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
>   36365 ERROR sqlalchemy.pool.QueuePool   File
> "/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in
> _safe_close_cursor
>   36365 ERROR sqlalchemy.pool.QueuePool cursor.close()
>   36365 ERROR sqlalchemy.pool.QueuePool ProgrammingError: close cannot be
> used while an asynchronous query is underway
>   36365 ERROR sqlalchemy.pool.QueuePool
>
> ###ps status of nova-conductor
> POD6-Mongodb03:/var/log/uvp-getosstat/statistics20171106101500log # cat
> /proc/33316/stack
> [] futex_wait_queue_me+0xc4/0x120
> [] futex_wait+0x179/0x280
> [] do_futex+0xfe/0x5b0
> [] SyS_futex+0x80/0x180
> [] system_call_fastpath+0x16/0x1b
> [] 0x
>
> ### stack of the nova-conductor process
> POD6-Mongodb03:/tmp # pstack 33316
> #0  0x2b8449e35f4d in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x2b8449e31d02 in _L_lock_791 () from /lib64/libpthread.so.0
> #2  0x2b8449e31c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x2b84554c44ab in pq_abort () from
> /python2.7/site-packages/psycopg2/_psycopg.so
> #4  0x2b84554c955e in psyco_conn_rollback () from
> /python2.7/site-packages/psycopg2/_psycopg.so
> #5  0x2b8449b42b50 in PyEval_EvalFrameEx () from
> /lib64/libpython2.7.so.1.0
> #6  0x2b8449b42ad0 in PyEval_EvalFrameEx () from
> /lib64/libpython2.7.so.1.0
>
> The psycopg2 was trying to close the cursor, and try to get the mutex lock
> "pthread_mutex_lock", but it seems that the cursor was used by other
> session.
>
>
> Questions:
>
> 1. What the error "ProgrammingError: close cannot be used while an
> asynchronous query is underway" mean?
> AFAIK, these caused by psycopg2, which means a asynchronous query was
> executed in one connection.
> But the I think the sqlalchemy was thread safe since it was patched by
> eventlet, see details in  eventlet/support/psycopg2_patcher.py

it's kind of news to me that postgresql/eventlet is even doing the
psyco-green monkeypatch, but this seems to have been the case for a
long time since eventlet 0.9.8.So what we're seeing is just a poor
failure mode which is helped along by the awkward limitations of the
async connection, where the "SELECT 1" ping we do to test liveness of
the connection is not able to raise a proper error code.

Right off I will note that "moving the VIP" is not adequate for a
clean HA switchover, we've had lots of problems doing things that way
including specifically that a connection that is over the VIP has a
very hard time detecting that it's no longer "open" when you switch
the VIP under it.In the MySQL world we've had to modify the socket
with an additional parameter "source_address"