Re: [openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers

2014-12-12 Thread Ihar Hrachyshka
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Reading the latest comments at
https://github.com/PyMySQL/PyMySQL/issues/275, it seems to me that the
issue is not to be solved in drivers themselves but instead in
libraries that arrange connections (sqlalchemy/oslo.db), correct?

Will the proposed connection reopening help?

/Ihar

On 05/12/14 23:43, Mike Bayer wrote:
 Hey list -
 
 I’m posting this here just to get some ideas on what might be
 happening here, as it may or may not have some impact on Openstack
 if and when we move to MySQL drivers that are async-patchable, like
 MySQL-connector or PyMySQL.  I had a user post this issue a few
 days ago which I’ve since distilled into test cases for PyMySQL and
 MySQL-connector separately.   It uses gevent, not eventlet, so I’m
 not really sure if this applies.  But there’s plenty of very smart
 people here so if anyone can shed some light on what is actually
 happening here, that would help.
 
 The program essentially illustrates code that performs several
 steps upon a connection, however if the greenlet is suddenly
 killed, the state from the connection, while damaged, is still
 being allowed to continue on in some way, and what’s
 super-catastrophic here is that you see a transaction actually
 being committed *without* all the statements proceeding on it.
 
 In my work with MySQL drivers, I’ve noted for years that they are
 all very, very bad at dealing with concurrency-related issues.  The
 whole “MySQL has gone away” and “commands out of sync” errors are
 ones that we’ve all just drowned in, and so often these are due to
 the driver getting mixed up due to concurrent use of a connection.
 However this one seems more insidious.   Though at the same time,
 the script has some complexity happening (like a simplistic
 connection pool) and I’m not really sure where the core of the
 issue lies.
 
 The script is at
 https://gist.github.com/zzzeek/d196fa91c40cb515365e and also below.
 If you run it for a few seconds, go over to your MySQL command line
 and run this query:
 
 SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a)
 ORDER BY a_id DESC;
 
 and what you’ll see is tons of rows in table_b where the “a_id” is
 zero (because cursor.lastrowid fails), but the *rows are
 committed*.   If you read the segment of code that does this, it
 should be impossible:
 
 connection = pool.get() rowid = execute_sql( connection, INSERT
 INTO table_a (data) VALUES (%s), (a,) )
 
 gevent.sleep(random.random() * 0.2)  try: execute_sql( connection, 
 INSERT INTO table_b (a_id, data) VALUES (%s, %s), (rowid, b,) 
 )  connection.commit()  pool.return_conn(connection)
 
 except Exception: connection.rollback() 
 pool.return_conn(connection)
 
 so if the gevent.sleep() throws a timeout error, somehow we are
 getting thrown back in there, with the connection in an invalid
 state, but not invalid enough to commit.
 
 If a simple check for “SELECT connection_id()” is added, this query
 fails and the whole issue is prevented.  Additionally, if you put a
 foreign key constraint on that b_table.a_id, then the issue is
 prevented, and you see that the constraint violation is happening
 all over the place within the commit() call.   The connection is
 being used such that its state just started after the
 gevent.sleep() call.
 
 Now, there’s also a very rudimental connection pool here.   That is
 also part of what’s going on.  If i try to run without the pool,
 the whole script just runs out of connections, fast, which suggests
 that this gevent timeout cleans itself up very, very badly.
 However, SQLAlchemy’s pool works a lot like this one, so if folks
 here can tell me if the connection pool is doing something bad,
 then that’s key, because I need to make a comparable change in
 SQLAlchemy’s pool.   Otherwise I worry our eventlet use could have
 big problems under high load.
 
 
 
 
 
 # -*- coding: utf-8 -*- import gevent.monkey 
 gevent.monkey.patch_all()
 
 import collections import threading import time import random 
 import sys
 
 import logging logging.basicConfig() log =
 logging.getLogger('foo') log.setLevel(logging.DEBUG)
 
 #import pymysql as dbapi from mysql import connector as dbapi
 
 
 class SimplePool(object): def __init__(self): self.checkedin =
 collections.deque([ self._connect() for i in range(50) ]) 
 self.checkout_lock = threading.Lock() self.checkin_lock =
 threading.Lock()
 
 def _connect(self): return dbapi.connect( user=scott,
 passwd=tiger, host=localhost, db=test)
 
 def get(self): with self.checkout_lock: while not self.checkedin: 
 time.sleep(.1) return self.checkedin.pop()
 
 def return_conn(self, conn): try: conn.rollback() except: 
 log.error(Exception during rollback, exc_info=True) try: 
 conn.close() except: log.error(Exception during close,
 exc_info=True)
 
 # recycle to a new connection conn = self._connect() with
 self.checkin_lock: self.checkedin.append(conn)
 
 
 def verify_connection_id(conn): cursor = conn.cursor() try: 
 

Re: [openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers

2014-12-12 Thread Mike Bayer

 On Dec 12, 2014, at 9:27 AM, Ihar Hrachyshka ihrac...@redhat.com wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA512
 
 Reading the latest comments at
 https://github.com/PyMySQL/PyMySQL/issues/275, it seems to me that the
 issue is not to be solved in drivers themselves but instead in
 libraries that arrange connections (sqlalchemy/oslo.db), correct?
 
 Will the proposed connection reopening help?

disagree, this is absolutely a driver bug.  I’ve re-read that last comment and 
now I see that the developer is suggesting that this condition not be flagged 
in any way, so I’ve responded.  The connection should absolutely blow up and if 
it wants to refuse to be usable afterwards, that’s fine (it’s the same as 
MySQLdb “commands out of sync”).  It just has to *not* emit any further SQL as 
though nothing is wrong.

It doesn’t matter much for PyMySQL anyway, I don’t know that PyMySQL is up to 
par for openstack in any case (look at the entries in their changelog: 
https://github.com/PyMySQL/PyMySQL/blob/master/CHANGELOG Several other bug 
fixes”, “Many bug fixes- really?  is this an iphone app?)

We really should be looking to get this fixed in MySQL-connector, which seems 
to have a similar issue.   It’s just so difficult to get responses from 
MySQL-connector that the PyMySQL thread is at least informative.





 
 /Ihar
 
 On 05/12/14 23:43, Mike Bayer wrote:
 Hey list -
 
 I’m posting this here just to get some ideas on what might be
 happening here, as it may or may not have some impact on Openstack
 if and when we move to MySQL drivers that are async-patchable, like
 MySQL-connector or PyMySQL.  I had a user post this issue a few
 days ago which I’ve since distilled into test cases for PyMySQL and
 MySQL-connector separately.   It uses gevent, not eventlet, so I’m
 not really sure if this applies.  But there’s plenty of very smart
 people here so if anyone can shed some light on what is actually
 happening here, that would help.
 
 The program essentially illustrates code that performs several
 steps upon a connection, however if the greenlet is suddenly
 killed, the state from the connection, while damaged, is still
 being allowed to continue on in some way, and what’s
 super-catastrophic here is that you see a transaction actually
 being committed *without* all the statements proceeding on it.
 
 In my work with MySQL drivers, I’ve noted for years that they are
 all very, very bad at dealing with concurrency-related issues.  The
 whole “MySQL has gone away” and “commands out of sync” errors are
 ones that we’ve all just drowned in, and so often these are due to
 the driver getting mixed up due to concurrent use of a connection.
 However this one seems more insidious.   Though at the same time,
 the script has some complexity happening (like a simplistic
 connection pool) and I’m not really sure where the core of the
 issue lies.
 
 The script is at
 https://gist.github.com/zzzeek/d196fa91c40cb515365e and also below.
 If you run it for a few seconds, go over to your MySQL command line
 and run this query:
 
 SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a)
 ORDER BY a_id DESC;
 
 and what you’ll see is tons of rows in table_b where the “a_id” is
 zero (because cursor.lastrowid fails), but the *rows are
 committed*.   If you read the segment of code that does this, it
 should be impossible:
 
 connection = pool.get() rowid = execute_sql( connection, INSERT
 INTO table_a (data) VALUES (%s), (a,) )
 
 gevent.sleep(random.random() * 0.2)  try: execute_sql( connection, 
 INSERT INTO table_b (a_id, data) VALUES (%s, %s), (rowid, b,) 
 )  connection.commit()  pool.return_conn(connection)
 
 except Exception: connection.rollback() 
 pool.return_conn(connection)
 
 so if the gevent.sleep() throws a timeout error, somehow we are
 getting thrown back in there, with the connection in an invalid
 state, but not invalid enough to commit.
 
 If a simple check for “SELECT connection_id()” is added, this query
 fails and the whole issue is prevented.  Additionally, if you put a
 foreign key constraint on that b_table.a_id, then the issue is
 prevented, and you see that the constraint violation is happening
 all over the place within the commit() call.   The connection is
 being used such that its state just started after the
 gevent.sleep() call.
 
 Now, there’s also a very rudimental connection pool here.   That is
 also part of what’s going on.  If i try to run without the pool,
 the whole script just runs out of connections, fast, which suggests
 that this gevent timeout cleans itself up very, very badly.
 However, SQLAlchemy’s pool works a lot like this one, so if folks
 here can tell me if the connection pool is doing something bad,
 then that’s key, because I need to make a comparable change in
 SQLAlchemy’s pool.   Otherwise I worry our eventlet use could have
 big problems under high load.
 
 
 
 
 
 # -*- coding: utf-8 -*- import gevent.monkey 
 gevent.monkey.patch_all()
 
 import 

Re: [openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers

2014-12-12 Thread Doug Hellmann

On Dec 12, 2014, at 1:16 PM, Mike Bayer mba...@redhat.com wrote:

 
 On Dec 12, 2014, at 9:27 AM, Ihar Hrachyshka ihrac...@redhat.com wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA512
 
 Reading the latest comments at
 https://github.com/PyMySQL/PyMySQL/issues/275, it seems to me that the
 issue is not to be solved in drivers themselves but instead in
 libraries that arrange connections (sqlalchemy/oslo.db), correct?
 
 Will the proposed connection reopening help?
 
 disagree, this is absolutely a driver bug.  I’ve re-read that last comment 
 and now I see that the developer is suggesting that this condition not be 
 flagged in any way, so I’ve responded.  The connection should absolutely blow 
 up and if it wants to refuse to be usable afterwards, that’s fine (it’s the 
 same as MySQLdb “commands out of sync”).  It just has to *not* emit any 
 further SQL as though nothing is wrong.
 
 It doesn’t matter much for PyMySQL anyway, I don’t know that PyMySQL is up to 
 par for openstack in any case (look at the entries in their changelog: 
 https://github.com/PyMySQL/PyMySQL/blob/master/CHANGELOG Several other bug 
 fixes”, “Many bug fixes- really?  is this an iphone app?)

This does make me a little concerned about merging 
https://review.openstack.org/#/c/133962/ so I’ve added a -2 for the time being 
to let the discussion go on here.

Doug


 
 We really should be looking to get this fixed in MySQL-connector, which seems 
 to have a similar issue.   It’s just so difficult to get responses from 
 MySQL-connector that the PyMySQL thread is at least informative.
 
 
 
 
 
 
 /Ihar
 
 On 05/12/14 23:43, Mike Bayer wrote:
 Hey list -
 
 I’m posting this here just to get some ideas on what might be
 happening here, as it may or may not have some impact on Openstack
 if and when we move to MySQL drivers that are async-patchable, like
 MySQL-connector or PyMySQL.  I had a user post this issue a few
 days ago which I’ve since distilled into test cases for PyMySQL and
 MySQL-connector separately.   It uses gevent, not eventlet, so I’m
 not really sure if this applies.  But there’s plenty of very smart
 people here so if anyone can shed some light on what is actually
 happening here, that would help.
 
 The program essentially illustrates code that performs several
 steps upon a connection, however if the greenlet is suddenly
 killed, the state from the connection, while damaged, is still
 being allowed to continue on in some way, and what’s
 super-catastrophic here is that you see a transaction actually
 being committed *without* all the statements proceeding on it.
 
 In my work with MySQL drivers, I’ve noted for years that they are
 all very, very bad at dealing with concurrency-related issues.  The
 whole “MySQL has gone away” and “commands out of sync” errors are
 ones that we’ve all just drowned in, and so often these are due to
 the driver getting mixed up due to concurrent use of a connection.
 However this one seems more insidious.   Though at the same time,
 the script has some complexity happening (like a simplistic
 connection pool) and I’m not really sure where the core of the
 issue lies.
 
 The script is at
 https://gist.github.com/zzzeek/d196fa91c40cb515365e and also below.
 If you run it for a few seconds, go over to your MySQL command line
 and run this query:
 
 SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a)
 ORDER BY a_id DESC;
 
 and what you’ll see is tons of rows in table_b where the “a_id” is
 zero (because cursor.lastrowid fails), but the *rows are
 committed*.   If you read the segment of code that does this, it
 should be impossible:
 
 connection = pool.get() rowid = execute_sql( connection, INSERT
 INTO table_a (data) VALUES (%s), (a,) )
 
 gevent.sleep(random.random() * 0.2)  try: execute_sql( connection, 
 INSERT INTO table_b (a_id, data) VALUES (%s, %s), (rowid, b,) 
 )  connection.commit()  pool.return_conn(connection)
 
 except Exception: connection.rollback() 
 pool.return_conn(connection)
 
 so if the gevent.sleep() throws a timeout error, somehow we are
 getting thrown back in there, with the connection in an invalid
 state, but not invalid enough to commit.
 
 If a simple check for “SELECT connection_id()” is added, this query
 fails and the whole issue is prevented.  Additionally, if you put a
 foreign key constraint on that b_table.a_id, then the issue is
 prevented, and you see that the constraint violation is happening
 all over the place within the commit() call.   The connection is
 being used such that its state just started after the
 gevent.sleep() call.
 
 Now, there’s also a very rudimental connection pool here.   That is
 also part of what’s going on.  If i try to run without the pool,
 the whole script just runs out of connections, fast, which suggests
 that this gevent timeout cleans itself up very, very badly.
 However, SQLAlchemy’s pool works a lot like this one, so if folks
 here can tell me if the connection pool is doing something 

[openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers

2014-12-05 Thread Mike Bayer
Hey list -

I’m posting this here just to get some ideas on what might be happening here, 
as it may or may not have some impact on Openstack if and when we move to MySQL 
drivers that are async-patchable, like MySQL-connector or PyMySQL.  I had a 
user post this issue a few days ago which I’ve since distilled into test cases 
for PyMySQL and MySQL-connector separately.   It uses gevent, not eventlet, so 
I’m not really sure if this applies.  But there’s plenty of very smart people 
here so if anyone can shed some light on what is actually happening here, that 
would help.

The program essentially illustrates code that performs several steps upon a 
connection, however if the greenlet is suddenly killed, the state from the 
connection, while damaged, is still being allowed to continue on in some way, 
and what’s super-catastrophic here is that you see a transaction actually being 
committed *without* all the statements proceeding on it. 

In my work with MySQL drivers, I’ve noted for years that they are all very, 
very bad at dealing with concurrency-related issues.  The whole “MySQL has gone 
away” and “commands out of sync” errors are ones that we’ve all just drowned 
in, and so often these are due to the driver getting mixed up due to concurrent 
use of a connection.  However this one seems more insidious.   Though at the 
same time, the script has some complexity happening (like a simplistic 
connection pool) and I’m not really sure where the core of the issue lies.

The script is at https://gist.github.com/zzzeek/d196fa91c40cb515365e and also 
below.  If you run it for a few seconds, go over to your MySQL command line and 
run this query:

SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a) ORDER BY a_id 
DESC;

and what you’ll see is tons of rows in table_b where the “a_id” is zero 
(because cursor.lastrowid fails), but the *rows are committed*.   If you read 
the segment of code that does this, it should be impossible:

connection = pool.get()
rowid = execute_sql(
connection,
INSERT INTO table_a (data) VALUES (%s), (a,)
)

gevent.sleep(random.random() * 0.2)
 
try:
execute_sql(
connection,
INSERT INTO table_b (a_id, data) VALUES (%s, %s),
(rowid, b,)
)
 
connection.commit()
 
pool.return_conn(connection) 

except Exception:
connection.rollback()
pool.return_conn(connection)

so if the gevent.sleep() throws a timeout error, somehow we are getting thrown 
back in there, with the connection in an invalid state, but not invalid enough 
to commit.

If a simple check for “SELECT connection_id()” is added, this query fails and 
the whole issue is prevented.  Additionally, if you put a foreign key 
constraint on that b_table.a_id, then the issue is prevented, and you see that 
the constraint violation is happening all over the place within the commit() 
call.   The connection is being used such that its state just started after the 
gevent.sleep() call.  

Now, there’s also a very rudimental connection pool here.   That is also part 
of what’s going on.  If i try to run without the pool, the whole script just 
runs out of connections, fast, which suggests that this gevent timeout cleans 
itself up very, very badly.   However, SQLAlchemy’s pool works a lot like this 
one, so if folks here can tell me if the connection pool is doing something 
bad, then that’s key, because I need to make a comparable change in 
SQLAlchemy’s pool.   Otherwise I worry our eventlet use could have big problems 
under high load.





# -*- coding: utf-8 -*-
import gevent.monkey
gevent.monkey.patch_all()

import collections
import threading
import time
import random
import sys

import logging
logging.basicConfig()
log = logging.getLogger('foo')
log.setLevel(logging.DEBUG)

#import pymysql as dbapi
from mysql import connector as dbapi


class SimplePool(object):
def __init__(self):
self.checkedin = collections.deque([
self._connect() for i in range(50)
])
self.checkout_lock = threading.Lock()
self.checkin_lock = threading.Lock()

def _connect(self):
return dbapi.connect(
user=scott, passwd=tiger,
host=localhost, db=test)

def get(self):
with self.checkout_lock:
while not self.checkedin:
time.sleep(.1)
return self.checkedin.pop()

def return_conn(self, conn):
try:
conn.rollback()
except:
log.error(Exception during rollback, exc_info=True)
try:
conn.close()
except:
log.error(Exception during close, exc_info=True)

# recycle to a new connection
conn = self._connect()
with self.checkin_lock: