Re: [openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Reading the latest comments at https://github.com/PyMySQL/PyMySQL/issues/275, it seems to me that the issue is not to be solved in drivers themselves but instead in libraries that arrange connections (sqlalchemy/oslo.db), correct? Will the proposed connection reopening help? /Ihar On 05/12/14 23:43, Mike Bayer wrote: Hey list - I’m posting this here just to get some ideas on what might be happening here, as it may or may not have some impact on Openstack if and when we move to MySQL drivers that are async-patchable, like MySQL-connector or PyMySQL. I had a user post this issue a few days ago which I’ve since distilled into test cases for PyMySQL and MySQL-connector separately. It uses gevent, not eventlet, so I’m not really sure if this applies. But there’s plenty of very smart people here so if anyone can shed some light on what is actually happening here, that would help. The program essentially illustrates code that performs several steps upon a connection, however if the greenlet is suddenly killed, the state from the connection, while damaged, is still being allowed to continue on in some way, and what’s super-catastrophic here is that you see a transaction actually being committed *without* all the statements proceeding on it. In my work with MySQL drivers, I’ve noted for years that they are all very, very bad at dealing with concurrency-related issues. The whole “MySQL has gone away” and “commands out of sync” errors are ones that we’ve all just drowned in, and so often these are due to the driver getting mixed up due to concurrent use of a connection. However this one seems more insidious. Though at the same time, the script has some complexity happening (like a simplistic connection pool) and I’m not really sure where the core of the issue lies. The script is at https://gist.github.com/zzzeek/d196fa91c40cb515365e and also below. If you run it for a few seconds, go over to your MySQL command line and run this query: SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a) ORDER BY a_id DESC; and what you’ll see is tons of rows in table_b where the “a_id” is zero (because cursor.lastrowid fails), but the *rows are committed*. If you read the segment of code that does this, it should be impossible: connection = pool.get() rowid = execute_sql( connection, INSERT INTO table_a (data) VALUES (%s), (a,) ) gevent.sleep(random.random() * 0.2) try: execute_sql( connection, INSERT INTO table_b (a_id, data) VALUES (%s, %s), (rowid, b,) ) connection.commit() pool.return_conn(connection) except Exception: connection.rollback() pool.return_conn(connection) so if the gevent.sleep() throws a timeout error, somehow we are getting thrown back in there, with the connection in an invalid state, but not invalid enough to commit. If a simple check for “SELECT connection_id()” is added, this query fails and the whole issue is prevented. Additionally, if you put a foreign key constraint on that b_table.a_id, then the issue is prevented, and you see that the constraint violation is happening all over the place within the commit() call. The connection is being used such that its state just started after the gevent.sleep() call. Now, there’s also a very rudimental connection pool here. That is also part of what’s going on. If i try to run without the pool, the whole script just runs out of connections, fast, which suggests that this gevent timeout cleans itself up very, very badly. However, SQLAlchemy’s pool works a lot like this one, so if folks here can tell me if the connection pool is doing something bad, then that’s key, because I need to make a comparable change in SQLAlchemy’s pool. Otherwise I worry our eventlet use could have big problems under high load. # -*- coding: utf-8 -*- import gevent.monkey gevent.monkey.patch_all() import collections import threading import time import random import sys import logging logging.basicConfig() log = logging.getLogger('foo') log.setLevel(logging.DEBUG) #import pymysql as dbapi from mysql import connector as dbapi class SimplePool(object): def __init__(self): self.checkedin = collections.deque([ self._connect() for i in range(50) ]) self.checkout_lock = threading.Lock() self.checkin_lock = threading.Lock() def _connect(self): return dbapi.connect( user=scott, passwd=tiger, host=localhost, db=test) def get(self): with self.checkout_lock: while not self.checkedin: time.sleep(.1) return self.checkedin.pop() def return_conn(self, conn): try: conn.rollback() except: log.error(Exception during rollback, exc_info=True) try: conn.close() except: log.error(Exception during close, exc_info=True) # recycle to a new connection conn = self._connect() with self.checkin_lock: self.checkedin.append(conn) def verify_connection_id(conn): cursor = conn.cursor() try:
Re: [openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers
On Dec 12, 2014, at 9:27 AM, Ihar Hrachyshka ihrac...@redhat.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Reading the latest comments at https://github.com/PyMySQL/PyMySQL/issues/275, it seems to me that the issue is not to be solved in drivers themselves but instead in libraries that arrange connections (sqlalchemy/oslo.db), correct? Will the proposed connection reopening help? disagree, this is absolutely a driver bug. I’ve re-read that last comment and now I see that the developer is suggesting that this condition not be flagged in any way, so I’ve responded. The connection should absolutely blow up and if it wants to refuse to be usable afterwards, that’s fine (it’s the same as MySQLdb “commands out of sync”). It just has to *not* emit any further SQL as though nothing is wrong. It doesn’t matter much for PyMySQL anyway, I don’t know that PyMySQL is up to par for openstack in any case (look at the entries in their changelog: https://github.com/PyMySQL/PyMySQL/blob/master/CHANGELOG Several other bug fixes”, “Many bug fixes- really? is this an iphone app?) We really should be looking to get this fixed in MySQL-connector, which seems to have a similar issue. It’s just so difficult to get responses from MySQL-connector that the PyMySQL thread is at least informative. /Ihar On 05/12/14 23:43, Mike Bayer wrote: Hey list - I’m posting this here just to get some ideas on what might be happening here, as it may or may not have some impact on Openstack if and when we move to MySQL drivers that are async-patchable, like MySQL-connector or PyMySQL. I had a user post this issue a few days ago which I’ve since distilled into test cases for PyMySQL and MySQL-connector separately. It uses gevent, not eventlet, so I’m not really sure if this applies. But there’s plenty of very smart people here so if anyone can shed some light on what is actually happening here, that would help. The program essentially illustrates code that performs several steps upon a connection, however if the greenlet is suddenly killed, the state from the connection, while damaged, is still being allowed to continue on in some way, and what’s super-catastrophic here is that you see a transaction actually being committed *without* all the statements proceeding on it. In my work with MySQL drivers, I’ve noted for years that they are all very, very bad at dealing with concurrency-related issues. The whole “MySQL has gone away” and “commands out of sync” errors are ones that we’ve all just drowned in, and so often these are due to the driver getting mixed up due to concurrent use of a connection. However this one seems more insidious. Though at the same time, the script has some complexity happening (like a simplistic connection pool) and I’m not really sure where the core of the issue lies. The script is at https://gist.github.com/zzzeek/d196fa91c40cb515365e and also below. If you run it for a few seconds, go over to your MySQL command line and run this query: SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a) ORDER BY a_id DESC; and what you’ll see is tons of rows in table_b where the “a_id” is zero (because cursor.lastrowid fails), but the *rows are committed*. If you read the segment of code that does this, it should be impossible: connection = pool.get() rowid = execute_sql( connection, INSERT INTO table_a (data) VALUES (%s), (a,) ) gevent.sleep(random.random() * 0.2) try: execute_sql( connection, INSERT INTO table_b (a_id, data) VALUES (%s, %s), (rowid, b,) ) connection.commit() pool.return_conn(connection) except Exception: connection.rollback() pool.return_conn(connection) so if the gevent.sleep() throws a timeout error, somehow we are getting thrown back in there, with the connection in an invalid state, but not invalid enough to commit. If a simple check for “SELECT connection_id()” is added, this query fails and the whole issue is prevented. Additionally, if you put a foreign key constraint on that b_table.a_id, then the issue is prevented, and you see that the constraint violation is happening all over the place within the commit() call. The connection is being used such that its state just started after the gevent.sleep() call. Now, there’s also a very rudimental connection pool here. That is also part of what’s going on. If i try to run without the pool, the whole script just runs out of connections, fast, which suggests that this gevent timeout cleans itself up very, very badly. However, SQLAlchemy’s pool works a lot like this one, so if folks here can tell me if the connection pool is doing something bad, then that’s key, because I need to make a comparable change in SQLAlchemy’s pool. Otherwise I worry our eventlet use could have big problems under high load. # -*- coding: utf-8 -*- import gevent.monkey gevent.monkey.patch_all() import
Re: [openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers
On Dec 12, 2014, at 1:16 PM, Mike Bayer mba...@redhat.com wrote: On Dec 12, 2014, at 9:27 AM, Ihar Hrachyshka ihrac...@redhat.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Reading the latest comments at https://github.com/PyMySQL/PyMySQL/issues/275, it seems to me that the issue is not to be solved in drivers themselves but instead in libraries that arrange connections (sqlalchemy/oslo.db), correct? Will the proposed connection reopening help? disagree, this is absolutely a driver bug. I’ve re-read that last comment and now I see that the developer is suggesting that this condition not be flagged in any way, so I’ve responded. The connection should absolutely blow up and if it wants to refuse to be usable afterwards, that’s fine (it’s the same as MySQLdb “commands out of sync”). It just has to *not* emit any further SQL as though nothing is wrong. It doesn’t matter much for PyMySQL anyway, I don’t know that PyMySQL is up to par for openstack in any case (look at the entries in their changelog: https://github.com/PyMySQL/PyMySQL/blob/master/CHANGELOG Several other bug fixes”, “Many bug fixes- really? is this an iphone app?) This does make me a little concerned about merging https://review.openstack.org/#/c/133962/ so I’ve added a -2 for the time being to let the discussion go on here. Doug We really should be looking to get this fixed in MySQL-connector, which seems to have a similar issue. It’s just so difficult to get responses from MySQL-connector that the PyMySQL thread is at least informative. /Ihar On 05/12/14 23:43, Mike Bayer wrote: Hey list - I’m posting this here just to get some ideas on what might be happening here, as it may or may not have some impact on Openstack if and when we move to MySQL drivers that are async-patchable, like MySQL-connector or PyMySQL. I had a user post this issue a few days ago which I’ve since distilled into test cases for PyMySQL and MySQL-connector separately. It uses gevent, not eventlet, so I’m not really sure if this applies. But there’s plenty of very smart people here so if anyone can shed some light on what is actually happening here, that would help. The program essentially illustrates code that performs several steps upon a connection, however if the greenlet is suddenly killed, the state from the connection, while damaged, is still being allowed to continue on in some way, and what’s super-catastrophic here is that you see a transaction actually being committed *without* all the statements proceeding on it. In my work with MySQL drivers, I’ve noted for years that they are all very, very bad at dealing with concurrency-related issues. The whole “MySQL has gone away” and “commands out of sync” errors are ones that we’ve all just drowned in, and so often these are due to the driver getting mixed up due to concurrent use of a connection. However this one seems more insidious. Though at the same time, the script has some complexity happening (like a simplistic connection pool) and I’m not really sure where the core of the issue lies. The script is at https://gist.github.com/zzzeek/d196fa91c40cb515365e and also below. If you run it for a few seconds, go over to your MySQL command line and run this query: SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a) ORDER BY a_id DESC; and what you’ll see is tons of rows in table_b where the “a_id” is zero (because cursor.lastrowid fails), but the *rows are committed*. If you read the segment of code that does this, it should be impossible: connection = pool.get() rowid = execute_sql( connection, INSERT INTO table_a (data) VALUES (%s), (a,) ) gevent.sleep(random.random() * 0.2) try: execute_sql( connection, INSERT INTO table_b (a_id, data) VALUES (%s, %s), (rowid, b,) ) connection.commit() pool.return_conn(connection) except Exception: connection.rollback() pool.return_conn(connection) so if the gevent.sleep() throws a timeout error, somehow we are getting thrown back in there, with the connection in an invalid state, but not invalid enough to commit. If a simple check for “SELECT connection_id()” is added, this query fails and the whole issue is prevented. Additionally, if you put a foreign key constraint on that b_table.a_id, then the issue is prevented, and you see that the constraint violation is happening all over the place within the commit() call. The connection is being used such that its state just started after the gevent.sleep() call. Now, there’s also a very rudimental connection pool here. That is also part of what’s going on. If i try to run without the pool, the whole script just runs out of connections, fast, which suggests that this gevent timeout cleans itself up very, very badly. However, SQLAlchemy’s pool works a lot like this one, so if folks here can tell me if the connection pool is doing something
[openstack-dev] [all] [ha] potential issue with implicit async-compatible mysql drivers
Hey list - I’m posting this here just to get some ideas on what might be happening here, as it may or may not have some impact on Openstack if and when we move to MySQL drivers that are async-patchable, like MySQL-connector or PyMySQL. I had a user post this issue a few days ago which I’ve since distilled into test cases for PyMySQL and MySQL-connector separately. It uses gevent, not eventlet, so I’m not really sure if this applies. But there’s plenty of very smart people here so if anyone can shed some light on what is actually happening here, that would help. The program essentially illustrates code that performs several steps upon a connection, however if the greenlet is suddenly killed, the state from the connection, while damaged, is still being allowed to continue on in some way, and what’s super-catastrophic here is that you see a transaction actually being committed *without* all the statements proceeding on it. In my work with MySQL drivers, I’ve noted for years that they are all very, very bad at dealing with concurrency-related issues. The whole “MySQL has gone away” and “commands out of sync” errors are ones that we’ve all just drowned in, and so often these are due to the driver getting mixed up due to concurrent use of a connection. However this one seems more insidious. Though at the same time, the script has some complexity happening (like a simplistic connection pool) and I’m not really sure where the core of the issue lies. The script is at https://gist.github.com/zzzeek/d196fa91c40cb515365e and also below. If you run it for a few seconds, go over to your MySQL command line and run this query: SELECT * FROM table_b WHERE a_id not in (SELECT id FROM table_a) ORDER BY a_id DESC; and what you’ll see is tons of rows in table_b where the “a_id” is zero (because cursor.lastrowid fails), but the *rows are committed*. If you read the segment of code that does this, it should be impossible: connection = pool.get() rowid = execute_sql( connection, INSERT INTO table_a (data) VALUES (%s), (a,) ) gevent.sleep(random.random() * 0.2) try: execute_sql( connection, INSERT INTO table_b (a_id, data) VALUES (%s, %s), (rowid, b,) ) connection.commit() pool.return_conn(connection) except Exception: connection.rollback() pool.return_conn(connection) so if the gevent.sleep() throws a timeout error, somehow we are getting thrown back in there, with the connection in an invalid state, but not invalid enough to commit. If a simple check for “SELECT connection_id()” is added, this query fails and the whole issue is prevented. Additionally, if you put a foreign key constraint on that b_table.a_id, then the issue is prevented, and you see that the constraint violation is happening all over the place within the commit() call. The connection is being used such that its state just started after the gevent.sleep() call. Now, there’s also a very rudimental connection pool here. That is also part of what’s going on. If i try to run without the pool, the whole script just runs out of connections, fast, which suggests that this gevent timeout cleans itself up very, very badly. However, SQLAlchemy’s pool works a lot like this one, so if folks here can tell me if the connection pool is doing something bad, then that’s key, because I need to make a comparable change in SQLAlchemy’s pool. Otherwise I worry our eventlet use could have big problems under high load. # -*- coding: utf-8 -*- import gevent.monkey gevent.monkey.patch_all() import collections import threading import time import random import sys import logging logging.basicConfig() log = logging.getLogger('foo') log.setLevel(logging.DEBUG) #import pymysql as dbapi from mysql import connector as dbapi class SimplePool(object): def __init__(self): self.checkedin = collections.deque([ self._connect() for i in range(50) ]) self.checkout_lock = threading.Lock() self.checkin_lock = threading.Lock() def _connect(self): return dbapi.connect( user=scott, passwd=tiger, host=localhost, db=test) def get(self): with self.checkout_lock: while not self.checkedin: time.sleep(.1) return self.checkedin.pop() def return_conn(self, conn): try: conn.rollback() except: log.error(Exception during rollback, exc_info=True) try: conn.close() except: log.error(Exception during close, exc_info=True) # recycle to a new connection conn = self._connect() with self.checkin_lock: