Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently
Mike Bayer wrote: > > > Doug Hellmann wrote: > >>> My question is then how is it that such an architecture would be >>> possible, >>> that Cinder’s service starts up without greenlets yet allows >>> greenlet-based >>> requests to come in before this critical task is complete? Shouldn’t the >>> various oslo systems be providing patterns to prevent this disastrous >>> combination? >> >> I would have thought so, but they are (mostly) libraries not frameworks >> so they are often combined in unexpected ways. Let's see where the issue >> is before deciding on where the fix should be. > > > my next suspicion is that this is actually a spawned subprocess, though > oslo.concurrency, and it is failing to create a new connection pool and thus > is sharing the same file descriptor between processes. That is a much more > ordinary issue and would explain everything I’m seeing in a more familiar > way. Let me see if i can confirm *that*. So, that was it. Here’s some debugging I added to oslo.db inside of its own create_engine() that catches this easily: engine = sqlalchemy.create_engine(url, **engine_args) import os from sqlalchemy import event @event.listens_for(engine, "connect") def connect(dbapi_connection, connection_record): connection_record.info['pid'] = os.getpid() @event.listens_for(engine, "checkout") def checkout(dbapi_connection, connection_record, connection_proxy): pid = os.getpid() if connection_record.info['pid'] != pid: raise Exception( "Connection record belongs to pid %s, " "attempting to check out in pid %s" % (connection_record.info['pid'], pid)) sure enough, the database errors go away and Cinders logs when I do a plain startup are filled with: ERROR cinder.openstack.common.threadgroup [-] Connection record belongs to pid 21200, attempting to check out in pid 21408 ERROR cinder.openstack.common.threadgroup [-] Connection record belongs to pid 21200, attempting to check out in pid 21409 etc. The subprocesses here are logged as: 2015-02-17 13:05:12.583 DEBUG oslo_concurrency.processutils [req-a06464d1-4785-4a29-8e58-239d5a674451 None None] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C LVM_SYSTEM_DIR=/etc/cinder vgs --noheadings --unit=g -o name,size,free,lv_count,uuid --separator : --nosuffix stack-volumes-lvm-1" returned: 0 in 0.828s from (pid=21408) execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:221 a trace is as follows: Traceback (most recent call last): File "/opt/stack/cinder/cinder/openstack/common/threadgroup.py", line 143, in wait x.wait() File "/opt/stack/cinder/cinder/openstack/common/threadgroup.py", line 47, in wait return self.thread.wait() File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait return self._exit_event.wait() File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait return hubs.get_hub().switch() File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch return self.greenlet.switch() File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main result = function(*args, **kwargs) File "/opt/stack/cinder/cinder/openstack/common/service.py", line 492, in run_service service.start() File "/opt/stack/cinder/cinder/service.py", line 142, in start self.manager.init_host() File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper return f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper return f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper return f(*args, **kwargs) File "/opt/stack/cinder/cinder/volume/manager.py", line 294, in init_host volumes = self.db.volume_get_all_by_host(ctxt, self.host) File "/opt/stack/cinder/cinder/db/api.py", line 191, in volume_get_all_by_host return IMPL.volume_get_all_by_host(context, host) File "/opt/stack/cinder/cinder/db/sqlalchemy/api.py", line 160, in wrapper return f(*args, **kwargs) File "/opt/stack/cinder/cinder/db/sqlalchemy/api.py", line 1229, in volume_get_all_by_host result = _volume_get_query(context).filter(or_(*conditions)).all() File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2320, in all return list(self) File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2438, in __iter__ return self._execute_and_instances(context) File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2451, in _execute_and_instances close_with_result=True) File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2442, in _connection_from_session **kw) File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 854, in connection close_with_result=close_with_result) Fil
Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently
Doug Hellmann wrote: >> My question is then how is it that such an architecture would be >> possible, >> that Cinder’s service starts up without greenlets yet allows >> greenlet-based >> requests to come in before this critical task is complete? Shouldn’t the >> various oslo systems be providing patterns to prevent this disastrous >> combination? > > I would have thought so, but they are (mostly) libraries not frameworks > so they are often combined in unexpected ways. Let's see where the issue > is before deciding on where the fix should be. my next suspicion is that this is actually a spawned subprocess, though oslo.concurrency, and it is failing to create a new connection pool and thus is sharing the same file descriptor between processes. That is a much more ordinary issue and would explain everything I’m seeing in a more familiar way. Let me see if i can confirm *that*. > Doug > >>> Doug >>> Current status is that I’m continuing to try to determine why this is happening here, and seemingly nowhere else. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> __ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently
On Tue, Feb 17, 2015, at 11:17 AM, Mike Bayer wrote: > > > Doug Hellmann wrote: > > >> > >> So I’m not really sure what’s going on here. Cinder seems to have some > >> openstack greenlet code of its own in > >> cinder/openstack/common/threadgroup.py, I don’t know the purpose of this > >> code. SQLAlchemy’s connection pool has been tested a lot with eventlet > >> / gevent and this has never been reported before. This is a very > >> severe race and I’d think that this would be happening all over the > >> place. > > > > The threadgroup module is from the Oslo incubator, so if you need to > > review the git history you'll want to look at that copy. > > > I haven’t confirmed this yet today but based on some greenlet research as > well as things I observed with PDB yesterday, my suspicion is that > Cinder’s > startup code runs in a traditional thread, at the same time the service > is > allowing connections to come in via green-threads, which are running in a > separate greenlet event loop (how else did my PDB sessions have continued > echo output stepping on my typing?). greenlet performs stack-slicing > where > it is memoizing the state of the interpreter to some extent, but > importantly > it does not provide this in conjunction with traditional threads. So > Python > code can’t even tell that it’s being shared, because all of the state is > completely swapped out (but of course this doesn’t count when you’re a > file > descriptor). I’ve been observing this by watching the identical objects > (same ID) magically have different state as a stale greenlet suddenly > wakes > up in the middle of the presumably thread-bound initialization code. > > My question is then how is it that such an architecture would be > possible, > that Cinder’s service starts up without greenlets yet allows > greenlet-based > requests to come in before this critical task is complete? Shouldn’t the > various oslo systems be providing patterns to prevent this disastrous > combination? I would have thought so, but they are (mostly) libraries not frameworks so they are often combined in unexpected ways. Let's see where the issue is before deciding on where the fix should be. Doug > > > > Doug > > > >> Current status is that I’m continuing to try to determine why this is > >> happening here, and seemingly nowhere else. > >> > >> > >> > >> > >> __ > >> OpenStack Development Mailing List (not for usage questions) > >> Unsubscribe: > >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > __ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently
Mike Bayer wrote: > > I haven’t confirmed this yet today but based on some greenlet research as > well as things I observed with PDB yesterday, my suspicion is that Cinder’s > startup code runs in a traditional thread, at the same time the service is > allowing connections to come in via green-threads, which are running in a > separate greenlet event loop OK, it seems that is not what’s happening. Which is actually very bad news because it’s starting to look like SQLAlchemy’s connection pool, even if I directly patch it with eventlet’s threading and Queue implementations, is failing. Which would just be all the more amazing that we don’t see this happening everywhere, all the time? The story is still not told yet. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently
Doug Hellmann wrote: >> >> So I’m not really sure what’s going on here. Cinder seems to have some >> openstack greenlet code of its own in >> cinder/openstack/common/threadgroup.py, I don’t know the purpose of this >> code. SQLAlchemy’s connection pool has been tested a lot with eventlet >> / gevent and this has never been reported before. This is a very >> severe race and I’d think that this would be happening all over the >> place. > > The threadgroup module is from the Oslo incubator, so if you need to > review the git history you'll want to look at that copy. I haven’t confirmed this yet today but based on some greenlet research as well as things I observed with PDB yesterday, my suspicion is that Cinder’s startup code runs in a traditional thread, at the same time the service is allowing connections to come in via green-threads, which are running in a separate greenlet event loop (how else did my PDB sessions have continued echo output stepping on my typing?). greenlet performs stack-slicing where it is memoizing the state of the interpreter to some extent, but importantly it does not provide this in conjunction with traditional threads. So Python code can’t even tell that it’s being shared, because all of the state is completely swapped out (but of course this doesn’t count when you’re a file descriptor). I’ve been observing this by watching the identical objects (same ID) magically have different state as a stale greenlet suddenly wakes up in the middle of the presumably thread-bound initialization code. My question is then how is it that such an architecture would be possible, that Cinder’s service starts up without greenlets yet allows greenlet-based requests to come in before this critical task is complete? Shouldn’t the various oslo systems be providing patterns to prevent this disastrous combination? > Doug > >> Current status is that I’m continuing to try to determine why this is >> happening here, and seemingly nowhere else. >> >> >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently
On Mon, Feb 16, 2015, at 07:50 PM, Mike Bayer wrote: > hi all - > > I’ve been researching this cinder issue > https://bugs.launchpad.net/cinder/+bug/1417018 and I’ve found something > concerning. > > Basically there’s a race condition here which is occurring because a > single MySQL connection is shared between two green threads. This occurs > while Cinder is doing its startup work in cinder/volume/manager.py -> > init_host(), and at the same time a request comes in from a separate > service call that seems to be part of the startup. > > The log output at http://paste.openstack.org/show/175571/ shows this > happening. I can break it down: > > > 1. A big query for volumes occurs as part of > "self.db.volume_get_all_by_host(ctxt, self.host)”. To reproduce the > error more regularly I’ve placed it into a loop of 100 calls. We can see > that thread id is 68089648 MySQL connection is 3a9c5a0. > > 2015-02-16 19:32:47.236 INFO sqlalchemy.engine.base.Engine > [req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None] tid: 68089648, > connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt > SELECT volumes.created_at AS > 2015-02-16 19:32:47.237 INFO sqlalchemy.engine.base.Engine > [req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None] > ('localhost.localdomain@ceph', 'localhost.localdomain@ceph#%’) > > 2. A “ping” query comes in related to a different API call - different > thread ID, *same* connection > > 2015-02-16 19:32:47.276 INFO sqlalchemy.engine.base.Engine > [req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] tid: 68081456, > connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt > SELECT 1 > 2015-02-16 19:32:47.279 INFO sqlalchemy.engine.base.Engine > [req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] () > > 3. The first statement is still in the middle of invocation, so we get a > failure, either a mismatch of the statement to the cursor, or a MySQL > lost connection (stack trace begins) > > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, > in fire_timers > … more stack trace > > 4. another thread id, *same* connection. > > 2015-02-16 19:32:47.290 INFO sqlalchemy.engine.base.Engine > [req-f980de7c-151d-4fed-b45e-d12b133859a6 None None] tid: 61238160, > connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt > SELECT 1 > > rows = [process[0](row, None) for row in fetch] > File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", > line 363, in _instance > tuple([row[column] for column in pk_cols]) > File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", > line 331, in _key_fallback > expression._string_or_unprintable(key)) > NoSuchColumnError: "Could not locate column in row for column > 'volumes.id'" > 2015-02-16 19:32:47.293 ERROR cinder.openstack.common.threadgroup [-] > "Could not locate column in row for column 'volumes.id’" > > > So I’m not really sure what’s going on here. Cinder seems to have some > openstack greenlet code of its own in > cinder/openstack/common/threadgroup.py, I don’t know the purpose of this > code. SQLAlchemy’s connection pool has been tested a lot with eventlet > / gevent and this has never been reported before. This is a very > severe race and I’d think that this would be happening all over the > place. The threadgroup module is from the Oslo incubator, so if you need to review the git history you'll want to look at that copy. Doug > > Current status is that I’m continuing to try to determine why this is > happening here, and seemingly nowhere else. > > > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev