Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently

2015-02-17 Thread Mike Bayer


Mike Bayer  wrote:

> 
> 
> Doug Hellmann  wrote:
> 
>>> My question is then how is it that such an architecture would be
>>> possible,
>>> that Cinder’s service starts up without greenlets yet allows
>>> greenlet-based
>>> requests to come in before this critical task is complete? Shouldn’t the
>>> various oslo systems be providing patterns to prevent this disastrous
>>> combination?   
>> 
>> I would have thought so, but they are (mostly) libraries not frameworks
>> so they are often combined in unexpected ways. Let's see where the issue
>> is before deciding on where the fix should be.
> 
> 
> my next suspicion is that this is actually a spawned subprocess, though
> oslo.concurrency, and it is failing to create a new connection pool and thus
> is sharing the same file descriptor between processes. That is a much more
> ordinary issue and would explain everything I’m seeing in a more familiar
> way. Let me see if i can confirm *that*.


So, that was it.

Here’s some debugging I added to oslo.db inside of its own create_engine() that 
catches this easily:

   engine = sqlalchemy.create_engine(url, **engine_args)

import os
from sqlalchemy import event

@event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
connection_record.info['pid'] = os.getpid()

@event.listens_for(engine, "checkout")
def checkout(dbapi_connection, connection_record, connection_proxy):
pid = os.getpid()
if connection_record.info['pid'] != pid:
raise Exception(
"Connection record belongs to pid %s, "
"attempting to check out in pid %s" % 
(connection_record.info['pid'], pid))


sure enough, the database errors go away and Cinders logs when I do a plain
startup are filled with:

 ERROR cinder.openstack.common.threadgroup [-] Connection record belongs to pid 
21200, attempting to check out in pid 21408
 ERROR cinder.openstack.common.threadgroup [-] Connection record belongs to pid 
21200, attempting to check out in pid 21409

etc.

The subprocesses here are logged as:

2015-02-17 13:05:12.583 DEBUG oslo_concurrency.processutils 
[req-a06464d1-4785-4a29-8e58-239d5a674451 None None] CMD "sudo cinder-rootwrap 
/etc/cinder/rootwrap.conf env LC_ALL=C LVM_SYSTEM_DIR=/etc/cinder vgs 
--noheadings --unit=g -o name,size,free,lv_count,uuid --separator : --nosuffix 
stack-volumes-lvm-1" returned: 0 in 0.828s from (pid=21408) execute 
/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:221

a trace is as follows:

Traceback (most recent call last):
 File "/opt/stack/cinder/cinder/openstack/common/threadgroup.py", line 143, in 
wait
   x.wait()
 File "/opt/stack/cinder/cinder/openstack/common/threadgroup.py", line 47, in 
wait
   return self.thread.wait()
 File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in 
wait
   return self._exit_event.wait()
 File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
   return hubs.get_hub().switch()
 File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in 
switch
   return self.greenlet.switch()
 File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in 
main
   result = function(*args, **kwargs)
 File "/opt/stack/cinder/cinder/openstack/common/service.py", line 492, in 
run_service
   service.start()
 File "/opt/stack/cinder/cinder/service.py", line 142, in start
   self.manager.init_host()
 File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in 
wrapper
   return f(*args, **kwargs)
 File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in 
wrapper
   return f(*args, **kwargs)
 File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in 
wrapper
   return f(*args, **kwargs)
 File "/opt/stack/cinder/cinder/volume/manager.py", line 294, in init_host
   volumes = self.db.volume_get_all_by_host(ctxt, self.host)
 File "/opt/stack/cinder/cinder/db/api.py", line 191, in volume_get_all_by_host
   return IMPL.volume_get_all_by_host(context, host)
 File "/opt/stack/cinder/cinder/db/sqlalchemy/api.py", line 160, in wrapper
   return f(*args, **kwargs)
 File "/opt/stack/cinder/cinder/db/sqlalchemy/api.py", line 1229, in 
volume_get_all_by_host
   result = _volume_get_query(context).filter(or_(*conditions)).all()
 File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2320, 
in all
   return list(self)
 File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2438, 
in __iter__
   return self._execute_and_instances(context)
 File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2451, 
in _execute_and_instances
   close_with_result=True)
 File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2442, 
in _connection_from_session
   **kw)
 File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 854, 
in connection
   close_with_result=close_with_result)
 Fil

Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently

2015-02-17 Thread Mike Bayer


Doug Hellmann  wrote:

>> My question is then how is it that such an architecture would be
>> possible,
>> that Cinder’s service starts up without greenlets yet allows
>> greenlet-based
>> requests to come in before this critical task is complete? Shouldn’t the
>> various oslo systems be providing patterns to prevent this disastrous
>> combination?   
> 
> I would have thought so, but they are (mostly) libraries not frameworks
> so they are often combined in unexpected ways. Let's see where the issue
> is before deciding on where the fix should be.


my next suspicion is that this is actually a spawned subprocess, though
oslo.concurrency, and it is failing to create a new connection pool and thus
is sharing the same file descriptor between processes. That is a much more
ordinary issue and would explain everything I’m seeing in a more familiar
way. Let me see if i can confirm *that*.



> Doug
> 
>>> Doug
>>> 
 Current status is that I’m continuing to try to determine why this is
 happening here, and seemingly nowhere else.
 
 
 
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>> 
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently

2015-02-17 Thread Doug Hellmann


On Tue, Feb 17, 2015, at 11:17 AM, Mike Bayer wrote:
> 
> 
> Doug Hellmann  wrote:
> 
> >> 
> >> So I’m not really sure what’s going on here.  Cinder seems to have some
> >> openstack greenlet code of its own in
> >> cinder/openstack/common/threadgroup.py, I don’t know the purpose of this
> >> code.   SQLAlchemy’s connection pool has been tested a lot with eventlet
> >> / gevent and this has never been reported before. This is a very
> >> severe race and I’d think that this would be happening all over the
> >> place.
> > 
> > The threadgroup module is from the Oslo incubator, so if you need to
> > review the git history you'll want to look at that copy.
> 
> 
> I haven’t confirmed this yet today but based on some greenlet research as
> well as things I observed with PDB yesterday, my suspicion is that
> Cinder’s
> startup code runs in a traditional thread, at the same time the service
> is
> allowing connections to come in via green-threads, which are running in a
> separate greenlet event loop (how else did my PDB sessions have continued
> echo output stepping on my typing?). greenlet performs stack-slicing
> where
> it is memoizing the state of the interpreter to some extent, but
> importantly
> it does not provide this in conjunction with traditional threads. So
> Python
> code can’t even tell that it’s being shared, because all of the state is
> completely swapped out (but of course this doesn’t count when you’re a
> file
> descriptor). I’ve been observing this by watching the identical objects
> (same ID) magically have different state as a stale greenlet suddenly
> wakes
> up in the middle of the presumably thread-bound initialization code.
> 
> My question is then how is it that such an architecture would be
> possible,
> that Cinder’s service starts up without greenlets yet allows
> greenlet-based
> requests to come in before this critical task is complete? Shouldn’t the
> various oslo systems be providing patterns to prevent this disastrous
> combination?   

I would have thought so, but they are (mostly) libraries not frameworks
so they are often combined in unexpected ways. Let's see where the issue
is before deciding on where the fix should be.

Doug

> 
> 
> > Doug
> > 
> >> Current status is that I’m continuing to try to determine why this is
> >> happening here, and seemingly nowhere else.
> >> 
> >> 
> >> 
> >> 
> >> __
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe:
> >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > 
> > __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently

2015-02-17 Thread Mike Bayer


Mike Bayer  wrote:

> 
> I haven’t confirmed this yet today but based on some greenlet research as
> well as things I observed with PDB yesterday, my suspicion is that Cinder’s
> startup code runs in a traditional thread, at the same time the service is
> allowing connections to come in via green-threads, which are running in a
> separate greenlet event loop

OK, it seems that is not what’s happening. Which is actually very bad news
because it’s starting to look like SQLAlchemy’s connection pool, even if I
directly patch it with eventlet’s threading and Queue implementations, is
failing. Which would just be all the more amazing that we don’t see this
happening everywhere, all the time? The story is still not told yet.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently

2015-02-17 Thread Mike Bayer


Doug Hellmann  wrote:

>> 
>> So I’m not really sure what’s going on here.  Cinder seems to have some
>> openstack greenlet code of its own in
>> cinder/openstack/common/threadgroup.py, I don’t know the purpose of this
>> code.   SQLAlchemy’s connection pool has been tested a lot with eventlet
>> / gevent and this has never been reported before. This is a very
>> severe race and I’d think that this would be happening all over the
>> place.
> 
> The threadgroup module is from the Oslo incubator, so if you need to
> review the git history you'll want to look at that copy.


I haven’t confirmed this yet today but based on some greenlet research as
well as things I observed with PDB yesterday, my suspicion is that Cinder’s
startup code runs in a traditional thread, at the same time the service is
allowing connections to come in via green-threads, which are running in a
separate greenlet event loop (how else did my PDB sessions have continued
echo output stepping on my typing?). greenlet performs stack-slicing where
it is memoizing the state of the interpreter to some extent, but importantly
it does not provide this in conjunction with traditional threads. So Python
code can’t even tell that it’s being shared, because all of the state is
completely swapped out (but of course this doesn’t count when you’re a file
descriptor). I’ve been observing this by watching the identical objects
(same ID) magically have different state as a stale greenlet suddenly wakes
up in the middle of the presumably thread-bound initialization code.

My question is then how is it that such an architecture would be possible,
that Cinder’s service starts up without greenlets yet allows greenlet-based
requests to come in before this critical task is complete? Shouldn’t the
various oslo systems be providing patterns to prevent this disastrous
combination?   


> Doug
> 
>> Current status is that I’m continuing to try to determine why this is
>> happening here, and seemingly nowhere else.
>> 
>> 
>> 
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] [oslo] MySQL connection shared between green threads concurrently

2015-02-17 Thread Doug Hellmann


On Mon, Feb 16, 2015, at 07:50 PM, Mike Bayer wrote:
> hi all -
> 
> I’ve been researching this cinder issue
> https://bugs.launchpad.net/cinder/+bug/1417018 and I’ve found something
> concerning.
> 
> Basically there’s a race condition here which is occurring because a
> single MySQL connection is shared between two green threads.  This occurs
> while Cinder is doing its startup work in cinder/volume/manager.py ->
> init_host(), and at the same time a request comes in from a separate
> service call that seems to be part of the startup.
> 
> The log output at http://paste.openstack.org/show/175571/ shows this
> happening.  I can break it down:
> 
> 
> 1. A big query for volumes occurs as part of
> "self.db.volume_get_all_by_host(ctxt, self.host)”.   To reproduce the
> error more regularly I’ve placed it into a loop of 100 calls.  We can see
> that thread id is 68089648 MySQL connection is 3a9c5a0.
> 
> 2015-02-16 19:32:47.236 INFO sqlalchemy.engine.base.Engine
> [req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None] tid: 68089648,
> connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt
> SELECT volumes.created_at AS 
> 2015-02-16 19:32:47.237 INFO sqlalchemy.engine.base.Engine
> [req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None]
> ('localhost.localdomain@ceph', 'localhost.localdomain@ceph#%’)
> 
> 2. A “ping” query comes in related to a different API call - different
> thread ID, *same* connection
> 
> 2015-02-16 19:32:47.276 INFO sqlalchemy.engine.base.Engine
> [req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] tid: 68081456,
> connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt
> SELECT 1
> 2015-02-16 19:32:47.279 INFO sqlalchemy.engine.base.Engine
> [req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] ()
> 
> 3. The first statement is still in the middle of invocation, so we get a
> failure, either a mismatch of the statement to the cursor, or a MySQL
> lost connection (stack trace begins)
> 
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457,
>   in fire_timers
>  … more stack trace
> 
> 4. another thread id, *same* connection.
> 
> 2015-02-16 19:32:47.290 INFO sqlalchemy.engine.base.Engine
> [req-f980de7c-151d-4fed-b45e-d12b133859a6 None None] tid: 61238160,
> connection: <_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt
> SELECT 1
> 
> rows = [process[0](row, None) for row in fetch]
>   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py",
>   line 363, in _instance
> tuple([row[column] for column in pk_cols])
>   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py",
>   line 331, in _key_fallback
> expression._string_or_unprintable(key))
> NoSuchColumnError: "Could not locate column in row for column
> 'volumes.id'"
> 2015-02-16 19:32:47.293 ERROR cinder.openstack.common.threadgroup [-]
> "Could not locate column in row for column 'volumes.id’"
> 
> 
> So I’m not really sure what’s going on here.  Cinder seems to have some
> openstack greenlet code of its own in
> cinder/openstack/common/threadgroup.py, I don’t know the purpose of this
> code.   SQLAlchemy’s connection pool has been tested a lot with eventlet
> / gevent and this has never been reported before. This is a very
> severe race and I’d think that this would be happening all over the
> place.

The threadgroup module is from the Oslo incubator, so if you need to
review the git history you'll want to look at that copy.

Doug

> 
> Current status is that I’m continuing to try to determine why this is
> happening here, and seemingly nowhere else.
> 
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev