subject:"\[openstack\-dev\] \[oslo.db\] \[release\] opportunistic tests breaking randomly"

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Zane Bitter


On 21/09/16 11:41, Joshua Harlow wrote:

tl;dr this appears to have been around forever (at least since we
switched to using a pure-Python MySQL client) and is almost certainly
completely unrelated to any particular release of oslo.db.


Update: Mike was kind enough to run this one to ground and it looks like 
we could get a fix in sqlalchemy 1.1.0:


https://bitbucket.org/zzzeek/sqlalchemy/issues/3803/dbapi-connections-go-invalid-on


I've seen something similar at https://review.openstack.org/#/c/316935/


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Henry Gessau

Sean Dague  wrote:
> On 09/15/2016 09:20 AM, Roman Podoliaka wrote:
>> Sean,
>>
>> So currently we have a default timeout of 160s in Nova. And
>> specifically for migration tests we set a scaling factor of 2. Let's
>> maybe give 2.5 or 3 a try ( https://review.openstack.org/#/c/370805/ )
>> and make a couple of "rechecks" to see if it helps or not.
> 
> Given the fail rate, lets just go safe and double it to 4. If it's
> really a timeout issue that should move the bell curve to mostly cover
> it. Timeouts are really just supposed to be a backstop to ensure the
> tests don't deadlock forever.
> 
> The incidence rate is at that odd race rate that we'll probably have to
> merge and just see if it goes away.

Neutron also started seeing timeouts in DB migration tests today.
We're going to try a similar timeout scaling factor for these tests.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Roman Podoliaka

FWIW, there was no new failures in Nova jobs since then.

I'm confused as well why these tests would sporadically take much
longer time to execute. Perhaps we could install something like atop
on our nodes to answer that question.

On Wed, Sep 21, 2016 at 5:46 PM, Ihar Hrachyshka  wrote:
> I just hit that TimeoutException error in neutron functional tests:
>
> http://logs.openstack.org/68/373868/4/check/gate-neutron-dsvm-functional-ubuntu-trusty/4de275e/testr_results.html.gz
>
> It’s a bit weird that we hit that 180 sec timeout because in good runs, the
> test takes ~5 secs.
>
> Do we have a remedy against that kind of failure? I saw nova bumped the
> timeout length for the tests. Is it the approach we should apply across the
> board for other projects?
>
> Ihar
>
>
> Zane Bitter  wrote:
>
>> On 14/09/16 11:44, Mike Bayer wrote:
>>>
>>> On 09/14/2016 11:08 AM, Mike Bayer wrote:

 On 09/14/2016 09:15 AM, Sean Dague wrote:
>
> I noticed the following issues happening quite often now in the
> opportunistic db tests for nova -
>
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>
>
>
>
> It looks like some race has been introduced where the various db
> connections are not fully isolated from each other like they used to
> be.
> The testing magic for this is buried pretty deep in oslo.db.


 that error message occurs when a connection that is intended against a
 SELECT statement fails to provide a cursor.description attribute.  It is
 typically a driver-level bug in the MySQL world and corresponds to
 mis-handled failure modes from the MySQL connection.

 By "various DB connections are not fully isolated from each other" are
 you suggesting that a single in-Python connection object itself is being
 shared among multiple greenlets?   I'm not aware of a change in oslo.db
 that would be a relationship to such an effect.
>>>
>>>
>>> So, I think by "fully isolated from each other" what you really mean is
>>> "operations upon a connection are not fully isolated from the subsequent
>>> use of that connection", since that's what I see in the logs.  A
>>> connection is attempting to be used during teardown to drop tables,
>>> however it's in this essentially broken state from a PyMySQL
>>> perspective, which would indicate something has gone wrong with this
>>> (pooled) connection in the preceding test that could not be detected or
>>> reverted once the connection was returned to the pool.
>>>
>>> From Roman's observation, it looks like a likely source of this
>>> corruption is a timeout that is interrupting the state of the PyMySQL
>>> connection.   In the preceding stack trace, PyMySQL is encountering a
>>> raise as it attempts to call "self._sock.recv_into(b)", and it seems
>>> like some combination of eventlet's response to signals and the
>>> fixtures.Timeout() fixture is the cause of this interruption.   As an
>>> additional wart, something else is getting involved and turning it into
>>> an IndexError, I'm not sure what that part is yet though I can imagine
>>> that might be SQLAlchemy mis-interpreting what it expects to be a
>>> PyMySQL exception class, since we normally look inside of
>>> exception.args[0] to get the MySQL error code.   With a blank exception
>>> like fixtures.TimeoutException, .args is the empty tuple.
>>>
>>> The PyMySQL connection is now in an invalid state and unable to perform
>>> a SELECT statement correctly, but the connection is not invalidated and
>>> is instead returned to the connection pool in a broken state.  So the
>>> subsequent teardown, if it uses this same connection (which is likely),
>>> fails because the connection has been interrupted in the middle of its
>>> work and not given the chance to clean up.
>>>
>>> Seems like the use of fixtures.Timeout() fixture here is not organized
>>> to work with a database operation in progress, especially an
>>> eventlet-monkeypatched PyMySQL.   Ideally, if something like a timeout
>>> due to a signal handler occurs, the entire connection pool should be
>>> disposed (quickest way, engine.dispose()), or at the very least (and
>>> much more targeted), the connection that's involved should be
>>> invalidated from the pool, e.g. connection.invalidate().
>>>
>>> The change to the environment here would be that this timeout is
>>> happening at all - the reason for that is not yet known.   If oslo.db's
>>> version were involved in this error, I would guess that it would be
>>> related to this timeout condition being caused, and not anything to do
>>> with the connection provisioning.
>>>
> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
> think we need to strongly consider blocking it and revisiting these
> issues post newton.
>>
>>
>> We've been seeing similar errors in Heat since at least Liberty
>> (https://bugs.launchpad.net/heat/+b

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Joshua Harlow


Mike Bayer wrote:



On 09/21/2016 11:41 AM, Joshua Harlow wrote:


I've seen something similar at https://review.openstack.org/#/c/316935/

Maybe its time we asked again why are we still using eventlet and do we
need to anymore. What functionality of it are people actually taking
advantage of? If it's supporting libraries like oslo.service then it'd
probably be useful to talk to the ceilometer folks who replaced
oslo.service with something else (another oslo library for periodics and
https://github.com/sileht/cotyledon for service oriented tasks).


Plus Keystone has gotten off of it.

I actually like eventlet and gevent quite a lot. I am using it in a new
middleware component that will be involved with database connection
pooling. However, I *don't* use the global monkeypatching aspect. That's
where this all goes very wrong. Things that are designed for synchronous
operations, like database-oriented business methods as well as the work
of the database driver itself, should run within threads. You can in
fact use eventlet/gevent's APIs explicitly and you can even combine it
with traditional threading explicitly. I'm actually using a stdlib Queue
(carefully) to send data between greenlets and threads. Madness!


Agreed (thanks for making that clear), it's really the monkeying that 
kills things, not eventlet/gevent directly, fair point.


-Josh

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Mike Bayer




On 09/21/2016 11:41 AM, Joshua Harlow wrote:


I've seen something similar at https://review.openstack.org/#/c/316935/

Maybe its time we asked again why are we still using eventlet and do we
need to anymore. What functionality of it are people actually taking
advantage of? If it's supporting libraries like oslo.service then it'd
probably be useful to talk to the ceilometer folks who replaced
oslo.service with something else (another oslo library for periodics and
https://github.com/sileht/cotyledon for service oriented tasks).


Plus Keystone has gotten off of it.

I actually like eventlet and gevent quite a lot.   I am using it in a 
new middleware component that will be involved with database connection 
pooling.  However, I *don't* use the global monkeypatching aspect. 
That's where this all goes very wrong.   Things that are designed for 
synchronous operations, like database-oriented business methods as well 
as the work of the database driver itself, should run within threads. 
You can in fact use eventlet/gevent's APIs explicitly and you can even 
combine it with traditional threading explicitly.   I'm actually using a 
stdlib Queue (carefully) to send data between greenlets and threads. 
Madness!








-Josh

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Joshua Harlow

Zane Bitter wrote:

On 14/09/16 11:44, Mike Bayer wrote:

On 09/14/2016 11:08 AM, Mike Bayer wrote:

On 09/14/2016 09:15 AM, Sean Dague wrote:

I noticed the following issues happening quite often now in the
opportunistic db tests for nova -
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22

It looks like some race has been introduced where the various db
connections are not fully isolated from each other like they used to
be.
The testing magic for this is buried pretty deep in oslo.db.

that error message occurs when a connection that is intended against a
SELECT statement fails to provide a cursor.description attribute. It is
typically a driver-level bug in the MySQL world and corresponds to
mis-handled failure modes from the MySQL connection.

By "various DB connections are not fully isolated from each other" are
you suggesting that a single in-Python connection object itself is being
shared among multiple greenlets? I'm not aware of a change in oslo.db
that would be a relationship to such an effect.

So, I think by "fully isolated from each other" what you really mean is
"operations upon a connection are not fully isolated from the subsequent
use of that connection", since that's what I see in the logs. A
connection is attempting to be used during teardown to drop tables,
however it's in this essentially broken state from a PyMySQL
perspective, which would indicate something has gone wrong with this
(pooled) connection in the preceding test that could not be detected or
reverted once the connection was returned to the pool.

From Roman's observation, it looks like a likely source of this
corruption is a timeout that is interrupting the state of the PyMySQL
connection. In the preceding stack trace, PyMySQL is encountering a
raise as it attempts to call "self._sock.recv_into(b)", and it seems
like some combination of eventlet's response to signals and the
fixtures.Timeout() fixture is the cause of this interruption. As an
additional wart, something else is getting involved and turning it into
an IndexError, I'm not sure what that part is yet though I can imagine
that might be SQLAlchemy mis-interpreting what it expects to be a
PyMySQL exception class, since we normally look inside of
exception.args[0] to get the MySQL error code. With a blank exception
like fixtures.TimeoutException, .args is the empty tuple.

The PyMySQL connection is now in an invalid state and unable to perform
a SELECT statement correctly, but the connection is not invalidated and
is instead returned to the connection pool in a broken state. So the
subsequent teardown, if it uses this same connection (which is likely),
fails because the connection has been interrupted in the middle of its
work and not given the chance to clean up.

Seems like the use of fixtures.Timeout() fixture here is not organized
to work with a database operation in progress, especially an
eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout
due to a signal handler occurs, the entire connection pool should be
disposed (quickest way, engine.dispose()), or at the very least (and
much more targeted), the connection that's involved should be
invalidated from the pool, e.g. connection.invalidate().

The change to the environment here would be that this timeout is
happening at all - the reason for that is not yet known. If oslo.db's
version were involved in this error, I would guess that it would be
related to this timeout condition being caused, and not anything to do
with the connection provisioning.

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.

We've been seeing similar errors in Heat since at least Liberty
(https://bugs.launchpad.net/heat/+bug/1499669). Mike and I did some
poking around yesterday and basically confirmed his theory above. If you
look at the PyMySQL code, it believes that only an IOError can occur
while writing to a socket, so it has no handling for any other type of
exception, thus it can't deal with signal handlers raising exceptions or
other exceptions being thrown into the greenthread by eventlet. It
sounds like sqlalchemy also fails to catch at least some of these
exceptions and invalidate the connection.

tl;dr this appears to have been around forever (at least since we
switched to using a pure-Python MySQL client) and is almost certainly
completely unrelated to any particular release of oslo.db.

I've seen something similar at https://review.openstack.org/#/c/316935/

Maybe its time we asked again why are we still using eventlet and do we
need to anymore. What functionality of it are people actually taking
advantage of? If it's supporting libraries like oslo.service then it'd
probably be useful to talk to the ceilometer folks who replaced
oslo.service with something else (another oslo library for periodics and
https://github

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Ihar Hrachyshka

I just hit that TimeoutException error in neutron functional tests:

http://logs.openstack.org/68/373868/4/check/gate-neutron-dsvm-functional-ubuntu-trusty/4de275e/testr_results.html.gz

It’s a bit weird that we hit that 180 sec timeout because in good runs, the
test takes ~5 secs.

Do we have a remedy against that kind of failure? I saw nova bumped the
timeout length for the tests. Is it the approach we should apply across the
board for other projects?

Ihar

Zane Bitter wrote:

On 14/09/16 11:44, Mike Bayer wrote:

On 09/14/2016 11:08 AM, Mike Bayer wrote:

On 09/14/2016 09:15 AM, Sean Dague wrote:

It looks like some race has been introduced where the various db
connections are not fully isolated from each other like they used to be.
The testing magic for this is buried pretty deep in oslo.db.

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.

We've been seeing similar errors in Heat since at least Liberty
(https://bugs.launchpad.net/heat/+bug/1499669). Mike and I did some
poking around yesterday and basically confirmed his theory above. If you
look at the PyMySQL code, it believes that only an IOError can occur
while writing to a socket, so it has no handling for any other type of
exception, thus it can't deal with signal handlers raising exceptions or
other exceptions being thrown into the greenthread by eventlet. It sounds
like sqlalchemy also fails to catch at least some of these exceptions and
invalidate the connection.

tl;dr this appears to have been around forever (at least since we
switched to using a pure-Python MySQL client) and is almost certainly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-21 Thread Zane Bitter

On 14/09/16 11:44, Mike Bayer wrote:

On 09/14/2016 11:08 AM, Mike Bayer wrote:

On 09/14/2016 09:15 AM, Sean Dague wrote:

It looks like some race has been introduced where the various db
connections are not fully isolated from each other like they used to be.
The testing magic for this is buried pretty deep in oslo.db.

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.

tl;dr this appears to have been around forever (at least since we
switched to using a pure-Python MySQL client) and is almost certainly
completely unrelated to any particular release of oslo.db.

cheers,
Zane.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-15 Thread Sean Dague

On 09/14/2016 11:57 PM, Mike Bayer wrote:
> 
> 
> On 09/14/2016 11:05 PM, Mike Bayer wrote:
>>
>> Are *these* errors also new as of version 4.13.3 of oslo.db ?   Because
>> here I have more suspicion of one particular oslo.db change here.
> 
> The version in question that has the changes to provisioning and
> anything really to do with this area is 4.12.0.   So if you didn't see
> any problem w/ 4.12 then almost definitely oslo.db is not the cause -
> the code changes subsequent to 4.12 have no relationship to any system
> used by the opportunistic test base.I would hope at least that 4.12
> is the version where we see things changing because there were small
> changes to the provisioning code.
> 
> But at the same time, I'm combing through the quite small adjustments to
> the provisioning code as of 4.12.0 and I'm not seeing what could
> introduce this issue.   That said, we really should never see the kind
> of error we see with the "DROP DATABASE" failing because it remains in
> use, however this can be a side effect of the test itself having
> problems with the state of a different connection, not being closed and
> locks remain held.
> 
> That is, there's poor failure modes for sure here, I just can't see
> anything in 4.13 or even 4.12 that would suddenly introduce them.
> 
> By all means if these failures disappear when we go to 4.11 vs. 4.12,
> that would be where we need to go and to look for next cycle. From
> my POV if the failures do disappear then that would be the best evidence
> that the oslo.db version is the factor.

In looking through the change history, I agree that nothing in the
4.13.x stream is plausibly related to this failure. The last time
something seems related is back in the 4.11/4.12 space. That would be a
much risker change than just rolling back to 4.13.0 at this point in the
release, so I would not recommend it.

So I think Roman's approach of trying to just adjust the timeout seems
like the lowest risk way to move forward. Hopefully that mitigates things.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-15 Thread Sean Dague

On 09/15/2016 09:20 AM, Roman Podoliaka wrote:
> Sean,
> 
> So currently we have a default timeout of 160s in Nova. And
> specifically for migration tests we set a scaling factor of 2. Let's
> maybe give 2.5 or 3 a try ( https://review.openstack.org/#/c/370805/ )
> and make a couple of "rechecks" to see if it helps or not.

Given the fail rate, lets just go safe and double it to 4. If it's
really a timeout issue that should move the bell curve to mostly cover
it. Timeouts are really just supposed to be a backstop to ensure the
tests don't deadlock forever.

The incidence rate is at that odd race rate that we'll probably have to
merge and just see if it goes away.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-15 Thread Roman Podoliaka

Sean,

So currently we have a default timeout of 160s in Nova. And
specifically for migration tests we set a scaling factor of 2. Let's
maybe give 2.5 or 3 a try ( https://review.openstack.org/#/c/370805/ )
and make a couple of "rechecks" to see if it helps or not.

In Ocata we could revisit the migrations collapse again to reduce the
number of scripts.

On the testing side, we could probably "cheat" a bit to trade data
safety for performance. E.g. we could set "fsync = off" for PostgreSQL
(https://www.postgresql.org/docs/9.2/static/runtime-config-wal.html).
Similar settings must be available for MySQL as well.

Thanks,
Roman

On Thu, Sep 15, 2016 at 3:07 PM, Sean Dague  wrote:
> On 09/15/2016 05:52 AM, Roman Podoliaka wrote:
>> Mike,
>>
>> On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer  wrote:
>>
>>> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur?
>>> If so, was it also accompanied by the same "resource closed" condition or
>>> did this second part of the condition only appear at 4.13.3?
>>> * Did we see a similar "timeout" / "resource closed" combination prior to
>>> 4.13.3, just with less frequency?
>>
>> I believe we did -
>> https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we
>> used mysql-python back then, so the error was slightly different.
>>
>>> * What is the magnitude of the "timeout" this fixture is using, is it on the
>>> order of seconds, minutes, hours?
>>
>> It's set in seconds per project in .testr.conf, e.g.:
>>
>> https://github.com/openstack/nova/blob/master/.testr.conf
>> https://github.com/openstack/ironic/blob/master/.testr.conf
>>
>> In Nova we also have a 'timeout scaling factor' specifically set for
>> migration tests:
>>
>> https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67
>>
>>> * If many minutes or hours, can the test suite be observed to be stuck on
>>> this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
>>> condition is occurring to see what SQL is pausing?
>>
>> We could try to do that in the gate, but I don't expect to see
>> anything interesting: IMO, we'd see regular queries that should have
>> been executed fast, but actually took much longer time (presumably due
>> to heavy disk IO caused by multiple workers running similar tests in
>> parallel).
>>
>>> * Is this failure only present within the Nova test suite or has it been
>>> observed in the test suites of other projects?
>>
>> According to
>>
>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>>
>> it's mostly Nova, but this has also been observed in Ironic, Manila
>> and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60
>> seconds.
>>
>>> * Is this failure present only on the "database migration" test suite or is
>>> it present in other opportunistic tests, for Nova and others?
>>
>> Based on the console logs I've checked only migration tests failed,
>> but that's probably due to the fact that they are usually the slowest
>> ones (again, presumably due to heavy disk IO).
>>
>>> * Have there been new database migrations added to Nova which are being
>>> exercised here and may be involved?
>>
>> Looks like there were no changes recently:
>>
>> https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22)
>>
>>> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
>>> downgrading it is feasible, that would at least be a way to eliminate it as
>>> a possibility if these same failures continue to occur, or a way to confirm
>>> its involvement if they disappear.   But if downgrading is disruptive then
>>> there are other things to look at in order to have a better chance at
>>> predicting its involvement.
>>
>> I don't think we need to block oslo.db 4.13.3, unless we clearly see
>> it's this version that causes these failures.
>>
>> I gave version 4.11 (before changes to provisioning) a try on my local
>> machine and see the very same errors when concurrency level is high (
>> http://paste.openstack.org/show/577350/ ), so I don't think the latest
>> oslo.db release has anything to do with the increase of the number of
>> failures on CI.
>>
>> My current understanding is that the load on gate nodes somehow
>> increased (either we run more testr workers in parallel now or
>> apply/test more migrations or just more run VMs per host or the gate
>> is simply busy at this point of the release cycle), so that we started
>> to see these timeouts more often.
>
> The migration count is definitely going to grow over time, as is the
> nature of the beast. Nova hasn't had a migration collapse in quite a
> while. The higher patch volume in Nova and larger number of db
> migrations could definitely account for Nova being higher here.
>
> Is there a better timeout value that you think will make

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-15 Thread Roman Podoliaka

Mike,

I think the exact error (InterfaceError vs TimeoutException) varies
depending on what code is being executed at the very moment when a
process receives SIGALRM.

I tried to run the tests against PostgreSQL passing very small timeout
values (OS_TEST_TIMEOUT=5 python -m testtools.run
nova.tests.unit.db.test_migrations.TestNovaMigrationsPostgreSQL.test_walk_versions)
and saw both InterfaceError and TimeoutException:

http://paste.openstack.org/show/577410/
http://paste.openstack.org/show/577411/

strace'ing shows that a connection to PostgreSQL is closed right after
SIGARLM is handled:

http://paste.openstack.org/show/577425/

I tried to reproduce that manually by the means of gdb and set a
breakpoint on close():

http://paste.openstack.org/show/577422/

^ looks like psycopg2 closes the connection automatically if a query
was interrupted by SIGALRM.

The corresponding Python level backtrace is:

http://paste.openstack.org/show/577423/

^ i.e. connection closing happens in the middle of cursor.execute() call.

In the end I see a similar InterfaceError:

http://paste.openstack.org/show/577424/

That being said, this does not explain the "DB is in use" part.

Thanks,
Roman

On Thu, Sep 15, 2016 at 6:05 AM, Mike Bayer  wrote:
> There's a different set of logs attached to the launchpad issue, that's not
> what I was looking at before.
>
> These logs are at
> http://logs.openstack.org/90/369490/1/check/gate-nova-tox-db-functional-ubuntu-xenial/085ac3e/console.html#_2016-09-13_14_54_18_098031
> .In these logs, I see something *very* different, not just the MySQL
> tests but the Postgresql tests are definitely hitting conflicts against the
> randomly generated database.
>
> This set of traces, e.g.:
>
> sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) database
> "dbzrtmgbxv" is being accessed by other users
> 2016-09-13 14:54:18.093723 | DETAIL:  There is 1 other session using the
> database.
> 2016-09-13 14:54:18.093736 |  [SQL: 'DROP DATABASE dbzrtmgbxv']
>
> and
>
> File
> "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 668, in _rollback_impl
> 2016-09-13 14:54:18.095470 |
> self.engine.dialect.do_rollback(self.connection)
> 2016-09-13 14:54:18.095513 |   File
> "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py",
> line 420, in do_rollback
> 2016-09-13 14:54:18.095526 | dbapi_connection.rollback()
> 2016-09-13 14:54:18.095548 | sqlalchemy.exc.InterfaceError:
> (psycopg2.InterfaceError) connection already closed
>
> are a very different animal. For one thing, they're on Postgresql where the
> driver and DB acts extremely rationally.   For another, there's no timeout
> exception here, and not all the conflicts are within the teardown.
>
> Are *these* errors also new as of version 4.13.3 of oslo.db ?   Because here
> I have more suspicion of one particular oslo.db change here.
>
>
>
>
>
>
>
>  fits much more with your initial description
>
>
> On 09/14/2016 10:48 PM, Mike Bayer wrote:
>>
>>
>>
>> On 09/14/2016 07:04 PM, Alan Pevec wrote:

 Olso.db 4.13.3 did hit the scene about the time this showed up. So I
 think we need to strongly consider blocking it and revisiting these
 issues post newton.
>>>
>>>
>>> So that means reverting all stable/newton changes, previous 4.13.x
>>> have been already blocked https://review.openstack.org/365565
>>> How would we proceed, do we need to revert all backport on stable/newton?
>>
>>
>> In case my previous email wasn't clear, I don't *yet* see evidence that
>> the recent 4.13.3 release of oslo.db is the cause of this problem.
>> However, that is only based upon what I see in this stack trace, which
>> is that the test framework is acting predictably (though erroneously)
>> based on the timeout condition which is occurring.   I don't (yet) see a
>> reason that the same effect would not occur prior to 4.13.3 in the face
>> of a signal pre-empting the work of the pymysql driver mid-stream.
>> However, this assumes that the timeout condition itself is not a product
>> of the current oslo.db version and that is not known yet.
>>
>> There's a list of questions that should all be answerable which could
>> assist in giving some hints towards this.
>>
>> There's two parts to the error in the logs.  There's the "timeout"
>> condition, then there is the bad reaction of the PyMySQL driver and the
>> test framework as a result of the operation being interrupted within the
>> test.
>>
>> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition
>> occur?   If so, was it also accompanied by the same "resource closed"
>> condition or did this second part of the condition only appear at 4.13.3?
>>
>> * Did we see a similar "timeout" / "resource closed" combination prior
>> to 4.13.3, just with less frequency?
>>
>> * Was the vers

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-15 Thread Sean Dague

On 09/15/2016 05:52 AM, Roman Podoliaka wrote:
> Mike,
> 
> On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer  wrote:
> 
>> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur?
>> If so, was it also accompanied by the same "resource closed" condition or
>> did this second part of the condition only appear at 4.13.3?
>> * Did we see a similar "timeout" / "resource closed" combination prior to
>> 4.13.3, just with less frequency?
> 
> I believe we did -
> https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we
> used mysql-python back then, so the error was slightly different.
> 
>> * What is the magnitude of the "timeout" this fixture is using, is it on the
>> order of seconds, minutes, hours?
> 
> It's set in seconds per project in .testr.conf, e.g.:
> 
> https://github.com/openstack/nova/blob/master/.testr.conf
> https://github.com/openstack/ironic/blob/master/.testr.conf
> 
> In Nova we also have a 'timeout scaling factor' specifically set for
> migration tests:
> 
> https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67
> 
>> * If many minutes or hours, can the test suite be observed to be stuck on
>> this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
>> condition is occurring to see what SQL is pausing?
> 
> We could try to do that in the gate, but I don't expect to see
> anything interesting: IMO, we'd see regular queries that should have
> been executed fast, but actually took much longer time (presumably due
> to heavy disk IO caused by multiple workers running similar tests in
> parallel).
> 
>> * Is this failure only present within the Nova test suite or has it been
>> observed in the test suites of other projects?
> 
> According to
> 
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
> 
> it's mostly Nova, but this has also been observed in Ironic, Manila
> and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60
> seconds.
> 
>> * Is this failure present only on the "database migration" test suite or is
>> it present in other opportunistic tests, for Nova and others?
> 
> Based on the console logs I've checked only migration tests failed,
> but that's probably due to the fact that they are usually the slowest
> ones (again, presumably due to heavy disk IO).
> 
>> * Have there been new database migrations added to Nova which are being
>> exercised here and may be involved?
> 
> Looks like there were no changes recently:
> 
> https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22)
> 
>> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
>> downgrading it is feasible, that would at least be a way to eliminate it as
>> a possibility if these same failures continue to occur, or a way to confirm
>> its involvement if they disappear.   But if downgrading is disruptive then
>> there are other things to look at in order to have a better chance at
>> predicting its involvement.
> 
> I don't think we need to block oslo.db 4.13.3, unless we clearly see
> it's this version that causes these failures.
> 
> I gave version 4.11 (before changes to provisioning) a try on my local
> machine and see the very same errors when concurrency level is high (
> http://paste.openstack.org/show/577350/ ), so I don't think the latest
> oslo.db release has anything to do with the increase of the number of
> failures on CI.
> 
> My current understanding is that the load on gate nodes somehow
> increased (either we run more testr workers in parallel now or
> apply/test more migrations or just more run VMs per host or the gate
> is simply busy at this point of the release cycle), so that we started
> to see these timeouts more often.

The migration count is definitely going to grow over time, as is the
nature of the beast. Nova hasn't had a migration collapse in quite a
while. The higher patch volume in Nova and larger number of db
migrations could definitely account for Nova being higher here.

Is there a better timeout value that you think will make these timeouts
happen less often?

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-15 Thread Roman Podoliaka

Mike,

On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer  wrote:

> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur?
> If so, was it also accompanied by the same "resource closed" condition or
> did this second part of the condition only appear at 4.13.3?
> * Did we see a similar "timeout" / "resource closed" combination prior to
> 4.13.3, just with less frequency?

I believe we did -
https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we
used mysql-python back then, so the error was slightly different.

> * What is the magnitude of the "timeout" this fixture is using, is it on the
> order of seconds, minutes, hours?

It's set in seconds per project in .testr.conf, e.g.:

https://github.com/openstack/nova/blob/master/.testr.conf
https://github.com/openstack/ironic/blob/master/.testr.conf

In Nova we also have a 'timeout scaling factor' specifically set for
migration tests:

https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67

> * If many minutes or hours, can the test suite be observed to be stuck on
> this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
> condition is occurring to see what SQL is pausing?

We could try to do that in the gate, but I don't expect to see
anything interesting: IMO, we'd see regular queries that should have
been executed fast, but actually took much longer time (presumably due
to heavy disk IO caused by multiple workers running similar tests in
parallel).

> * Is this failure only present within the Nova test suite or has it been
> observed in the test suites of other projects?

According to

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22

it's mostly Nova, but this has also been observed in Ironic, Manila
and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60
seconds.

> * Is this failure present only on the "database migration" test suite or is
> it present in other opportunistic tests, for Nova and others?

Based on the console logs I've checked only migration tests failed,
but that's probably due to the fact that they are usually the slowest
ones (again, presumably due to heavy disk IO).

> * Have there been new database migrations added to Nova which are being
> exercised here and may be involved?

Looks like there were no changes recently:

https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22)

> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
> downgrading it is feasible, that would at least be a way to eliminate it as
> a possibility if these same failures continue to occur, or a way to confirm
> its involvement if they disappear.   But if downgrading is disruptive then
> there are other things to look at in order to have a better chance at
> predicting its involvement.

I don't think we need to block oslo.db 4.13.3, unless we clearly see
it's this version that causes these failures.

I gave version 4.11 (before changes to provisioning) a try on my local
machine and see the very same errors when concurrency level is high (
http://paste.openstack.org/show/577350/ ), so I don't think the latest
oslo.db release has anything to do with the increase of the number of
failures on CI.

My current understanding is that the load on gate nodes somehow
increased (either we run more testr workers in parallel now or
apply/test more migrations or just more run VMs per host or the gate
is simply busy at this point of the release cycle), so that we started
to see these timeouts more often.

Thanks,
Roman

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Mike Bayer




On 09/14/2016 11:05 PM, Mike Bayer wrote:


Are *these* errors also new as of version 4.13.3 of oslo.db ?   Because
here I have more suspicion of one particular oslo.db change here.


The version in question that has the changes to provisioning and 
anything really to do with this area is 4.12.0.   So if you didn't see 
any problem w/ 4.12 then almost definitely oslo.db is not the cause - 
the code changes subsequent to 4.12 have no relationship to any system 
used by the opportunistic test base.I would hope at least that 4.12 
is the version where we see things changing because there were small 
changes to the provisioning code.


But at the same time, I'm combing through the quite small adjustments to 
the provisioning code as of 4.12.0 and I'm not seeing what could 
introduce this issue.   That said, we really should never see the kind 
of error we see with the "DROP DATABASE" failing because it remains in 
use, however this can be a side effect of the test itself having 
problems with the state of a different connection, not being closed and 
locks remain held.


That is, there's poor failure modes for sure here, I just can't see 
anything in 4.13 or even 4.12 that would suddenly introduce them.


By all means if these failures disappear when we go to 4.11 vs. 4.12, 
that would be where we need to go and to look for next cycle. From 
my POV if the failures do disappear then that would be the best evidence 
that the oslo.db version is the factor.











 fits much more with your initial description

On 09/14/2016 10:48 PM, Mike Bayer wrote:



On 09/14/2016 07:04 PM, Alan Pevec wrote:

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.


So that means reverting all stable/newton changes, previous 4.13.x
have been already blocked https://review.openstack.org/365565
How would we proceed, do we need to revert all backport on
stable/newton?


In case my previous email wasn't clear, I don't *yet* see evidence that
the recent 4.13.3 release of oslo.db is the cause of this problem.
However, that is only based upon what I see in this stack trace, which
is that the test framework is acting predictably (though erroneously)
based on the timeout condition which is occurring.   I don't (yet) see a
reason that the same effect would not occur prior to 4.13.3 in the face
of a signal pre-empting the work of the pymysql driver mid-stream.
However, this assumes that the timeout condition itself is not a product
of the current oslo.db version and that is not known yet.

There's a list of questions that should all be answerable which could
assist in giving some hints towards this.

There's two parts to the error in the logs.  There's the "timeout"
condition, then there is the bad reaction of the PyMySQL driver and the
test framework as a result of the operation being interrupted within the
test.

* Prior to oslo.db 4.13.3, did we ever see this "timeout" condition
occur?   If so, was it also accompanied by the same "resource closed"
condition or did this second part of the condition only appear at 4.13.3?

* Did we see a similar "timeout" / "resource closed" combination prior
to 4.13.3, just with less frequency?

* Was the version of PyMySQL also recently upgraded (I'm assuming this
environment has been on PyMySQL for a long time at this point) ?   What
was the version change if so?  Especially if we previously saw "timeout"
but no "resource closed", perhaps an older version pf PyMySQL didn't
react in this way?

* Was the version of MySQL running in the CI environment changed?   What
was the version change if so?Were there any configurational changes
such as transaction isolation, memory or process settings?

* Have there been changes to the "timeout" logic itself in the test
suite, e.g. whatever it is that sets up fixtures.Timeout()?  Or some
change that alters how teardown of tests occurs when a test is
interrupted via this timeout?

* What is the magnitude of the "timeout" this fixture is using, is it on
the order of seconds, minutes, hours?

* If many minutes or hours, can the test suite be observed to be stuck
on this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
condition is occurring to see what SQL is pausing?

* Has there been some change such that the migration tests are running
against non-empty tables or tables with much more data than was present
before?

* Is this failure only present within the Nova test suite or has it been
observed in the test suites of other projects?

* Is this failure present only on the "database migration" test suite or
is it present in other opportunistic tests, for Nova and others?

* Have there been new database migrations added to Nova which are being
exercised here and may be involved?

I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
downgrading it is feasible, that would at least be a way to eliminate it
as a possi

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Mike Bayer

There's a different set of logs attached to the launchpad issue, that's 
not what I was looking at before.


These logs are at 
http://logs.openstack.org/90/369490/1/check/gate-nova-tox-db-functional-ubuntu-xenial/085ac3e/console.html#_2016-09-13_14_54_18_098031 
.In these logs, I see something *very* different, not just the MySQL 
tests but the Postgresql tests are definitely hitting conflicts against 
the randomly generated database.


This set of traces, e.g.:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) database 
"dbzrtmgbxv" is being accessed by other users
2016-09-13 14:54:18.093723 | DETAIL:  There is 1 other session using 
the database.

2016-09-13 14:54:18.093736 |  [SQL: 'DROP DATABASE dbzrtmgbxv']

and

File 
"/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
line 668, in _rollback_impl
2016-09-13 14:54:18.095470 | 
self.engine.dialect.do_rollback(self.connection)
2016-09-13 14:54:18.095513 |   File 
"/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", 
line 420, in do_rollback

2016-09-13 14:54:18.095526 | dbapi_connection.rollback()
2016-09-13 14:54:18.095548 | sqlalchemy.exc.InterfaceError: 
(psycopg2.InterfaceError) connection already closed


are a very different animal. For one thing, they're on Postgresql where 
the driver and DB acts extremely rationally.   For another, there's no 
timeout exception here, and not all the conflicts are within the teardown.


Are *these* errors also new as of version 4.13.3 of oslo.db ?   Because 
here I have more suspicion of one particular oslo.db change here.








 fits much more with your initial description

On 09/14/2016 10:48 PM, Mike Bayer wrote:



On 09/14/2016 07:04 PM, Alan Pevec wrote:

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.


So that means reverting all stable/newton changes, previous 4.13.x
have been already blocked https://review.openstack.org/365565
How would we proceed, do we need to revert all backport on stable/newton?


In case my previous email wasn't clear, I don't *yet* see evidence that
the recent 4.13.3 release of oslo.db is the cause of this problem.
However, that is only based upon what I see in this stack trace, which
is that the test framework is acting predictably (though erroneously)
based on the timeout condition which is occurring.   I don't (yet) see a
reason that the same effect would not occur prior to 4.13.3 in the face
of a signal pre-empting the work of the pymysql driver mid-stream.
However, this assumes that the timeout condition itself is not a product
of the current oslo.db version and that is not known yet.

There's a list of questions that should all be answerable which could
assist in giving some hints towards this.

There's two parts to the error in the logs.  There's the "timeout"
condition, then there is the bad reaction of the PyMySQL driver and the
test framework as a result of the operation being interrupted within the
test.

* Prior to oslo.db 4.13.3, did we ever see this "timeout" condition
occur?   If so, was it also accompanied by the same "resource closed"
condition or did this second part of the condition only appear at 4.13.3?

* Did we see a similar "timeout" / "resource closed" combination prior
to 4.13.3, just with less frequency?

* Was the version of PyMySQL also recently upgraded (I'm assuming this
environment has been on PyMySQL for a long time at this point) ?   What
was the version change if so?  Especially if we previously saw "timeout"
but no "resource closed", perhaps an older version pf PyMySQL didn't
react in this way?

* Was the version of MySQL running in the CI environment changed?   What
was the version change if so?Were there any configurational changes
such as transaction isolation, memory or process settings?

* Have there been changes to the "timeout" logic itself in the test
suite, e.g. whatever it is that sets up fixtures.Timeout()?  Or some
change that alters how teardown of tests occurs when a test is
interrupted via this timeout?

* What is the magnitude of the "timeout" this fixture is using, is it on
the order of seconds, minutes, hours?

* If many minutes or hours, can the test suite be observed to be stuck
on this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
condition is occurring to see what SQL is pausing?

* Has there been some change such that the migration tests are running
against non-empty tables or tables with much more data than was present
before?

* Is this failure only present within the Nova test suite or has it been
observed in the test suites of other projects?

* Is this failure present only on the "database migration" test suite or
is it present in other opportunistic tests, fo

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Mike Bayer




On 09/14/2016 07:04 PM, Alan Pevec wrote:

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.


So that means reverting all stable/newton changes, previous 4.13.x
have been already blocked https://review.openstack.org/365565
How would we proceed, do we need to revert all backport on stable/newton?


In case my previous email wasn't clear, I don't *yet* see evidence that 
the recent 4.13.3 release of oslo.db is the cause of this problem. 
However, that is only based upon what I see in this stack trace, which 
is that the test framework is acting predictably (though erroneously) 
based on the timeout condition which is occurring.   I don't (yet) see a 
reason that the same effect would not occur prior to 4.13.3 in the face 
of a signal pre-empting the work of the pymysql driver mid-stream. 
However, this assumes that the timeout condition itself is not a product 
of the current oslo.db version and that is not known yet.


There's a list of questions that should all be answerable which could 
assist in giving some hints towards this.


There's two parts to the error in the logs.  There's the "timeout" 
condition, then there is the bad reaction of the PyMySQL driver and the 
test framework as a result of the operation being interrupted within the 
test.


* Prior to oslo.db 4.13.3, did we ever see this "timeout" condition 
occur?   If so, was it also accompanied by the same "resource closed" 
condition or did this second part of the condition only appear at 4.13.3?


* Did we see a similar "timeout" / "resource closed" combination prior 
to 4.13.3, just with less frequency?


* Was the version of PyMySQL also recently upgraded (I'm assuming this 
environment has been on PyMySQL for a long time at this point) ?   What 
was the version change if so?  Especially if we previously saw "timeout" 
but no "resource closed", perhaps an older version pf PyMySQL didn't 
react in this way?


* Was the version of MySQL running in the CI environment changed?   What 
was the version change if so?Were there any configurational changes 
such as transaction isolation, memory or process settings?


* Have there been changes to the "timeout" logic itself in the test 
suite, e.g. whatever it is that sets up fixtures.Timeout()?  Or some 
change that alters how teardown of tests occurs when a test is 
interrupted via this timeout?


* What is the magnitude of the "timeout" this fixture is using, is it on 
the order of seconds, minutes, hours?


* If many minutes or hours, can the test suite be observed to be stuck 
on this test?   Has someone tried to run a "SHOW PROCESSLIST" while this 
condition is occurring to see what SQL is pausing?


* Has there been some change such that the migration tests are running 
against non-empty tables or tables with much more data than was present 
before?


* Is this failure only present within the Nova test suite or has it been 
observed in the test suites of other projects?


* Is this failure present only on the "database migration" test suite or 
is it present in other opportunistic tests, for Nova and others?


* Have there been new database migrations added to Nova which are being 
exercised here and may be involved?


I'm not sure how much of an inconvenience it is to downgrade oslo.db. 
If downgrading it is feasible, that would at least be a way to eliminate 
it as a possibility if these same failures continue to occur, or a way 
to confirm its involvement if they disappear.   But if downgrading is 
disruptive then there are other things to look at in order to have a 
better chance at predicting its involvement.






Cheers,
Alan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Clark Boylan

On Wed, Sep 14, 2016, at 10:01 AM, Sean Dague wrote:
> On 09/14/2016 12:06 PM, Roman Podoliaka wrote:
> > Hmm, looks like we now run more testr workers in parallel (8 instead of 4):
> > 
> > http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz
> > http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html
> > 
> > On my machine running Nova migration tests against MySQL is much
> > slower with 8 workers than with 4 due to disk IO (it's HDD). When they
> > time out (after 320s) I see the very same TimeoutException and
> > IndexError (probably something messes up with TimeoutException up the
> > stack).
> 
> Yes, by default testr runs with the number of workers matching the # of
> cpus on the target. I think all our cloud providers are now 8 cpu
> guests. So unit / functional tests are running 8 way. That's been true
> for quite a while IIRC.

OSIC is 4vcpu and the others have 8vcpu.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Alan Pevec

> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
> think we need to strongly consider blocking it and revisiting these
> issues post newton.

So that means reverting all stable/newton changes, previous 4.13.x
have been already blocked https://review.openstack.org/365565
How would we proceed, do we need to revert all backport on stable/newton?

Cheers,
Alan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Sean Dague

On 09/14/2016 12:06 PM, Roman Podoliaka wrote:
> Hmm, looks like we now run more testr workers in parallel (8 instead of 4):
> 
> http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz
> http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html
> 
> On my machine running Nova migration tests against MySQL is much
> slower with 8 workers than with 4 due to disk IO (it's HDD). When they
> time out (after 320s) I see the very same TimeoutException and
> IndexError (probably something messes up with TimeoutException up the
> stack).

Yes, by default testr runs with the number of workers matching the # of
cpus on the target. I think all our cloud providers are now 8 cpu
guests. So unit / functional tests are running 8 way. That's been true
for quite a while IIRC.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Sean Dague

On 09/14/2016 11:23 AM, Thomas Goirand wrote:
> On 09/14/2016 03:15 PM, Sean Dague wrote:
>> I noticed the following issues happening quite often now in the
>> opportunistic db tests for nova -
>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>>
>>
>> It looks like some race has been introduced where the various db
>> connections are not fully isolated from each other like they used to be.
>> The testing magic for this is buried pretty deep in oslo.db.
>>
>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
>> think we need to strongly consider blocking it and revisiting these
>> issues post newton.
>>
>>  -Sean
> 
> Blocking a version of oslo.db just because of your 6th sense tells you
> it may have introduce some issues isn't the best way to decide. We need
> a better investigation than just this, and find the root cause.
> 
> Cheers,
> 
> Thomas Goirand (zigo)

Sure, but investigations all start with hypothesis, which is then proved
/ disproved, and we move on. Given that we're in RC week, and we've had
library releases after the freeze point, it's reasonable to see if
rolling back to previous versions mitigate things.

Root cause analysis may take more time, and need different folks, than
the RC period allows. So given imperfect information we need to consider
what the safest path forward is.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Roman Podoliaka

Hmm, looks like we now run more testr workers in parallel (8 instead of 4):

http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz
http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html

On my machine running Nova migration tests against MySQL is much
slower with 8 workers than with 4 due to disk IO (it's HDD). When they
time out (after 320s) I see the very same TimeoutException and
IndexError (probably something messes up with TimeoutException up the
stack).

On Wed, Sep 14, 2016 at 6:44 PM, Mike Bayer  wrote:
>
>
> On 09/14/2016 11:08 AM, Mike Bayer wrote:
>>
>>
>>
>> On 09/14/2016 09:15 AM, Sean Dague wrote:
>>>
>>> I noticed the following issues happening quite often now in the
>>> opportunistic db tests for nova -
>>>
>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>>>
>>>
>>>
>>> It looks like some race has been introduced where the various db
>>> connections are not fully isolated from each other like they used to be.
>>> The testing magic for this is buried pretty deep in oslo.db.
>>
>>
>> that error message occurs when a connection that is intended against a
>> SELECT statement fails to provide a cursor.description attribute.  It is
>> typically a driver-level bug in the MySQL world and corresponds to
>> mis-handled failure modes from the MySQL connection.
>>
>> By "various DB connections are not fully isolated from each other" are
>> you suggesting that a single in-Python connection object itself is being
>> shared among multiple greenlets?   I'm not aware of a change in oslo.db
>> that would be a relationship to such an effect.
>
>
> So, I think by "fully isolated from each other" what you really mean is
> "operations upon a connection are not fully isolated from the subsequent use
> of that connection", since that's what I see in the logs.  A connection is
> attempting to be used during teardown to drop tables, however it's in this
> essentially broken state from a PyMySQL perspective, which would indicate
> something has gone wrong with this (pooled) connection in the preceding test
> that could not be detected or reverted once the connection was returned to
> the pool.
>
> From Roman's observation, it looks like a likely source of this corruption
> is a timeout that is interrupting the state of the PyMySQL connection.   In
> the preceding stack trace, PyMySQL is encountering a raise as it attempts to
> call "self._sock.recv_into(b)", and it seems like some combination of
> eventlet's response to signals and the fixtures.Timeout() fixture is the
> cause of this interruption.   As an additional wart, something else is
> getting involved and turning it into an IndexError, I'm not sure what that
> part is yet though I can imagine that might be SQLAlchemy mis-interpreting
> what it expects to be a PyMySQL exception class, since we normally look
> inside of exception.args[0] to get the MySQL error code.   With a blank
> exception like fixtures.TimeoutException, .args is the empty tuple.
>
> The PyMySQL connection is now in an invalid state and unable to perform a
> SELECT statement correctly, but the connection is not invalidated and is
> instead returned to the connection pool in a broken state.  So the
> subsequent teardown, if it uses this same connection (which is likely),
> fails because the connection has been interrupted in the middle of its work
> and not given the chance to clean up.
>
> Seems like the use of fixtures.Timeout() fixture here is not organized to
> work with a database operation in progress, especially an
> eventlet-monkeypatched PyMySQL.   Ideally, if something like a timeout due
> to a signal handler occurs, the entire connection pool should be disposed
> (quickest way, engine.dispose()), or at the very least (and much more
> targeted), the connection that's involved should be invalidated from the
> pool, e.g. connection.invalidate().
>
> The change to the environment here would be that this timeout is happening
> at all - the reason for that is not yet known.   If oslo.db's version were
> involved in this error, I would guess that it would be related to this
> timeout condition being caused, and not anything to do with the connection
> provisioning.
>
>
>
>
>
>>
>>
>>
>>>
>>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
>>> think we need to strongly consider blocking it and revisiting these
>>> issues post newton.
>>>
>>> -Sean
>>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Mike Bayer

On 09/14/2016 11:08 AM, Mike Bayer wrote:

On 09/14/2016 09:15 AM, Sean Dague wrote:

It looks like some race has been introduced where the various db
connections are not fully isolated from each other like they used to be.
The testing magic for this is buried pretty deep in oslo.db.

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.

-Sean

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Thomas Goirand

On 09/14/2016 03:15 PM, Sean Dague wrote:
> I noticed the following issues happening quite often now in the
> opportunistic db tests for nova -
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
> 
> 
> It looks like some race has been introduced where the various db
> connections are not fully isolated from each other like they used to be.
> The testing magic for this is buried pretty deep in oslo.db.
> 
> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
> think we need to strongly consider blocking it and revisiting these
> issues post newton.
> 
>   -Sean

Blocking a version of oslo.db just because of your 6th sense tells you
it may have introduce some issues isn't the best way to decide. We need
a better investigation than just this, and find the root cause.

Cheers,

Thomas Goirand (zigo)


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Mike Bayer




On 09/14/2016 09:15 AM, Sean Dague wrote:

I noticed the following issues happening quite often now in the
opportunistic db tests for nova -
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22


It looks like some race has been introduced where the various db
connections are not fully isolated from each other like they used to be.
The testing magic for this is buried pretty deep in oslo.db.


that error message occurs when a connection that is intended against a 
SELECT statement fails to provide a cursor.description attribute.  It is 
typically a driver-level bug in the MySQL world and corresponds to 
mis-handled failure modes from the MySQL connection.


By "various DB connections are not fully isolated from each other" are 
you suggesting that a single in-Python connection object itself is being 
shared among multiple greenlets?   I'm not aware of a change in oslo.db 
that would be a relationship to such an effect.






Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.

-Sean



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Roman Podoliaka

Sean,

I'll take a closer look, but test execution times and errors look suspicious:

ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsPostgreSQL.test_walk_versions
60.002

2016-09-14 14:21:38.756421 |   File
"/home/jenkins/workspace/gate-ironic-python27-db-ubuntu-xenial/.tox/py27/local/lib/python2.7/site-packages/eventlet/hubs/epolls.py",
line 62, in do_poll
2016-09-14 14:21:38.756435 | return self.poll.poll(seconds)
2016-09-14 14:21:38.756481 |   File
"/home/jenkins/workspace/gate-ironic-python27-db-ubuntu-xenial/.tox/py27/local/lib/python2.7/site-packages/fixtures/_fixtures/timeout.py",
line 52, in signal_handler
2016-09-14 14:21:38.756494 | raise TimeoutException()
2016-09-14 14:21:38.756508 | IndexError: tuple index out of range

Like if the test case was forcibly stopped after timeout.

Thanks,
Roman

On Wed, Sep 14, 2016 at 4:15 PM, Sean Dague  wrote:
> I noticed the following issues happening quite often now in the
> opportunistic db tests for nova -
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>
>
> It looks like some race has been introduced where the various db
> connections are not fully isolated from each other like they used to be.
> The testing magic for this is buried pretty deep in oslo.db.
>
> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
> think we need to strongly consider blocking it and revisiting these
> issues post newton.
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

2016-09-14 Thread Sean Dague

I noticed the following issues happening quite often now in the
opportunistic db tests for nova -
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22


It looks like some race has been introduced where the various db
connections are not fully isolated from each other like they used to be.
The testing magic for this is buried pretty deep in oslo.db.

Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

27 matches

Site Navigation

Mail list logo

Footer information