Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 21/09/16 11:41, Joshua Harlow wrote: tl;dr this appears to have been around forever (at least since we switched to using a pure-Python MySQL client) and is almost certainly completely unrelated to any particular release of oslo.db. Update: Mike was kind enough to run this one to ground and it looks like we could get a fix in sqlalchemy 1.1.0: https://bitbucket.org/zzzeek/sqlalchemy/issues/3803/dbapi-connections-go-invalid-on I've seen something similar at https://review.openstack.org/#/c/316935/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Sean Dague wrote: > On 09/15/2016 09:20 AM, Roman Podoliaka wrote: >> Sean, >> >> So currently we have a default timeout of 160s in Nova. And >> specifically for migration tests we set a scaling factor of 2. Let's >> maybe give 2.5 or 3 a try ( https://review.openstack.org/#/c/370805/ ) >> and make a couple of "rechecks" to see if it helps or not. > > Given the fail rate, lets just go safe and double it to 4. If it's > really a timeout issue that should move the bell curve to mostly cover > it. Timeouts are really just supposed to be a backstop to ensure the > tests don't deadlock forever. > > The incidence rate is at that odd race rate that we'll probably have to > merge and just see if it goes away. Neutron also started seeing timeouts in DB migration tests today. We're going to try a similar timeout scaling factor for these tests. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
FWIW, there was no new failures in Nova jobs since then. I'm confused as well why these tests would sporadically take much longer time to execute. Perhaps we could install something like atop on our nodes to answer that question. On Wed, Sep 21, 2016 at 5:46 PM, Ihar Hrachyshka wrote: > I just hit that TimeoutException error in neutron functional tests: > > http://logs.openstack.org/68/373868/4/check/gate-neutron-dsvm-functional-ubuntu-trusty/4de275e/testr_results.html.gz > > It’s a bit weird that we hit that 180 sec timeout because in good runs, the > test takes ~5 secs. > > Do we have a remedy against that kind of failure? I saw nova bumped the > timeout length for the tests. Is it the approach we should apply across the > board for other projects? > > Ihar > > > Zane Bitter wrote: > >> On 14/09/16 11:44, Mike Bayer wrote: >>> >>> On 09/14/2016 11:08 AM, Mike Bayer wrote: On 09/14/2016 09:15 AM, Sean Dague wrote: > > I noticed the following issues happening quite often now in the > opportunistic db tests for nova - > > http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 > > > > > It looks like some race has been introduced where the various db > connections are not fully isolated from each other like they used to > be. > The testing magic for this is buried pretty deep in oslo.db. that error message occurs when a connection that is intended against a SELECT statement fails to provide a cursor.description attribute. It is typically a driver-level bug in the MySQL world and corresponds to mis-handled failure modes from the MySQL connection. By "various DB connections are not fully isolated from each other" are you suggesting that a single in-Python connection object itself is being shared among multiple greenlets? I'm not aware of a change in oslo.db that would be a relationship to such an effect. >>> >>> >>> So, I think by "fully isolated from each other" what you really mean is >>> "operations upon a connection are not fully isolated from the subsequent >>> use of that connection", since that's what I see in the logs. A >>> connection is attempting to be used during teardown to drop tables, >>> however it's in this essentially broken state from a PyMySQL >>> perspective, which would indicate something has gone wrong with this >>> (pooled) connection in the preceding test that could not be detected or >>> reverted once the connection was returned to the pool. >>> >>> From Roman's observation, it looks like a likely source of this >>> corruption is a timeout that is interrupting the state of the PyMySQL >>> connection. In the preceding stack trace, PyMySQL is encountering a >>> raise as it attempts to call "self._sock.recv_into(b)", and it seems >>> like some combination of eventlet's response to signals and the >>> fixtures.Timeout() fixture is the cause of this interruption. As an >>> additional wart, something else is getting involved and turning it into >>> an IndexError, I'm not sure what that part is yet though I can imagine >>> that might be SQLAlchemy mis-interpreting what it expects to be a >>> PyMySQL exception class, since we normally look inside of >>> exception.args[0] to get the MySQL error code. With a blank exception >>> like fixtures.TimeoutException, .args is the empty tuple. >>> >>> The PyMySQL connection is now in an invalid state and unable to perform >>> a SELECT statement correctly, but the connection is not invalidated and >>> is instead returned to the connection pool in a broken state. So the >>> subsequent teardown, if it uses this same connection (which is likely), >>> fails because the connection has been interrupted in the middle of its >>> work and not given the chance to clean up. >>> >>> Seems like the use of fixtures.Timeout() fixture here is not organized >>> to work with a database operation in progress, especially an >>> eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout >>> due to a signal handler occurs, the entire connection pool should be >>> disposed (quickest way, engine.dispose()), or at the very least (and >>> much more targeted), the connection that's involved should be >>> invalidated from the pool, e.g. connection.invalidate(). >>> >>> The change to the environment here would be that this timeout is >>> happening at all - the reason for that is not yet known. If oslo.db's >>> version were involved in this error, I would guess that it would be >>> related to this timeout condition being caused, and not anything to do >>> with the connection provisioning. >>> > Olso.db 4.13.3 did hit the scene about the time this showed up. So I > think we need to strongly consider blocking it and revisiting these > issues post newton. >> >> >> We've been seeing similar errors in Heat since at least Liberty >> (https://bugs.launchpad.net/heat/+b
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Mike Bayer wrote: On 09/21/2016 11:41 AM, Joshua Harlow wrote: I've seen something similar at https://review.openstack.org/#/c/316935/ Maybe its time we asked again why are we still using eventlet and do we need to anymore. What functionality of it are people actually taking advantage of? If it's supporting libraries like oslo.service then it'd probably be useful to talk to the ceilometer folks who replaced oslo.service with something else (another oslo library for periodics and https://github.com/sileht/cotyledon for service oriented tasks). Plus Keystone has gotten off of it. I actually like eventlet and gevent quite a lot. I am using it in a new middleware component that will be involved with database connection pooling. However, I *don't* use the global monkeypatching aspect. That's where this all goes very wrong. Things that are designed for synchronous operations, like database-oriented business methods as well as the work of the database driver itself, should run within threads. You can in fact use eventlet/gevent's APIs explicitly and you can even combine it with traditional threading explicitly. I'm actually using a stdlib Queue (carefully) to send data between greenlets and threads. Madness! Agreed (thanks for making that clear), it's really the monkeying that kills things, not eventlet/gevent directly, fair point. -Josh __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/21/2016 11:41 AM, Joshua Harlow wrote: I've seen something similar at https://review.openstack.org/#/c/316935/ Maybe its time we asked again why are we still using eventlet and do we need to anymore. What functionality of it are people actually taking advantage of? If it's supporting libraries like oslo.service then it'd probably be useful to talk to the ceilometer folks who replaced oslo.service with something else (another oslo library for periodics and https://github.com/sileht/cotyledon for service oriented tasks). Plus Keystone has gotten off of it. I actually like eventlet and gevent quite a lot. I am using it in a new middleware component that will be involved with database connection pooling. However, I *don't* use the global monkeypatching aspect. That's where this all goes very wrong. Things that are designed for synchronous operations, like database-oriented business methods as well as the work of the database driver itself, should run within threads. You can in fact use eventlet/gevent's APIs explicitly and you can even combine it with traditional threading explicitly. I'm actually using a stdlib Queue (carefully) to send data between greenlets and threads. Madness! -Josh __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Zane Bitter wrote: On 14/09/16 11:44, Mike Bayer wrote: On 09/14/2016 11:08 AM, Mike Bayer wrote: On 09/14/2016 09:15 AM, Sean Dague wrote: I noticed the following issues happening quite often now in the opportunistic db tests for nova - http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 It looks like some race has been introduced where the various db connections are not fully isolated from each other like they used to be. The testing magic for this is buried pretty deep in oslo.db. that error message occurs when a connection that is intended against a SELECT statement fails to provide a cursor.description attribute. It is typically a driver-level bug in the MySQL world and corresponds to mis-handled failure modes from the MySQL connection. By "various DB connections are not fully isolated from each other" are you suggesting that a single in-Python connection object itself is being shared among multiple greenlets? I'm not aware of a change in oslo.db that would be a relationship to such an effect. So, I think by "fully isolated from each other" what you really mean is "operations upon a connection are not fully isolated from the subsequent use of that connection", since that's what I see in the logs. A connection is attempting to be used during teardown to drop tables, however it's in this essentially broken state from a PyMySQL perspective, which would indicate something has gone wrong with this (pooled) connection in the preceding test that could not be detected or reverted once the connection was returned to the pool. From Roman's observation, it looks like a likely source of this corruption is a timeout that is interrupting the state of the PyMySQL connection. In the preceding stack trace, PyMySQL is encountering a raise as it attempts to call "self._sock.recv_into(b)", and it seems like some combination of eventlet's response to signals and the fixtures.Timeout() fixture is the cause of this interruption. As an additional wart, something else is getting involved and turning it into an IndexError, I'm not sure what that part is yet though I can imagine that might be SQLAlchemy mis-interpreting what it expects to be a PyMySQL exception class, since we normally look inside of exception.args[0] to get the MySQL error code. With a blank exception like fixtures.TimeoutException, .args is the empty tuple. The PyMySQL connection is now in an invalid state and unable to perform a SELECT statement correctly, but the connection is not invalidated and is instead returned to the connection pool in a broken state. So the subsequent teardown, if it uses this same connection (which is likely), fails because the connection has been interrupted in the middle of its work and not given the chance to clean up. Seems like the use of fixtures.Timeout() fixture here is not organized to work with a database operation in progress, especially an eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout due to a signal handler occurs, the entire connection pool should be disposed (quickest way, engine.dispose()), or at the very least (and much more targeted), the connection that's involved should be invalidated from the pool, e.g. connection.invalidate(). The change to the environment here would be that this timeout is happening at all - the reason for that is not yet known. If oslo.db's version were involved in this error, I would guess that it would be related to this timeout condition being caused, and not anything to do with the connection provisioning. Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. We've been seeing similar errors in Heat since at least Liberty (https://bugs.launchpad.net/heat/+bug/1499669). Mike and I did some poking around yesterday and basically confirmed his theory above. If you look at the PyMySQL code, it believes that only an IOError can occur while writing to a socket, so it has no handling for any other type of exception, thus it can't deal with signal handlers raising exceptions or other exceptions being thrown into the greenthread by eventlet. It sounds like sqlalchemy also fails to catch at least some of these exceptions and invalidate the connection. tl;dr this appears to have been around forever (at least since we switched to using a pure-Python MySQL client) and is almost certainly completely unrelated to any particular release of oslo.db. I've seen something similar at https://review.openstack.org/#/c/316935/ Maybe its time we asked again why are we still using eventlet and do we need to anymore. What functionality of it are people actually taking advantage of? If it's supporting libraries like oslo.service then it'd probably be useful to talk to the ceilometer folks who replaced oslo.service with something else (another oslo library for periodics and https://github
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
I just hit that TimeoutException error in neutron functional tests: http://logs.openstack.org/68/373868/4/check/gate-neutron-dsvm-functional-ubuntu-trusty/4de275e/testr_results.html.gz It’s a bit weird that we hit that 180 sec timeout because in good runs, the test takes ~5 secs. Do we have a remedy against that kind of failure? I saw nova bumped the timeout length for the tests. Is it the approach we should apply across the board for other projects? Ihar Zane Bitter wrote: On 14/09/16 11:44, Mike Bayer wrote: On 09/14/2016 11:08 AM, Mike Bayer wrote: On 09/14/2016 09:15 AM, Sean Dague wrote: I noticed the following issues happening quite often now in the opportunistic db tests for nova - http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 It looks like some race has been introduced where the various db connections are not fully isolated from each other like they used to be. The testing magic for this is buried pretty deep in oslo.db. that error message occurs when a connection that is intended against a SELECT statement fails to provide a cursor.description attribute. It is typically a driver-level bug in the MySQL world and corresponds to mis-handled failure modes from the MySQL connection. By "various DB connections are not fully isolated from each other" are you suggesting that a single in-Python connection object itself is being shared among multiple greenlets? I'm not aware of a change in oslo.db that would be a relationship to such an effect. So, I think by "fully isolated from each other" what you really mean is "operations upon a connection are not fully isolated from the subsequent use of that connection", since that's what I see in the logs. A connection is attempting to be used during teardown to drop tables, however it's in this essentially broken state from a PyMySQL perspective, which would indicate something has gone wrong with this (pooled) connection in the preceding test that could not be detected or reverted once the connection was returned to the pool. From Roman's observation, it looks like a likely source of this corruption is a timeout that is interrupting the state of the PyMySQL connection. In the preceding stack trace, PyMySQL is encountering a raise as it attempts to call "self._sock.recv_into(b)", and it seems like some combination of eventlet's response to signals and the fixtures.Timeout() fixture is the cause of this interruption. As an additional wart, something else is getting involved and turning it into an IndexError, I'm not sure what that part is yet though I can imagine that might be SQLAlchemy mis-interpreting what it expects to be a PyMySQL exception class, since we normally look inside of exception.args[0] to get the MySQL error code. With a blank exception like fixtures.TimeoutException, .args is the empty tuple. The PyMySQL connection is now in an invalid state and unable to perform a SELECT statement correctly, but the connection is not invalidated and is instead returned to the connection pool in a broken state. So the subsequent teardown, if it uses this same connection (which is likely), fails because the connection has been interrupted in the middle of its work and not given the chance to clean up. Seems like the use of fixtures.Timeout() fixture here is not organized to work with a database operation in progress, especially an eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout due to a signal handler occurs, the entire connection pool should be disposed (quickest way, engine.dispose()), or at the very least (and much more targeted), the connection that's involved should be invalidated from the pool, e.g. connection.invalidate(). The change to the environment here would be that this timeout is happening at all - the reason for that is not yet known. If oslo.db's version were involved in this error, I would guess that it would be related to this timeout condition being caused, and not anything to do with the connection provisioning. Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. We've been seeing similar errors in Heat since at least Liberty (https://bugs.launchpad.net/heat/+bug/1499669). Mike and I did some poking around yesterday and basically confirmed his theory above. If you look at the PyMySQL code, it believes that only an IOError can occur while writing to a socket, so it has no handling for any other type of exception, thus it can't deal with signal handlers raising exceptions or other exceptions being thrown into the greenthread by eventlet. It sounds like sqlalchemy also fails to catch at least some of these exceptions and invalidate the connection. tl;dr this appears to have been around forever (at least since we switched to using a pure-Python MySQL client) and is almost certainly
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 14/09/16 11:44, Mike Bayer wrote: On 09/14/2016 11:08 AM, Mike Bayer wrote: On 09/14/2016 09:15 AM, Sean Dague wrote: I noticed the following issues happening quite often now in the opportunistic db tests for nova - http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 It looks like some race has been introduced where the various db connections are not fully isolated from each other like they used to be. The testing magic for this is buried pretty deep in oslo.db. that error message occurs when a connection that is intended against a SELECT statement fails to provide a cursor.description attribute. It is typically a driver-level bug in the MySQL world and corresponds to mis-handled failure modes from the MySQL connection. By "various DB connections are not fully isolated from each other" are you suggesting that a single in-Python connection object itself is being shared among multiple greenlets? I'm not aware of a change in oslo.db that would be a relationship to such an effect. So, I think by "fully isolated from each other" what you really mean is "operations upon a connection are not fully isolated from the subsequent use of that connection", since that's what I see in the logs. A connection is attempting to be used during teardown to drop tables, however it's in this essentially broken state from a PyMySQL perspective, which would indicate something has gone wrong with this (pooled) connection in the preceding test that could not be detected or reverted once the connection was returned to the pool. From Roman's observation, it looks like a likely source of this corruption is a timeout that is interrupting the state of the PyMySQL connection. In the preceding stack trace, PyMySQL is encountering a raise as it attempts to call "self._sock.recv_into(b)", and it seems like some combination of eventlet's response to signals and the fixtures.Timeout() fixture is the cause of this interruption. As an additional wart, something else is getting involved and turning it into an IndexError, I'm not sure what that part is yet though I can imagine that might be SQLAlchemy mis-interpreting what it expects to be a PyMySQL exception class, since we normally look inside of exception.args[0] to get the MySQL error code. With a blank exception like fixtures.TimeoutException, .args is the empty tuple. The PyMySQL connection is now in an invalid state and unable to perform a SELECT statement correctly, but the connection is not invalidated and is instead returned to the connection pool in a broken state. So the subsequent teardown, if it uses this same connection (which is likely), fails because the connection has been interrupted in the middle of its work and not given the chance to clean up. Seems like the use of fixtures.Timeout() fixture here is not organized to work with a database operation in progress, especially an eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout due to a signal handler occurs, the entire connection pool should be disposed (quickest way, engine.dispose()), or at the very least (and much more targeted), the connection that's involved should be invalidated from the pool, e.g. connection.invalidate(). The change to the environment here would be that this timeout is happening at all - the reason for that is not yet known. If oslo.db's version were involved in this error, I would guess that it would be related to this timeout condition being caused, and not anything to do with the connection provisioning. Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. We've been seeing similar errors in Heat since at least Liberty (https://bugs.launchpad.net/heat/+bug/1499669). Mike and I did some poking around yesterday and basically confirmed his theory above. If you look at the PyMySQL code, it believes that only an IOError can occur while writing to a socket, so it has no handling for any other type of exception, thus it can't deal with signal handlers raising exceptions or other exceptions being thrown into the greenthread by eventlet. It sounds like sqlalchemy also fails to catch at least some of these exceptions and invalidate the connection. tl;dr this appears to have been around forever (at least since we switched to using a pure-Python MySQL client) and is almost certainly completely unrelated to any particular release of oslo.db. cheers, Zane. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 11:57 PM, Mike Bayer wrote: > > > On 09/14/2016 11:05 PM, Mike Bayer wrote: >> >> Are *these* errors also new as of version 4.13.3 of oslo.db ? Because >> here I have more suspicion of one particular oslo.db change here. > > The version in question that has the changes to provisioning and > anything really to do with this area is 4.12.0. So if you didn't see > any problem w/ 4.12 then almost definitely oslo.db is not the cause - > the code changes subsequent to 4.12 have no relationship to any system > used by the opportunistic test base.I would hope at least that 4.12 > is the version where we see things changing because there were small > changes to the provisioning code. > > But at the same time, I'm combing through the quite small adjustments to > the provisioning code as of 4.12.0 and I'm not seeing what could > introduce this issue. That said, we really should never see the kind > of error we see with the "DROP DATABASE" failing because it remains in > use, however this can be a side effect of the test itself having > problems with the state of a different connection, not being closed and > locks remain held. > > That is, there's poor failure modes for sure here, I just can't see > anything in 4.13 or even 4.12 that would suddenly introduce them. > > By all means if these failures disappear when we go to 4.11 vs. 4.12, > that would be where we need to go and to look for next cycle. From > my POV if the failures do disappear then that would be the best evidence > that the oslo.db version is the factor. In looking through the change history, I agree that nothing in the 4.13.x stream is plausibly related to this failure. The last time something seems related is back in the 4.11/4.12 space. That would be a much risker change than just rolling back to 4.13.0 at this point in the release, so I would not recommend it. So I think Roman's approach of trying to just adjust the timeout seems like the lowest risk way to move forward. Hopefully that mitigates things. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/15/2016 09:20 AM, Roman Podoliaka wrote: > Sean, > > So currently we have a default timeout of 160s in Nova. And > specifically for migration tests we set a scaling factor of 2. Let's > maybe give 2.5 or 3 a try ( https://review.openstack.org/#/c/370805/ ) > and make a couple of "rechecks" to see if it helps or not. Given the fail rate, lets just go safe and double it to 4. If it's really a timeout issue that should move the bell curve to mostly cover it. Timeouts are really just supposed to be a backstop to ensure the tests don't deadlock forever. The incidence rate is at that odd race rate that we'll probably have to merge and just see if it goes away. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Sean, So currently we have a default timeout of 160s in Nova. And specifically for migration tests we set a scaling factor of 2. Let's maybe give 2.5 or 3 a try ( https://review.openstack.org/#/c/370805/ ) and make a couple of "rechecks" to see if it helps or not. In Ocata we could revisit the migrations collapse again to reduce the number of scripts. On the testing side, we could probably "cheat" a bit to trade data safety for performance. E.g. we could set "fsync = off" for PostgreSQL (https://www.postgresql.org/docs/9.2/static/runtime-config-wal.html). Similar settings must be available for MySQL as well. Thanks, Roman On Thu, Sep 15, 2016 at 3:07 PM, Sean Dague wrote: > On 09/15/2016 05:52 AM, Roman Podoliaka wrote: >> Mike, >> >> On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer wrote: >> >>> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur? >>> If so, was it also accompanied by the same "resource closed" condition or >>> did this second part of the condition only appear at 4.13.3? >>> * Did we see a similar "timeout" / "resource closed" combination prior to >>> 4.13.3, just with less frequency? >> >> I believe we did - >> https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we >> used mysql-python back then, so the error was slightly different. >> >>> * What is the magnitude of the "timeout" this fixture is using, is it on the >>> order of seconds, minutes, hours? >> >> It's set in seconds per project in .testr.conf, e.g.: >> >> https://github.com/openstack/nova/blob/master/.testr.conf >> https://github.com/openstack/ironic/blob/master/.testr.conf >> >> In Nova we also have a 'timeout scaling factor' specifically set for >> migration tests: >> >> https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67 >> >>> * If many minutes or hours, can the test suite be observed to be stuck on >>> this test? Has someone tried to run a "SHOW PROCESSLIST" while this >>> condition is occurring to see what SQL is pausing? >> >> We could try to do that in the gate, but I don't expect to see >> anything interesting: IMO, we'd see regular queries that should have >> been executed fast, but actually took much longer time (presumably due >> to heavy disk IO caused by multiple workers running similar tests in >> parallel). >> >>> * Is this failure only present within the Nova test suite or has it been >>> observed in the test suites of other projects? >> >> According to >> >> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 >> >> it's mostly Nova, but this has also been observed in Ironic, Manila >> and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60 >> seconds. >> >>> * Is this failure present only on the "database migration" test suite or is >>> it present in other opportunistic tests, for Nova and others? >> >> Based on the console logs I've checked only migration tests failed, >> but that's probably due to the fact that they are usually the slowest >> ones (again, presumably due to heavy disk IO). >> >>> * Have there been new database migrations added to Nova which are being >>> exercised here and may be involved? >> >> Looks like there were no changes recently: >> >> https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22) >> >>> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If >>> downgrading it is feasible, that would at least be a way to eliminate it as >>> a possibility if these same failures continue to occur, or a way to confirm >>> its involvement if they disappear. But if downgrading is disruptive then >>> there are other things to look at in order to have a better chance at >>> predicting its involvement. >> >> I don't think we need to block oslo.db 4.13.3, unless we clearly see >> it's this version that causes these failures. >> >> I gave version 4.11 (before changes to provisioning) a try on my local >> machine and see the very same errors when concurrency level is high ( >> http://paste.openstack.org/show/577350/ ), so I don't think the latest >> oslo.db release has anything to do with the increase of the number of >> failures on CI. >> >> My current understanding is that the load on gate nodes somehow >> increased (either we run more testr workers in parallel now or >> apply/test more migrations or just more run VMs per host or the gate >> is simply busy at this point of the release cycle), so that we started >> to see these timeouts more often. > > The migration count is definitely going to grow over time, as is the > nature of the beast. Nova hasn't had a migration collapse in quite a > while. The higher patch volume in Nova and larger number of db > migrations could definitely account for Nova being higher here. > > Is there a better timeout value that you think will make
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Mike, I think the exact error (InterfaceError vs TimeoutException) varies depending on what code is being executed at the very moment when a process receives SIGALRM. I tried to run the tests against PostgreSQL passing very small timeout values (OS_TEST_TIMEOUT=5 python -m testtools.run nova.tests.unit.db.test_migrations.TestNovaMigrationsPostgreSQL.test_walk_versions) and saw both InterfaceError and TimeoutException: http://paste.openstack.org/show/577410/ http://paste.openstack.org/show/577411/ strace'ing shows that a connection to PostgreSQL is closed right after SIGARLM is handled: http://paste.openstack.org/show/577425/ I tried to reproduce that manually by the means of gdb and set a breakpoint on close(): http://paste.openstack.org/show/577422/ ^ looks like psycopg2 closes the connection automatically if a query was interrupted by SIGALRM. The corresponding Python level backtrace is: http://paste.openstack.org/show/577423/ ^ i.e. connection closing happens in the middle of cursor.execute() call. In the end I see a similar InterfaceError: http://paste.openstack.org/show/577424/ That being said, this does not explain the "DB is in use" part. Thanks, Roman On Thu, Sep 15, 2016 at 6:05 AM, Mike Bayer wrote: > There's a different set of logs attached to the launchpad issue, that's not > what I was looking at before. > > These logs are at > http://logs.openstack.org/90/369490/1/check/gate-nova-tox-db-functional-ubuntu-xenial/085ac3e/console.html#_2016-09-13_14_54_18_098031 > .In these logs, I see something *very* different, not just the MySQL > tests but the Postgresql tests are definitely hitting conflicts against the > randomly generated database. > > This set of traces, e.g.: > > sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) database > "dbzrtmgbxv" is being accessed by other users > 2016-09-13 14:54:18.093723 | DETAIL: There is 1 other session using the > database. > 2016-09-13 14:54:18.093736 | [SQL: 'DROP DATABASE dbzrtmgbxv'] > > and > > File > "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", > line 668, in _rollback_impl > 2016-09-13 14:54:18.095470 | > self.engine.dialect.do_rollback(self.connection) > 2016-09-13 14:54:18.095513 | File > "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", > line 420, in do_rollback > 2016-09-13 14:54:18.095526 | dbapi_connection.rollback() > 2016-09-13 14:54:18.095548 | sqlalchemy.exc.InterfaceError: > (psycopg2.InterfaceError) connection already closed > > are a very different animal. For one thing, they're on Postgresql where the > driver and DB acts extremely rationally. For another, there's no timeout > exception here, and not all the conflicts are within the teardown. > > Are *these* errors also new as of version 4.13.3 of oslo.db ? Because here > I have more suspicion of one particular oslo.db change here. > > > > > > > > fits much more with your initial description > > > On 09/14/2016 10:48 PM, Mike Bayer wrote: >> >> >> >> On 09/14/2016 07:04 PM, Alan Pevec wrote: Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. >>> >>> >>> So that means reverting all stable/newton changes, previous 4.13.x >>> have been already blocked https://review.openstack.org/365565 >>> How would we proceed, do we need to revert all backport on stable/newton? >> >> >> In case my previous email wasn't clear, I don't *yet* see evidence that >> the recent 4.13.3 release of oslo.db is the cause of this problem. >> However, that is only based upon what I see in this stack trace, which >> is that the test framework is acting predictably (though erroneously) >> based on the timeout condition which is occurring. I don't (yet) see a >> reason that the same effect would not occur prior to 4.13.3 in the face >> of a signal pre-empting the work of the pymysql driver mid-stream. >> However, this assumes that the timeout condition itself is not a product >> of the current oslo.db version and that is not known yet. >> >> There's a list of questions that should all be answerable which could >> assist in giving some hints towards this. >> >> There's two parts to the error in the logs. There's the "timeout" >> condition, then there is the bad reaction of the PyMySQL driver and the >> test framework as a result of the operation being interrupted within the >> test. >> >> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition >> occur? If so, was it also accompanied by the same "resource closed" >> condition or did this second part of the condition only appear at 4.13.3? >> >> * Did we see a similar "timeout" / "resource closed" combination prior >> to 4.13.3, just with less frequency? >> >> * Was the vers
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/15/2016 05:52 AM, Roman Podoliaka wrote: > Mike, > > On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer wrote: > >> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur? >> If so, was it also accompanied by the same "resource closed" condition or >> did this second part of the condition only appear at 4.13.3? >> * Did we see a similar "timeout" / "resource closed" combination prior to >> 4.13.3, just with less frequency? > > I believe we did - > https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we > used mysql-python back then, so the error was slightly different. > >> * What is the magnitude of the "timeout" this fixture is using, is it on the >> order of seconds, minutes, hours? > > It's set in seconds per project in .testr.conf, e.g.: > > https://github.com/openstack/nova/blob/master/.testr.conf > https://github.com/openstack/ironic/blob/master/.testr.conf > > In Nova we also have a 'timeout scaling factor' specifically set for > migration tests: > > https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67 > >> * If many minutes or hours, can the test suite be observed to be stuck on >> this test? Has someone tried to run a "SHOW PROCESSLIST" while this >> condition is occurring to see what SQL is pausing? > > We could try to do that in the gate, but I don't expect to see > anything interesting: IMO, we'd see regular queries that should have > been executed fast, but actually took much longer time (presumably due > to heavy disk IO caused by multiple workers running similar tests in > parallel). > >> * Is this failure only present within the Nova test suite or has it been >> observed in the test suites of other projects? > > According to > > http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 > > it's mostly Nova, but this has also been observed in Ironic, Manila > and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60 > seconds. > >> * Is this failure present only on the "database migration" test suite or is >> it present in other opportunistic tests, for Nova and others? > > Based on the console logs I've checked only migration tests failed, > but that's probably due to the fact that they are usually the slowest > ones (again, presumably due to heavy disk IO). > >> * Have there been new database migrations added to Nova which are being >> exercised here and may be involved? > > Looks like there were no changes recently: > > https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22) > >> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If >> downgrading it is feasible, that would at least be a way to eliminate it as >> a possibility if these same failures continue to occur, or a way to confirm >> its involvement if they disappear. But if downgrading is disruptive then >> there are other things to look at in order to have a better chance at >> predicting its involvement. > > I don't think we need to block oslo.db 4.13.3, unless we clearly see > it's this version that causes these failures. > > I gave version 4.11 (before changes to provisioning) a try on my local > machine and see the very same errors when concurrency level is high ( > http://paste.openstack.org/show/577350/ ), so I don't think the latest > oslo.db release has anything to do with the increase of the number of > failures on CI. > > My current understanding is that the load on gate nodes somehow > increased (either we run more testr workers in parallel now or > apply/test more migrations or just more run VMs per host or the gate > is simply busy at this point of the release cycle), so that we started > to see these timeouts more often. The migration count is definitely going to grow over time, as is the nature of the beast. Nova hasn't had a migration collapse in quite a while. The higher patch volume in Nova and larger number of db migrations could definitely account for Nova being higher here. Is there a better timeout value that you think will make these timeouts happen less often? -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Mike, On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer wrote: > * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur? > If so, was it also accompanied by the same "resource closed" condition or > did this second part of the condition only appear at 4.13.3? > * Did we see a similar "timeout" / "resource closed" combination prior to > 4.13.3, just with less frequency? I believe we did - https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we used mysql-python back then, so the error was slightly different. > * What is the magnitude of the "timeout" this fixture is using, is it on the > order of seconds, minutes, hours? It's set in seconds per project in .testr.conf, e.g.: https://github.com/openstack/nova/blob/master/.testr.conf https://github.com/openstack/ironic/blob/master/.testr.conf In Nova we also have a 'timeout scaling factor' specifically set for migration tests: https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67 > * If many minutes or hours, can the test suite be observed to be stuck on > this test? Has someone tried to run a "SHOW PROCESSLIST" while this > condition is occurring to see what SQL is pausing? We could try to do that in the gate, but I don't expect to see anything interesting: IMO, we'd see regular queries that should have been executed fast, but actually took much longer time (presumably due to heavy disk IO caused by multiple workers running similar tests in parallel). > * Is this failure only present within the Nova test suite or has it been > observed in the test suites of other projects? According to http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 it's mostly Nova, but this has also been observed in Ironic, Manila and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60 seconds. > * Is this failure present only on the "database migration" test suite or is > it present in other opportunistic tests, for Nova and others? Based on the console logs I've checked only migration tests failed, but that's probably due to the fact that they are usually the slowest ones (again, presumably due to heavy disk IO). > * Have there been new database migrations added to Nova which are being > exercised here and may be involved? Looks like there were no changes recently: https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22) > I'm not sure how much of an inconvenience it is to downgrade oslo.db. If > downgrading it is feasible, that would at least be a way to eliminate it as > a possibility if these same failures continue to occur, or a way to confirm > its involvement if they disappear. But if downgrading is disruptive then > there are other things to look at in order to have a better chance at > predicting its involvement. I don't think we need to block oslo.db 4.13.3, unless we clearly see it's this version that causes these failures. I gave version 4.11 (before changes to provisioning) a try on my local machine and see the very same errors when concurrency level is high ( http://paste.openstack.org/show/577350/ ), so I don't think the latest oslo.db release has anything to do with the increase of the number of failures on CI. My current understanding is that the load on gate nodes somehow increased (either we run more testr workers in parallel now or apply/test more migrations or just more run VMs per host or the gate is simply busy at this point of the release cycle), so that we started to see these timeouts more often. Thanks, Roman __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 11:05 PM, Mike Bayer wrote: Are *these* errors also new as of version 4.13.3 of oslo.db ? Because here I have more suspicion of one particular oslo.db change here. The version in question that has the changes to provisioning and anything really to do with this area is 4.12.0. So if you didn't see any problem w/ 4.12 then almost definitely oslo.db is not the cause - the code changes subsequent to 4.12 have no relationship to any system used by the opportunistic test base.I would hope at least that 4.12 is the version where we see things changing because there were small changes to the provisioning code. But at the same time, I'm combing through the quite small adjustments to the provisioning code as of 4.12.0 and I'm not seeing what could introduce this issue. That said, we really should never see the kind of error we see with the "DROP DATABASE" failing because it remains in use, however this can be a side effect of the test itself having problems with the state of a different connection, not being closed and locks remain held. That is, there's poor failure modes for sure here, I just can't see anything in 4.13 or even 4.12 that would suddenly introduce them. By all means if these failures disappear when we go to 4.11 vs. 4.12, that would be where we need to go and to look for next cycle. From my POV if the failures do disappear then that would be the best evidence that the oslo.db version is the factor. fits much more with your initial description On 09/14/2016 10:48 PM, Mike Bayer wrote: On 09/14/2016 07:04 PM, Alan Pevec wrote: Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. So that means reverting all stable/newton changes, previous 4.13.x have been already blocked https://review.openstack.org/365565 How would we proceed, do we need to revert all backport on stable/newton? In case my previous email wasn't clear, I don't *yet* see evidence that the recent 4.13.3 release of oslo.db is the cause of this problem. However, that is only based upon what I see in this stack trace, which is that the test framework is acting predictably (though erroneously) based on the timeout condition which is occurring. I don't (yet) see a reason that the same effect would not occur prior to 4.13.3 in the face of a signal pre-empting the work of the pymysql driver mid-stream. However, this assumes that the timeout condition itself is not a product of the current oslo.db version and that is not known yet. There's a list of questions that should all be answerable which could assist in giving some hints towards this. There's two parts to the error in the logs. There's the "timeout" condition, then there is the bad reaction of the PyMySQL driver and the test framework as a result of the operation being interrupted within the test. * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur? If so, was it also accompanied by the same "resource closed" condition or did this second part of the condition only appear at 4.13.3? * Did we see a similar "timeout" / "resource closed" combination prior to 4.13.3, just with less frequency? * Was the version of PyMySQL also recently upgraded (I'm assuming this environment has been on PyMySQL for a long time at this point) ? What was the version change if so? Especially if we previously saw "timeout" but no "resource closed", perhaps an older version pf PyMySQL didn't react in this way? * Was the version of MySQL running in the CI environment changed? What was the version change if so?Were there any configurational changes such as transaction isolation, memory or process settings? * Have there been changes to the "timeout" logic itself in the test suite, e.g. whatever it is that sets up fixtures.Timeout()? Or some change that alters how teardown of tests occurs when a test is interrupted via this timeout? * What is the magnitude of the "timeout" this fixture is using, is it on the order of seconds, minutes, hours? * If many minutes or hours, can the test suite be observed to be stuck on this test? Has someone tried to run a "SHOW PROCESSLIST" while this condition is occurring to see what SQL is pausing? * Has there been some change such that the migration tests are running against non-empty tables or tables with much more data than was present before? * Is this failure only present within the Nova test suite or has it been observed in the test suites of other projects? * Is this failure present only on the "database migration" test suite or is it present in other opportunistic tests, for Nova and others? * Have there been new database migrations added to Nova which are being exercised here and may be involved? I'm not sure how much of an inconvenience it is to downgrade oslo.db. If downgrading it is feasible, that would at least be a way to eliminate it as a possi
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
There's a different set of logs attached to the launchpad issue, that's not what I was looking at before. These logs are at http://logs.openstack.org/90/369490/1/check/gate-nova-tox-db-functional-ubuntu-xenial/085ac3e/console.html#_2016-09-13_14_54_18_098031 .In these logs, I see something *very* different, not just the MySQL tests but the Postgresql tests are definitely hitting conflicts against the randomly generated database. This set of traces, e.g.: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) database "dbzrtmgbxv" is being accessed by other users 2016-09-13 14:54:18.093723 | DETAIL: There is 1 other session using the database. 2016-09-13 14:54:18.093736 | [SQL: 'DROP DATABASE dbzrtmgbxv'] and File "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 668, in _rollback_impl 2016-09-13 14:54:18.095470 | self.engine.dialect.do_rollback(self.connection) 2016-09-13 14:54:18.095513 | File "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 420, in do_rollback 2016-09-13 14:54:18.095526 | dbapi_connection.rollback() 2016-09-13 14:54:18.095548 | sqlalchemy.exc.InterfaceError: (psycopg2.InterfaceError) connection already closed are a very different animal. For one thing, they're on Postgresql where the driver and DB acts extremely rationally. For another, there's no timeout exception here, and not all the conflicts are within the teardown. Are *these* errors also new as of version 4.13.3 of oslo.db ? Because here I have more suspicion of one particular oslo.db change here. fits much more with your initial description On 09/14/2016 10:48 PM, Mike Bayer wrote: On 09/14/2016 07:04 PM, Alan Pevec wrote: Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. So that means reverting all stable/newton changes, previous 4.13.x have been already blocked https://review.openstack.org/365565 How would we proceed, do we need to revert all backport on stable/newton? In case my previous email wasn't clear, I don't *yet* see evidence that the recent 4.13.3 release of oslo.db is the cause of this problem. However, that is only based upon what I see in this stack trace, which is that the test framework is acting predictably (though erroneously) based on the timeout condition which is occurring. I don't (yet) see a reason that the same effect would not occur prior to 4.13.3 in the face of a signal pre-empting the work of the pymysql driver mid-stream. However, this assumes that the timeout condition itself is not a product of the current oslo.db version and that is not known yet. There's a list of questions that should all be answerable which could assist in giving some hints towards this. There's two parts to the error in the logs. There's the "timeout" condition, then there is the bad reaction of the PyMySQL driver and the test framework as a result of the operation being interrupted within the test. * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur? If so, was it also accompanied by the same "resource closed" condition or did this second part of the condition only appear at 4.13.3? * Did we see a similar "timeout" / "resource closed" combination prior to 4.13.3, just with less frequency? * Was the version of PyMySQL also recently upgraded (I'm assuming this environment has been on PyMySQL for a long time at this point) ? What was the version change if so? Especially if we previously saw "timeout" but no "resource closed", perhaps an older version pf PyMySQL didn't react in this way? * Was the version of MySQL running in the CI environment changed? What was the version change if so?Were there any configurational changes such as transaction isolation, memory or process settings? * Have there been changes to the "timeout" logic itself in the test suite, e.g. whatever it is that sets up fixtures.Timeout()? Or some change that alters how teardown of tests occurs when a test is interrupted via this timeout? * What is the magnitude of the "timeout" this fixture is using, is it on the order of seconds, minutes, hours? * If many minutes or hours, can the test suite be observed to be stuck on this test? Has someone tried to run a "SHOW PROCESSLIST" while this condition is occurring to see what SQL is pausing? * Has there been some change such that the migration tests are running against non-empty tables or tables with much more data than was present before? * Is this failure only present within the Nova test suite or has it been observed in the test suites of other projects? * Is this failure present only on the "database migration" test suite or is it present in other opportunistic tests, fo
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 07:04 PM, Alan Pevec wrote: Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. So that means reverting all stable/newton changes, previous 4.13.x have been already blocked https://review.openstack.org/365565 How would we proceed, do we need to revert all backport on stable/newton? In case my previous email wasn't clear, I don't *yet* see evidence that the recent 4.13.3 release of oslo.db is the cause of this problem. However, that is only based upon what I see in this stack trace, which is that the test framework is acting predictably (though erroneously) based on the timeout condition which is occurring. I don't (yet) see a reason that the same effect would not occur prior to 4.13.3 in the face of a signal pre-empting the work of the pymysql driver mid-stream. However, this assumes that the timeout condition itself is not a product of the current oslo.db version and that is not known yet. There's a list of questions that should all be answerable which could assist in giving some hints towards this. There's two parts to the error in the logs. There's the "timeout" condition, then there is the bad reaction of the PyMySQL driver and the test framework as a result of the operation being interrupted within the test. * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur? If so, was it also accompanied by the same "resource closed" condition or did this second part of the condition only appear at 4.13.3? * Did we see a similar "timeout" / "resource closed" combination prior to 4.13.3, just with less frequency? * Was the version of PyMySQL also recently upgraded (I'm assuming this environment has been on PyMySQL for a long time at this point) ? What was the version change if so? Especially if we previously saw "timeout" but no "resource closed", perhaps an older version pf PyMySQL didn't react in this way? * Was the version of MySQL running in the CI environment changed? What was the version change if so?Were there any configurational changes such as transaction isolation, memory or process settings? * Have there been changes to the "timeout" logic itself in the test suite, e.g. whatever it is that sets up fixtures.Timeout()? Or some change that alters how teardown of tests occurs when a test is interrupted via this timeout? * What is the magnitude of the "timeout" this fixture is using, is it on the order of seconds, minutes, hours? * If many minutes or hours, can the test suite be observed to be stuck on this test? Has someone tried to run a "SHOW PROCESSLIST" while this condition is occurring to see what SQL is pausing? * Has there been some change such that the migration tests are running against non-empty tables or tables with much more data than was present before? * Is this failure only present within the Nova test suite or has it been observed in the test suites of other projects? * Is this failure present only on the "database migration" test suite or is it present in other opportunistic tests, for Nova and others? * Have there been new database migrations added to Nova which are being exercised here and may be involved? I'm not sure how much of an inconvenience it is to downgrade oslo.db. If downgrading it is feasible, that would at least be a way to eliminate it as a possibility if these same failures continue to occur, or a way to confirm its involvement if they disappear. But if downgrading is disruptive then there are other things to look at in order to have a better chance at predicting its involvement. Cheers, Alan __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On Wed, Sep 14, 2016, at 10:01 AM, Sean Dague wrote: > On 09/14/2016 12:06 PM, Roman Podoliaka wrote: > > Hmm, looks like we now run more testr workers in parallel (8 instead of 4): > > > > http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz > > http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html > > > > On my machine running Nova migration tests against MySQL is much > > slower with 8 workers than with 4 due to disk IO (it's HDD). When they > > time out (after 320s) I see the very same TimeoutException and > > IndexError (probably something messes up with TimeoutException up the > > stack). > > Yes, by default testr runs with the number of workers matching the # of > cpus on the target. I think all our cloud providers are now 8 cpu > guests. So unit / functional tests are running 8 way. That's been true > for quite a while IIRC. OSIC is 4vcpu and the others have 8vcpu. Clark __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
> Olso.db 4.13.3 did hit the scene about the time this showed up. So I > think we need to strongly consider blocking it and revisiting these > issues post newton. So that means reverting all stable/newton changes, previous 4.13.x have been already blocked https://review.openstack.org/365565 How would we proceed, do we need to revert all backport on stable/newton? Cheers, Alan __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 12:06 PM, Roman Podoliaka wrote: > Hmm, looks like we now run more testr workers in parallel (8 instead of 4): > > http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz > http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html > > On my machine running Nova migration tests against MySQL is much > slower with 8 workers than with 4 due to disk IO (it's HDD). When they > time out (after 320s) I see the very same TimeoutException and > IndexError (probably something messes up with TimeoutException up the > stack). Yes, by default testr runs with the number of workers matching the # of cpus on the target. I think all our cloud providers are now 8 cpu guests. So unit / functional tests are running 8 way. That's been true for quite a while IIRC. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 11:23 AM, Thomas Goirand wrote: > On 09/14/2016 03:15 PM, Sean Dague wrote: >> I noticed the following issues happening quite often now in the >> opportunistic db tests for nova - >> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 >> >> >> It looks like some race has been introduced where the various db >> connections are not fully isolated from each other like they used to be. >> The testing magic for this is buried pretty deep in oslo.db. >> >> Olso.db 4.13.3 did hit the scene about the time this showed up. So I >> think we need to strongly consider blocking it and revisiting these >> issues post newton. >> >> -Sean > > Blocking a version of oslo.db just because of your 6th sense tells you > it may have introduce some issues isn't the best way to decide. We need > a better investigation than just this, and find the root cause. > > Cheers, > > Thomas Goirand (zigo) Sure, but investigations all start with hypothesis, which is then proved / disproved, and we move on. Given that we're in RC week, and we've had library releases after the freeze point, it's reasonable to see if rolling back to previous versions mitigate things. Root cause analysis may take more time, and need different folks, than the RC period allows. So given imperfect information we need to consider what the safest path forward is. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Hmm, looks like we now run more testr workers in parallel (8 instead of 4): http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html On my machine running Nova migration tests against MySQL is much slower with 8 workers than with 4 due to disk IO (it's HDD). When they time out (after 320s) I see the very same TimeoutException and IndexError (probably something messes up with TimeoutException up the stack). On Wed, Sep 14, 2016 at 6:44 PM, Mike Bayer wrote: > > > On 09/14/2016 11:08 AM, Mike Bayer wrote: >> >> >> >> On 09/14/2016 09:15 AM, Sean Dague wrote: >>> >>> I noticed the following issues happening quite often now in the >>> opportunistic db tests for nova - >>> >>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 >>> >>> >>> >>> It looks like some race has been introduced where the various db >>> connections are not fully isolated from each other like they used to be. >>> The testing magic for this is buried pretty deep in oslo.db. >> >> >> that error message occurs when a connection that is intended against a >> SELECT statement fails to provide a cursor.description attribute. It is >> typically a driver-level bug in the MySQL world and corresponds to >> mis-handled failure modes from the MySQL connection. >> >> By "various DB connections are not fully isolated from each other" are >> you suggesting that a single in-Python connection object itself is being >> shared among multiple greenlets? I'm not aware of a change in oslo.db >> that would be a relationship to such an effect. > > > So, I think by "fully isolated from each other" what you really mean is > "operations upon a connection are not fully isolated from the subsequent use > of that connection", since that's what I see in the logs. A connection is > attempting to be used during teardown to drop tables, however it's in this > essentially broken state from a PyMySQL perspective, which would indicate > something has gone wrong with this (pooled) connection in the preceding test > that could not be detected or reverted once the connection was returned to > the pool. > > From Roman's observation, it looks like a likely source of this corruption > is a timeout that is interrupting the state of the PyMySQL connection. In > the preceding stack trace, PyMySQL is encountering a raise as it attempts to > call "self._sock.recv_into(b)", and it seems like some combination of > eventlet's response to signals and the fixtures.Timeout() fixture is the > cause of this interruption. As an additional wart, something else is > getting involved and turning it into an IndexError, I'm not sure what that > part is yet though I can imagine that might be SQLAlchemy mis-interpreting > what it expects to be a PyMySQL exception class, since we normally look > inside of exception.args[0] to get the MySQL error code. With a blank > exception like fixtures.TimeoutException, .args is the empty tuple. > > The PyMySQL connection is now in an invalid state and unable to perform a > SELECT statement correctly, but the connection is not invalidated and is > instead returned to the connection pool in a broken state. So the > subsequent teardown, if it uses this same connection (which is likely), > fails because the connection has been interrupted in the middle of its work > and not given the chance to clean up. > > Seems like the use of fixtures.Timeout() fixture here is not organized to > work with a database operation in progress, especially an > eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout due > to a signal handler occurs, the entire connection pool should be disposed > (quickest way, engine.dispose()), or at the very least (and much more > targeted), the connection that's involved should be invalidated from the > pool, e.g. connection.invalidate(). > > The change to the environment here would be that this timeout is happening > at all - the reason for that is not yet known. If oslo.db's version were > involved in this error, I would guess that it would be related to this > timeout condition being caused, and not anything to do with the connection > provisioning. > > > > > >> >> >> >>> >>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I >>> think we need to strongly consider blocking it and revisiting these >>> issues post newton. >>> >>> -Sean >>> > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 11:08 AM, Mike Bayer wrote: On 09/14/2016 09:15 AM, Sean Dague wrote: I noticed the following issues happening quite often now in the opportunistic db tests for nova - http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 It looks like some race has been introduced where the various db connections are not fully isolated from each other like they used to be. The testing magic for this is buried pretty deep in oslo.db. that error message occurs when a connection that is intended against a SELECT statement fails to provide a cursor.description attribute. It is typically a driver-level bug in the MySQL world and corresponds to mis-handled failure modes from the MySQL connection. By "various DB connections are not fully isolated from each other" are you suggesting that a single in-Python connection object itself is being shared among multiple greenlets? I'm not aware of a change in oslo.db that would be a relationship to such an effect. So, I think by "fully isolated from each other" what you really mean is "operations upon a connection are not fully isolated from the subsequent use of that connection", since that's what I see in the logs. A connection is attempting to be used during teardown to drop tables, however it's in this essentially broken state from a PyMySQL perspective, which would indicate something has gone wrong with this (pooled) connection in the preceding test that could not be detected or reverted once the connection was returned to the pool. From Roman's observation, it looks like a likely source of this corruption is a timeout that is interrupting the state of the PyMySQL connection. In the preceding stack trace, PyMySQL is encountering a raise as it attempts to call "self._sock.recv_into(b)", and it seems like some combination of eventlet's response to signals and the fixtures.Timeout() fixture is the cause of this interruption. As an additional wart, something else is getting involved and turning it into an IndexError, I'm not sure what that part is yet though I can imagine that might be SQLAlchemy mis-interpreting what it expects to be a PyMySQL exception class, since we normally look inside of exception.args[0] to get the MySQL error code. With a blank exception like fixtures.TimeoutException, .args is the empty tuple. The PyMySQL connection is now in an invalid state and unable to perform a SELECT statement correctly, but the connection is not invalidated and is instead returned to the connection pool in a broken state. So the subsequent teardown, if it uses this same connection (which is likely), fails because the connection has been interrupted in the middle of its work and not given the chance to clean up. Seems like the use of fixtures.Timeout() fixture here is not organized to work with a database operation in progress, especially an eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout due to a signal handler occurs, the entire connection pool should be disposed (quickest way, engine.dispose()), or at the very least (and much more targeted), the connection that's involved should be invalidated from the pool, e.g. connection.invalidate(). The change to the environment here would be that this timeout is happening at all - the reason for that is not yet known. If oslo.db's version were involved in this error, I would guess that it would be related to this timeout condition being caused, and not anything to do with the connection provisioning. Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. -Sean __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 03:15 PM, Sean Dague wrote: > I noticed the following issues happening quite often now in the > opportunistic db tests for nova - > http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 > > > It looks like some race has been introduced where the various db > connections are not fully isolated from each other like they used to be. > The testing magic for this is buried pretty deep in oslo.db. > > Olso.db 4.13.3 did hit the scene about the time this showed up. So I > think we need to strongly consider blocking it and revisiting these > issues post newton. > > -Sean Blocking a version of oslo.db just because of your 6th sense tells you it may have introduce some issues isn't the best way to decide. We need a better investigation than just this, and find the root cause. Cheers, Thomas Goirand (zigo) __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
On 09/14/2016 09:15 AM, Sean Dague wrote: I noticed the following issues happening quite often now in the opportunistic db tests for nova - http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 It looks like some race has been introduced where the various db connections are not fully isolated from each other like they used to be. The testing magic for this is buried pretty deep in oslo.db. that error message occurs when a connection that is intended against a SELECT statement fails to provide a cursor.description attribute. It is typically a driver-level bug in the MySQL world and corresponds to mis-handled failure modes from the MySQL connection. By "various DB connections are not fully isolated from each other" are you suggesting that a single in-Python connection object itself is being shared among multiple greenlets? I'm not aware of a change in oslo.db that would be a relationship to such an effect. Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. -Sean __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
Sean, I'll take a closer look, but test execution times and errors look suspicious: ironic.tests.unit.db.sqlalchemy.test_migrations.TestMigrationsPostgreSQL.test_walk_versions 60.002 2016-09-14 14:21:38.756421 | File "/home/jenkins/workspace/gate-ironic-python27-db-ubuntu-xenial/.tox/py27/local/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll 2016-09-14 14:21:38.756435 | return self.poll.poll(seconds) 2016-09-14 14:21:38.756481 | File "/home/jenkins/workspace/gate-ironic-python27-db-ubuntu-xenial/.tox/py27/local/lib/python2.7/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler 2016-09-14 14:21:38.756494 | raise TimeoutException() 2016-09-14 14:21:38.756508 | IndexError: tuple index out of range Like if the test case was forcibly stopped after timeout. Thanks, Roman On Wed, Sep 14, 2016 at 4:15 PM, Sean Dague wrote: > I noticed the following issues happening quite often now in the > opportunistic db tests for nova - > http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 > > > It looks like some race has been introduced where the various db > connections are not fully isolated from each other like they used to be. > The testing magic for this is buried pretty deep in oslo.db. > > Olso.db 4.13.3 did hit the scene about the time this showed up. So I > think we need to strongly consider blocking it and revisiting these > issues post newton. > > -Sean > > -- > Sean Dague > http://dague.net > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly
I noticed the following issues happening quite often now in the opportunistic db tests for nova - http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 It looks like some race has been introduced where the various db connections are not fully isolated from each other like they used to be. The testing magic for this is buried pretty deep in oslo.db. Olso.db 4.13.3 did hit the scene about the time this showed up. So I think we need to strongly consider blocking it and revisiting these issues post newton. -Sean -- Sean Dague http://dague.net __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev