Re: [openstack-dev] [nova][neutron][qa] top gate bugs: a plea for help

David Kranz Sun, 12 Jan 2014 13:08:23 -0800

On 01/11/2014 05:06 PM, Russell Bryant wrote:

On 01/11/2014 11:38 AM, Sean Dague wrote:

3) (still testing) https://review.openstack.org/#/c/65805/


Right now when tempest runs in the devstack-gate jobs, it runs with
concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
this maxes out the deployment and results in timeouts (usually network
related).

This patch changes tempest concurrency to 2 instead of 4.  The initial
results are quite promising.  The tests have been passing reliably so
far, but we're going to continue to recheck this for a while longer for
more data.

One very interesting observation on this came from Jim where he said "A
quick glance suggests 1.2x -- 1.4x change in runtime."  If the
deployment were *not* being maxed out, we would expect this change to
result in much closer to a 2x runtime increase.

We could also address this by locally turning up timeouts on operations
that are timing out. Which would let those things take the time they need.

Before dropping the concurrency I'd really like to make sure we can
point to specific fails that we think will go away. There was a lot of
speculation around nova-network, however the nova-network timeout errors
only pop up on elastic search on large-ops jobs, not normal tempest
jobs. Definitely making OpenStack more idle will make more tests pass.
The Neutron team has experienced that.

It would be a ton better if we could actually feed back a 503 with a
retry time (which I realize is a ton of work).

Because if we decide we're now always pinned to only 2way, we have to
start doing some major rethinking on our test strategy, as we'll be way
outside the soft 45min time budget we've been trying to operate on. We'd
actually been planning on going up to 8way, but were waiting for some
issues to get fixed before we did that. It would sort of immediately put
a moratorium on new tests. If that's what we need to do, that's what we
need to do, but we should talk it through.

I can try to write up some detailed analysis on a few failures next week
to help justify it, but FWIW, when I was looking this last week, I felt
like making this change was going to fix a lot more than the
nova-network timeout errors.

If we can already tell this is going to improve reliability, both when
using nova-network and neutron, then I think that should be enough to
justify it.  Taking longer seems acceptable if that comes with a more
acceptable pass rate.

Right now I'd like to see us set concurrency=2 while we work on the more
difficult performance improvements to both neutron and nova-network, and
we can turn it back up later on once we're able to demonstrate that it
passes reliably without failures with a root cause of test load being
too high.

I have to agree with Russell here. The way we run Tempest has morphed itfrom a simple functional test suite to include stress/performance testcharacteristics as well. This is great because it has found a lot ofbugs but obviously there is a huge downside in having such testcharacteristics in the gate at the current failure rate. But it is notan either/or between acceptable performance/stress levels and acceptablerun time. If we cut the concurrency to 2 and split each full tempest jobinto two jobs, each running "half" the tests (based on splitting theexpected execution time), then we can have both until we are able tocrank up the concurrency to 8 or beyond.


 -David
 -David

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][neutron][qa] top gate bugs: a plea for help

Reply via email to