Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

2014-01-13 Thread Russell Bryant
On 01/11/2014 09:57 AM, Russell Bryant wrote:
 5) https://review.openstack.org/#/c/65989/
 
 This patch isn't a candidate for merging, but was written to test the
 theory that by updating nova-network to use conductor instead of direct
 database access, nova-network will be able to do work in parallel better
 than it does today, just as we have observed with nova-compute.
 
 Dan's initial test results from this are **very** promising.  Initial
 testing showed a 20% speedup in runtime and a 33% decrease in CPU
 consumption by nova-network.
 
 Doing this properly will not be quick, but I'm hopeful that we can
 complete it by the Icehouse release.  We will need to convert
 nova-network to use Nova's object model.  Much of this work is starting
 to catch nova-network up on work that we've been doing in the rest of
 the tree but have passed on doing for nova-network due to nova-network
 being in a freeze.

I have filed a blueprint to track the completion of this work throughout
the rest of Icehouse.

https://blueprints.launchpad.net/nova/+spec/nova-network-objects

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

2014-01-11 Thread Russell Bryant
On 01/09/2014 04:16 PM, Russell Bryant wrote:
 On 01/08/2014 05:53 PM, Joe Gordon wrote:
 Hi All, 

 As you know the gate has been in particularly bad shape (gate queue over
 100!) this week due to a number of factors. One factor is how many major
 outstanding bugs we have in the gate.  Below is a list of the top 4 open
 gate bugs.

 Here are some fun facts about this list:
 * All bugs have been open for over a month
 * All are nova bugs
 * These 4 bugs alone were hit 588 times which averages to 42 hits per
 day (data is over two weeks)!

 If we want the gate queue to drop and not have to continuously run
 'recheck bug x' we need to fix these bugs.  So I'm looking for
 volunteers to help debug and fix these bugs.
 
 I created the following etherpad to help track the most important Nova
 gate bugs. who is actively working on them, and any patches that we have
 in flight to help address them:
 
   https://etherpad.openstack.org/p/nova-gate-issue-tracking
 
 Please jump in if you can.  We shouldn't wait for the gate bug day to
 move on these.  Even if others are already looking at a bug, feel free
 to do the same.  We need multiple sets of eyes on each of these issues.
 

Some good progress from the last few days:

After looking at a lot of failures, we determined that the vast majority
of failures are performance related.  The load being put on the
OpenStack deployment is just too high.  We're working to address this to
make the gate more reliable in a number of ways.

1) (merged) https://review.openstack.org/#/c/65760/

The large-ops test was cut back from spawning 100 instances to 50.  From
the commit message:

  It turns out the variance in cloud instances is very high, especially
  when comparing different cloud providers and regions. This test was
  originally added as a regression test for the nova-network issues with
  rootwrap. At which time this test wouldn't pass for 30 instances.  So
  50 is still a valid regression test.

2) (merged) https://review.openstack.org/#/c/45766/

nova-compute is able to do work in parallel very well.  nova-conductor
can not by default due to the details of our use of eventlet + how we
talk to MySQL.  The way you allow nova-conductor to do its work in
parallel is by running multiple conductor workers.  We had not enabled
this by default in devstack, so our 4 vCPU test nodes were only using a
single conductor worker.  They now use 4 conductor workers.

3) (still testing) https://review.openstack.org/#/c/65805/

Right now when tempest runs in the devstack-gate jobs, it runs with
concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
this maxes out the deployment and results in timeouts (usually network
related).

This patch changes tempest concurrency to 2 instead of 4.  The initial
results are quite promising.  The tests have been passing reliably so
far, but we're going to continue to recheck this for a while longer for
more data.

One very interesting observation on this came from Jim where he said A
quick glance suggests 1.2x -- 1.4x change in runtime.  If the
deployment were *not* being maxed out, we would expect this change to
result in much closer to a 2x runtime increase.

4) (approved, not yet merged) https://review.openstack.org/#/c/65784/

nova-network seems to be the largest bottleneck in terms of performance
problems when nova is maxed out on these test nodes.  This patch is one
quick speedup we can make by not using rootwrap in a few cases where it
wasn't necessary.  These really add up.

5) https://review.openstack.org/#/c/65989/

This patch isn't a candidate for merging, but was written to test the
theory that by updating nova-network to use conductor instead of direct
database access, nova-network will be able to do work in parallel better
than it does today, just as we have observed with nova-compute.

Dan's initial test results from this are **very** promising.  Initial
testing showed a 20% speedup in runtime and a 33% decrease in CPU
consumption by nova-network.

Doing this properly will not be quick, but I'm hopeful that we can
complete it by the Icehouse release.  We will need to convert
nova-network to use Nova's object model.  Much of this work is starting
to catch nova-network up on work that we've been doing in the rest of
the tree but have passed on doing for nova-network due to nova-network
being in a freeze.

6) (no patch yet)

We haven't had time to dive too deep into this yet, but we would also
like to revisit our locking usage and how it is affecting nova-network
performance.  There may be some more significant improvements we can
make there.


Final notes:

I am hopeful that by addressing these performance issues both in Nova's
code, as well as by turning down the test load, that we will see a
significant increase in gate reliability in the near future.  I
apologize on behalf of the Nova team for Nova's contribution to gate
instability.

*Thank you* to everyone who has been helping out!

-- 
Russell Bryant


Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

2014-01-11 Thread Sean Dague
First, thanks a ton for diving in on all this Russell. The big push by 
the Nova team recently is really helpful.


On 01/11/2014 09:57 AM, Russell Bryant wrote:

On 01/09/2014 04:16 PM, Russell Bryant wrote:

On 01/08/2014 05:53 PM, Joe Gordon wrote:

Hi All,

As you know the gate has been in particularly bad shape (gate queue over
100!) this week due to a number of factors. One factor is how many major
outstanding bugs we have in the gate.  Below is a list of the top 4 open
gate bugs.

Here are some fun facts about this list:
* All bugs have been open for over a month
* All are nova bugs
* These 4 bugs alone were hit 588 times which averages to 42 hits per
day (data is over two weeks)!

If we want the gate queue to drop and not have to continuously run
'recheck bug x' we need to fix these bugs.  So I'm looking for
volunteers to help debug and fix these bugs.


I created the following etherpad to help track the most important Nova
gate bugs. who is actively working on them, and any patches that we have
in flight to help address them:

   https://etherpad.openstack.org/p/nova-gate-issue-tracking

Please jump in if you can.  We shouldn't wait for the gate bug day to
move on these.  Even if others are already looking at a bug, feel free
to do the same.  We need multiple sets of eyes on each of these issues.



Some good progress from the last few days:

After looking at a lot of failures, we determined that the vast majority
of failures are performance related.  The load being put on the
OpenStack deployment is just too high.  We're working to address this to
make the gate more reliable in a number of ways.

1) (merged) https://review.openstack.org/#/c/65760/

The large-ops test was cut back from spawning 100 instances to 50.  From
the commit message:

   It turns out the variance in cloud instances is very high, especially
   when comparing different cloud providers and regions. This test was
   originally added as a regression test for the nova-network issues with
   rootwrap. At which time this test wouldn't pass for 30 instances.  So
   50 is still a valid regression test.

2) (merged) https://review.openstack.org/#/c/45766/

nova-compute is able to do work in parallel very well.  nova-conductor
can not by default due to the details of our use of eventlet + how we
talk to MySQL.  The way you allow nova-conductor to do its work in
parallel is by running multiple conductor workers.  We had not enabled
this by default in devstack, so our 4 vCPU test nodes were only using a
single conductor worker.  They now use 4 conductor workers.

3) (still testing) https://review.openstack.org/#/c/65805/

Right now when tempest runs in the devstack-gate jobs, it runs with
concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
this maxes out the deployment and results in timeouts (usually network
related).

This patch changes tempest concurrency to 2 instead of 4.  The initial
results are quite promising.  The tests have been passing reliably so
far, but we're going to continue to recheck this for a while longer for
more data.

One very interesting observation on this came from Jim where he said A
quick glance suggests 1.2x -- 1.4x change in runtime.  If the
deployment were *not* being maxed out, we would expect this change to
result in much closer to a 2x runtime increase.


We could also address this by locally turning up timeouts on operations 
that are timing out. Which would let those things take the time they need.


Before dropping the concurrency I'd really like to make sure we can 
point to specific fails that we think will go away. There was a lot of 
speculation around nova-network, however the nova-network timeout errors 
only pop up on elastic search on large-ops jobs, not normal tempest 
jobs. Definitely making OpenStack more idle will make more tests pass. 
The Neutron team has experienced that.


It would be a ton better if we could actually feed back a 503 with a 
retry time (which I realize is a ton of work).


Because if we decide we're now always pinned to only 2way, we have to 
start doing some major rethinking on our test strategy, as we'll be way 
outside the soft 45min time budget we've been trying to operate on. We'd 
actually been planning on going up to 8way, but were waiting for some 
issues to get fixed before we did that. It would sort of immediately put 
a moratorium on new tests. If that's what we need to do, that's what we 
need to do, but we should talk it through.



4) (approved, not yet merged) https://review.openstack.org/#/c/65784/

nova-network seems to be the largest bottleneck in terms of performance
problems when nova is maxed out on these test nodes.  This patch is one
quick speedup we can make by not using rootwrap in a few cases where it
wasn't necessary.  These really add up.

5) https://review.openstack.org/#/c/65989/

This patch isn't a candidate for merging, but was written to test the
theory that by updating nova-network to use conductor instead of direct

Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

2014-01-11 Thread Russell Bryant
On 01/11/2014 11:38 AM, Sean Dague wrote:
 3) (still testing) https://review.openstack.org/#/c/65805/

 Right now when tempest runs in the devstack-gate jobs, it runs with
 concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
 this maxes out the deployment and results in timeouts (usually network
 related).

 This patch changes tempest concurrency to 2 instead of 4.  The initial
 results are quite promising.  The tests have been passing reliably so
 far, but we're going to continue to recheck this for a while longer for
 more data.

 One very interesting observation on this came from Jim where he said A
 quick glance suggests 1.2x -- 1.4x change in runtime.  If the
 deployment were *not* being maxed out, we would expect this change to
 result in much closer to a 2x runtime increase.
 
 We could also address this by locally turning up timeouts on operations
 that are timing out. Which would let those things take the time they need.
 
 Before dropping the concurrency I'd really like to make sure we can
 point to specific fails that we think will go away. There was a lot of
 speculation around nova-network, however the nova-network timeout errors
 only pop up on elastic search on large-ops jobs, not normal tempest
 jobs. Definitely making OpenStack more idle will make more tests pass.
 The Neutron team has experienced that.
 
 It would be a ton better if we could actually feed back a 503 with a
 retry time (which I realize is a ton of work).
 
 Because if we decide we're now always pinned to only 2way, we have to
 start doing some major rethinking on our test strategy, as we'll be way
 outside the soft 45min time budget we've been trying to operate on. We'd
 actually been planning on going up to 8way, but were waiting for some
 issues to get fixed before we did that. It would sort of immediately put
 a moratorium on new tests. If that's what we need to do, that's what we
 need to do, but we should talk it through.

I can try to write up some detailed analysis on a few failures next week
to help justify it, but FWIW, when I was looking this last week, I felt
like making this change was going to fix a lot more than the
nova-network timeout errors.

If we can already tell this is going to improve reliability, both when
using nova-network and neutron, then I think that should be enough to
justify it.  Taking longer seems acceptable if that comes with a more
acceptable pass rate.

Right now I'd like to see us set concurrency=2 while we work on the more
difficult performance improvements to both neutron and nova-network, and
we can turn it back up later on once we're able to demonstrate that it
passes reliably without failures with a root cause of test load being
too high.

 5) https://review.openstack.org/#/c/65989/

 This patch isn't a candidate for merging, but was written to test the
 theory that by updating nova-network to use conductor instead of direct
 database access, nova-network will be able to do work in parallel better
 than it does today, just as we have observed with nova-compute.

 Dan's initial test results from this are **very** promising.  Initial
 testing showed a 20% speedup in runtime and a 33% decrease in CPU
 consumption by nova-network.

 Doing this properly will not be quick, but I'm hopeful that we can
 complete it by the Icehouse release.  We will need to convert
 nova-network to use Nova's object model.  Much of this work is starting
 to catch nova-network up on work that we've been doing in the rest of
 the tree but have passed on doing for nova-network due to nova-network
 being in a freeze.
 
 I'm a huge +1 on fixing this in nova-network.

Of course.  This is just a bit of a longer term effort.

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

2014-01-09 Thread Salvatore Orlando
I think I have found another fault triggering bug 1253896 when neutron is
enabled.

I've added a comment to https://bugs.launchpad.net/bugs/1253896
On another note, I'm seeing also occurrences of this bug with nova-network.
Is there anybody from the nova side looking at it (I can give it a try, but
I don't know a lot about nova-network)

Salvatore


On 8 January 2014 23:53, Joe Gordon joe.gord...@gmail.com wrote:

 Hi All,

 As you know the gate has been in particularly bad shape (gate queue over
 100!) this week due to a number of factors. One factor is how many major
 outstanding bugs we have in the gate.  Below is a list of the top 4 open
 gate bugs.

 Here are some fun facts about this list:
 * All bugs have been open for over a month
 * All are nova bugs
 * These 4 bugs alone were hit 588 times which averages to 42 hits per day
 (data is over two weeks)!

 If we want the gate queue to drop and not have to continuously run
 'recheck bug x' we need to fix these bugs.  So I'm looking for volunteers
 to help debug and fix these bugs.


 best,
 Joe

 Bug: https://bugs.launchpad.net/bugs/1253896 = message:SSHTimeout:
 Connection to the AND message:via SSH timed out. AND
 filename:console.html
 Filed: 2013-11-21
 Title: Attempts to verify guests are running via SSH fails. SSH connection
 to guest does not work.
 Project: Status
   neutron: In Progress
   nova: Triaged
   tempest: Confirmed
 Hits
   FAILURE: 243
 Percentage of Gate Queue Job failures triggered by this bug
   gate-tempest-dsvm-postgres-full: 0.35%
   gate-grenade-dsvm: 0.68%
   gate-tempest-dsvm-neutron: 0.39%
   gate-tempest-dsvm-neutron-isolated: 4.76%
   gate-tempest-dsvm-full: 0.19%

 Bug: https://bugs.launchpad.net/bugs/1254890
 Fingerprint: message:Details: Timed out waiting for thing AND
 message:to become AND  (message:ACTIVE OR message:in-use OR
 message:available)
 Filed: 2013-11-25
 Title: Timed out waiting for thing causes tempest-dsvm-neutron-* failures
 Project: Status
   neutron: Invalid
   nova: Triaged
   tempest: Confirmed
 Hits
   FAILURE: 173
 Percentage of Gate Queue Job failures triggered by this bug
   gate-tempest-dsvm-neutron-isolated: 4.76%
   gate-tempest-dsvm-postgres-full: 0.35%
   gate-tempest-dsvm-large-ops: 0.68%
   gate-tempest-dsvm-neutron-large-ops: 0.70%
   gate-tempest-dsvm-full: 0.19%
   gate-tempest-dsvm-neutron-pg: 3.57%

 Bug: https://bugs.launchpad.net/bugs/1257626
 Fingerprint: message:nova.compute.manager Timeout: Timeout while waiting
 on RPC response - topic: \network\, RPC method:
 \allocate_for_instance\ AND filename:logs/screen-n-cpu.txt
 Filed: 2013-12-04
 Title: Timeout while waiting on RPC response - topic: network, RPC
 method: allocate_for_instance info: unknown
 Project: Status
   nova: Triaged
 Hits
   FAILURE: 118
 Percentage of Gate Queue Job failures triggered by this bug
   gate-tempest-dsvm-large-ops: 0.68%

 Bug: https://bugs.launchpad.net/bugs/1254872
 Fingerprint: message:libvirtError: Timed out during operation: cannot
 acquire state change lock AND filename:logs/screen-n-cpu.txt
 Filed: 2013-11-25
 Title: libvirtError: Timed out during operation: cannot acquire state
 change lock
 Project: Status
   nova: Triaged
 Hits
   FAILURE: 54
   SUCCESS: 3
 Percentage of Gate Queue Job failures triggered by this bug
   gate-tempest-dsvm-postgres-full: 0.35%
   gate-tempest-dsvm-full: 0.19%


 Generated with: elastic-recheck-success

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

2014-01-09 Thread Russell Bryant
On 01/08/2014 05:53 PM, Joe Gordon wrote:
 Hi All, 
 
 As you know the gate has been in particularly bad shape (gate queue over
 100!) this week due to a number of factors. One factor is how many major
 outstanding bugs we have in the gate.  Below is a list of the top 4 open
 gate bugs.
 
 Here are some fun facts about this list:
 * All bugs have been open for over a month
 * All are nova bugs
 * These 4 bugs alone were hit 588 times which averages to 42 hits per
 day (data is over two weeks)!
 
 If we want the gate queue to drop and not have to continuously run
 'recheck bug x' we need to fix these bugs.  So I'm looking for
 volunteers to help debug and fix these bugs.

I created the following etherpad to help track the most important Nova
gate bugs. who is actively working on them, and any patches that we have
in flight to help address them:

  https://etherpad.openstack.org/p/nova-gate-issue-tracking

Please jump in if you can.  We shouldn't wait for the gate bug day to
move on these.  Even if others are already looking at a bug, feel free
to do the same.  We need multiple sets of eyes on each of these issues.

-- 
Russell Bryant

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][neutron] top gate bugs: a plea for help

2014-01-08 Thread Joe Gordon
Hi All,

As you know the gate has been in particularly bad shape (gate queue over
100!) this week due to a number of factors. One factor is how many major
outstanding bugs we have in the gate.  Below is a list of the top 4 open
gate bugs.

Here are some fun facts about this list:
* All bugs have been open for over a month
* All are nova bugs
* These 4 bugs alone were hit 588 times which averages to 42 hits per day
(data is over two weeks)!

If we want the gate queue to drop and not have to continuously run 'recheck
bug x' we need to fix these bugs.  So I'm looking for volunteers to help
debug and fix these bugs.


best,
Joe

Bug: https://bugs.launchpad.net/bugs/1253896 = message:SSHTimeout:
Connection to the AND message:via SSH timed out. AND
filename:console.html
Filed: 2013-11-21
Title: Attempts to verify guests are running via SSH fails. SSH connection
to guest does not work.
Project: Status
  neutron: In Progress
  nova: Triaged
  tempest: Confirmed
Hits
  FAILURE: 243
Percentage of Gate Queue Job failures triggered by this bug
  gate-tempest-dsvm-postgres-full: 0.35%
  gate-grenade-dsvm: 0.68%
  gate-tempest-dsvm-neutron: 0.39%
  gate-tempest-dsvm-neutron-isolated: 4.76%
  gate-tempest-dsvm-full: 0.19%

Bug: https://bugs.launchpad.net/bugs/1254890
Fingerprint: message:Details: Timed out waiting for thing AND message:to
become AND  (message:ACTIVE OR message:in-use OR message:available)
Filed: 2013-11-25
Title: Timed out waiting for thing causes tempest-dsvm-neutron-* failures
Project: Status
  neutron: Invalid
  nova: Triaged
  tempest: Confirmed
Hits
  FAILURE: 173
Percentage of Gate Queue Job failures triggered by this bug
  gate-tempest-dsvm-neutron-isolated: 4.76%
  gate-tempest-dsvm-postgres-full: 0.35%
  gate-tempest-dsvm-large-ops: 0.68%
  gate-tempest-dsvm-neutron-large-ops: 0.70%
  gate-tempest-dsvm-full: 0.19%
  gate-tempest-dsvm-neutron-pg: 3.57%

Bug: https://bugs.launchpad.net/bugs/1257626
Fingerprint: message:nova.compute.manager Timeout: Timeout while waiting
on RPC response - topic: \network\, RPC method:
\allocate_for_instance\ AND filename:logs/screen-n-cpu.txt
Filed: 2013-12-04
Title: Timeout while waiting on RPC response - topic: network, RPC
method: allocate_for_instance info: unknown
Project: Status
  nova: Triaged
Hits
  FAILURE: 118
Percentage of Gate Queue Job failures triggered by this bug
  gate-tempest-dsvm-large-ops: 0.68%

Bug: https://bugs.launchpad.net/bugs/1254872
Fingerprint: message:libvirtError: Timed out during operation: cannot
acquire state change lock AND filename:logs/screen-n-cpu.txt
Filed: 2013-11-25
Title: libvirtError: Timed out during operation: cannot acquire state
change lock
Project: Status
  nova: Triaged
Hits
  FAILURE: 54
  SUCCESS: 3
Percentage of Gate Queue Job failures triggered by this bug
  gate-tempest-dsvm-postgres-full: 0.35%
  gate-tempest-dsvm-full: 0.19%


Generated with: elastic-recheck-success
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev