Re: [openstack-dev] [Tempest][Production] Tempest / the gate / real world load

2014-01-13 Thread Maru Newby
I'm afraid I missed this topic the first time around, and I think it bears 
revisiting.

tl;dr: I think we should consider ensuring gate stability in the face of 
resource-starved services by some combination of more intelligent test design 
and better handling of resource starvation (for example, rate-limiting).  
Stress-testing would be more effective if it were explicitly focused on 
real-world usage scenarios and run separately from the gate.  I think 
stress-testing is about the 'when' of failure, whereas the gate is about 'if'.

I don't think it can be argued that OpenStack services (especially Neutron) can 
do better to ensure reliability under load.  Running things in parallel in the 
gate shone a bright light on many problem areas and that was inarguably a good 
thing.  Now that we have a better sense of the problem, though, it may be time 
to think about evolving our approach.

From the perspective of gating commits, I think it makes sense to (a) minimize 
gate execution time and (b) provide some guarantees of reliability under 
reasonable load.  I don't think either of these requires continuing to 
evaluate unrealistic usage scenarios against services running in a severely 
resource-starved environment.  Every service eventually falls over when too 
much is asked of it.  These kinds of failure are not likely to be particularly 
deterministic, so wouldn't it make sense to avoid triggering them in the gate 
as much as possible?

In the specific case of Neutron, the current approach to testing isolation 
involves creating and tearing down networks at a tremendous rate.  I'm not sure 
anyone can argue that this constitutes a usage scenario that is likely to 
appear in production, but because it causes problems in the gate, we've had to 
prioritize working on it over initiatives that might prove more useful to 
operators.  While this may have been a necessary stop on the road to Neutron 
stability, I think it may be worth considering whether we want the gate to 
continue having an outsized role in defining optimization priorities.  

Thoughts?


m.

On Dec 12, 2013, at 11:23 AM, Robert Collins robe...@robertcollins.net wrote:

 A few times now we've run into patches for devstack-gate / devstack
 that change default configuration to handle 'tempest load'.
 
 For instance - https://review.openstack.org/61137 (Sorry Salvatore I'm
 not picking on you really!)
 
 So there appears to be a meme that the gate is particularly stressful
 - a bad environment - and that real world situations have less load.
 
 This could happen a few ways: (a) deployers might separate out
 components more; (b) they might have faster machines; (c) they might
 have less concurrent activity.
 
 (a) - unlikely! Deployers will cram stuff together as much as they can
 to save overheads. Big clouds will have components split out - yes,
 but they will also have correspondingly more load to drive that split
 out.
 
 (b) Perhaps, but not orders of magnitude faster, the clouds we run on
 are running on fairly recent hardware, and by using big instances we
 don't get crammed it with that many other tenants.
 
 (c) Almost certainly not. Tempest currently does a maximum of four
 concurrent requests. A small business cloud could easily have 5 or 6
 people making concurrent requests from time to time, and bigger but
 not huge clouds will certainly have that. Their /average/ rate of API
 requests may be much lower, but when they point service orchestration
 tools at it -- particularly tools that walk their dependencies in
 parallel - load is going to be much much higher than what we generate
 with Tempest.
 
 tl;dr : if we need to change a config file setting in devstack-gate or
 devstack *other than* setting up the specific scenario, think thrice -
 should it be a production default and set in the relevant projects
 default config setting.
 
 Cheers,
 Rob
 -- 
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Tempest][Production] Tempest / the gate / real world load

2013-12-13 Thread Salvatore Orlando
Robert,
As you've deliberately picked on me I feel compelled to reply!

Jokes apart, I am going to retire that patch and push the new default in
neutron. Regardless of considerations on real loads vs gate loads, I think
it is correct to assume the default configuration should be one that will
allow gate tests to pass. A sort of maximum common denominator, if you want.
I think however that the discussion on whether our gate tests are
representative of real world deployment is outside the scope of this
thread, even if very interesting.

On the specific matter of this patch we've been noticing the CPU on the
gate tests with neutron easily reaching 100%; this is not because of (b). I
can indeed replicate the same behaviour on any other VM, even with twice as
much vCPUs. Never tried baremetal though.
However, because of the fact that 'just' the gate tests send the cpu on the
single host to 100% should let us think that deployers might easily end up
facing the same problem in real environment (your (a) point) regardless of
how the components are split.

Thankfully, Armando found out a related issue with the DHCP agent which was
causing it to use a lot of cpu as well as terribly stressing ovsdbserver,
and fixed it. Since then we're seeing a lot less timeout errors on the gate.

Salvatore






On 12 December 2013 20:23, Robert Collins robe...@robertcollins.net wrote:

 A few times now we've run into patches for devstack-gate / devstack
 that change default configuration to handle 'tempest load'.

 For instance - https://review.openstack.org/61137 (Sorry Salvatore I'm
 not picking on you really!)

 So there appears to be a meme that the gate is particularly stressful
 - a bad environment - and that real world situations have less load.

 This could happen a few ways: (a) deployers might separate out
 components more; (b) they might have faster machines; (c) they might
 have less concurrent activity.

 (a) - unlikely! Deployers will cram stuff together as much as they can
 to save overheads. Big clouds will have components split out - yes,
 but they will also have correspondingly more load to drive that split
 out.

 (b) Perhaps, but not orders of magnitude faster, the clouds we run on
 are running on fairly recent hardware, and by using big instances we
 don't get crammed it with that many other tenants.

 (c) Almost certainly not. Tempest currently does a maximum of four
 concurrent requests. A small business cloud could easily have 5 or 6
 people making concurrent requests from time to time, and bigger but
 not huge clouds will certainly have that. Their /average/ rate of API
 requests may be much lower, but when they point service orchestration
 tools at it -- particularly tools that walk their dependencies in
 parallel - load is going to be much much higher than what we generate
 with Tempest.

 tl;dr : if we need to change a config file setting in devstack-gate or
 devstack *other than* setting up the specific scenario, think thrice -
 should it be a production default and set in the relevant projects
 default config setting.

 Cheers,
 Rob
 --
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Tempest][Production] Tempest / the gate / real world load

2013-12-12 Thread Robert Collins
A few times now we've run into patches for devstack-gate / devstack
that change default configuration to handle 'tempest load'.

For instance - https://review.openstack.org/61137 (Sorry Salvatore I'm
not picking on you really!)

So there appears to be a meme that the gate is particularly stressful
- a bad environment - and that real world situations have less load.

This could happen a few ways: (a) deployers might separate out
components more; (b) they might have faster machines; (c) they might
have less concurrent activity.

(a) - unlikely! Deployers will cram stuff together as much as they can
to save overheads. Big clouds will have components split out - yes,
but they will also have correspondingly more load to drive that split
out.

(b) Perhaps, but not orders of magnitude faster, the clouds we run on
are running on fairly recent hardware, and by using big instances we
don't get crammed it with that many other tenants.

(c) Almost certainly not. Tempest currently does a maximum of four
concurrent requests. A small business cloud could easily have 5 or 6
people making concurrent requests from time to time, and bigger but
not huge clouds will certainly have that. Their /average/ rate of API
requests may be much lower, but when they point service orchestration
tools at it -- particularly tools that walk their dependencies in
parallel - load is going to be much much higher than what we generate
with Tempest.

tl;dr : if we need to change a config file setting in devstack-gate or
devstack *other than* setting up the specific scenario, think thrice -
should it be a production default and set in the relevant projects
default config setting.

Cheers,
Rob
-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev