Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

2014-05-20 Thread Matt Riedemann



On 5/19/2014 1:25 PM, Sean Dague wrote:

On 05/19/2014 02:13 PM, Matt Riedemann wrote:



On 5/19/2014 11:33 AM, Matt Riedemann wrote:



On 5/19/2014 10:53 AM, Matt Riedemann wrote:

I was looking through this timeout bug [1] this morning and am able to
correlate that around the time of the image snapshot timeout, ceilometer
was really hammering CPU on the host.  There are already threads on
ceilometer performance and how that needs to be improved for Tempest
runs so I don't want to get into that here.

What I'm thinking about is if there is a way to be smarter about how we
do timeouts in the tests, rather than just rely on globally configured
hard-coded timeouts which are bound to fail intermittently in dynamic
environments like this.

I'm thinking something along the lines of keeping track of CPU stats on
intervals in our waiter loops, then when we reach our configured
timeout, calculate the average CPU load/idle and if it falls below some
threshold, we cut the timeout in half and redo the timeout loop - and we
continue that until our timeout reaches some level that no longer makes
sense, like once it drops less than a minute for example.

Are there other ideas here?  My main concern is the number of random
timeout failures we see in the tests and then people are trying to
fingerprint them with elastic-recheck but the queries are so generic
they are not really useful.  We now put the test class and test case in
the compute test timeout messages, but it's also not very useful to
fingerprint every individual permutation of test class/case that we can
hit a timeout in.

[1] https://bugs.launchpad.net/nova/+bug/1320617



This change to devstack should help [1].

It would be good if we actually used the default timeouts we have
configured in Tempest rather than hard-coding them in devstack based on
the latest state of the gate at the time.

[1] https://review.openstack.org/#/c/94221/



I have a proof of concept up for Tempest with adjusted timeouts based on
CPU idle values here:

https://review.openstack.org/#/c/94245/


The problem is this makes an assumption that Tempest is on the host
where the services are. We actually need to get away from that assumption.

If there is something in ceilometer that would make sense to poll, that
might be an option. But psutils definitely can't be a thing we us here.

-Sean



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



I've abandoned both changes in great shame. :P

To target the timeout failure in the image snapshot tests and bug 
1320617, I'm looking to see if maybe the experimental tasks API in 
glance v2 could help get some diagnostic information at the point of 
failure to see if things are just slow or if the snapshot is actually 
hung and will never complete.


Ideally we could leverage tasks across the services for debugging issues 
like this.


--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

2014-05-19 Thread Matt Riedemann



On 5/19/2014 10:53 AM, Matt Riedemann wrote:

I was looking through this timeout bug [1] this morning and am able to
correlate that around the time of the image snapshot timeout, ceilometer
was really hammering CPU on the host.  There are already threads on
ceilometer performance and how that needs to be improved for Tempest
runs so I don't want to get into that here.

What I'm thinking about is if there is a way to be smarter about how we
do timeouts in the tests, rather than just rely on globally configured
hard-coded timeouts which are bound to fail intermittently in dynamic
environments like this.

I'm thinking something along the lines of keeping track of CPU stats on
intervals in our waiter loops, then when we reach our configured
timeout, calculate the average CPU load/idle and if it falls below some
threshold, we cut the timeout in half and redo the timeout loop - and we
continue that until our timeout reaches some level that no longer makes
sense, like once it drops less than a minute for example.

Are there other ideas here?  My main concern is the number of random
timeout failures we see in the tests and then people are trying to
fingerprint them with elastic-recheck but the queries are so generic
they are not really useful.  We now put the test class and test case in
the compute test timeout messages, but it's also not very useful to
fingerprint every individual permutation of test class/case that we can
hit a timeout in.

[1] https://bugs.launchpad.net/nova/+bug/1320617



This change to devstack should help [1].

It would be good if we actually used the default timeouts we have 
configured in Tempest rather than hard-coding them in devstack based on 
the latest state of the gate at the time.


[1] https://review.openstack.org/#/c/94221/

--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

2014-05-19 Thread Matt Riedemann



On 5/19/2014 11:33 AM, Matt Riedemann wrote:



On 5/19/2014 10:53 AM, Matt Riedemann wrote:

I was looking through this timeout bug [1] this morning and am able to
correlate that around the time of the image snapshot timeout, ceilometer
was really hammering CPU on the host.  There are already threads on
ceilometer performance and how that needs to be improved for Tempest
runs so I don't want to get into that here.

What I'm thinking about is if there is a way to be smarter about how we
do timeouts in the tests, rather than just rely on globally configured
hard-coded timeouts which are bound to fail intermittently in dynamic
environments like this.

I'm thinking something along the lines of keeping track of CPU stats on
intervals in our waiter loops, then when we reach our configured
timeout, calculate the average CPU load/idle and if it falls below some
threshold, we cut the timeout in half and redo the timeout loop - and we
continue that until our timeout reaches some level that no longer makes
sense, like once it drops less than a minute for example.

Are there other ideas here?  My main concern is the number of random
timeout failures we see in the tests and then people are trying to
fingerprint them with elastic-recheck but the queries are so generic
they are not really useful.  We now put the test class and test case in
the compute test timeout messages, but it's also not very useful to
fingerprint every individual permutation of test class/case that we can
hit a timeout in.

[1] https://bugs.launchpad.net/nova/+bug/1320617



This change to devstack should help [1].

It would be good if we actually used the default timeouts we have
configured in Tempest rather than hard-coding them in devstack based on
the latest state of the gate at the time.

[1] https://review.openstack.org/#/c/94221/



I have a proof of concept up for Tempest with adjusted timeouts based on 
CPU idle values here:


https://review.openstack.org/#/c/94245/

--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

2014-05-19 Thread Sean Dague
On 05/19/2014 02:13 PM, Matt Riedemann wrote:
 
 
 On 5/19/2014 11:33 AM, Matt Riedemann wrote:


 On 5/19/2014 10:53 AM, Matt Riedemann wrote:
 I was looking through this timeout bug [1] this morning and am able to
 correlate that around the time of the image snapshot timeout, ceilometer
 was really hammering CPU on the host.  There are already threads on
 ceilometer performance and how that needs to be improved for Tempest
 runs so I don't want to get into that here.

 What I'm thinking about is if there is a way to be smarter about how we
 do timeouts in the tests, rather than just rely on globally configured
 hard-coded timeouts which are bound to fail intermittently in dynamic
 environments like this.

 I'm thinking something along the lines of keeping track of CPU stats on
 intervals in our waiter loops, then when we reach our configured
 timeout, calculate the average CPU load/idle and if it falls below some
 threshold, we cut the timeout in half and redo the timeout loop - and we
 continue that until our timeout reaches some level that no longer makes
 sense, like once it drops less than a minute for example.

 Are there other ideas here?  My main concern is the number of random
 timeout failures we see in the tests and then people are trying to
 fingerprint them with elastic-recheck but the queries are so generic
 they are not really useful.  We now put the test class and test case in
 the compute test timeout messages, but it's also not very useful to
 fingerprint every individual permutation of test class/case that we can
 hit a timeout in.

 [1] https://bugs.launchpad.net/nova/+bug/1320617


 This change to devstack should help [1].

 It would be good if we actually used the default timeouts we have
 configured in Tempest rather than hard-coding them in devstack based on
 the latest state of the gate at the time.

 [1] https://review.openstack.org/#/c/94221/

 
 I have a proof of concept up for Tempest with adjusted timeouts based on
 CPU idle values here:
 
 https://review.openstack.org/#/c/94245/

The problem is this makes an assumption that Tempest is on the host
where the services are. We actually need to get away from that assumption.

If there is something in ceilometer that would make sense to poll, that
might be an option. But psutils definitely can't be a thing we us here.

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

2014-05-19 Thread Daryl Walleck
This is still a partially formed idea, but one of the ways I've thought of 
solving the timeout issue is by using aggregate data from previous results. Any 
of the event systems (Ceilometer, StackTach) should allow you to create models 
of what an inbounds server build (or any action) time per image looks like, and 
what might be considered an outlier. The general idea feels right, but also 
could be tricked by slow, increasing changes in the execution time of an action.

Daryl

-Original Message-
From: Matt Riedemann [mailto:mrie...@linux.vnet.ibm.com] 
Sent: Monday, May 19, 2014 10:54 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: [openstack-dev] [qa] Smarter timeouts in Tempest?

I was looking through this timeout bug [1] this morning and am able to 
correlate that around the time of the image snapshot timeout, ceilometer was 
really hammering CPU on the host.  There are already threads on ceilometer 
performance and how that needs to be improved for Tempest runs so I don't want 
to get into that here.

What I'm thinking about is if there is a way to be smarter about how we do 
timeouts in the tests, rather than just rely on globally configured hard-coded 
timeouts which are bound to fail intermittently in dynamic environments like 
this.

I'm thinking something along the lines of keeping track of CPU stats on 
intervals in our waiter loops, then when we reach our configured timeout, 
calculate the average CPU load/idle and if it falls below some threshold, we 
cut the timeout in half and redo the timeout loop - and we continue that until 
our timeout reaches some level that no longer makes sense, like once it drops 
less than a minute for example.

Are there other ideas here?  My main concern is the number of random timeout 
failures we see in the tests and then people are trying to fingerprint them 
with elastic-recheck but the queries are so generic they are not really useful. 
 We now put the test class and test case in the compute test timeout messages, 
but it's also not very useful to fingerprint every individual permutation of 
test class/case that we can hit a timeout in.

[1] https://bugs.launchpad.net/nova/+bug/1320617

-- 

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

2014-05-19 Thread Joe Gordon
I see this as the result of two unrelated issues.


On Mon, May 19, 2014 at 8:53 AM, Matt Riedemann
mrie...@linux.vnet.ibm.comwrote:

 I was looking through this timeout bug [1] this morning and am able to
 correlate that around the time of the image snapshot timeout, ceilometer
 was really hammering CPU on the host.  There are already threads on
 ceilometer performance and how that needs to be improved for Tempest runs
 so I don't want to get into that here.

 What I'm thinking about is if there is a way to be smarter about how we do
 timeouts in the tests, rather than just rely on globally configured
 hard-coded timeouts which are bound to fail intermittently in dynamic
 environments like this.

 I'm thinking something along the lines of keeping track of CPU stats on
 intervals in our waiter loops, then when we reach our configured timeout,
 calculate the average CPU load/idle and if it falls below some threshold,
 we cut the timeout in half and redo the timeout loop - and we continue that
 until our timeout reaches some level that no longer makes sense, like once
 it drops less than a minute for example


1.  Our test environment is being pushed to its limits. In the past we have
seen things fail in strange ways when CPU idle % drops below 10%. To
address this we can do a few things:

  * Better track when our test environment has low idle CPU (post
processing on gate jobs?)
  * Make gate jobs use less CPU (ceilometer issues etc).



 Are there other ideas here?  My main concern is the number of random
 timeout failures we see in the tests and then people are trying to
 fingerprint them with elastic-recheck but the queries are so generic they
 are not really useful.  We now put the test class and test case in the
 compute test timeout messages, but it's also not very useful to fingerprint
 every individual permutation of test class/case that we can hit a timeout
 in.


2. OpenStack is hard to debug.  If we, the developers, cannot figure out
what is failing then imagine how hard debugging is for non-openstack
developers.  When we see these types of issues, we should work on making
the logs more useful so we can create better elastic-rececheck fingerprints.



 [1] https://bugs.launchpad.net/nova/+bug/1320617

 --

 Thanks,

 Matt Riedemann


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev