Re: [openstack-dev] [qa] Smarter timeouts in Tempest?
On 5/19/2014 1:25 PM, Sean Dague wrote: On 05/19/2014 02:13 PM, Matt Riedemann wrote: On 5/19/2014 11:33 AM, Matt Riedemann wrote: On 5/19/2014 10:53 AM, Matt Riedemann wrote: I was looking through this timeout bug [1] this morning and am able to correlate that around the time of the image snapshot timeout, ceilometer was really hammering CPU on the host. There are already threads on ceilometer performance and how that needs to be improved for Tempest runs so I don't want to get into that here. What I'm thinking about is if there is a way to be smarter about how we do timeouts in the tests, rather than just rely on globally configured hard-coded timeouts which are bound to fail intermittently in dynamic environments like this. I'm thinking something along the lines of keeping track of CPU stats on intervals in our waiter loops, then when we reach our configured timeout, calculate the average CPU load/idle and if it falls below some threshold, we cut the timeout in half and redo the timeout loop - and we continue that until our timeout reaches some level that no longer makes sense, like once it drops less than a minute for example. Are there other ideas here? My main concern is the number of random timeout failures we see in the tests and then people are trying to fingerprint them with elastic-recheck but the queries are so generic they are not really useful. We now put the test class and test case in the compute test timeout messages, but it's also not very useful to fingerprint every individual permutation of test class/case that we can hit a timeout in. [1] https://bugs.launchpad.net/nova/+bug/1320617 This change to devstack should help [1]. It would be good if we actually used the default timeouts we have configured in Tempest rather than hard-coding them in devstack based on the latest state of the gate at the time. [1] https://review.openstack.org/#/c/94221/ I have a proof of concept up for Tempest with adjusted timeouts based on CPU idle values here: https://review.openstack.org/#/c/94245/ The problem is this makes an assumption that Tempest is on the host where the services are. We actually need to get away from that assumption. If there is something in ceilometer that would make sense to poll, that might be an option. But psutils definitely can't be a thing we us here. -Sean ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev I've abandoned both changes in great shame. :P To target the timeout failure in the image snapshot tests and bug 1320617, I'm looking to see if maybe the experimental tasks API in glance v2 could help get some diagnostic information at the point of failure to see if things are just slow or if the snapshot is actually hung and will never complete. Ideally we could leverage tasks across the services for debugging issues like this. -- Thanks, Matt Riedemann ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [qa] Smarter timeouts in Tempest?
I see this as the result of two unrelated issues. On Mon, May 19, 2014 at 8:53 AM, Matt Riedemann wrote: > I was looking through this timeout bug [1] this morning and am able to > correlate that around the time of the image snapshot timeout, ceilometer > was really hammering CPU on the host. There are already threads on > ceilometer performance and how that needs to be improved for Tempest runs > so I don't want to get into that here. > > What I'm thinking about is if there is a way to be smarter about how we do > timeouts in the tests, rather than just rely on globally configured > hard-coded timeouts which are bound to fail intermittently in dynamic > environments like this. > > I'm thinking something along the lines of keeping track of CPU stats on > intervals in our waiter loops, then when we reach our configured timeout, > calculate the average CPU load/idle and if it falls below some threshold, > we cut the timeout in half and redo the timeout loop - and we continue that > until our timeout reaches some level that no longer makes sense, like once > it drops less than a minute for example > 1. Our test environment is being pushed to its limits. In the past we have seen things fail in strange ways when CPU idle % drops below 10%. To address this we can do a few things: * Better track when our test environment has low idle CPU (post processing on gate jobs?) * Make gate jobs use less CPU (ceilometer issues etc). > > Are there other ideas here? My main concern is the number of random > timeout failures we see in the tests and then people are trying to > fingerprint them with elastic-recheck but the queries are so generic they > are not really useful. We now put the test class and test case in the > compute test timeout messages, but it's also not very useful to fingerprint > every individual permutation of test class/case that we can hit a timeout > in. > 2. OpenStack is hard to debug. If we, the developers, cannot figure out what is failing then imagine how hard debugging is for non-openstack developers. When we see these types of issues, we should work on making the logs more useful so we can create better elastic-rececheck fingerprints. > > [1] https://bugs.launchpad.net/nova/+bug/1320617 > > -- > > Thanks, > > Matt Riedemann > > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [qa] Smarter timeouts in Tempest?
This is still a partially formed idea, but one of the ways I've thought of solving the timeout issue is by using aggregate data from previous results. Any of the event systems (Ceilometer, StackTach) should allow you to create models of what an inbounds server build (or any action) time per image looks like, and what might be considered an outlier. The general idea feels right, but also could be tricked by slow, increasing changes in the execution time of an action. Daryl -Original Message- From: Matt Riedemann [mailto:mrie...@linux.vnet.ibm.com] Sent: Monday, May 19, 2014 10:54 AM To: OpenStack Development Mailing List (not for usage questions) Subject: [openstack-dev] [qa] Smarter timeouts in Tempest? I was looking through this timeout bug [1] this morning and am able to correlate that around the time of the image snapshot timeout, ceilometer was really hammering CPU on the host. There are already threads on ceilometer performance and how that needs to be improved for Tempest runs so I don't want to get into that here. What I'm thinking about is if there is a way to be smarter about how we do timeouts in the tests, rather than just rely on globally configured hard-coded timeouts which are bound to fail intermittently in dynamic environments like this. I'm thinking something along the lines of keeping track of CPU stats on intervals in our waiter loops, then when we reach our configured timeout, calculate the average CPU load/idle and if it falls below some threshold, we cut the timeout in half and redo the timeout loop - and we continue that until our timeout reaches some level that no longer makes sense, like once it drops less than a minute for example. Are there other ideas here? My main concern is the number of random timeout failures we see in the tests and then people are trying to fingerprint them with elastic-recheck but the queries are so generic they are not really useful. We now put the test class and test case in the compute test timeout messages, but it's also not very useful to fingerprint every individual permutation of test class/case that we can hit a timeout in. [1] https://bugs.launchpad.net/nova/+bug/1320617 -- Thanks, Matt Riedemann ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [qa] Smarter timeouts in Tempest?
On 05/19/2014 02:13 PM, Matt Riedemann wrote: > > > On 5/19/2014 11:33 AM, Matt Riedemann wrote: >> >> >> On 5/19/2014 10:53 AM, Matt Riedemann wrote: >>> I was looking through this timeout bug [1] this morning and am able to >>> correlate that around the time of the image snapshot timeout, ceilometer >>> was really hammering CPU on the host. There are already threads on >>> ceilometer performance and how that needs to be improved for Tempest >>> runs so I don't want to get into that here. >>> >>> What I'm thinking about is if there is a way to be smarter about how we >>> do timeouts in the tests, rather than just rely on globally configured >>> hard-coded timeouts which are bound to fail intermittently in dynamic >>> environments like this. >>> >>> I'm thinking something along the lines of keeping track of CPU stats on >>> intervals in our waiter loops, then when we reach our configured >>> timeout, calculate the average CPU load/idle and if it falls below some >>> threshold, we cut the timeout in half and redo the timeout loop - and we >>> continue that until our timeout reaches some level that no longer makes >>> sense, like once it drops less than a minute for example. >>> >>> Are there other ideas here? My main concern is the number of random >>> timeout failures we see in the tests and then people are trying to >>> fingerprint them with elastic-recheck but the queries are so generic >>> they are not really useful. We now put the test class and test case in >>> the compute test timeout messages, but it's also not very useful to >>> fingerprint every individual permutation of test class/case that we can >>> hit a timeout in. >>> >>> [1] https://bugs.launchpad.net/nova/+bug/1320617 >>> >> >> This change to devstack should help [1]. >> >> It would be good if we actually used the default timeouts we have >> configured in Tempest rather than hard-coding them in devstack based on >> the latest state of the gate at the time. >> >> [1] https://review.openstack.org/#/c/94221/ >> > > I have a proof of concept up for Tempest with adjusted timeouts based on > CPU idle values here: > > https://review.openstack.org/#/c/94245/ The problem is this makes an assumption that Tempest is on the host where the services are. We actually need to get away from that assumption. If there is something in ceilometer that would make sense to poll, that might be an option. But psutils definitely can't be a thing we us here. -Sean -- Sean Dague http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [qa] Smarter timeouts in Tempest?
On 5/19/2014 11:33 AM, Matt Riedemann wrote: On 5/19/2014 10:53 AM, Matt Riedemann wrote: I was looking through this timeout bug [1] this morning and am able to correlate that around the time of the image snapshot timeout, ceilometer was really hammering CPU on the host. There are already threads on ceilometer performance and how that needs to be improved for Tempest runs so I don't want to get into that here. What I'm thinking about is if there is a way to be smarter about how we do timeouts in the tests, rather than just rely on globally configured hard-coded timeouts which are bound to fail intermittently in dynamic environments like this. I'm thinking something along the lines of keeping track of CPU stats on intervals in our waiter loops, then when we reach our configured timeout, calculate the average CPU load/idle and if it falls below some threshold, we cut the timeout in half and redo the timeout loop - and we continue that until our timeout reaches some level that no longer makes sense, like once it drops less than a minute for example. Are there other ideas here? My main concern is the number of random timeout failures we see in the tests and then people are trying to fingerprint them with elastic-recheck but the queries are so generic they are not really useful. We now put the test class and test case in the compute test timeout messages, but it's also not very useful to fingerprint every individual permutation of test class/case that we can hit a timeout in. [1] https://bugs.launchpad.net/nova/+bug/1320617 This change to devstack should help [1]. It would be good if we actually used the default timeouts we have configured in Tempest rather than hard-coding them in devstack based on the latest state of the gate at the time. [1] https://review.openstack.org/#/c/94221/ I have a proof of concept up for Tempest with adjusted timeouts based on CPU idle values here: https://review.openstack.org/#/c/94245/ -- Thanks, Matt Riedemann ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [qa] Smarter timeouts in Tempest?
On 5/19/2014 10:53 AM, Matt Riedemann wrote: I was looking through this timeout bug [1] this morning and am able to correlate that around the time of the image snapshot timeout, ceilometer was really hammering CPU on the host. There are already threads on ceilometer performance and how that needs to be improved for Tempest runs so I don't want to get into that here. What I'm thinking about is if there is a way to be smarter about how we do timeouts in the tests, rather than just rely on globally configured hard-coded timeouts which are bound to fail intermittently in dynamic environments like this. I'm thinking something along the lines of keeping track of CPU stats on intervals in our waiter loops, then when we reach our configured timeout, calculate the average CPU load/idle and if it falls below some threshold, we cut the timeout in half and redo the timeout loop - and we continue that until our timeout reaches some level that no longer makes sense, like once it drops less than a minute for example. Are there other ideas here? My main concern is the number of random timeout failures we see in the tests and then people are trying to fingerprint them with elastic-recheck but the queries are so generic they are not really useful. We now put the test class and test case in the compute test timeout messages, but it's also not very useful to fingerprint every individual permutation of test class/case that we can hit a timeout in. [1] https://bugs.launchpad.net/nova/+bug/1320617 This change to devstack should help [1]. It would be good if we actually used the default timeouts we have configured in Tempest rather than hard-coding them in devstack based on the latest state of the gate at the time. [1] https://review.openstack.org/#/c/94221/ -- Thanks, Matt Riedemann ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev