Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-18 Thread Salvatore Orlando
Thanks Mark.
As usual I had to fetch your message from the spam folder!

Anyway, I received a sensible request to avoid running neutron tests for
advanced services (load balancing, firewall, vpn), in the integrated gate.
Therefore the patches [1] and [2] will not run anymore service plugins in
 the 'standard' neutron job. On the other hand the introduce a new
"extended" neutron job - which is exactly the same as the neutron full job
we have now - which will run only on the neutron gate.

This is good because neutron services and other openstack projects such as
nova or keystone are pretty much orthogonal, and this will avoid causing
failures there should any of these services become unstable. On the other
hand it's good to point out that if changes in oslo libraries should break
the service plugin, we won't be able to detect that anymore as oslo
libraries use the integrated gate.

Also, some services such as load balancing and firewall run also
smoketests. Considering the current job structure this mean they have been
executed already for months in the integrated gate. Also - they will keep
being executed as the postgresql smoke job will keep running in place of
the full one until bug [3] is fixed.

Obviously if you disagree with this approach, please speak up. And note
that the patches [1] and [2] are WIPs at the moment. I'm aware that they
don't work ;)

Salvatore

[1] https://review.openstack.org/#/c/114933/
[2] https://review.openstack.org/#/c/114932/
[3] https://bugs.launchpad.net/nova/+bug/1305892

On 16 August 2014 01:13, Mark McClain  wrote:

>
>  On Aug 15, 2014, at 6:20 PM, Salvatore Orlando 
> wrote:
>
>  The neutron full job is finally voting, and the first patch [1] has
> already passed it in gate checks!
> I've collected a few data points before it was switched to voting, and we
> should probably expect a failure rate around 4%. This is not bad, but
> neither great, and everybody's contribution will be appreciated in
> reporting and assessing the nature gate failures, which, needless to say,
> are mostly races.
>
>  Note: we've also added the postgresql version of the same job, but that
> is not voting yet as we never executed it before.
>
>  Salvatore
>
>  [1] https://review.openstack.org/#/c/105694/
>
>
>  Thanks to Salvatore for driving this effort and for everyone who
> contributed patches and reviews.  It is exciting to see it enabled.
>
>  mark
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-15 Thread Mark McClain

On Aug 15, 2014, at 6:20 PM, Salvatore Orlando 
mailto:sorla...@nicira.com>> wrote:

The neutron full job is finally voting, and the first patch [1] has already 
passed it in gate checks!
I've collected a few data points before it was switched to voting, and we 
should probably expect a failure rate around 4%. This is not bad, but neither 
great, and everybody's contribution will be appreciated in reporting and 
assessing the nature gate failures, which, needless to say, are mostly races.

Note: we've also added the postgresql version of the same job, but that is not 
voting yet as we never executed it before.

Salvatore

[1] https://review.openstack.org/#/c/105694/

Thanks to Salvatore for driving this effort and for everyone who contributed 
patches and reviews.  It is exciting to see it enabled.

mark


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-15 Thread Salvatore Orlando
The neutron full job is finally voting, and the first patch [1] has already
passed it in gate checks!
I've collected a few data points before it was switched to voting, and we
should probably expect a failure rate around 4%. This is not bad, but
neither great, and everybody's contribution will be appreciated in
reporting and assessing the nature gate failures, which, needless to say,
are mostly races.

Note: we've also added the postgresql version of the same job, but that is
not voting yet as we never executed it before.

Salvatore

[1] https://review.openstack.org/#/c/105694/


On 12 August 2014 20:14, Salvatore Orlando  wrote:

> And just when the patch was only missing a +A, another bug slipped in!
> The nova patch to fix it is available at [1]
>
> And while we're there, it won't be a bad idea to also push the neutron
> full job, as non-voting, into the integrated gate [2]
>
> Thanks in advance,
> (especially to the nova and infra cores who'll review these patches!)
> Salvatore
>
> [1] https://review.openstack.org/#/c/113554/
> [2] https://review.openstack.org/#/c/113562/
>
>
> On 7 August 2014 17:51, Salvatore Orlando  wrote:
>
>> Thanks Armando,
>>
>> The fix for the bug you pointed out was the reason of the failure we've
>> been seeing.
>> The follow-up patch merged and I've removed the wip status from the patch
>> for the full job [1]
>>
>> Salvatore
>>
>> [1] https://review.openstack.org/#/c/88289/
>>
>>
>> On 7 August 2014 16:50, Armando M.  wrote:
>>
>>> Hi Salvatore,
>>>
>>> I did notice the issue and I flagged this bug report:
>>>
>>> https://bugs.launchpad.net/nova/+bug/1352141
>>>
>>> I'll follow up.
>>>
>>> Cheers,
>>> Armando
>>>
>>>
>>> On 7 August 2014 01:34, Salvatore Orlando  wrote:
>>>
 I had to put the patch back on WIP because yesterday a bug causing a
 100% failure rate slipped in.
 It should be an easy fix, and I'm already working on it.
 Situations like this, exemplified by [1] are a bit frustrating for all
 the people working on improving neutron quality.
 Now, if you allow me a little rant, as Neutron is receiving a lot of
 attention for all the ongoing discussion regarding this group policy stuff,
 would it be possible for us to receive a bit of attention to ensure both
 the full job and the grenade one are switched to voting before the juno-3
 review crunch.

 We've already had the attention of the QA team, it would probably good
 if we could get the attention of the infra core team to ensure:
 1) the jobs are also deemed by them stable enough to be switched to
 voting
 2) the relevant patches for openstack-infra/config are reviewed

 Regards,
 Salvatore

 [1]
 http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==


 On 23 July 2014 14:59, Matthew Treinish  wrote:

> On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:
> > Here I am again bothering you with the state of the full job for
> Neutron.
> >
> > The patch for fixing an issue in nova's server external events
> extension
> > merged yesterday [1]
> > We do not have yet enough data points to make a reliable assessment,
> but of
> > out 37 runs since the patch merged, we had "only" 5 failures, which
> puts
> > the failure rate at about 13%
> >
> > This is ugly compared with the current failure rate of the smoketest
> (3%).
> > However, I think it is good enough to start making the full job
> voting at
> > least for neutron patches.
> > Once we'll be able to bring down failure rate to anything around 5%,
> we can
> > then enable the job everywhere.
>
> I think that sounds like a good plan. I'm also curious how the failure
> rates
> compare to the other non-neutron jobs, that might be a useful
> comparison too
> for deciding when to flip the switch everywhere.
>
> >
> > As much as I hate asymmetric gating, I think this is a good
> compromise for
> > avoiding developers working on other projects are badly affected by
> the
> > higher failure rate in the neutron full job.
>
> So we discussed this during the project meeting a couple of weeks ago
> [3] and
> there was a general agreement that doing it asymmetrically at first
> would be
> better. Everyone should be wary of the potential harms with doing it
> asymmetrically and I think priority will be given to fixing issues
> that block
> the neutron gate should they arise.
>
> > I will therefore resume wor

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-12 Thread Salvatore Orlando
And just when the patch was only missing a +A, another bug slipped in!
The nova patch to fix it is available at [1]

And while we're there, it won't be a bad idea to also push the neutron full
job, as non-voting, into the integrated gate [2]

Thanks in advance,
(especially to the nova and infra cores who'll review these patches!)
Salvatore

[1] https://review.openstack.org/#/c/113554/
[2] https://review.openstack.org/#/c/113562/


On 7 August 2014 17:51, Salvatore Orlando  wrote:

> Thanks Armando,
>
> The fix for the bug you pointed out was the reason of the failure we've
> been seeing.
> The follow-up patch merged and I've removed the wip status from the patch
> for the full job [1]
>
> Salvatore
>
> [1] https://review.openstack.org/#/c/88289/
>
>
> On 7 August 2014 16:50, Armando M.  wrote:
>
>> Hi Salvatore,
>>
>> I did notice the issue and I flagged this bug report:
>>
>> https://bugs.launchpad.net/nova/+bug/1352141
>>
>> I'll follow up.
>>
>> Cheers,
>> Armando
>>
>>
>> On 7 August 2014 01:34, Salvatore Orlando  wrote:
>>
>>> I had to put the patch back on WIP because yesterday a bug causing a
>>> 100% failure rate slipped in.
>>> It should be an easy fix, and I'm already working on it.
>>> Situations like this, exemplified by [1] are a bit frustrating for all
>>> the people working on improving neutron quality.
>>> Now, if you allow me a little rant, as Neutron is receiving a lot of
>>> attention for all the ongoing discussion regarding this group policy stuff,
>>> would it be possible for us to receive a bit of attention to ensure both
>>> the full job and the grenade one are switched to voting before the juno-3
>>> review crunch.
>>>
>>> We've already had the attention of the QA team, it would probably good
>>> if we could get the attention of the infra core team to ensure:
>>> 1) the jobs are also deemed by them stable enough to be switched to
>>> voting
>>> 2) the relevant patches for openstack-infra/config are reviewed
>>>
>>> Regards,
>>> Salvatore
>>>
>>> [1]
>>> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
>>>
>>>
>>> On 23 July 2014 14:59, Matthew Treinish  wrote:
>>>
 On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:
 > Here I am again bothering you with the state of the full job for
 Neutron.
 >
 > The patch for fixing an issue in nova's server external events
 extension
 > merged yesterday [1]
 > We do not have yet enough data points to make a reliable assessment,
 but of
 > out 37 runs since the patch merged, we had "only" 5 failures, which
 puts
 > the failure rate at about 13%
 >
 > This is ugly compared with the current failure rate of the smoketest
 (3%).
 > However, I think it is good enough to start making the full job
 voting at
 > least for neutron patches.
 > Once we'll be able to bring down failure rate to anything around 5%,
 we can
 > then enable the job everywhere.

 I think that sounds like a good plan. I'm also curious how the failure
 rates
 compare to the other non-neutron jobs, that might be a useful
 comparison too
 for deciding when to flip the switch everywhere.

 >
 > As much as I hate asymmetric gating, I think this is a good
 compromise for
 > avoiding developers working on other projects are badly affected by
 the
 > higher failure rate in the neutron full job.

 So we discussed this during the project meeting a couple of weeks ago
 [3] and
 there was a general agreement that doing it asymmetrically at first
 would be
 better. Everyone should be wary of the potential harms with doing it
 asymmetrically and I think priority will be given to fixing issues that
 block
 the neutron gate should they arise.

 > I will therefore resume work on [2] and remove the WIP status as soon
 as I
 > can confirm a failure rate below 15% with more data points.
 >

 Thanks for keeping on top of this Salvatore. It'll be good to finally
 be at
 least partially gating with a parallel job.

 -Matt Treinish

 >
 > [1] https://review.openstack.org/#/c/103865/
 > [2] https://review.openstack.org/#/c/88289/
 [3]
 http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28

 >
 >
 > On 10 July 2014 11:49, Salvatore Orlando  wrote:
 >
 > >
 > >
 > >
 > > On 10 July 2014 11:27, Ihar Hrachyshka  wrote:
 > >
 > >> -BEGIN PGP SIGNED MESSAGE-
 > >> Hash: SHA512
 > >>

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-07 Thread Salvatore Orlando
Thanks Armando,

The fix for the bug you pointed out was the reason of the failure we've
been seeing.
The follow-up patch merged and I've removed the wip status from the patch
for the full job [1]

Salvatore

[1] https://review.openstack.org/#/c/88289/


On 7 August 2014 16:50, Armando M.  wrote:

> Hi Salvatore,
>
> I did notice the issue and I flagged this bug report:
>
> https://bugs.launchpad.net/nova/+bug/1352141
>
> I'll follow up.
>
> Cheers,
> Armando
>
>
> On 7 August 2014 01:34, Salvatore Orlando  wrote:
>
>> I had to put the patch back on WIP because yesterday a bug causing a 100%
>> failure rate slipped in.
>> It should be an easy fix, and I'm already working on it.
>> Situations like this, exemplified by [1] are a bit frustrating for all
>> the people working on improving neutron quality.
>> Now, if you allow me a little rant, as Neutron is receiving a lot of
>> attention for all the ongoing discussion regarding this group policy stuff,
>> would it be possible for us to receive a bit of attention to ensure both
>> the full job and the grenade one are switched to voting before the juno-3
>> review crunch.
>>
>> We've already had the attention of the QA team, it would probably good if
>> we could get the attention of the infra core team to ensure:
>> 1) the jobs are also deemed by them stable enough to be switched to voting
>> 2) the relevant patches for openstack-infra/config are reviewed
>>
>> Regards,
>> Salvatore
>>
>> [1]
>> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
>>
>>
>> On 23 July 2014 14:59, Matthew Treinish  wrote:
>>
>>> On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:
>>> > Here I am again bothering you with the state of the full job for
>>> Neutron.
>>> >
>>> > The patch for fixing an issue in nova's server external events
>>> extension
>>> > merged yesterday [1]
>>> > We do not have yet enough data points to make a reliable assessment,
>>> but of
>>> > out 37 runs since the patch merged, we had "only" 5 failures, which
>>> puts
>>> > the failure rate at about 13%
>>> >
>>> > This is ugly compared with the current failure rate of the smoketest
>>> (3%).
>>> > However, I think it is good enough to start making the full job voting
>>> at
>>> > least for neutron patches.
>>> > Once we'll be able to bring down failure rate to anything around 5%,
>>> we can
>>> > then enable the job everywhere.
>>>
>>> I think that sounds like a good plan. I'm also curious how the failure
>>> rates
>>> compare to the other non-neutron jobs, that might be a useful comparison
>>> too
>>> for deciding when to flip the switch everywhere.
>>>
>>> >
>>> > As much as I hate asymmetric gating, I think this is a good compromise
>>> for
>>> > avoiding developers working on other projects are badly affected by the
>>> > higher failure rate in the neutron full job.
>>>
>>> So we discussed this during the project meeting a couple of weeks ago
>>> [3] and
>>> there was a general agreement that doing it asymmetrically at first
>>> would be
>>> better. Everyone should be wary of the potential harms with doing it
>>> asymmetrically and I think priority will be given to fixing issues that
>>> block
>>> the neutron gate should they arise.
>>>
>>> > I will therefore resume work on [2] and remove the WIP status as soon
>>> as I
>>> > can confirm a failure rate below 15% with more data points.
>>> >
>>>
>>> Thanks for keeping on top of this Salvatore. It'll be good to finally be
>>> at
>>> least partially gating with a parallel job.
>>>
>>> -Matt Treinish
>>>
>>> >
>>> > [1] https://review.openstack.org/#/c/103865/
>>> > [2] https://review.openstack.org/#/c/88289/
>>> [3]
>>> http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28
>>>
>>> >
>>> >
>>> > On 10 July 2014 11:49, Salvatore Orlando  wrote:
>>> >
>>> > >
>>> > >
>>> > >
>>> > > On 10 July 2014 11:27, Ihar Hrachyshka  wrote:
>>> > >
>>> > >> -BEGIN PGP SIGNED MESSAGE-
>>> > >> Hash: SHA512
>>> > >>
>>> > >> On 10/07/14 11:07, Salvatore Orlando wrote:
>>> > >> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
>>> > >> > it seems there has been an improvement on the failure rate, which
>>> > >> > seem to have dropped to 25% from over 40%. Still, since the patch
>>> > >> > merged there have been 11 failures already in the full job out of
>>> > >> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
>>> > >> > problems in the patches being tested - 1 had the same root cause
>>> as
>>> > >> > bug 1329564. Indeed the related job started before the patch
>>> merged
>>> > >> > but finished 

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-07 Thread Armando M.
Hi Salvatore,

I did notice the issue and I flagged this bug report:

https://bugs.launchpad.net/nova/+bug/1352141

I'll follow up.

Cheers,
Armando


On 7 August 2014 01:34, Salvatore Orlando  wrote:

> I had to put the patch back on WIP because yesterday a bug causing a 100%
> failure rate slipped in.
> It should be an easy fix, and I'm already working on it.
> Situations like this, exemplified by [1] are a bit frustrating for all the
> people working on improving neutron quality.
> Now, if you allow me a little rant, as Neutron is receiving a lot of
> attention for all the ongoing discussion regarding this group policy stuff,
> would it be possible for us to receive a bit of attention to ensure both
> the full job and the grenade one are switched to voting before the juno-3
> review crunch.
>
> We've already had the attention of the QA team, it would probably good if
> we could get the attention of the infra core team to ensure:
> 1) the jobs are also deemed by them stable enough to be switched to voting
> 2) the relevant patches for openstack-infra/config are reviewed
>
> Regards,
> Salvatore
>
> [1]
> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
>
>
> On 23 July 2014 14:59, Matthew Treinish  wrote:
>
>> On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:
>> > Here I am again bothering you with the state of the full job for
>> Neutron.
>> >
>> > The patch for fixing an issue in nova's server external events extension
>> > merged yesterday [1]
>> > We do not have yet enough data points to make a reliable assessment,
>> but of
>> > out 37 runs since the patch merged, we had "only" 5 failures, which puts
>> > the failure rate at about 13%
>> >
>> > This is ugly compared with the current failure rate of the smoketest
>> (3%).
>> > However, I think it is good enough to start making the full job voting
>> at
>> > least for neutron patches.
>> > Once we'll be able to bring down failure rate to anything around 5%, we
>> can
>> > then enable the job everywhere.
>>
>> I think that sounds like a good plan. I'm also curious how the failure
>> rates
>> compare to the other non-neutron jobs, that might be a useful comparison
>> too
>> for deciding when to flip the switch everywhere.
>>
>> >
>> > As much as I hate asymmetric gating, I think this is a good compromise
>> for
>> > avoiding developers working on other projects are badly affected by the
>> > higher failure rate in the neutron full job.
>>
>> So we discussed this during the project meeting a couple of weeks ago [3]
>> and
>> there was a general agreement that doing it asymmetrically at first would
>> be
>> better. Everyone should be wary of the potential harms with doing it
>> asymmetrically and I think priority will be given to fixing issues that
>> block
>> the neutron gate should they arise.
>>
>> > I will therefore resume work on [2] and remove the WIP status as soon
>> as I
>> > can confirm a failure rate below 15% with more data points.
>> >
>>
>> Thanks for keeping on top of this Salvatore. It'll be good to finally be
>> at
>> least partially gating with a parallel job.
>>
>> -Matt Treinish
>>
>> >
>> > [1] https://review.openstack.org/#/c/103865/
>> > [2] https://review.openstack.org/#/c/88289/
>> [3]
>> http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28
>>
>> >
>> >
>> > On 10 July 2014 11:49, Salvatore Orlando  wrote:
>> >
>> > >
>> > >
>> > >
>> > > On 10 July 2014 11:27, Ihar Hrachyshka  wrote:
>> > >
>> > >> -BEGIN PGP SIGNED MESSAGE-
>> > >> Hash: SHA512
>> > >>
>> > >> On 10/07/14 11:07, Salvatore Orlando wrote:
>> > >> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
>> > >> > it seems there has been an improvement on the failure rate, which
>> > >> > seem to have dropped to 25% from over 40%. Still, since the patch
>> > >> > merged there have been 11 failures already in the full job out of
>> > >> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
>> > >> > problems in the patches being tested - 1 had the same root cause as
>> > >> > bug 1329564. Indeed the related job started before the patch merged
>> > >> > but finished after. So this failure "doesn't count". - 1 was for an
>> > >> > issue introduced about a week ago which actually causing a lot of
>> > >> > failures in the full job [3]. Fix should be easy for it; however
>> > >> > given the nature of the test we might even skip it while it's
>> > >> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is
>> > >> > going on on gerrit regarding the most suitable approach. - 3 were
>> > >> > fo

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-07 Thread Salvatore Orlando
Patch [1] will solve the observed issue. It has already passed Jenkins
tests.
As it is a nova patch, the neutron full job did not run for it.
To check the neutron full job outcome with [1], please check [2].

Salvatore

[1] https://review.openstack.org/#/c/112541/
[2] https://review.openstack.org/#/c/98441/


On 7 August 2014 10:34, Salvatore Orlando  wrote:

> I had to put the patch back on WIP because yesterday a bug causing a 100%
> failure rate slipped in.
> It should be an easy fix, and I'm already working on it.
> Situations like this, exemplified by [1] are a bit frustrating for all the
> people working on improving neutron quality.
> Now, if you allow me a little rant, as Neutron is receiving a lot of
> attention for all the ongoing discussion regarding this group policy stuff,
> would it be possible for us to receive a bit of attention to ensure both
> the full job and the grenade one are switched to voting before the juno-3
> review crunch.
>
> We've already had the attention of the QA team, it would probably good if
> we could get the attention of the infra core team to ensure:
> 1) the jobs are also deemed by them stable enough to be switched to voting
> 2) the relevant patches for openstack-infra/config are reviewed
>
> Regards,
> Salvatore
>
> [1]
> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
>
>
> On 23 July 2014 14:59, Matthew Treinish  wrote:
>
>> On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:
>> > Here I am again bothering you with the state of the full job for
>> Neutron.
>> >
>> > The patch for fixing an issue in nova's server external events extension
>> > merged yesterday [1]
>> > We do not have yet enough data points to make a reliable assessment,
>> but of
>> > out 37 runs since the patch merged, we had "only" 5 failures, which puts
>> > the failure rate at about 13%
>> >
>> > This is ugly compared with the current failure rate of the smoketest
>> (3%).
>> > However, I think it is good enough to start making the full job voting
>> at
>> > least for neutron patches.
>> > Once we'll be able to bring down failure rate to anything around 5%, we
>> can
>> > then enable the job everywhere.
>>
>> I think that sounds like a good plan. I'm also curious how the failure
>> rates
>> compare to the other non-neutron jobs, that might be a useful comparison
>> too
>> for deciding when to flip the switch everywhere.
>>
>> >
>> > As much as I hate asymmetric gating, I think this is a good compromise
>> for
>> > avoiding developers working on other projects are badly affected by the
>> > higher failure rate in the neutron full job.
>>
>> So we discussed this during the project meeting a couple of weeks ago [3]
>> and
>> there was a general agreement that doing it asymmetrically at first would
>> be
>> better. Everyone should be wary of the potential harms with doing it
>> asymmetrically and I think priority will be given to fixing issues that
>> block
>> the neutron gate should they arise.
>>
>> > I will therefore resume work on [2] and remove the WIP status as soon
>> as I
>> > can confirm a failure rate below 15% with more data points.
>> >
>>
>> Thanks for keeping on top of this Salvatore. It'll be good to finally be
>> at
>> least partially gating with a parallel job.
>>
>> -Matt Treinish
>>
>> >
>> > [1] https://review.openstack.org/#/c/103865/
>> > [2] https://review.openstack.org/#/c/88289/
>> [3]
>> http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28
>>
>> >
>> >
>> > On 10 July 2014 11:49, Salvatore Orlando  wrote:
>> >
>> > >
>> > >
>> > >
>> > > On 10 July 2014 11:27, Ihar Hrachyshka  wrote:
>> > >
>> > >> -BEGIN PGP SIGNED MESSAGE-
>> > >> Hash: SHA512
>> > >>
>> > >> On 10/07/14 11:07, Salvatore Orlando wrote:
>> > >> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
>> > >> > it seems there has been an improvement on the failure rate, which
>> > >> > seem to have dropped to 25% from over 40%. Still, since the patch
>> > >> > merged there have been 11 failures already in the full job out of
>> > >> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
>> > >> > problems in the patches being tested - 1 had the same root cause as
>> > >> > bug 1329564. Indeed the related job started before the patch merged
>> > >> > but finished after. So this failure "doesn't count". - 1 was for an
>> > >> > issue introduced about a week ago which actually causing a lot of
>> > >> > failures in the full job [3]. Fix should be easy for it; however
>> > >> > given the nature of the test we might even skip it while it's
>>

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-08-07 Thread Salvatore Orlando
I had to put the patch back on WIP because yesterday a bug causing a 100%
failure rate slipped in.
It should be an easy fix, and I'm already working on it.
Situations like this, exemplified by [1] are a bit frustrating for all the
people working on improving neutron quality.
Now, if you allow me a little rant, as Neutron is receiving a lot of
attention for all the ongoing discussion regarding this group policy stuff,
would it be possible for us to receive a bit of attention to ensure both
the full job and the grenade one are switched to voting before the juno-3
review crunch.

We've already had the attention of the QA team, it would probably good if
we could get the attention of the infra core team to ensure:
1) the jobs are also deemed by them stable enough to be switched to voting
2) the relevant patches for openstack-infra/config are reviewed

Regards,
Salvatore

[1]
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwie3UnbWVzc2FnZSc6IHUnRmxvYXRpbmcgaXAgcG9vbCBub3QgZm91bmQuJywgdSdjb2RlJzogNDAwfVwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzQwMDExMDIwNywibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==


On 23 July 2014 14:59, Matthew Treinish  wrote:

> On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:
> > Here I am again bothering you with the state of the full job for Neutron.
> >
> > The patch for fixing an issue in nova's server external events extension
> > merged yesterday [1]
> > We do not have yet enough data points to make a reliable assessment, but
> of
> > out 37 runs since the patch merged, we had "only" 5 failures, which puts
> > the failure rate at about 13%
> >
> > This is ugly compared with the current failure rate of the smoketest
> (3%).
> > However, I think it is good enough to start making the full job voting at
> > least for neutron patches.
> > Once we'll be able to bring down failure rate to anything around 5%, we
> can
> > then enable the job everywhere.
>
> I think that sounds like a good plan. I'm also curious how the failure
> rates
> compare to the other non-neutron jobs, that might be a useful comparison
> too
> for deciding when to flip the switch everywhere.
>
> >
> > As much as I hate asymmetric gating, I think this is a good compromise
> for
> > avoiding developers working on other projects are badly affected by the
> > higher failure rate in the neutron full job.
>
> So we discussed this during the project meeting a couple of weeks ago [3]
> and
> there was a general agreement that doing it asymmetrically at first would
> be
> better. Everyone should be wary of the potential harms with doing it
> asymmetrically and I think priority will be given to fixing issues that
> block
> the neutron gate should they arise.
>
> > I will therefore resume work on [2] and remove the WIP status as soon as
> I
> > can confirm a failure rate below 15% with more data points.
> >
>
> Thanks for keeping on top of this Salvatore. It'll be good to finally be at
> least partially gating with a parallel job.
>
> -Matt Treinish
>
> >
> > [1] https://review.openstack.org/#/c/103865/
> > [2] https://review.openstack.org/#/c/88289/
> [3]
> http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28
>
> >
> >
> > On 10 July 2014 11:49, Salvatore Orlando  wrote:
> >
> > >
> > >
> > >
> > > On 10 July 2014 11:27, Ihar Hrachyshka  wrote:
> > >
> > >> -BEGIN PGP SIGNED MESSAGE-
> > >> Hash: SHA512
> > >>
> > >> On 10/07/14 11:07, Salvatore Orlando wrote:
> > >> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
> > >> > it seems there has been an improvement on the failure rate, which
> > >> > seem to have dropped to 25% from over 40%. Still, since the patch
> > >> > merged there have been 11 failures already in the full job out of
> > >> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
> > >> > problems in the patches being tested - 1 had the same root cause as
> > >> > bug 1329564. Indeed the related job started before the patch merged
> > >> > but finished after. So this failure "doesn't count". - 1 was for an
> > >> > issue introduced about a week ago which actually causing a lot of
> > >> > failures in the full job [3]. Fix should be easy for it; however
> > >> > given the nature of the test we might even skip it while it's
> > >> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is
> > >> > going on on gerrit regarding the most suitable approach. - 3 were
> > >> > for lock wait timeout errors. Several people in the community are
> > >> > already working on them. I hope this will raise the profile of this
> > >> > issue (maybe some might think it's just a corner case as it rarely
> > >> > causes failures in smoke jobs, whereas the truth is that error
> > >> > occurs but it does not cause job failur

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-07-23 Thread Matthew Treinish
On Wed, Jul 23, 2014 at 02:40:02PM +0200, Salvatore Orlando wrote:
> Here I am again bothering you with the state of the full job for Neutron.
> 
> The patch for fixing an issue in nova's server external events extension
> merged yesterday [1]
> We do not have yet enough data points to make a reliable assessment, but of
> out 37 runs since the patch merged, we had "only" 5 failures, which puts
> the failure rate at about 13%
> 
> This is ugly compared with the current failure rate of the smoketest (3%).
> However, I think it is good enough to start making the full job voting at
> least for neutron patches.
> Once we'll be able to bring down failure rate to anything around 5%, we can
> then enable the job everywhere.

I think that sounds like a good plan. I'm also curious how the failure rates
compare to the other non-neutron jobs, that might be a useful comparison too
for deciding when to flip the switch everywhere.

> 
> As much as I hate asymmetric gating, I think this is a good compromise for
> avoiding developers working on other projects are badly affected by the
> higher failure rate in the neutron full job.

So we discussed this during the project meeting a couple of weeks ago [3] and
there was a general agreement that doing it asymmetrically at first would be
better. Everyone should be wary of the potential harms with doing it
asymmetrically and I think priority will be given to fixing issues that block
the neutron gate should they arise.

> I will therefore resume work on [2] and remove the WIP status as soon as I
> can confirm a failure rate below 15% with more data points.
> 

Thanks for keeping on top of this Salvatore. It'll be good to finally be at
least partially gating with a parallel job.

-Matt Treinish

> 
> [1] https://review.openstack.org/#/c/103865/
> [2] https://review.openstack.org/#/c/88289/
[3] 
http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.log.html#l-28

> 
> 
> On 10 July 2014 11:49, Salvatore Orlando  wrote:
> 
> >
> >
> >
> > On 10 July 2014 11:27, Ihar Hrachyshka  wrote:
> >
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA512
> >>
> >> On 10/07/14 11:07, Salvatore Orlando wrote:
> >> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
> >> > it seems there has been an improvement on the failure rate, which
> >> > seem to have dropped to 25% from over 40%. Still, since the patch
> >> > merged there have been 11 failures already in the full job out of
> >> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
> >> > problems in the patches being tested - 1 had the same root cause as
> >> > bug 1329564. Indeed the related job started before the patch merged
> >> > but finished after. So this failure "doesn't count". - 1 was for an
> >> > issue introduced about a week ago which actually causing a lot of
> >> > failures in the full job [3]. Fix should be easy for it; however
> >> > given the nature of the test we might even skip it while it's
> >> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is
> >> > going on on gerrit regarding the most suitable approach. - 3 were
> >> > for lock wait timeout errors. Several people in the community are
> >> > already working on them. I hope this will raise the profile of this
> >> > issue (maybe some might think it's just a corner case as it rarely
> >> > causes failures in smoke jobs, whereas the truth is that error
> >> > occurs but it does not cause job failure because the jobs isn't
> >> > parallel).
> >>
> >> Can you give directions on where to find those lock timeout failures?
> >> I'd like to check logs to see whether they have the same nature as
> >> most other failures (e.g. improper yield under transaction).
> >>
> >
> > This logstash query will give you all occurences of lock wait timeout
> > issues: message:"(OperationalError) (1205, 'Lock wait timeout exceeded; try
> > restarting transaction')" AND tags:"screen-q-svc.txt"
> >
> > The fact that in most cases the build succeeds anyway is misleading,
> > because in many cases these errors occur in RPC handling between agents and
> > servers, and therefore are not detected by tempest. The neutron full job,
> > which is parallel, increases their occurrence because of parallelism - and
> > since API request too occur concurrently it also yields a higher tempest
> > build failure rate.
> >
> > However, as I argued in the past the "lock wait timeout" error should
> > always be treated as an error condition.
> > Eugene has already classified lock wait timeout failures and filed bugs
> > for them a few weeks ago.
> >
> >
> >> >
> >> > Summarizing, I think time is not yet ripe to enable the full job;
> >> > once bug 1333654 is fixed, we should go for it. AFAIK there is no
> >> > way for working around it in gate tests other than disabling
> >> > nova/neutron event reporting, which I guess we don't want to do.
> >> >
> >> > Salvatore
> >> >
> >> > [1] https://review.openstack.org/#/c/105239 [2]
> >> >
> >> http

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-07-23 Thread Salvatore Orlando
Here I am again bothering you with the state of the full job for Neutron.

The patch for fixing an issue in nova's server external events extension
merged yesterday [1]
We do not have yet enough data points to make a reliable assessment, but of
out 37 runs since the patch merged, we had "only" 5 failures, which puts
the failure rate at about 13%

This is ugly compared with the current failure rate of the smoketest (3%).
However, I think it is good enough to start making the full job voting at
least for neutron patches.
Once we'll be able to bring down failure rate to anything around 5%, we can
then enable the job everywhere.

As much as I hate asymmetric gating, I think this is a good compromise for
avoiding developers working on other projects are badly affected by the
higher failure rate in the neutron full job.
I will therefore resume work on [2] and remove the WIP status as soon as I
can confirm a failure rate below 15% with more data points.

Salvatore

[1] https://review.openstack.org/#/c/103865/
[2] https://review.openstack.org/#/c/88289/


On 10 July 2014 11:49, Salvatore Orlando  wrote:

>
>
>
> On 10 July 2014 11:27, Ihar Hrachyshka  wrote:
>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA512
>>
>> On 10/07/14 11:07, Salvatore Orlando wrote:
>> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
>> > it seems there has been an improvement on the failure rate, which
>> > seem to have dropped to 25% from over 40%. Still, since the patch
>> > merged there have been 11 failures already in the full job out of
>> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
>> > problems in the patches being tested - 1 had the same root cause as
>> > bug 1329564. Indeed the related job started before the patch merged
>> > but finished after. So this failure "doesn't count". - 1 was for an
>> > issue introduced about a week ago which actually causing a lot of
>> > failures in the full job [3]. Fix should be easy for it; however
>> > given the nature of the test we might even skip it while it's
>> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is
>> > going on on gerrit regarding the most suitable approach. - 3 were
>> > for lock wait timeout errors. Several people in the community are
>> > already working on them. I hope this will raise the profile of this
>> > issue (maybe some might think it's just a corner case as it rarely
>> > causes failures in smoke jobs, whereas the truth is that error
>> > occurs but it does not cause job failure because the jobs isn't
>> > parallel).
>>
>> Can you give directions on where to find those lock timeout failures?
>> I'd like to check logs to see whether they have the same nature as
>> most other failures (e.g. improper yield under transaction).
>>
>
> This logstash query will give you all occurences of lock wait timeout
> issues: message:"(OperationalError) (1205, 'Lock wait timeout exceeded; try
> restarting transaction')" AND tags:"screen-q-svc.txt"
>
> The fact that in most cases the build succeeds anyway is misleading,
> because in many cases these errors occur in RPC handling between agents and
> servers, and therefore are not detected by tempest. The neutron full job,
> which is parallel, increases their occurrence because of parallelism - and
> since API request too occur concurrently it also yields a higher tempest
> build failure rate.
>
> However, as I argued in the past the "lock wait timeout" error should
> always be treated as an error condition.
> Eugene has already classified lock wait timeout failures and filed bugs
> for them a few weeks ago.
>
>
>> >
>> > Summarizing, I think time is not yet ripe to enable the full job;
>> > once bug 1333654 is fixed, we should go for it. AFAIK there is no
>> > way for working around it in gate tests other than disabling
>> > nova/neutron event reporting, which I guess we don't want to do.
>> >
>> > Salvatore
>> >
>> > [1] https://review.openstack.org/#/c/105239 [2]
>> >
>> http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
>> >
>> >
>> [3]
>> >
>> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxk

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-07-10 Thread Salvatore Orlando
On 10 July 2014 11:27, Ihar Hrachyshka  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> On 10/07/14 11:07, Salvatore Orlando wrote:
> > The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
> > it seems there has been an improvement on the failure rate, which
> > seem to have dropped to 25% from over 40%. Still, since the patch
> > merged there have been 11 failures already in the full job out of
> > 42 jobs executed in total. Of these 11 failures: - 3 were due to
> > problems in the patches being tested - 1 had the same root cause as
> > bug 1329564. Indeed the related job started before the patch merged
> > but finished after. So this failure "doesn't count". - 1 was for an
> > issue introduced about a week ago which actually causing a lot of
> > failures in the full job [3]. Fix should be easy for it; however
> > given the nature of the test we might even skip it while it's
> > fixed. - 3 were for bug 1333654 [4]; for this bug discussion is
> > going on on gerrit regarding the most suitable approach. - 3 were
> > for lock wait timeout errors. Several people in the community are
> > already working on them. I hope this will raise the profile of this
> > issue (maybe some might think it's just a corner case as it rarely
> > causes failures in smoke jobs, whereas the truth is that error
> > occurs but it does not cause job failure because the jobs isn't
> > parallel).
>
> Can you give directions on where to find those lock timeout failures?
> I'd like to check logs to see whether they have the same nature as
> most other failures (e.g. improper yield under transaction).
>

This logstash query will give you all occurences of lock wait timeout
issues: message:"(OperationalError) (1205, 'Lock wait timeout exceeded; try
restarting transaction')" AND tags:"screen-q-svc.txt"

The fact that in most cases the build succeeds anyway is misleading,
because in many cases these errors occur in RPC handling between agents and
servers, and therefore are not detected by tempest. The neutron full job,
which is parallel, increases their occurrence because of parallelism - and
since API request too occur concurrently it also yields a higher tempest
build failure rate.

However, as I argued in the past the "lock wait timeout" error should
always be treated as an error condition.
Eugene has already classified lock wait timeout failures and filed bugs for
them a few weeks ago.


> >
> > Summarizing, I think time is not yet ripe to enable the full job;
> > once bug 1333654 is fixed, we should go for it. AFAIK there is no
> > way for working around it in gate tests other than disabling
> > nova/neutron event reporting, which I guess we don't want to do.
> >
> > Salvatore
> >
> > [1] https://review.openstack.org/#/c/105239 [2]
> >
> http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
> >
> >
> [3]
> >
> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=
> >
> >
> [4] https://bugs.launchpad.net/nova/+bug/1333654
> >
> >
> > On 2 July 2014 17:57, Salvatore Orlando 
> > wrote:
> >
> >> Hi again,
> >>
> >> From my analysis most of the failures affecting the neutron full
> >> job are because of bugs [1] and [2] for which patch [3] and [4]
> >> have been proposed. Both patches address the nova side of the
> >> neutron/nova notification system for vif plugging. It is worth
> >> noting that these bugs did manifest only in the neutron full job
> >> not because of its "full" nature, but because of its "parallel"
> >> nature.
> >>
> >> Openstackers with a good memory will probably remember we fixed
> >> the parallel job back in January, before the massive "kernel bug"
> >> gate outage [5]. However, since parallel testing was
> >> unfortunately never enabled on the smoke job we run on the gate,
> >> we allowed new bugs to slip in. For this reason I would recommend
> >> the following: - once patches [3] and [4] have been reviewed and
> >> merge, re-assess neutron full job failure rate over a period of
> >> 48 hours (72 if the period includes at least 24 hours within a
> >> weekend - GMT time) - turn neutron full job to voting if the

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-07-10 Thread Ihar Hrachyshka
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 10/07/14 11:07, Salvatore Orlando wrote:
> The patch for bug 1329564 [1] merged about 11 hours ago. From [2]
> it seems there has been an improvement on the failure rate, which 
> seem to have dropped to 25% from over 40%. Still, since the patch
> merged there have been 11 failures already in the full job out of
> 42 jobs executed in total. Of these 11 failures: - 3 were due to
> problems in the patches being tested - 1 had the same root cause as
> bug 1329564. Indeed the related job started before the patch merged
> but finished after. So this failure "doesn't count". - 1 was for an
> issue introduced about a week ago which actually causing a lot of
> failures in the full job [3]. Fix should be easy for it; however 
> given the nature of the test we might even skip it while it's
> fixed. - 3 were for bug 1333654 [4]; for this bug discussion is
> going on on gerrit regarding the most suitable approach. - 3 were
> for lock wait timeout errors. Several people in the community are 
> already working on them. I hope this will raise the profile of this
> issue (maybe some might think it's just a corner case as it rarely
> causes failures in smoke jobs, whereas the truth is that error
> occurs but it does not cause job failure because the jobs isn't
> parallel).

Can you give directions on where to find those lock timeout failures?
I'd like to check logs to see whether they have the same nature as
most other failures (e.g. improper yield under transaction).

> 
> Summarizing, I think time is not yet ripe to enable the full job;
> once bug 1333654 is fixed, we should go for it. AFAIK there is no
> way for working around it in gate tests other than disabling
> nova/neutron event reporting, which I guess we don't want to do.
> 
> Salvatore
> 
> [1] https://review.openstack.org/#/c/105239 [2] 
> http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
>
> 
[3]
> http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=
>
> 
[4] https://bugs.launchpad.net/nova/+bug/1333654
> 
> 
> On 2 July 2014 17:57, Salvatore Orlando 
> wrote:
> 
>> Hi again,
>> 
>> From my analysis most of the failures affecting the neutron full
>> job are because of bugs [1] and [2] for which patch [3] and [4]
>> have been proposed. Both patches address the nova side of the
>> neutron/nova notification system for vif plugging. It is worth
>> noting that these bugs did manifest only in the neutron full job
>> not because of its "full" nature, but because of its "parallel"
>> nature.
>> 
>> Openstackers with a good memory will probably remember we fixed
>> the parallel job back in January, before the massive "kernel bug"
>> gate outage [5]. However, since parallel testing was
>> unfortunately never enabled on the smoke job we run on the gate,
>> we allowed new bugs to slip in. For this reason I would recommend
>> the following: - once patches [3] and [4] have been reviewed and
>> merge, re-assess neutron full job failure rate over a period of
>> 48 hours (72 if the period includes at least 24 hours within a
>> weekend - GMT time) - turn neutron full job to voting if the
>> previous step reveals a failure rate below 10%, otherwise go back
>> to the drawing board
>> 
>> In my opinion whether the full job should be enabled in an
>> asymmetric fashion or not should be a decision for the QA and
>> Infra teams. Once the full job is made voting there will
>> inevitably be a higher failure rate. An asymmetric gate will not
>> cause backlogs on other projects, so less angry people, but as
>> Matt said it will still allow other bugs to slip in. Personally
>> I'm ok either way.
>> 
>> The reason why we're expecting a higher failure rate on the full
>> job is that we have already observed that some "known" bugs, such
>> as the various lock timeout issues affecting neutron tend to show
>> with a higher frequency on the full job because of its parallel
>> nature.
>> 
>> Salvatore
>> 
>> [1] https://launchpad.net/bugs/1329546 [2]
>> https://launchpad.net/bugs/1333654 [3]
>> https://review.openstack.org/#/c/99182/ [4]
>> https://review.open

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-07-10 Thread Salvatore Orlando
The patch for bug 1329564 [1] merged about 11 hours ago.
>From [2] it seems there has been an improvement on the failure rate, which
seem to have dropped to 25% from over 40%.
Still, since the patch merged there have been 11 failures already in the
full job out of 42 jobs executed in total.
Of these 11 failures:
- 3 were due to problems in the patches being tested
- 1 had the same root cause as bug 1329564. Indeed the related job started
before the patch merged but finished after. So this failure "doesn't count".
- 1 was for an issue introduced about a week ago which actually causing a
lot of failures in the full job [3]. Fix should be easy for it; however
given the nature of the test we might even skip it while it's fixed.
- 3 were for bug 1333654 [4]; for this bug discussion is going on on gerrit
regarding the most suitable approach.
- 3 were for lock wait timeout errors. Several people in the community are
already working on them. I hope this will raise the profile of this issue
(maybe some might think it's just a corner case as it rarely causes
failures in smoke jobs, whereas the truth is that error occurs but it does
not cause job failure because the jobs isn't parallel).

Summarizing, I think time is not yet ripe to enable the full job; once bug
1333654 is fixed, we should go for it. AFAIK there is no way for working
around it in gate tests other than disabling nova/neutron event reporting,
which I guess we don't want to do.

Salvatore

[1] https://review.openstack.org/#/c/105239
[2]
http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9zdGF0dXM6RkFJTFVSRSBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBGQUlMVVJFXCIgQU5EIGJ1aWxkX25hbWU6XCJjaGVjay10ZW1wZXN0LWRzdm0tbmV1dHJvbi1mdWxsXCIgQU5EIGJ1aWxkX2JyYW5jaDpcIm1hc3RlclwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiIxNzI4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTBUMDA6MjQ6NTcrMDA6MDAiLCJ0byI6IjIwMTQtMDctMTBUMDg6MjQ6NTMrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNDk4MjU2MjM2OCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==
[3]
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUEJhZFJlcXVlc3Q6IFVucmVjb2duaXplZCBhdHRyaWJ1dGUocykgJ21lbWJlciwgdmlwLCBwb29sLCBoZWFsdGhfbW9uaXRvcidcIiBBTkQgdGFnczpcInNjcmVlbi1xLXN2Yy50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiY3VzdG9tIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTA3LTAxVDA4OjU5OjAxKzAwOjAwIiwidG8iOiIyMDE0LTA3LTEwVDA4OjU5OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjE0MDQ5ODI3OTc3ODAsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=
[4] https://bugs.launchpad.net/nova/+bug/1333654


On 2 July 2014 17:57, Salvatore Orlando  wrote:

> Hi again,
>
> From my analysis most of the failures affecting the neutron full job are
> because of bugs [1] and [2] for which patch [3] and [4] have been proposed.
> Both patches address the nova side of the neutron/nova notification system
> for vif plugging.
> It is worth noting that these bugs did manifest only in the neutron full
> job not because of its "full" nature, but because of its "parallel" nature.
>
> Openstackers with a good memory will probably remember we fixed the
> parallel job back in January, before the massive "kernel bug" gate outage
> [5]. However, since parallel testing was unfortunately never enabled on the
> smoke job we run on the gate, we allowed new bugs to slip in.
> For this reason I would recommend the following:
> - once patches [3] and [4] have been reviewed and merge, re-assess neutron
> full job failure rate over a period of 48 hours (72 if the period includes
> at least 24 hours within a weekend - GMT time)
> - turn neutron full job to voting if the previous step reveals a failure
> rate below 10%, otherwise go back to the drawing board
>
> In my opinion whether the full job should be enabled in an asymmetric
> fashion or not should be a decision for the QA and Infra teams. Once the
> full job is made voting there will inevitably be a higher failure rate. An
> asymmetric gate will not cause backlogs on other projects, so less angry
> people, but as Matt said it will still allow other bugs to slip in.
> Personally I'm ok either way.
>
> The reason why we're expecting a higher failure rate on the full job is
> that we have already observed that some "known" bugs, such as the various
> lock timeout issues affecting neutron tend to show with a higher frequency
> on the full job because of its parallel nature.
>
> Salvatore
>
> [1] https://launchpad.net/bugs/1329546
> [2] https://launchpad.net/bugs/1333654
> [3] https://review.openstack.org/#/c/99182/
> [4] https://review.openstack.org/#/c/103865/
> [5] https://bugs.launchpad.net/neutron/+bug/1273386
>
>
>
>
> On 25 June 2014 23:38, Matthew Treinish  wrote:
>
>> On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando wrote:
>> > There is a long standing patch [1] for enabling the neutron full job.
>> > Little before the Icehouse release date, when we first pushed this, the
>> > neutron full job had a failure rate of less than 10%. However, since 

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-07-02 Thread Salvatore Orlando
Hi again,

>From my analysis most of the failures affecting the neutron full job are
because of bugs [1] and [2] for which patch [3] and [4] have been proposed.
Both patches address the nova side of the neutron/nova notification system
for vif plugging.
It is worth noting that these bugs did manifest only in the neutron full
job not because of its "full" nature, but because of its "parallel" nature.

Openstackers with a good memory will probably remember we fixed the
parallel job back in January, before the massive "kernel bug" gate outage
[5]. However, since parallel testing was unfortunately never enabled on the
smoke job we run on the gate, we allowed new bugs to slip in.
For this reason I would recommend the following:
- once patches [3] and [4] have been reviewed and merge, re-assess neutron
full job failure rate over a period of 48 hours (72 if the period includes
at least 24 hours within a weekend - GMT time)
- turn neutron full job to voting if the previous step reveals a failure
rate below 10%, otherwise go back to the drawing board

In my opinion whether the full job should be enabled in an asymmetric
fashion or not should be a decision for the QA and Infra teams. Once the
full job is made voting there will inevitably be a higher failure rate. An
asymmetric gate will not cause backlogs on other projects, so less angry
people, but as Matt said it will still allow other bugs to slip in.
Personally I'm ok either way.

The reason why we're expecting a higher failure rate on the full job is
that we have already observed that some "known" bugs, such as the various
lock timeout issues affecting neutron tend to show with a higher frequency
on the full job because of its parallel nature.

Salvatore

[1] https://launchpad.net/bugs/1329546
[2] https://launchpad.net/bugs/1333654
[3] https://review.openstack.org/#/c/99182/
[4] https://review.openstack.org/#/c/103865/
[5] https://bugs.launchpad.net/neutron/+bug/1273386




On 25 June 2014 23:38, Matthew Treinish  wrote:

> On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando wrote:
> > There is a long standing patch [1] for enabling the neutron full job.
> > Little before the Icehouse release date, when we first pushed this, the
> > neutron full job had a failure rate of less than 10%. However, since has
> > come by, and perceived failure rates were higher, we ran again this
> > analysis.
>
> So I'm not exactly a fan of having the gates be asymmetrical.  It's very
> easy
> for breaks to slip in blocking the neutron gate if it's not voting
> everywhere.
> Especially because I think most people have been trained to ignore the full
> job because it's been nonvoting for so long. Is there a particular reason
> we
> just don't switch everything all at once? I think having a little bit of
> friction everywhere during the migration is fine. Especially if we do it
> way
> before a milestone. (as opposed to the original parallel switch which was
> right
> before H-3)
>
> >
> > Here are the findings in a nutshell.
> > 1) If we were to enable the job today we might expect about a 3-fold
> > increase in neutron job failures when compared with the smoke test. This
> is
> > unfortunately not acceptable and we therefore need to identify and fix
> the
> > issues causing the additional failure rate.
> > 2) However this also puts us in a position where if we wait until the
> > failure rate drops under a given threshold we might end up chasing a
> moving
> > target as new issues might be introduced at any time since the job is not
> > voting.
> > 3) When it comes to evaluating failure rates for a non voting job, taking
> > the rough numbers does not mean anything, as that will take in account
> > patches 'in progress' which end up failing the tests because of problems
> in
> > the patch themselves.
> >
> > Well, that was pretty much a lot for a "nutshell"; however if you're not
> > yet bored to death please go on reading.
> >
> > The data in this post are a bit skewed because of a rise in neutron job
> > failures in the past 36 hours. However, this rise affects both the full
> and
> > the smoke job so it does not invalidate what we say here. The results
> shown
> > below are representative of the gate status 12 hours ago.
> >
> > - Neutron smoke job failure rates (all queues)
> >   24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
> > - Neutron smoke job failure rates (gate queue only):
> >   24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
> > - Neutron full job failure rate (check queue only as it's non voting):
> >   24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%
> >
> > Check/Gate Ratio between neutron smoke failures
> > 24 hours: 2.15 48 hours: 1.89 7 days: 2.53
> >
> > Estimated job failure rate for neutron full job if it were to run in the
> > gate:
> > 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%
> >
> > The numbers are therefore not terrible, but definitely not good enough;
> > looking at the last 7 days the full job will have a failure rate about 3
> > times hig

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-06-25 Thread Matthew Treinish
On Tue, Jun 24, 2014 at 02:14:16PM +0200, Salvatore Orlando wrote:
> There is a long standing patch [1] for enabling the neutron full job.
> Little before the Icehouse release date, when we first pushed this, the
> neutron full job had a failure rate of less than 10%. However, since has
> come by, and perceived failure rates were higher, we ran again this
> analysis.

So I'm not exactly a fan of having the gates be asymmetrical.  It's very easy
for breaks to slip in blocking the neutron gate if it's not voting everywhere.
Especially because I think most people have been trained to ignore the full
job because it's been nonvoting for so long. Is there a particular reason we
just don't switch everything all at once? I think having a little bit of
friction everywhere during the migration is fine. Especially if we do it way
before a milestone. (as opposed to the original parallel switch which was right
before H-3)

> 
> Here are the findings in a nutshell.
> 1) If we were to enable the job today we might expect about a 3-fold
> increase in neutron job failures when compared with the smoke test. This is
> unfortunately not acceptable and we therefore need to identify and fix the
> issues causing the additional failure rate.
> 2) However this also puts us in a position where if we wait until the
> failure rate drops under a given threshold we might end up chasing a moving
> target as new issues might be introduced at any time since the job is not
> voting.
> 3) When it comes to evaluating failure rates for a non voting job, taking
> the rough numbers does not mean anything, as that will take in account
> patches 'in progress' which end up failing the tests because of problems in
> the patch themselves.
> 
> Well, that was pretty much a lot for a "nutshell"; however if you're not
> yet bored to death please go on reading.
> 
> The data in this post are a bit skewed because of a rise in neutron job
> failures in the past 36 hours. However, this rise affects both the full and
> the smoke job so it does not invalidate what we say here. The results shown
> below are representative of the gate status 12 hours ago.
> 
> - Neutron smoke job failure rates (all queues)
>   24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
> - Neutron smoke job failure rates (gate queue only):
>   24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
> - Neutron full job failure rate (check queue only as it's non voting):
>   24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%
> 
> Check/Gate Ratio between neutron smoke failures
> 24 hours: 2.15 48 hours: 1.89 7 days: 2.53
> 
> Estimated job failure rate for neutron full job if it were to run in the
> gate:
> 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%
> 
> The numbers are therefore not terrible, but definitely not good enough;
> looking at the last 7 days the full job will have a failure rate about 3
> times higher than the smoke job.
> 
> We then took, as it's usual for us when we do this kind of evaluation, a
> window with a reasonable number of failures (41 in our case), and analysed
> them in detail.
> 
> Of these 41 failures 17 were excluded because of infra problems, patches
> 'in progress', or other transient failures; considering that over the same
> period of time 160 full job runs succeeded this would leave us with 24
> failures on 184 run, and therefore a failure rate of 13.04%, which not far
> from the estimate.
> 
> Let's consider now these 24 'real' falures:
> A)  2 were for the SSH timeout (8.33% of failures, 1.08% of total full job
> runs). These specific failure is being analyzed to see if a specific
> fingerprint can be found
> B) 2  (8.33% of failures, 1.08% of total full job runs) were for a failure
> in test load balancer basic, which is actually a test design issue and is
> already being addressed [2]
> C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue
> while resizing a server, which has been already spotted and has a bug in
> progress [3]
> D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a
> failure in test_server_address; however the actual root cause was being
> masked by [4]. A bug has been filed [5]; this is the most worrying one in
> my opinion as there are many cases where the fault happens but does not
> trigger a failure because of the way tempest tests are designed.
> E) 6 are because of our friend lock wait timeout. This was initially filed
> as [6] but since then we've closed it to file more detailed bug reports as
> the lock wait timeout can manifest in various places; Eugene is leading the
> effort on this problem with Kevin B.
> 
> 
> Summarizing the only failure modes specific to the full job seem to be C &
> D. If we were able to fix those we should reasonably expect a failure rate
> of about 6.5%. That's still almost twice as the smoke job, but I deem it
> acceptable for two reasons:
> 1- by voting, we will avoid new bugs affecting the full job from being
> introduced. it is worth reminding people that any bug a

Re: [openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-06-24 Thread Salvatore Orlando
Ops...  I forgot to mention that in agreement with sdague we won't anyway
enable this job before thursday June 26th, in order to give a few days to
the trusty update to settle down.

Salvatore


On 24 June 2014 14:14, Salvatore Orlando  wrote:

> There is a long standing patch [1] for enabling the neutron full job.
> Little before the Icehouse release date, when we first pushed this, the
> neutron full job had a failure rate of less than 10%. However, since has
> come by, and perceived failure rates were higher, we ran again this
> analysis.
>
> Here are the findings in a nutshell.
> 1) If we were to enable the job today we might expect about a 3-fold
> increase in neutron job failures when compared with the smoke test. This is
> unfortunately not acceptable and we therefore need to identify and fix the
> issues causing the additional failure rate.
> 2) However this also puts us in a position where if we wait until the
> failure rate drops under a given threshold we might end up chasing a moving
> target as new issues might be introduced at any time since the job is not
> voting.
> 3) When it comes to evaluating failure rates for a non voting job, taking
> the rough numbers does not mean anything, as that will take in account
> patches 'in progress' which end up failing the tests because of problems in
> the patch themselves.
>
> Well, that was pretty much a lot for a "nutshell"; however if you're not
> yet bored to death please go on reading.
>
> The data in this post are a bit skewed because of a rise in neutron job
> failures in the past 36 hours. However, this rise affects both the full and
> the smoke job so it does not invalidate what we say here. The results shown
> below are representative of the gate status 12 hours ago.
>
> - Neutron smoke job failure rates (all queues)
>   24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
> - Neutron smoke job failure rates (gate queue only):
>   24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
> - Neutron full job failure rate (check queue only as it's non voting):
>   24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%
>
> Check/Gate Ratio between neutron smoke failures
> 24 hours: 2.15 48 hours: 1.89 7 days: 2.53
>
> Estimated job failure rate for neutron full job if it were to run in the
> gate:
> 24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%
>
> The numbers are therefore not terrible, but definitely not good enough;
> looking at the last 7 days the full job will have a failure rate about 3
> times higher than the smoke job.
>
> We then took, as it's usual for us when we do this kind of evaluation, a
> window with a reasonable number of failures (41 in our case), and analysed
> them in detail.
>
> Of these 41 failures 17 were excluded because of infra problems, patches
> 'in progress', or other transient failures; considering that over the same
> period of time 160 full job runs succeeded this would leave us with 24
> failures on 184 run, and therefore a failure rate of 13.04%, which not far
> from the estimate.
>
> Let's consider now these 24 'real' falures:
> A)  2 were for the SSH timeout (8.33% of failures, 1.08% of total full job
> runs). These specific failure is being analyzed to see if a specific
> fingerprint can be found
> B) 2  (8.33% of failures, 1.08% of total full job runs) were for a failure
> in test load balancer basic, which is actually a test design issue and is
> already being addressed [2]
> C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue
> while resizing a server, which has been already spotted and has a bug in
> progress [3]
> D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a
> failure in test_server_address; however the actual root cause was being
> masked by [4]. A bug has been filed [5]; this is the most worrying one in
> my opinion as there are many cases where the fault happens but does not
> trigger a failure because of the way tempest tests are designed.
> E) 6 are because of our friend lock wait timeout. This was initially filed
> as [6] but since then we've closed it to file more detailed bug reports as
> the lock wait timeout can manifest in various places; Eugene is leading the
> effort on this problem with Kevin B.
>
>
> Summarizing the only failure modes specific to the full job seem to be C &
> D. If we were able to fix those we should reasonably expect a failure rate
> of about 6.5%. That's still almost twice as the smoke job, but I deem it
> acceptable for two reasons:
> 1- by voting, we will avoid new bugs affecting the full job from being
> introduced. it is worth reminding people that any bug affecting the full
> job is likely to affect production environments
> 2- patches failing in the gate will spur neutron developers to quickly
> find a fix. Patches failing a non voting job will cause some neutron core
> team members to write long and boring posts to the mailing list.
>
> Salvatore
>
>
>
>
> [1] https://review.openstack.org/#/c/88289/
> [2] https://review.openstack.org/#

[openstack-dev] [Neutron][QA] Enabling full neutron Job

2014-06-24 Thread Salvatore Orlando
There is a long standing patch [1] for enabling the neutron full job.
Little before the Icehouse release date, when we first pushed this, the
neutron full job had a failure rate of less than 10%. However, since has
come by, and perceived failure rates were higher, we ran again this
analysis.

Here are the findings in a nutshell.
1) If we were to enable the job today we might expect about a 3-fold
increase in neutron job failures when compared with the smoke test. This is
unfortunately not acceptable and we therefore need to identify and fix the
issues causing the additional failure rate.
2) However this also puts us in a position where if we wait until the
failure rate drops under a given threshold we might end up chasing a moving
target as new issues might be introduced at any time since the job is not
voting.
3) When it comes to evaluating failure rates for a non voting job, taking
the rough numbers does not mean anything, as that will take in account
patches 'in progress' which end up failing the tests because of problems in
the patch themselves.

Well, that was pretty much a lot for a "nutshell"; however if you're not
yet bored to death please go on reading.

The data in this post are a bit skewed because of a rise in neutron job
failures in the past 36 hours. However, this rise affects both the full and
the smoke job so it does not invalidate what we say here. The results shown
below are representative of the gate status 12 hours ago.

- Neutron smoke job failure rates (all queues)
  24 hours: 22.4% 48 hours: 19.3% 7 days: 8.96%
- Neutron smoke job failure rates (gate queue only):
  24 hours: 10.41% 48 hours: 10.20% 7 days: 3.53%
- Neutron full job failure rate (check queue only as it's non voting):
  24 hours: 31.54% 48 hours: 28.87% 7 days: 25.73%

Check/Gate Ratio between neutron smoke failures
24 hours: 2.15 48 hours: 1.89 7 days: 2.53

Estimated job failure rate for neutron full job if it were to run in the
gate:
24 hours: 14.67% 48 hours: 15.27% 7 days: 10.16%

The numbers are therefore not terrible, but definitely not good enough;
looking at the last 7 days the full job will have a failure rate about 3
times higher than the smoke job.

We then took, as it's usual for us when we do this kind of evaluation, a
window with a reasonable number of failures (41 in our case), and analysed
them in detail.

Of these 41 failures 17 were excluded because of infra problems, patches
'in progress', or other transient failures; considering that over the same
period of time 160 full job runs succeeded this would leave us with 24
failures on 184 run, and therefore a failure rate of 13.04%, which not far
from the estimate.

Let's consider now these 24 'real' falures:
A)  2 were for the SSH timeout (8.33% of failures, 1.08% of total full job
runs). These specific failure is being analyzed to see if a specific
fingerprint can be found
B) 2  (8.33% of failures, 1.08% of total full job runs) were for a failure
in test load balancer basic, which is actually a test design issue and is
already being addressed [2]
C) 7 (29.16% of failures, 3.81% of total full job runs) were for an issue
while resizing a server, which has been already spotted and has a bug in
progress [3]
D) 5 (20.83% of failures, 2.72% of total full job runs) manifested as a
failure in test_server_address; however the actual root cause was being
masked by [4]. A bug has been filed [5]; this is the most worrying one in
my opinion as there are many cases where the fault happens but does not
trigger a failure because of the way tempest tests are designed.
E) 6 are because of our friend lock wait timeout. This was initially filed
as [6] but since then we've closed it to file more detailed bug reports as
the lock wait timeout can manifest in various places; Eugene is leading the
effort on this problem with Kevin B.


Summarizing the only failure modes specific to the full job seem to be C &
D. If we were able to fix those we should reasonably expect a failure rate
of about 6.5%. That's still almost twice as the smoke job, but I deem it
acceptable for two reasons:
1- by voting, we will avoid new bugs affecting the full job from being
introduced. it is worth reminding people that any bug affecting the full
job is likely to affect production environments
2- patches failing in the gate will spur neutron developers to quickly find
a fix. Patches failing a non voting job will cause some neutron core team
members to write long and boring posts to the mailing list.

Salvatore




[1] https://review.openstack.org/#/c/88289/
[2] https://review.openstack.org/#/c/98065/
[3] https://bugs.launchpad.net/nova/+bug/1329546
[4] https://bugs.launchpad.net/tempest/+bug/1332414
[5] https://bugs.launchpad.net/nova/+bug/1333654
[5] https://bugs.launchpad.net/nova/+bug/1283522
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev