Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-21 Thread Michael Still
On Thu, Nov 21, 2013 at 9:10 AM, Michael Still mi...@stillhq.com wrote:
 On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote:

 How do we avoid this in the future? Step one is reviewers that are
 approving changes (or reverifying them) should keep an eye on the gate
 queue.

 Talking on the -infra IRC channel just now, it has become clear to me
 that we need to stop approving _any_ change for now until we have the
 gate fixed. All we're doing at the moment is rechecking over and over
 because the gate is too unreliable to actually pass changes. This is
 making debugging the gate significantly harder.

 Could cores please refrain from approving code until the gate issues
 are resolved?

I am pleased to say that people much smarter than me seem to have now
resolved the gate issues. It is now safe to approve code once again.

Expect a long merge queue as the backlog clears, so perhaps start by
approving patches which were approved before we downed tools?

Cheers,
Michael

-- 
Rackspace Australia

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-21 Thread James E. Blair
Matt Riedemann mrie...@linux.vnet.ibm.com writes:

 People get heads-down in their own projects and what they are working
 on and it's hard to keep up with what's going on in the infra channel
 (or nova channel for that matter), so sending out a recap that
 everyone can see in the mailing list is helpful to reset where things
 are at and focus possibly various isolated investigations (as we saw
 happen this week).

Further on that point, Joe and I and others have been brainstorming
about how to prevent this situation and improve things when it does
happen.  To that end, I'd like to propose we adopt some process around
gate-blocking bugs:

1) The QA team should have the ability to triage bugs in _all_ OpenStack
projects, specifically so that they may set gate-blocking bugs to
critical priority.

2) If there isn't an immediately obvious assignee for the bug, send an
email to the -dev list announcing it and asking for someone to take or
be assigned to the bug.

I think the expectation should be that the bug triage teams or PTLs
should help get someone assigned to the bug in a reasonable time (say,
24 hours, or ideally much less).

3) If things get really bad, as they have recently, we send a mail to
the list asking core devs to stop approving patches that don't address
gate-blocking bugs.

I don't think any of this is revolutionary -- we have more or less done
these things already in this situation, but we usually take a while to
get there.  I think setting expectations around this and standardizing
how we proceed will make us better able to handle it.

Separately we will be following up with information on some changes that
we hope will reduce the likelihood of nondeterministic bugs creeping in
in the first place.

-Jim

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Clark Boylan
Joe Gordon has been doing great working tracking test failures and how
often they affect us. Post Havana release the failure rate has
increased dramatically, negatively affecting the gate and forcing it to
run in a near worst case scenario. That is changes are being tested in
parallel but the head of the queue is more often than not running into a
failed job forcing all changes behind it to be retested and so on.

This led to a gate queue 130 deep with the head of the queue 18 hours
behind its approval. We have identified fixes for some of the worst
current bugs and in order to get them in have restarted Zuul effectively
cancelling the gate queue and have queued these changes up at the front
of the qeueue. Once these changes are in and we are happy with the bug
fixing results we will requeue changes that were in the queue when it
got cancelled.

How do we avoid this in the future? Step one is reviewers that are
approving changes (or reverifying them) should keep an eye on the gate
queue. If it is struggling adding more changes to that queue problably
won't help. Instead we should focus on identifying the bugs, submitting
changes to elastic-recheck to track these bugs, and work towards fixing
the bugs. Everyone is affected by persistent gate failures, we need to
work together to fix them.

Thank you for your patience,

Clark

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Matt Riedemann



On Wednesday, November 20, 2013 2:44:52 PM, Clark Boylan wrote:

Joe Gordon has been doing great working tracking test failures and how
often they affect us. Post Havana release the failure rate has
increased dramatically, negatively affecting the gate and forcing it to
run in a near worst case scenario. That is changes are being tested in
parallel but the head of the queue is more often than not running into a
failed job forcing all changes behind it to be retested and so on.

This led to a gate queue 130 deep with the head of the queue 18 hours
behind its approval. We have identified fixes for some of the worst
current bugs and in order to get them in have restarted Zuul effectively
cancelling the gate queue and have queued these changes up at the front
of the qeueue. Once these changes are in and we are happy with the bug
fixing results we will requeue changes that were in the queue when it
got cancelled.

How do we avoid this in the future? Step one is reviewers that are
approving changes (or reverifying them) should keep an eye on the gate
queue. If it is struggling adding more changes to that queue problably
won't help. Instead we should focus on identifying the bugs, submitting
changes to elastic-recheck to track these bugs, and work towards fixing
the bugs. Everyone is affected by persistent gate failures, we need to
work together to fix them.

Thank you for your patience,

Clark

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



Let me also say that I think it's really helpful that Joe has been 
sending out recaps to the mailing list about the top offenders so 
people can help pitch in on investigating and fixing those (like we saw 
with the Neutron team's response to Joe's recent post about the top 
gate failures).


People get heads-down in their own projects and what they are working 
on and it's hard to keep up with what's going on in the infra channel 
(or nova channel for that matter), so sending out a recap that everyone 
can see in the mailing list is helpful to reset where things are at and 
focus possibly various isolated investigations (as we saw happen this 
week).


--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Michael Still
On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote:

 How do we avoid this in the future? Step one is reviewers that are
 approving changes (or reverifying them) should keep an eye on the gate
 queue.

Talking on the -infra IRC channel just now, it has become clear to me
that we need to stop approving _any_ change for now until we have the
gate fixed. All we're doing at the moment is rechecking over and over
because the gate is too unreliable to actually pass changes. This is
making debugging the gate significantly harder.

Could cores please refrain from approving code until the gate issues
are resolved?

Thanks,
Michael

-- 
Rackspace Australia

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Anita Kuno
On 11/20/2013 05:10 PM, Michael Still wrote:
 On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote:
 
 How do we avoid this in the future? Step one is reviewers that are
 approving changes (or reverifying them) should keep an eye on the gate
 queue.
 
 Talking on the -infra IRC channel just now, it has become clear to me
 that we need to stop approving _any_ change for now until we have the
 gate fixed. All we're doing at the moment is rechecking over and over
 because the gate is too unreliable to actually pass changes. This is
 making debugging the gate significantly harder.
 
 Could cores please refrain from approving code until the gate issues
 are resolved?
 
 Thanks,
 Michael
 
We are talking in -neutron and Mark McClain sent out an email to all
cores expressing this objective. He is also -2 any patch in check that
is not directly related to a gate blocking bug fix. Neutron is in
holding until we get the all clear from -infra.

Thanks Michael,
Anita.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Davanum Srinivas
Anita, Joe,

I need help with adding debug message in:
https://review.openstack.org/#/c/56316/

to track down:
Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to
neutron failed: Maximum attempts reached AND
filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error:
Connection to neutron failed: Maximum attempts reached

-- dims



On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote:
 On 11/20/2013 05:10 PM, Michael Still wrote:
 On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote:

 How do we avoid this in the future? Step one is reviewers that are
 approving changes (or reverifying them) should keep an eye on the gate
 queue.

 Talking on the -infra IRC channel just now, it has become clear to me
 that we need to stop approving _any_ change for now until we have the
 gate fixed. All we're doing at the moment is rechecking over and over
 because the gate is too unreliable to actually pass changes. This is
 making debugging the gate significantly harder.

 Could cores please refrain from approving code until the gate issues
 are resolved?

 Thanks,
 Michael

 We are talking in -neutron and Mark McClain sent out an email to all
 cores expressing this objective. He is also -2 any patch in check that
 is not directly related to a gate blocking bug fix. Neutron is in
 holding until we get the all clear from -infra.

 Thanks Michael,
 Anita.

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Davanum Srinivas :: http://davanum.wordpress.com

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Anita Kuno
On 11/20/2013 06:58 PM, Davanum Srinivas wrote:
 Anita, Joe,
 
 I need help with adding debug message in:
 https://review.openstack.org/#/c/56316/
 
 to track down:
 Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to
 neutron failed: Maximum attempts reached AND
 filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error:
 Connection to neutron failed: Maximum attempts reached
It has been decided on the bug report that 1251784 is a duplicate of
https://bugs.launchpad.net/nova/+bug/1251920 which has patch 57357 which
is in the gate.

I have no problem reviewing 56316 though.

Thanks dims,
Anita.
 
 -- dims
 
 
 
 On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote:
 On 11/20/2013 05:10 PM, Michael Still wrote:
 On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com 
 wrote:

 How do we avoid this in the future? Step one is reviewers that are
 approving changes (or reverifying them) should keep an eye on the gate
 queue.

 Talking on the -infra IRC channel just now, it has become clear to me
 that we need to stop approving _any_ change for now until we have the
 gate fixed. All we're doing at the moment is rechecking over and over
 because the gate is too unreliable to actually pass changes. This is
 making debugging the gate significantly harder.

 Could cores please refrain from approving code until the gate issues
 are resolved?

 Thanks,
 Michael

 We are talking in -neutron and Mark McClain sent out an email to all
 cores expressing this objective. He is also -2 any patch in check that
 is not directly related to a gate blocking bug fix. Neutron is in
 holding until we get the all clear from -infra.

 Thanks Michael,
 Anita.

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Joe Gordon
Dims, we think https://review.openstack.org/#/c/57509/ will fix
125178https://launchpad.net/bugs/1251784
.


On Wed, Nov 20, 2013 at 3:58 PM, Davanum Srinivas dava...@gmail.com wrote:

 Anita, Joe,

 I need help with adding debug message in:
 https://review.openstack.org/#/c/56316/

 to track down:
 Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to
 neutron failed: Maximum attempts reached AND
 filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error:
 Connection to neutron failed: Maximum attempts reached

 -- dims



 On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote:
  On 11/20/2013 05:10 PM, Michael Still wrote:
  On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com
 wrote:
 
  How do we avoid this in the future? Step one is reviewers that are
  approving changes (or reverifying them) should keep an eye on the gate
  queue.
 
  Talking on the -infra IRC channel just now, it has become clear to me
  that we need to stop approving _any_ change for now until we have the
  gate fixed. All we're doing at the moment is rechecking over and over
  because the gate is too unreliable to actually pass changes. This is
  making debugging the gate significantly harder.
 
  Could cores please refrain from approving code until the gate issues
  are resolved?
 
  Thanks,
  Michael
 
  We are talking in -neutron and Mark McClain sent out an email to all
  cores expressing this objective. He is also -2 any patch in check that
  is not directly related to a gate blocking bug fix. Neutron is in
  holding until we get the all clear from -infra.
 
  Thanks Michael,
  Anita.
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 --
 Davanum Srinivas :: http://davanum.wordpress.com

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] The recent gate performance and how it affects you

2013-11-20 Thread Davanum Srinivas
Nice! Joe.

thanks,
-- dims

On Wed, Nov 20, 2013 at 7:15 PM, Joe Gordon joe.gord...@gmail.com wrote:
 Dims, we think https://review.openstack.org/#/c/57509/ will fix 125178.


 On Wed, Nov 20, 2013 at 3:58 PM, Davanum Srinivas dava...@gmail.com wrote:

 Anita, Joe,

 I need help with adding debug message in:
 https://review.openstack.org/#/c/56316/

 to track down:
 Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to
 neutron failed: Maximum attempts reached AND
 filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error:
 Connection to neutron failed: Maximum attempts reached

 -- dims



 On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote:
  On 11/20/2013 05:10 PM, Michael Still wrote:
  On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com
  wrote:
 
  How do we avoid this in the future? Step one is reviewers that are
  approving changes (or reverifying them) should keep an eye on the gate
  queue.
 
  Talking on the -infra IRC channel just now, it has become clear to me
  that we need to stop approving _any_ change for now until we have the
  gate fixed. All we're doing at the moment is rechecking over and over
  because the gate is too unreliable to actually pass changes. This is
  making debugging the gate significantly harder.
 
  Could cores please refrain from approving code until the gate issues
  are resolved?
 
  Thanks,
  Michael
 
  We are talking in -neutron and Mark McClain sent out an email to all
  cores expressing this objective. He is also -2 any patch in check that
  is not directly related to a gate blocking bug fix. Neutron is in
  holding until we get the all clear from -infra.
 
  Thanks Michael,
  Anita.
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 --
 Davanum Srinivas :: http://davanum.wordpress.com

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Davanum Srinivas :: http://davanum.wordpress.com

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev