Re: [openstack-dev] The recent gate performance and how it affects you
On Thu, Nov 21, 2013 at 9:10 AM, Michael Still mi...@stillhq.com wrote: On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote: How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. Talking on the -infra IRC channel just now, it has become clear to me that we need to stop approving _any_ change for now until we have the gate fixed. All we're doing at the moment is rechecking over and over because the gate is too unreliable to actually pass changes. This is making debugging the gate significantly harder. Could cores please refrain from approving code until the gate issues are resolved? I am pleased to say that people much smarter than me seem to have now resolved the gate issues. It is now safe to approve code once again. Expect a long merge queue as the backlog clears, so perhaps start by approving patches which were approved before we downed tools? Cheers, Michael -- Rackspace Australia ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
Matt Riedemann mrie...@linux.vnet.ibm.com writes: People get heads-down in their own projects and what they are working on and it's hard to keep up with what's going on in the infra channel (or nova channel for that matter), so sending out a recap that everyone can see in the mailing list is helpful to reset where things are at and focus possibly various isolated investigations (as we saw happen this week). Further on that point, Joe and I and others have been brainstorming about how to prevent this situation and improve things when it does happen. To that end, I'd like to propose we adopt some process around gate-blocking bugs: 1) The QA team should have the ability to triage bugs in _all_ OpenStack projects, specifically so that they may set gate-blocking bugs to critical priority. 2) If there isn't an immediately obvious assignee for the bug, send an email to the -dev list announcing it and asking for someone to take or be assigned to the bug. I think the expectation should be that the bug triage teams or PTLs should help get someone assigned to the bug in a reasonable time (say, 24 hours, or ideally much less). 3) If things get really bad, as they have recently, we send a mail to the list asking core devs to stop approving patches that don't address gate-blocking bugs. I don't think any of this is revolutionary -- we have more or less done these things already in this situation, but we usually take a while to get there. I think setting expectations around this and standardizing how we proceed will make us better able to handle it. Separately we will be following up with information on some changes that we hope will reduce the likelihood of nondeterministic bugs creeping in in the first place. -Jim ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] The recent gate performance and how it affects you
Joe Gordon has been doing great working tracking test failures and how often they affect us. Post Havana release the failure rate has increased dramatically, negatively affecting the gate and forcing it to run in a near worst case scenario. That is changes are being tested in parallel but the head of the queue is more often than not running into a failed job forcing all changes behind it to be retested and so on. This led to a gate queue 130 deep with the head of the queue 18 hours behind its approval. We have identified fixes for some of the worst current bugs and in order to get them in have restarted Zuul effectively cancelling the gate queue and have queued these changes up at the front of the qeueue. Once these changes are in and we are happy with the bug fixing results we will requeue changes that were in the queue when it got cancelled. How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. If it is struggling adding more changes to that queue problably won't help. Instead we should focus on identifying the bugs, submitting changes to elastic-recheck to track these bugs, and work towards fixing the bugs. Everyone is affected by persistent gate failures, we need to work together to fix them. Thank you for your patience, Clark ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
On Wednesday, November 20, 2013 2:44:52 PM, Clark Boylan wrote: Joe Gordon has been doing great working tracking test failures and how often they affect us. Post Havana release the failure rate has increased dramatically, negatively affecting the gate and forcing it to run in a near worst case scenario. That is changes are being tested in parallel but the head of the queue is more often than not running into a failed job forcing all changes behind it to be retested and so on. This led to a gate queue 130 deep with the head of the queue 18 hours behind its approval. We have identified fixes for some of the worst current bugs and in order to get them in have restarted Zuul effectively cancelling the gate queue and have queued these changes up at the front of the qeueue. Once these changes are in and we are happy with the bug fixing results we will requeue changes that were in the queue when it got cancelled. How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. If it is struggling adding more changes to that queue problably won't help. Instead we should focus on identifying the bugs, submitting changes to elastic-recheck to track these bugs, and work towards fixing the bugs. Everyone is affected by persistent gate failures, we need to work together to fix them. Thank you for your patience, Clark ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Let me also say that I think it's really helpful that Joe has been sending out recaps to the mailing list about the top offenders so people can help pitch in on investigating and fixing those (like we saw with the Neutron team's response to Joe's recent post about the top gate failures). People get heads-down in their own projects and what they are working on and it's hard to keep up with what's going on in the infra channel (or nova channel for that matter), so sending out a recap that everyone can see in the mailing list is helpful to reset where things are at and focus possibly various isolated investigations (as we saw happen this week). -- Thanks, Matt Riedemann ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote: How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. Talking on the -infra IRC channel just now, it has become clear to me that we need to stop approving _any_ change for now until we have the gate fixed. All we're doing at the moment is rechecking over and over because the gate is too unreliable to actually pass changes. This is making debugging the gate significantly harder. Could cores please refrain from approving code until the gate issues are resolved? Thanks, Michael -- Rackspace Australia ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
On 11/20/2013 05:10 PM, Michael Still wrote: On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote: How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. Talking on the -infra IRC channel just now, it has become clear to me that we need to stop approving _any_ change for now until we have the gate fixed. All we're doing at the moment is rechecking over and over because the gate is too unreliable to actually pass changes. This is making debugging the gate significantly harder. Could cores please refrain from approving code until the gate issues are resolved? Thanks, Michael We are talking in -neutron and Mark McClain sent out an email to all cores expressing this objective. He is also -2 any patch in check that is not directly related to a gate blocking bug fix. Neutron is in holding until we get the all clear from -infra. Thanks Michael, Anita. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
Anita, Joe, I need help with adding debug message in: https://review.openstack.org/#/c/56316/ to track down: Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to neutron failed: Maximum attempts reached AND filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error: Connection to neutron failed: Maximum attempts reached -- dims On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote: On 11/20/2013 05:10 PM, Michael Still wrote: On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote: How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. Talking on the -infra IRC channel just now, it has become clear to me that we need to stop approving _any_ change for now until we have the gate fixed. All we're doing at the moment is rechecking over and over because the gate is too unreliable to actually pass changes. This is making debugging the gate significantly harder. Could cores please refrain from approving code until the gate issues are resolved? Thanks, Michael We are talking in -neutron and Mark McClain sent out an email to all cores expressing this objective. He is also -2 any patch in check that is not directly related to a gate blocking bug fix. Neutron is in holding until we get the all clear from -infra. Thanks Michael, Anita. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Davanum Srinivas :: http://davanum.wordpress.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
On 11/20/2013 06:58 PM, Davanum Srinivas wrote: Anita, Joe, I need help with adding debug message in: https://review.openstack.org/#/c/56316/ to track down: Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to neutron failed: Maximum attempts reached AND filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error: Connection to neutron failed: Maximum attempts reached It has been decided on the bug report that 1251784 is a duplicate of https://bugs.launchpad.net/nova/+bug/1251920 which has patch 57357 which is in the gate. I have no problem reviewing 56316 though. Thanks dims, Anita. -- dims On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote: On 11/20/2013 05:10 PM, Michael Still wrote: On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote: How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. Talking on the -infra IRC channel just now, it has become clear to me that we need to stop approving _any_ change for now until we have the gate fixed. All we're doing at the moment is rechecking over and over because the gate is too unreliable to actually pass changes. This is making debugging the gate significantly harder. Could cores please refrain from approving code until the gate issues are resolved? Thanks, Michael We are talking in -neutron and Mark McClain sent out an email to all cores expressing this objective. He is also -2 any patch in check that is not directly related to a gate blocking bug fix. Neutron is in holding until we get the all clear from -infra. Thanks Michael, Anita. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
Dims, we think https://review.openstack.org/#/c/57509/ will fix 125178https://launchpad.net/bugs/1251784 . On Wed, Nov 20, 2013 at 3:58 PM, Davanum Srinivas dava...@gmail.com wrote: Anita, Joe, I need help with adding debug message in: https://review.openstack.org/#/c/56316/ to track down: Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to neutron failed: Maximum attempts reached AND filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error: Connection to neutron failed: Maximum attempts reached -- dims On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote: On 11/20/2013 05:10 PM, Michael Still wrote: On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote: How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. Talking on the -infra IRC channel just now, it has become clear to me that we need to stop approving _any_ change for now until we have the gate fixed. All we're doing at the moment is rechecking over and over because the gate is too unreliable to actually pass changes. This is making debugging the gate significantly harder. Could cores please refrain from approving code until the gate issues are resolved? Thanks, Michael We are talking in -neutron and Mark McClain sent out an email to all cores expressing this objective. He is also -2 any patch in check that is not directly related to a gate blocking bug fix. Neutron is in holding until we get the all clear from -infra. Thanks Michael, Anita. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Davanum Srinivas :: http://davanum.wordpress.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] The recent gate performance and how it affects you
Nice! Joe. thanks, -- dims On Wed, Nov 20, 2013 at 7:15 PM, Joe Gordon joe.gord...@gmail.com wrote: Dims, we think https://review.openstack.org/#/c/57509/ will fix 125178. On Wed, Nov 20, 2013 at 3:58 PM, Davanum Srinivas dava...@gmail.com wrote: Anita, Joe, I need help with adding debug message in: https://review.openstack.org/#/c/56316/ to track down: Bug: https://bugs.launchpad.net/bugs/1251784 = message:Connection to neutron failed: Maximum attempts reached AND filename:logs/screen-n-cpu.txt Title: nova+neutron scheduling error: Connection to neutron failed: Maximum attempts reached -- dims On Wed, Nov 20, 2013 at 5:43 PM, Anita Kuno ante...@anteaya.info wrote: On 11/20/2013 05:10 PM, Michael Still wrote: On Thu, Nov 21, 2013 at 7:44 AM, Clark Boylan clark.boy...@gmail.com wrote: How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. Talking on the -infra IRC channel just now, it has become clear to me that we need to stop approving _any_ change for now until we have the gate fixed. All we're doing at the moment is rechecking over and over because the gate is too unreliable to actually pass changes. This is making debugging the gate significantly harder. Could cores please refrain from approving code until the gate issues are resolved? Thanks, Michael We are talking in -neutron and Mark McClain sent out an email to all cores expressing this objective. He is also -2 any patch in check that is not directly related to a gate blocking bug fix. Neutron is in holding until we get the all clear from -infra. Thanks Michael, Anita. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Davanum Srinivas :: http://davanum.wordpress.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Davanum Srinivas :: http://davanum.wordpress.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev