Re: [openstack-dev] [IMPORTANT] The Gate around Feature Freeze

2013-08-26 Thread Flavio Percoco

On 22/08/13 21:37 -0500, Dolph Mathews wrote:


On Thu, Aug 22, 2013 at 7:48 PM, James E. Blair  wrote:
Wow, nice work! Thank you, infra!



You guys ROCK! Thanks for everything!
FF

--
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [IMPORTANT] The Gate around Feature Freeze

2013-08-23 Thread Russell Bryant
On 08/22/2013 10:37 PM, Dolph Mathews wrote:
> 
> On Thu, Aug 22, 2013 at 7:48 PM, James E. Blair  > wrote:
> 
> Monty Taylor mailto:mord...@inaugust.com>>
> writes:
> 
> > The infra team has done a lot of work in prep for our favorite time of
> > year, and we've actually landed several upgrades to the gate without
> > which we'd be in particularly bad shape right now. (I'll let Jim write
> > about some of them later when he's not battling the current
> operational
> > issues - they're pretty spectacular) As with many scaling issues, some
> > of these upgrades have resulted in moving the point of pain further
> > along the stack. We're working on solutions to the current pain
> points.
> > (Or, I should say they are, because I'm on a plane headed to
> Burning Man
> > and not useful for much other than writing emails.)
> 
> Hi!
> 
> The good news is that a lot of the operational problems over the past
> few days have been corrected, we are now pretty close to the noise floor
> of infrastructure issues in the gate, and over the next few days we'll
> work to get rid of the remaining bugs.
> 
> As I'm sure everyone knows, we've seen a huge growth in the project, the
> number of changes, and the number of tests we run.  That is both
> wonderful, and a little terrifying!  But we haven't been idle: we have
> made some significant improvements and innovations to the project
> infrastructure to deal with our growing load, especially during these
> peak times.
> 
> About a year ago, we realized that the growing number of jobs run (and
> number of test machines on which we run those jobs) was going to cause
> scaling issues with Jenkins.  So with the help of Khai Do, we created
> the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it.
> That means that Zuul isn't directly tied to Jenkins anymore, and can
> distribute the jobs it needs to run to anything that can run them via
> Gearman.
> 
> A few weeks ago we took advantage of that by adding two new Jenkins
> masters to our system, giving us one of the first (if not the first)
> multi-master Jenkins systems.  Since then, all of the test jobs have
> been run on nodes attached to either jenkins01.openstack.org
>  or
> jenkins02.openstack.org  (which you
> may have seen linked to from the Zuul
> status page).  That has given us the ability to upgrade Jenkins and its
> plugins with no interruption due to the active-active nature of the
> system.  And we can add hundreds of test nodes to each of these systems
> and continue to scale them horizontally as our load increases.
> 
> With Jenkins now able to scale, the next bottleneck was the number of
> test nodes.  Until recently, we had a handful of special Jenkins jobs
> which would launch and destroy the single-use nodes that are used for
> devstack tests.  We were seeing issues with Jenkins running those jobs,
> as well as their ability to keep up with demand.  So we started the
> Nodepool project[2] to create a daemon that could keep up with the
> demand for test nodes, be much more responsive, and eliminate some of
> the occasional errors that we would see in the old Rube-Goldberg system
> we had for managing nodes.
> 
> In anticipation of the rush of patches for the feature freeze, we rolled
> that out over the weekend so it was ready to go Monday.  And it worked!
> 
> In fact, it's extremely responsive.  It immediately utilized our entire
> capacity to supply test nodes.  Which was great, except that a lot of
> our tests are configured to use the git repos from Gerrit, which is why
> Gerrit was very slow early in the week.  Fortunately, Elizabeth Krumbach
> Joseph has been working on setting up a new Git server.  That alone is
> pretty exciting, and she's going to send an announcement about it soon.
> Since it was ready to go, we moved the test load from Gerrit to the new
> git server, which has made Gerrit much more responsive again.
> Unfortunately, the new git server still wasn't quite able to keep up
> with the test load, so Clark Boylan, Elizabeth and I have spent some
> time tuning it as well as load-balancing it across several hosts.
> 
> That is now in place, and the new system seems able to cope with the
> load from the current rush of patches.
> 
> We're still seeing an occasional issue where a job is reported as LOST
> because Jenkins is apparently unaware that it can't talk to the test
> node.  We have some workarounds in progress that we hope to have in
> place soon.
> 
> Our goal is to have the most robust and accurate test system possible,
> that can run all of the tests we can think to throw at it.  I think 

Re: [openstack-dev] [IMPORTANT] The Gate around Feature Freeze

2013-08-23 Thread Joshua Harlow
+2

Infra is the backbone of openstack and your guys/gals work is much appreciated!

Sent from my really tiny device...

On Aug 23, 2013, at 1:20 AM, "Chmouel Boudjnah"  wrote:

> Dolph Mathews  writes:
> 
>>pretty excited! As always, if you'd like to pitch in, stop by
>>#openstack-infra on Freenode and see what we're up to.
>> Wow, nice work! Thank you, infra!
> 
> agreed, thanks for the good work infra.
> 
> Chmouel.
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [IMPORTANT] The Gate around Feature Freeze

2013-08-23 Thread Chmouel Boudjnah
Dolph Mathews  writes:

> pretty excited! As always, if you'd like to pitch in, stop by
> #openstack-infra on Freenode and see what we're up to.
> Wow, nice work! Thank you, infra!

agreed, thanks for the good work infra.

Chmouel.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [IMPORTANT] The Gate around Feature Freeze

2013-08-22 Thread Dolph Mathews
On Thu, Aug 22, 2013 at 7:48 PM, James E. Blair wrote:

> Monty Taylor  writes:
>
> > The infra team has done a lot of work in prep for our favorite time of
> > year, and we've actually landed several upgrades to the gate without
> > which we'd be in particularly bad shape right now. (I'll let Jim write
> > about some of them later when he's not battling the current operational
> > issues - they're pretty spectacular) As with many scaling issues, some
> > of these upgrades have resulted in moving the point of pain further
> > along the stack. We're working on solutions to the current pain points.
> > (Or, I should say they are, because I'm on a plane headed to Burning Man
> > and not useful for much other than writing emails.)
>
> Hi!
>
> The good news is that a lot of the operational problems over the past
> few days have been corrected, we are now pretty close to the noise floor
> of infrastructure issues in the gate, and over the next few days we'll
> work to get rid of the remaining bugs.
>
> As I'm sure everyone knows, we've seen a huge growth in the project, the
> number of changes, and the number of tests we run.  That is both
> wonderful, and a little terrifying!  But we haven't been idle: we have
> made some significant improvements and innovations to the project
> infrastructure to deal with our growing load, especially during these
> peak times.
>
> About a year ago, we realized that the growing number of jobs run (and
> number of test machines on which we run those jobs) was going to cause
> scaling issues with Jenkins.  So with the help of Khai Do, we created
> the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it.
> That means that Zuul isn't directly tied to Jenkins anymore, and can
> distribute the jobs it needs to run to anything that can run them via
> Gearman.
>
> A few weeks ago we took advantage of that by adding two new Jenkins
> masters to our system, giving us one of the first (if not the first)
> multi-master Jenkins systems.  Since then, all of the test jobs have
> been run on nodes attached to either jenkins01.openstack.org or
> jenkins02.openstack.org (which you may have seen linked to from the Zuul
> status page).  That has given us the ability to upgrade Jenkins and its
> plugins with no interruption due to the active-active nature of the
> system.  And we can add hundreds of test nodes to each of these systems
> and continue to scale them horizontally as our load increases.
>
> With Jenkins now able to scale, the next bottleneck was the number of
> test nodes.  Until recently, we had a handful of special Jenkins jobs
> which would launch and destroy the single-use nodes that are used for
> devstack tests.  We were seeing issues with Jenkins running those jobs,
> as well as their ability to keep up with demand.  So we started the
> Nodepool project[2] to create a daemon that could keep up with the
> demand for test nodes, be much more responsive, and eliminate some of
> the occasional errors that we would see in the old Rube-Goldberg system
> we had for managing nodes.
>
> In anticipation of the rush of patches for the feature freeze, we rolled
> that out over the weekend so it was ready to go Monday.  And it worked!
>
> In fact, it's extremely responsive.  It immediately utilized our entire
> capacity to supply test nodes.  Which was great, except that a lot of
> our tests are configured to use the git repos from Gerrit, which is why
> Gerrit was very slow early in the week.  Fortunately, Elizabeth Krumbach
> Joseph has been working on setting up a new Git server.  That alone is
> pretty exciting, and she's going to send an announcement about it soon.
> Since it was ready to go, we moved the test load from Gerrit to the new
> git server, which has made Gerrit much more responsive again.
> Unfortunately, the new git server still wasn't quite able to keep up
> with the test load, so Clark Boylan, Elizabeth and I have spent some
> time tuning it as well as load-balancing it across several hosts.
>
> That is now in place, and the new system seems able to cope with the
> load from the current rush of patches.
>
> We're still seeing an occasional issue where a job is reported as LOST
> because Jenkins is apparently unaware that it can't talk to the test
> node.  We have some workarounds in progress that we hope to have in
> place soon.
>
> Our goal is to have the most robust and accurate test system possible,
> that can run all of the tests we can think to throw at it.  I think the
> improvements we've made recently are going to help tremendously and I'm
> pretty excited!  As always, if you'd like to pitch in, stop by
> #openstack-infra on Freenode and see what we're up to.
>

Wow, nice work! Thank you, infra!


>
> -Jim
>
> [1] http://git.openstack.org/cgit/openstack-infra/gearman-plugin/
> [2] http://git.openstack.org/cgit/openstack-infra/nodepool/
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org

Re: [openstack-dev] [IMPORTANT] The Gate around Feature Freeze

2013-08-22 Thread James E. Blair
Monty Taylor  writes:

> The infra team has done a lot of work in prep for our favorite time of
> year, and we've actually landed several upgrades to the gate without
> which we'd be in particularly bad shape right now. (I'll let Jim write
> about some of them later when he's not battling the current operational
> issues - they're pretty spectacular) As with many scaling issues, some
> of these upgrades have resulted in moving the point of pain further
> along the stack. We're working on solutions to the current pain points.
> (Or, I should say they are, because I'm on a plane headed to Burning Man
> and not useful for much other than writing emails.)

Hi!

The good news is that a lot of the operational problems over the past
few days have been corrected, we are now pretty close to the noise floor
of infrastructure issues in the gate, and over the next few days we'll
work to get rid of the remaining bugs.

As I'm sure everyone knows, we've seen a huge growth in the project, the
number of changes, and the number of tests we run.  That is both
wonderful, and a little terrifying!  But we haven't been idle: we have
made some significant improvements and innovations to the project
infrastructure to deal with our growing load, especially during these
peak times.

About a year ago, we realized that the growing number of jobs run (and
number of test machines on which we run those jobs) was going to cause
scaling issues with Jenkins.  So with the help of Khai Do, we created
the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it.
That means that Zuul isn't directly tied to Jenkins anymore, and can
distribute the jobs it needs to run to anything that can run them via
Gearman.

A few weeks ago we took advantage of that by adding two new Jenkins
masters to our system, giving us one of the first (if not the first)
multi-master Jenkins systems.  Since then, all of the test jobs have
been run on nodes attached to either jenkins01.openstack.org or
jenkins02.openstack.org (which you may have seen linked to from the Zuul
status page).  That has given us the ability to upgrade Jenkins and its
plugins with no interruption due to the active-active nature of the
system.  And we can add hundreds of test nodes to each of these systems
and continue to scale them horizontally as our load increases.

With Jenkins now able to scale, the next bottleneck was the number of
test nodes.  Until recently, we had a handful of special Jenkins jobs
which would launch and destroy the single-use nodes that are used for
devstack tests.  We were seeing issues with Jenkins running those jobs,
as well as their ability to keep up with demand.  So we started the
Nodepool project[2] to create a daemon that could keep up with the
demand for test nodes, be much more responsive, and eliminate some of
the occasional errors that we would see in the old Rube-Goldberg system
we had for managing nodes.

In anticipation of the rush of patches for the feature freeze, we rolled
that out over the weekend so it was ready to go Monday.  And it worked!

In fact, it's extremely responsive.  It immediately utilized our entire
capacity to supply test nodes.  Which was great, except that a lot of
our tests are configured to use the git repos from Gerrit, which is why
Gerrit was very slow early in the week.  Fortunately, Elizabeth Krumbach
Joseph has been working on setting up a new Git server.  That alone is
pretty exciting, and she's going to send an announcement about it soon.
Since it was ready to go, we moved the test load from Gerrit to the new
git server, which has made Gerrit much more responsive again.
Unfortunately, the new git server still wasn't quite able to keep up
with the test load, so Clark Boylan, Elizabeth and I have spent some
time tuning it as well as load-balancing it across several hosts.

That is now in place, and the new system seems able to cope with the
load from the current rush of patches.

We're still seeing an occasional issue where a job is reported as LOST
because Jenkins is apparently unaware that it can't talk to the test
node.  We have some workarounds in progress that we hope to have in
place soon.

Our goal is to have the most robust and accurate test system possible,
that can run all of the tests we can think to throw at it.  I think the
improvements we've made recently are going to help tremendously and I'm
pretty excited!  As always, if you'd like to pitch in, stop by
#openstack-infra on Freenode and see what we're up to.

-Jim

[1] http://git.openstack.org/cgit/openstack-infra/gearman-plugin/
[2] http://git.openstack.org/cgit/openstack-infra/nodepool/

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [IMPORTANT] The Gate around Feature Freeze

2013-08-21 Thread Monty Taylor
Hey all!

tl;dr - PLEASE DO NOT APPROVE PATCHES BEFORE THE CHECK JOBS HAVE FINISHED

I'm sure everyone has noticed, it's Feature Freeze time again, which
means that everyone has eleventy-billion patches that need to land NOW.
As always, we're discovering new and exciting scaling opportunities!

The infra team has done a lot of work in prep for our favorite time of
year, and we've actually landed several upgrades to the gate without
which we'd be in particularly bad shape right now. (I'll let Jim write
about some of them later when he's not battling the current operational
issues - they're pretty spectacular) As with many scaling issues, some
of these upgrades have resulted in moving the point of pain further
along the stack. We're working on solutions to the current pain points.
(Or, I should say they are, because I'm on a plane headed to Burning Man
and not useful for much other than writing emails.)

Which brings me to - the gate queue is really long right now. As much
technology as we are bringing to bear on the problem, it's just a
reality. That means everyone gets antsy and wants their stuff to land
RIGHT NOW. I get it- I'm probably the most impatient person in the world.

However, as the tl;dr says above, we're seeing folks hitting Approve
before the upload check passes. That may seem expedient, but failures in
the gate itself are actually quite costly, as they cause everything
behind it to stack up. If the check queue can catch an error, we REALLY
need it to right now, because we need every little bit of help we can get.

So to summarize:

PLEASE DO NOT APPROVE PATCHES BEFORE THE CHECK JOBS HAVE FINISHED

At least not this week.

Thanks!
Monty

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev