Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-11-26 Thread Bogdan Dobrelya

Here is a related bug [1] and implementation [1] for that. PTAL folks!

[0] https://bugs.launchpad.net/tripleo/+bug/1804822
[1] https://review.openstack.org/#/q/topic:base-container-reduction


Let's also think of removing puppet-tripleo from the base container.
It really brings the world-in (and yum updates in CI!) each job and each 
container!
So if we did so, we should then either install puppet-tripleo and co on 
the host and bind-mount it for the docker-puppet deployment task steps 
(bad idea IMO), OR use the magical --volumes-from  
option to mount volumes from some "puppet-config" sidecar container 
inside each of the containers being launched by docker-puppet tooling.


On Wed, Oct 31, 2018 at 11:16 AM Harald Jensås  
wrote:

We add this to all images:

https://github.com/openstack/tripleo-common/blob/d35af75b0d8c4683a677660646e535cf972c98ef/container-images/tripleo_kolla_template_overrides.j2#L35

/bin/sh -c yum -y install iproute iscsi-initiator-utils lvm2 python
socat sudo which openstack-tripleo-common-container-base rsync cronie
crudini openstack-selinux ansible python-shade puppet-tripleo python2-
kubernetes && yum clean all && rm -rf /var/cache/yum 276 MB 


Is the additional 276 MB reasonable here?
openstack-selinux <- This package run relabling, does that kind of
touching the filesystem impact the size due to docker layers?

Also: python2-kubernetes is a fairly large package (18007990) do we use
that in every image? I don't see any tripleo related repos importing
from that when searching on Hound? The original commit message[1]
adding it states it is for future convenience.

On my undercloud we have 101 images, if we are downloading every 18 MB
per image thats almost 1.8 GB for a package we don't use? (I hope it's
not like this? With docker layers, we only download that 276 MB
transaction once? Or?)


[1] https://review.openstack.org/527927




--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-11-05 Thread Alex Schultz
On Mon, Nov 5, 2018 at 3:47 AM Bogdan Dobrelya  wrote:
>
> Let's also think of removing puppet-tripleo from the base container.
> It really brings the world-in (and yum updates in CI!) each job and each
> container!
> So if we did so, we should then either install puppet-tripleo and co on
> the host and bind-mount it for the docker-puppet deployment task steps
> (bad idea IMO), OR use the magical --volumes-from 
> option to mount volumes from some "puppet-config" sidecar container
> inside each of the containers being launched by docker-puppet tooling.
>

This does bring an interesting point as we also include this in
overcloud-full. I know Dan had a patch to stop using the
puppet-tripleo from the host[0] which is the opposite of this.  While
these yum updates happen a bunch in CI, they aren't super large
updates. But yes I think we need to figure out the correct way forward
with these packages.

Thanks,
-Alex

[0] https://review.openstack.org/#/c/550848/


> On 10/31/18 6:35 PM, Alex Schultz wrote:
> >
> > So this is a single layer that is updated once and shared by all the
> > containers that inherit from it. I did notice the same thing and have
> > proposed a change in the layering of these packages last night.
> >
> > https://review.openstack.org/#/c/614371/
> >
> > In general this does raise a point about dependencies of services and
> > what the actual impact of adding new ones to projects is. Especially
> > in the container world where this might be duplicated N times
> > depending on the number of services deployed.  With the move to
> > containers, much of the sharedness that being on a single host
> > provided has been lost at a cost of increased bandwidth, memory, and
> > storage usage.
> >
> > Thanks,
> > -Alex
> >
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-11-05 Thread Cédric Jeanneret


On 11/5/18 11:47 AM, Bogdan Dobrelya wrote:
> Let's also think of removing puppet-tripleo from the base container.
> It really brings the world-in (and yum updates in CI!) each job and each
> container!
> So if we did so, we should then either install puppet-tripleo and co on
> the host and bind-mount it for the docker-puppet deployment task steps
> (bad idea IMO), OR use the magical --volumes-from 
> option to mount volumes from some "puppet-config" sidecar container
> inside each of the containers being launched by docker-puppet tooling.

And, in addition, I'd rather see the "podman" thingy as a bind-mount,
especially since we MUST get the same version in all the calls.

> 
> On 10/31/18 6:35 PM, Alex Schultz wrote:
>>
>> So this is a single layer that is updated once and shared by all the
>> containers that inherit from it. I did notice the same thing and have
>> proposed a change in the layering of these packages last night.
>>
>> https://review.openstack.org/#/c/614371/
>>
>> In general this does raise a point about dependencies of services and
>> what the actual impact of adding new ones to projects is. Especially
>> in the container world where this might be duplicated N times
>> depending on the number of services deployed.  With the move to
>> containers, much of the sharedness that being on a single host
>> provided has been lost at a cost of increased bandwidth, memory, and
>> storage usage.
>>
>> Thanks,
>> -Alex
>>
> 

-- 
Cédric Jeanneret
Software Engineer
DFG:DF



signature.asc
Description: OpenPGP digital signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-11-05 Thread Bogdan Dobrelya

Let's also think of removing puppet-tripleo from the base container.
It really brings the world-in (and yum updates in CI!) each job and each 
container!
So if we did so, we should then either install puppet-tripleo and co on 
the host and bind-mount it for the docker-puppet deployment task steps 
(bad idea IMO), OR use the magical --volumes-from  
option to mount volumes from some "puppet-config" sidecar container 
inside each of the containers being launched by docker-puppet tooling.


On 10/31/18 6:35 PM, Alex Schultz wrote:


So this is a single layer that is updated once and shared by all the
containers that inherit from it. I did notice the same thing and have
proposed a change in the layering of these packages last night.

https://review.openstack.org/#/c/614371/

In general this does raise a point about dependencies of services and
what the actual impact of adding new ones to projects is. Especially
in the container world where this might be duplicated N times
depending on the number of services deployed.  With the move to
containers, much of the sharedness that being on a single host
provided has been lost at a cost of increased bandwidth, memory, and
storage usage.

Thanks,
-Alex



--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-11-01 Thread Ben Nemec



On 10/30/18 4:16 PM, Clark Boylan wrote:

On Tue, Oct 30, 2018, at 1:01 PM, Ben Nemec wrote:



On 10/30/18 1:25 PM, Clark Boylan wrote:

On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:

On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec  wrote:


Tagging with tripleo since my suggestion below is specific to that project.

On 10/30/18 11:03 AM, Clark Boylan wrote:

Hello everyone,

A little while back I sent email explaining how the gate queues work and how 
fixing bugs helps us test and merge more code. All of this still is still true 
and we should keep pushing to improve our testing to avoid gate resets.

Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the 
process of doing this we had to restart Zuul which brought in a new logging 
feature that exposes node resource usage by jobs. Using this data I've been 
able to generate some report information on where our node demand is going. 
This change [0] produces this report [1].

As with optimizing software we want to identify which changes will have the 
biggest impact and to be able to measure whether or not changes have had an 
impact once we have made them. Hopefully this information is a start at doing 
that. Currently we can only look back to the point Zuul was restarted, but we 
have a thirty day log rotation for this service and should be able to look at a 
month's worth of data going forward.

Looking at the data you might notice that Tripleo is using many more node 
resources than our other projects. They are aware of this and have a plan [2] 
to reduce their resource consumption. We'll likely be using this report 
generator to check progress of this plan over time.


I know at one point we had discussed reducing the concurrency of the
tripleo gate to help with this. Since tripleo is still using >50% of the
resources it seems like maybe we should revisit that, at least for the
short-term until the more major changes can be made? Looking through the
merge history for tripleo projects I don't see a lot of cases (any, in
fact) where more than a dozen patches made it through anyway*, so I
suspect it wouldn't have a significant impact on gate throughput, but it
would free up quite a few nodes for other uses.



It's the failures in gate and resets.  At this point I think it would
be a good idea to turn down the concurrency of the tripleo queue in
the gate if possible. As of late it's been timeouts but we've been
unable to track down why it's timing out specifically.  I personally
have a feeling it's the container download times since we do not have
a local registry available and are only able to leverage the mirrors
for some levels of caching. Unfortunately we don't get the best
information about this out of docker (or the mirrors) and it's really
hard to determine what exactly makes things run a bit slower.


We actually tried this not too long ago 
https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
 but decided to revert it because it didn't decrease the check queue backlog 
significantly. We were still running at several hours behind most of the time.


I'm surprised to hear that. Counting the tripleo jobs in the gate at
positions 11-20 right now, I see around 84 nodes tied up in long-running
jobs and another 32 for shorter unit test jobs. The latter probably
don't have much impact, but the former is a non-trivial amount. It may
not erase the entire 2300+ job queue that we have right now, but it
seems like it should help.



If we want to set up better monitoring and measuring and try it again we can do 
that. But we probably want to measure queue sizes with and without the change 
like that to better understand if it helps.


This seems like good information to start capturing, otherwise we are
kind of just guessing. Is there something in infra already that we could
use or would it need to be new tooling?


Digging around in graphite we currently track mean in pipelines. This is 
probably a reasonable metric to use for this specific case.

Looking at the check queue [3] shows the mean time enqueued in check during the rough 
period window floor was 10 and [4] shows it since then. The 26th and 27th are bigger 
peaks than previously seen (possibly due to losing inap temporarily) but otherwise a 
queue backlog of ~200 minutes was "normal" in both time periods.

[3] 
http://graphite.openstack.org/render/?from=20181015=20181019=scale(stats.timers.zuul.tenant.openstack.pipeline.check.resident_time.mean,%200.166)
[4] 
http://graphite.openstack.org/render/?from=20181019=20181030=scale(stats.timers.zuul.tenant.openstack.pipeline.check.resident_time.mean,%200.166)

You should be able to change check to eg gate or other queue names and poke 
around more if you like. Note the scale factor scales from milliseconds to 
minutes.

Clark



Cool, thanks. Seems like things have been better for the past couple of 
days, but I'll keep this in my back pocket for 

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-31 Thread Harald Jensås
On Wed, 2018-10-31 at 11:35 -0600, Alex Schultz wrote:
> On Wed, Oct 31, 2018 at 11:16 AM Harald Jensås 
> wrote:
> > 
> > On Tue, 2018-10-30 at 15:00 -0600, Alex Schultz wrote:
> > > On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan <
> > > cboy...@sapwetik.org>
> > > wrote:
> > > > 
> > > > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> > > > > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <
> > > > > openst...@nemebean.com> wrote:
> > > > > > 
> > > > > > Tagging with tripleo since my suggestion below is specific
> > > > > > to
> > > > > > that project.
> > > > > > 
> > > > > > On 10/30/18 11:03 AM, Clark Boylan wrote:
> > > > > > > Hello everyone,
> > > > > > > 
> > > > > > > A little while back I sent email explaining how the gate
> > > > > > > queues work and how fixing bugs helps us test and merge
> > > > > > > more
> > > > > > > code. All of this still is still true and we should keep
> > > > > > > pushing to improve our testing to avoid gate resets.
> > > > > > > 
> > > > > > > Last week we migrated Zuul and Nodepool to a new
> > > > > > > Zookeeper
> > > > > > > cluster. In the process of doing this we had to restart
> > > > > > > Zuul
> > > > > > > which brought in a new logging feature that exposes node
> > > > > > > resource usage by jobs. Using this data I've been able to
> > > > > > > generate some report information on where our node demand
> > > > > > > is
> > > > > > > going. This change [0] produces this report [1].
> > > > > > > 
> > > > > > > As with optimizing software we want to identify which
> > > > > > > changes
> > > > > > > will have the biggest impact and to be able to measure
> > > > > > > whether or not changes have had an impact once we have
> > > > > > > made
> > > > > > > them. Hopefully this information is a start at doing
> > > > > > > that.
> > > > > > > Currently we can only look back to the point Zuul was
> > > > > > > restarted, but we have a thirty day log rotation for this
> > > > > > > service and should be able to look at a month's worth of
> > > > > > > data
> > > > > > > going forward.
> > > > > > > 
> > > > > > > Looking at the data you might notice that Tripleo is
> > > > > > > using
> > > > > > > many more node resources than our other projects. They
> > > > > > > are
> > > > > > > aware of this and have a plan [2] to reduce their
> > > > > > > resource
> > > > > > > consumption. We'll likely be using this report generator
> > > > > > > to
> > > > > > > check progress of this plan over time.
> > > > > > 
> > > > > > I know at one point we had discussed reducing the
> > > > > > concurrency
> > > > > > of the
> > > > > > tripleo gate to help with this. Since tripleo is still
> > > > > > using
> > > > > > > 50% of the
> > > > > > 
> > > > > > resources it seems like maybe we should revisit that, at
> > > > > > least
> > > > > > for the
> > > > > > short-term until the more major changes can be made?
> > > > > > Looking
> > > > > > through the
> > > > > > merge history for tripleo projects I don't see a lot of
> > > > > > cases
> > > > > > (any, in
> > > > > > fact) where more than a dozen patches made it through
> > > > > > anyway*,
> > > > > > so I
> > > > > > suspect it wouldn't have a significant impact on gate
> > > > > > throughput, but it
> > > > > > would free up quite a few nodes for other uses.
> > > > > > 
> > > > > 
> > > > > It's the failures in gate and resets.  At this point I think
> > > > > it
> > > > > would
> > > > > be a good idea to turn down the concurrency of the tripleo
> > > > > queue
> > > > > in
> > > > > the gate if possible. As of late it's been timeouts but we've
> > > > > been
> > > > > unable to track down why it's timing out specifically.  I
> > > > > personally
> > > > > have a feeling it's the container download times since we do
> > > > > not
> > > > > have
> > > > > a local registry available and are only able to leverage the
> > > > > mirrors
> > > > > for some levels of caching. Unfortunately we don't get the
> > > > > best
> > > > > information about this out of docker (or the mirrors) and
> > > > > it's
> > > > > really
> > > > > hard to determine what exactly makes things run a bit slower.
> > > > 
> > > > We actually tried this not too long ago
> > > > 
https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
> > > >  but decided to revert it because it didn't decrease the check
> > > > queue backlog significantly. We were still running at several
> > > > hours
> > > > behind most of the time.
> > > > 
> > > > If we want to set up better monitoring and measuring and try it
> > > > again we can do that. But we probably want to measure queue
> > > > sizes
> > > > with and without the change like that to better understand if
> > > > it
> > > > helps.
> > > > 
> > > > As for container image download times can we quantify that via
> > > > docker logs? Basically sum up the amount of time spent by a job
> > > > downloading images so that we can see what the impact is 

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-31 Thread Alex Schultz
On Wed, Oct 31, 2018 at 11:16 AM Harald Jensås  wrote:
>
> On Tue, 2018-10-30 at 15:00 -0600, Alex Schultz wrote:
> > On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan 
> > wrote:
> > >
> > > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> > > > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <
> > > > openst...@nemebean.com> wrote:
> > > > >
> > > > > Tagging with tripleo since my suggestion below is specific to
> > > > > that project.
> > > > >
> > > > > On 10/30/18 11:03 AM, Clark Boylan wrote:
> > > > > > Hello everyone,
> > > > > >
> > > > > > A little while back I sent email explaining how the gate
> > > > > > queues work and how fixing bugs helps us test and merge more
> > > > > > code. All of this still is still true and we should keep
> > > > > > pushing to improve our testing to avoid gate resets.
> > > > > >
> > > > > > Last week we migrated Zuul and Nodepool to a new Zookeeper
> > > > > > cluster. In the process of doing this we had to restart Zuul
> > > > > > which brought in a new logging feature that exposes node
> > > > > > resource usage by jobs. Using this data I've been able to
> > > > > > generate some report information on where our node demand is
> > > > > > going. This change [0] produces this report [1].
> > > > > >
> > > > > > As with optimizing software we want to identify which changes
> > > > > > will have the biggest impact and to be able to measure
> > > > > > whether or not changes have had an impact once we have made
> > > > > > them. Hopefully this information is a start at doing that.
> > > > > > Currently we can only look back to the point Zuul was
> > > > > > restarted, but we have a thirty day log rotation for this
> > > > > > service and should be able to look at a month's worth of data
> > > > > > going forward.
> > > > > >
> > > > > > Looking at the data you might notice that Tripleo is using
> > > > > > many more node resources than our other projects. They are
> > > > > > aware of this and have a plan [2] to reduce their resource
> > > > > > consumption. We'll likely be using this report generator to
> > > > > > check progress of this plan over time.
> > > > >
> > > > > I know at one point we had discussed reducing the concurrency
> > > > > of the
> > > > > tripleo gate to help with this. Since tripleo is still using
> > > > > >50% of the
> > > > > resources it seems like maybe we should revisit that, at least
> > > > > for the
> > > > > short-term until the more major changes can be made? Looking
> > > > > through the
> > > > > merge history for tripleo projects I don't see a lot of cases
> > > > > (any, in
> > > > > fact) where more than a dozen patches made it through anyway*,
> > > > > so I
> > > > > suspect it wouldn't have a significant impact on gate
> > > > > throughput, but it
> > > > > would free up quite a few nodes for other uses.
> > > > >
> > > >
> > > > It's the failures in gate and resets.  At this point I think it
> > > > would
> > > > be a good idea to turn down the concurrency of the tripleo queue
> > > > in
> > > > the gate if possible. As of late it's been timeouts but we've
> > > > been
> > > > unable to track down why it's timing out specifically.  I
> > > > personally
> > > > have a feeling it's the container download times since we do not
> > > > have
> > > > a local registry available and are only able to leverage the
> > > > mirrors
> > > > for some levels of caching. Unfortunately we don't get the best
> > > > information about this out of docker (or the mirrors) and it's
> > > > really
> > > > hard to determine what exactly makes things run a bit slower.
> > >
> > > We actually tried this not too long ago
> > > https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
> > >  but decided to revert it because it didn't decrease the check
> > > queue backlog significantly. We were still running at several hours
> > > behind most of the time.
> > >
> > > If we want to set up better monitoring and measuring and try it
> > > again we can do that. But we probably want to measure queue sizes
> > > with and without the change like that to better understand if it
> > > helps.
> > >
> > > As for container image download times can we quantify that via
> > > docker logs? Basically sum up the amount of time spent by a job
> > > downloading images so that we can see what the impact is but also
> > > measure if changes improve that? As for other ideas improving
> > > things seems like many of the images that tripleo use are quite
> > > large. I recall seeing a > 600MB image just for rsyslog. Wouldn't
> > > it be advantageous for both the gate and tripleo in the real world
> > > to trim the size of those images (which should improve download
> > > times). In any case quantifying the size of the downloads and
> > > trimming those if possible is likely also worthwhile.
> > >
> >
> > So it's not that simple as we don't just download all the images in a
> > distinct task and there isn't any 

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-31 Thread Harald Jensås
On Tue, 2018-10-30 at 15:00 -0600, Alex Schultz wrote:
> On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan 
> wrote:
> > 
> > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> > > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <
> > > openst...@nemebean.com> wrote:
> > > > 
> > > > Tagging with tripleo since my suggestion below is specific to
> > > > that project.
> > > > 
> > > > On 10/30/18 11:03 AM, Clark Boylan wrote:
> > > > > Hello everyone,
> > > > > 
> > > > > A little while back I sent email explaining how the gate
> > > > > queues work and how fixing bugs helps us test and merge more
> > > > > code. All of this still is still true and we should keep
> > > > > pushing to improve our testing to avoid gate resets.
> > > > > 
> > > > > Last week we migrated Zuul and Nodepool to a new Zookeeper
> > > > > cluster. In the process of doing this we had to restart Zuul
> > > > > which brought in a new logging feature that exposes node
> > > > > resource usage by jobs. Using this data I've been able to
> > > > > generate some report information on where our node demand is
> > > > > going. This change [0] produces this report [1].
> > > > > 
> > > > > As with optimizing software we want to identify which changes
> > > > > will have the biggest impact and to be able to measure
> > > > > whether or not changes have had an impact once we have made
> > > > > them. Hopefully this information is a start at doing that.
> > > > > Currently we can only look back to the point Zuul was
> > > > > restarted, but we have a thirty day log rotation for this
> > > > > service and should be able to look at a month's worth of data
> > > > > going forward.
> > > > > 
> > > > > Looking at the data you might notice that Tripleo is using
> > > > > many more node resources than our other projects. They are
> > > > > aware of this and have a plan [2] to reduce their resource
> > > > > consumption. We'll likely be using this report generator to
> > > > > check progress of this plan over time.
> > > > 
> > > > I know at one point we had discussed reducing the concurrency
> > > > of the
> > > > tripleo gate to help with this. Since tripleo is still using
> > > > >50% of the
> > > > resources it seems like maybe we should revisit that, at least
> > > > for the
> > > > short-term until the more major changes can be made? Looking
> > > > through the
> > > > merge history for tripleo projects I don't see a lot of cases
> > > > (any, in
> > > > fact) where more than a dozen patches made it through anyway*,
> > > > so I
> > > > suspect it wouldn't have a significant impact on gate
> > > > throughput, but it
> > > > would free up quite a few nodes for other uses.
> > > > 
> > > 
> > > It's the failures in gate and resets.  At this point I think it
> > > would
> > > be a good idea to turn down the concurrency of the tripleo queue
> > > in
> > > the gate if possible. As of late it's been timeouts but we've
> > > been
> > > unable to track down why it's timing out specifically.  I
> > > personally
> > > have a feeling it's the container download times since we do not
> > > have
> > > a local registry available and are only able to leverage the
> > > mirrors
> > > for some levels of caching. Unfortunately we don't get the best
> > > information about this out of docker (or the mirrors) and it's
> > > really
> > > hard to determine what exactly makes things run a bit slower.
> > 
> > We actually tried this not too long ago 
> > https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
> >  but decided to revert it because it didn't decrease the check
> > queue backlog significantly. We were still running at several hours
> > behind most of the time.
> > 
> > If we want to set up better monitoring and measuring and try it
> > again we can do that. But we probably want to measure queue sizes
> > with and without the change like that to better understand if it
> > helps.
> > 
> > As for container image download times can we quantify that via
> > docker logs? Basically sum up the amount of time spent by a job
> > downloading images so that we can see what the impact is but also
> > measure if changes improve that? As for other ideas improving
> > things seems like many of the images that tripleo use are quite
> > large. I recall seeing a > 600MB image just for rsyslog. Wouldn't
> > it be advantageous for both the gate and tripleo in the real world
> > to trim the size of those images (which should improve download
> > times). In any case quantifying the size of the downloads and
> > trimming those if possible is likely also worthwhile.
> > 
> 
> So it's not that simple as we don't just download all the images in a
> distinct task and there isn't any information provided around
> size/speed AFAIK.  Additionally we aren't doing anything special with
> the images (it's mostly kolla built containers with a handful of
> tweaks) so that's just the size of the containers.  I am currently
> working on 

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-30 Thread Clark Boylan
On Tue, Oct 30, 2018, at 1:01 PM, Ben Nemec wrote:
> 
> 
> On 10/30/18 1:25 PM, Clark Boylan wrote:
> > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> >> On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec  wrote:
> >>>
> >>> Tagging with tripleo since my suggestion below is specific to that 
> >>> project.
> >>>
> >>> On 10/30/18 11:03 AM, Clark Boylan wrote:
>  Hello everyone,
> 
>  A little while back I sent email explaining how the gate queues work and 
>  how fixing bugs helps us test and merge more code. All of this still is 
>  still true and we should keep pushing to improve our testing to avoid 
>  gate resets.
> 
>  Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In 
>  the process of doing this we had to restart Zuul which brought in a new 
>  logging feature that exposes node resource usage by jobs. Using this 
>  data I've been able to generate some report information on where our 
>  node demand is going. This change [0] produces this report [1].
> 
>  As with optimizing software we want to identify which changes will have 
>  the biggest impact and to be able to measure whether or not changes have 
>  had an impact once we have made them. Hopefully this information is a 
>  start at doing that. Currently we can only look back to the point Zuul 
>  was restarted, but we have a thirty day log rotation for this service 
>  and should be able to look at a month's worth of data going forward.
> 
>  Looking at the data you might notice that Tripleo is using many more 
>  node resources than our other projects. They are aware of this and have 
>  a plan [2] to reduce their resource consumption. We'll likely be using 
>  this report generator to check progress of this plan over time.
> >>>
> >>> I know at one point we had discussed reducing the concurrency of the
> >>> tripleo gate to help with this. Since tripleo is still using >50% of the
> >>> resources it seems like maybe we should revisit that, at least for the
> >>> short-term until the more major changes can be made? Looking through the
> >>> merge history for tripleo projects I don't see a lot of cases (any, in
> >>> fact) where more than a dozen patches made it through anyway*, so I
> >>> suspect it wouldn't have a significant impact on gate throughput, but it
> >>> would free up quite a few nodes for other uses.
> >>>
> >>
> >> It's the failures in gate and resets.  At this point I think it would
> >> be a good idea to turn down the concurrency of the tripleo queue in
> >> the gate if possible. As of late it's been timeouts but we've been
> >> unable to track down why it's timing out specifically.  I personally
> >> have a feeling it's the container download times since we do not have
> >> a local registry available and are only able to leverage the mirrors
> >> for some levels of caching. Unfortunately we don't get the best
> >> information about this out of docker (or the mirrors) and it's really
> >> hard to determine what exactly makes things run a bit slower.
> > 
> > We actually tried this not too long ago 
> > https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
> >  but decided to revert it because it didn't decrease the check queue 
> > backlog significantly. We were still running at several hours behind most 
> > of the time.
> 
> I'm surprised to hear that. Counting the tripleo jobs in the gate at 
> positions 11-20 right now, I see around 84 nodes tied up in long-running 
> jobs and another 32 for shorter unit test jobs. The latter probably 
> don't have much impact, but the former is a non-trivial amount. It may 
> not erase the entire 2300+ job queue that we have right now, but it 
> seems like it should help.
> 
> > 
> > If we want to set up better monitoring and measuring and try it again we 
> > can do that. But we probably want to measure queue sizes with and without 
> > the change like that to better understand if it helps.
> 
> This seems like good information to start capturing, otherwise we are 
> kind of just guessing. Is there something in infra already that we could 
> use or would it need to be new tooling?

Digging around in graphite we currently track mean in pipelines. This is 
probably a reasonable metric to use for this specific case.

Looking at the check queue [3] shows the mean time enqueued in check during the 
rough period window floor was 10 and [4] shows it since then. The 26th and 27th 
are bigger peaks than previously seen (possibly due to losing inap temporarily) 
but otherwise a queue backlog of ~200 minutes was "normal" in both time periods.

[3] 
http://graphite.openstack.org/render/?from=20181015=20181019=scale(stats.timers.zuul.tenant.openstack.pipeline.check.resident_time.mean,%200.166)
[4] 

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-30 Thread Alex Schultz
On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan  wrote:
>
> On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec  wrote:
> > >
> > > Tagging with tripleo since my suggestion below is specific to that 
> > > project.
> > >
> > > On 10/30/18 11:03 AM, Clark Boylan wrote:
> > > > Hello everyone,
> > > >
> > > > A little while back I sent email explaining how the gate queues work 
> > > > and how fixing bugs helps us test and merge more code. All of this 
> > > > still is still true and we should keep pushing to improve our testing 
> > > > to avoid gate resets.
> > > >
> > > > Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In 
> > > > the process of doing this we had to restart Zuul which brought in a new 
> > > > logging feature that exposes node resource usage by jobs. Using this 
> > > > data I've been able to generate some report information on where our 
> > > > node demand is going. This change [0] produces this report [1].
> > > >
> > > > As with optimizing software we want to identify which changes will have 
> > > > the biggest impact and to be able to measure whether or not changes 
> > > > have had an impact once we have made them. Hopefully this information 
> > > > is a start at doing that. Currently we can only look back to the point 
> > > > Zuul was restarted, but we have a thirty day log rotation for this 
> > > > service and should be able to look at a month's worth of data going 
> > > > forward.
> > > >
> > > > Looking at the data you might notice that Tripleo is using many more 
> > > > node resources than our other projects. They are aware of this and have 
> > > > a plan [2] to reduce their resource consumption. We'll likely be using 
> > > > this report generator to check progress of this plan over time.
> > >
> > > I know at one point we had discussed reducing the concurrency of the
> > > tripleo gate to help with this. Since tripleo is still using >50% of the
> > > resources it seems like maybe we should revisit that, at least for the
> > > short-term until the more major changes can be made? Looking through the
> > > merge history for tripleo projects I don't see a lot of cases (any, in
> > > fact) where more than a dozen patches made it through anyway*, so I
> > > suspect it wouldn't have a significant impact on gate throughput, but it
> > > would free up quite a few nodes for other uses.
> > >
> >
> > It's the failures in gate and resets.  At this point I think it would
> > be a good idea to turn down the concurrency of the tripleo queue in
> > the gate if possible. As of late it's been timeouts but we've been
> > unable to track down why it's timing out specifically.  I personally
> > have a feeling it's the container download times since we do not have
> > a local registry available and are only able to leverage the mirrors
> > for some levels of caching. Unfortunately we don't get the best
> > information about this out of docker (or the mirrors) and it's really
> > hard to determine what exactly makes things run a bit slower.
>
> We actually tried this not too long ago 
> https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
>  but decided to revert it because it didn't decrease the check queue backlog 
> significantly. We were still running at several hours behind most of the time.
>
> If we want to set up better monitoring and measuring and try it again we can 
> do that. But we probably want to measure queue sizes with and without the 
> change like that to better understand if it helps.
>
> As for container image download times can we quantify that via docker logs? 
> Basically sum up the amount of time spent by a job downloading images so that 
> we can see what the impact is but also measure if changes improve that? As 
> for other ideas improving things seems like many of the images that tripleo 
> use are quite large. I recall seeing a > 600MB image just for rsyslog. 
> Wouldn't it be advantageous for both the gate and tripleo in the real world 
> to trim the size of those images (which should improve download times). In 
> any case quantifying the size of the downloads and trimming those if possible 
> is likely also worthwhile.
>

So it's not that simple as we don't just download all the images in a
distinct task and there isn't any information provided around
size/speed AFAIK.  Additionally we aren't doing anything special with
the images (it's mostly kolla built containers with a handful of
tweaks) so that's just the size of the containers.  I am currently
working on reducing any tripleo specific dependencies (ie removal of
instack-undercloud, etc) in hopes that we'll shave off some of the
dependencies but it seems that there's a larger (bloat) issue around
containers in general.  I have no idea why the rsyslog container would
be 600M, but yea that does seem excessive.

> Clark
>
> 

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-30 Thread Ben Nemec



On 10/30/18 1:25 PM, Clark Boylan wrote:

On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:

On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec  wrote:


Tagging with tripleo since my suggestion below is specific to that project.

On 10/30/18 11:03 AM, Clark Boylan wrote:

Hello everyone,

A little while back I sent email explaining how the gate queues work and how 
fixing bugs helps us test and merge more code. All of this still is still true 
and we should keep pushing to improve our testing to avoid gate resets.

Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the 
process of doing this we had to restart Zuul which brought in a new logging 
feature that exposes node resource usage by jobs. Using this data I've been 
able to generate some report information on where our node demand is going. 
This change [0] produces this report [1].

As with optimizing software we want to identify which changes will have the 
biggest impact and to be able to measure whether or not changes have had an 
impact once we have made them. Hopefully this information is a start at doing 
that. Currently we can only look back to the point Zuul was restarted, but we 
have a thirty day log rotation for this service and should be able to look at a 
month's worth of data going forward.

Looking at the data you might notice that Tripleo is using many more node 
resources than our other projects. They are aware of this and have a plan [2] 
to reduce their resource consumption. We'll likely be using this report 
generator to check progress of this plan over time.


I know at one point we had discussed reducing the concurrency of the
tripleo gate to help with this. Since tripleo is still using >50% of the
resources it seems like maybe we should revisit that, at least for the
short-term until the more major changes can be made? Looking through the
merge history for tripleo projects I don't see a lot of cases (any, in
fact) where more than a dozen patches made it through anyway*, so I
suspect it wouldn't have a significant impact on gate throughput, but it
would free up quite a few nodes for other uses.



It's the failures in gate and resets.  At this point I think it would
be a good idea to turn down the concurrency of the tripleo queue in
the gate if possible. As of late it's been timeouts but we've been
unable to track down why it's timing out specifically.  I personally
have a feeling it's the container download times since we do not have
a local registry available and are only able to leverage the mirrors
for some levels of caching. Unfortunately we don't get the best
information about this out of docker (or the mirrors) and it's really
hard to determine what exactly makes things run a bit slower.


We actually tried this not too long ago 
https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
 but decided to revert it because it didn't decrease the check queue backlog 
significantly. We were still running at several hours behind most of the time.


I'm surprised to hear that. Counting the tripleo jobs in the gate at 
positions 11-20 right now, I see around 84 nodes tied up in long-running 
jobs and another 32 for shorter unit test jobs. The latter probably 
don't have much impact, but the former is a non-trivial amount. It may 
not erase the entire 2300+ job queue that we have right now, but it 
seems like it should help.




If we want to set up better monitoring and measuring and try it again we can do 
that. But we probably want to measure queue sizes with and without the change 
like that to better understand if it helps.


This seems like good information to start capturing, otherwise we are 
kind of just guessing. Is there something in infra already that we could 
use or would it need to be new tooling?




As for container image download times can we quantify that via docker logs? 
Basically sum up the amount of time spent by a job downloading images so that we 
can see what the impact is but also measure if changes improve that? As for other 
ideas improving things seems like many of the images that tripleo use are quite 
large. I recall seeing a > 600MB image just for rsyslog. Wouldn't it be 
advantageous for both the gate and tripleo in the real world to trim the size of 
those images (which should improve download times). In any case quantifying the 
size of the downloads and trimming those if possible is likely also worthwhile.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-30 Thread Clark Boylan
On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec  wrote:
> >
> > Tagging with tripleo since my suggestion below is specific to that project.
> >
> > On 10/30/18 11:03 AM, Clark Boylan wrote:
> > > Hello everyone,
> > >
> > > A little while back I sent email explaining how the gate queues work and 
> > > how fixing bugs helps us test and merge more code. All of this still is 
> > > still true and we should keep pushing to improve our testing to avoid 
> > > gate resets.
> > >
> > > Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In 
> > > the process of doing this we had to restart Zuul which brought in a new 
> > > logging feature that exposes node resource usage by jobs. Using this data 
> > > I've been able to generate some report information on where our node 
> > > demand is going. This change [0] produces this report [1].
> > >
> > > As with optimizing software we want to identify which changes will have 
> > > the biggest impact and to be able to measure whether or not changes have 
> > > had an impact once we have made them. Hopefully this information is a 
> > > start at doing that. Currently we can only look back to the point Zuul 
> > > was restarted, but we have a thirty day log rotation for this service and 
> > > should be able to look at a month's worth of data going forward.
> > >
> > > Looking at the data you might notice that Tripleo is using many more node 
> > > resources than our other projects. They are aware of this and have a plan 
> > > [2] to reduce their resource consumption. We'll likely be using this 
> > > report generator to check progress of this plan over time.
> >
> > I know at one point we had discussed reducing the concurrency of the
> > tripleo gate to help with this. Since tripleo is still using >50% of the
> > resources it seems like maybe we should revisit that, at least for the
> > short-term until the more major changes can be made? Looking through the
> > merge history for tripleo projects I don't see a lot of cases (any, in
> > fact) where more than a dozen patches made it through anyway*, so I
> > suspect it wouldn't have a significant impact on gate throughput, but it
> > would free up quite a few nodes for other uses.
> >
> 
> It's the failures in gate and resets.  At this point I think it would
> be a good idea to turn down the concurrency of the tripleo queue in
> the gate if possible. As of late it's been timeouts but we've been
> unable to track down why it's timing out specifically.  I personally
> have a feeling it's the container download times since we do not have
> a local registry available and are only able to leverage the mirrors
> for some levels of caching. Unfortunately we don't get the best
> information about this out of docker (or the mirrors) and it's really
> hard to determine what exactly makes things run a bit slower.

We actually tried this not too long ago 
https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
 but decided to revert it because it didn't decrease the check queue backlog 
significantly. We were still running at several hours behind most of the time.

If we want to set up better monitoring and measuring and try it again we can do 
that. But we probably want to measure queue sizes with and without the change 
like that to better understand if it helps.

As for container image download times can we quantify that via docker logs? 
Basically sum up the amount of time spent by a job downloading images so that 
we can see what the impact is but also measure if changes improve that? As for 
other ideas improving things seems like many of the images that tripleo use are 
quite large. I recall seeing a > 600MB image just for rsyslog. Wouldn't it be 
advantageous for both the gate and tripleo in the real world to trim the size 
of those images (which should improve download times). In any case quantifying 
the size of the downloads and trimming those if possible is likely also 
worthwhile.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-30 Thread Alex Schultz
On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec  wrote:
>
> Tagging with tripleo since my suggestion below is specific to that project.
>
> On 10/30/18 11:03 AM, Clark Boylan wrote:
> > Hello everyone,
> >
> > A little while back I sent email explaining how the gate queues work and 
> > how fixing bugs helps us test and merge more code. All of this still is 
> > still true and we should keep pushing to improve our testing to avoid gate 
> > resets.
> >
> > Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the 
> > process of doing this we had to restart Zuul which brought in a new logging 
> > feature that exposes node resource usage by jobs. Using this data I've been 
> > able to generate some report information on where our node demand is going. 
> > This change [0] produces this report [1].
> >
> > As with optimizing software we want to identify which changes will have the 
> > biggest impact and to be able to measure whether or not changes have had an 
> > impact once we have made them. Hopefully this information is a start at 
> > doing that. Currently we can only look back to the point Zuul was 
> > restarted, but we have a thirty day log rotation for this service and 
> > should be able to look at a month's worth of data going forward.
> >
> > Looking at the data you might notice that Tripleo is using many more node 
> > resources than our other projects. They are aware of this and have a plan 
> > [2] to reduce their resource consumption. We'll likely be using this report 
> > generator to check progress of this plan over time.
>
> I know at one point we had discussed reducing the concurrency of the
> tripleo gate to help with this. Since tripleo is still using >50% of the
> resources it seems like maybe we should revisit that, at least for the
> short-term until the more major changes can be made? Looking through the
> merge history for tripleo projects I don't see a lot of cases (any, in
> fact) where more than a dozen patches made it through anyway*, so I
> suspect it wouldn't have a significant impact on gate throughput, but it
> would free up quite a few nodes for other uses.
>

It's the failures in gate and resets.  At this point I think it would
be a good idea to turn down the concurrency of the tripleo queue in
the gate if possible. As of late it's been timeouts but we've been
unable to track down why it's timing out specifically.  I personally
have a feeling it's the container download times since we do not have
a local registry available and are only able to leverage the mirrors
for some levels of caching. Unfortunately we don't get the best
information about this out of docker (or the mirrors) and it's really
hard to determine what exactly makes things run a bit slower.

I've asked about the status of moving the scenarios off of multinode
to standalone which would half the number of systems being run for
these jobs. It's currently next on the list of things to tackle after
we get a single fedora28 job up and running.

Thanks,
-Alex

> *: I have no actual stats to back that up, I'm just looking through the
> IRC backlog for merge bot messages. If such stats do exist somewhere we
> should look at them instead. :-)
>
> >
> > Also related to the long queue backlogs is this proposal [3] to change how 
> > Zuul prioritizes resource allocations to try to be more fair.
> >
> > [0] https://review.openstack.org/#/c/613674/
> > [1] http://paste.openstack.org/show/733644/
> > [2] 
> > http://lists.openstack.org/pipermail/openstack-dev/2018-October/135396.html
> > [3] http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-October/000575.html
> >
> > If you find any of this interesting and would like to help feel free to 
> > reach out to myself or the infra team.
> >
> > Thank you,
> > Clark
> >
> > __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

2018-10-30 Thread Ben Nemec

Tagging with tripleo since my suggestion below is specific to that project.

On 10/30/18 11:03 AM, Clark Boylan wrote:

Hello everyone,

A little while back I sent email explaining how the gate queues work and how 
fixing bugs helps us test and merge more code. All of this still is still true 
and we should keep pushing to improve our testing to avoid gate resets.

Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the 
process of doing this we had to restart Zuul which brought in a new logging 
feature that exposes node resource usage by jobs. Using this data I've been 
able to generate some report information on where our node demand is going. 
This change [0] produces this report [1].

As with optimizing software we want to identify which changes will have the 
biggest impact and to be able to measure whether or not changes have had an 
impact once we have made them. Hopefully this information is a start at doing 
that. Currently we can only look back to the point Zuul was restarted, but we 
have a thirty day log rotation for this service and should be able to look at a 
month's worth of data going forward.

Looking at the data you might notice that Tripleo is using many more node 
resources than our other projects. They are aware of this and have a plan [2] 
to reduce their resource consumption. We'll likely be using this report 
generator to check progress of this plan over time.


I know at one point we had discussed reducing the concurrency of the 
tripleo gate to help with this. Since tripleo is still using >50% of the 
resources it seems like maybe we should revisit that, at least for the 
short-term until the more major changes can be made? Looking through the 
merge history for tripleo projects I don't see a lot of cases (any, in 
fact) where more than a dozen patches made it through anyway*, so I 
suspect it wouldn't have a significant impact on gate throughput, but it 
would free up quite a few nodes for other uses.


*: I have no actual stats to back that up, I'm just looking through the 
IRC backlog for merge bot messages. If such stats do exist somewhere we 
should look at them instead. :-)




Also related to the long queue backlogs is this proposal [3] to change how Zuul 
prioritizes resource allocations to try to be more fair.

[0] https://review.openstack.org/#/c/613674/
[1] http://paste.openstack.org/show/733644/
[2] http://lists.openstack.org/pipermail/openstack-dev/2018-October/135396.html
[3] http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-October/000575.html

If you find any of this interesting and would like to help feel free to reach 
out to myself or the infra team.

Thank you,
Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev