Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-15 Thread Sergii Golovatiuk
Wesley,

For Ubuntu I suggest to enable 'proposed' repo to catch the problem
before package will be moved to 'updates'.

On Mon, May 14, 2018 at 11:42 PM, Wesley Hayutin  wrote:
>
>
> On Sun, May 13, 2018 at 11:50 PM Tristan Cacqueray 
> wrote:
>>
>> On May 14, 2018 2:44 am, Wesley Hayutin wrote:
>> [snip]
>> > I do think it would be helpful to say have a one week change window
>> > where
>> > folks are given the opportunity to preflight check a new image and the
>> > potential impact on the job workflow the updated image may have.
>> [snip]
>>
>> How about adding a periodic job that setup centos-release-cr in a pre
>> task? This should highlight issues with up-coming updates:
>> https://wiki.centos.org/AdditionalResources/Repositories/CR
>>
>> -Tristan
>
>
> Thanks for the suggestion Tristan, going to propose using this repo at the
> next TripleO mtg.
>
> Thanks
>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Best Regards,
Sergii Golovatiuk

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Wesley Hayutin
On Sun, May 13, 2018 at 11:50 PM Tristan Cacqueray 
wrote:

> On May 14, 2018 2:44 am, Wesley Hayutin wrote:
> [snip]
> > I do think it would be helpful to say have a one week change window where
> > folks are given the opportunity to preflight check a new image and the
> > potential impact on the job workflow the updated image may have.
> [snip]
>
> How about adding a periodic job that setup centos-release-cr in a pre
> task? This should highlight issues with up-coming updates:
> https://wiki.centos.org/AdditionalResources/Repositories/CR
>
> -Tristan
>

Thanks for the suggestion Tristan, going to propose using this repo at the
next TripleO mtg.

Thanks


> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Jeremy Stanley
On 2018-05-14 18:56:51 + (+), Jeremy Stanley wrote:
[...]
> Gödel's completeness theorem at work
[...]

More accurately, Gödel's first incompleteness theorem, I suppose. ;)
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Jeremy Stanley
On 2018-05-14 12:00:05 -0600 (-0600), Wesley Hayutin wrote:
[...]
> Project non-voting check jobs on the node-pool image creation job
> perhaps could be the canary in the coal mine we are seeking. Maybe
> we could see if that would be something that could be useful to
> both infra and to various OpenStack projects?
[...]

This presumes that Nodepool image builds are Zuul jobs, which they
aren't (at least not today). Long, long ago in a CI system not so
far away, our DevStack-specific image builds were in fact CI jobs
and for a while back then we did run DevStack's "smoke" tests as an
acceptance test before putting a new image into service. At the time
we discovered that even deploying DevStack was too complex and racy
to make for a viable acceptance test. The lesson we learned is that
most of the image regressions we were concerned with preventing
required testing complex enough to be a significant regression
magnet itself (Gödel's completeness theorem at work, I expect?).

That said, the idea of turning more of Nodepool's tasks into Zuul
jobs is an interesting one worthy of lengthy discussion sometime.
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Clark Boylan
On Mon, May 14, 2018, at 10:11 AM, Wesley Hayutin wrote:
> On Mon, May 14, 2018 at 12:08 PM Clark Boylan  wrote:
> 
> > On Mon, May 14, 2018, at 8:57 AM, Wesley Hayutin wrote:
> > > On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley 
> > wrote:
> > >
> > > > On 2018-05-14 07:07:03 -0600 (-0600), Wesley Hayutin wrote:

snip

> > > > Our automation doesn't know that there's a difference between
> > > > packages which were part of CentOS 7.4 and 7.5 any more than it
> > > > knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
> > > > Even if we somehow managed to pause our CentOS image updates
> > > > immediately prior to 7.5, jobs would still try to upgrade those
> > > > 7.4-based images to the 7.5 packages in our mirror, right?
> > > >
> > >
> > > Understood, I suspect this will become a more widespread issue as
> > > more projects start to use containers ( not sure ).  It's my
> > understanding
> > > that
> > > there are some mechanisms in place to pin packages in the centos nodepool
> > > image so
> > > there has been some thoughts generally in the area of this issue.
> >
> > Again, I think we need to understand why containers would make this worse
> > not better. Seems like the big feature everyone talks about when it comes
> > to containers is isolating packaging whether that be python packages so
> > that nova and glance can use a different version of oslo or cohabitating
> > software that would otherwise conflict. Why do the packages on the host
> > platform so strongly impact your container package lists?
> >
> 
> I'll let others comment on that, however my thought is you don't move from
> A -> Z in one step and containers do not make everything easier
> immediately.  Like most things, it takes a little time.
> 

If the main issue is being caught in a transition period at the same time a 
minor update happens can we treat this as a temporary state? Rather than 
attempting to for solve this particular case happening again the future we 
might be better served testing that upcoming CentOS releases won't break 
tripleo due to changes in the packaging using the centos-release-cr repo as 
Tristan suggests. That should tell you if something like pacemaker were to stop 
working. Note this wouldn't require any infra side updates, you would just have 
these jobs configure the additional repo and go from there.

Then on top of that get through the transition period so that the containers 
isolate you from these changes in the way they should. Then when 7.6 happens 
you'll have hopefully identified all the broken packaging ahead of time and 
worked with upstream to address those problems (which should be important for a 
stable long term support distro) and your containers can update at whatever 
pace they choose?

I don't think it would be appropriate for Infra to stage centos minor versions 
for a couple reasons. The first is we don't support specific minor versions of 
CentOS/RHEL, we support the major version and if it updates and OpenStack stops 
working that is CI doing its job and providing that info. The other major 
concern is CentOS specifically says "We are trying to make sure people 
understand they can NOT use older minor versions and still be secure." 
Similarly to how we won't support Ubuntu 12.04 because it is no longer 
supported we shouldn't support CentOS 7.4 at this point. These are no longer 
secure platforms.

However, I think testing using the pre release repo as proposed above should 
allow you to catch issues before updates happen just as well as a staged minor 
version update would. The added benefit of using this process is you should 
know as soon as possible and not after the release has been made (helping other 
users of CentOS by not releasing broken packages in the first place).

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Wesley Hayutin
On Mon, May 14, 2018 at 12:37 PM Jeremy Stanley  wrote:

> On 2018-05-14 09:57:17 -0600 (-0600), Wesley Hayutin wrote:
> > On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley 
> wrote:
> [...]
> > > Couldn't a significant burst of new packages cause the same
> > > symptoms even without it being tied to a minor version increase?
> >
> > Yes, certainly this could happen outside of a minor update of the
> > baseos.
>
> Thanks for confirming. So this is not specifically a CentOS minor
> version increase issue, it's just more likely to occur at minor
> version boundaries.
>

Correct, you got it


>
> > So the only thing out of our control is the package set on the
> > base nodepool image. If that suddenly gets updated with too many
> > packages, then we have to scramble to ensure the images and
> > containers are also udpated.
>
> It's still unclear to me why the packages on the test instance image
> (i.e. the "container host") are related to the packages in the
> container guest images at all. That would seem to be the whole point
> of having containers?
>

You are right, just note some services are not 100% containerized yet.
This doesn't happen overnight it's a process and we're getting there.


>
> > If there is a breaking change in the nodepool image for example
> > [a], we have to react to and fix that as well.
>
> I would argue that one is a terrible workaround which happened to
> show its warts. We should fix DIB's pip-and-virtualenv element
> rather than continue rely on side effects of pinning RPM versions.
> I've commented to that effect on https://launchpad.net/bugs/1770298
> just now.
>
>
k.. thanks


> > > It sounds like a problem with how the jobs are designed
> > > and expectations around distros slowly trickling package updates
> > > into the series without occasional larger bursts of package deltas.
> > > I'd like to understand more about why you upgrade packages inside
> > > your externally-produced container images at job runtime at all,
> > > rather than relying on the package versions baked into them.
> >
> > We do that to ensure the gerrit review itself and it's
> > dependencies are built via rpm and injected into the build. If we
> > did not do this the job would not be testing the change at all.
> > This is a result of being a package based deployment for better or
> > worse.
> [...]
>
> Now I'll risk jumping to proposing solutions, but have you
> considered building those particular packages in containers too?
> That way they're built against the same package versions as will be
> present in the other container images you're using rather than to
> the package versions on the host, right? Seems like it would
> completely sidestep the problem.
>

So a little background.  The containers and images used in TripleO are
rebuilt multiple times each day via periodic jobs, when they pass our
criteria they are pushed out and used upstream.
Each zuul change and it's dependencies can potentially impact a few or all
the containers in play.   We can not rebuild all the containers due to time
constraints in each job.  We have been able to mount and yum update the
containers involved with the zuul change.

Latest patch to fine tune that process is here
https://review.openstack.org/#/c/567550/


>
> > An enhancement could be to stage the new images for say one week
> > or so. Do we need the CentOS updates immediately? Is there a
> > possible path that does not create a lot of work for infra, but
> > also provides some space for projects to prep for the consumption
> > of the updates?
> [...]
>
> Nodepool builds new images constantly, but at least daily. Part of
> this is to prevent the delta of available packages/indices and other
> files baked into those images from being more than a day or so stale
> at any given point in time. The older the image, the more packages
> (on average) jobs will need to download if they want to test with
> latest package versions and the more strain it will put on our
> mirrors and on our bandwidth quotas/donors' networks.
>

Sure that makes perfect sense.  We do the same with our containers and
images.


>
> There's also a question of retention, if we're building images at
> least daily but keeping them around for 7 days (storage on the
> builders, tenant quotas for Glance in our providers) as well as the
> explosion of additional nodes we'd need since we pre-boot nodes with
> each of our images (and the idea as I understand it is that you
> would want jobs to be able to select between any of them). One
> option, I suppose, would be to switch to building images weekly
> instead of daily, but that only solves the storage and node count
> problem not the additional bandwidth and mirror load. And of course,
> nodepool would need to learn to be able to boot nodes from older
> versions of an image on record which is not a feature it has right
> now.
>

OK.. thanks for walking me through that.  It totally makes sense to be
concerned with 

Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Wesley Hayutin
On Mon, May 14, 2018 at 12:08 PM Clark Boylan  wrote:

> On Mon, May 14, 2018, at 8:57 AM, Wesley Hayutin wrote:
> > On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley 
> wrote:
> >
> > > On 2018-05-14 07:07:03 -0600 (-0600), Wesley Hayutin wrote:
> > > [...]
>
> snip
>
> > >
> > > This _doesn't_ sound to me like a problem with how we've designed
> > > our infrastructure, unless there are additional details you're
> > > omitting.
> >
> >
> > So the only thing out of our control is the package set on the base
> > nodepool image.
> > If that suddenly gets updated with too many packages, then we have to
> > scramble to ensure the images and containers are also udpated.
> > If there is a breaking change in the nodepool image for example [a], we
> > have to react to and fix that as well.
>
> Aren't the container images independent of the hosting platform (eg what
> infra hosts)? I'm not sure I understand why the host platform updating
> implies all the container images must also be updated.
>

You make a fine point here, I think as with anything there are some bits
that are still being worked on. At this moment it's my understanding that
pacemaker and possibly a few others components are not 100% containerized
atm.  I'm not an expert in the subject and my understanding may not be
correct.  Untill you are 100% containerized there may still be some
dependencies on the base image and an impact from changes.


>
> >
> >
> > > It sounds like a problem with how the jobs are designed
> > > and expectations around distros slowly trickling package updates
> > > into the series without occasional larger bursts of package deltas.
> > > I'd like to understand more about why you upgrade packages inside
> > > your externally-produced container images at job runtime at all,
> > > rather than relying on the package versions baked into them.
> >
> >
> > We do that to ensure the gerrit review itself and it's dependencies are
> > built via rpm and injected into the build.
> > If we did not do this the job would not be testing the change at all.
> >  This is a result of being a package based deployment for better or
> worse.
>
> You'd only need to do that for the change in review, not the entire system
> right?
>

Correct there is no intention of updating the entire distribution in run
time, the intent is to have as much updated in our jobs that build the
containers and images.
Only the rpm built zuul change should be included in the update, however
some zuul changes require a CentOS base package that was not previously
installed on the container e.g. a new python dependency introduced in a
zuul change.  Previously we had not enabled any CentOS repos in the
container update, but found that was not viable 100% of the time.

We have a change to further limit the scope of the update which should help
[1], especialy when facing a minor version update.

 [1] https://review.openstack.org/#/c/567550/

>
> >
>
> snip
>
> > > Our automation doesn't know that there's a difference between
> > > packages which were part of CentOS 7.4 and 7.5 any more than it
> > > knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
> > > Even if we somehow managed to pause our CentOS image updates
> > > immediately prior to 7.5, jobs would still try to upgrade those
> > > 7.4-based images to the 7.5 packages in our mirror, right?
> > >
> >
> > Understood, I suspect this will become a more widespread issue as
> > more projects start to use containers ( not sure ).  It's my
> understanding
> > that
> > there are some mechanisms in place to pin packages in the centos nodepool
> > image so
> > there has been some thoughts generally in the area of this issue.
>
> Again, I think we need to understand why containers would make this worse
> not better. Seems like the big feature everyone talks about when it comes
> to containers is isolating packaging whether that be python packages so
> that nova and glance can use a different version of oslo or cohabitating
> software that would otherwise conflict. Why do the packages on the host
> platform so strongly impact your container package lists?
>

I'll let others comment on that, however my thought is you don't move from
A -> Z in one step and containers do not make everything easier
immediately.  Like most things, it takes a little time.

>
> >
> > TripleO may be the exception to the rule here and that is fine, I'm more
> > interested in exploring
> > the possibilities of delivering updates in a staged fashion than
> anything.
> > I don't have insight into
> > what the possibilities are, or if other projects have similiar issues or
> > requests.  Perhaps the TripleO
> > project could share the details of our job workflow with the community
> and
> > this would make more sense.
> >
> > I appreciate your time, effort and thoughts you have shared in the
> thread.
> >
> >
> > > --
> > > Jeremy Stanley
> > >
> >
> > [a] https://bugs.launchpad.net/tripleo/+bug/1770298
>
> 

Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Jeremy Stanley
On 2018-05-14 09:57:17 -0600 (-0600), Wesley Hayutin wrote:
> On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley  wrote:
[...]
> > Couldn't a significant burst of new packages cause the same
> > symptoms even without it being tied to a minor version increase?
> 
> Yes, certainly this could happen outside of a minor update of the
> baseos.

Thanks for confirming. So this is not specifically a CentOS minor
version increase issue, it's just more likely to occur at minor
version boundaries.

> So the only thing out of our control is the package set on the
> base nodepool image. If that suddenly gets updated with too many
> packages, then we have to scramble to ensure the images and
> containers are also udpated.

It's still unclear to me why the packages on the test instance image
(i.e. the "container host") are related to the packages in the
container guest images at all. That would seem to be the whole point
of having containers?

> If there is a breaking change in the nodepool image for example
> [a], we have to react to and fix that as well.

I would argue that one is a terrible workaround which happened to
show its warts. We should fix DIB's pip-and-virtualenv element
rather than continue rely on side effects of pinning RPM versions.
I've commented to that effect on https://launchpad.net/bugs/1770298
just now.

> > It sounds like a problem with how the jobs are designed
> > and expectations around distros slowly trickling package updates
> > into the series without occasional larger bursts of package deltas.
> > I'd like to understand more about why you upgrade packages inside
> > your externally-produced container images at job runtime at all,
> > rather than relying on the package versions baked into them.
> 
> We do that to ensure the gerrit review itself and it's
> dependencies are built via rpm and injected into the build. If we
> did not do this the job would not be testing the change at all.
> This is a result of being a package based deployment for better or
> worse.
[...]

Now I'll risk jumping to proposing solutions, but have you
considered building those particular packages in containers too?
That way they're built against the same package versions as will be
present in the other container images you're using rather than to
the package versions on the host, right? Seems like it would
completely sidestep the problem.

> An enhancement could be to stage the new images for say one week
> or so. Do we need the CentOS updates immediately? Is there a
> possible path that does not create a lot of work for infra, but
> also provides some space for projects to prep for the consumption
> of the updates?
[...]

Nodepool builds new images constantly, but at least daily. Part of
this is to prevent the delta of available packages/indices and other
files baked into those images from being more than a day or so stale
at any given point in time. The older the image, the more packages
(on average) jobs will need to download if they want to test with
latest package versions and the more strain it will put on our
mirrors and on our bandwidth quotas/donors' networks.

There's also a question of retention, if we're building images at
least daily but keeping them around for 7 days (storage on the
builders, tenant quotas for Glance in our providers) as well as the
explosion of additional nodes we'd need since we pre-boot nodes with
each of our images (and the idea as I understand it is that you
would want jobs to be able to select between any of them). One
option, I suppose, would be to switch to building images weekly
instead of daily, but that only solves the storage and node count
problem not the additional bandwidth and mirror load. And of course,
nodepool would need to learn to be able to boot nodes from older
versions of an image on record which is not a feature it has right
now.

> Understood, I suspect this will become a more widespread issue as
> more projects start to use containers ( not sure ).

I'm still confused as to what makes this a container problem in the
general sense, rather than just a problem (leaky abstraction) with
how you've designed the job framework in which you're using them.

> It's my understanding that there are some mechanisms in place to
> pin packages in the centos nodepool image so there has been some
> thoughts generally in the area of this issue.
[...]

If this is a reference back to bug 1770298, as mentioned already I
think that's a mistake in diskimage-builder's stdlib which should be
corrected, not a pattern we should propagate.
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Clark Boylan
On Mon, May 14, 2018, at 8:57 AM, Wesley Hayutin wrote:
> On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley  wrote:
> 
> > On 2018-05-14 07:07:03 -0600 (-0600), Wesley Hayutin wrote:
> > [...]

snip

> >
> > This _doesn't_ sound to me like a problem with how we've designed
> > our infrastructure, unless there are additional details you're
> > omitting.
> 
> 
> So the only thing out of our control is the package set on the base
> nodepool image.
> If that suddenly gets updated with too many packages, then we have to
> scramble to ensure the images and containers are also udpated.
> If there is a breaking change in the nodepool image for example [a], we
> have to react to and fix that as well.

Aren't the container images independent of the hosting platform (eg what infra 
hosts)? I'm not sure I understand why the host platform updating implies all 
the container images must also be updated.

> 
> 
> > It sounds like a problem with how the jobs are designed
> > and expectations around distros slowly trickling package updates
> > into the series without occasional larger bursts of package deltas.
> > I'd like to understand more about why you upgrade packages inside
> > your externally-produced container images at job runtime at all,
> > rather than relying on the package versions baked into them.
> 
> 
> We do that to ensure the gerrit review itself and it's dependencies are
> built via rpm and injected into the build.
> If we did not do this the job would not be testing the change at all.
>  This is a result of being a package based deployment for better or worse.

You'd only need to do that for the change in review, not the entire system 
right?

> 

snip

> > Our automation doesn't know that there's a difference between
> > packages which were part of CentOS 7.4 and 7.5 any more than it
> > knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
> > Even if we somehow managed to pause our CentOS image updates
> > immediately prior to 7.5, jobs would still try to upgrade those
> > 7.4-based images to the 7.5 packages in our mirror, right?
> >
> 
> Understood, I suspect this will become a more widespread issue as
> more projects start to use containers ( not sure ).  It's my understanding
> that
> there are some mechanisms in place to pin packages in the centos nodepool
> image so
> there has been some thoughts generally in the area of this issue.

Again, I think we need to understand why containers would make this worse not 
better. Seems like the big feature everyone talks about when it comes to 
containers is isolating packaging whether that be python packages so that nova 
and glance can use a different version of oslo or cohabitating software that 
would otherwise conflict. Why do the packages on the host platform so strongly 
impact your container package lists?

> 
> TripleO may be the exception to the rule here and that is fine, I'm more
> interested in exploring
> the possibilities of delivering updates in a staged fashion than anything.
> I don't have insight into
> what the possibilities are, or if other projects have similiar issues or
> requests.  Perhaps the TripleO
> project could share the details of our job workflow with the community and
> this would make more sense.
> 
> I appreciate your time, effort and thoughts you have shared in the thread.
> 
> 
> > --
> > Jeremy Stanley
> >
> 
> [a] https://bugs.launchpad.net/tripleo/+bug/1770298

I think understanding the questions above may be the important aspect of 
understanding what the underlying issue is here and how we might address it.

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Wesley Hayutin
On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley  wrote:

> On 2018-05-14 07:07:03 -0600 (-0600), Wesley Hayutin wrote:
> [...]
> > I think you may be conflating the notion that ubuntu or rhel/cent
> > can be updated w/o any issues to applications that run atop of the
> > distributions with what it means to introduce a minor update into
> > the upstream openstack ci workflow.
> >
> > If jobs could execute w/o a timeout the tripleo jobs would have
> > not gone red.  Since we do have constraints in the upstream like a
> > timeouts and others we have to prepare containers, images etc to
> > work efficiently in the upstream.  For example, if our jobs had
> > the time to yum update the roughly 120 containers in play in each
> > job the tripleo jobs would have just worked.  I am not advocating
> > for not having timeouts or constraints on jobs, however I am
> > saying this is an infra issue, not a distribution or distribution
> > support issue.
> >
> > I think this is an important point to consider and I view it as
> > mostly unrelated to the support claims by the distribution.  Does
> > that make sense?
> [...]
>
> Thanks, the thread jumped straight to suggesting costly fixes
> (separate images for each CentOS point release, adding an evaluation
> period or acceptance testing for new point releases, et cetera)
> without coming anywhere close to exploring the problem space. Is
> your only concern that when your jobs started using CentOS 7.5
> instead of 7.4 they took longer to run?


Yes, If they had unlimited time to run, our workflow would have everything
updated to CentOS 7.5 in the job itself and I would expect everything to
just work.


> What was the root cause? Are
> you saying your jobs consume externally-produced artifacts which lag
> behind CentOS package updates?


Yes, TripleO has externally produced overcloud images, and containers both
of which can be yum updated but we try to ensure they are frequently
recreated so the yum transaction is small.


> Couldn't a significant burst of new
> packages cause the same symptoms even without it being tied to a
> minor version increase?
>

Yes, certainly this could happen outside of a minor update of the baseos.


>
> This _doesn't_ sound to me like a problem with how we've designed
> our infrastructure, unless there are additional details you're
> omitting.


So the only thing out of our control is the package set on the base
nodepool image.
If that suddenly gets updated with too many packages, then we have to
scramble to ensure the images and containers are also udpated.
If there is a breaking change in the nodepool image for example [a], we
have to react to and fix that as well.


> It sounds like a problem with how the jobs are designed
> and expectations around distros slowly trickling package updates
> into the series without occasional larger bursts of package deltas.
> I'd like to understand more about why you upgrade packages inside
> your externally-produced container images at job runtime at all,
> rather than relying on the package versions baked into them.


We do that to ensure the gerrit review itself and it's dependencies are
built via rpm and injected into the build.
If we did not do this the job would not be testing the change at all.
 This is a result of being a package based deployment for better or worse.


> It
> seems like you're arguing that the existence of lots of new package
> versions which aren't already in your container images is the
> problem, in which case I have trouble with the rationalization of it
> being "an infra issue" insofar as it requires changes to the
> services as provided by the OpenStack Infra team.
>
> Just to be clear, we didn't "introduce a minor update into the
> upstream openstack ci workflow." We continuously pull CentOS 7
> packages into our package mirrors, and continuously rebuild our
> centos-7 images from whatever packages the distro says are current.
>

Understood, which I think is fine and probably works for most projects.
An enhancement could be to stage the new images for say one week or so.
Do we need the CentOS updates immediately? Is there a possible path that
does not create a lot of work for infra, but also provides some space for
projects
to prep for the consumption of the updates?


> Our automation doesn't know that there's a difference between
> packages which were part of CentOS 7.4 and 7.5 any more than it
> knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
> Even if we somehow managed to pause our CentOS image updates
> immediately prior to 7.5, jobs would still try to upgrade those
> 7.4-based images to the 7.5 packages in our mirror, right?
>

Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ).  It's my understanding
that
there are some mechanisms in place to pin packages in the centos nodepool
image so
there has been some thoughts generally in the area of this issue.

TripleO may 

Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Jeremy Stanley
On 2018-05-14 07:07:03 -0600 (-0600), Wesley Hayutin wrote:
[...]
> I think you may be conflating the notion that ubuntu or rhel/cent
> can be updated w/o any issues to applications that run atop of the
> distributions with what it means to introduce a minor update into
> the upstream openstack ci workflow.
> 
> If jobs could execute w/o a timeout the tripleo jobs would have
> not gone red.  Since we do have constraints in the upstream like a
> timeouts and others we have to prepare containers, images etc to
> work efficiently in the upstream.  For example, if our jobs had
> the time to yum update the roughly 120 containers in play in each
> job the tripleo jobs would have just worked.  I am not advocating
> for not having timeouts or constraints on jobs, however I am
> saying this is an infra issue, not a distribution or distribution
> support issue.
> 
> I think this is an important point to consider and I view it as
> mostly unrelated to the support claims by the distribution.  Does
> that make sense?
[...]

Thanks, the thread jumped straight to suggesting costly fixes
(separate images for each CentOS point release, adding an evaluation
period or acceptance testing for new point releases, et cetera)
without coming anywhere close to exploring the problem space. Is
your only concern that when your jobs started using CentOS 7.5
instead of 7.4 they took longer to run? What was the root cause? Are
you saying your jobs consume externally-produced artifacts which lag
behind CentOS package updates? Couldn't a significant burst of new
packages cause the same symptoms even without it being tied to a
minor version increase?

This _doesn't_ sound to me like a problem with how we've designed
our infrastructure, unless there are additional details you're
omitting. It sounds like a problem with how the jobs are designed
and expectations around distros slowly trickling package updates
into the series without occasional larger bursts of package deltas.
I'd like to understand more about why you upgrade packages inside
your externally-produced container images at job runtime at all,
rather than relying on the package versions baked into them. It
seems like you're arguing that the existence of lots of new package
versions which aren't already in your container images is the
problem, in which case I have trouble with the rationalization of it
being "an infra issue" insofar as it requires changes to the
services as provided by the OpenStack Infra team.

Just to be clear, we didn't "introduce a minor update into the
upstream openstack ci workflow." We continuously pull CentOS 7
packages into our package mirrors, and continuously rebuild our
centos-7 images from whatever packages the distro says are current.
Our automation doesn't know that there's a difference between
packages which were part of CentOS 7.4 and 7.5 any more than it
knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
Even if we somehow managed to pause our CentOS image updates
immediately prior to 7.5, jobs would still try to upgrade those
7.4-based images to the 7.5 packages in our mirror, right?
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-14 Thread Wesley Hayutin
On Sun, May 13, 2018 at 11:30 PM Jeremy Stanley  wrote:

> On 2018-05-13 20:44:25 -0600 (-0600), Wesley Hayutin wrote:
> [...]
> > I do think it would be helpful to say have a one week change
> > window where folks are given the opportunity to preflight check a
> > new image and the potential impact on the job workflow the updated
> > image may have. If I could update or create a non-voting job w/
> > the new image that would provide two things.
> >
> > 1. The first is the head's up, this new minor version of centos is
> > coming into the system and you have $x days to deal with it.
> >
> > 2. The ability to build a few non-voting jobs w/ the new image to
> > see what kind of impact it has on the workflow and deployments.
> [...]
>
> While I can see where you're coming from, right now even the Infra
> team doesn't know immediately when a new CentOS minor release starts
> to be used. The packages show up in the mirrors automatically and
> images begin to be built with them right away. There isn't a
> conscious "switch" which is thrown by anyone. This is essentially
> the same way we treat Ubuntu LTS point releases as well. If this is
> _not_ the way RHEL/CentOS are intended to be consumed (i.e. just
> upgrade to and run the latest packages available for a given major
> release series) then we should perhaps take a step back and
> reevaluate this model.


I think you may be conflating the notion that ubuntu or rhel/cent can be
updated w/o any issues to applications that run atop of the distributions
with what it means to introduce a minor update into the upstream openstack
ci workflow.

If jobs could execute w/o a timeout the tripleo jobs would have not gone
red.  Since we do have constraints in the upstream like a timeouts and
others we have to prepare containers, images etc to work efficiently in the
upstream.  For example, if our jobs had the time to yum update the roughly
120 containers in play in each job the tripleo jobs would have just
worked.  I am not advocating for not having timeouts or constraints on
jobs, however I am saying this is an infra issue, not a distribution or
distribution support issue.

I think this is an important point to consider and I view it as mostly
unrelated to the support claims by the distribution.  Does that make sense?
Thanks




> For now we have some fairly deep-driven
> assumptions in that regard which are reflected in the Linux
> distributions support policy of our project testing interface as
> documented in OpenStack governance.
> --
> Jeremy Stanley
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-13 Thread Tristan Cacqueray

On May 14, 2018 2:44 am, Wesley Hayutin wrote:
[snip]

I do think it would be helpful to say have a one week change window where
folks are given the opportunity to preflight check a new image and the
potential impact on the job workflow the updated image may have.

[snip]

How about adding a periodic job that setup centos-release-cr in a pre
task? This should highlight issues with up-coming updates:
https://wiki.centos.org/AdditionalResources/Repositories/CR

-Tristan


pgpt4rqQwjK_W.pgp
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-13 Thread Jeremy Stanley
On 2018-05-13 20:44:25 -0600 (-0600), Wesley Hayutin wrote:
[...]
> I do think it would be helpful to say have a one week change
> window where folks are given the opportunity to preflight check a
> new image and the potential impact on the job workflow the updated
> image may have. If I could update or create a non-voting job w/
> the new image that would provide two things.
> 
> 1. The first is the head's up, this new minor version of centos is
> coming into the system and you have $x days to deal with it.
> 
> 2. The ability to build a few non-voting jobs w/ the new image to
> see what kind of impact it has on the workflow and deployments.
[...]

While I can see where you're coming from, right now even the Infra
team doesn't know immediately when a new CentOS minor release starts
to be used. The packages show up in the mirrors automatically and
images begin to be built with them right away. There isn't a
conscious "switch" which is thrown by anyone. This is essentially
the same way we treat Ubuntu LTS point releases as well. If this is
_not_ the way RHEL/CentOS are intended to be consumed (i.e. just
upgrade to and run the latest packages available for a given major
release series) then we should perhaps take a step back and
reevaluate this model. For now we have some fairly deep-driven
assumptions in that regard which are reflected in the Linux
distributions support policy of our project testing interface as
documented in OpenStack governance.
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-13 Thread Wesley Hayutin
On Sun, May 13, 2018 at 11:25 AM Jeremy Stanley  wrote:

> On 2018-05-13 08:25:25 -0600 (-0600), Wesley Hayutin wrote:
> [...]
> > We need to in coordination with the infra team be able to pin / lock
> > content for production check and gate jobs while also have the ability to
> > stage new content e.g. centos 7.5 with experimental or periodic jobs.
> [...]
>
> It looks like adjustments would be needed to DIB's centos-minimal
> element if we want to be able to pin it to specific minor releases.
> However, having to rotate out images in the fashion described would
> be a fair amount of manual effort and seems like it would violate
> our support expectations in governance if we end up pinning to older
> minor versions (for major LTS versions on the other hand, we expect
> to undergo this level of coordination but they come at a much slower
> pace with a lot more advance warning). If we need to add controlled
> roll-out of CentOS minor version updates, this is really no better
> than Fedora from the Infra team's perspective and we've already said
> we can't make stable branch testing guarantees for Fedora due to the
> complexity involved in using different releases for each branch and
> the need to support our stable branches longer than the distros are
> supporting the releases on which we're testing.
>

This is good insight Jeremy, thanks for replying.



>
> For example, how long would the distro maintainers have committed to
> supporting RHEL 7.4 after 7.5 was released? Longer than we're
> committing to extended maintenance on our stable/queens branches? Or
> would you expect projects to still continue to backport support for
> these minor platform bumps to all their stable branches too? And
> what sort of grace period should we give them before we take away
> the old versions? Also, how many minor versions of CentOS should we
> expect to end up maintaining in parallel? (Remember, every
> additional image means that much extra time to build and upload to
> all our providers, as well as that much more storage on our builders
> and in our Glance quotas.)
> --
> Jeremy Stanley
>

I think you may be describing a level of support that is far greater than
what I was thinking. I also don't want to tax the infra team w/ n+ versions
of the baseos to support.
I do think it would be helpful to say have a one week change window where
folks are given the opportunity to preflight check a new image and the
potential impact on the job workflow the updated image may have.  If I
could update or create a non-voting job w/ the new image that would provide
two things.

1. The first is the head's up, this new minor version of centos is coming
into the system and you have $x days to deal with it.
2. The ability to build a few non-voting jobs w/ the new image to see what
kind of impact it has on the workflow and deployments.

In this case the updated 7.5 CentOS image worked fine w/ TripleO, however
it did cause our gates to go red because..
a. when we update containers w/ zuul dependendencies all the base-os
updates were pulled in and jobs timed out.
b. a kernel bug workaround with virt-customize failed to work due the
kernel packages changed ( 3rd party job )
c. the containers we use were not yet at CentOS 7.5 but the bm image was
causing issues w/ pacemaker.
d. there may be a few more that I am forgetting, but hopefully the point is
made.

We can fix a lot of the issues and I'm not blaming anyone because if we
(tripleo ) thought of all the corner cases with our workflow we would have
been able to avoid some of these issues.  However it does seem like we get
hit by $something every time we update a minor version of the baseos.  My
preference would be to have a heads up and work through the issues than to
go immediately red and unable to merge patches.  I don't know if other
teams get impacted in similiar ways, and I understand this is a big ship
and updating CentOS may work just fine for everyone else.

Thanks all for your time and effort!




> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-13 Thread Matt Young
Re: resolving network latency issue on the promotion server in
tripleo-infra tenant, that's great news!

Re: retrospective on this class of issue, I'll reach out directly early
this week to get something on the calendar for our two teams.  We clearly
need to brainstorm/hash out together how we can reduce the turbulence
moving forward.

In addition, as a result of working these issues over the past few days
we've identified a few pieces of low hanging (tooling) fruit that are ripe
for for improvements that will speed diagnosis / debug in the future.
We'll capture these as RFE's and get them into our backlog.

Matt

On Sun, May 13, 2018 at 10:25 AM, Wesley Hayutin 
wrote:

>
>
> On Sat, May 12, 2018 at 11:45 PM Emilien Macchi 
> wrote:
>
>> On Sat, May 12, 2018 at 9:10 AM, Wesley Hayutin 
>> wrote:
>>>
>>> 2. Shortly after #1 was resolved CentOS released 7.5 which comes
>>> directly into the upstream repos untested and ungated.  Additionally the
>>> associated qcow2 image and container-base images were not updated at the
>>> same time as the yum repos.  https://bugs.launchpad.net/tripleo/+bug/
>>> 1770355
>>>
>>
>> Why do we have this situation everytime the OS is upgraded to a major
>> version? Can't we test the image before actually using it? We could have
>> experimental jobs testing latest image and pin gate images to a specific
>> one?
>> Like we could configure infra to deploy centos 7.4 in our gate and 7.5 in
>> experimental, so we can take our time to fix eventual problems and make the
>> switch when we're ready, instead of dealing with fires (that usually come
>> all together).
>>
>> It would be great to make a retrospective on this thing between tripleo
>> ci & infra folks, and see how we can improve things.
>>
>
> I agree,
> We need to in coordination with the infra team be able to pin / lock
> content for production check and gate jobs while also have the ability to
> stage new content e.g. centos 7.5 with experimental or periodic jobs.
> In this particular case the ci team did check the tripleo deployment w/
> centos 7.5 updates, however we did not stage or test what impact the centos
> minor update would have on the upstream job workflow.
> The key issue is that the base centos image used upstream can not be
> pinned by the ci team, if say we could pin that image the ci team could pin
> the centos repos used in ci and run staging jobs on the latest centos
> content.
>
> I'm glad that you also see the need for some amount of coordination here,
> I've been in contact with a few folks to initiate the conversation.
>
> In an unrelated note, Sagi and I just fixed the network latency issue on
> our promotion server, it was related to DNS.  Automatic promotions should
> be back online.
> Thanks all.
>
>
>> --
>> Emilien Macchi
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:
>> unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-13 Thread Jeremy Stanley
On 2018-05-13 08:25:25 -0600 (-0600), Wesley Hayutin wrote:
[...]
> We need to in coordination with the infra team be able to pin / lock
> content for production check and gate jobs while also have the ability to
> stage new content e.g. centos 7.5 with experimental or periodic jobs.
[...]

It looks like adjustments would be needed to DIB's centos-minimal
element if we want to be able to pin it to specific minor releases.
However, having to rotate out images in the fashion described would
be a fair amount of manual effort and seems like it would violate
our support expectations in governance if we end up pinning to older
minor versions (for major LTS versions on the other hand, we expect
to undergo this level of coordination but they come at a much slower
pace with a lot more advance warning). If we need to add controlled
roll-out of CentOS minor version updates, this is really no better
than Fedora from the Infra team's perspective and we've already said
we can't make stable branch testing guarantees for Fedora due to the
complexity involved in using different releases for each branch and
the need to support our stable branches longer than the distros are
supporting the releases on which we're testing.

For example, how long would the distro maintainers have committed to
supporting RHEL 7.4 after 7.5 was released? Longer than we're
committing to extended maintenance on our stable/queens branches? Or
would you expect projects to still continue to backport support for
these minor platform bumps to all their stable branches too? And
what sort of grace period should we give them before we take away
the old versions? Also, how many minor versions of CentOS should we
expect to end up maintaining in parallel? (Remember, every
additional image means that much extra time to build and upload to
all our providers, as well as that much more storage on our builders
and in our Glance quotas.)
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-13 Thread Wesley Hayutin
On Sat, May 12, 2018 at 11:45 PM Emilien Macchi  wrote:

> On Sat, May 12, 2018 at 9:10 AM, Wesley Hayutin 
> wrote:
>>
>> 2. Shortly after #1 was resolved CentOS released 7.5 which comes directly
>> into the upstream repos untested and ungated.  Additionally the associated
>> qcow2 image and container-base images were not updated at the same time as
>> the yum repos.  https://bugs.launchpad.net/tripleo/+bug/1770355
>>
>
> Why do we have this situation everytime the OS is upgraded to a major
> version? Can't we test the image before actually using it? We could have
> experimental jobs testing latest image and pin gate images to a specific
> one?
> Like we could configure infra to deploy centos 7.4 in our gate and 7.5 in
> experimental, so we can take our time to fix eventual problems and make the
> switch when we're ready, instead of dealing with fires (that usually come
> all together).
>
> It would be great to make a retrospective on this thing between tripleo ci
> & infra folks, and see how we can improve things.
>

I agree,
We need to in coordination with the infra team be able to pin / lock
content for production check and gate jobs while also have the ability to
stage new content e.g. centos 7.5 with experimental or periodic jobs.
In this particular case the ci team did check the tripleo deployment w/
centos 7.5 updates, however we did not stage or test what impact the centos
minor update would have on the upstream job workflow.
The key issue is that the base centos image used upstream can not be pinned
by the ci team, if say we could pin that image the ci team could pin the
centos repos used in ci and run staging jobs on the latest centos content.

I'm glad that you also see the need for some amount of coordination here,
I've been in contact with a few folks to initiate the conversation.

In an unrelated note, Sagi and I just fixed the network latency issue on
our promotion server, it was related to DNS.  Automatic promotions should
be back online.
Thanks all.


> --
> Emilien Macchi
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-13 Thread Jeremy Stanley
On 2018-05-12 20:44:04 -0700 (-0700), Emilien Macchi wrote:
[...]
> Why do we have this situation everytime the OS is upgraded to a major
> version? Can't we test the image before actually using it? We could have
> experimental jobs testing latest image and pin gate images to a specific
> one?
> 
> Like we could configure infra to deploy centos 7.4 in our gate and 7.5 in
> experimental, so we can take our time to fix eventual problems and make the
> switch when we're ready, instead of dealing with fires (that usually come
> all together).
> 
> It would be great to make a retrospective on this thing between tripleo ci
> & infra folks, and see how we can improve things.

In the past we've trusted statements from Red Hat that you should be
able to upgrade to newer point releases without experiencing
backward-incompatible breakage. Right now all our related tooling is
based on the assumption we made in governance that we can just
treat, e.g., RHEL/CentOS 7 as a long-term stable release
distribution similar to an Ubuntu LTS and not have to worry about
tracking individual point releases.

If this is not actually the case any longer, we should likely
reevaluate our support claims.
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-12 Thread Emilien Macchi
On Sat, May 12, 2018 at 9:10 AM, Wesley Hayutin  wrote:
>
> 2. Shortly after #1 was resolved CentOS released 7.5 which comes directly
> into the upstream repos untested and ungated.  Additionally the associated
> qcow2 image and container-base images were not updated at the same time as
> the yum repos.  https://bugs.launchpad.net/tripleo/+bug/1770355
>

Why do we have this situation everytime the OS is upgraded to a major
version? Can't we test the image before actually using it? We could have
experimental jobs testing latest image and pin gate images to a specific
one?
Like we could configure infra to deploy centos 7.4 in our gate and 7.5 in
experimental, so we can take our time to fix eventual problems and make the
switch when we're ready, instead of dealing with fires (that usually come
all together).

It would be great to make a retrospective on this thing between tripleo ci
& infra folks, and see how we can improve things.
-- 
Emilien Macchi
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

2018-05-12 Thread Wesley Hayutin
On Wed, May 9, 2018 at 10:43 PM Wesley Hayutin  wrote:

> FYI.. https://bugs.launchpad.net/tripleo/+bug/1770298
>
> I'm on #openstack-infra chatting w/ Ian atm.
> Thanks
>
>
Greetings,

I wanted to update  everyone on the status of the upstream tripleo check
and gate jobs.
There have been a series of infra related issues that caused the upstream
tripleo gates to go red.

1. The first issue hit was
https://bugs.launchpad.net/tripleo/+bug/1770298 which
caused package install errors
2. Shortly after #1 was resolved CentOS released 7.5 which comes directly
into the upstream repos untested and ungated.  Additionally the associated
qcow2 image and container-base images were not updated at the same time as
the yum repos.  https://bugs.launchpad.net/tripleo/+bug/1770355
3.  Related to #2 the container and bm image rpms were not in sync causing
https://bugs.launchpad.net/tripleo/+bug/1770692
4. Building the bm images was failing due to an open issue with the centos
kernel, thanks to Yatin and Alfredo for
https://review.rdoproject.org/r/#/c/13737/
5. To ensure the containers are updated to the latest rpms at build time,
we have the following patch from Alex
https://review.openstack.org/#/c/567636/.
6.  I also noticed that we are building the centos-base container in our
container build jobs, however it is not pushed out to the container
registeries because it is not included in the tripleo-common repo

I would like to discuss this with some of the folks working on containers.
If we had an updated centos-base container I think some of these issues
would have been prevented.

The above issues were resolved, and the master promotion jobs all had
passed.  Thanks to all who were involved!

Once the promotion jobs pass and report status to the dlrn_api, a promotion
was triggered automatically to upload the promoted images, containers, and
updated dlrn hash.  This failed due to network latency in the tenant where
the tripleo-ci infra is hosted.  The issue is tracked here
https://bugs.launchpad.net/tripleo/+bug/1770860

Matt Young and myself worked well into the evening on Friday to diagnose
the issue and ended up having to execute the image, container and dlrn_hash
promotion outside of our tripleo-infra tenant.  Thanks to Matt for his
effort.

At the moment I have updated the ci status in #tripleo, the master check
and gate jobs are green in the upstream which should unblock merging most
patches.  The status of stable branches and third party ci is still being
investigated.

Automatic promotions are blocked until the network issues in the
tripleo-infra tenant are resolved.  The bug is marked with alert in
#tripleo.  Please see #tripleo for future status updates.

Thanks all
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev