Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-21 Thread Paul Belanger
On Mon, Aug 21, 2017 at 10:43:07AM +1200, Steve Baker wrote:
> On Thu, Aug 17, 2017 at 4:13 PM, Steve Baker  wrote:
> 
> >
> >
> > On Thu, Aug 17, 2017 at 10:47 AM, Emilien Macchi 
> > wrote:
> >
> >>
> >> > Problem #3: from Ocata to Pike: all container images are
> >> > uploaded/specified, even for services not deployed
> >> > https://bugs.launchpad.net/tripleo/+bug/1710992
> >> > The CI jobs are timeouting during the upgrade process because
> >> > downloading + uploading _all_ containers in local cache takes more
> >> > than 20 minutes.
> >> > So this is where we are now, upgrade jobs timeout on that. Steve Baker
> >> > is currently looking at it but we'll probably offer some help.
> >>
> >> Steve is still working on it: https://review.openstack.org/#/c/448328/
> >> Steve, if you need any help (reviewing or coding) - please let us
> >> know, as we consider this thing important to have and probably good to
> >> have in Pike.
> >>
> >
> > I have a couple of changes up now, one to capture the relationship between
> > images and services[1], and another to add an argument to the prepare
> > command to filter the image list based on which services are containerised
> > [2]. Once these land, all the calls to prepare in CI can be modified to
> > also specify these heat environment files, and this will reduce uploads to
> > only the images required.
> >
> > [1] https://review.openstack.org/#/c/448328/
> > [2] https://review.openstack.org/#/c/494367/
> >
> >
> Just updating progress on this, with infra caching from docker.io I'm
> seeing transfer times of 16 minutes (an improvement on 20 minutes ->
> $timeout).
> 
> Only transferring the required images [3] reduces this to 8 minutes.
> 
> [3] https://review.openstack.org/#/c/494767/

I'd still like to have docker daemon running with debug:True, just for peace of
mind. In our testing of the cache, it was possible for docker to silently
failure on the reverse proxy cache and hit docker.io directly.  Regardless this
is good news.

Because the size of the containers we are talking about here, I think it is a
great idea to only download / cache images that will only be used for the job.

Lets me know if you see any issues

> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-20 Thread Steve Baker
On Thu, Aug 17, 2017 at 4:13 PM, Steve Baker  wrote:

>
>
> On Thu, Aug 17, 2017 at 10:47 AM, Emilien Macchi 
> wrote:
>
>>
>> > Problem #3: from Ocata to Pike: all container images are
>> > uploaded/specified, even for services not deployed
>> > https://bugs.launchpad.net/tripleo/+bug/1710992
>> > The CI jobs are timeouting during the upgrade process because
>> > downloading + uploading _all_ containers in local cache takes more
>> > than 20 minutes.
>> > So this is where we are now, upgrade jobs timeout on that. Steve Baker
>> > is currently looking at it but we'll probably offer some help.
>>
>> Steve is still working on it: https://review.openstack.org/#/c/448328/
>> Steve, if you need any help (reviewing or coding) - please let us
>> know, as we consider this thing important to have and probably good to
>> have in Pike.
>>
>
> I have a couple of changes up now, one to capture the relationship between
> images and services[1], and another to add an argument to the prepare
> command to filter the image list based on which services are containerised
> [2]. Once these land, all the calls to prepare in CI can be modified to
> also specify these heat environment files, and this will reduce uploads to
> only the images required.
>
> [1] https://review.openstack.org/#/c/448328/
> [2] https://review.openstack.org/#/c/494367/
>
>
Just updating progress on this, with infra caching from docker.io I'm
seeing transfer times of 16 minutes (an improvement on 20 minutes ->
$timeout).

Only transferring the required images [3] reduces this to 8 minutes.

[3] https://review.openstack.org/#/c/494767/
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-17 Thread Jiří Stránský

On 17.8.2017 00:47, Emilien Macchi wrote:

Here's an update on the situation.

On Tue, Aug 15, 2017 at 6:33 PM, Emilien Macchi  wrote:

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955

[...]

- revert distgit patch in RDO: https://review.rdoproject.org/r/8575
- push https://review.openstack.org/#/c/494334/ as a temporary solution
- we need https://review.openstack.org/#/c/489874/ landed ASAP.
- once https://review.openstack.org/#/c/489874/ is landed, we need to
revert https://review.openstack.org/#/c/494334 ASAP.

We still need some help to find out why upgrade jobs timeout so much
in stable/ocata.


Problem #2: from Ocata to Pike (containerized) missing container upload step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972

[...]

The patch worked and helped! We've got a successful job running today:
http://logs.openstack.org/00/461000/32/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2f13627/console.html#_2017-08-16_01_31_32_009061

We're now pushing to the next step: testing the upgrade with pingtest.
See https://review.openstack.org/#/c/494268/ and the Depends-On: on
https://review.openstack.org/#/c/461000/.

If pingtest proves to work, it would be a good news and prove that we
have a basic workflow in place on which we can iterate.

The next iterations afterward would be to work on the 4 scenarios that
are also going to be upgrades from Ocata to pike (001 to 004).
For that, we'll need Problem #1 and #2 resolved before we want to make
any progress here, to not hit the same issues that before.


Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading _all_ containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.


Steve is still working on it: https://review.openstack.org/#/c/448328/
Steve, if you need any help (reviewing or coding) - please let us
know, as we consider this thing important to have and probably good to
have in Pike.


Independent, but related issue is that the job doesn't make use of 
CI-local registry mirrors. I seem to recall we already had mirror usage 
implemented at some point, but we must have lost it somehow. Fix is here:


https://review.openstack.org/#/c/494525/

Jirka



Thanks,




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Steve Baker
On Thu, Aug 17, 2017 at 10:47 AM, Emilien Macchi  wrote:

>
> > Problem #3: from Ocata to Pike: all container images are
> > uploaded/specified, even for services not deployed
> > https://bugs.launchpad.net/tripleo/+bug/1710992
> > The CI jobs are timeouting during the upgrade process because
> > downloading + uploading _all_ containers in local cache takes more
> > than 20 minutes.
> > So this is where we are now, upgrade jobs timeout on that. Steve Baker
> > is currently looking at it but we'll probably offer some help.
>
> Steve is still working on it: https://review.openstack.org/#/c/448328/
> Steve, if you need any help (reviewing or coding) - please let us
> know, as we consider this thing important to have and probably good to
> have in Pike.
>

I have a couple of changes up now, one to capture the relationship between
images and services[1], and another to add an argument to the prepare
command to filter the image list based on which services are containerised
[2]. Once these land, all the calls to prepare in CI can be modified to
also specify these heat environment files, and this will reduce uploads to
only the images required.

[1] https://review.openstack.org/#/c/448328/
[2] https://review.openstack.org/#/c/494367/
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Emilien Macchi
On Wed, Aug 16, 2017 at 3:47 PM, Emilien Macchi  wrote:
> Here's an update on the situation.
>
> On Tue, Aug 15, 2017 at 6:33 PM, Emilien Macchi  wrote:
>> Problem #1: Upgrade jobs timeout from Newton to Ocata
>> https://bugs.launchpad.net/tripleo/+bug/1702955
> [...]
>
> - revert distgit patch in RDO: https://review.rdoproject.org/r/8575
> - push https://review.openstack.org/#/c/494334/ as a temporary solution
> - we need https://review.openstack.org/#/c/489874/ landed ASAP.
> - once https://review.openstack.org/#/c/489874/ is landed, we need to
> revert https://review.openstack.org/#/c/494334 ASAP.
>
> We still need some help to find out why upgrade jobs timeout so much
> in stable/ocata.
>
>> Problem #2: from Ocata to Pike (containerized) missing container upload step
>> https://bugs.launchpad.net/tripleo/+bug/1710938
>> Wes has a patch (thanks!) that is currently in the gate:
>> https://review.openstack.org/#/c/493972
> [...]
>
> The patch worked and helped! We've got a successful job running today:
> http://logs.openstack.org/00/461000/32/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2f13627/console.html#_2017-08-16_01_31_32_009061
>
> We're now pushing to the next step: testing the upgrade with pingtest.
> See https://review.openstack.org/#/c/494268/ and the Depends-On: on
> https://review.openstack.org/#/c/461000/.
>
> If pingtest proves to work, it would be a good news and prove that we
> have a basic workflow in place on which we can iterate.

Pingtest doesn't work:
http://logs.openstack.org/00/461000/37/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/1beac0e/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-08-17_01_03_09

We need to investigate and find out why.
If nobody looks at it before, I'll take a change tomorrow.

> The next iterations afterward would be to work on the 4 scenarios that
> are also going to be upgrades from Ocata to pike (001 to 004).
> For that, we'll need Problem #1 and #2 resolved before we want to make
> any progress here, to not hit the same issues that before.
>
>> Problem #3: from Ocata to Pike: all container images are
>> uploaded/specified, even for services not deployed
>> https://bugs.launchpad.net/tripleo/+bug/1710992
>> The CI jobs are timeouting during the upgrade process because
>> downloading + uploading _all_ containers in local cache takes more
>> than 20 minutes.
>> So this is where we are now, upgrade jobs timeout on that. Steve Baker
>> is currently looking at it but we'll probably offer some help.
>
> Steve is still working on it: https://review.openstack.org/#/c/448328/
> Steve, if you need any help (reviewing or coding) - please let us
> know, as we consider this thing important to have and probably good to
> have in Pike.
>
> Thanks,
> --
> Emilien Macchi



-- 
Emilien Macchi

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Emilien Macchi
Here's an update on the situation.

On Tue, Aug 15, 2017 at 6:33 PM, Emilien Macchi  wrote:
> Problem #1: Upgrade jobs timeout from Newton to Ocata
> https://bugs.launchpad.net/tripleo/+bug/1702955
[...]

- revert distgit patch in RDO: https://review.rdoproject.org/r/8575
- push https://review.openstack.org/#/c/494334/ as a temporary solution
- we need https://review.openstack.org/#/c/489874/ landed ASAP.
- once https://review.openstack.org/#/c/489874/ is landed, we need to
revert https://review.openstack.org/#/c/494334 ASAP.

We still need some help to find out why upgrade jobs timeout so much
in stable/ocata.

> Problem #2: from Ocata to Pike (containerized) missing container upload step
> https://bugs.launchpad.net/tripleo/+bug/1710938
> Wes has a patch (thanks!) that is currently in the gate:
> https://review.openstack.org/#/c/493972
[...]

The patch worked and helped! We've got a successful job running today:
http://logs.openstack.org/00/461000/32/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2f13627/console.html#_2017-08-16_01_31_32_009061

We're now pushing to the next step: testing the upgrade with pingtest.
See https://review.openstack.org/#/c/494268/ and the Depends-On: on
https://review.openstack.org/#/c/461000/.

If pingtest proves to work, it would be a good news and prove that we
have a basic workflow in place on which we can iterate.

The next iterations afterward would be to work on the 4 scenarios that
are also going to be upgrades from Ocata to pike (001 to 004).
For that, we'll need Problem #1 and #2 resolved before we want to make
any progress here, to not hit the same issues that before.

> Problem #3: from Ocata to Pike: all container images are
> uploaded/specified, even for services not deployed
> https://bugs.launchpad.net/tripleo/+bug/1710992
> The CI jobs are timeouting during the upgrade process because
> downloading + uploading _all_ containers in local cache takes more
> than 20 minutes.
> So this is where we are now, upgrade jobs timeout on that. Steve Baker
> is currently looking at it but we'll probably offer some help.

Steve is still working on it: https://review.openstack.org/#/c/448328/
Steve, if you need any help (reviewing or coding) - please let us
know, as we consider this thing important to have and probably good to
have in Pike.

Thanks,
-- 
Emilien Macchi

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Paul Belanger
On Tue, Aug 15, 2017 at 11:06:20PM -0400, Wesley Hayutin wrote:
> On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi  wrote:
> 
> > So far, we're having 3 critical issues, that we all need to address as
> > soon as we can.
> >
> > Problem #1: Upgrade jobs timeout from Newton to Ocata
> > https://bugs.launchpad.net/tripleo/+bug/1702955
> > Today I spent an hour to look at it and here's what I've found so far:
> > depending on which public cloud we're running the TripleO CI jobs, it
> > timeouts or not.
> > Here's an example of Heat resources that run in our CI:
> > https://www.diffchecker.com/VTXkNFuk
> > On the left, resources on a job that failed (running on internap) and
> > on the right (running on citycloud) it worked.
> > I've been through all upgrade steps and I haven't seen specific tasks
> > that take more time here or here, but some little changes that make
> > the big change at the end (so hard to debug).
> > Note: both jobs use AFS mirrors.
> > Help on that front would be very welcome.
> >
> >
> > Problem #2: from Ocata to Pike (containerized) missing container upload
> > step
> > https://bugs.launchpad.net/tripleo/+bug/1710938
> > Wes has a patch (thanks!) that is currently in the gate:
> > https://review.openstack.org/#/c/493972
> > Thanks to that work, we managed to find the problem #3.
> >
> >
> > Problem #3: from Ocata to Pike: all container images are
> > uploaded/specified, even for services not deployed
> > https://bugs.launchpad.net/tripleo/+bug/1710992
> > The CI jobs are timeouting during the upgrade process because
> > downloading + uploading _all_ containers in local cache takes more
> > than 20 minutes.
> > So this is where we are now, upgrade jobs timeout on that. Steve Baker
> > is currently looking at it but we'll probably offer some help.
> >
> >
> > Solutions:
> > - for stable/ocata: make upgrade jobs non-voting
> > - for pike: keep upgrade jobs non-voting and release without upgrade
> > testing
> >
> > Risks:
> > - for stable/ocata: it's highly possible to inject regression if jobs
> > aren't voting anymore.
> > - for pike: the quality of the release won't be good enough in term of
> > CI coverage comparing to Ocata.
> >
> > Mitigations:
> > - for stable/ocata: make jobs non-voting and enforce our
> > core-reviewers to pay double attention on what is landed. It should be
> > temporary until we manage to fix the CI jobs.
> > - for master: release RC1 without upgrade jobs and make progress
> > - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> > somewhere with resources and without timeout constraints.
> >
> > I would like some feedback on the proposal so we can move forward this
> > week,
> > Thanks.
> > --
> > Emilien Macchi
> >
> 
> I think due to some of the limitations with run times upstream we may need
> to rethink the workflow with upgrade tests upstream. It's not very clear to
> me what can be done with the multinode nodepool jobs outside of what is
> already being done.  I think we do have some choices with ovb jobs.   I'm
> not going to try and solve in this email but rethinking how we CI upgrades
> in the upstream infrastructure should be a focus for the Queens PTG.  We
> will need to focus on bringing run times significantly down as it's
> incredibly difficult to run two installs in 175 minutes across all the
> upstream cloud providers.
> 
Can you explain in more details where the bottlenecks are for the 175 mins?
That's just shy of 3 hours, and seems like more then enough time.

Not that it can be solved now, but maybe it is time to look at these jobs the
other way, how can we make them faster and what optimizations need to be made.

One example, we spend a lot of time in rebuilding RPM package with DLRN.  It is
possible in zuulv3 we'll be able to make changes to the CI workflow, so only 1
nodes builds a package, then all other jobs download new packages from that
node.

Another thing we can look at, is more parallel testing inplace of serial. I
can't point to anything specific, but would be helpful to sit down with sombody
to better understand all the back and forth between undercloud / overcloud /
multinodes / etc.

> Thanks Emilien for all the work you have done around upgrades!
> 
> 
> 
> >
> > __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >

> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Emilien Macchi
On Wed, Aug 16, 2017 at 3:17 AM, Bogdan Dobrelya  wrote:
> We could limit the upstream multinode jobs scope to only do upgrade
> testing of a couple of the services deployed, like keystone and nova and
> neutron, or so.

That would be a huge regression in our CI. Strong -2 on this idea.
We worked hard to have a pretty descent coverage during Ocata, we're
not going to give it up easily.
-- 
Emilien Macchi

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Emilien Macchi
On Wed, Aug 16, 2017 at 12:37 AM, Marios Andreou  wrote:
> For Newton to Ocata, is it consistent which clouds we are timing out on?

It's not consistent but the rate is very high:
http://cistatus.tripleo.org/

gate-tripleo-ci-centos-7-multinode-upgrades - 30 % of success this week
gate-tripleo-ci-centos-7-scenario001-multinode-upgrades - 13% of
success this week
gate-tripleo-ci-centos-7-scenario002-multinode-upgrades - 34% of
success this week
gate-tripleo-ci-centos-7-scenario003-multinode-upgrades - 78% of
success this week

(results on stable/ocata)

So as you can see results are not good at all for gate jobs.

> for master, +1 I think this is essentially what I am saying above for O...P
> - sounds like problem 2 is well in progress from weshay and the other
> container/image related problem 3 is the main outstanding item. Since RC1 is
> this week I think what you are proposing as mitigation is fair. So we
> re-evaluate making these jobs voting before the final RCs end of August

We might need to help him, and see how we can accelerate this work now.

> thanks for putting this together. I think if we really had to pick one the
> O..P ci has priority obviously this week (!)... I think the container/images
> related issues for O...P are both expected/teething issues from the huge
> amount of work done by the containerization team and can hopefully be
> resolved quickly.

I agree, priority is O..P for now - and getting these upgrade jobs working.
Note that the upgrade scenarios are not working correctly yet on
master, we'll need to figure that out as well. If you can maybe help
to have a look, that would be awesome.

Thanks,
-- 
Emilien Macchi

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Bogdan Dobrelya
On 16.08.2017 3:33, Emilien Macchi wrote:
> So far, we're having 3 critical issues, that we all need to address as
> soon as we can.
> 
> Problem #1: Upgrade jobs timeout from Newton to Ocata
> https://bugs.launchpad.net/tripleo/+bug/1702955
> Today I spent an hour to look at it and here's what I've found so far:
> depending on which public cloud we're running the TripleO CI jobs, it
> timeouts or not.
> Here's an example of Heat resources that run in our CI:
> https://www.diffchecker.com/VTXkNFuk
> On the left, resources on a job that failed (running on internap) and
> on the right (running on citycloud) it worked.
> I've been through all upgrade steps and I haven't seen specific tasks
> that take more time here or here, but some little changes that make
> the big change at the end (so hard to debug).
> Note: both jobs use AFS mirrors.
> Help on that front would be very welcome.
> 
> 
> Problem #2: from Ocata to Pike (containerized) missing container upload step
> https://bugs.launchpad.net/tripleo/+bug/1710938
> Wes has a patch (thanks!) that is currently in the gate:
> https://review.openstack.org/#/c/493972
> Thanks to that work, we managed to find the problem #3.
> 
> 
> Problem #3: from Ocata to Pike: all container images are
> uploaded/specified, even for services not deployed
> https://bugs.launchpad.net/tripleo/+bug/1710992
> The CI jobs are timeouting during the upgrade process because
> downloading + uploading _all_ containers in local cache takes more
> than 20 minutes.
> So this is where we are now, upgrade jobs timeout on that. Steve Baker
> is currently looking at it but we'll probably offer some help.
> 
> 
> Solutions:
> - for stable/ocata: make upgrade jobs non-voting
> - for pike: keep upgrade jobs non-voting and release without upgrade testing

This doesn't look like a viable option to me. I'd prefer reduce the
scope (deployed services under upgrade testing) of the upgrade testing,
but release only having it passing for that scope.

> 
> Risks:
> - for stable/ocata: it's highly possible to inject regression if jobs
> aren't voting anymore.
> - for pike: the quality of the release won't be good enough in term of
> CI coverage comparing to Ocata.
> 
> Mitigations:
> - for stable/ocata: make jobs non-voting and enforce our
> core-reviewers to pay double attention on what is landed. It should be
> temporary until we manage to fix the CI jobs.
> - for master: release RC1 without upgrade jobs and make progress
> - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> somewhere with resources and without timeout constraints.
> 
> I would like some feedback on the proposal so we can move forward this week,
> Thanks.
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Bogdan Dobrelya
On 16.08.2017 5:06, Wesley Hayutin wrote:
> 
> 
> On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi  > wrote:
> 
> So far, we're having 3 critical issues, that we all need to address as
> soon as we can.
> 
> Problem #1: Upgrade jobs timeout from Newton to Ocata
> https://bugs.launchpad.net/tripleo/+bug/1702955
> 
> Today I spent an hour to look at it and here's what I've found so far:
> depending on which public cloud we're running the TripleO CI jobs, it
> timeouts or not.
> Here's an example of Heat resources that run in our CI:
> https://www.diffchecker.com/VTXkNFuk
> 
> On the left, resources on a job that failed (running on internap) and
> on the right (running on citycloud) it worked.
> I've been through all upgrade steps and I haven't seen specific tasks
> that take more time here or here, but some little changes that make
> the big change at the end (so hard to debug).
> Note: both jobs use AFS mirrors.
> Help on that front would be very welcome.
> 
> 
> Problem #2: from Ocata to Pike (containerized) missing container
> upload step
> https://bugs.launchpad.net/tripleo/+bug/1710938
> 
> Wes has a patch (thanks!) that is currently in the gate:
> https://review.openstack.org/#/c/493972
> 
> Thanks to that work, we managed to find the problem #3.
> 
> 
> Problem #3: from Ocata to Pike: all container images are
> uploaded/specified, even for services not deployed
> https://bugs.launchpad.net/tripleo/+bug/1710992
> 
> The CI jobs are timeouting during the upgrade process because
> downloading + uploading _all_ containers in local cache takes more
> than 20 minutes.
> So this is where we are now, upgrade jobs timeout on that. Steve Baker
> is currently looking at it but we'll probably offer some help.
> 
> 
> Solutions:
> - for stable/ocata: make upgrade jobs non-voting
> - for pike: keep upgrade jobs non-voting and release without upgrade
> testing
> 
> Risks:
> - for stable/ocata: it's highly possible to inject regression if jobs
> aren't voting anymore.
> - for pike: the quality of the release won't be good enough in term of
> CI coverage comparing to Ocata.
> 
> Mitigations:
> - for stable/ocata: make jobs non-voting and enforce our
> core-reviewers to pay double attention on what is landed. It should be
> temporary until we manage to fix the CI jobs.
> - for master: release RC1 without upgrade jobs and make progress
> - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> somewhere with resources and without timeout constraints.
> 
> I would like some feedback on the proposal so we can move forward
> this week,
> Thanks.
> --
> Emilien Macchi
> 
> 
> I think due to some of the limitations with run times upstream we may
> need to rethink the workflow with upgrade tests upstream. It's not very
> clear to me what can be done with the multinode nodepool jobs outside of
> what is already being done.  I think we do have some choices with ovb

We could limit the upstream multinode jobs scope to only do upgrade
testing of a couple of the services deployed, like keystone and nova and
neutron, or so.

> jobs.   I'm not going to try and solve in this email but rethinking how
> we CI upgrades in the upstream infrastructure should be a focus for the
> Queens PTG.  We will need to focus on bringing run times significantly
> down as it's incredibly difficult to run two installs in 175 minutes
> across all the upstream cloud providers.
> 
> Thanks Emilien for all the work you have done around upgrades!
> 
>  
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> 
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-16 Thread Marios Andreou
On Wed, Aug 16, 2017 at 4:33 AM, Emilien Macchi  wrote:

> So far, we're having 3 critical issues, that we all need to address as
> soon as we can.
>
> Problem #1: Upgrade jobs timeout from Newton to Ocata
> https://bugs.launchpad.net/tripleo/+bug/1702955
> Today I spent an hour to look at it and here's what I've found so far:
> depending on which public cloud we're running the TripleO CI jobs, it
> timeouts or not.
> Here's an example of Heat resources that run in our CI:
> https://www.diffchecker.com/VTXkNFuk
> On the left, resources on a job that failed (running on internap) and
> on the right (running on citycloud) it worked.
> I've been through all upgrade steps and I haven't seen specific tasks
> that take more time here or here, but some little changes that make
> the big change at the end (so hard to debug).
> Note: both jobs use AFS mirrors.
> Help on that front would be very welcome.
>
>
> Problem #2: from Ocata to Pike (containerized) missing container upload
> step
> https://bugs.launchpad.net/tripleo/+bug/1710938
> Wes has a patch (thanks!) that is currently in the gate:
> https://review.openstack.org/#/c/493972
> Thanks to that work, we managed to find the problem #3.
>
>
> Problem #3: from Ocata to Pike: all container images are
> uploaded/specified, even for services not deployed
> https://bugs.launchpad.net/tripleo/+bug/1710992
> The CI jobs are timeouting during the upgrade process because
> downloading + uploading _all_ containers in local cache takes more
> than 20 minutes.
> So this is where we are now, upgrade jobs timeout on that. Steve Baker
> is currently looking at it but we'll probably offer some help.
>
>
> Solutions:
> - for stable/ocata: make upgrade jobs non-voting
> - for pike: keep upgrade jobs non-voting and release without upgrade
> testing
>
>
+1 but for Ocata to Pike, sounds like the container/images related problems
2 and 3 above are both in progress or being looked at (weshay/sbaker ++) in
which case we might be able to fix O...P jobs at least?

For Newton to Ocata, is it consistent which clouds we are timing out on? I
've looked at that https://bugs.launchpad.net/tripleo/+bug/1702955 before
and I know other folks from upgrades have too, but couldn't find some root
cause, or any upgrades operations taking too long/timing out/error etc. If
it is consistent which clouds time out we can use that info to guide us in
the case that we make the jobs non-voting for N...O (e.g. a known list of
'timing out clouds' to decide if we should inspect the ci logs closer
before merging some patch). Obviously only until/unless we actually root
cause that one (I will also find some time to check again)



> Risks:
> - for stable/ocata: it's highly possible to inject regression if jobs
> aren't voting anymore.
> - for pike: the quality of the release won't be good enough in term of
> CI coverage comparing to Ocata.
>
> Mitigations:
> - for stable/ocata: make jobs non-voting and enforce our
> core-reviewers to pay double attention on what is landed. It should be
> temporary until we manage to fix the CI jobs.
> - for master: release RC1 without upgrade jobs and make progress
>

for master, +1 I think this is essentially what I am saying above for O...P
- sounds like problem 2 is well in progress from weshay and the other
container/image related problem 3 is the main outstanding item. Since RC1
is this week I think what you are proposing as mitigation is fair. So we
re-evaluate making these jobs voting before the final RCs end of August


> - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> somewhere with resources and without timeout constraints.


> I would like some feedback on the proposal so we can move forward this
> week,
> Thanks.
>


thanks for putting this together. I think if we really had to pick one the
O..P ci has priority obviously this week (!)... I think the
container/images related issues for O...P are both expected/teething issues
from the huge amount of work done by the containerization team and can
hopefully be resolved quickly.

marios



> --
> Emilien Macchi
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-15 Thread Wesley Hayutin
On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi  wrote:

> So far, we're having 3 critical issues, that we all need to address as
> soon as we can.
>
> Problem #1: Upgrade jobs timeout from Newton to Ocata
> https://bugs.launchpad.net/tripleo/+bug/1702955
> Today I spent an hour to look at it and here's what I've found so far:
> depending on which public cloud we're running the TripleO CI jobs, it
> timeouts or not.
> Here's an example of Heat resources that run in our CI:
> https://www.diffchecker.com/VTXkNFuk
> On the left, resources on a job that failed (running on internap) and
> on the right (running on citycloud) it worked.
> I've been through all upgrade steps and I haven't seen specific tasks
> that take more time here or here, but some little changes that make
> the big change at the end (so hard to debug).
> Note: both jobs use AFS mirrors.
> Help on that front would be very welcome.
>
>
> Problem #2: from Ocata to Pike (containerized) missing container upload
> step
> https://bugs.launchpad.net/tripleo/+bug/1710938
> Wes has a patch (thanks!) that is currently in the gate:
> https://review.openstack.org/#/c/493972
> Thanks to that work, we managed to find the problem #3.
>
>
> Problem #3: from Ocata to Pike: all container images are
> uploaded/specified, even for services not deployed
> https://bugs.launchpad.net/tripleo/+bug/1710992
> The CI jobs are timeouting during the upgrade process because
> downloading + uploading _all_ containers in local cache takes more
> than 20 minutes.
> So this is where we are now, upgrade jobs timeout on that. Steve Baker
> is currently looking at it but we'll probably offer some help.
>
>
> Solutions:
> - for stable/ocata: make upgrade jobs non-voting
> - for pike: keep upgrade jobs non-voting and release without upgrade
> testing
>
> Risks:
> - for stable/ocata: it's highly possible to inject regression if jobs
> aren't voting anymore.
> - for pike: the quality of the release won't be good enough in term of
> CI coverage comparing to Ocata.
>
> Mitigations:
> - for stable/ocata: make jobs non-voting and enforce our
> core-reviewers to pay double attention on what is landed. It should be
> temporary until we manage to fix the CI jobs.
> - for master: release RC1 without upgrade jobs and make progress
> - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> somewhere with resources and without timeout constraints.
>
> I would like some feedback on the proposal so we can move forward this
> week,
> Thanks.
> --
> Emilien Macchi
>

I think due to some of the limitations with run times upstream we may need
to rethink the workflow with upgrade tests upstream. It's not very clear to
me what can be done with the multinode nodepool jobs outside of what is
already being done.  I think we do have some choices with ovb jobs.   I'm
not going to try and solve in this email but rethinking how we CI upgrades
in the upstream infrastructure should be a focus for the Queens PTG.  We
will need to focus on bringing run times significantly down as it's
incredibly difficult to run two installs in 175 minutes across all the
upstream cloud providers.

Thanks Emilien for all the work you have done around upgrades!



>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo] critical situation with CI / upgrade jobs

2017-08-15 Thread Emilien Macchi
So far, we're having 3 critical issues, that we all need to address as
soon as we can.

Problem #1: Upgrade jobs timeout from Newton to Ocata
https://bugs.launchpad.net/tripleo/+bug/1702955
Today I spent an hour to look at it and here's what I've found so far:
depending on which public cloud we're running the TripleO CI jobs, it
timeouts or not.
Here's an example of Heat resources that run in our CI:
https://www.diffchecker.com/VTXkNFuk
On the left, resources on a job that failed (running on internap) and
on the right (running on citycloud) it worked.
I've been through all upgrade steps and I haven't seen specific tasks
that take more time here or here, but some little changes that make
the big change at the end (so hard to debug).
Note: both jobs use AFS mirrors.
Help on that front would be very welcome.


Problem #2: from Ocata to Pike (containerized) missing container upload step
https://bugs.launchpad.net/tripleo/+bug/1710938
Wes has a patch (thanks!) that is currently in the gate:
https://review.openstack.org/#/c/493972
Thanks to that work, we managed to find the problem #3.


Problem #3: from Ocata to Pike: all container images are
uploaded/specified, even for services not deployed
https://bugs.launchpad.net/tripleo/+bug/1710992
The CI jobs are timeouting during the upgrade process because
downloading + uploading _all_ containers in local cache takes more
than 20 minutes.
So this is where we are now, upgrade jobs timeout on that. Steve Baker
is currently looking at it but we'll probably offer some help.


Solutions:
- for stable/ocata: make upgrade jobs non-voting
- for pike: keep upgrade jobs non-voting and release without upgrade testing

Risks:
- for stable/ocata: it's highly possible to inject regression if jobs
aren't voting anymore.
- for pike: the quality of the release won't be good enough in term of
CI coverage comparing to Ocata.

Mitigations:
- for stable/ocata: make jobs non-voting and enforce our
core-reviewers to pay double attention on what is landed. It should be
temporary until we manage to fix the CI jobs.
- for master: release RC1 without upgrade jobs and make progress
- Run TripleO upgrade scenarios as third party CI in RDO Cloud or
somewhere with resources and without timeout constraints.

I would like some feedback on the proposal so we can move forward this week,
Thanks.
-- 
Emilien Macchi

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev