Re: [openstack-dev] [tripleo] container jobs are unstable

2017-04-07 Thread Dan Prince
On Thu, 2017-04-06 at 15:32 -0400, Paul Belanger wrote:
> On Thu, Mar 30, 2017 at 11:01:08AM -0400, Paul Belanger wrote:
> > On Thu, Mar 30, 2017 at 03:08:57PM +0100, Steven Hardy wrote:
> > > To be fair, we discussed this on IRC yesterday, everyone agreed
> > > infra
> > > supported docker cache/registry was a great idea, but you said
> > > there was no
> > > known timeline for it actually getting done.
> > > 
> > > So while we all want to see that happen, and potentially help out
> > > with the
> > > effort, we're also trying to mitigate the fact that work isn't
> > > done by
> > > working around it in our OVB environment.
> > > 
> > > FWIW I think we absolutely need multinode container jobs, e.g
> > > using infra
> > > resources, as that has worked out great for our puppet based CI,
> > > but we
> > > really need to work out how to optimize the container download
> > > speed in
> > > that environment before that will work well AFAIK.
> > > 
> > > You referenced https://review.openstack.org/#/c/447524/ in your
> > > other
> > > reply, which AFAICS is a spec about publishing to dockerhub,
> > > which sounds
> > > great, but we have the opposite problem, we need to consume those
> > > published
> > > images during our CI runs, and currently downloading images takes
> > > too long.
> > > So we ideally need some sort of local registry/pull-through-cache 
> > > that
> > > speeds up that process.
> > > 
> > > How can we move forward here, is there anyone on the infra side
> > > we can work
> > > with to discuss further?
> > > 
> > 
> > Yes, I am currently working with clarkb to adress some of these
> > concerns. Today
> > we are looking at setup our cloud mirrors to cache[1] specific
> > URLs, for example
> > we are trying testing out http://trunk.rdoproject.org  This is not
> > a long term
> > solution for projects, but a short. It will be opt-in for now,
> > rather then us
> > set it up for all jobs.  Long term, we move rdoproject.org into
> > AFS.
> > 
> > I have been trying to see if we can do the same for docker hub, and
> > continue to
> > run it.  The main issue, at least for me, is we don't want to
> > depend on docker
> > tooling for this. I'd rather not install a docker into our control
> > play at this
> > point in time.
> > 
> > So, all of that to stay, it will take some time. I understand it is
> > a high
> > priority, but lets solve the current mirroring issues with tripleo
> > first (RDO,
> > gems, github), and lets see if the apache cache proxy with work for
> > hub.docker.com too.
> > 
> > [1] https://review.openstack.org/451554
> 
> Wanted to follow up to this thread, we managed to get a reverse proxy
> cache[2]
> for https://registry-1.docker.io working. So far, I've just tested
> ubuntu,
> fedora, centos images but the caching works. Once we land this, any
> jobs using
> docker can take advantage of the mirror.
> 
> [2] https://review.openstack.org/#/c/453811


Thanks for your help in this Paul.

A reverse proxy cache wasn't exactly what I was expecting so it took a
few more patches to get all this initially wired into the TripleO OVB
jobs (6 patches so far). Once we have this we can duplicate a similar
setup for the multinode patches as well.

I created a quick etherpad below [1] to track the status of these
patches. I think they mostly need to land in the order they are listed
in the etherpad...

[1] https://etherpad.openstack.org/p/tripleo-docker-registry-mirror

> 
> _
> _
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] container jobs are unstable

2017-04-06 Thread Wesley Hayutin
On Thu, Mar 30, 2017 at 10:08 AM, Steven Hardy  wrote:

> On Wed, Mar 29, 2017 at 10:07:24PM -0400, Paul Belanger wrote:
> > On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> > > On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi 
> wrote:
> > >
> > > > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco 
> wrote:
> > > > > On 23/03/17 16:24 +0100, Martin André wrote:
> > > > >>
> > > > >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince 
> wrote:
> > > > >>>
> > > > >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > > > 
> > > >  On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > > >  > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > > >  > > Hey,
> > > >  > >
> > > >  > > I've noticed that container jobs look pretty unstable
> lately; to
> > > >  > > me,
> > > >  > > it sounds like a timeout:
> > > >  > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-
> tripleo-
> > > >  > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_
> 2017-03-
> > > >  > > 22_00_08_55_358973
> > > >  >
> > > >  > There are different hypothesis on what is going on here. Some
> > > >  > patches have
> > > >  > landed to improve the write performance on containers by using
> > > >  > hostpath mounts
> > > >  > but we think the real slowness is coming from the images
> download.
> > > >  >
> > > >  > This said, this is still under investigation and the
> containers
> > > >  > squad will
> > > >  > report back as soon as there are new findings.
> > > > 
> > > >  Also, to be more precise, Martin André is looking into this. He
> also
> > > >  fixed the
> > > >  gate in the last 2 weeks.
> > > > >>>
> > > > >>>
> > > > >>> I spoke w/ Martin on IRC. He seems to think this is the cause of
> some
> > > > >>> of the failures:
> > > > >>>
> > > > >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > > > tripleo-ci-cen
> > > > >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-
> controller-
> > > > >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> > > > >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> > > > >>>
> > > > >>>
> > > > >>> Looks like Heat isn't able to create Nova instances in the
> overcloud
> > > > >>> due to "Host 'overcloud-novacompute-0' is not mapped to any
> cell'. This
> > > > >>> means our cells initialization code for containers may not be
> quite
> > > > >>> right... or there is a race somewhere.
> > > > >>
> > > > >>
> > > > >> Here are some findings. I've looked at time measures from CI for
> > > > >> https://review.openstack.org/#/c/448533/ which provided the most
> > > > >> recent results:
> > > > >>
> > > > >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> > > > >>undercloud install: 23
> > > > >>overcloud deploy: 72
> > > > >>total time: 125
> > > > >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > > > >>undercloud install: 25
> > > > >>overcloud deploy: 48
> > > > >>total time: 122
> > > > >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> > > > >>undercloud install: 24
> > > > >>overcloud deploy: 57
> > > > >>total time: 152
> > > > >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > > > >>undercloud install: 28
> > > > >>overcloud deploy: 48
> > > > >>total time: 165 (timeout)
> > > > >>
> > > > >> Looking at the undercloud & overcloud install times, the most task
> > > > >> consuming tasks, the containers job isn't doing that bad compared
> to
> > > > >> other OVB jobs. But looking closer I could see that:
> > > > >> - the containers job pulls docker images from dockerhub, this
> process
> > > > >> takes roughly 18 min.
> > > > >
> > > > >
> > > > > I think we can optimize this a bit by having the script that
> populates
> > > > the
> > > > > local
> > > > > registry in the overcloud job to run in parallel. The docker
> daemon can
> > > > do
> > > > > multiple pulls w/o problems.
> > > > >
> > > > >> - the overcloud validate task takes 10 min more than it should
> because
> > > > >> of the bug Dan mentioned (a fix is in the queue at
> > > > >> https://review.openstack.org/#/c/448575/)
> > > > >
> > > > >
> > > > > +A
> > > > >
> > > > >> - the postci takes a long time with quickstart, 13 min (4 min
> alone
> > > > >> spent on docker log collection) whereas it takes only 3 min when
> using
> > > > >> tripleo.sh
> > > > >
> > > > >
> > > > > mmh, does this have anything to do with ansible being in between?
> Or is
> > > > that
> > > > > time specifically for the part that gets the logs?
> > > > >
> > > > >>
> > > > >> Adding all these numbers, we're at about 40 min of additional
> time for
> > > > >> oooq containers job which is enough to cross the CI job limit.
> > > > >>
> > > > >> There is certainly a lot of room for optimization here and there
> and
> > > > >> I'll explore how we can speed up the containers CI job 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-04-06 Thread Paul Belanger
On Thu, Mar 30, 2017 at 11:01:08AM -0400, Paul Belanger wrote:
> On Thu, Mar 30, 2017 at 03:08:57PM +0100, Steven Hardy wrote:
> > To be fair, we discussed this on IRC yesterday, everyone agreed infra
> > supported docker cache/registry was a great idea, but you said there was no
> > known timeline for it actually getting done.
> > 
> > So while we all want to see that happen, and potentially help out with the
> > effort, we're also trying to mitigate the fact that work isn't done by
> > working around it in our OVB environment.
> > 
> > FWIW I think we absolutely need multinode container jobs, e.g using infra
> > resources, as that has worked out great for our puppet based CI, but we
> > really need to work out how to optimize the container download speed in
> > that environment before that will work well AFAIK.
> > 
> > You referenced https://review.openstack.org/#/c/447524/ in your other
> > reply, which AFAICS is a spec about publishing to dockerhub, which sounds
> > great, but we have the opposite problem, we need to consume those published
> > images during our CI runs, and currently downloading images takes too long.
> > So we ideally need some sort of local registry/pull-through-cache that
> > speeds up that process.
> > 
> > How can we move forward here, is there anyone on the infra side we can work
> > with to discuss further?
> > 
> Yes, I am currently working with clarkb to adress some of these concerns. 
> Today
> we are looking at setup our cloud mirrors to cache[1] specific URLs, for 
> example
> we are trying testing out http://trunk.rdoproject.org  This is not a long term
> solution for projects, but a short. It will be opt-in for now, rather then us
> set it up for all jobs.  Long term, we move rdoproject.org into AFS.
> 
> I have been trying to see if we can do the same for docker hub, and continue 
> to
> run it.  The main issue, at least for me, is we don't want to depend on docker
> tooling for this. I'd rather not install a docker into our control play at 
> this
> point in time.
> 
> So, all of that to stay, it will take some time. I understand it is a high
> priority, but lets solve the current mirroring issues with tripleo first (RDO,
> gems, github), and lets see if the apache cache proxy with work for
> hub.docker.com too.
> 
> [1] https://review.openstack.org/451554

Wanted to follow up to this thread, we managed to get a reverse proxy cache[2]
for https://registry-1.docker.io working. So far, I've just tested ubuntu,
fedora, centos images but the caching works. Once we land this, any jobs using
docker can take advantage of the mirror.

[2] https://review.openstack.org/#/c/453811

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-30 Thread Paul Belanger
On Thu, Mar 30, 2017 at 03:08:57PM +0100, Steven Hardy wrote:
> On Wed, Mar 29, 2017 at 10:07:24PM -0400, Paul Belanger wrote:
> > On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> > > On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi  
> > > wrote:
> > > 
> > > > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco  
> > > > wrote:
> > > > > On 23/03/17 16:24 +0100, Martin André wrote:
> > > > >>
> > > > >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  
> > > > >> wrote:
> > > > >>>
> > > > >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > > > 
> > > >  On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > > >  > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > > >  > > Hey,
> > > >  > >
> > > >  > > I've noticed that container jobs look pretty unstable lately; 
> > > >  > > to
> > > >  > > me,
> > > >  > > it sounds like a timeout:
> > > >  > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> > > >  > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> > > >  > > 22_00_08_55_358973
> > > >  >
> > > >  > There are different hypothesis on what is going on here. Some
> > > >  > patches have
> > > >  > landed to improve the write performance on containers by using
> > > >  > hostpath mounts
> > > >  > but we think the real slowness is coming from the images 
> > > >  > download.
> > > >  >
> > > >  > This said, this is still under investigation and the containers
> > > >  > squad will
> > > >  > report back as soon as there are new findings.
> > > > 
> > > >  Also, to be more precise, Martin André is looking into this. He 
> > > >  also
> > > >  fixed the
> > > >  gate in the last 2 weeks.
> > > > >>>
> > > > >>>
> > > > >>> I spoke w/ Martin on IRC. He seems to think this is the cause of 
> > > > >>> some
> > > > >>> of the failures:
> > > > >>>
> > > > >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > > > tripleo-ci-cen
> > > > >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> > > > >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> > > > >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> > > > >>>
> > > > >>>
> > > > >>> Looks like Heat isn't able to create Nova instances in the overcloud
> > > > >>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. 
> > > > >>> This
> > > > >>> means our cells initialization code for containers may not be quite
> > > > >>> right... or there is a race somewhere.
> > > > >>
> > > > >>
> > > > >> Here are some findings. I've looked at time measures from CI for
> > > > >> https://review.openstack.org/#/c/448533/ which provided the most
> > > > >> recent results:
> > > > >>
> > > > >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> > > > >>undercloud install: 23
> > > > >>overcloud deploy: 72
> > > > >>total time: 125
> > > > >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > > > >>undercloud install: 25
> > > > >>overcloud deploy: 48
> > > > >>total time: 122
> > > > >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> > > > >>undercloud install: 24
> > > > >>overcloud deploy: 57
> > > > >>total time: 152
> > > > >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > > > >>undercloud install: 28
> > > > >>overcloud deploy: 48
> > > > >>total time: 165 (timeout)
> > > > >>
> > > > >> Looking at the undercloud & overcloud install times, the most task
> > > > >> consuming tasks, the containers job isn't doing that bad compared to
> > > > >> other OVB jobs. But looking closer I could see that:
> > > > >> - the containers job pulls docker images from dockerhub, this process
> > > > >> takes roughly 18 min.
> > > > >
> > > > >
> > > > > I think we can optimize this a bit by having the script that populates
> > > > the
> > > > > local
> > > > > registry in the overcloud job to run in parallel. The docker daemon 
> > > > > can
> > > > do
> > > > > multiple pulls w/o problems.
> > > > >
> > > > >> - the overcloud validate task takes 10 min more than it should 
> > > > >> because
> > > > >> of the bug Dan mentioned (a fix is in the queue at
> > > > >> https://review.openstack.org/#/c/448575/)
> > > > >
> > > > >
> > > > > +A
> > > > >
> > > > >> - the postci takes a long time with quickstart, 13 min (4 min alone
> > > > >> spent on docker log collection) whereas it takes only 3 min when 
> > > > >> using
> > > > >> tripleo.sh
> > > > >
> > > > >
> > > > > mmh, does this have anything to do with ansible being in between? Or 
> > > > > is
> > > > that
> > > > > time specifically for the part that gets the logs?
> > > > >
> > > > >>
> > > > >> Adding all these numbers, we're at about 40 min of additional time 
> > > > >> for
> > > > >> oooq containers job which is enough to cross the CI job limit.
> > > > >>
> > > > >> There is certainly a lot of room for 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-30 Thread Dan Prince
On Wed, 2017-03-29 at 22:07 -0400, Paul Belanger wrote:
> On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> > On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi  > > wrote:
> > 
> > > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco  > > m> wrote:
> > > > On 23/03/17 16:24 +0100, Martin André wrote:
> > > > > 
> > > > > On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  > > > > om> wrote:
> > > > > > 
> > > > > > On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > > > > > > 
> > > > > > > On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > > > > > > > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > > > > > > > > Hey,
> > > > > > > > > 
> > > > > > > > > I've noticed that container jobs look pretty unstable
> > > > > > > > > lately; to
> > > > > > > > > me,
> > > > > > > > > it sounds like a timeout:
> > > > > > > > > http://logs.openstack.org/19/447319/2/check-tripleo/g
> > > > > > > > > ate-tripleo-
> > > > > > > > > ci-centos-7-ovb-containers-oooq-
> > > > > > > > > nv/bca496a/console.html#_2017-03-
> > > > > > > > > 22_00_08_55_358973
> > > > > > > > 
> > > > > > > > There are different hypothesis on what is going on
> > > > > > > > here. Some
> > > > > > > > patches have
> > > > > > > > landed to improve the write performance on containers
> > > > > > > > by using
> > > > > > > > hostpath mounts
> > > > > > > > but we think the real slowness is coming from the
> > > > > > > > images download.
> > > > > > > > 
> > > > > > > > This said, this is still under investigation and the
> > > > > > > > containers
> > > > > > > > squad will
> > > > > > > > report back as soon as there are new findings.
> > > > > > > 
> > > > > > > Also, to be more precise, Martin André is looking into
> > > > > > > this. He also
> > > > > > > fixed the
> > > > > > > gate in the last 2 weeks.
> > > > > > 
> > > > > > 
> > > > > > I spoke w/ Martin on IRC. He seems to think this is the
> > > > > > cause of some
> > > > > > of the failures:
> > > > > > 
> > > > > > http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > > 
> > > tripleo-ci-cen
> > > > > > tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-
> > > > > > controller-
> > > > > > 0/var/log/extra/docker/containers/heat_engine/log/heat/heat
> > > > > > -
> > > > > > engine.log.txt.gz#_2017-03-21_20_26_29_697
> > > > > > 
> > > > > > 
> > > > > > Looks like Heat isn't able to create Nova instances in the
> > > > > > overcloud
> > > > > > due to "Host 'overcloud-novacompute-0' is not mapped to any
> > > > > > cell'. This
> > > > > > means our cells initialization code for containers may not
> > > > > > be quite
> > > > > > right... or there is a race somewhere.
> > > > > 
> > > > > 
> > > > > Here are some findings. I've looked at time measures from CI
> > > > > for
> > > > > https://review.openstack.org/#/c/448533/ which provided the
> > > > > most
> > > > > recent results:
> > > > > 
> > > > > * gate-tripleo-ci-centos-7-ovb-ha [1]
> > > > >    undercloud install: 23
> > > > >    overcloud deploy: 72
> > > > >    total time: 125
> > > > > * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > > > >    undercloud install: 25
> > > > >    overcloud deploy: 48
> > > > >    total time: 122
> > > > > * gate-tripleo-ci-centos-7-ovb-updates [3]
> > > > >    undercloud install: 24
> > > > >    overcloud deploy: 57
> > > > >    total time: 152
> > > > > * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > > > >    undercloud install: 28
> > > > >    overcloud deploy: 48
> > > > >    total time: 165 (timeout)
> > > > > 
> > > > > Looking at the undercloud & overcloud install times, the most
> > > > > task
> > > > > consuming tasks, the containers job isn't doing that bad
> > > > > compared to
> > > > > other OVB jobs. But looking closer I could see that:
> > > > > - the containers job pulls docker images from dockerhub, this
> > > > > process
> > > > > takes roughly 18 min.
> > > > 
> > > > 
> > > > I think we can optimize this a bit by having the script that
> > > > populates
> > > 
> > > the
> > > > local
> > > > registry in the overcloud job to run in parallel. The docker
> > > > daemon can
> > > 
> > > do
> > > > multiple pulls w/o problems.
> > > > 
> > > > > - the overcloud validate task takes 10 min more than it
> > > > > should because
> > > > > of the bug Dan mentioned (a fix is in the queue at
> > > > > https://review.openstack.org/#/c/448575/)
> > > > 
> > > > 
> > > > +A
> > > > 
> > > > > - the postci takes a long time with quickstart, 13 min (4 min
> > > > > alone
> > > > > spent on docker log collection) whereas it takes only 3 min
> > > > > when using
> > > > > tripleo.sh
> > > > 
> > > > 
> > > > mmh, does this have anything to do with ansible being in
> > > > between? Or is
> > > 
> > > that
> > > > time specifically for the part that gets the logs?
> > > > 
> > > > > 
> > > > > Adding all these numbers, we're at about 40 min of additional
> > > > > time for
> > > > > oooq containers 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-30 Thread Steven Hardy
On Wed, Mar 29, 2017 at 10:07:24PM -0400, Paul Belanger wrote:
> On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> > On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi  wrote:
> > 
> > > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco  wrote:
> > > > On 23/03/17 16:24 +0100, Martin André wrote:
> > > >>
> > > >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  wrote:
> > > >>>
> > > >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > > 
> > >  On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > >  > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > >  > > Hey,
> > >  > >
> > >  > > I've noticed that container jobs look pretty unstable lately; to
> > >  > > me,
> > >  > > it sounds like a timeout:
> > >  > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> > >  > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> > >  > > 22_00_08_55_358973
> > >  >
> > >  > There are different hypothesis on what is going on here. Some
> > >  > patches have
> > >  > landed to improve the write performance on containers by using
> > >  > hostpath mounts
> > >  > but we think the real slowness is coming from the images download.
> > >  >
> > >  > This said, this is still under investigation and the containers
> > >  > squad will
> > >  > report back as soon as there are new findings.
> > > 
> > >  Also, to be more precise, Martin André is looking into this. He also
> > >  fixed the
> > >  gate in the last 2 weeks.
> > > >>>
> > > >>>
> > > >>> I spoke w/ Martin on IRC. He seems to think this is the cause of some
> > > >>> of the failures:
> > > >>>
> > > >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > > tripleo-ci-cen
> > > >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> > > >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> > > >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> > > >>>
> > > >>>
> > > >>> Looks like Heat isn't able to create Nova instances in the overcloud
> > > >>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. 
> > > >>> This
> > > >>> means our cells initialization code for containers may not be quite
> > > >>> right... or there is a race somewhere.
> > > >>
> > > >>
> > > >> Here are some findings. I've looked at time measures from CI for
> > > >> https://review.openstack.org/#/c/448533/ which provided the most
> > > >> recent results:
> > > >>
> > > >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> > > >>undercloud install: 23
> > > >>overcloud deploy: 72
> > > >>total time: 125
> > > >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > > >>undercloud install: 25
> > > >>overcloud deploy: 48
> > > >>total time: 122
> > > >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> > > >>undercloud install: 24
> > > >>overcloud deploy: 57
> > > >>total time: 152
> > > >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > > >>undercloud install: 28
> > > >>overcloud deploy: 48
> > > >>total time: 165 (timeout)
> > > >>
> > > >> Looking at the undercloud & overcloud install times, the most task
> > > >> consuming tasks, the containers job isn't doing that bad compared to
> > > >> other OVB jobs. But looking closer I could see that:
> > > >> - the containers job pulls docker images from dockerhub, this process
> > > >> takes roughly 18 min.
> > > >
> > > >
> > > > I think we can optimize this a bit by having the script that populates
> > > the
> > > > local
> > > > registry in the overcloud job to run in parallel. The docker daemon can
> > > do
> > > > multiple pulls w/o problems.
> > > >
> > > >> - the overcloud validate task takes 10 min more than it should because
> > > >> of the bug Dan mentioned (a fix is in the queue at
> > > >> https://review.openstack.org/#/c/448575/)
> > > >
> > > >
> > > > +A
> > > >
> > > >> - the postci takes a long time with quickstart, 13 min (4 min alone
> > > >> spent on docker log collection) whereas it takes only 3 min when using
> > > >> tripleo.sh
> > > >
> > > >
> > > > mmh, does this have anything to do with ansible being in between? Or is
> > > that
> > > > time specifically for the part that gets the logs?
> > > >
> > > >>
> > > >> Adding all these numbers, we're at about 40 min of additional time for
> > > >> oooq containers job which is enough to cross the CI job limit.
> > > >>
> > > >> There is certainly a lot of room for optimization here and there and
> > > >> I'll explore how we can speed up the containers CI job over the next
> > > >
> > > >
> > > > Thanks a lot for the update. The time break down is fantastic,
> > > > Flavio
> > >
> > > TBH the problem is far from being solved:
> > >
> > > 1. Click on https://status-tripleoci.rhcloud.com/
> > > 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
> > >
> > > Container job has been 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-29 Thread Paul Belanger
On Wed, Mar 29, 2017 at 10:07:24PM -0400, Paul Belanger wrote:
> On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> > On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi  wrote:
> > 
> > > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco  wrote:
> > > > On 23/03/17 16:24 +0100, Martin André wrote:
> > > >>
> > > >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  wrote:
> > > >>>
> > > >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > > 
> > >  On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > >  > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > >  > > Hey,
> > >  > >
> > >  > > I've noticed that container jobs look pretty unstable lately; to
> > >  > > me,
> > >  > > it sounds like a timeout:
> > >  > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> > >  > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> > >  > > 22_00_08_55_358973
> > >  >
> > >  > There are different hypothesis on what is going on here. Some
> > >  > patches have
> > >  > landed to improve the write performance on containers by using
> > >  > hostpath mounts
> > >  > but we think the real slowness is coming from the images download.
> > >  >
> > >  > This said, this is still under investigation and the containers
> > >  > squad will
> > >  > report back as soon as there are new findings.
> > > 
> > >  Also, to be more precise, Martin André is looking into this. He also
> > >  fixed the
> > >  gate in the last 2 weeks.
> > > >>>
> > > >>>
> > > >>> I spoke w/ Martin on IRC. He seems to think this is the cause of some
> > > >>> of the failures:
> > > >>>
> > > >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > > tripleo-ci-cen
> > > >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> > > >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> > > >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> > > >>>
> > > >>>
> > > >>> Looks like Heat isn't able to create Nova instances in the overcloud
> > > >>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. 
> > > >>> This
> > > >>> means our cells initialization code for containers may not be quite
> > > >>> right... or there is a race somewhere.
> > > >>
> > > >>
> > > >> Here are some findings. I've looked at time measures from CI for
> > > >> https://review.openstack.org/#/c/448533/ which provided the most
> > > >> recent results:
> > > >>
> > > >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> > > >>undercloud install: 23
> > > >>overcloud deploy: 72
> > > >>total time: 125
> > > >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > > >>undercloud install: 25
> > > >>overcloud deploy: 48
> > > >>total time: 122
> > > >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> > > >>undercloud install: 24
> > > >>overcloud deploy: 57
> > > >>total time: 152
> > > >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > > >>undercloud install: 28
> > > >>overcloud deploy: 48
> > > >>total time: 165 (timeout)
> > > >>
> > > >> Looking at the undercloud & overcloud install times, the most task
> > > >> consuming tasks, the containers job isn't doing that bad compared to
> > > >> other OVB jobs. But looking closer I could see that:
> > > >> - the containers job pulls docker images from dockerhub, this process
> > > >> takes roughly 18 min.
> > > >
> > > >
> > > > I think we can optimize this a bit by having the script that populates
> > > the
> > > > local
> > > > registry in the overcloud job to run in parallel. The docker daemon can
> > > do
> > > > multiple pulls w/o problems.
> > > >
> > > >> - the overcloud validate task takes 10 min more than it should because
> > > >> of the bug Dan mentioned (a fix is in the queue at
> > > >> https://review.openstack.org/#/c/448575/)
> > > >
> > > >
> > > > +A
> > > >
> > > >> - the postci takes a long time with quickstart, 13 min (4 min alone
> > > >> spent on docker log collection) whereas it takes only 3 min when using
> > > >> tripleo.sh
> > > >
> > > >
> > > > mmh, does this have anything to do with ansible being in between? Or is
> > > that
> > > > time specifically for the part that gets the logs?
> > > >
> > > >>
> > > >> Adding all these numbers, we're at about 40 min of additional time for
> > > >> oooq containers job which is enough to cross the CI job limit.
> > > >>
> > > >> There is certainly a lot of room for optimization here and there and
> > > >> I'll explore how we can speed up the containers CI job over the next
> > > >
> > > >
> > > > Thanks a lot for the update. The time break down is fantastic,
> > > > Flavio
> > >
> > > TBH the problem is far from being solved:
> > >
> > > 1. Click on https://status-tripleoci.rhcloud.com/
> > > 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
> > >
> > > Container job has been 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-29 Thread Paul Belanger
On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi  wrote:
> 
> > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco  wrote:
> > > On 23/03/17 16:24 +0100, Martin André wrote:
> > >>
> > >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  wrote:
> > >>>
> > >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > 
> >  On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> >  > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> >  > > Hey,
> >  > >
> >  > > I've noticed that container jobs look pretty unstable lately; to
> >  > > me,
> >  > > it sounds like a timeout:
> >  > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> >  > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> >  > > 22_00_08_55_358973
> >  >
> >  > There are different hypothesis on what is going on here. Some
> >  > patches have
> >  > landed to improve the write performance on containers by using
> >  > hostpath mounts
> >  > but we think the real slowness is coming from the images download.
> >  >
> >  > This said, this is still under investigation and the containers
> >  > squad will
> >  > report back as soon as there are new findings.
> > 
> >  Also, to be more precise, Martin André is looking into this. He also
> >  fixed the
> >  gate in the last 2 weeks.
> > >>>
> > >>>
> > >>> I spoke w/ Martin on IRC. He seems to think this is the cause of some
> > >>> of the failures:
> > >>>
> > >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > tripleo-ci-cen
> > >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> > >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> > >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> > >>>
> > >>>
> > >>> Looks like Heat isn't able to create Nova instances in the overcloud
> > >>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
> > >>> means our cells initialization code for containers may not be quite
> > >>> right... or there is a race somewhere.
> > >>
> > >>
> > >> Here are some findings. I've looked at time measures from CI for
> > >> https://review.openstack.org/#/c/448533/ which provided the most
> > >> recent results:
> > >>
> > >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> > >>undercloud install: 23
> > >>overcloud deploy: 72
> > >>total time: 125
> > >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > >>undercloud install: 25
> > >>overcloud deploy: 48
> > >>total time: 122
> > >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> > >>undercloud install: 24
> > >>overcloud deploy: 57
> > >>total time: 152
> > >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > >>undercloud install: 28
> > >>overcloud deploy: 48
> > >>total time: 165 (timeout)
> > >>
> > >> Looking at the undercloud & overcloud install times, the most task
> > >> consuming tasks, the containers job isn't doing that bad compared to
> > >> other OVB jobs. But looking closer I could see that:
> > >> - the containers job pulls docker images from dockerhub, this process
> > >> takes roughly 18 min.
> > >
> > >
> > > I think we can optimize this a bit by having the script that populates
> > the
> > > local
> > > registry in the overcloud job to run in parallel. The docker daemon can
> > do
> > > multiple pulls w/o problems.
> > >
> > >> - the overcloud validate task takes 10 min more than it should because
> > >> of the bug Dan mentioned (a fix is in the queue at
> > >> https://review.openstack.org/#/c/448575/)
> > >
> > >
> > > +A
> > >
> > >> - the postci takes a long time with quickstart, 13 min (4 min alone
> > >> spent on docker log collection) whereas it takes only 3 min when using
> > >> tripleo.sh
> > >
> > >
> > > mmh, does this have anything to do with ansible being in between? Or is
> > that
> > > time specifically for the part that gets the logs?
> > >
> > >>
> > >> Adding all these numbers, we're at about 40 min of additional time for
> > >> oooq containers job which is enough to cross the CI job limit.
> > >>
> > >> There is certainly a lot of room for optimization here and there and
> > >> I'll explore how we can speed up the containers CI job over the next
> > >
> > >
> > > Thanks a lot for the update. The time break down is fantastic,
> > > Flavio
> >
> > TBH the problem is far from being solved:
> >
> > 1. Click on https://status-tripleoci.rhcloud.com/
> > 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
> >
> > Container job has been failing more than 55% of the time.
> >
> > As a reference,
> > gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
> > gate-tripleo-ci-centos-7-ovb-ha has 64% of success.
> >
> > It clearly means the ovb-containers job was and is not ready to be run
> > in the check pipeline, it's not reliable enough.
> >
> > 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-29 Thread Steve Baker
On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi  wrote:

> On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco  wrote:
> > On 23/03/17 16:24 +0100, Martin André wrote:
> >>
> >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  wrote:
> >>>
> >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> 
>  On 22/03/17 13:32 +0100, Flavio Percoco wrote:
>  > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
>  > > Hey,
>  > >
>  > > I've noticed that container jobs look pretty unstable lately; to
>  > > me,
>  > > it sounds like a timeout:
>  > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
>  > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
>  > > 22_00_08_55_358973
>  >
>  > There are different hypothesis on what is going on here. Some
>  > patches have
>  > landed to improve the write performance on containers by using
>  > hostpath mounts
>  > but we think the real slowness is coming from the images download.
>  >
>  > This said, this is still under investigation and the containers
>  > squad will
>  > report back as soon as there are new findings.
> 
>  Also, to be more precise, Martin André is looking into this. He also
>  fixed the
>  gate in the last 2 weeks.
> >>>
> >>>
> >>> I spoke w/ Martin on IRC. He seems to think this is the cause of some
> >>> of the failures:
> >>>
> >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> tripleo-ci-cen
> >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> >>>
> >>>
> >>> Looks like Heat isn't able to create Nova instances in the overcloud
> >>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
> >>> means our cells initialization code for containers may not be quite
> >>> right... or there is a race somewhere.
> >>
> >>
> >> Here are some findings. I've looked at time measures from CI for
> >> https://review.openstack.org/#/c/448533/ which provided the most
> >> recent results:
> >>
> >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> >>undercloud install: 23
> >>overcloud deploy: 72
> >>total time: 125
> >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> >>undercloud install: 25
> >>overcloud deploy: 48
> >>total time: 122
> >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> >>undercloud install: 24
> >>overcloud deploy: 57
> >>total time: 152
> >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> >>undercloud install: 28
> >>overcloud deploy: 48
> >>total time: 165 (timeout)
> >>
> >> Looking at the undercloud & overcloud install times, the most task
> >> consuming tasks, the containers job isn't doing that bad compared to
> >> other OVB jobs. But looking closer I could see that:
> >> - the containers job pulls docker images from dockerhub, this process
> >> takes roughly 18 min.
> >
> >
> > I think we can optimize this a bit by having the script that populates
> the
> > local
> > registry in the overcloud job to run in parallel. The docker daemon can
> do
> > multiple pulls w/o problems.
> >
> >> - the overcloud validate task takes 10 min more than it should because
> >> of the bug Dan mentioned (a fix is in the queue at
> >> https://review.openstack.org/#/c/448575/)
> >
> >
> > +A
> >
> >> - the postci takes a long time with quickstart, 13 min (4 min alone
> >> spent on docker log collection) whereas it takes only 3 min when using
> >> tripleo.sh
> >
> >
> > mmh, does this have anything to do with ansible being in between? Or is
> that
> > time specifically for the part that gets the logs?
> >
> >>
> >> Adding all these numbers, we're at about 40 min of additional time for
> >> oooq containers job which is enough to cross the CI job limit.
> >>
> >> There is certainly a lot of room for optimization here and there and
> >> I'll explore how we can speed up the containers CI job over the next
> >
> >
> > Thanks a lot for the update. The time break down is fantastic,
> > Flavio
>
> TBH the problem is far from being solved:
>
> 1. Click on https://status-tripleoci.rhcloud.com/
> 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
>
> Container job has been failing more than 55% of the time.
>
> As a reference,
> gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
> gate-tripleo-ci-centos-7-ovb-ha has 64% of success.
>
> It clearly means the ovb-containers job was and is not ready to be run
> in the check pipeline, it's not reliable enough.
>
> The current queue time in TripleO OVB is 11 hours. This is not
> acceptable for TripleO developers and we need a short term solution,
> which is disabling this job from the check pipeline:
> https://review.openstack.org/#/c/451546/
>
>
Yes, given resource constraints I don't see an alternative in the short
term.


> 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-29 Thread Emilien Macchi
On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco  wrote:
> On 23/03/17 16:24 +0100, Martin André wrote:
>>
>> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  wrote:
>>>
>>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:

 On 22/03/17 13:32 +0100, Flavio Percoco wrote:
 > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
 > > Hey,
 > >
 > > I've noticed that container jobs look pretty unstable lately; to
 > > me,
 > > it sounds like a timeout:
 > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
 > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
 > > 22_00_08_55_358973
 >
 > There are different hypothesis on what is going on here. Some
 > patches have
 > landed to improve the write performance on containers by using
 > hostpath mounts
 > but we think the real slowness is coming from the images download.
 >
 > This said, this is still under investigation and the containers
 > squad will
 > report back as soon as there are new findings.

 Also, to be more precise, Martin André is looking into this. He also
 fixed the
 gate in the last 2 weeks.
>>>
>>>
>>> I spoke w/ Martin on IRC. He seems to think this is the cause of some
>>> of the failures:
>>>
>>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-tripleo-ci-cen
>>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
>>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
>>> engine.log.txt.gz#_2017-03-21_20_26_29_697
>>>
>>>
>>> Looks like Heat isn't able to create Nova instances in the overcloud
>>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
>>> means our cells initialization code for containers may not be quite
>>> right... or there is a race somewhere.
>>
>>
>> Here are some findings. I've looked at time measures from CI for
>> https://review.openstack.org/#/c/448533/ which provided the most
>> recent results:
>>
>> * gate-tripleo-ci-centos-7-ovb-ha [1]
>>undercloud install: 23
>>overcloud deploy: 72
>>total time: 125
>> * gate-tripleo-ci-centos-7-ovb-nonha [2]
>>undercloud install: 25
>>overcloud deploy: 48
>>total time: 122
>> * gate-tripleo-ci-centos-7-ovb-updates [3]
>>undercloud install: 24
>>overcloud deploy: 57
>>total time: 152
>> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
>>undercloud install: 28
>>overcloud deploy: 48
>>total time: 165 (timeout)
>>
>> Looking at the undercloud & overcloud install times, the most task
>> consuming tasks, the containers job isn't doing that bad compared to
>> other OVB jobs. But looking closer I could see that:
>> - the containers job pulls docker images from dockerhub, this process
>> takes roughly 18 min.
>
>
> I think we can optimize this a bit by having the script that populates the
> local
> registry in the overcloud job to run in parallel. The docker daemon can do
> multiple pulls w/o problems.
>
>> - the overcloud validate task takes 10 min more than it should because
>> of the bug Dan mentioned (a fix is in the queue at
>> https://review.openstack.org/#/c/448575/)
>
>
> +A
>
>> - the postci takes a long time with quickstart, 13 min (4 min alone
>> spent on docker log collection) whereas it takes only 3 min when using
>> tripleo.sh
>
>
> mmh, does this have anything to do with ansible being in between? Or is that
> time specifically for the part that gets the logs?
>
>>
>> Adding all these numbers, we're at about 40 min of additional time for
>> oooq containers job which is enough to cross the CI job limit.
>>
>> There is certainly a lot of room for optimization here and there and
>> I'll explore how we can speed up the containers CI job over the next
>
>
> Thanks a lot for the update. The time break down is fantastic,
> Flavio

TBH the problem is far from being solved:

1. Click on https://status-tripleoci.rhcloud.com/
2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv

Container job has been failing more than 55% of the time.

As a reference,
gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
gate-tripleo-ci-centos-7-ovb-ha has 64% of success.

It clearly means the ovb-containers job was and is not ready to be run
in the check pipeline, it's not reliable enough.

The current queue time in TripleO OVB is 11 hours. This is not
acceptable for TripleO developers and we need a short term solution,
which is disabling this job from the check pipeline:
https://review.openstack.org/#/c/451546/

On the long-term, we need to:

- Stabilize ovb-containers which is AFIK already WIP by Martin (kudos
to him). My hope is Martin gets enough help from Container squad to
work on this topic.
- Remove ovb-nonha scenario from the check pipeline - and probably
keep it periodic. Dan Prince started some work on it:
https://review.openstack.org/#/c/449791/ and
https://review.openstack.org/#/c/449785/ - but not much 

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-27 Thread Flavio Percoco

On 23/03/17 16:24 +0100, Martin André wrote:

On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  wrote:

On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:

On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > Hey,
> >
> > I've noticed that container jobs look pretty unstable lately; to
> > me,
> > it sounds like a timeout:
> > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> > 22_00_08_55_358973
>
> There are different hypothesis on what is going on here. Some
> patches have
> landed to improve the write performance on containers by using
> hostpath mounts
> but we think the real slowness is coming from the images download.
>
> This said, this is still under investigation and the containers
> squad will
> report back as soon as there are new findings.

Also, to be more precise, Martin André is looking into this. He also
fixed the
gate in the last 2 weeks.


I spoke w/ Martin on IRC. He seems to think this is the cause of some
of the failures:

http://logs.openstack.org/32/446432/1/check-tripleo/gate-tripleo-ci-cen
tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
engine.log.txt.gz#_2017-03-21_20_26_29_697


Looks like Heat isn't able to create Nova instances in the overcloud
due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
means our cells initialization code for containers may not be quite
right... or there is a race somewhere.


Here are some findings. I've looked at time measures from CI for
https://review.openstack.org/#/c/448533/ which provided the most
recent results:

* gate-tripleo-ci-centos-7-ovb-ha [1]
   undercloud install: 23
   overcloud deploy: 72
   total time: 125
* gate-tripleo-ci-centos-7-ovb-nonha [2]
   undercloud install: 25
   overcloud deploy: 48
   total time: 122
* gate-tripleo-ci-centos-7-ovb-updates [3]
   undercloud install: 24
   overcloud deploy: 57
   total time: 152
* gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
   undercloud install: 28
   overcloud deploy: 48
   total time: 165 (timeout)

Looking at the undercloud & overcloud install times, the most task
consuming tasks, the containers job isn't doing that bad compared to
other OVB jobs. But looking closer I could see that:
- the containers job pulls docker images from dockerhub, this process
takes roughly 18 min.


I think we can optimize this a bit by having the script that populates the local
registry in the overcloud job to run in parallel. The docker daemon can do
multiple pulls w/o problems.


- the overcloud validate task takes 10 min more than it should because
of the bug Dan mentioned (a fix is in the queue at
https://review.openstack.org/#/c/448575/)


+A


- the postci takes a long time with quickstart, 13 min (4 min alone
spent on docker log collection) whereas it takes only 3 min when using
tripleo.sh


mmh, does this have anything to do with ansible being in between? Or is that
time specifically for the part that gets the logs?



Adding all these numbers, we're at about 40 min of additional time for
oooq containers job which is enough to cross the CI job limit.

There is certainly a lot of room for optimization here and there and
I'll explore how we can speed up the containers CI job over the next


Thanks a lot for the update. The time break down is fantastic,
Flavio


weeks.

Martin

[1] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/d2c1b16/
[2] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/d6df760/
[3] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/3b1f795/
[4] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-containers-oooq-nv/b816f20/


Dan



Flavio



_
_
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
cribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


--
@flaper87
Flavio Percoco


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)

Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-23 Thread Martin André
On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince  wrote:
> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
>> On 22/03/17 13:32 +0100, Flavio Percoco wrote:
>> > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
>> > > Hey,
>> > >
>> > > I've noticed that container jobs look pretty unstable lately; to
>> > > me,
>> > > it sounds like a timeout:
>> > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
>> > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
>> > > 22_00_08_55_358973
>> >
>> > There are different hypothesis on what is going on here. Some
>> > patches have
>> > landed to improve the write performance on containers by using
>> > hostpath mounts
>> > but we think the real slowness is coming from the images download.
>> >
>> > This said, this is still under investigation and the containers
>> > squad will
>> > report back as soon as there are new findings.
>>
>> Also, to be more precise, Martin André is looking into this. He also
>> fixed the
>> gate in the last 2 weeks.
>
> I spoke w/ Martin on IRC. He seems to think this is the cause of some
> of the failures:
>
> http://logs.openstack.org/32/446432/1/check-tripleo/gate-tripleo-ci-cen
> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> engine.log.txt.gz#_2017-03-21_20_26_29_697
>
>
> Looks like Heat isn't able to create Nova instances in the overcloud
> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
> means our cells initialization code for containers may not be quite
> right... or there is a race somewhere.

Here are some findings. I've looked at time measures from CI for
https://review.openstack.org/#/c/448533/ which provided the most
recent results:

* gate-tripleo-ci-centos-7-ovb-ha [1]
undercloud install: 23
overcloud deploy: 72
total time: 125
* gate-tripleo-ci-centos-7-ovb-nonha [2]
undercloud install: 25
overcloud deploy: 48
total time: 122
* gate-tripleo-ci-centos-7-ovb-updates [3]
undercloud install: 24
overcloud deploy: 57
total time: 152
* gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
undercloud install: 28
overcloud deploy: 48
total time: 165 (timeout)

Looking at the undercloud & overcloud install times, the most task
consuming tasks, the containers job isn't doing that bad compared to
other OVB jobs. But looking closer I could see that:
- the containers job pulls docker images from dockerhub, this process
takes roughly 18 min.
- the overcloud validate task takes 10 min more than it should because
of the bug Dan mentioned (a fix is in the queue at
https://review.openstack.org/#/c/448575/)
- the postci takes a long time with quickstart, 13 min (4 min alone
spent on docker log collection) whereas it takes only 3 min when using
tripleo.sh

Adding all these numbers, we're at about 40 min of additional time for
oooq containers job which is enough to cross the CI job limit.

There is certainly a lot of room for optimization here and there and
I'll explore how we can speed up the containers CI job over the next
weeks.

Martin

[1] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/d2c1b16/
[2] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/d6df760/
[3] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/3b1f795/
[4] 
http://logs.openstack.org/33/448533/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-containers-oooq-nv/b816f20/

> Dan
>
>>
>> Flavio
>>
>>
>>
>> _
>> _
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
>> cribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-22 Thread Dan Prince
On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > > Hey,
> > > 
> > > I've noticed that container jobs look pretty unstable lately; to
> > > me,
> > > it sounds like a timeout:
> > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> > > 22_00_08_55_358973
> > 
> > There are different hypothesis on what is going on here. Some
> > patches have
> > landed to improve the write performance on containers by using
> > hostpath mounts
> > but we think the real slowness is coming from the images download.
> > 
> > This said, this is still under investigation and the containers
> > squad will
> > report back as soon as there are new findings.
> 
> Also, to be more precise, Martin André is looking into this. He also
> fixed the
> gate in the last 2 weeks.

I spoke w/ Martin on IRC. He seems to think this is the cause of some
of the failures:

http://logs.openstack.org/32/446432/1/check-tripleo/gate-tripleo-ci-cen
tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
engine.log.txt.gz#_2017-03-21_20_26_29_697


Looks like Heat isn't able to create Nova instances in the overcloud
due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
means our cells initialization code for containers may not be quite
right... or there is a race somewhere.

Dan

> 
> Flavio
> 
> 
> 
> _
> _
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-22 Thread Flavio Percoco

On 22/03/17 13:32 +0100, Flavio Percoco wrote:

On 21/03/17 23:15 -0400, Emilien Macchi wrote:

Hey,

I've noticed that container jobs look pretty unstable lately; to me,
it sounds like a timeout:
http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-22_00_08_55_358973


There are different hypothesis on what is going on here. Some patches have
landed to improve the write performance on containers by using hostpath mounts
but we think the real slowness is coming from the images download.

This said, this is still under investigation and the containers squad will
report back as soon as there are new findings.


Also, to be more precise, Martin André is looking into this. He also fixed the
gate in the last 2 weeks.

Flavio



--
@flaper87
Flavio Percoco


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] container jobs are unstable

2017-03-22 Thread Flavio Percoco

On 21/03/17 23:15 -0400, Emilien Macchi wrote:

Hey,

I've noticed that container jobs look pretty unstable lately; to me,
it sounds like a timeout:
http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-22_00_08_55_358973


There are different hypothesis on what is going on here. Some patches have
landed to improve the write performance on containers by using hostpath mounts
but we think the real slowness is coming from the images download.

This said, this is still under investigation and the containers squad will
report back as soon as there are new findings.


If anyone could file a bug and see how we can bring it back as soon as
possible, I think we want to maintain this job in stable shape.
I remember Container squad wanted it voting because it was supposed to
be stable, but I'm not sure that's the case today.

Also, it would be great to have the container jobs in
http://tripleo.org/cistatus.html - what do you think?


As I mentioned here (and in my email yday) this is work in progress and the
containers squad is aware of it. Just not enough news today.

Flavio

--
@flaper87
Flavio Percoco


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev