Re: [openstack-dev] [all] Zuul job backlog

2018-10-31 Thread Abhishek Kekane
Hi All,

I have fixed the glance functional test issue, patch [1] is merged in
master. I hope the mentioned issue is now resolved.

Kindly let me know.

[1] https://review.openstack.org/#/c/608856/

Thank you,
Abhishek

On Mon, 8 Oct 2018 at 11:37 PM, Doug Hellmann  wrote:

> Abhishek Kekane  writes:
>
> > Hi Doug,
> >
> > Should I use something like SimpleHttpServer to upload a file and
> download
> > the same, or there are other efficient ways to handle it,
> > Kindly let me know if you are having any suggestions for the same.
>
> Sure, that would work, especially if your tests are running in the unit
> test jobs. If you're running a functional test, it seems like it would
> also be OK to just copy a file into the directory Apache is serving from
> and then download it from there.
>
> Doug
>
> >
> > Thanks & Best Regards,
> >
> > Abhishek Kekane
> >
> >
> > On Fri, Oct 5, 2018 at 4:57 PM Doug Hellmann 
> wrote:
> >
> >> Abhishek Kekane  writes:
> >>
> >> > Hi Matt,
> >> >
> >> > Thanks for the input, I guess I should use '
> >> > http://git.openstack.org/static/openstack.png' which will definitely
> >> work.
> >> > Clark, Matt, Kindly let me know your opinion about the same.
> >>
> >> That URL would not be on the local node running the test, and would
> >> eventually exhibit the same problems. In fact we have seen issues
> >> cloning git repositories as part of the tests in the past.
> >>
> >> You need to use a localhost URL to ensure that the download doesn't have
> >> to go off of the node. That may mean placing something into the
> directory
> >> where Apache is serving files as part of the test setup.
> >>
> >> Doug
> >>
>
-- 
Thanks & Best Regards,

Abhishek Kekane
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-08 Thread Doug Hellmann
Abhishek Kekane  writes:

> Hi Doug,
>
> Should I use something like SimpleHttpServer to upload a file and download
> the same, or there are other efficient ways to handle it,
> Kindly let me know if you are having any suggestions for the same.

Sure, that would work, especially if your tests are running in the unit
test jobs. If you're running a functional test, it seems like it would
also be OK to just copy a file into the directory Apache is serving from
and then download it from there.

Doug

>
> Thanks & Best Regards,
>
> Abhishek Kekane
>
>
> On Fri, Oct 5, 2018 at 4:57 PM Doug Hellmann  wrote:
>
>> Abhishek Kekane  writes:
>>
>> > Hi Matt,
>> >
>> > Thanks for the input, I guess I should use '
>> > http://git.openstack.org/static/openstack.png' which will definitely
>> work.
>> > Clark, Matt, Kindly let me know your opinion about the same.
>>
>> That URL would not be on the local node running the test, and would
>> eventually exhibit the same problems. In fact we have seen issues
>> cloning git repositories as part of the tests in the past.
>>
>> You need to use a localhost URL to ensure that the download doesn't have
>> to go off of the node. That may mean placing something into the directory
>> where Apache is serving files as part of the test setup.
>>
>> Doug
>>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-08 Thread Abhishek Kekane
Hi Doug,

Should I use something like SimpleHttpServer to upload a file and download
the same, or there are other efficient ways to handle it,
Kindly let me know if you are having any suggestions for the same.

Thanks & Best Regards,

Abhishek Kekane


On Fri, Oct 5, 2018 at 4:57 PM Doug Hellmann  wrote:

> Abhishek Kekane  writes:
>
> > Hi Matt,
> >
> > Thanks for the input, I guess I should use '
> > http://git.openstack.org/static/openstack.png' which will definitely
> work.
> > Clark, Matt, Kindly let me know your opinion about the same.
>
> That URL would not be on the local node running the test, and would
> eventually exhibit the same problems. In fact we have seen issues
> cloning git repositories as part of the tests in the past.
>
> You need to use a localhost URL to ensure that the download doesn't have
> to go off of the node. That may mean placing something into the directory
> where Apache is serving files as part of the test setup.
>
> Doug
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-05 Thread Doug Hellmann
Abhishek Kekane  writes:

> Hi Matt,
>
> Thanks for the input, I guess I should use '
> http://git.openstack.org/static/openstack.png' which will definitely work.
> Clark, Matt, Kindly let me know your opinion about the same.

That URL would not be on the local node running the test, and would
eventually exhibit the same problems. In fact we have seen issues
cloning git repositories as part of the tests in the past.

You need to use a localhost URL to ensure that the download doesn't have
to go off of the node. That may mean placing something into the directory
where Apache is serving files as part of the test setup.

Doug

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-04 Thread Abhishek Kekane
Hi Matt,

Thanks for the input, I guess I should use '
http://git.openstack.org/static/openstack.png' which will definitely work.
Clark, Matt, Kindly let me know your opinion about the same.

Thanks & Best Regards,

Abhishek Kekane


On Fri, Oct 5, 2018 at 10:20 AM Matthew Treinish 
wrote:

>
>
> On October 5, 2018 12:11:51 AM EDT, Abhishek Kekane 
> wrote:
> >Hi Clark,
> >
> >Thank you for the inputs. I have verified the logs and found that
> >mostly
> >image import web-download import method related tests are failing.
> >Now in this test [1] we are trying to download a file from '
> >
> https://www.openstack.org/assets/openstack-logo/2016R/OpenStack-Logo-Horizontal.eps.zip
> '
> >in glance. Here we are assuming image will be downloaded and active
> >within
> >20 seconds of time and if not it will be marked as failed. Now this
> >test
> >never fails in local environment but their might be a problem of
> >connecting
> >to remote while this test is executed in zuul jobs.
> >
> >Do you have any alternative idea how we can test this scenario, as it
> >is
> >very hard to reproduce this in local environment.
> >
>
> External networking will always be unreliable from the ci environment,
> nothing is 100% reliable and just given the sheer number of jobs we execute
> there will be an appreciable number of failures just from that. That being
> said this exact problem you've described is one we fixed in
> devstack/tempest over 5 years ago:
>
> https://bugs.launchpad.net/tempest/+bug/1190623
>
> It'd be nice if we didn't keep repeating problems. The solution for that
> bug is likely to be the same thing here, and not relying on pulling
> something from the external network in the test. Just use something else
> hosted on the local apache httpd of the test node and use that as the url
> to import in the test.
>
> -Matt Treinish
>
> >
> >
> >On Thu, Oct 4, 2018 at 7:43 PM Clark Boylan 
> >wrote:
> >
> >> On Thu, Oct 4, 2018, at 12:16 AM, Abhishek Kekane wrote:
> >> > Hi,
> >> > Could you please point out some of the glance functional tests
> >which are
> >> > failing and causing this resets?
> >> > I will like to put some efforts towards fixing those.
> >>
> >> http://status.openstack.org/elastic-recheck/data/integrated_gate.html
> >is
> >> a good place to start. That shows you a list of tests that failed in
> >the
> >> OpenStack Integrated gate that elastic-recheck could not identify the
> >> failure for including those for several functional jobs.
> >>
> >> If you'd like to start looking at identified bugs first then
> >> http://status.openstack.org/elastic-recheck/gate.html shows
> >identified
> >> failures that happened in the gate.
> >>
> >> For glance functional jobs the first link points to:
> >>
> >>
> >
> http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional/fc13eca/
> >>
> >>
> >
> http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional/b7c487c/
> >>
> >>
> >
> http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional-py35/b166313/
> >>
> >>
> >
> http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional-py35/ce262ab/
> >>
> >> Clark
> >>
> >>
> >__
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe:
> >openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-04 Thread Matthew Treinish


On October 5, 2018 12:11:51 AM EDT, Abhishek Kekane  wrote:
>Hi Clark,
>
>Thank you for the inputs. I have verified the logs and found that
>mostly
>image import web-download import method related tests are failing.
>Now in this test [1] we are trying to download a file from '
>https://www.openstack.org/assets/openstack-logo/2016R/OpenStack-Logo-Horizontal.eps.zip'
>in glance. Here we are assuming image will be downloaded and active
>within
>20 seconds of time and if not it will be marked as failed. Now this
>test
>never fails in local environment but their might be a problem of
>connecting
>to remote while this test is executed in zuul jobs.
>
>Do you have any alternative idea how we can test this scenario, as it
>is
>very hard to reproduce this in local environment.
>

External networking will always be unreliable from the ci environment, nothing 
is 100% reliable and just given the sheer number of jobs we execute there will 
be an appreciable number of failures just from that. That being said this exact 
problem you've described is one we fixed in devstack/tempest over 5 years ago: 

https://bugs.launchpad.net/tempest/+bug/1190623

It'd be nice if we didn't keep repeating problems. The solution for that bug is 
likely to be the same thing here, and not relying on pulling something from the 
external network in the test. Just use something else hosted on the local 
apache httpd of the test node and use that as the url to import in the test.

-Matt Treinish

>
>
>On Thu, Oct 4, 2018 at 7:43 PM Clark Boylan 
>wrote:
>
>> On Thu, Oct 4, 2018, at 12:16 AM, Abhishek Kekane wrote:
>> > Hi,
>> > Could you please point out some of the glance functional tests
>which are
>> > failing and causing this resets?
>> > I will like to put some efforts towards fixing those.
>>
>> http://status.openstack.org/elastic-recheck/data/integrated_gate.html
>is
>> a good place to start. That shows you a list of tests that failed in
>the
>> OpenStack Integrated gate that elastic-recheck could not identify the
>> failure for including those for several functional jobs.
>>
>> If you'd like to start looking at identified bugs first then
>> http://status.openstack.org/elastic-recheck/gate.html shows
>identified
>> failures that happened in the gate.
>>
>> For glance functional jobs the first link points to:
>>
>>
>http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional/fc13eca/
>>
>>
>http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional/b7c487c/
>>
>>
>http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional-py35/b166313/
>>
>>
>http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional-py35/ce262ab/
>>
>> Clark
>>
>>
>__
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-04 Thread Abhishek Kekane
Hi Clark,

Thank you for the inputs. I have verified the logs and found that mostly
image import web-download import method related tests are failing.
Now in this test [1] we are trying to download a file from '
https://www.openstack.org/assets/openstack-logo/2016R/OpenStack-Logo-Horizontal.eps.zip'
in glance. Here we are assuming image will be downloaded and active within
20 seconds of time and if not it will be marked as failed. Now this test
never fails in local environment but their might be a problem of connecting
to remote while this test is executed in zuul jobs.

Do you have any alternative idea how we can test this scenario, as it is
very hard to reproduce this in local environment.

Thanks & Best Regards,

Abhishek Kekane


On Thu, Oct 4, 2018 at 7:43 PM Clark Boylan  wrote:

> On Thu, Oct 4, 2018, at 12:16 AM, Abhishek Kekane wrote:
> > Hi,
> > Could you please point out some of the glance functional tests which are
> > failing and causing this resets?
> > I will like to put some efforts towards fixing those.
>
> http://status.openstack.org/elastic-recheck/data/integrated_gate.html is
> a good place to start. That shows you a list of tests that failed in the
> OpenStack Integrated gate that elastic-recheck could not identify the
> failure for including those for several functional jobs.
>
> If you'd like to start looking at identified bugs first then
> http://status.openstack.org/elastic-recheck/gate.html shows identified
> failures that happened in the gate.
>
> For glance functional jobs the first link points to:
>
> http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional/fc13eca/
>
> http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional/b7c487c/
>
> http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional-py35/b166313/
>
> http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional-py35/ce262ab/
>
> Clark
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-04 Thread Clark Boylan
On Thu, Oct 4, 2018, at 12:16 AM, Abhishek Kekane wrote:
> Hi,
> Could you please point out some of the glance functional tests which are
> failing and causing this resets?
> I will like to put some efforts towards fixing those.

http://status.openstack.org/elastic-recheck/data/integrated_gate.html is a good 
place to start. That shows you a list of tests that failed in the OpenStack 
Integrated gate that elastic-recheck could not identify the failure for 
including those for several functional jobs.

If you'd like to start looking at identified bugs first then 
http://status.openstack.org/elastic-recheck/gate.html shows identified failures 
that happened in the gate.

For glance functional jobs the first link points to:
http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional/fc13eca/
http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional/b7c487c/
http://logs.openstack.org/99/595299/1/gate/openstack-tox-functional-py35/b166313/
http://logs.openstack.org/44/569644/3/gate/openstack-tox-functional-py35/ce262ab/

Clark

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-04 Thread Abhishek Kekane
Hi,
Could you please point out some of the glance functional tests which are
failing and causing this resets?
I will like to put some efforts towards fixing those.

Thanks & Best Regards,

Abhishek Kekane


On Wed, Oct 3, 2018 at 10:14 PM Doug Hellmann  wrote:

> Wesley Hayutin  writes:
>
> [snip]
>
> > The TripleO project has created a single node container based composable
> > OpenStack deployment [2]. It is the projects intention to replace most of
> > the TripleO upstream jobs with the Standalone deployment.  We would like
> to
> > reduce our multi-node usage to a total of two or three multinode jobs to
> > handle a basic overcloud deployment, updates and upgrades[a]. Currently
> in
> > master we are relying on multiple multi-node scenario jobs to test many
> of
> > the OpenStack services in a single job. Our intention is to move these
> > multinode scenario jobs to single node job(s) that tests a smaller subset
> > of services. The goal of this would be target the specific areas of the
> > TripleO code base that affect these services and only run those there.
> This
> > would replace the existing 2-3 hour two node job(s) with single node
> job(s)
> > for specific services that completes in about half the time.  This
> > unfortunately will reduce the overall coverage upstream but still allows
> us
> > a basic smoke test of the supported OpenStack services and their
> deployment
> > upstream.
> >
> > Ideally projects other than TripleO would make use of the Standalone
> > deployment to test their particular service with containers, upgrades or
> > for various other reasons.  Additional projects using this deployment
> would
> > help ensure bugs are found quickly and resolved providing additional
> > resilience to the upstream gate jobs. The TripleO team will begin review
> to
> > scope out and create estimates for the above work starting on October 18
> > 2018.  One should expect to see updates on our progress posted to the
> > list.  Below are some details on the proposed changes.
>
> [snip]
>
> Thanks for all of the details, Wes. I know the current situation has
> been hurting the TripleO team as well, so I'm glad to see a good plan in
> place to address it. I look forward to seeing updates about the
> progress.
>
> Doug
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-03 Thread Doug Hellmann
Wesley Hayutin  writes:

[snip]

> The TripleO project has created a single node container based composable
> OpenStack deployment [2]. It is the projects intention to replace most of
> the TripleO upstream jobs with the Standalone deployment.  We would like to
> reduce our multi-node usage to a total of two or three multinode jobs to
> handle a basic overcloud deployment, updates and upgrades[a]. Currently in
> master we are relying on multiple multi-node scenario jobs to test many of
> the OpenStack services in a single job. Our intention is to move these
> multinode scenario jobs to single node job(s) that tests a smaller subset
> of services. The goal of this would be target the specific areas of the
> TripleO code base that affect these services and only run those there. This
> would replace the existing 2-3 hour two node job(s) with single node job(s)
> for specific services that completes in about half the time.  This
> unfortunately will reduce the overall coverage upstream but still allows us
> a basic smoke test of the supported OpenStack services and their deployment
> upstream.
>
> Ideally projects other than TripleO would make use of the Standalone
> deployment to test their particular service with containers, upgrades or
> for various other reasons.  Additional projects using this deployment would
> help ensure bugs are found quickly and resolved providing additional
> resilience to the upstream gate jobs. The TripleO team will begin review to
> scope out and create estimates for the above work starting on October 18
> 2018.  One should expect to see updates on our progress posted to the
> list.  Below are some details on the proposed changes.

[snip]

Thanks for all of the details, Wes. I know the current situation has
been hurting the TripleO team as well, so I'm glad to see a good plan in
place to address it. I look forward to seeing updates about the
progress.

Doug

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-10-03 Thread Wesley Hayutin
On Fri, Sep 28, 2018 at 3:02 PM Matt Riedemann  wrote:

> On 9/28/2018 3:12 PM, Clark Boylan wrote:
> > I was asked to write a followup to this as the long Zuul queues have
> persisted through this week. Largely because the situation from last week
> hasn't changed much. We were down the upgraded cloud region while we worked
> around a network configuration bug, then once that was addressed we ran
> into neutron port assignment and deletion issues. We think these are both
> fixed and we are running in this region again as of today.
> >
> > Other good news is our classification rate is up significantly. We can
> use that information to go through the top identified gate bugs:
> >
> > Network Connectivity issues to test nodes [2]. This is the current top
> of the list, but I think its impact is relatively small. What is happening
> here is jobs fail to connect to their test nodes early in the pre-run
> playbook and then fail. Zuul will rerun these jobs for us because they
> failed in the pre-run step. Prior to zuulv3 we had nodepool run a ready
> script before marking test nodes as ready, this script would've caught and
> filtered out these broken network nodes early. We now notice them late
> during the pre-run of a job.
> >
> > Pip fails to find distribution for package [3]. Earlier in the week we
> had the in region mirror fail in two different regions for unrelated
> errors. These mirrors were fixed and the only other hits for this bug come
> from Ara which tried to install the 'black' package on python3.5 but this
> package requires python>=3.6.
> >
> > yum, no more mirrors to try [4]. At first glance this appears to be an
> infrastructure issue because the mirror isn't serving content to yum. On
> further investigation it turned out to be a DNS resolution issue caused by
> the installation of designate in the tripleo jobs. Tripleo is aware of this
> issue and working to correct it.
> >
> > Stackviz failing on py3 [5]. This is a real bug in stackviz caused by
> subunit data being binary not utf8 encoded strings. I've written a fix for
> this problem athttps://review.openstack.org/606184, but in doing so found
> that this was a known issue back in March and there was already a proposed
> fix,https://review.openstack.org/#/c/555388/3. It would be helpful if the
> QA team could care for this project and get a fix in. Otherwise, we should
> consider disabling stackviz on our tempest jobs (though the output from
> stackviz is often useful).
> >
> > There are other bugs being tracked by e-r. Some are bugs in the
> openstack software and I'm sure some are also bugs in the infrastructure. I
> have not yet had the time to work through the others though. It would be
> helpful if project teams could prioritize the debugging and fixing of these
> issues though.
> >
> > [2]http://status.openstack.org/elastic-recheck/gate.html#1793370
> > [3]http://status.openstack.org/elastic-recheck/gate.html#1449136
> > [4]http://status.openstack.org/elastic-recheck/gate.html#1708704
> > [5]http://status.openstack.org/elastic-recheck/gate.html#1758054
>
> Thanks for the update Clark.
>
> Another thing this week is the logstash indexing is behind by at least
> half a day. That's because workers were hitting OOM errors due to giant
> screen log files that aren't formatted properly so that we only index
> INFO+ level logs, and were instead trying to index the entire file,
> which some of which are 33MB *compressed*. So indexing of those
> identified problematic screen logs has been disabled:
>
> https://review.openstack.org/#/c/606197/
>
> I've reported bugs against each related project.
>
> --
>
> Thanks,
>
> Matt
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



Greetings Clark and all,
The TripleO team would like to announce a significant change to the
upstream CI the project has in place today.

TripleO can at times consume a large share of the compute resources [1]
provided by the OpenStack upstream infrastructure team and OpenStack
providers.  The TripleO project has a large code base and high velocity of
change which alone can tax the upstream CI system [3]. Additionally like
other projects the issue is particularly acute when gate jobs are reset at
a high rate.  Unlike most other projects in OpenStack, TripleO uses
multiple nodepool nodes in each job to more closely emulate customer like
deployments.  While using multiple nodes per job helps to uncover bugs
that are not found in other projects, the resources used, the run time of
each job, and usability have proven to be challenging.  It has been a
challenge to maintain job run times, quality and usability for TripleO and
a challenge for the infra team to provide the required compute resources
for the project.

A simplification of our 

Re: [openstack-dev] [all] Zuul job backlog

2018-09-28 Thread Matt Riedemann

On 9/28/2018 3:12 PM, Clark Boylan wrote:

I was asked to write a followup to this as the long Zuul queues have persisted 
through this week. Largely because the situation from last week hasn't changed 
much. We were down the upgraded cloud region while we worked around a network 
configuration bug, then once that was addressed we ran into neutron port 
assignment and deletion issues. We think these are both fixed and we are 
running in this region again as of today.

Other good news is our classification rate is up significantly. We can use that 
information to go through the top identified gate bugs:

Network Connectivity issues to test nodes [2]. This is the current top of the 
list, but I think its impact is relatively small. What is happening here is 
jobs fail to connect to their test nodes early in the pre-run playbook and then 
fail. Zuul will rerun these jobs for us because they failed in the pre-run 
step. Prior to zuulv3 we had nodepool run a ready script before marking test 
nodes as ready, this script would've caught and filtered out these broken 
network nodes early. We now notice them late during the pre-run of a job.

Pip fails to find distribution for package [3]. Earlier in the week we had the in 
region mirror fail in two different regions for unrelated errors. These mirrors 
were fixed and the only other hits for this bug come from Ara which tried to 
install the 'black' package on python3.5 but this package requires python>=3.6.

yum, no more mirrors to try [4]. At first glance this appears to be an 
infrastructure issue because the mirror isn't serving content to yum. On 
further investigation it turned out to be a DNS resolution issue caused by the 
installation of designate in the tripleo jobs. Tripleo is aware of this issue 
and working to correct it.

Stackviz failing on py3 [5]. This is a real bug in stackviz caused by subunit 
data being binary not utf8 encoded strings. I've written a fix for this problem 
athttps://review.openstack.org/606184, but in doing so found that this was a 
known issue back in March and there was already a proposed 
fix,https://review.openstack.org/#/c/555388/3. It would be helpful if the QA 
team could care for this project and get a fix in. Otherwise, we should 
consider disabling stackviz on our tempest jobs (though the output from 
stackviz is often useful).

There are other bugs being tracked by e-r. Some are bugs in the openstack 
software and I'm sure some are also bugs in the infrastructure. I have not yet 
had the time to work through the others though. It would be helpful if project 
teams could prioritize the debugging and fixing of these issues though.

[2]http://status.openstack.org/elastic-recheck/gate.html#1793370
[3]http://status.openstack.org/elastic-recheck/gate.html#1449136
[4]http://status.openstack.org/elastic-recheck/gate.html#1708704
[5]http://status.openstack.org/elastic-recheck/gate.html#1758054


Thanks for the update Clark.

Another thing this week is the logstash indexing is behind by at least 
half a day. That's because workers were hitting OOM errors due to giant 
screen log files that aren't formatted properly so that we only index 
INFO+ level logs, and were instead trying to index the entire file, 
which some of which are 33MB *compressed*. So indexing of those 
identified problematic screen logs has been disabled:


https://review.openstack.org/#/c/606197/

I've reported bugs against each related project.

--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-09-28 Thread Clark Boylan
On Wed, Sep 19, 2018, at 12:11 PM, Clark Boylan wrote:
> Hello everyone,
> 
> You may have noticed there is a large Zuul job backlog and changes are 
> not getting CI reports as quickly as you might expect. There are several 
> factors interacting with each other to make this the case. The short 
> version is that one of our clouds is performing upgrades and has been 
> removed from service, and we have a large number of gate failures which 
> cause things to reset and start over. We have fewer resources than 
> normal and are using them inefficiently. Zuul is operating as expected.
> 
> Continue reading if you'd like to understand the technical details and 
> find out how you can help make this better.
> 
> Zuul gates related projects in shared queues. Changes enter these queues 
> and are ordered in a speculative future state that Zuul assumes will 
> pass because multiple humans have reviewed the changes and said they are 
> good (also they had to pass check testing first). Problems arise when 
> tests fail forcing Zuul to evict changes from the speculative future 
> state, build a new state, then start jobs over again for this new 
> future.
> 
> Typically this doesn't happen often and we merge many changes at a time, 
> quickly pushing code into our repos. Unfortunately, the results are 
> painful when we fail often as we end up rebuilding future states and 
> restarting jobs often. Currently we have the gate and release jobs set 
> to the highest priority as well so they run jobs before other queues. 
> This means the gate can starve other work if it is flaky. We've 
> configured things this way because the gate is not supposed to be flaky 
> since we've reviewed things and already passed check testing. One of the 
> tools we have in place to make this less painful is each gate queue 
> operates on a window that grows and shrinks similar to how TCP 
> slowstart. As changes merge we increase the size of the window and when 
> they fail to merge we decrease it. This reduces the size of the future 
> state that must be rebuilt and retested on failure when things are 
> persistently flaky.
> 
> The best way to make this better is to fix the bugs in our software, 
> whether that is in the CI system itself or the software being tested. 
> The first step in doing that is to identify and track the bugs that we 
> are dealing with. We have a tool called elastic-recheck that does this 
> using indexed logs from the jobs. The idea there is to go through the 
> list of unclassified failures [0] and fingerprint them so that we can 
> track them [1]. With that data available we can then prioritize fixing 
> the bugs that have the biggest impact.
> 
> Unfortunately, right now our classification rate is very poor (only 
> 15%), which makes it difficult to know what exactly is causing these 
> failures. Mriedem and I have quickly scanned the unclassified list, and 
> it appears there is a db migration testing issue causing these tests to 
> timeout across several projects. Mriedem is working to get this 
> classified and tracked which should help, but we will also need to fix 
> the bug. On top of that it appears that Glance has flaky functional 
> tests (both python2 and python3) which are causing resets and should be 
> looked into.
> 
> If you'd like to help, let mriedem or myself know and we'll gladly work 
> with you to get elasticsearch queries added to elastic-recheck. We are 
> likely less help when it comes to fixing functional tests in Glance, but 
> I'm happy to point people in the right direction for that as much as I 
> can. If you can take a few minutes to do this before/after you issue a 
> recheck it does help quite a bit.
> 
> One general thing I've found would be helpful is if projects can clean 
> up the deprecation warnings in their log outputs. The persistent 
> "WARNING you used the old name for a thing" messages make the logs large 
> and much harder to read to find the actual failures.
> 
> As a final note this is largely targeted at the OpenStack Integrated 
> gate (Nova, Glance, Cinder, Keystone, Swift, Neutron) since that appears 
> to be particularly flaky at the moment. The Zuul behavior applies to 
> other gate pipelines (OSA, Tripleo, Airship, etc) as does elastic-
> recheck and related tooling. If you find your particular pipeline is 
> flaky I'm more than happy to help in that context as well.
> 
> [0] http://status.openstack.org/elastic-recheck/data/integrated_gate.html
> [1] http://status.openstack.org/elastic-recheck/gate.html

I was asked to write a followup to this as the long Zuul queues have persisted 
through this week. Largely because the situation from last week hasn't changed 
much. We were down the upgraded cloud region while we worked around a network 
configuration bug, then once that was addressed we ran into neutron port 
assignment and deletion issues. We think these are both fixed and we are 
running in this region again as of today.

Other good news is our 

Re: [openstack-dev] [all] Zuul job backlog

2018-09-19 Thread Matt Riedemann

On 9/19/2018 2:45 PM, Matt Riedemann wrote:

Another one we need to make a decision on is:

https://bugs.launchpad.net/tempest/+bug/1783405

Which I'm suggesting we need to mark more slow tests with the actual 
"slow" tag in Tempest so they move to only be run in the tempest-slow 
job. gmann and I talked about this last week over IRC but I forgot to 
update the bug report with details. I think rather than increase the 
timeout of the tempest-full job we should be marking more slow tests as 
slow. Increasing timeouts gives some short-term relief but eventually we 
just have to look at these issues again, and a tempest run shouldn't 
take over 2 hours (remember when it used to take ~45 minutes?).


https://review.openstack.org/#/c/603900/

--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Zuul job backlog

2018-09-19 Thread Matt Riedemann

On 9/19/2018 2:11 PM, Clark Boylan wrote:

Unfortunately, right now our classification rate is very poor (only 15%), which 
makes it difficult to know what exactly is causing these failures. Mriedem and 
I have quickly scanned the unclassified list, and it appears there is a db 
migration testing issue causing these tests to timeout across several projects. 
Mriedem is working to get this classified and tracked which should help, but we 
will also need to fix the bug. On top of that it appears that Glance has flaky 
functional tests (both python2 and python3) which are causing resets and should 
be looked into.

If you'd like to help, let mriedem or myself know and we'll gladly work with 
you to get elasticsearch queries added to elastic-recheck. We are likely less 
help when it comes to fixing functional tests in Glance, but I'm happy to point 
people in the right direction for that as much as I can. If you can take a few 
minutes to do this before/after you issue a recheck it does help quite a bit.


Things have gotten bad enough that I've started proposing changes to 
skip particularly high failure rate tests that are not otherwise getting 
attention to help triage and fix the bugs. For example:


https://review.openstack.org/#/c/602649/

https://review.openstack.org/#/c/602656/

Generally this is a last resort since it means we're losing test 
coverage, but when we hit a critical mass of random failures it becomes 
extremely difficult to merge code.


Another one we need to make a decision on is:

https://bugs.launchpad.net/tempest/+bug/1783405

Which I'm suggesting we need to mark more slow tests with the actual 
"slow" tag in Tempest so they move to only be run in the tempest-slow 
job. gmann and I talked about this last week over IRC but I forgot to 
update the bug report with details. I think rather than increase the 
timeout of the tempest-full job we should be marking more slow tests as 
slow. Increasing timeouts gives some short-term relief but eventually we 
just have to look at these issues again, and a tempest run shouldn't 
take over 2 hours (remember when it used to take ~45 minutes?).


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev