[OpenStack-Infra] Zuul v3: some layout checks disabled in project-config

2017-08-10 Thread James E. Blair
Hi,

With https://review.openstack.org/492697 we are moving gating of Zuul
itself and some related job repos from Zuul v2 to Zuul v3.  As part of
this, we need to disable some of the checks that we perform on the
layout file.  That change disables the following checks for the
openstack-infra/* repos only:

* usage of the merge-check template
* at least one check job
* at least one gate job
* every gerrit project appears in zuul

The first three should only be needed for a short time while we continue
to construct the post and release pipelines in Zuul v3.  After that is
complete, we should be able to reinstate those checks, but we will need
to keep the final check disabled (for openstack-infra repos at least)
until Zuul v2 is retired.

-Jim

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] citycloud lon1 mirror postmortem

2017-08-10 Thread Paul Belanger
On Thu, Aug 10, 2017 at 10:34:56PM +1000, Ian Wienand wrote:
> Hi,
> 
> In response to sdague reporting that citycloud jobs were timing out, I
> investigated the mirror, suspecting it was not providing data fast enough.
> 
> There were some 170 htcacheclean jobs running, and the host had a load
> over 100.  I killed all these, but performance was still unacceptable.
> 
> I suspected networking, but since the host was in such a bad state I
> decided to reboot it.  Unfortunately it would get an address from DHCP
> but seemed to have DNS issues ... eventually it would ping but nothing
> else was working.
> 
> nodepool.o.o was placed in the emergency file and I removed lon1 to
> avoid jobs going there.
> 
> I used the citycloud live chat, and Kim helpfully investigated and
> ended up migrating mirror.lon1.citycloud.openstack.org to a new
> compute node.  This appeared to fix things, for us at least.
> 
> nodepool.o.o is removed from the emergency file and original config
> restored.
> 
> With hindsight, clearly the excessive htcacheclean processes were due
> to negative feedback of slow processes due to the network/dns issues
> all starting to bunch up over time.  However, I still think we could
> minimise further issues running it under a lock [1].  Other than that,
> not sure there is much else we can do, I think this was largely an
> upstream issue.
> 
> Cheers,
> 
> -i
> 
> [1] https://review.openstack.org/#/c/492481/
> 
Thanks, I also noticed a job fail to download a package from
mirror.iad.rax.openstack.org. When I SSH'd to the server I too see high load
(6.0+) and multiple htcacheclean processes running. 

I did an audit on the other mirrors and they too had the same, so I killed all
the processes there.  I can confirm the lock patch merged but will keep an eye
on it.

I did notice that mirror.lon1.citycloud.openstack.org wass still slow to react
to shell commands. I still think we have an IO bottleneck some where, possible
the compute host is throttling something.  We should keep an eye on it.

-PB

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] citycloud lon1 mirror postmortem

2017-08-10 Thread Ian Wienand

Hi,

In response to sdague reporting that citycloud jobs were timing out, I
investigated the mirror, suspecting it was not providing data fast enough.

There were some 170 htcacheclean jobs running, and the host had a load
over 100.  I killed all these, but performance was still unacceptable.

I suspected networking, but since the host was in such a bad state I
decided to reboot it.  Unfortunately it would get an address from DHCP
but seemed to have DNS issues ... eventually it would ping but nothing
else was working.

nodepool.o.o was placed in the emergency file and I removed lon1 to
avoid jobs going there.

I used the citycloud live chat, and Kim helpfully investigated and
ended up migrating mirror.lon1.citycloud.openstack.org to a new
compute node.  This appeared to fix things, for us at least.

nodepool.o.o is removed from the emergency file and original config
restored.

With hindsight, clearly the excessive htcacheclean processes were due
to negative feedback of slow processes due to the network/dns issues
all starting to bunch up over time.  However, I still think we could
minimise further issues running it under a lock [1].  Other than that,
not sure there is much else we can do, I think this was largely an
upstream issue.

Cheers,

-i

[1] https://review.openstack.org/#/c/492481/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra