[openstack-dev] [TripleO]: CI outage yesterday/today

2016-02-16 Thread Dan Prince
Just a quick update about the CI outage today and yesterday. Turns out
our jobs weren't running due to a bad Keystone URL (it was pointing to
localhost:5000 instead of our public SSL endpoint).

We've now fixed that issue and I'm told that as soon as Infra restarts
nodepool (they cache the keystone endpoints) we should start processing
jobs again.

Wait on it...

http://status.openstack.org/zuul/

Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] CI outage

2015-03-30 Thread Derek Higgins


Tl;dr tripleo ci is back up and running, see below for more

On 21/03/15 01:41, Dan Prince wrote:

Short version:

The RH1 CI region has been down since yesterday afternoon.

We have a misbehaving switch and have file a support ticket with the
vendor to troubleshoot things further. We hope to know more this
weekend, or Monday at the latest.

Long version:

Yesterday afternoon we started seeing issues in scheduling jobs on the
RH1 CI cloud. We haven't made any OpenStack configuration changes
recently, and things have been quite stable for some time now (our
uptime was 365 days on the controller).

Initially we found a misconfigured Keystone URL which was preventing
some diagnostic queries via OS clients external to the rack. This
setting hadn't been recently changed however and didn't seem to bother
nodepool before so I don't think it is the cause of the outage...

MySQL also got a bounce. It seemed happy enough after a restart as well.

After fixing the keystone setting and bouncing MySQL instances appears
to go ACTIVE but we were still having connectivity issues getting
floating IPs and DHCP working on overcloud instances. After a good bit
of debugging we started looking at the switches. Turns out one of them
has a high CPU usuage (above the warning threshold) and MAC address are
also unstable (ports are moving around).

Until this is resolved RH1 is unavailable to host jobs CI jobs. Will
post back here with an update once we have more information.


RH1 has been running as expected since last Thursday afternoon which 
means the cloud was down for almost a week, I'm left not entirely sure 
what some problems were, at various times during the week we tried a 
number of different interventions which may have caused (or exposed) 
some of our problems, e.g.


at one stage we restarted openvswitch in an attempt to ensure nothing 
had gone wrong with our ovs tunnels, around the same time (and possible 
caused by the restart), we started getting progressively worse 
connections to some of our servers. With lots of entries like this on 
our bastion server
Mar 20 13:22:49 host01-rack01 kernel: bond0.5: received packet with own 
address as source address


Not linking the restart with the looping packets message and instead 
thinking we may have a problem with the switch we put in a call with our 
switch vendor.


Continuing to chase down a problem on our own servers we noticed that 
tcpdump was reporting at times about 100,000 ARP packets per second 
(sometimes more).


Various interventions stopped the excess broadcast traffic e.g.
  Shutting down most of the compute nodes stopped the excess traffic, 
but the problem wasn't linked to any one particular compute node
  Running the tripleo os-refresh-config script on each compute node 
stopped the excess traffic


But restarting the controller node caused the excess traffic to return

Eventually we got the cloud running without the flood of broadcast 
traffic, with a small number of compute nodes, but instances still 
weren't getting IP address, with nova and neutron in debug mode we saw 
an error where nova failing to mount the qcow image (iirc it was 
attempting to resize the image).


Unable to figure out why this was working in the past but now isn't we 
redeployed this single compute node using the original image that was 
used (over a year ago), instances on this compute node we're booting but 
failing to get an IP address, we noticed this was because of a 
difference between the time on the controller when compared to the 
compute node. After resetting the time, now instances were booting and 
networking was working as expected (this was now Wednesday evening).


Looking back at the error while mounting the qcow image, I believe this 
was a red herring, it looks like this problem was always present on our 
system but we didn't have scary looking tracebacks in the logs until we 
switched to debug mode.


Now pretty confident we can get back to a running system by starting up 
all the compute nodes again and ensuring the os-refresh-config scripts 
were run then ensuring the times were all set on each host properly we 
decided to remove any entropy the may have built up while debugging 
problems on each computes node so we redeployed all of our compute nodes 
from scratch. This all went as expected but was a little time consuming 
as we spent time to verify each step as we went along, the steps went 
something like this


o with the exception of the overcloud controller, nova delete all of 
the hosts on the undercloud (31 hosts)


o we now have a problem, in tripleo the controller and compute nodes are 
tied together in a single heat template, so we need the heat template 
that was used a year ago to deploy the whole overcloud along with the 
parameters that were passed into it, we had actually done this before 
when adding new compute nodes to the cloud so it wasn't new territory.
   o Use heat template-show ci-overcloud to get the original heat 
template (a 

[openstack-dev] [TripleO] CI outage

2015-03-20 Thread Dan Prince
Short version:

The RH1 CI region has been down since yesterday afternoon.

We have a misbehaving switch and have file a support ticket with the
vendor to troubleshoot things further. We hope to know more this
weekend, or Monday at the latest.

Long version:

Yesterday afternoon we started seeing issues in scheduling jobs on the
RH1 CI cloud. We haven't made any OpenStack configuration changes
recently, and things have been quite stable for some time now (our
uptime was 365 days on the controller).

Initially we found a misconfigured Keystone URL which was preventing
some diagnostic queries via OS clients external to the rack. This
setting hadn't been recently changed however and didn't seem to bother
nodepool before so I don't think it is the cause of the outage...

MySQL also got a bounce. It seemed happy enough after a restart as well.

After fixing the keystone setting and bouncing MySQL instances appears
to go ACTIVE but we were still having connectivity issues getting
floating IPs and DHCP working on overcloud instances. After a good bit
of debugging we started looking at the switches. Turns out one of them
has a high CPU usuage (above the warning threshold) and MAC address are
also unstable (ports are moving around).

Until this is resolved RH1 is unavailable to host jobs CI jobs. Will
post back here with an update once we have more information.

Dan


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev